Skip to main content

ML Models

Machine learning models are akin to mathematical functions. They take a request in the form of input data, make a prediction on that input data, and then serve a response. Layer takes a declarative approach to streamline the ML Model development process.

ML Models are first-class entities in Layer. They are integral to and built within a Layer Project. They are versioned and stored in the Layer Model Catalog.

All models can be defined in a directory, with a model.yml file at the root linked to one or more Python files. Models have the following basic layout:

models/
├── my_model/
│ ├── model.yml
│ ├── my_model.py

Model Configuration#

Models are configured in a model.yml file, which looks like this:

# required. this is used to make sure backwards-incompatible changes
# in config format do not break layer CLI
apiVersion: 1
# required.
name: my_model
# optional.
description: "My model description"
# required. used to determine how to train this model
training:
- name: my_model_training
description: "My Model Training"
entrypoint: my_model.py
environment: requirements.txt

Model Training#

Model training is done via Python files referred to from the model.yml file.

This python code uses the Layer SDK to define a train_model function that takes a train argument as well as a series of featureset arguments. You can train your model any way you'd like within this function. While training, you can use train.log_parameter and train.log_metric to save parameters and metrics of your training runs that will then be viewable in the Layer Model Catalog UI. You can also use train.register_input and train.register_output to define the model signature, which can then be used for determining the data lineage of this model.

Here's an example model training code outline:

from typing import Any
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
import xgboost as xgb
from layer import Featureset, Train # imported
def train_model(train: Train, tf: Featureset("transaction_features")) -> Any:
# We create the training and label data
train_df = tf.to_pandas()
X = train_df.drop(["unused_features"], axis=1)
Y = train_df["labeled_data"]
random_state = 13
test_size = 0.2
train.log_parameter("random_state", random_state)
train.log_parameter("test_size", test_size)
trainX, testX, trainY, testY = train_test_split(X, Y, test_size=test_size,
random_state=random_state)
# Here we register input & output of the train. Layer will use
# this registers to extract the signature of the model and calculate
# the drift
train.register_input(trainX)
train.register_output(trainY)
max_depth = 3
objective = 'binary:logitraw'
train.log_parameter("max_depth", max_depth)
train.log_parameter("objective", objective)
# Train model
param = {'max_depth': max_depth, 'objective': objective}
dtrain = xgb.DMatrix(trainX, label=trainY)
model_xg = xgb.train(param, dtrain)
dtest = xgb.DMatrix(testX)
preds = model_xg.predict(dtest)
# Since the data is highly skewed, we will use the area under the
# precision-recall curve (AUPRC) rather than the conventional area under
# the receiver operating characteristic (AUROC). This is because the AUPRC
# is more sensitive to differences between algorithms and their parameter
# settings rather than the AUROC (see Davis and Goadrich,
# 2006: http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf)
auprc = average_precision_score(testY, preds)
train.log_metric("auprc", auprc)
# Return the model
return model_xg

Now that you defined your model, you can then go ahead and run your project!