Skip to main content

Python Features

For some ML projects, you may need advanced scripting to extract features from your dataset. At that point, you will need a scripting language which will give you much more transformation capabilities compared to SQL. That's why we provide Python Features.

Some use cases are:

  • In an NLP project, you may want to lemmatize the words with nltk
  • Extract the product similarity score from the product images

Here are the steps to add a Python Feature to your Layer Project:

  1. Develop your feature in a Python file by implementing the build_feature function. You can add parameter to this method to load entities from the Data Catalog and Model Catalog. Example:

    from typing import Any
    from layer import Dataset
    from sklearn.preprocessing import LabelEncoder
    def build_feature(sdf: Dataset("spam_messages")) -> Any:
    df = sdf.to_pandas()
    feature_data = df[["id", "label"]]
    # creating instance of labelencoder
    labelencoder = LabelEncoder()
    feature_data = feature_data.assign(is_spam = labelencoder.fit_transform(feature_data["label"]))
    feature_data.drop(columns=["label"], inplace=True)
    return feature_data
  2. If your Python code requires external libraries, list them in requirements.txt

    scikit-learn==0.22.2.post1
  3. Add your feature to a featureset by listing it in the dataset.yml

    # Spam Detection Project Example
    apiVersion: 1
    type: featureset
    name: "sms_featureset"
    description: "SMS features extracted from the labeled sms messages"
    features:
    - name: is_spam
    description: "Target label"
    source: is_spam/feature.py
    environment: is_spam/requirements.txt
    - name: message
    description: "Lemmatized messages"
    source: message/feature.py
    environment: message/requirements.txt
    schema:
    # All of the features above should include this primary key. It will be used to join the features
    # together.
    primary_keys: ["id"]
    materializations:
    - target: layer