Skip to main content

Featuresets

A featureset is a group of calculated features that provide a high-level interface to access individual features. Featuresets are different from static datasets or ordinary database tables as they provide the capability to time-travel to get point-in-time values of its underlying features.

Featuresets are first-class entities in Layer. They are integral to and built within a Layer project. They are stored in the Layer Data Catalog.

All featuresets are defined in a directory, with a dataset.yml file at the root linked to one or more SQL or Python files. Featuresets have the following basic layout:

data/
├── my_featureset/
│ ├── dataset.yml
│ ├── my_feature_1.sql
│ ├── my_feature_2.py

Featureset configuration#

Featuresets are configured in a dataset.yml file. An example is shown below, alongside field definitions. Click a definition to highlight the code that it refers to.

apiVersion#

Version of the model definition file. This is used to make sure backwards-incompatible changes in config format do not break Layer CLI.

name#

Name of the featureset. The name will be used to identify the featureset in the Data Catalog.

description#

Description of the dataset. We recommend writing a description that future coworkers (or future you) will be grateful for. This description is displayed on the featureset card in the Data Catalog.

type#

Determines what type of data is contained in this dataset. featureset is the only option for featuresets.

features#

List of features.

name#

Name of the feature

description#

Description of the feature.

source#

Source code of the feature. SQL source query file indicates SQL features, Python source code file indicates either Python or Spark features.

schema#

Contains primary_keys. This field is used to join the features under a featureset. Every single feature has an ID and a Value column. The primary_key field tells Layer how to join the features

materialization:#

This field is required.

type#

table is the only option.

target or integration#

Integration where this data (features) is materialized. Name of the integration where this dataset lives. You assign names to integrations in Layer Settings > Integrations.

# required.
apiVersion: 1
# required.
name: "my_featureset"
# optional.
description: "Car features with transmission and age"
# required.
type: featureset
# required.
features:
- name: my_feature_1
description: "My SQL Feature's description"
source: my_feature1.sql
- name: my_feature_2
description: "My Python Feature's description"
source: my_feature2.py
# required.
schema:
primary_keys: ["ID"]
# required.
materialization:
type: table
target: my_db

Defining features#

There are two ways you can define your features in a Layer Project.

  • SQL features: You can use SQL queries to define the transformation on your dataset to extract features
  • Python features: For advanced feature extraction, you can develop Python scripts with the help of libraries (nltk, scikit, etc)