Skip to main content

Datasets

A dataset is a reference of your data in an external data source. You can add a database table or a CSV in a cloud file storage as a dataset to the Layer Data Catalog.

Datasets are first-class entities in Layer. Layer Data Catalog centralizes all your datasets from different sources for better discovery and efficient analysis. You can run complex quality tests on your datasets, check their profile before starting your new ML project or check which featuresets or ML models depend on them.

Dataset configuration#

Datasets are configured in a dataset.yml file. An example is shown below, alongside field definitions. Click a definition to highlight the code that it refers to.

apiVersion#

Version of the dataset definition file. This is used to make sure backwards-incompatible changes in config format do not break Layer CLI.

name#

Unique name of this dataset which will be used in this project to refer to this dataset.

description#

Description of the dataset. We recommend writing a description that future coworkers (or future you) will be grateful for.

type#

This tells Layer that this is a dataset from an external data source. source is the only option for datasets.

materialization#

This field is required.

target#

Name of the integration where this dataset lives. You assign names to integrations in Layer Settings > Integrations.

table_name#

Name of the table in your database that is referenced.

# optional. 1 by default.
apiVersion: 1
# required
name: "orders"
# optional
description: "Order details, including table name
and what was ordered."
# required
type: source
# required
materialization:
target: my_snowflake
table_name: table_orders

Reuse datasets#

You can load the order dataset configured above in your Jupyter Notebook by entering the following commands in Jupyter Notebook cells:

import layer
orders_dataset = layer.get_dataset("orders")
df = orders_dataset.to_pandas()
df.head()

You can also refer to this dataset in a SQL Feature query like this:

SELECT
user_id, AVG(amount) as average_order_amount
FROM
orders
GROUP BY 1