A dataset is a reference of your data in an external data source. You can add a database table or a CSV in a cloud file storage as a dataset to the Layer Data Catalog.
Datasets are first-class entities in Layer. Layer Data Catalog centralizes all your datasets from different sources for better discovery and efficient analysis. You can run complex quality tests on your datasets, check their profile before starting your new ML project or check which featuresets or ML models depend on them.
Datasets are configured in a
dataset.yml file. An example is shown below, alongside field definitions. Click a definition to highlight the code that it refers to.
Version of the dataset definition file. This is used to make sure backwards-incompatible changes in config format do not break Layer CLI.
Unique name of this dataset which will be used in this project to refer to this dataset.
Description of the dataset. We recommend writing a description that future coworkers (or future you) will be grateful for.
This tells Layer that this is a dataset from an external data source.
source is the only option for datasets.
You can load the
order dataset configured above in your Jupyter Notebook by entering the following commands in Jupyter Notebook cells:
You can also refer to this dataset in a SQL Feature query like this: