Skip to main content

SQL Features

You can use SQL queries to tranform your datasets to extract features. SQL Features are defined with SQL files and referred to from the dataset.yml file. These SQL files contain SQL SELECT statements that select the features desired from a dataset. Every feature definition MUST select the primary keys to support joins.

Here is a simple SQL Feature to extract the Title of the passengers from their Name field:


Titanic Survival Project Example
`Title` Feature
WITH titleDF as (
SELECT PassengerId,
regexp_extract(Name, ' (\\\w+)\\\.',1)
,'^(Don|Countess|Col|Rev|Lady|Capt|Dr|Sir|Jonkheer|Major)$', 'Rare')
,'^(Mlle|Ms|Mme)$', 'Miss'
as parsedTitle
FROM titanic
SELECT PassengerId,
WHEN parsedTitle = "Mr" THEN 1
WHEN parsedTitle = "Miss" THEN 2
WHEN parsedTitle = "Mrs" THEN 3
WHEN parsedTitle = "Master" THEN 4
WHEN parsedTitle = "Rare" THEN 5
END as Title
FROM titleDF

You can add this feature to your Layer project by adding it to the dataset.yml


# Titanic Survival Project Example
# Any directory includes an `dataset.yml` will be treated as a dataset
# project. In this `yaml` file, we will define the attributes of our
# featureset.
apiVersion: 1
type: featureset
name: "passenger_features"
description: "Passenger features from Titanic Dataset"
- name: title
description: "Extracts the title of the pasennger from their name (`Zabour, Miss. Hileni` > `Miss`)"
source: title.sql