Skip to main content

SQL Features

You can use SQL queries to tranform your datasets to extract features. SQL Features are defined with SQL files and referred to from the dataset.yml file. These SQL files contain SQL SELECT statements that select the features desired from a dataset. Every feature definition MUST select the primary keys to support joins.

Here is a simple SQL Feature to extract the Title of the passengers from their Name field:

title.sql

/*
Titanic Survival Project Example
`Title` Feature
*/
WITH titleDF as (
SELECT PassengerId,
regexp_replace(
regexp_replace(
regexp_extract(Name, ' (\\\w+)\\\.',1)
,'^(Don|Countess|Col|Rev|Lady|Capt|Dr|Sir|Jonkheer|Major)$', 'Rare')
,'^(Mlle|Ms|Mme)$', 'Miss'
)
as parsedTitle
FROM titanic
)
SELECT PassengerId,
CASE
WHEN parsedTitle = "Mr" THEN 1
WHEN parsedTitle = "Miss" THEN 2
WHEN parsedTitle = "Mrs" THEN 3
WHEN parsedTitle = "Master" THEN 4
WHEN parsedTitle = "Rare" THEN 5
END as Title
FROM titleDF

You can add this feature to your Layer project by adding it to the dataset.yml

dataset.yml

# Titanic Survival Project Example
#
# Any directory includes an `dataset.yml` will be treated as a dataset
# project. In this `yaml` file, we will define the attributes of our
# featureset.
apiVersion: 1
type: featureset
name: "passenger_features"
description: "Passenger features from Titanic Dataset"
features:
- name: title
description: "Extracts the title of the pasennger from their name (`Zabour, Miss. Hileni` > `Miss`)"
source: title.sql