ADQM functions for machine learning

ADQM/ClickHouse provides a set of machine learning functions — aggregate functions implementing machine learning algorithms that you can train on known data sets and then use to predict unknown values.

Basic concepts

In this article, the following basic concepts are used to describe machine learning algorithms and ADQM functions that implement them:

  • Independent and dependent variables. Independent variables are the input to a machine learning algorithm. Dependent variables (also known as target variables) are the output that depend on independent variables. In other words, independent variables are features of an object (or event) whose values can be used to predict a value of a target variable using a trained machine learning model.

  • Training and test data sets. Training data is used to train a machine learning model. It should include columns with values of independent variables and a column with target values that the model should learn to predict. Test data is another data set (independent of the training data set) used to evaluate the model fit on the training data.

  • Model training. A machine learning algorithm finds patterns in the provided training data that map values of independent variables to target variable values, and outputs a machine learning model that captures these patterns. The model can then be used to make predictions based on new data for which values of the target variable are unknown.

  • Model parameters or regression weights (coefficients). Model parameters (or regression weights) define how independent variables affect a dependent one. For example, if the linear regression equation is , the model parameters are the constants ( is the bias). Model parameters are automatically estimated during model training based on the training data and returned by an ADQM machine learning function.

  • Model hyperparameters. Hyperparameters are preset configurations that are manually adjusted before the training process starts and determine the model structure and the way it is trained. Properly selected hyperparameters can significantly affect the model efficiency and accuracy.

Machine learning algorithms

ADQM functions for machine learning implement the following regression algorithms.

Simple linear regression

The simpleLinearRegression(<feature>, <target>) function performs simple linear regression, i.e. calculates a linear relationship between two variables — one independent variable and a target one. Pass columns with independent and target variable values from a training data set as the <feature> and <target> function parameters, respectively.

Stochastic regression

To predict values of a target variable based on two or more independent variables, ADQM provides the following multiple-regression functions that use the stochastic gradient descent method to train a model:

  • stochasticLinearRegression([<hyperparameters>])(<target>, <feature1>, <feature2>, …​) — estimates a linear relationship between a target variable (<target>) and a set of independent variables (<feature1>, <feature2>, …​) and allows prediction of continuous numerical values of the dependent variable based on the observed values of the independent variables.

  • stochasticLogisticRegression([<hyperparameters>])(<target>, <feature1>, <feature2>, …​) — allows estimating the probability of an event occurrence based on values of multiple input features. Since the outcome is a probability, the dependent variable can take one of two values — usually 0 (the event did not occur) and 1 (the event did occur). This algorithm is used for binary classification tasks.

When using a stochastic regression function to build a machine learning model, you can customize four hyperparameters (<hyperparameters> in the function syntax descriptions above). The following table lists the hyperparameters in the order they should be passed to a function sequentially.

Hyperparameter Description Default value

learning rate

Step length coefficient, when the gradient descent step is performed. Too high learning rate may cause infinite weights of the model

0.00001

l2 regularization coefficient

L2 regularization coefficient, which may help to avoid overfitting

0.1

mini-batch size

Number of elements of the training set whose gradients will be computed and summed to perform one step of gradient descent. Pure stochastic descent uses a single element, however, using a mini-batch of a fixed small number of elements (about 10 elements) makes the gradient steps more stable

15

method for updating weights

Method for updating weights. Possible values: Adam (default), SGD, Momentum, and Nesterov. The last two methods are more demanding on computing resources and memory, however, they have a high convergence speed and stability of stochastic gradient methods

Adam

How to use machine learning functions

Machine learning functions are used to fit a model and predict new data.

Fit a model

To fit a model, follow the steps:

  1. Create a table based on a training data set. A sample query:

    CREATE TABLE train_data
    (
        feature1 Float64,
        feature2 Float64,
        target Float64
    )
    ENGINE = MergeTree;

    For linear regression, independent (feature1, feature2…​) and target (target) variables should be numeric. For logistic regression, the target variable should be binary (takes only two values — for example, 0 and 1) and predicted values (probabilities) will range between 0 and 1.

  2. Use a machine learning function with the State combinator to fit a model and save its state for later use. An example of a query with the stochasticLinearRegression function:

    CREATE TABLE my_model ENGINE = Memory AS SELECT
    stochasticLinearRegressionState(0.001, 0.1, 10, 'SGD')(target, feature1, feature2)
    AS state FROM train_data;

    If you don’t apply the State combinator to a machine learning function, you can fetch the parameters of the created model without saving it. For example, the following query fits a model and returns a column with its parameters (the first values correspond to the regression weights and the last one is bias):

    stochasticLinearRegression(0.001, 0.1, 10, 'SGD')(target, feature1, feature2) FROM train_data;

Predict new values

After saving a model state into a table, you can use it multiple times to make predictions or merge it with other states to create new (improved) models.

  1. Create a table based on a test data set (test_data in the example below). This table should be similar in structure to train_data, but may not contain a target value (it will be predicted).

  2. To make predictions, use the evalMLMethod function, which accepts a model state and features to predict on as parameters. For example, the following query returns a column with target variable values predicted based on values of the feature1 and feature2 features in the test_data table:

    WITH (SELECT state FROM my_model) AS model SELECT
    evalMLMethod(model, feature1, feature2) FROM test_data;

Example

Prepare data

  1. Create tables with data for training (trips_train) and testing (trips_test) as described in the New York Taxi Data example that the ClickHouse documentation provides:

    CREATE TABLE trips_train (
        trip_id             UInt32,
        pickup_datetime     DateTime,
        dropoff_datetime    DateTime,
        pickup_longitude    Nullable(Float64),
        pickup_latitude     Nullable(Float64),
        dropoff_longitude   Nullable(Float64),
        dropoff_latitude    Nullable(Float64),
        passenger_count     UInt8,
        trip_distance       Float32,
        fare_amount         Float32,
        extra               Float32,
        tip_amount          Float32,
        tolls_amount        Float32,
        total_amount        Float32,
        payment_type        Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
        pickup_ntaname      LowCardinality(String),
        dropoff_ntaname     LowCardinality(String)
    )
    ENGINE = MergeTree
    PRIMARY KEY (pickup_datetime, dropoff_datetime);
    CREATE TABLE trips_test (
        trip_id             UInt32,
        pickup_datetime     DateTime,
        dropoff_datetime    DateTime,
        pickup_longitude    Nullable(Float64),
        pickup_latitude     Nullable(Float64),
        dropoff_longitude   Nullable(Float64),
        dropoff_latitude    Nullable(Float64),
        passenger_count     UInt8,
        trip_distance       Float32,
        fare_amount         Float32,
        extra               Float32,
        tip_amount          Float32,
        tolls_amount        Float32,
        total_amount        Float32,
        payment_type        Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
        pickup_ntaname      LowCardinality(String),
        dropoff_ntaname     LowCardinality(String)
    )
    ENGINE = MergeTree
    PRIMARY KEY (pickup_datetime, dropoff_datetime);
  2. Fill the trips_train and trips_test tables with data from the trips_0.gz and trips_1.gz files in the S3 bucket, respectively:

    INSERT INTO trips_train
    SELECT
        trip_id,
        pickup_datetime,
        dropoff_datetime,
        pickup_longitude,
        pickup_latitude,
        dropoff_longitude,
        dropoff_latitude,
        passenger_count,
        trip_distance,
        fare_amount,
        extra,
        tip_amount,
        tolls_amount,
        total_amount,
        payment_type,
        pickup_ntaname,
        dropoff_ntaname
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_0.gz', 'TabSeparatedWithNames');
    INSERT INTO trips_test
    SELECT
        trip_id,
        pickup_datetime,
        dropoff_datetime,
        pickup_longitude,
        pickup_latitude,
        dropoff_longitude,
        dropoff_latitude,
        passenger_count,
        trip_distance,
        fare_amount,
        extra,
        tip_amount,
        tolls_amount,
        total_amount,
        payment_type,
        pickup_ntaname,
        dropoff_ntaname
    FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_1.gz', 'TabSeparatedWithNames');

Create a machine learning model

  1. Create a machine learning model to predict taxi fare (a target variable is fare_amount) depending on the number of passengers and trip distance (the passenger_count and trip_distance independent variables):

    SELECT stochasticLinearRegression(0.001, 0.1, 5)(fare_amount, passenger_count, trip_distance)
    FROM trips_train;

    This query returns model parameters that determine the weights of the linear regression equation ( ):

    ┌─stochasticLinearRegression(0.001, 0.1, 5)(fare_amount, passenger_count, trip_distance)─┐
    │ [0.08194997204590103,2.706812913667359,4.897836872083942]                              │
    └────────────────────────────────────────────────────────────────────────────────────────┘
    
    1 row in set. Elapsed: 0.036 sec. Processed 1.00 million rows, 9.01 MB (27.71 million rows/s., 249.35 MB/s.)
    Peak memory usage: 2.38 MiB.

    As you can see, the fare is not much affected by the number of passengers.

  2. Save the model state to the taxi_trips_model table based on the Memory engine:

    CREATE TABLE taxi_trips_model ENGINE = Memory AS
    SELECT stochasticLinearRegressionState(0.001, 0.1, 5)(fare_amount, passenger_count, trip_distance)
    AS state FROM trips_train;

Evaluate the model

Use the trained model to make predictions based on the test data set and compare those predictions to the actual values (note that the evalMLMethod function accepts the model state and the same set of independent variables that were used to train the model — passenger_count and trip_distance):

WITH (SELECT state FROM taxi_trips_model) AS model
SELECT
    fare_amount,
    evalMLMethod(model, passenger_count, trip_distance) AS prediction
FROM trips_test LIMIT 10;

In the output table, the resulting predictions are not too far from the actual data:

┌─fare_amount─┬─────────prediction─┐
│         8.5 │   9.20394916259993 │
│        13.5 │ 13.714548983540645 │
│        41.5 │  41.78716293769729 │
│          24 │ 22.230999964014817 │
│         8.5 │  9.018605114531145 │
│         6.5 │ 7.6682400731869595 │
│        19.5 │ 21.287184808552087 │
│          19 │ 20.377558220769647 │
│         7.5 │ 7.8100114849842495 │
│           9 │  8.954756513157905 │
└─────────────┴────────────────────┘

For more accurate predictions, you can select another set of hyperparameters.

Found a mistake? Seleсt text and press Ctrl+Enter to report it