ADQM functions for machine learning
ADQM/ClickHouse provides a set of machine learning functions — aggregate functions implementing machine learning algorithms that you can train on known data sets and then use to predict unknown values.
Basic concepts
In this article, the following basic concepts are used to describe machine learning algorithms and ADQM functions that implement them:

Independent and dependent variables. Independent variables are the input to a machine learning algorithm. Dependent variables (also known as target variables) are the output that depend on independent variables. In other words, independent variables are features of an object (or event) whose values can be used to predict a value of a target variable using a trained machine learning model.

Training and test data sets. Training data is used to train a machine learning model. It should include columns with values of independent variables and a column with target values that the model should learn to predict. Test data is another data set (independent of the training data set) used to evaluate the model fit on the training data.

Model training. A machine learning algorithm finds patterns in the provided training data that map values of independent variables to target variable values, and outputs a machine learning model that captures these patterns. The model can then be used to make predictions based on new data for which values of the target variable are unknown.

Model parameters or regression weights (coefficients). Model parameters (or regression weights) define how independent variables affect a dependent one. For example, if the linear regression equation is , the model parameters are the constants ( is the bias). Model parameters are automatically estimated during model training based on the training data and returned by an ADQM machine learning function.

Model hyperparameters. Hyperparameters are preset configurations that are manually adjusted before the training process starts and determine the model structure and the way it is trained. Properly selected hyperparameters can significantly affect the model efficiency and accuracy.
Machine learning algorithms
ADQM functions for machine learning implement the following regression algorithms.
Simple linear regression
The simpleLinearRegression(<feature>, <target>)
function performs simple linear regression, i.e. calculates a linear relationship between two variables — one independent variable and a target one. Pass columns with independent and target variable values from a training data set as the <feature>
and <target>
function parameters, respectively.
Stochastic regression
To predict values of a target variable based on two or more independent variables, ADQM provides the following multipleregression functions that use the stochastic gradient descent method to train a model:

stochasticLinearRegression([<hyperparameters>])(<target>, <feature1>, <feature2>, …)
— estimates a linear relationship between a target variable (<target>
) and a set of independent variables (<feature1>, <feature2>, …
) and allows prediction of continuous numerical values of the dependent variable based on the observed values of the independent variables. 
stochasticLogisticRegression([<hyperparameters>])(<target>, <feature1>, <feature2>, …)
— allows estimating the probability of an event occurrence based on values of multiple input features. Since the outcome is a probability, the dependent variable can take one of two values — usually0
(the event did not occur) and1
(the event did occur). This algorithm is used for binary classification tasks.
When using a stochastic regression function to build a machine learning model, you can customize four hyperparameters (<hyperparameters>
in the function syntax descriptions above). The following table lists the hyperparameters in the order they should be passed to a function sequentially.
Hyperparameter  Description  Default value 

learning rate 
Step length coefficient, when the gradient descent step is performed. Too high learning rate may cause infinite weights of the model 
0.00001 
l2 regularization coefficient 
L2 regularization coefficient, which may help to avoid overfitting 
0.1 
minibatch size 
Number of elements of the training set whose gradients will be computed and summed to perform one step of gradient descent. Pure stochastic descent uses a single element, however, using a minibatch of a fixed small number of elements (about 10 elements) makes the gradient steps more stable 
15 
method for updating weights 
Method for updating weights. Possible values: 
Adam 
How to use machine learning functions
Machine learning functions are used to fit a model and predict new data.
Fit a model
To fit a model, follow the steps:

Create a table based on a training data set. A sample query:
CREATE TABLE train_data ( feature1 Float64, feature2 Float64, target Float64 ) ENGINE = MergeTree;
For linear regression, independent (
feature1
,feature2
…) and target (target
) variables should be numeric. For logistic regression, the target variable should be binary (takes only two values — for example,0
and1
) and predicted values (probabilities) will range between0
and1
. 
Use a machine learning function with the
State
combinator to fit a model and save its state for later use. An example of a query with thestochasticLinearRegression
function:CREATE TABLE my_model ENGINE = Memory AS SELECT stochasticLinearRegressionState(0.001, 0.1, 10, 'SGD')(target, feature1, feature2) AS state FROM train_data;
If you don’t apply the
State
combinator to a machine learning function, you can fetch the parameters of the created model without saving it. For example, the following query fits a model and returns a column with its parameters (the first values correspond to the regression weights and the last one is bias):stochasticLinearRegression(0.001, 0.1, 10, 'SGD')(target, feature1, feature2) FROM train_data;
Predict new values
After saving a model state into a table, you can use it multiple times to make predictions or merge it with other states to create new (improved) models.

Create a table based on a test data set (
test_data
in the example below). This table should be similar in structure totrain_data
, but may not contain a target value (it will be predicted). 
To make predictions, use the
evalMLMethod
function, which accepts a model state and features to predict on as parameters. For example, the following query returns a column with target variable values predicted based on values of thefeature1
andfeature2
features in thetest_data
table:WITH (SELECT state FROM my_model) AS model SELECT evalMLMethod(model, feature1, feature2) FROM test_data;
Example
Prepare data

Create tables with data for training (
trips_train
) and testing (trips_test
) as described in the New York Taxi Data example that the ClickHouse documentation provides:CREATE TABLE trips_train ( trip_id UInt32, pickup_datetime DateTime, dropoff_datetime DateTime, pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count UInt8, trip_distance Float32, fare_amount Float32, extra Float32, tip_amount Float32, tolls_amount Float32, total_amount Float32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), pickup_ntaname LowCardinality(String), dropoff_ntaname LowCardinality(String) ) ENGINE = MergeTree PRIMARY KEY (pickup_datetime, dropoff_datetime);
CREATE TABLE trips_test ( trip_id UInt32, pickup_datetime DateTime, dropoff_datetime DateTime, pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count UInt8, trip_distance Float32, fare_amount Float32, extra Float32, tip_amount Float32, tolls_amount Float32, total_amount Float32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), pickup_ntaname LowCardinality(String), dropoff_ntaname LowCardinality(String) ) ENGINE = MergeTree PRIMARY KEY (pickup_datetime, dropoff_datetime);

Fill the
trips_train
andtrips_test
tables with data from the trips_0.gz and trips_1.gz files in the S3 bucket, respectively:INSERT INTO trips_train SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3('https://datasetsdocumentation.s3.euwest3.amazonaws.com/nyctaxi/trips_0.gz', 'TabSeparatedWithNames');
INSERT INTO trips_test SELECT trip_id, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count, trip_distance, fare_amount, extra, tip_amount, tolls_amount, total_amount, payment_type, pickup_ntaname, dropoff_ntaname FROM s3('https://datasetsdocumentation.s3.euwest3.amazonaws.com/nyctaxi/trips_1.gz', 'TabSeparatedWithNames');
Create a machine learning model

Create a machine learning model to predict taxi fare (a target variable is
fare_amount
) depending on the number of passengers and trip distance (thepassenger_count
andtrip_distance
independent variables):SELECT stochasticLinearRegression(0.001, 0.1, 5)(fare_amount, passenger_count, trip_distance) FROM trips_train;
This query returns model parameters that determine the weights of the linear regression equation ( ):
┌─stochasticLinearRegression(0.001, 0.1, 5)(fare_amount, passenger_count, trip_distance)─┐ │ [0.08194997204590103,2.706812913667359,4.897836872083942] │ └────────────────────────────────────────────────────────────────────────────────────────┘ 1 row in set. Elapsed: 0.036 sec. Processed 1.00 million rows, 9.01 MB (27.71 million rows/s., 249.35 MB/s.) Peak memory usage: 2.38 MiB.
As you can see, the fare is not much affected by the number of passengers.

Save the model state to the
taxi_trips_model
table based on the Memory engine:CREATE TABLE taxi_trips_model ENGINE = Memory AS SELECT stochasticLinearRegressionState(0.001, 0.1, 5)(fare_amount, passenger_count, trip_distance) AS state FROM trips_train;
Evaluate the model
Use the trained model to make predictions based on the test data set and compare those predictions to the actual values (note that the evalMLMethod
function accepts the model state and the same set of independent variables that were used to train the model — passenger_count
and trip_distance
):
WITH (SELECT state FROM taxi_trips_model) AS model
SELECT
fare_amount,
evalMLMethod(model, passenger_count, trip_distance) AS prediction
FROM trips_test LIMIT 10;
In the output table, the resulting predictions are not too far from the actual data:
┌─fare_amount─┬─────────prediction─┐ │ 8.5 │ 9.20394916259993 │ │ 13.5 │ 13.714548983540645 │ │ 41.5 │ 41.78716293769729 │ │ 24 │ 22.230999964014817 │ │ 8.5 │ 9.018605114531145 │ │ 6.5 │ 7.6682400731869595 │ │ 19.5 │ 21.287184808552087 │ │ 19 │ 20.377558220769647 │ │ 7.5 │ 7.8100114849842495 │ │ 9 │ 8.954756513157905 │ └─────────────┴────────────────────┘
For more accurate predictions, you can select another set of hyperparameters.