MySQL HeatWave User Guide
MySQL 9.4.1 introduces the
TRAIN_TEST_SPLIT
routine, which
automatically splits your data into training and testing
datasets.
Two new tables in the same database are created with the following names:
[original_table_name]_train
[original_table_name]_test
The split of the data between training and testing datasets depends on the machine learning task.
Classification: A stratified split of data. For each class in the dataset, 80% of the samples goes into the training dataset, and the remaining goes into the testing dataset. If the number of samples in the 80% subset is fewer than 5, then instead select 5 of the samples for the training dataset.
Regression: A random split of data.
Forecasting: A time-based split of data. Order the table by
the datetime_index
values and select the
first 80% of the samples for the training dataset. Insert
the subsequent samples into the testing dataset.
Unsupervised anomaly detection: A random split of data. Select 80% of the samples for the training dataset and select the remaining samples for the testing dataset.
Semi-supervised anomaly detection: A stratified split of data.
Anomaly detection for log data: A split of data based on primary key values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset. Review requirements when running Anomaly Detection for Logs.
Recommendations: A random split of data.
Topic modeling: A random split of data.
mysql>CALL sys.TRAIN_TEST_SPLIT ('
table_name
', 'target_column_name
', [options
| NULL]);options
: { JSON_OBJECT("key
","value
"[,"key
","value
"] ...)"key","value"
: { ['task', {'classification'|'regression'|'forecasting'|'anomaly_detection'|'log_anomaly_detection'|'recommendation'|'topic_modeling'}] ['datetime_index', 'column
'] ['semisupervised', {'true'|'false'}]
TRAIN_TEST_SPLIT
parameters:
table_name
: You must provide the fully
qualified name of the table that contains the dataset to
split (schema_name.table_name
).
target_column_name
: Classification and
semi-supervised anomaly detection tasks require a target
column. All other tasks do not require a target column. If
a target column is not required you can set this parameter
to NULL
.
options
: Set the following options as
needed as key-value pairs in JSON object format. If no
options are needed, set this to NULL
.
task
: If the machine learning task
is not set, the default task is
classification
.
datetime_index
: The column that has
datetime values. This parameter is required for
forecasting tasks.
The following data types for this column are supported:
semisupervised
: If running an
anomaly detection task, set this to
true
for semi-supervised learning,
or false
for unsupervised learning.
If this is set to NULL
, then the
default value of false
is selected.
A classification task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.data_files_1', 'class', JSON_OBJECT('task', 'classification'));
mysql>SHOW TABLES;
+-------------------------+ | Tables_in_data_files_db | +-------------------------+ | data_files_1 | | data_files_1_test | | data_files_1_train | +-------------------------+
A regression task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.food_delivery_data', NULL, JSON_OBJECT('task', 'regression'));
mysql>SHOW TABLES;
+--------------------------+ | Tables_in_data_files_db | +--------------------------+ | food_delivery_data | | food_delivery_data_test | | food_delivery_data_train | +--------------------------+
A forecasting task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.forecasting_data', NULL, JSON_OBJECT('task', 'forecasting', 'datetime_index', 'timestamp'));
mysql>SHOW TABLES;
+-------------------------+ | Tables_in_data_files_db | +-------------------------+ | forecasting_data | | forecasting_data_test | | forecasting_data_train | +-------------------------+
An unsupervised anomaly detection task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.anomaly_detection_data', NULL, JSON_OBJECT('task', 'anomaly_detection'));
mysql>SHOW TABLES;
+------------------------------+ | Tables_in_data_files_db | +------------------------------+ | anomaly_detection_data | | anomaly_detection_data_test | | anomaly_detection_data_train | +------------------------------+
A semi-supervised anomaly detection task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.anomaly_detection_semi', 'anomaly', JSON_OBJECT('task', 'anomaly_detection', 'semisupervised', 'true'));
mysql>SHOW TABLES;
+------------------------------+ | Tables_in_data_files_db | +------------------------------+ | anomaly_detection_semi | | anomaly_detection_semi_test | | anomaly_detection_semi_train | +------------------------------+
A task for anomaly detection on log data:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.anomaly_detection_logs', NULL, JSON_OBJECT('task', 'log_anomaly_detection'));
mysql>SHOW TABLES;
+------------------------------+ | Tables_in_data_files_db | +------------------------------+ | anomaly_detection_logs | | anomaly_detection_logs_test | | anomaly_detection_logs_train | +------------------------------+
A recommendation task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.rec_data', NULL, JSON_OBJECT('task', 'recommendation'));
mysql>SHOW TABLES;
+-------------------------+ | Tables_in_data_files_db | +-------------------------+ | rec_data | | rec_data_test | | rec_data_train | +-------------------------+
A topic modeling task:
mysql>CALL sys.TRAIN_TEST_SPLIT('data_files_db.text_data', NULL, JSON_OBJECT('task', 'topic_modeling'));
mysql>SHOW TABLES;
+-------------------------+ | Tables_in_data_files_db | +-------------------------+ | text_data | | text_data_test | | text_data_train | +-------------------------+