MySQL HeatWave User Guide
MySQL 9.2.2 introduces the ability to detect anomalies in log data. To perform anomaly detection on logs, log data is cleaned, segemented, and encoded before running anomaly detection. This feature leverages the log template miner Drain3.
Consider the following when running anomaly detection on logs.
The input table can only have the following columns:
The column containing the logs.
If including logs from different sources, a column containing the source of each log. The values in this column contain the names of the sources that each log belongs to. These values are used to group each host's logs together. If this column is not present, it is assumed that all logs originate from the same source.
If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
As of MySQL 9.4.1, at least one column must act as the
primary key to establish the temporal order of logs.
If the primary key column (or columns) is not one of
the previous required columns (log data, source of
log, or label), then you must use the
exclude_column_list option when
running ML_TRAIN to
exclude all primary key columns that don't include
required data. See
Syntax Examples for Anomaly Detection Training
to review relevant examples.
If the input table has additional columns to the ones
permitted, you must use the
exclude_column_list option when running
ML_TRAIN to exclude
irrelevant columns.
The data collected for anomaly detection can be unsupervised or semi-supervised. To run semi-supervised anomaly detection, you can provide a separate column in the input table with labels for the labeled log lines. This column labels identified anomalous logs with a value of 1, non-anomalous logs with 0, and unlabeled logs with NULL. See Semi-supervised Anomaly Detection to learn more.
In addition to the anomaly scores included in the output table, you have the option to leverage MySQL HeatWave GenAI to provide textual log summaries.
By default the following parameters are masked in the
input data (training or test data): IP, DATETIME, TIME,
HEX, IPPORT, and OCID. You have the option to mask
additional regex patterns with the
additional_masking_regex option.
For versions before MySQL 9.5.0, textual log data is
transformed into numerical vectors using TF-IDF (Term
Frequency-Inverse Document Frequency), which is a
statistical feature extractor that assigns numerical
weights to words according to their frequency across
documents. As of MySQL 9.5.0, MySQL HeatWave uses a combination
of a keyword feature extractor and an embedding model to
train models. This allows trained models to capture
semantic meanings in log data. You have the option to
select the keyword model and any embedding model supported
by MySQL HeatWave with the training options
keyword_model and
embedding_model. The available keyword
model options are tf-idf and
NULL. To review supported embedding
models, run the following query: SELECT
sys.ML_LIST_LLMS(); and see models that have
capabilities with
TEXT_EMBEDDINGS. The default keyword
feature extractor is tf-idf, and the
default embedding model is
multilingual-e5-small. Using an
embedding model causes higher memory usage. You can set
either embedding_model or
keyword_model to
NULL, but you cannot set both to
NULL.
Learn more about the following:
Learn how to Prepare Data for an Anomaly Detection Model.