6.7.4.3 Anomaly Detection for Logs

MySQL 9.2.2 introduces the ability to detect anomalies in log data. To perform anomaly detection on logs, log data is cleaned, segemented, and encoded before running anomaly detection. This feature leverages the log template miner Drain3.

Consider the following when running anomaly detection on logs.

The input table can only have the following columns:
- The column containing the logs.
- If including logs from different sources, a column containing the source of each log. The values in this column contain the names of the sources that each log belongs to. These values are used to group each host's logs together. If this column is not present, it is assumed that all logs originate from the same source.
- If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
- As of MySQL 9.4.1, at least one column must act as the primary key to establish the temporal order of logs. If the primary key column (or columns) is not one of the previous required columns (log data, source of log, or label), then you must use the exclude_column_list option when running ML_TRAIN to exclude all primary key columns that don't include required data. See Syntax Examples for Anomaly Detection Training to review relevant examples.
If the input table has additional columns to the ones permitted, you must use the exclude_column_list option when running ML_TRAIN to exclude irrelevant columns.
The data collected for anomaly detection can be unsupervised or semi-supervised. To run semi-supervised anomaly detection, you can provide a separate column in the input table with labels for the labeled log lines. This column labels identified anomalous logs with a value of 1, non-anomalous logs with 0, and unlabeled logs with NULL. See Semi-supervised Anomaly Detection to learn more.
In addition to the anomaly scores included in the output table, you have the option to leverage MySQL HeatWave GenAI to provide textual log summaries.
By default the following parameters are masked in the input data (training or test data): IP, DATETIME, TIME, HEX, IPPORT, and OCID. You have the option to mask additional regex patterns with the additional_masking_regex option.
For versions before MySQL 9.5.0, textual log data is transformed into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency), which is a statistical feature extractor that assigns numerical weights to words according to their frequency across documents. As of MySQL 9.5.0, MySQL HeatWave uses a combination of a keyword feature extractor and an embedding model to train models. This allows trained models to capture semantic meanings in log data. You have the option to select the keyword model and any embedding model supported by MySQL HeatWave with the training options keyword_model and embedding_model. The available keyword model options are tf-idf and NULL. To review supported embedding models, run the following query: SELECT sys.ML_LIST_LLMS(); and see models that have capabilities with TEXT_EMBEDDINGS. The default keyword feature extractor is tf-idf, and the default embedding model is multilingual-e5-small. Using an embedding model causes higher memory usage. You can set either embedding_model or keyword_model to NULL, but you cannot set both to NULL.

What's Next

Learn more about the following:
- Anomaly Detection Algorithm Model Types
- Anomaly Detection Learning Types
Learn how to Prepare Data for an Anomaly Detection Model.