MySQL HeatWave User Guide
The VECTOR_STORE_LOAD routine generates
vector embedding for the specified files or folders that are
stored in the bucket, and loads the embeddings into a new vector
store table.
This routine creates an asynchronous task which loads vector store tables in the background. It also returns a query that you can run to track the status of the vector store load task that is running in the background.
It is recommended that you create five or less vector store load tasks at a time. Too many tasks running at the same time might cause overloading issues.
This topic contains the following sections:
To learn about the privileges you need to run this routine, see Section 7.3, “MySQL HeatWave GenAI Roles and Privileges”.
mysql>CALL sys.VECTOR_STORE_LOAD('URI'[,options]);options: JSON_OBJECT(keyvalue[,keyvalue]...)keyvalue: { 'formats', JSON_ARRAY('Format'[, 'Format'] ...) |'schema_name', 'SchemaName' |'table_name', 'TableName' |'region', 'Region' |'task_name', 'TaskName' |'language', 'Language' |'embed_model_id', {'EmbeddingModelID'} |'description', 'Description' |'uris', JSON_ARRAY(urioptions[,urioptions] ...) |'ocr', {true|false} |'chunking',chunkingoptions}
Following are VECTOR_STORE_LOAD parameters:
URI: specifies the unique
reference index (URI) or pre-authenticated request (PAR)
of the bucket files or folders to be ingested into the
vector store.
A URI is considered to be one of the following:
A
glob
pattern, if it contains at least one unescaped
? or *
character.
A prefix, if it is not a pattern and ends with a
/ character like a folder path.
A file path, if it is neither a glob pattern nor a prefix.
To learn how to create PAR for your , see Creating a PAR request in .
options: specifies optional parameters
as key-value pairs in JSON format. It can include the
following parameters:
formats: specifies the list of
formats to be loaded. The supported file formats are
pdf, ppt,
txt, html, and
doc. By default, the routine uses
all the supported formats.
If the routine detects multiple files with the same
or different file formats in a single load, it
creates a separate table for every format it finds.
The table name for each format is the specified or
default table name followed by the format:
TableName_Format.
schema_name: specifies the name
of the schema where the vector embeddings are to be
loaded. By default, this procedure uses the current
schema from the session.
table_name: specifies the name of
the vector store table to create. By default, the
routine generates a unique table name with format
vector_store_data_x, where
x is a counter.
region: specifies the region of
the bucket. Default value is the region where the
current DB System is running.
task_name: specifies a name for
the loading task to be mentioned in the task status.
Default value is Vector Store
Loader.
language: specifies the text
content language used in the files to be ingested
into the vector store. To set the value of the
language parameter, use the
two-letter ISO 639-1 code for the
language. This parameter is available as of MySQL
9.0.1-u1.
Default value is en.
For possible values, to view the list of supported languages, see Languages.
embed_model_id: specifies the
embedding model to use for encoding the text.
As of 9.3.0, default value is
multilingual-e5-small. In earlier
versions of MySQL, default value is
minilm for English text and
multilingual-e5-small for text in
languages other than English.
For possible values, to view the list of available embedding models, see MySQL HeatWave In-Database Embedding Models and OCI Generative AI Service Embedding Models. OCI Generative AI Service models are supported as of MySQL 9.5.0.
The VECTOR_STORE_LOAD routine
does not support
OCI Generative AI Service Embedding Models.
If you want to use
OCI Generative AI Service Embedding Models,
then you can use the
HEATWAVE_LOAD
routine to ingest files into a vector store.
This parameter is available as of MySQL 9.2.1.
description: specifies a
description of document collection being loaded to
be mentioned in the task status. Default value is
NULL.
uris: specifies a list of
additional URIs to include along with an optional
name for the vector store table to be created for
the specified URI. Default value is
NULL.
Each object in the list can include the following parameters:
urioptions: JSON_OBJECT(urioptskeyvalue[,urioptskeyvalue])urioptskeyvalue: { 'uri', 'URI' |'table_name', 'TableName' }
uri: specifies the additional
URI. If only uri is provided,
the routine uses the specified URI as an
additional URI, and loads it into the main table
options.table_name or the
generated table with the unique table name.
table_name: if both
uri and
table_name are provided, the
routine loads the specified URI into the
specified table.
If only table_name is
provided, the routine loads the specified vector
store table into MySQL HeatWave.
ocr: specifies whether to enable
or disable
Optical
Character Recognition (OCR). If set to
false, disables OCR. Default
value is true, which means OCR is
enabled by default.
This parameter is available as of MySQL 9.1.1.
As of MySQL 9.1.2, the default value is
true.
As of MySQL 9.1.1, default value is
false.
chunking: specifies parameter
values for customized text segmentation during
vector store creation.
It can include the following parameters:
chunkingoptions: JSON_OBJECT(chunkingkeyvalue[,chunkingkeyvalue]...)chunkingkeyvalue: { 'split_by', {'page'|'paragraph'|'sentence'|'document'|'recursive'} |'max',MaxValue|'by', {'characters'|'words'} |'truncate', {true|false} |'overlap',OverlapValue}
split_by: specifies the
method to use for splitting the text into
segments. It can be one of the following:
page: for text
segmentation based on the pagination
available in the document. This segmentation
method is supported for PDF and PPT
documents only. If used for other documents,
the routine falls back to the
document text
segmentation method for unsupported
documents.
paragraph: for text
segmentation based on the paragraphs
identified in the document. Wherein, a
paragraph is a piece of text separated from
another piece of text using
\n\n characters.
sentence: for text
segmentation based on the sentences
identified in the document. Wherein, a
sentence is a sequence of words that is
separated from another sequence of words
using a punctuation that marks the end of a
sentence: .,
!, or
?. For PDF files, when
OCR is enabled, any sentence that spills
into the next page is considered as a two
separate sentences.
document: for putting an
entire document into one text segment. In
case the max parameter is
not set, you cannot overlap across
documents.
recursive: for using the
default text segmentation method used while
creating vector store tables in previous
versions of MySQL. This method provides
backward compatibility with vector store
tables created in previous versions of
MySQL.
Default value is recursive.
max: specifies the maximum
number of characters or words to be included in
each segment. If left unspecified, there is no
maximum size enforcement on the chunks of text.
For words, this value can be up to
100000 and for characters,
this value can be up to
1000000. This is unspecified
by default.
by: specifies the unit to use
for defining the maximum and overlap text
segment limits using the max
and overlap parameter. It can
be set to characters or
words. Default value is
characters.
truncate: if set to
true, enables truncation of
text segments that are too large for the
embedding model to handle without truncation.
However, when tructation is enabled, the end of
the text segment is trimmed out, and you might
lose the information from the part of the text
segment that is trimmed out.
If set to false, disables
truncation and instead throw an error when a
text segment that is too large for the embeeding
model is found. In this case, the entire load
fails and exits with an error. You can adjust
the text segmentation method used to ensure that
the size of the text segments is within the
segment size limit of the embedding model.
Default value is true.
overlap: specifies the
maximum number of characters or words to overlap
between segments on each side. For words, this
value can be up to 50000 and
for characters, this value can be up to
500000. Default value is
0.
The chunking parameters are
available as of MySQL 9.5.0.
Specifying the file to ingest using the URI in
VECTOR_STORE_LOAD:
mysql> CALL sys.VECTOR_STORE_LOAD('oci://demo_bucket@demo_namespace/heatwave-en.pdf', '{"table_name": "demo_embeddings"}');
Specifying the file to ingest using the PAR in
VECTOR_STORE_LOAD:
mysql> CALL sys.VECTOR_STORE_LOAD('https://demo.objectstorage.us-ashburn-1.oci.customer-oci.com/p/demo-url/n/demo/b/demo-bucket/o/heatwave-en.pdf', '{"table_name": "demo_embeddings_par"}');
Specifying additional options such the schema name, table
name, language, format, and table description in
VECTOR_STORE_LOAD:
mysql> CALL sys.VECTOR_STORE_LOAD('oci://demo_bucket@demo_namespace/german_files/de*', '{"schema_name": "demo_db", "table_name": "german_embeddings", "language": "de", "formats": ["pdf"], "description": "Vector store table containing German PDF files."}');
Tracking the progress of a load task by running the task
query displayed as output for the
VECTOR_STORE_LOAD routine:
As of MySQL 9.3.1:
SELECT mysql_tasks.task_status_brief("TaskID");
The output looks similar to the following:
+-----------------------------------------------------------------------------------------+
| mysql_tasks.task_status_brief("TaskID") |
+-----------------------------------------------------------------------------------------+
| {"data": {"tables_to_load": "[{\"table_name\": \"quickstart_embeddings\", |
| \"load_progress\": 40.0}]"}, "status": "RUNNING", "message": "Loading in progress...", |
| "progress": 40} |
+-----------------------------------------------------------------------------------------+
In earlier versions:
mysql> SELECT id, name, message, progress, status, scheduled_time,estimated_completion_time, estimated_remaining_time, progress_bar FROM mysql_task_management.task_status WHERE id=1\GThe output looks similar to the following:
id: 1
name: Vector Store Loader
message: Task starting.
progress: 0
status: RUNNING
scheduled_time: 2024-07-02 14:42:38
estimated_completion_time: 2024-07-22 10:19:53
estimated_remaining_time: 52.50000
progress_bar: __________
Getting more details about the load task by querying the task logs for the given task id:
SELECT * from mysql_task_management.task_log where task_id = 1;
The output looks similar to the following:
+------------------------------------+---------+----------------------------+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+
| id | task_id | log_time | message | data | progress | status |
+------------------------------------+---------+----------------------------+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+
| 0x11EF799F5D99054288CC020017091C01 | 1 | 2024-09-23 11:31:24.884514 | Task created by user. | NULL | 0 | SCHEDULED |
| 0x11EF799F5D99812188CC020017091C01 | 1 | 2024-09-23 11:31:24.887685 | Task starting. | {"engine_attributes": "{\"demo_embeddings\":\"'{\\n\\\"dialect\\\": {\\\"format\\\": \\\"pdf\\\", \\\"language\\\": \\\"en\\\", \\\"is_strict_mode\\\": false },\\n\\\"file\\\": [\\n{\\n \\\"name\\\": \\\"demo_folder/demo_file.pdf\\\",\\n \\\"bucket\\\": \\\"demo_bucket\\\",\\n \\\"region\\\": \\\"demo_region\\\",\\n \\\"namespace\\\": \\\"demo_namespace\\\"\\n}\\n]}'\"}"} | 0 | RUNNING |
| 0x11EF799F6390FD9788CC020017091C01 | 1 | 2024-09-23 11:31:34.898219 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 10.0}]"} | 10 | RUNNING |
| 0x11EF799F668C172F88CC020017091C01 | 1 | 2024-09-23 11:31:39.899271 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 40.0}]"} | 40 | RUNNING |
| 0x11EF799F6987348588CC020017091C01 | 1 | 2024-09-23 11:31:44.900419 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 40.0}]"} | 40 | RUNNING |
| 0x11EF799F6C82547D88CC020017091C01 | 1 | 2024-09-23 11:31:49.901634 | Loading in progress... | {"tables_to_load": "[{\"table_name\": \"demo_embeddings\", \"load_progress\": 40.0}]"} | 40 | RUNNING |
+------------------------------------+---------+----------------------------+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+