UTL_TO_CHUNKS
Use the DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS
chainable utility function to split a large plain text document into smaller chunks of text.
Purpose
To perform a text-to-chunks transformation. This chainable utility function internally calls the VECTOR_CHUNKS
SQL function for the operation.
To embed a large document, you may first need to split it into multiple appropriate-sized segments or chunks through a splitting process known as chunking (as explained in Understand the Stages of Data Transformations). A chunk can be words (to capture specific words or word pieces), sentences (to capture a specific meaning), or paragraphs (to capture broader themes). A single document may be split into multiple chunks, each transformed into a vector.
Syntax
DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS (
DATA IN CLOB | VARCHAR2,
PARAMS IN JSON default NULL
) return VECTOR_ARRAY_T;
DATA
This function accepts the input data type as CLOB
or VARCHAR2
.
It returns an array of CLOB
s, where each CLOB
contains a chunk along with its metadata in JSON format, as follows:
{
"chunk_id" : NUMBER,
"chunk_offset" : NUMBER,
"chunk_length" : NUMBER,
"chunk_data" : "VARCHAR2(4000)"
}
{"chunk_id":1,"chunk_offset":1,"chunk_length":6,"chunk_data":"sample"}
-
chunk_id
specifies the chunk ID for each chunk. -
chunk_offset
specifies the original position of each chunk in the source document, relative to the start of document which has a position of1
. -
chunk_length
specifies the character length of each chunk. -
chunk_data
displays text pieces from each chunk.
PARAMS
Specify input parameters in JSON format.
{
"by" : mode,
"max" : max,
"overlap" : overlap,
"split" : split_condition,
"custom_list" : [ split_chars1, ... ],
"vocabulary" : vocabulary_name,
"language" : nls_language,
"normalize" : normalize_mode,
"norm_options" : [ normalize_option1, ... ],
"extended" : boolean
}
For example:
JSON('
{ "by" : "vocabulary",
"vocabulary" : "myvocab",
"max" : "100",
"overlap" : "0",
"split" : "custom",
"custom_list" : [ "<p>" , "<s>" ],
"language" : "american",
"normalize" : "options",
"norm_options" : [ "whitespace" ]
}')
Here is a complete description of these parameters:
Parameter | Description and Acceptable Values |
---|---|
|
Specify a mode for splitting your data, that is, to split by counting the number of characters, words, or vocabulary tokens. Valid values:
Default value: |
|
Specify a limit on the maximum size of each chunk. This setting splits the input text at a fixed point where the maximum limit occurs in the larger text. The units of Valid values:
Default value: |
|
Specify where to split the input text when it reaches the maximum size limit. This helps to keep related data together by defining appropriate boundaries for chunks. Valid values:
Default value: |
|
Specify the amount (as a positive integer literal or zero) of the preceding text that the chunk should contain, if any. This helps in logically splitting up related text (such as a sentence) by including some amount of the preceding chunk text. The amount of overlap depends on how the maximum size of the chunk is measured (in characters, words, or vocabulary tokens). The overlap begins at the specified Valid value: Default value: |
|
Specify the language of your input data. This clause is important, especially when your text contains certain characters (for example, punctuations or abbreviations) that may be interpreted differently in another language. Valid values:
Note: You must use escape characters with any language abbreviation that is also a SQL reserved word (for example, language abbreviations such as For example:
Default value: |
|
Automatically pre-processes or post-processes issues (such as multiple consecutive spaces and smart quotes) that may arise when documents are converted into text. Oracle recommends you to use a normalization mode to extract high-quality chunks. Valid values:
Default value: |
|
Increases the output limit of a Default value: |
Example
SELECT D.id doc,
JSON_VALUE(C.column_value, '$.chunk_id' RETURNING NUMBER) AS id,
JSON_VALUE(C.column_value, '$.chunk_offset' RETURNING NUMBER) AS pos,
JSON_VALUE(C.column_value, '$.chunk_length' RETURNING NUMBER) AS siz,
JSON_VALUE(C.column_value, '$.chunk_data') AS txt
FROM docs D,
dbms_vector_chain.utl_to_chunks(D.text,
JSON('{ "by" : "words",
"max" : "100",
"overlap" : "0",
"split" : "recursively",
"language" : "american",
"normalize": "all" }')) C;
End-to-end examples:
To run end-to-end example scenarios using this function, see Perform Chunking With Embedding and Configure Chunking Parameters.
Related Topics
Parent topic: DBMS_VECTOR_CHAIN