CREATE_PREFERENCE
Use the DBMS_VECTOR_CHAIN.CREATE_PREFERENCE
helper
procedure to create a vectorizer preference, to be used when creating or updating hybrid vector
indexes.
Purpose
To create a vectorizer preference.
This allows you to customize vector search parameters of a hybrid vector indexing pipeline. The goal of a vectorizer preference is to provide you with a straightforward way to configure how to chunk or embed your documents, without requiring a deep understanding of various chunking or embedding strategies.
Usage Notes
A vectorizer preference is a JSON object that collectively holds user-specified values related to the following chunking, embedding, or vector index creation parameters:
-
Chunking (
UTL_TO_CHUNKS
andVECTOR_CHUNKS
) -
Embedding (
UTL_TO_EMBEDDING
,UTL_TO_EMBEDDINGS
, andVECTOR_EMBEDDING
) -
Vector index creation (
distance
,accuracy
, andvector_idxtype
)
All vector index preferences follow the same JSON syntax as defined for their corresponding DBMS_VECTOR
and DBMS_VECTOR_CHAIN
APIs.
After creating a vectorizer preference, you can use the VECTORIZER
parameter to pass this preference name in the paramstring
of the PARAMETERS
clause for CREATE_HYBRID_VECTOR_INDEX
and ALTER_INDEX
SQL statements.
Creating a preference is optional. If you do not specify any optional preference, then the index is created with system defaults.
Syntax
DBMS_VECTOR_CHAIN.CREATE_PREFERENCE (
PREF_NAME IN VARCHAR2,
PREF_TYPE IN VARCHAR2,
PARAMS IN JSON default NULL
);
PREF_NAME
Specify the name of the vectorizer preference to create.
PREF_TYPE
Type of preference. The only supported preference type is:
DBMS_VECTOR_CHAIN.VECTORIZER
PARAMS
Specify vector search-specific parameters in JSON format:
Embedding Parameter:
{ "model" : <embedding_model_for_vector_generation> }
For example:
{ "model" : MY_INDB_MODEL }
model
specifies the name under which your ONNX embedding model is stored in the database.
If you do not have an in-database embedding model in ONNX format, then perform the steps listed in Oracle Database AI Vector Search User's Guide.
Chunking Parameters:
{
"by" : mode,
"max" : max,
"overlap" : overlap,
"split" : split_condition,
"vocabulary" : vocabulary_name,
"language" : nls_language,
"normalize" : normalize_mode,
"extended" : boolean
}
For example:
JSON(
'{ "by" : "vocabulary",
"max" : "100",
"overlap" : "0",
"split" : "none",
"vocabulary" : "myvocab",
"language" : "american",
"normalize" : "all"
}')
If you specify split
as custom
and normalize
as options
, then you must additionally specify the custom_list
and norm_options
parameters, respectively:
JSON(
'{ "by" : "vocabulary",
"max" : "100",
"overlap" : "0",
"split" : "custom",
"custom_list" : [ "<p>" , "<s>" ],
"vocabulary" : "myvocab",
"language" : "american",
"normalize" : "options",
"norm_options" : [ "whitespace" ]
}')
The following table describes all the chunking parameters:
Parameter | Description and Acceptable Values |
---|---|
|
Specify a mode for splitting your data, that is, to split by counting the number of characters, words, or vocabulary tokens. Valid values:
Default value: |
|
Specify a limit on the maximum size of each chunk. This setting splits the input text at a fixed point where the maximum limit occurs in the larger text. The units of Valid values:
Default value: |
|
Specify where to split the input text when it reaches the maximum size limit. This helps to keep related data together by defining appropriate boundaries for chunks. Valid values:
Default value: |
|
Specify the amount (as a positive integer literal or zero) of the preceding text that the chunk should contain, if any. This helps in logically splitting up related text (such as a sentence) by including some amount of the preceding chunk text. The amount of overlap depends on how the maximum size of the chunk is measured (in characters, words, or vocabulary tokens). The overlap begins at the specified Valid value: Default value: |
|
Specify the language of your input data. This clause is important, especially when your text contains certain characters (for example, punctuations or abbreviations) that may be interpreted differently in another language. Valid values:
Note: You must use escape characters with any language abbreviation that is also a SQL reserved word (for example, language abbreviations such as For example:
Default value: |
|
Automatically pre-processes or post-processes issues (such as multiple consecutive spaces and smart quotes) that may arise when documents are converted into text. Oracle recommends you to use a normalization mode to extract high-quality chunks. Valid values:
Default value: |
|
Increases the output limit of a Default value: |
Vector Index Parameters:
{
"distance" : <vector_distance>,
"accuracy" : <vector_accuracy>,
"vector_idxtype" : <vector_idxtype>
}
For example:
{
"distance" : COSINE,
"accuracy" : 95,
"vector_idxtype" : HNSW
}
Parameter | Description |
---|---|
|
Distance metric or mathematical function used to compute the distance between vectors:
Note: Currently, the For detailed information on each of these metrics, see Vector Distance Functions and Operators. Default value: |
|
Target accuracy at which the approximate search should be performed when running an approximate search query using vector indexes. As explained in Understand Approximate Similarity Search Using Vector Indexes, you can specify non-default target accuracy values either by specifying a percentage value or by specifying internal parameters values, depending on the index type you are using.
Valid range for both HNSW and IVF vector indexes is:
Default value: None |
|
Type of vector index to create:
For detailed information on each of these index types, see Manage the Different Categories of Vector Indexes. Default value: |
Paths Parameter:
type
and a
path_list
.
Note:
If the user does not specify thepaths
field, the whole document would be considered.
"paths":[
{"type" : "<path_type>",
"path_list" : ["<path_list_array>"]
}
]
Let us consider a sample JSON document:
{
"person":
{
"bio": "James is a data scientist who specializes in natural language .. ",
"profile":
{
"text" : "James is a data scientist with expertise in Python programming...",
"embedding" : [1.60541728E-001,5.76677322E-002,4.0473938E-003,1.2037459E-001,-5.98970801E-004, ..]
},
"avatar": "https://example.com/images/James.jpg"
},
"product":
{
"description": "A social media analytics tool.", "It helps brands track...",
"image": "https://example.com/images/data_tool.jpg",
"embedding" : [1.60541728E-001,5.76677322E-002,4.0473938E-003,1.2037459E-001,-5.98970801E-004, ..]
}
}
And a path_list
corresponding to the above JSON is provided
here:
"paths": [
{"type" : "VECTOR",
"path_list" : ["$.person.profile.embedding", "$.product.embedding"]
},
{"type" : "STRING",
"path_list" : ["$.person.bio", "$.product.description"]
}
]
The following table describes the details of paths
parameter:
Parameter |
Accepted Values |
|
The possible values for this field are:
|
|
Accepts an array of paths with at least one path in valid JSON format -
( Note: For the |
Example
begin
DBMS_VECTOR_CHAIN.CREATE_PREFERENCE(
'my_vec_spec',
DBMS_VECTOR_CHAIN.VECTORIZER,
json('{ "vector_idxtype" : "hnsw",
"model" : "my_doc_model",
"by" : "words",
"max" : 100,
"overlap" : 10,
"split" : "recursively,
"language" : "english",
"paths": : [
{
"type" : "VECTOR",
"path_list" : ["$.person.profile.embedding"]
}
]
}'));
end;
/
CREATE HYBRID VECTOR INDEX my_hybrid_idx on
doc_table(text_column)
parameters('VECTORIZER my_vec_spec');
Related Topics
Parent topic: DBMS_VECTOR_CHAIN