CREATE_PREFERENCE

Use the DBMS_VECTOR_CHAIN.CREATE_PREFERENCE helper procedure to create a vectorizer preference, to be used when creating or updating hybrid vector indexes.

Purpose

To create a vectorizer preference.

This allows you to customize vector search parameters of a hybrid vector indexing pipeline. The goal of a vectorizer preference is to provide you with a straightforward way to configure how to chunk or embed your documents, without requiring a deep understanding of various chunking or embedding strategies.

Usage Notes

A vectorizer preference is a JSON object that collectively holds user-specified values related to the following chunking, embedding, or vector index creation parameters:

  • Chunking (UTL_TO_CHUNKS and VECTOR_CHUNKS)

  • Embedding (UTL_TO_EMBEDDING, UTL_TO_EMBEDDINGS, and VECTOR_EMBEDDING)

  • Vector index creation (distance, accuracy, and vector_idxtype)

All vector index preferences follow the same JSON syntax as defined for their corresponding DBMS_VECTOR and DBMS_VECTOR_CHAIN APIs.

After creating a vectorizer preference, you can use the VECTORIZER parameter to pass this preference name in the paramstring of the PARAMETERS clause for CREATE_HYBRID_VECTOR_INDEX and ALTER_INDEX SQL statements.

Creating a preference is optional. If you do not specify any optional preference, then the index is created with system defaults.

Syntax

DBMS_VECTOR_CHAIN.CREATE_PREFERENCE (
    PREF_NAME     IN VARCHAR2,
    PREF_TYPE     IN VARCHAR2,
    PARAMS        IN JSON default NULL
);

PREF_NAME

Specify the name of the vectorizer preference to create.

PREF_TYPE

Type of preference. The only supported preference type is:

DBMS_VECTOR_CHAIN.VECTORIZER

PARAMS

Specify vector search-specific parameters in JSON format:

Embedding Parameter:

{ "model" : <embedding_model_for_vector_generation> }

For example:

{ "model" : MY_INDB_MODEL }

model specifies the name under which your ONNX embedding model is stored in the database.

If you do not have an in-database embedding model in ONNX format, then perform the steps listed in Oracle Database AI Vector Search User's Guide.

Chunking Parameters:

{
    "by"           :  mode,
    "max"          :  max,
    "overlap"      :  overlap,
    "split"        :  split_condition,
    "vocabulary"   :  vocabulary_name,
    "language"     :  nls_language,
    "normalize"    :  normalize_mode,
    "extended"     :  boolean
}

For example:

JSON(
 '{ "by"           :    "vocabulary",
    "max"          :    "100",
    "overlap"      :    "0",
    "split"        :    "none",
    "vocabulary"   :    "myvocab",
    "language"     :    "american",
    "normalize"    :    "all" 
  }')

If you specify split as custom and normalize as options, then you must additionally specify the custom_list and norm_options parameters, respectively:

JSON(
 '{ "by"           :    "vocabulary",
    "max"          :    "100",
    "overlap"      :    "0",
    "split"        :    "custom",
    "custom_list"  :    [ "<p>" , "<s>" ],
    "vocabulary"   :    "myvocab",
    "language"     :    "american",
    "normalize"    :    "options",
    "norm_options" :    [ "whitespace" ] 
  }')

The following table describes all the chunking parameters:

Parameter Description and Acceptable Values

by

Specify a mode for splitting your data, that is, to split by counting the number of characters, words, or vocabulary tokens.

Valid values:

  • characters (or chars):

    Splits by counting the number of characters.

  • words:

    Splits by counting the number of words.

    Words are defined as sequences of alphabetic characters, sequences of digits, individual punctuation marks, or symbols. For segmented languages without whitespace word boundaries (such as Chinese, Japanese, or Thai), each native character is considered a word (that is, unigram).

  • vocabulary:

    Splits by counting the number of vocabulary tokens.

    Vocabulary tokens are words or word pieces, recognized by the vocabulary of the tokenizer that your embedding model uses. You can load your vocabulary file using the chunker helper API DBMS_VECTOR_CHAIN.CREATE_VOCABULARY.

    Note: For accurate results, ensure that the chosen model matches the vocabulary file used for chunking. If you are not using a vocabulary file, then ensure that the input length is defined within the token limits of your model.

Default value: words

max

Specify a limit on the maximum size of each chunk. This setting splits the input text at a fixed point where the maximum limit occurs in the larger text. The units of max correspond to the by mode, that is, to split data when it reaches the maximum size limit of a certain number of characters, words, numbers, punctuation marks, or vocabulary tokens.

Valid values:

  • by characters: 50 to 4000 characters

  • by words: 10 to 1000 words

  • by vocabulary: 10 to 1000 tokens

Default value: 100

split [by]

Specify where to split the input text when it reaches the maximum size limit. This helps to keep related data together by defining appropriate boundaries for chunks.

Valid values:

  • none:

    Splits at the max limit of characters, words, or vocabulary tokens.

  • newline, blankline, and space:

    These are single-split character conditions that split at the last split character before the max value.

    Use newline to split at the end of a line of text. Use blankline to split at the end of a blank line (sequence of characters, such as two newlines). Use space to split at the end of a blank space.

  • recursively:

    This is a multiple-split character condition that breaks the input text using an ordered list of characters (or sequences).

    recursively is predefined as BLANKLINE, newline, space, none in this order:

    1. If the input text is more than the max value, then split by the first split character.

    2. If that fails, then split by the second split character.

    3. And so on.

    4. If no split characters exist, then split by max wherever it appears in the text.

  • sentence:

    This is an end-of-sentence split condition that breaks the input text at a sentence boundary.

    This condition automatically determines sentence boundaries by using knowledge of the input language's sentence punctuation and contextual rules. This language-specific condition relies mostly on end-of-sentence (EOS) punctuations and common abbreviations.

    Contextual rules are based on word information, so this condition is only valid when splitting the text by words or vocabulary (not by characters).

    Note: This condition obeys the by word and max settings, and thus may not determine accurate sentence boundaries in some cases. For example, when a sentence is larger than the max value, it splits the sentence at max. Similarly, it includes multiple sentences in the text only when they fit within the max limit.

  • custom:

    Splits based on a custom split characters list. You can provide custom sequences up to a limit of 16 split character strings, with a maximum length of 10 each.

    Specify an array of valid text literals using the custom_list parameter.

    {
        "split"        :  "custom",
        "custom_list"  :  [ "split_chars1", ... ]
    }

    For example:

    {
        "split"        :    "custom",
        "custom_list"  :    [ "<p>" , "<s>" ]
    }

    Note: You can omit sequences only for tab (\t), newline (\n), and linefeed (\r).

Default value: recursively

overlap

Specify the amount (as a positive integer literal or zero) of the preceding text that the chunk should contain, if any. This helps in logically splitting up related text (such as a sentence) by including some amount of the preceding chunk text.

The amount of overlap depends on how the maximum size of the chunk is measured (in characters, words, or vocabulary tokens). The overlap begins at the specified split condition (for example, at newline).

Valid value: 5% to 20% of max

Default value: 0

language

Specify the language of your input data.

This clause is important, especially when your text contains certain characters (for example, punctuations or abbreviations) that may be interpreted differently in another language.

Valid values:

Note: You must use escape characters with any language abbreviation that is also a SQL reserved word (for example, language abbreviations such as IN, AS, OR, IS).

For example:

SELECT dbms_vector_chain.utl_to_chunks('this is an example', 
   JSON('{ "language" : "\"in\"" }')) 
from dual;
SELECT dbms_vector_chain.utl_to_chunks('this is an example', 
   JSON_OBJECT('language' value '"in"' RETURNING JSON)) 
from dual;

Default value: NLS_LANGUAGE from session

normalize

Automatically pre-processes or post-processes issues (such as multiple consecutive spaces and smart quotes) that may arise when documents are converted into text. Oracle recommends you to use a normalization mode to extract high-quality chunks.

Valid values:

  • none:

    Applies no normalization.

  • all:

    Normalizes common multi-byte (unicode) punctuation to standard single-byte.

  • options:

    Specify an array of normalization options using the norm_options parameter.

    {
        "normalize"    :  "options",
        "norm_options" :  [ "normalize_option1", ... ]
    }
    • punctuation:

      Converts quotes, dashes, and other punctuation characters supported in the character set of the text to their common ASCII form. For example:

      • U+2013 (En Dash) maps to U+002D (Hyphen-Minus)

      • U+2018 (Left Single Quotation Mark) maps to U+0027 (Apostrophe)

      • U+2019 (Right Single Quotation Mark) maps to U+0027 (Apostrophe)

      • U+201B (Single High-Reversed-9 Quotation Mark) maps to U+0027 (Apostrophe)

    • whitespace:

      Minimizes whitespace by eliminating unnecessary characters.

      For example, retain blanklines, but remove any extra newlines and interspersed spaces or tabs: " \n \n " => "\n\n"

    • widechar:

      Normalizes wide, multi-byte digits and (a-z) letters to single-byte.

      These are multi-byte equivalents for 0-9 and a-z A-Z, which can show up in Chinese, Japanese, or Korean text.

    For example:

    {
        "normalize"    :  "options",
        "norm_options" :  [ "whitespace" ]
    }

Default value: none

extended

Increases the output limit of a VARCHAR2 string to 32767 bytes, without requiring you to set the max_string_size parameter to extended.

Default value: 4000 or 32767 (when max_string_size=extended)

Vector Index Parameters:

{ 
  "distance"        :  <vector_distance>,
  "accuracy"        :  <vector_accuracy>, 
  "vector_idxtype"  :  <vector_idxtype>
}

For example:

{ 
  "distance"        :  COSINE,
  "accuracy"        :  95, 
  "vector_idxtype"  :  HNSW
}
Parameter Description

distance

Distance metric or mathematical function used to compute the distance between vectors:

  • COSINE
  • MANHATTAN
  • DOT
  • EUCLIDEAN
  • L2_SQUARED
  • EUCLIDEAN_SQUARED

Note: Currently, the HAMMING and JACCARD vector distance metrics are not supported with hybrid vector indexes.

For detailed information on each of these metrics, see Vector Distance Functions and Operators.

Default value: COSINE

accuracy

Target accuracy at which the approximate search should be performed when running an approximate search query using vector indexes.

As explained in Understand Approximate Similarity Search Using Vector Indexes, you can specify non-default target accuracy values either by specifying a percentage value or by specifying internal parameters values, depending on the index type you are using.

  • For a Hierarchical Navigable Small World (HNSW) approximate search:

    In the case of an HNSW approximate search, you can specify a target accuracy percentage value to influence the number of  candidates considered to probe the search. This is automatically calculated by the algorithm. A value of 100 will tend to impose a similar result as an exact search, although the system may still use the index and will not perform an exact search. The optimizer may choose to still use an index as it may be faster to do so given the predicates in the query. Instead of specifying a target accuracy percentage value, you can specify the EFSEARCH parameter to impose a certain maximum number of candidates to be considered while probing the index. The higher that number, the higher the accuracy.

    For detailed information, see Understand Hierarchical Navigable Small World Indexes.

  • For an Inverted File Flat (IVF) approximate search:

    In the case of an IVF approximate search, you can specify a target accuracy percentage value to influence the number of partitions used to probe the search. This is automatically calculated by the algorithm. A value of 100 will tend to impose an exact search, although the system may still use the index and will not perform an exact search. The optimizer may choose to still use an index as it may be faster to do so given the predicates in the query. Instead of specifying a target accuracy percentage value, you can specify the NEIGHBOR PARTITION PROBES parameter to impose a certain maximum number of partitions to be probed by the search. The higher that number, the higher the accuracy.

    For detailed information, see Understand Inverted File Flat Vector Indexes.

Valid range for both HNSW and IVF vector indexes is:

> 0 and <= 100

Default value: None

vector_idxtype

Type of vector index to create:

  • HNSW for an HNSW vector index

  • IVF for an IVF vector index

For detailed information on each of these index types, see Manage the Different Categories of Vector Indexes.

Default value: IVF

Paths Parameter:

This field allows specification of an array of path objects. There can be many path objects and each path object must specify a type and a path_list.

Note:

If the user does not specify the paths field, the whole document would be considered.
"paths":[
        {"type"      : "<path_type>",
         "path_list" : ["<path_list_array>"]
        }
        ]

Let us consider a sample JSON document:

 {
      "person": 
       {
         "bio": "James is a data scientist who specializes in natural language .. ",
         "profile":
          {
              "text" : "James is a data scientist with expertise in Python programming...",
              "embedding" : [1.60541728E-001,5.76677322E-002,4.0473938E-003,1.2037459E-001,-5.98970801E-004, ..]
          },
         "avatar": "https://example.com/images/James.jpg"
       },
      "product": 
      {
        "description": "A social media analytics tool.", "It helps brands track...",
        "image": "https://example.com/images/data_tool.jpg",
        "embedding" : [1.60541728E-001,5.76677322E-002,4.0473938E-003,1.2037459E-001,-5.98970801E-004, ..]
      }
 }

And a path_list corresponding to the above JSON is provided here:

"paths": [
         {"type"       : "VECTOR",
          "path_list" : ["$.person.profile.embedding", "$.product.embedding"]
         },
         {"type"       : "STRING",
          "path_list" : ["$.person.bio", "$.product.description"]
         }
         ] 

The following table describes the details of paths parameter:

Example

begin
  DBMS_VECTOR_CHAIN.CREATE_PREFERENCE(
    'my_vec_spec',
     DBMS_VECTOR_CHAIN.VECTORIZER,
     json('{ "vector_idxtype" :  "hnsw",
             "model"          :  "my_doc_model",
             "by"             :  "words",
             "max"            :  100,
             "overlap"        :  10,
             "split"          :  "recursively,
             "language"       :  "english",
             "paths":          : [
                                 {
                                  "type"      : "VECTOR",
                                  "path_list" : ["$.person.profile.embedding"]
                                  }
                                 ]  
              }'));
end;
/

CREATE HYBRID VECTOR INDEX my_hybrid_idx on 
    doc_table(text_column) 
    parameters('VECTORIZER my_vec_spec');