CREATE_VOCABULARY
Use the DBMS_VECTOR_CHAIN.CREATE_VOCABULARY
chunker helper procedure to load your own token vocabulary file into the database.
Purpose
To create custom token vocabulary that is recognized by the tokenizer used by your vector embedding model.
A vocabulary contains a set of tokens (words and word pieces) that are collected during a model's statistical training process. You can supply this data to the chunker to help in accurately selecting the text size that approximates the maximum input limit imposed by the tokenizer of your embedding model.
Usage Notes
-
Usually, the supported vocabulary files (containing recognized tokens) are included as part of a model's distribution. Oracle recommends to use the vocabulary files associated with your model.
If a vocabulary file is not available, then you may download one of the following files depending on the tokenizer type:-
WordPiece:
Vocabulary file (
vocab.txt
) for the "bert-base-uncased" (English) or "bert-base-multilingual-cased" model -
Byte-Pair Encoding (BPE):
Vocabulary file (
vocab.json
) for the "GPT2" modelUse the following python script to extract the file:import json import sys with open(sys.argv[1], encoding="utf-8") as f: d = json.load(f) for term in d: print(term)
-
SentencePiece:
Vocabulary file (
tokenizer.json
) for the "xlm-roberta-base" modelUse the following python script to extract the file:import json import sys with open(sys.argv[1], encoding="utf-8") as f: d = json.load(f) for entry in d["model"]["vocab"]: print(entry[0])
Ensure to save your vocabulary files in
UTF-8
encoding. -
-
You can create a vocabulary based on the tokens loaded in the
schema.table.column
, using a user-specified vocabulary name (vocabulary_name
).After loading your vocabulary data, you can use the
by vocabulary
chunking mode (withVECTOR_CHUNKS
orUTL_TO_CHUNKS
) to split input data by counting the number of tokens. -
You can query these data dictionary views to access existing vocabulary data:
-
ALL_VECTOR_VOCAB
displays all available vocabularies. -
USER_VECTOR_VOCAB
displays vocabularies from the schema of the current user. -
ALL_VECTOR_VOCAB_TOKENS
displays a list of tokens from all available vocabularies. -
USER_VECTOR_VOCAB_TOKENS
displays a list of tokens from the vocabularies owned by the current user.
-
Syntax
DBMS_VECTOR_CHAIN.CREATE_VOCABULARY(
PARAMS IN JSON default NULL
);
PARAMS
{
table_name,
column_name,
vocabulary_name,
format,
cased
}
Table 12-20 Parameter Details
Parameter | Description | Required | Default Value |
---|---|---|---|
|
Name of the table (along with the optional table owner) in which you want to load the vocabulary file |
Yes |
No value |
|
Column name in the vocabulary table in which you want to load the vocabulary file |
Yes |
No value |
|
User-specified name of the vocabulary, along with the optional owner name (if other than the current owner) |
Yes |
No value |
|
|
Yes |
No value |
|
Character-casing of the vocabulary, that is, vocabulary to be treated as cased or uncased |
No |
|
Example
DECLARE
params clob := '{"table_name" : "doc_vocabtab",
"column_name" : "token",
"vocabulary_name" : "doc_vocab",
"format" : "bert",
"cased" : false}';
BEGIN
dbms_vector_chain.create_vocabulary(json(params));
END;
/
End-to-end example:
To run an end-to-end example scenario using this procedure, see Create and Use Custom Vocabulary.
Related Topics
Parent topic: DBMS_VECTOR_CHAIN