SPARSE Vectors

The storage format of a vector can be specified as SPARSE or DENSE. Sparse vectors are vectors that typically have a large number of dimensions but with very few non-zero dimension values, while Dense vectors are vectors where every dimension stores a value, including zero-values.

Sparse vectors can be generated by Sparse Encoding models such as SPLADE or BM25. Generally speaking, sparse models such as SPLADE outperform dense models, such as BERT and All-MiniLM, in keyword awareness search. They are also widely used for Hybrid Vector Search by combining sparse and dense vectors.

Conceptually, a sparse vector can be thought of as a vector where every dimension corresponds to a keyword in a certain vocabulary. For a given document, the sparse vector contains non-zero dimension values representing the number of occurrences for the keywords within that document. For example, BERT has a vocabulary size of 30,522 and several sparse encoders generate vectors of this dimensionality.

Representing a dense vector with 30,522 dimensions with only 100 non-zero FLOAT32 dimension values would still require 30,522 * 4 = ~120KB of storage. Such a format takes up a lot of space for no reason as most of the dimension values are 0. This would cause a huge performance deficit compared to the SPARSE representation of such vectors.

That is why when using SPARSE vectors, only the non-zero dimension values are physically stored.

Here is an example of creating and inserting a SPARSE vector:

DROP TABLE my_sparse_tab PURGE;
CREATE TABLE my_sparse_tab (v01 VECTOR(5, INT8, SPARSE));

INSERT INTO my_sparse_tab VALUES('[5,[2,4],[10,20]]');
INSERT INTO my_sparse_tab VALUES('[[2,4],[10,20]]');

SELECT * FROM my_sparse_tab;

V01
--------------------
[5,[2,4],[10,20]]
[5,[2,4],[10,20]]

You can see the difference between the SPARSE textual form within the INSERT statement with the one used for a DENSE vector. The SPARSE textual form looks like:

'[Total Dimension Count, [Dimension Index Array], [Dimension Value Array]]'

The example uses the number of dimensions in total (5 here but it is optional to specify it in this case as it is defined in the column's declaration), then gives the list of coordinates that have non-zero values, then the list of the corresponding values. In this example, coordinate 2 has the value 10 and coordinate 4 has the value 20. Coordinates 1, 3, and 5 have the value 0.

It is not permitted to use a DENSE textual form for SPARSE vectors and vice versa. However, it is possible to use vector functions to transform one into the other as illustrated in the following sample code using the table my_sparse_tab (created in the previous snippet):

The following INSERT statement fails:

INSERT INTO my_sparse_tab VALUES('[0, 10, 0, 20, 0]');

Error starting at line : 1 in command -
INSERT INTO my_sparse_tab VALUES ('[0,10,0,20,0]')
Error at Command Line : 1 Column : 33
Error report -
SQL Error: ORA-51833: Textual input conversion between sparse and dense vector is not
supported.

However, this insertion works:

INSERT INTO my_sparse_tab VALUES (TO_VECTOR('[0,0,10,0,20]', 5, INT8, DENSE));

SELECT * FROM my_sparse_tab;

V01
____________________
[5,[2,4],[10,20]]
[5,[2,4],[10,20]]
[5,[3,5],[10,20]]

You can also transform a SPARSE vector into a DENSE textual form if needed and vice versa:

SELECT FROM_VECTOR(v01 RETURNING CLOB FORMAT DENSE) 
FROM my_sparse_tab 
WHERE ROWNUM<2;

FROM_VECTOR(V01RETURNINGCLOBFORMATDENSE)
___________________________________________
[0,10,0,20,0]

Note:

The RETURNING clause used in the preceding example can also return a VARCHAR2 or a BLOB.