About Oracle Text Indexes

3.1 About Oracle Text Indexes

The discussion of Oracle Text indexes includes the different types of indexes, their structure, the indexing process, and limitations.

The following topics provide information about Oracle Text indexes:

3.1.1 Types of Oracle Text Indexes

With Oracle Text, you create indexes by using the CREATE INDEX statement.

Table 3-1 Oracle Text Index Types

Index Type Description Supported Preferences and Parameters Query Operator Notes

Index Type	Description	Supported Preferences and Parameters	Query Operator	Notes
`CONTEXT`	Use this index to build a text retrieval application when your text consists of large, coherent documents in, for example, MS Word, HTML, or plain text. You can customize the index in a variety of ways. This index type requires `CTX_DDL.SYNC_INDEX` after insert, update, and delete operations to the base table.	All `CREATE` `INDEX` preferences and parameters are supported, except for `INDEX` `SET.` Supported parameters: index partition clause format, charset, and language columns	`CONTAINS` The `CONTEXT` grammar supports a rich set of operations. Use the `CTXCAT` grammar with query templating.	Supports all documents services and query services. Supports indexing of partitioned text tables. Supports `FILTER` `BY` and `ORDER` `BY` clauses of `CREATE` `INDEX` to index structured column values for more efficient processing of mixed queries.
`SEARCH INDEX`	Use this index to build a text retrieval application when your text consists of large, coherent documents in, for example, MS Word, HTML, or plain text. You can customize the index in a variety of ways. This index type requires `CTX_DDL.SYNC_INDEX` after insert, update, and delete operations to the base table.	All `CREATE` `INDEX` preferences and parameters are supported, except for `INDEX` `SET.` Supported parameters: index partition clause format, charset, and language columns	`CONTAINS` The `SEARCH INDEX` grammar supports a rich set of operations. Use the `CONTEXT` and `CTXCAT` grammar with query templating.	Supports all documents services and query services. Supports indexing of partitioned text tables. Supports sharded databases and system managed partitioning for index storage tables.
`CTXCAT`	Use this index for better mixed query performance of small documents and text fragments. To improve mixed query performance, include other columns in the base table, such as item names, prices, and descriptions. This index type is transactional. It automatically updates itself after inserts, updates, or deletes to the base table. `CTX_DDL.SYNC_INDEX` is not necessary. Note: The Oracle Text indextype `CTXCAT` is deprecated with Oracle Database 23ai. The indextype itself, and it's operator `CTXCAT`, can be removed in a future release. `CTXCAT` was introduced when indexes were typically a few megabytes in size. Modern, large indexes, can be difficult to manage with `CTXCAT`. The addition of index sets to `CTXCAT` can be achieved more effectively by the use of `FILTER BY` and `ORDER BY` columns, or `SDATA`, or both, in the `CONTEXT` indextype. `CTXCAT` is therefore rarely an appropriate choice. Oracle recommends that you choose the more efficient `CONTEXT` indextype.	`INDEX` `SET` `LEXER` `STOPLIST` `STORAGE` `WORDLIST` (The `prefix_index` attribute is supported only for Japanese data.) Not supported: Format, charset, and language columns Table and index partitioning	`CATSEARCH` The `CTXCAT` grammar supports logical operations, phrase queries, and wildcarding. Use the `CONTEXT` grammar with query templating. Theme querying is supported.	This index is larger and takes longer to build than a `CONTEXT` index. The size of a `CTXCAT` index is related to the total amount of text to be indexed, the number of indexes in the index set, and the number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set. The `CTXCAT` index does not support index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)
`CTXRULE`	Use this index to build a document classification or routing application. Create this index on a table of queries, where the queries define the classification or routing criteria..	See "CTXRULE Parameters and Limitations".	`MATCHES`	Use the `MATCHES` operator to classify single documents (plain text, HTML, or XML). `MATCHES` turns a document into a set of queries and finds the matching rows in the index. To build a document classification application by using simple or rule-based classification, create an index of type `CTXRULE.` This index classifies plain text, HTML, or XML documents by using the `MATCHES` operator. Store your defining query set in the text table that you index.

CONTEXT

Use this index to build a text retrieval application when your text consists of large, coherent documents in, for example, MS Word, HTML, or plain text.

You can customize the index in a variety of ways.

This index type requires CTX_DDL.SYNC_INDEX after insert, update, and delete operations to the base table.

All CREATE INDEX preferences and parameters are supported, except for INDEX SET.

Supported parameters: index partition clause format, charset, and language columns

CONTAINS

The CONTEXT grammar supports a rich set of operations.

Use the CTXCAT grammar with query templating.

Supports all documents services and query services.

Supports indexing of partitioned text tables.

Supports FILTER BY and ORDER BY clauses of CREATE INDEX to index structured column values for more efficient processing of mixed queries.

SEARCH INDEX

Use this index to build a text retrieval application when your text consists of large, coherent documents in, for example, MS Word, HTML, or plain text.

You can customize the index in a variety of ways.

This index type requires CTX_DDL.SYNC_INDEX after insert, update, and delete operations to the base table.

All CREATE INDEX preferences and parameters are supported, except for INDEX SET.

Supported parameters: index partition clause format, charset, and language columns

CONTAINS

The SEARCH INDEX grammar supports a rich set of operations.

Use the CONTEXT and CTXCAT grammar with query templating.

Supports all documents services and query services.

Supports indexing of partitioned text tables.

Supports sharded databases and system managed partitioning for index storage tables.

CTXCAT

Use this index for better mixed query performance of small documents and text fragments. To improve mixed query performance, include other columns in the base table, such as item names, prices, and descriptions.

This index type is transactional. It automatically updates itself after inserts, updates, or deletes to the base table. CTX_DDL.SYNC_INDEX is not necessary.

Note: The Oracle Text indextype CTXCAT is deprecated with Oracle Database 23ai. The indextype itself, and it's operator CTXCAT, can be removed in a future release.

CTXCAT was introduced when indexes were typically a few megabytes in size. Modern, large indexes, can be difficult to manage with CTXCAT. The addition of index sets to CTXCAT can be achieved more effectively by the use of FILTER BY and ORDER BY columns, or SDATA, or both, in the CONTEXT indextype. CTXCAT is therefore rarely an appropriate choice. Oracle recommends that you choose the more efficient CONTEXT indextype.

INDEX SET

LEXER

STOPLIST

STORAGE

WORDLIST (The prefix_index attribute is supported only for Japanese data.)

Not supported: Format, charset, and language columns

Table and index partitioning

CATSEARCH

The CTXCAT grammar supports logical operations, phrase queries, and wildcarding.

Use the CONTEXT grammar with query templating.

Theme querying is supported.

This index is larger and takes longer to build than a CONTEXT index.

The size of a CTXCAT index is related to the total amount of text to be indexed, the number of indexes in the index set, and the number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set.

The CTXCAT index does not support index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)

CTXRULE

Use this index to build a document classification or routing application. Create this index on a table of queries, where the queries define the classification or routing criteria..

See "CTXRULE Parameters and Limitations".

MATCHES

Use the MATCHES operator to classify single documents (plain text, HTML, or XML). MATCHES turns a document into a set of queries and finds the matching rows in the index.

To build a document classification application by using simple or rule-based classification, create an index of type CTXRULE. This index classifies plain text, HTML, or XML documents by using the MATCHES operator. Store your defining query set in the text table that you index.

An Oracle Text index is an Oracle Database domain index. To build your query application, you can create an index of type CONTEXT with a mixture of text and structured data columns, and query it with the CONTAINS operator.

You create an index from a populated text table. In a query application, the table must contain the text or pointers to the location of the stored text. Text is usually a collection of documents, but it can also be small text fragments.

Note:

If you are building a new application that uses XML data, Oracle recommends that you use XMLIndex, not CTXRULE.

Create an Oracle Text index as a type of extensible index to Oracle Database by using standard SQL. This means that an Oracle Text index operates like an Oracle Database index. It has a name by which it is referenced and can be manipulated with standard SQL statements.

The benefit of creating an Oracle Text index is fast response time for text queries with the CONTAINS, CATSEARCH, and MATCHES operators. These operators query the CONTEXT, CTXCAT, and CTXRULE index types, respectively.

Note:

Because a Transparent Data Encryption-enabled column does not support domain indexes, do not use it with Oracle Text. However, you can create an Oracle Text index on a column in a table that is stored in a Transparent Data Encryption-enabled tablespace.

See Also:

"Creating Oracle Text Indexes"
Oracle XML DB Developer's Guide for information about XMLIndex and indexing XMLType data

3.1.2 Structure of the Oracle Text CONTEXT Index

Oracle Text indexes text by converting all words into tokens. The general structure of an Oracle Text CONTEXT index is an inverted index, where each token contains the list of documents (rows) that contain the token.

For example, after a single initial indexing operation, the word DOG might have an entry as follows:

Word	Appears in Document
DOG	DOC1 DOC3 DOC5

This means that the word DOG is contained in the rows that store documents one, three, and five.

Merged Word and Theme Indexing

By default in English and French, Oracle Text indexes theme information with word information. You can query theme information with the ABOUT operator. You can also enable and disable theme indexing.

See Also:

"Creating Preferences " to learn more about indexing theme information

3.1.3 Oracle Text Indexing Process

This section describes the Oracle Text indexing process. Initiate the indexing process by using the CREATE INDEX statement to create an Oracle Text index of tokens, organized according to your parameters and preferences.

Figure 3-1 shows the indexing process. This process is a data stream that is acted upon by the different indexing objects. Each object corresponds to an indexing preference type or section group that you can specify in the parameter string of CREATE INDEX or ALTER INDEX.

Figure 3-1 Oracle Text Indexing Process

Description of "Figure 3-1 Oracle Text Indexing Process"

Oracle Text processes the data stream with the following objects and engine:

3.1.3.1 Datastore Object

The stream starts with the datastore reading in the documents as they are stored in the system according to your datastore preference.

For example, if you defined your datastore as DIRECTORY_DATASTORE, then the stream starts by reading the files from an Oracle directory object. You can also store your documents on the internet or in Oracle Database. Wherever your files reside physically, a text table in Oracle Database must always point to the files.

3.1.3.2 Filter Object

The stream then passes through the filter. Your FILTER preference determines what happens. The stream can be acted upon in one of the following ways:

No filtering takes place when you specify the NULL_FILTER preference type or when the value of the format column is IGNORE. Documents that are plain text, HTML, or XML need no filtering.
Formatted documents (binary) are filtered to marked-up text when you specify the AUTO_FILTER preference type or when the value of the format column is BINARY.

3.1.3.3 Sectioner Object

After being filtered, the marked-up text passes through the sectioner, which separates the stream into text and section information. Section information includes where sections begin and end in the text stream. The type of sections that are extracted is determined by your section group type.

The text is passed to the lexer. The section information is passed directly to the indexing engine, which uses it later.

3.1.3.4 Lexer Object

You create a lexer preference by using one of the Oracle Text lexer types to specify the language of the text to be indexed. The lexer breaks the text into tokens according to your language. These tokens are usually words. To extract tokens, the lexer uses the parameters that are defined in your lexer preference. These parameters include the definitions for the characters that separate tokens, such as whitespace. Parameters also include whether to convert the text to all uppercase or to leave it in mixed case.

When you enable theme indexing, the lexer analyzes your text to create theme tokens for indexing.

3.1.3.5 Indexing Engine

The indexing engine creates the inverted index that maps tokens to the documents that contain them. In this phase, Oracle Text uses the stoplist that you specify to exclude stopwords or stopthemes from the index. Oracle Text also uses the parameters that are defined in your WORDLIST preference. Those parameters tell the system how to create a prefix index or substring index, if enabled.

3.1.4 About Updates to Indexed Columns

You can keep documents available for search operations until the index is synchronized, without immediately performing index synchronization.

In releases prior to Oracle Database 12c Release 2 (12.2), when there is an update to the column on which an Oracle Text index is based, the document is unavailable for search operations until the index is synchronized. User queries cannot perform a search of this document. Starting with Oracle Database 12c Release 2 (12.2), you can specify that documents must be searchable after updates, without immediately performing index synchronization. Before the index is synchronized, queries use the old index entries to fetch the contents of the old document. After index synchronization, user queries fetch the contents of the updated document.

The ASYNCHRONOUS_UPDATE option for indexes enables you to retain the old contents of a document after an update and then use this index to answer user queries.

Note:

The ASYNCHRONOUS_UPDATE setting of the CONTEXT indextype is deprecated in Oracle Database 23ai, and can be ignored or removed in a future release.

Oracle can ignore or remove this attribute in a future release. Oracle recommends that you allow this value to be set to its default value, SYNCHRONOUS_UPDATE. To avoid unexpected loss of results during updates, use SYNC (ON COMMIT) or SYNC(EVERY [time-period]) with a short time period.

The ASYNCHRONOUS_UPDATE setting was introduced as a workaround for the fact that updates are implemented as "delete followed by insert," and that deletes are immediate (on commit), while inserts are only performed during an index sync. However, this setting is incompatible with several other index options. Oracle recommends that you discontinue its use.

Related Topics

3.1.5 Partitioned Tables and Indexes

When you create a partitioned CONTEXT index on a partitioned text table, you must partition the table by range. Hash, composite, and list partitions are not supported.

You can create a partitioned text table to partition your data by date. For example, if your application maintains a large library of dated news articles, you can partition your information by month or year. Partitioning simplifies the manageability of large databases, because querying, insert, update, delete operations, and backup and recovery can act on a single partition.

On local CONTEXT indexes with multiple table sets, Oracle Text supports the number of partitions supported by Oracle Database.

Note:

The number of partitions that are supported in Oracle Text is approximately 1024K-1. This limit, which should be more than adequate, is not applicable to a CONTEXT index on partitioned tables.

See Also:

Oracle Database Concepts for more information about partitioning

To query a partitioned table, use CONTAINS in the WHERE clause of a SELECT statement as you query a regular table. You can query the entire table or a single partition. However, if you are using the ORDER BY SCORE clause, Oracle recommends that you query single partitions unless you include a range predicate that limits the query to a single partition.

3.1.6 Online Indexes

When it is not practical to lock your base table for indexing because of ongoing updates, you can create your index online with the ONLINE parameter of CREATE INDEX statement. This way an application with frequent inserts, updates, or deletes does not have to stop updating the base table for indexing.

There are short periods, however, when the base table is locked at the beginning and end of the indexing process.

See Also:

Oracle Text Reference to learn more about creating an index online

3.1.7 Parallel Indexing

Oracle Text supports parallel indexing with the CREATE INDEX statement.

When you enter a parallel indexing statement on a nonpartitioned table, Oracle Text splits the base table into temporary partitions, spawns child processes, and assigns a child to a partition. Each child then indexes the rows in its partition. The method of slicing the base table into partitions is determined by Oracle and is not under your direct control. This is true as well for the number of child processes actually spawned, which depends on machine capabilities, system load, your init.ora settings, and other factors. Because of these variables, the actual parallel degree may not match the degree of parallelism requested.

Because indexing is an intensive I/O operation, parallel indexing is most effective in decreasing your indexing time when you have distributed disk access and multiple CPUs. Parallel indexing can affect the performance of an initial index only with the CREATE INDEX statement. It does not affect insert, update, and delete operations with ALTER INDEX, and has minimal effect on query performance.

Because parallel indexing decreases the initial indexing time, it is useful for the following scenarios:

Data staging, when your product includes an Oracle Text index
Rapid initial startup of applications based on large data collections
Application testing, when you need to test different index parameters and schemas while developing your application
See Also:
- "Parallel Queries on a Local Context Index"
- "Frequently Asked Questions About Indexing Performance"

3.1.8 Indexing and Views

If you want to index documents that have contents in different tables, then create a ubiquitous search index.

You can use the DBMS_SEARCH PL/SQL package to create a ubiquitous search index on multiple tables and views within a schema. You can create this index only on the views that have a primary key and a foreign key constraint relationship with the component table.

Alternatively, you can create a data storage preference by using the USER_DATASTORE object. With this object, you can define a procedure that synthesizes documents from different tables at index time.

Oracle Text supports the creation of CONTEXT, CTXCAT, and CTXRULE indexes on materialized views (MVIEW).

Related Topics