Considerations for Oracle Text Indexing

3.2 Considerations for Oracle Text Indexing

Use the CREATE INDEX statement to create an Oracle Text index. When you create an index but do not specify a parameter string, an index is created with default parameters. You can create a CONTEXT index, a CTXCAT index, or a CTXRULE index.

You can also override the defaults and customize your index to suit your query application. The parameters and preference types that you use to customize your index with the CREATE INDEX statement fall into the following general categories.

This section contains the following topics:

3.2.1 Location of Text

The basic prerequisite for a text query application is a text table that is populated with your document collection. The text table is required for indexing.

When you create a CONTEXT index, populate rows in your text table with one of the following elements. CTXCAT and CTXRULE indexes support only the first method.

Text information (Documents or text fragments. By default, the indexing operation expects your document text to be directly loaded in your text table.)
Path names of documents in your file system
URLs of web documents

Figure 3-2 illustrates these different methods.

Figure 3-2 Different Ways of Storing Text

Description of "Figure 3-2 Different Ways of Storing Text"

3.2.2 Supported Column Types

With Oracle Text, you can create a CONTEXT index with columns of type VARCHAR2, CLOB (limited to 4294967295 bytes), BLOB, CHAR, BFILE, XMLType, and URIType.

Note:

You cannot index the NCLOB, DATE, and NUMBER column types.

3.2.3 Storing Text in the Text Table

For CONTEXT data storage, use these datastore types to store documents in your text table.

DIRECT_DATASTORE: In one column
MULTI_COLUMN_DATASTORE: In multiple columns (Oracle Text concatenates the columns into a virtual document, one document for each row.)
DETAIL_DATASTORE: Primary-detail relationships (Store one document across a number of rows.)
NESTED_DATASTORE: In a nested table

Oracle Text supports the indexing of the XMLType data type, which you use to store XML documents.

For CTXCAT data storage, you can store short text fragments, such as names, descriptions, and addresses, over a number of columns. A CTXCAT index improves performance for mixed queries.

3.2.4 Storing File Path Names

In your text table, store path names to files stored in your file system. During indexing, use the DIRECTORY_DATASTORE preference type. This method of data storage is supported only for CONTEXT indexes.

Note:

Starting with Oracle Database 19c, the Oracle Text type FILE_DATASTORE is deprecated. Use DIRECTORY_DATASTORE instead.

Oracle recommends that you replace FILE_DATASTORE text indexes with the DIRECTORY_DATASTORE index type, which is available starting with Oracle Database 19c. DIRECTORY_DATASTORE provides greater security because it enables file access to be based on directory objects.

3.2.5 Storing URLs

Store URL names to index websites. During indexing, use the NETWORK_DATASTORE preference type. This method of data storage is supported only for CONTEXT indexes.

Note:

Starting with Oracle Database 19c, the Oracle Text type URL_DATASTORE is deprecated. Use NETWORK_DATASTORE instead.

The URL_DATASTORE type is used for text stored in files on the internet (accessed through HTTP or FTP), and for text stored in local file system files (accessed through the file protocol). It is replaced with NETWORK_DATASTORE, which uses ACLs to allow access to specific servers. This change aligns Oracle Text more closely with the standard operating and security model for accessing URLs from the database.

3.2.6 Storing Associated Document Information

In your text table, create additional columns to store structured information that your query application might need, such as primary key, date, description, or author.

3.2.7 Format and Character Set Columns

If your documents consist of mixed formats or mixed character sets, create the following additional columns:

A format column to record the format (TEXT or BINARY) to help filtering during indexing. You can also use the format column to ignore rows for indexing by setting the format column to IGNORE. IGNORE is useful for bypassing rows containing data that is incompatible with Oracle Text indexing, such as images.
A character set column to record the document character set for each row.

When you create your index, specify the name of the format or character set column in the parameter clause of the CREATE INDEX statement.

For all rows containing the AUTO or AUTOMATIC keywords in character set or language columns, Oracle Text applies statistical techniques to determine the character set and language of the documents and modify document indexing appropriately.

3.2.8 Supported Document Formats

Because the system can index most document formats, including HTML, PDF, Microsoft Word, and plain text, you can load any supported type into the text column.

When your text column has mixed formats, you can include a format column to help filtering during indexing, and you can specify whether a document is binary (formatted) or text (nonformatted, such as HTML). If you mix HTML and XML documents in one index, you might not be able to configure your index to your needs; you cannot prevent style sheet information from being added to the index.

See Also:

Oracle Text Reference for more information about the supported document formats

3.2.9 Summary of DATASTORE Types

When you use CREATE INDEX, specify the location that uses the datastore preference. Use an appropriate datastore according to your application.

These are the different ways that you can store your text with datastore preference types.

Table 3-2 Summary of DATASTORE Types

Datastore Type	Use When
`DIRECT_DATASTORE`	Data is stored internally in a text column. Each row is indexed as a single document. Your text column can be `VARCHAR2, CLOB, BLOB, CHAR, or BFILE`. `XMLType` columns are supported for the context index type.
`MULTI_COLUMN_DATASTORE`	Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one document for each row.
`DETAIL_DATASTORE`	Data is stored internally in a text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a primary table.
`FILE_DATASTORE`	Data is stored externally in operating system files. File names are stored in the text column, one for each row. Note: Starting with Oracle Database 19c, the Oracle Text type `FILE_DATASTORE` is deprecated. Use `DIRECTORY_DATASTORE` instead.
`DIRECTORY_DATASTORE`	Data is stored externally in Oracle directory objects. File names are stored in the text column, one for each row.
`NESTED_DATASTORE`	Data is stored in a nested table.
`URL_DATASTORE`	Data is stored externally in files located on an intranet or the internet. URLs are stored in the text column. Note: Starting with Oracle Database 19c, the Oracle Text type `URL_DATASTORE` is deprecated. Use `NETWORK_DATASTORE` instead.
`NETWORK_DATASTORE`	Data is stored externally in files located on an intranet or the internet. URLs are stored in the text column.
`USER_DATASTORE`	Documents are synthesized at index time by a user-defined stored procedure.

Indexing time and document retrieval time increases for indexing URLs, because the system must retrieve the document from the network.

Note:

To troubleshoot issues with triggers and MULTI_COLUMN_DATASTORE or USER_DATASTORE, refer to My Oracle Support document 1613741.1.

Related Topics

3.2.10 Document Formats and Filtering

To index formatted documents, such as Microsoft Word and PDF, you must filter them to text. The FILTER preference type determines the type of filtering that the system uses. By default, the system uses the AUTO_FILTER filter type, which automatically detects the format of your documents and filters them to text.

Oracle Text can index most formats. It can also index columns that contain mixed-format documents.

See Also:

Oracle Text Reference for information about AUTO_FILTER supported document and graphics formats

3.2.10.1 No Filtering for HTML

If you are indexing HTML or plain-text files, do not use the AUTO_FILTER type. For best results, use the NULL_FILTER preference type.

3.2.10.2 Mixed-Format Columns Filtering

For a mixed-format column, such as one that contains Microsoft Word, plain text, and HTML documents, you can bypass filtering for plain text or HTML by including a format column in your text table. In the format column, tag each row TEXT or BINARY. Rows that are tagged TEXT are not filtered.

For example, tag the HTML and plain text rows as TEXT and the Microsoft Word rows as BINARY. You specify the format column in the CREATE INDEX parameter clause.

When you do not want a document to be indexed, you can use a third format column type, IGNORE. This column type is useful, for example, when a mixed-format table includes plain-text documents in Japanese and English, but you only want to process the English documents. This column type is also useful when a mixed-format table includes plain-text documents and images. Because IGNORE is implemented at the datastore level, you can use it with all filters.

3.2.10.3 Custom Filtering

You can create a custom filter to filter documents for indexing. You can create either an external filter that is executed from the file system or an internal filter as a PL/SQL or Java-stored procedure.

For external custom filtering, use the USER_FILTER filter preference type.

For internal filtering, use the PROCEDURE_FILTER filter type.

See Also:

"PROCEDURE_FILTER Example"

3.2.11 Bypass Rows

In your text table, you can bypass rows that you do not want to index, such as rows that contain image data. To bypass rows, you create a format column, set it to IGNORE, and name the format column in the parameter clause of the CREATE INDEX statement.

3.2.12 Document Character Set

The indexing engine expects filtered text to be in the database character set. When you use the AUTO_FILTER filter type, formatted documents are converted to text in the database character set.

If your source is text and your document character set is not the database character set, then you can use the AUTO_FILTER filter type to convert your text for indexing.

Character Set Detection

When you set the CHARSET column to AUTO, the AUTO_FILTER filter detects the character set of the document and converts it from the detected character set to the database character set, if there is a difference.

Mixed Character Set Columns

If your document set contains documents with different character sets, such as JA16EUC and JA16SJIS, you can index the documents, provided that you create a CHARSET column, populate this column with the name of the document character set for each row, and name the column in the parameter clause of the CREATE INDEX statement.