3.2 Considerations for Oracle Text Indexing
Use the CREATE
INDEX
statement to create an Oracle Text index. When you create an index but do not specify a parameter string, an index is created with default parameters. You can create a CONTEXT
index, a CTXCAT
index, or a CTXRULE
index.
You can also override the defaults and customize your index to suit your query application. The parameters and preference types that you use to customize your index with the CREATE
INDEX
statement fall into the following general categories.
This section contains the following topics:
3.2.1 Location of Text
The basic prerequisite for a text query application is a text table that is populated with your document collection. The text table is required for indexing.
When you create a CONTEXT
index, populate rows in your text table with one of the following elements. CTXCAT
and CTXRULE
indexes support only the first method.
-
Text information (Documents or text fragments. By default, the indexing operation expects your document text to be directly loaded in your text table.)
-
Path names of documents in your file system
-
URLs of web documents
Figure 3-2 illustrates these different methods.
3.2.3 Storing Text in the Text Table
For CONTEXT
data storage, use these datastore types to store documents in your text table.
-
DIRECT_DATASTORE:
In one column -
MULTI_COLUMN_DATASTORE:
In multiple columns (Oracle Text concatenates the columns into a virtual document, one document for each row.) -
DETAIL_DATASTORE:
Primary-detail relationships (Store one document across a number of rows.) -
NESTED_DATASTORE:
In a nested table
Oracle Text supports the indexing of the XMLType
data type, which you use to store XML documents.
For CTXCAT
data storage, you can store short text fragments, such as names, descriptions, and addresses, over a number of columns. A CTXCAT
index improves performance for mixed queries.
3.2.4 Storing File Path Names
In your text table, store path names to files stored in your file system. During indexing, use the DIRECTORY_DATASTORE
preference type. This method of data storage is supported only for CONTEXT
indexes.
Note:
Starting with Oracle Database 19c, the Oracle Text type
FILE_DATASTORE
is deprecated. Use
DIRECTORY_DATASTORE
instead.
Oracle
recommends that you replace FILE_DATASTORE
text indexes with the
DIRECTORY_DATASTORE
index type, which is available starting
with Oracle Database 19c. DIRECTORY_DATASTORE
provides greater
security because it enables file access to be based on directory objects.
3.2.5 Storing URLs
Store URL names to index websites. During indexing, use the NETWORK_DATASTORE
preference type. This method of data storage is supported only for CONTEXT
indexes.
Note:
Starting with Oracle Database 19c, the Oracle Text type
URL_DATASTORE
is deprecated. Use
NETWORK_DATASTORE
instead.
The
URL_DATASTORE
type is used for text stored in files on the
internet (accessed through HTTP or FTP), and for text stored in local file system
files (accessed through the file protocol). It is replaced with
NETWORK_DATASTORE
, which uses ACLs to allow access to specific
servers. This change aligns Oracle Text more closely with the standard operating and
security model for accessing URLs from the database.
3.2.6 Storing Associated Document Information
In your text table, create additional columns to store structured information that your query application might need, such as primary key, date, description, or author.
3.2.7 Format and Character Set Columns
If your documents consist of mixed formats or mixed character sets, create the following additional columns:
-
A format column to record the format (
TEXT
orBINARY
) to help filtering during indexing. You can also use the format column to ignore rows for indexing by setting the format column toIGNORE.
IGNORE
is useful for bypassing rows containing data that is incompatible with Oracle Text indexing, such as images. -
A character set column to record the document character set for each row.
When you create your index, specify the name of the format or character set column in the parameter clause of the CREATE
INDEX
statement.
For all rows containing the AUTO
or AUTOMATIC
keywords in character set or language columns, Oracle Text applies statistical techniques to determine the character set and language of the documents and modify document indexing appropriately.
3.2.8 Supported Document Formats
Because the system can index most document formats, including HTML, PDF, Microsoft Word, and plain text, you can load any supported type into the text column.
When your text column has mixed formats, you can include a format column to help filtering during indexing, and you can specify whether a document is binary (formatted) or text (nonformatted, such as HTML). If you mix HTML and XML documents in one index, you might not be able to configure your index to your needs; you cannot prevent style sheet information from being added to the index.
See Also:
Oracle Text Reference for more information about the supported document formats
3.2.9 Summary of DATASTORE Types
When you use CREATE INDEX
, specify the location that uses the datastore preference. Use an appropriate datastore according to your application.
These are the different ways that you can store your text with datastore preference types.
Table 3-2 Summary of DATASTORE Types
Datastore Type | Use When |
---|---|
|
Data is stored internally in a text column. Each row is indexed as a single document. Your text column can be |
|
Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one document for each row. |
|
Data is stored internally in a text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a primary table. |
|
Data is stored externally in operating system files. File names are stored in the text column, one for each row. Note: Starting with Oracle Database 19c, the Oracle Text typeFILE_DATASTORE is deprecated. Use
DIRECTORY_DATASTORE instead. |
|
Data is stored externally in Oracle directory objects. File names are stored in the text column, one for each row. |
|
Data is stored in a nested table. |
|
Data is stored externally in files located on an intranet or the internet. URLs are stored in the text column. Note: Starting with Oracle Database 19c, the Oracle Text typeURL_DATASTORE is deprecated. Use
NETWORK_DATASTORE instead. |
|
Data is stored externally in files located on an intranet or the internet. URLs are stored in the text column. |
|
Documents are synthesized at index time by a user-defined stored procedure. |
Indexing time and document retrieval time increases for indexing URLs, because the system must retrieve the document from the network.
Note:
To troubleshoot issues with triggers and MULTI_COLUMN_DATASTORE
or
USER_DATASTORE
, refer to My Oracle Support document
1613741.1.
Related Topics
3.2.10 Document Formats and Filtering
To index formatted documents, such as Microsoft Word and PDF, you must filter them to text. The FILTER
preference type determines the type of filtering that the system uses. By default, the system uses the AUTO_FILTER
filter type, which automatically detects the format of your documents and filters them to text.
Oracle Text can index most formats. It can also index columns that contain mixed-format documents.
See Also:
Oracle Text Reference for information about AUTO_FILTER
supported document and graphics formats
3.2.10.1 No Filtering for HTML
3.2.10.2 Mixed-Format Columns Filtering
For a mixed-format column, such as one that contains Microsoft Word, plain text, and HTML documents, you can bypass filtering for plain text or HTML by including a format column in your text table. In the format column, tag each row TEXT
or BINARY.
Rows that are tagged TEXT
are not filtered.
For example, tag the HTML and plain text rows as TEXT
and the Microsoft Word rows as BINARY.
You specify the format column in the CREATE INDEX
parameter clause.
When you do not want a document to be indexed, you can use a third format column type, IGNORE.
This column type is useful, for example, when a mixed-format table includes plain-text documents in Japanese and English, but you only want to process the English documents. This column type is also useful when a mixed-format table includes plain-text documents and images. Because IGNORE
is implemented at the datastore level, you can use it with all filters.
3.2.10.3 Custom Filtering
You can create a custom filter to filter documents for indexing. You can create either an external filter that is executed from the file system or an internal filter as a PL/SQL or Java-stored procedure.
For external custom filtering, use the USER_FILTER
filter preference type.
For internal filtering, use the PROCEDURE_FILTER
filter type.
See Also:
3.2.11 Bypass Rows
In your text table, you can bypass rows that you do not want to index, such as rows that contain image data. To bypass rows, you create a format column, set it to IGNORE,
and name the format column in the parameter clause of the CREATE INDEX
statement.
3.2.12 Document Character Set
The indexing engine expects filtered text to be in the database character set. When you use the AUTO_FILTER
filter type, formatted documents are converted to text in the database character set.
If your source is text and your document character set is not the database character set, then you can use the AUTO_FILTER
filter type to convert your text for indexing.
Character Set Detection
When you set the CHARSET
column to AUTO,
the AUTO_FILTER
filter detects the character set of the document and converts it from the detected character set to the database character set, if there is a difference.
Mixed Character Set Columns
If your document set contains documents with different character sets, such as JA16EUC and JA16SJIS, you can index the documents, provided that you create a CHARSET
column, populate this column with the name of the document character set for each row, and name the column in the parameter clause of the CREATE INDEX
statement.