Document Language

3.3 Document Language

Oracle Text can index most languages. By default, Oracle Text assumes that the language of the text to be indexed is the language that you specify in your database setup.

Depending on the language of your documents, use one of the following lexer types:

AUTO_LEXER: To automatically detect the language being indexed by examining the content, and apply suitable options (including stemming) for that language. Works best where each document contains a single-language, and has at least a couple of paragraphs of text to aid identification.
BASIC_LEXER: To index whitespace-delimited languages such as English, French, German, and Spanish. For some of these languages, you can enable alternate spelling, composite word indexing, and base-letter conversion.
MULTI_LEXER: To index tables containing documents of different languages such as English, German, and Japanese.
CHINESE_VGRAM: To extract tokens from Chinese text.
CHINESE_LEXER: To extract tokens from Chinese text. This lexer offers the following benefits over the CHINESE_VGRAM lexer:
- Generates a smaller index
- Better query response time
- Generates real world tokens resulting in better query precision
- Supports stop words
JAPANESE_VGRAM: To extract tokens from Japanese text.
JAPANESE_LEXER: To extract tokens from Japanese text. This lexer offers the following advantages over the JAPANESE_VGRAM lexer:
- Generates smaller index
- Better query response time
- Generates real world tokens resulting in better precision
KOREAN_MORPH_LEXER: To extract tokens from Korean text.
USER_LEXER: To create your own lexer for indexing a particular language.
WORLD_LEXER: To index tables containing documents of different languages and to autodetect the languages in the document.

With the BASIC_LEXER preference, Oracle Text provides a lexing solution for most languages. For the Japanese, Chinese, and Korean languages, you can create your own lexing solution in the user-defined lexer interface.

Language Features Outside BASIC_LEXER: The user-defined lexer interface enables you to create a PL/SQL or Java procedure to process your documents during indexing and querying. With the user-defined lexer, you can also create your own theme lexing solution or linguistic processing engine.
Multilanguage Columns: Oracle Text can index text columns that contain documents in different languages, such as a column that contains documents written in English, German, and Japanese. To index a multilanguage column, you add a language column to your text table and use the MULTI_LEXER preference type. You can also incorporate a multilanguage stoplist when you index multilanguage columns.

Related Topics