3.4 Special Characters
When you use the BASIC_LEXER
preference type, you can specify how nonalphanumeric characters, such as hyphens and periods, are indexed in relation to the tokens that contain them. For example, you can specify that Oracle Text include or exclude the hyphen (-) when it indexes a word such as vice-president.
These characters fall into BASIC_LEXER
categories according to the behavior that you require during indexing. The way you set the lexer to behave for indexing is the way it behaves for query parsing.
Some of the special characters you can set are as follows:
-
Printjoin Characters: Define a nonalphanumeric character as
printjoin
when you want this character to be included in the token during indexing. For example, if you want your index to include hyphens and underscores, define them as printjoins. This means that a word such as vice-president is indexed as vice-president. A query on vicepresident does not find vice-president. -
Skipjoin Characters: Define a nonalphanumeric character as
skipjoin
when you do not want this character to be indexed with the token that contains it. For example, with the hyphen (-) defined as a skipjoin, vice-president is indexed as vicepresident. A query on vice-president finds documents containing vice-president and vicepresident. -
Other Characters: You can specify other characters to control other tokenization behavior, such as token separation (startjoins, endjoins, whitespace), punctuation identification (punctuations), number tokenization (numjoins), and word continuation after line breaks (continuation). These categories of characters have modifiable defaults.
See Also:
-
Oracle Text Reference to learn more about the
BASIC_LEXER
type