Identifying Collocates

16.5 Identifying Collocates

Collocates are a group of words that frequently co-occur in a document. They provide a quick summary of other keywords or concepts that are related to a specified keyword. You can then use the other keywords in queries to fetch more relevant results.

You identify collocates based on a search query. For each document that is returned by the query, snippets of text around the search keyword are automatically extracted. Next, the words in these snippets are correlated to the query keyword by using statistical measures and, depending on how frequently the extracted words occur in the overall document set, a score is assigned to each returned co-occurring word.

Use the RSI to identify collocates. You can specify the number of co-occurring words that must be returned by the query. You can also specify whether to identify collocates that are common nouns or collocates that emphasize uniqueness. Synonyms of the specified search keyword can also be returned.

Note:

Collocates are supported only for BASIC_LEXER.

To identify collocates:

Create the document set table for the query.
Create an Oracle Text index on the document set table.
Use the XML Query RSI to define and input a query that identifies collocates. Include the collocates element with the required attributes.

Example 16-1 Identifying Collocates Within a Document Set

In this example, the keyword used to query documents in a data set is ‘Nobel.’ Oracle Text searches for occurrences of this keyword in the document set. In addition to the result set, use collocates to search for five common words that co-occur with ‘Nobel.’ Use the max_words attribute to identify the number of collocates to be generated. Set the use_tscore attribute to TRUE to specify that common words must be identified for the collocates. The number of words to pick on either side of the keyword in order to identify collocates is 10.

The following is the input RSI descriptor that is used to determine collocates:

declare
rsd varchar2(32767);
 begin
  ctx_query.result_set('tdrbnbsan01idx', 'nobel',
  <ctx_result_set_descriptor>
  <collocates radius = "10" max_words="5" use_tscore="TRUE"/>
  </ctx_result_set_descriptor>',
  :rs);
  end;
/

Here is the output result set for the query:

<ctx_result_set>
<collocates>
    <collocation>
       <word>PRIZE</word>
       <score>82</score>
    </collocation>
    <collocation>
       <word>LAUREATE</word>
       <score>70</score>
    </collocation>
    <collocation>
       <word>NOBELPRIZE</word>
       <score>44</score>
    </collocation>
    <collocation>
       <word>AWARD</word>
       <score>42</score>
    </collocation>
    <collocation>
       <word>ORG</word>
       <score>41</score>
    </collocation
</collocates>
</ctx_result_set>

For ‘Nobel,’ the top five common collocates, in order, are Prize, Laureate, Nobelprize, award, and org. Each word is assigned a score that indicates the frequency of occurrence. Collocates are always returned after any hitlist elements are returned.

If you set use_tscore to FALSE in the same example, then less common (unique) words are identified. Here is the output result set:

<ctx_result_set>
<collocates>   
    <collocation>
       <word>MOLA</word>
       <score>110</score>   
    </collocation>
    <collocation>
       <word>BISMARCK</word>
       <score>89</score>
    </collocation>
    <collocation>
       <word>COLONNA</word>
       <score>67</score>
    </collocation>
    <collocation>
       <word>LYNEN</word>
       <score>55</score>
    </collocation>
    <collocation>
       <word>TIMBERGEN</word>
       <score>25</score>
    </collocation>
    </collocates>
</ctx_result_set>