Supervised Classification

9.5 Supervised Classification

With supervised classification, you use the CTX_CLS.TRAIN procedure to automate the rule-writing step. CTX_CLS.TRAIN uses a training set of sample documents to deduce classification rules. This training set is the major advantage over rule-based classification, where you must write the classification rules.

However, before you can run the CTX_CLS.TRAIN procedure, you must manually create categories and assign each document in the sample training set to a category.

9.5.1 Decision Tree Supervised Classification

To use Decision Tree classification, you set the preference argument of CTX_CLS.TRAIN to RULE_CLASSIFIER.

This form of classification uses a decision tree algorithm for creating rules. Generally speaking, a decision tree is a method of deciding between two (or more, but usually two) choices. In document classification, the choices are "the document matches the training set" or "the document does not match the training set."

A decision tree has a set of attributes that can be tested. In this case, the attributes include:

words from the document
stems of words from the document (for example, the stem of running is run)
themes from the document (if themes are supported for the language in use)

The learning algorithm in Oracle Text builds one or more decision trees for each category provided in the training set. These decision trees are then coded into queries that are suitable for use by a CTXRULE index. For example, one category has a training document for "Japanese beetle," and another category has a document for "Japanese currency." The algorithm may create decision trees based on "Japanese," "beetle," and "currency," and then classify documents accordingly.

The decision trees include the concept of confidence. Each generated rule is allocated a percentage value that represents the accuracy of the rule, given the current training set. In trivial examples, the accuracy is almost always 100 percent, but this percentage merely represents the limitations of the training set. Similarly, the rules generated from a trivial training set may seem to be less than what you might expect, but they sufficiently distinguish the different categories in the current training set.

The advantage of the Decision Tree classification is that it can generate rules that users can easily inspect and modify. The Decision Tree classification makes sense when you want to the computer to generate the bulk of the rules, but you want to fine-tune them afterward by editing the rule sets.

9.5.2 Decision Tree Supervised Classification Example

The following SQL example steps through creating your document and classification tables, classifying the documents, and generating the rules. It then goes on to generate rules with CTX_CLS.TRAIN.

Rules are then indexed to create CTXRULE index and new documents are classified with MATCHES.

The CTX_CLS.TRAIN procedure requires an input training document set. A training set is a set of documents that have already been assigned a category.

After you generate the rules, you can test them by first indexing them and then using MATCHES to classify new documents.

To create and index the category rules:

Create and load a table of training documents.

This example uses a simple set of three fast food documents and three computer documents.

create table docs (
  doc_id number primary key,
  doc_text   clob);

insert into docs values
(1, 'MacTavishes is a fast-food chain specializing in burgers, fries and -
shakes. Burgers are clearly their most important line.');
insert into docs values
(2, 'Burger Prince are an up-market chain of burger shops, who sell burgers -
and fries in competition with the likes of MacTavishes.');
insert into docs values
(3, 'Shakes 2 Go are a new venture in the low-cost restaurant arena, 
specializing in semi-liquid frozen fruit-flavored vegetable oil products.');
insert into docs values
(4, 'TCP/IP network engineers generally need to know about routers, 
firewalls, hosts, patch cables networking etc');
insert into docs values
(5, 'Firewalls are used to protect a network from attack by remote hosts,
 generally across TCP/IP');

Create category tables, category descriptions and IDs.

----------------------------------------------------------------------------

-- Create category tables
-- Note that "category_descriptions" isn't really needed for this demo -
-- it just provides a descriptive name for the category numbers in
-- doc_categories
----------------------------------------------------------------------------

create table category_descriptions (
  cd_category    number,
  cd_description varchar2(80));

create table doc_categories (
  dc_category    number,
  dc_doc_id      number,
  primary key (dc_category, dc_doc_id)) 
  organization index;

-- descriptions for categories

insert into category_descriptions values (1, 'fast food');
insert into category_descriptions values (2, 'computer networking');

Assign each document to a category.

In this case, the fast food documents all go into category 1, and the computer documents go into category 2.

insert into doc_categories values (1, 1);
insert into doc_categories values (1, 2);
insert into doc_categories values (1, 3);
insert into doc_categories values (2, 4);
insert into doc_categories values (2, 5);

Create a CONTEXT index to be used by CTX_CLS.TRAIN.

To experiment with the effects of turning themes on and off, create an Oracle Text preference for the index.

exec ctx_ddl.create_preference('my_lex', 'basic_lexer');
exec ctx_ddl.set_attribute    ('my_lex', 'index_themes', 'no');
exec ctx_ddl.set_attribute    ('my_lex', 'index_text',   'yes');

create index docsindex on docs(doc_text) indextype is ctxsys.context
parameters ('lexer my_lex');

Create the rules table that will be populated by the generated rules.

create table rules(
  rule_cat_id     number,
  rule_text       varchar2(4000),
  rule_confidence number
);

Generate category rules.

All arguments are the names of tables, columns, or indexes previously created in this example. The rules table now contains the rules, which you can view.

begin
  ctx_cls.train(
    index_name => 'docsindex',
    docid      => 'doc_id',
    cattab     => 'doc_categories',
    catdocid   => 'dc_doc_id',
    catid      => 'dc_category',
    restab     => 'rules',
    rescatid   => 'rule_cat_id',
    resquery   => 'rule_text',
    resconfid  => 'rule_confidence'
  );
end;
/

Fetch the generated rules, viewed by category.

For convenience's sake, the rules table is joined with category_descriptions so that you can see the category that each rule applies to.
```
select cd_description, rule_confidence, rule_text from rules, 
category_descriptions where cd_category = rule_cat_id;
```
Use the CREATE INDEX statement to create the CTXRULE index on the previously generated rules.
```
create index rules_idx on rules (rule_text) indextype is ctxsys.ctxrule;
```

Test an incoming document by using MATCHES.

set serveroutput on;

declare
   incoming_doc clob;
begin
   incoming_doc 
       := 'I have spent my entire life managing restaurants selling burgers';
   for c in 
     ( select distinct cd_description from rules, category_descriptions
       where cd_category = rule_cat_id
       and matches (rule_text, incoming_doc) > 0) loop
     dbms_output.put_line('CATEGORY: '||c.cd_description);
   end loop;
end;
/

9.5.3 SVM-Based Supervised Classification

The second method that you can use for training purposes is Support Vector Machine (SVM) classification. SVM is a type of machine learning algorithm derived from statistical learning theory. A property of SVM classification is the ability to learn from a very small sample set.

Using the SVM classifier is much the same as using the Decision Tree classifier, except for the following differences:

In the call to CTX_CLS.TRAIN, use the SVM_CLASSIFIER preference instead of the RULE_CLASSIFIER preference. (If you do not want to modify any attributes, use the predefined CTXSYS.SVM_CLASSIFIER preference.)
Use the NOPOPULATE keyword if you do not want to populate the CONTEXT index on the table. The classifier uses it only to find the source of the text, by means of datastore and filter preferences, and to determine how to process the text through lexer and sectioner preferences.
In the generated rules table, use at least the following columns:
```
cat_id      number,
type        number,
rule        blob;
```

As you can see, the generated rule is written into a BLOB column. It is therefore opaque to the user, and unlike Decision Tree classification rules, it cannot be edited or modified. The trade-off here is that you often get considerably better accuracy with SVM than with Decision Tree classification.

With SVM classification, allocated memory has to be large enough to load the SVM model; otherwise, the application built on SVM incurs an out-of-memory error. Here is how to calculate the memory allocation:

Minimum memory request (in bytes) = number of unique categories x number of features 
                                    example: (value of MAX_FEATURES attributes) x 8

If necessary to meet the minimum memory requirements, increase one of the following memories:

SGA (if in shared server mode)
PGA (if in dedicated server mode)

9.5.4 SVM-Based Supervised Classification Example

This example uses SVM-based classification. The steps are essentially the same as the Decision Tree example, except for the following differences:

Set the SVM_CLASSIFIER preference with CTX_DDL.CREATE_PREFERENCE rather than setting it in CTX_CLS.TRAIN. (You can do it either way.)
Include category descriptions in the category table. (You can do it either way.)
Because rules are opaque to the user, use fewer arguments in CTX_CLS.TRAIN.

To create a SVM-based supervised classification:

Create and populate the training document table.

create table doc (id number primary key, text varchar2(2000));
insert into doc values(1,'1 2 3 4 5 6');
insert into doc values(2,'3 4 7 8 9 0');
insert into doc values(3,'a b c d e f');
insert into doc values(4,'g h i j k l m n o p q r');
insert into doc values(5,'g h i j k s t u v w x y z');

Create and populate the category table.

create table testcategory (
        doc_id number, 
        cat_id number, 
        cat_name varchar2(100)
         );
insert into testcategory values (1,1,'number');
insert into testcategory values (2,1,'number');
insert into testcategory values (3,2,'letter');
insert into testcategory values (4,2,'letter');
insert into testcategory values (5,2,'letter');

Create the CONTEXT index on the document table without populating it.

create index docx on doc(text) indextype is ctxsys.context 
       parameters('nopopulate');

Set the SVM_CLASSIFIER.

You can also set it in CTX.CLS_TRAIN.

exec ctx_ddl.create_preference('my_classifier','SVM_CLASSIFIER'); 
exec ctx_ddl.set_attribute('my_classifier','MAX_FEATURES','100');

Create the result (rule) table.

create table restab (
  cat_id number,
  type number(3) not null,
  rule blob
 );

Perform the training.

exec ctx_cls.train('docx', 'id','testcategory','doc_id','cat_id',
     'restab','my_classifier');

Create a CTXRULE index on the rules table.

exec ctx_ddl.create_preference('my_filter','NULL_FILTER');
create index restabx on restab (rule) 
       indextype is ctxsys.ctxrule 
       parameters ('filter my_filter classifier my_classifier');

Now you can classify two unknown documents, as follows:

select cat_id, match_score(1) from restab 
       where matches(rule, '4 5 6',1)>50;

select cat_id, match_score(1) from restab 
       where matches(rule, 'f h j',1)>50;

drop table doc;
drop table testcategory;
drop table restab;
exec ctx_ddl.drop_preference('my_classifier');
exec ctx_ddl.drop_preference('my_filter');