Classification Solutions

9.3 Classification Solutions

Oracle Text enables you to classify documents in the following ways:

Rule-Based Classification. For this solution, you group your documents, choose categories, and formulate the rules that define those categories; these rules are actually query phrases. You then index the rules and use the MATCHES operator to classify documents.

Advantages: This solution is very accurate for small document sets. Results are always based on what you define, because you write the rules.

Disadvantages: Defining rules can be tedious for large document sets with many categories. As your document set grows, you may need to write correspondingly more rules.
Supervised Classification. This solution is similar to rule-based classification, but the rule-writing step is automated with CTX_CLS.TRAIN. This procedure formulates a set of classification rules from a sample set of preclassified documents that you provide. As with rule-based classification, you use the MATCHES operator to classify documents.

Oracle Text offers two versions of supervised classification, one using the RULE_CLASSIFIER preference and one using the SVM_CLASSIFIER preference. These preferences are discussed in "Supervised Classification".

Advantages: Rules are written for you automatically. This method is useful for large document sets.

Disadvantages: You must assign documents to categories before generating the rules. Rules may not be as specific or accurate as those you write yourself.
Unsupervised Classification (Clustering). All steps, from grouping your documents to writing the category rules, are automated with CTX_CLS.CLUSTERING. Oracle Text statistically analyzes your document set and correlates them with clusters according to content.

Advantages:
- You do not need to provide the classification rules or the sample documents as a training set.
- This solution helps to discover overlooked patterns and content similarities in your document set.
  
  In fact, you can use this solution when you do not have a clear idea of rules or classifications. For example, use it to provide an initial set of categories and to build on the categories through supervised classification.
Disadvantages:
- Clustering is based on an internal solution. It might result in unexpected groupings, because the clustering operation is not user-defined.
- You do not see the rules that create the clusters.
- The clustering operation is CPU-intensive and can take at least the same time as indexing.