9.3 Classification Solutions
Oracle Text enables you to classify documents in the following ways:
-
Rule-Based Classification. For this solution, you group your documents, choose categories, and formulate the rules that define those categories; these rules are actually query phrases. You then index the rules and use the
MATCHES
operator to classify documents.Advantages: This solution is very accurate for small document sets. Results are always based on what you define, because you write the rules.
Disadvantages: Defining rules can be tedious for large document sets with many categories. As your document set grows, you may need to write correspondingly more rules.
-
Supervised Classification. This solution is similar to rule-based classification, but the rule-writing step is automated with
CTX_CLS.TRAIN.
This procedure formulates a set of classification rules from a sample set of preclassified documents that you provide. As with rule-based classification, you use theMATCHES
operator to classify documents.Oracle Text offers two versions of supervised classification, one using the
RULE_CLASSIFIER
preference and one using theSVM_CLASSIFIER
preference. These preferences are discussed in "Supervised Classification".Advantages: Rules are written for you automatically. This method is useful for large document sets.
Disadvantages: You must assign documents to categories before generating the rules. Rules may not be as specific or accurate as those you write yourself.
-
Unsupervised Classification (Clustering). All steps, from grouping your documents to writing the category rules, are automated with
CTX_CLS.CLUSTERING.
Oracle Text statistically analyzes your document set and correlates them with clusters according to content.Advantages:
-
You do not need to provide the classification rules or the sample documents as a training set.
-
This solution helps to discover overlooked patterns and content similarities in your document set.
In fact, you can use this solution when you do not have a clear idea of rules or classifications. For example, use it to provide an initial set of categories and to build on the categories through supervised classification.
Disadvantages:
-
Clustering is based on an internal solution. It might result in unexpected groupings, because the clustering operation is not user-defined.
-
You do not see the rules that create the clusters.
-
The clustering operation is CPU-intensive and can take at least the same time as indexing.
-