9.3 Classification Solutions

Oracle Text enables you to classify documents in the following ways:

  • Rule-Based Classification. For this solution, you group your documents, choose categories, and formulate the rules that define those categories; these rules are actually query phrases. You then index the rules and use the MATCHES operator to classify documents.

    Advantages: This solution is very accurate for small document sets. Results are always based on what you define, because you write the rules.

    Disadvantages: Defining rules can be tedious for large document sets with many categories. As your document set grows, you may need to write correspondingly more rules.

  • Supervised Classification. This solution is similar to rule-based classification, but the rule-writing step is automated with CTX_CLS.TRAIN. This procedure formulates a set of classification rules from a sample set of preclassified documents that you provide. As with rule-based classification, you use the MATCHES operator to classify documents.

    Oracle Text offers two versions of supervised classification, one using the RULE_CLASSIFIER preference and one using the SVM_CLASSIFIER preference. These preferences are discussed in "Supervised Classification".

    Advantages: Rules are written for you automatically. This method is useful for large document sets.

    Disadvantages: You must assign documents to categories before generating the rules. Rules may not be as specific or accurate as those you write yourself.

  • Unsupervised Classification (Clustering). All steps, from grouping your documents to writing the category rules, are automated with CTX_CLS.CLUSTERING. Oracle Text statistically analyzes your document set and correlates them with clusters according to content.

    Advantages:

    • You do not need to provide the classification rules or the sample documents as a training set.

    • This solution helps to discover overlooked patterns and content similarities in your document set.

      In fact, you can use this solution when you do not have a clear idea of rules or classifications. For example, use it to provide an initial set of categories and to build on the categories through supervised classification.

    Disadvantages:

    • Clustering is based on an internal solution. It might result in unexpected groupings, because the clustering operation is not user-defined.

    • You do not see the rules that create the clusters.

    • The clustering operation is CPU-intensive and can take at least the same time as indexing.