9.1 Overview of Document Classification
Each theme is a single word, a single phrase, or a hierarchical list of parent themes.
To sift through numerous documents you can use keyword search engines. However, keyword searches have limitations. One major drawback is that keyword searches do not discriminate by context. In many languages, a word or phrase may have multiple meanings, so a search may result in many matches that are not about the specific topic. For example, a query on the phrase river bank might return documents about the Hudson River Bank & Trust Company, because the word bank has two meanings.
Alternatively, you can sort through documents and classify them by content. This approach is not feasible for very large volumes of documents.
Oracle Text offers various approaches to document classification. Under rule-based classification (sometimes called simple classification), you write the classification rules yourself. With supervised classification, Oracle Text creates classification rules based on a set of sample documents that you preclassify. Finally, with unsupervised classification (also known as clustering), Oracle Text performs all steps, from writing the classification rules to classifying the documents, for you.