17.3 Training Sentiment Classifiers

Training a sentiment classifier generates the classification rules that are used to provide a positive or negative sentiment for a search keyword.

The following example trains a sentiment classifier that can perform sentiment analysis on user reviews of cameras:

  1. Create and populate the training document table. This table contains the actual text of the training set documents or the file names (if the documents are present externally).

    Ensure that the training documents are randomly selected to avoid any possible bias in the trained sentiment classifier. The distribution of positive and negative documents must not be skewed. Oracle Text checks for the distribution while training the sentiment classifier.

    create table training_camera (review_id number primary key, text varchar2(2000));
    insert into training_camera values( 1,'/sa/reviews/cameras/review1.txt');
    insert into training_camera values( 2,'/sa/reviews/cameras/review2.txt');
    insert into training_camera values( 3,'/sa/reviews/cameras/review3.txt');
    insert into training_camera values( 4,'/sa/reviews/cameras/review4.txt');
    
  2. Create and populate the category table.

    This table specifies training labels for the documents present in the document table. It tells the classifier the true sentiment of the training set documents.

    The primary key of the document table must have a foreign key relationship with the unique key of the category table. The names of these columns must be passed to the CTX_CLS.SA_TRAIN procedure so that the sentiment label can be associated with the corresponding document.

    Oracle Text validates the parameters specified for the classifier preference and the category values. The category values are restricted to 1 for positive, 2 for negative, and 0 for neutral sentiment. Documents with a category of 0 (neutral documents) are not used while training the classifier. Additional columns in the category table, other than document ID and category, are also not used by the classifier.

    create table train_category (doc_id number, category number, category_desc varchar2(100));
    
    insert into train_category values (1,0,'neutral');
    insert into train_category values (2,1,'positive');
    insert into train_category values (3,2,'negative');
    insert into train_category values (4,2,'negative');
    
  3. Create the context index on the training document table. This index is used to extract metadata for training documents while training the sentiment classifier.

    In this example, create an index without populating it.

    exec ctx_ddl.create_preference('fds','DIRECTORY_DATASTORE');
    create index docx on training_camera(text) indextype is ctxsys.context parameters ('datastore fds nopopulate');
  4. (Optional) Create a clsfier_camera sentiment classifier preference that performs sentiment analysis on a document set consisting of camera reviews.
  5. Train the sentiment classifier clsfier_camera.

    During training, Oracle Text determines the ratio of positive to negative documents. If this ratio is not in the range of 0.4 to 0.6, then a warning written to the CTX log indicates that the sentiment classifier is skewed. After the sentiment classifier is trained, it is ready to be used in sentiment queries to perform sentiment analysis.

    In the following example, clsfier_camera is the name of the sentiment classifier that is being trained, review_id is the name of the document ID column in the document training set, train_category is the name of the category table that contains the labels for the training set documents, doc_id is the document ID column in the category table, category is the category column in the category table, and clsfier is the name of the sentiment classifier preference that is used to train the classifier.

    exec ctx_cls.sa_train_model('clsfier_camera','docx','review_id','train_category','doc_id','category','clsfier');

    Note:

    If you do not specify a sentiment classifier preference when running the CTX_CLS.SA_TRAIN_MODEL procedure, then Oracle Text uses the default preference CTXSYS.DEFAULT_SENTIMENT_CLASSIFIER.