Build Model

3.2.3 Build Model

Build your model using the training data set. Use the oml.rf function to build your model and specify the model settings.

For a supervised learning, like Classification, before creating the model, split the data into training and test data. Although you can use the entire data set to build a model, it is difficult to validate the model unless there are new data sets available. Therefore, to evaluate the model and to accurately assess the performance of the model on the same data, you generally split or separate the data into training and test data. You use the training data set to train the model and then use the test data set to test the accuracy of the model by running prediction queries. The testing data set already contains known values for the attribute that you want to predict. It is thus easy to determine whether the predictions of the model are correct.

Algorithm Selection

Before you build a model, choose the suitable algorithm. You can choose one of the following algorithms to solve a classification problem:

Decision Tree
Generalized Linear Model
Naive Bayes
Neural Network
Random Forest
Support Vector Machine

Here you will be using Random forest algorithms as interpretability is not a major concern. The Random Forest algorithm is a type of ensemble method used for classification. Random forest builds a number of independent decision trees and combines the output of the multiple decision trees to make predictions. Each of these decision trees is built using a random sample from the input and each tree uses a random subset of the features. This avoids the problem of overfitting while increasing accuracy. To build a model using a supervised learning algorithm (Random Forest Model), you need to first split the data into train and test data. After splitting the data, build the model using the train data and once the model is built, score the test data using the model.

You will split the CUSTOMER_DATA data with 60 percent of the records for the train data set and 40 percent for the test data set. The seed parameter is used for random splitting. The split method splits the data referenced by the DataFrame proxy object CUSTOMER_DATA into two new DataFrame proxy objects train, and test. Run the following script.
```
TRAIN, TEST = CUSTOMER_DATA.split(ratio = (0.6,0.4),seed=1)
TRAIN_X = TRAIN.drop('HOME_THEATER_PACKAGE')
TRAIN_Y = TRAIN['HOME_THEATER_PACKAGE']
TEST_X = TEST
TEST_Y = TEST['HOME_THEATER_PACKAGE']
```
Run the following statement to view a few rows of the test dataset.
```
z.show(TRAIN)
```
To specify model settings and build a Random Forest model object for predicting the HOME_THEATER_PACKAGE attribute, run the following script. The settings are given as key-value or dictionary pairs where it refers to parameters name and value setting respectively. Here some of the settings specified are PREP_AUTO and RFOR_NUM_TREES . The Random Forest makes use of the Decision Tree settings to configure the construction of individual trees. The fit function builds the rf model according to the training data and parameter settings.
```
try:
    oml.drop(model = 'MODEL_RF')
except:
    pass
 
settings = {'PREP_AUTO': 'ON',
            'ALGO_NAME': 'ALGO_RANDOM_FOREST',
            'RFOR_NUM_TREES': '25'}
 
rf_mod = oml.rf(**settings)
rf_mod.fit(TRAIN_X, TRAIN_Y, case_id = 'CUST_ID', model_name = 'MODEL_RF')
```
Model setting parameters:
- RFOR_NUM_TREES: Denotes the number of trees the random forest can have.
- PREP_AUTO: Used to specify fully automated or user-directed general data preparation. By default, it is enabled with a constant value as 'PREP_AUTO': PREP_AUTO_ON. Alternatively, it can also be given as 'PREP_AUTO': 'ON'.
  
  Note:
  Any parameters or settings not specified are either system-determined or default values are used.

Parent topic: Classification Use Case