3.2.3 Build Model
Build your model using the training data set. Use the
oml.rf
function to build your model and specify the model
settings.
For a supervised learning, like Classification, before creating the model, split the data into training and test data. Although you can use the entire data set to build a model, it is difficult to validate the model unless there are new data sets available. Therefore, to evaluate the model and to accurately assess the performance of the model on the same data, you generally split or separate the data into training and test data. You use the training data set to train the model and then use the test data set to test the accuracy of the model by running prediction queries. The testing data set already contains known values for the attribute that you want to predict. It is thus easy to determine whether the predictions of the model are correct.
Algorithm Selection
Before you build a model, choose the suitable algorithm. You can choose one of the following algorithms to solve a classification problem:
- Decision Tree
- Generalized Linear Model
- Naive Bayes
- Neural Network
- Random Forest
- Support Vector Machine
Here you will be using Random forest algorithms as interpretability is not a major concern. The Random Forest algorithm is a type of ensemble method used for classification. Random forest builds a number of independent decision trees and combines the output of the multiple decision trees to make predictions. Each of these decision trees is built using a random sample from the input and each tree uses a random subset of the features. This avoids the problem of overfitting while increasing accuracy. To build a model using a supervised learning algorithm (Random Forest Model), you need to first split the data into train and test data. After splitting the data, build the model using the train data and once the model is built, score the test data using the model.
- You will split the CUSTOMER_DATA data with 60 percent of the
records for the train data set and 40 percent for the test data set. The seed
parameter is used for random splitting. The split method splits the data
referenced by the DataFrame proxy object CUSTOMER_DATA into two new DataFrame
proxy objects train, and test. Run the following
script.
TRAIN, TEST = CUSTOMER_DATA.split(ratio = (0.6,0.4),seed=1) TRAIN_X = TRAIN.drop('HOME_THEATER_PACKAGE') TRAIN_Y = TRAIN['HOME_THEATER_PACKAGE'] TEST_X = TEST TEST_Y = TEST['HOME_THEATER_PACKAGE']
- Run the following statement to view a few rows of the test
dataset.
z.show(TRAIN)
- To specify model settings and build a Random Forest model object
for predicting the HOME_THEATER_PACKAGE attribute, run the following script. The
settings are given as key-value or dictionary pairs where it refers to
parameters name and value setting respectively. Here some of the settings
specified are PREP_AUTO and RFOR_NUM_TREES . The Random Forest makes use of the
Decision Tree settings to configure the construction of individual trees. The
fit function builds the rf model according to the training data and parameter
settings.
try: oml.drop(model = 'MODEL_RF') except: pass settings = {'PREP_AUTO': 'ON', 'ALGO_NAME': 'ALGO_RANDOM_FOREST', 'RFOR_NUM_TREES': '25'} rf_mod = oml.rf(**settings) rf_mod.fit(TRAIN_X, TRAIN_Y, case_id = 'CUST_ID', model_name = 'MODEL_RF')
Model setting parameters:
RFOR_NUM_TREES
: Denotes the number of trees the random forest can have.PREP_AUTO
: Used to specify fully automated or user-directed general data preparation. By default, it is enabled with a constant value as 'PREP_AUTO': PREP_AUTO_ON. Alternatively, it can also be given as 'PREP_AUTO': 'ON'.Note:
Any parameters or settings not specified are either system-determined or default values are used.
Parent topic: Classification Use Case