3.1.3 Build Model

Build your model using the training data set. Use the oml.glm function to build your model and specify model settings.

For a supervised learning, like Regression, before creating the model, split the data in to training and test data. Although you can use the entire data set to build a model, it is difficult to validate the model unless there are new data sets available. Therefore, to evaluate the model and to accurately assess the performance of the model on the same data, you generally split or separate the data into training and test data. You use the training data set to train the model and then use the test data set to test the accuracy of the model by running prediction queries. The testing data set already contains known values for the attribute that you want to predict. It is thus easy to determine whether the model's predictions are correct.

Algorithm Selection

Before you build a model, choose the suitable algorithm. You can choose one of the following algorithms to solve a regression problem:

  • Extreme Gradient Boosting
  • Generalized Linear Model
  • Neural Network
  • Support Vector Machine

When you want to understand the data set, you always start from a simple and easy baseline model. The Generalized Linear Model algorithm is the right choice because it is simple and easy to interpret since it fits a linear relationship between the feature and the target. You can get an initial understanding of a new data set from the result of the linear model.

The following steps guide you to split your data and build your model with the selected algorithm.

  1. Split Data: Train/Test:

    Split the data into training and test data, with a 80/20 ratio respectively. The seed parameter is used for random splitting. The split method splits the data referenced by the DataFrame proxy object BROOKLYN5 into two new DataFrame proxy objects train, and test.
    TRAIN, TEST = BROOKLYN5.split(ratio = (0.8,0.2), seed=15)
    TRAIN_X = TRAIN.drop('sale_price')
    TRAIN_Y = TRAIN['sale_price']
    TEST_X = TEST
    TEST_Y = TEST['sale_price']
  2. Model Building:

    Specify the model settings and build a Generalized Linear Model (GLM) model object for predicting the sale_price attribute, run the following script. The settings are given as key-value or dictionary pairs where it refers to parameters name and value setting respectively.

    try:
        oml.drop(model = 'BROOKLYN_GLM_REGRESSION_MODEL')
    except:
        print('No such model')
     
    setting = {'PREP_AUTO':'ON',
               'GLMS_ROW_DIAGNOSTICS':'GLMS_ROW_DIAG_ENABLE',
               'GLMS_FTR_SELECTION':'GLMS_FTR_SELECTION_ENABLE',
               'GLMS_FTR_GENERATION':'GLMS_FTR_GENERATION_ENABLE'}
                
    glm_mod = oml.glm("regression", **setting)
    glm_mod = glm_mod.fit(TRAIN_X,TRAIN_Y,model_name = 'BROOKLYN_GLM_REGRESSION_MODEL',case_id = 'ID')
    

    Model setting parameters:

    • PREP_AUTO: Used to specify fully automated or user-directed general data preparation. By default, it is enabled with a constant value as 'PREP_AUTO': PREP_AUTO_ON.
    • GLMS_ROW_DIAGNOSTICS: Enables or disables the row diagnostics. By default, row diagnostics are disabled.
    • GLMS_FTR_SELECTION: Enables or disables feature selection for GLM. By default, feature selection is not enabled.
    • GLMS_FTR_GENERATION: Specifies whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

      Note:

      Feature generation can only be enabled when feature selection is also enabled.