7.18 Random Forest Model

The ore.odmRF class creates an in-database Random Forest (RF) model that provides an ensemble learning technique for classification.

By combining the ideas of bagging and random selection of variables, the Random Forest algorithm produces a collection of decision trees with controlled variance while avoiding overfitting, which is a common problem for decision trees.

Settings for a Random Forest Model

The following table lists settings that apply to Random Forest models.

Table 7-19 Random Forest Model Settings

Setting Name Setting Value Description

RFOR_MTRY

a number >= 0

Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value 0 indicates that the candidate pool includes all columns.

RFOR_NUM_TREES

1<=a number <=65535

Number of trees in the forest

Default is 20.

RFOR_SAMPLING_RATIO

0< a fraction<=1

Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data.

Example 7-21 Using the ore.odmRF Function

This example pushes the data frame iris: to a temporary database table IRIS and creates a Random Forest model.


# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Create the a temporary OML4R proxy object IRIS.

IRIS <- ore.push(iris)

# Create an RF model object. Fit the RF model according to the data and setting parameters.

mod.rf <- ore.odmRF(Species ~ ., IRIS, 
                        odm.settings = list(tree_impurity_metric = 'TREE_IMPURITY_ENTROPY',
                        tree_term_max_depth = 5,
                        tree_term_minrec_split = 5,
                        tree_term_minpct_split = 2,
                        tree_term_minrec_node = 5,
                        tree_term_minpct_node = 0.05))
                        
# Show the model summary and attribute importance.

summary(mod.rf)
importance(mod.rf)

# Use the model to make predictions on the input data.

pred.rf <- predict(mod.rf, IRIS, supplemental.cols="Species")

# Generate a confusion matrix.

with(pred.rf, table(Species, PREDICTION))

Listing for This Example

Call: ore.odmRF(formula = Species ~ ., data = IRIS, odm.settings = list(tree_impurity_metric = "TREE_IMPURITY_ENTROPY", tree_term_max_depth = 5, tree_term_minrec_split = 5, tree_term_minpct_split = 2, tree_term_minrec_node = 5, tree_term_minpct_node = 0.05))
Settings:
                                                 value 
      clas.max.sup.bins                          32
      clas.weights.balanced                      OFF
      odms.details                               odms.enable
      odms.missing.value.treatment   odms.missing.value.auto 
      odms.random.seed                                     0 
      odms.sampling                    odms.sampling.disable 
      prep.auto                                           ON
      rfor.num.trees                                      20
      rfor.sampling.ratio                                 .5
      impurity.metric                       impurity.entropy 
      term.max.depth                                       5
      term.minpct.node                                  0.05 
      term.minpct.split                                    2 
      term.minrec.node                                     5
      term.minrec.split                                    5
Importance:
    ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE 
1   Petal.Length             <NA>              0.60890776 
2   Petal.Width              <NA>              0.53412466
3   Sepal.Length             <NA>              0.23343292
4   Sepal.Width              <NA>              0.06182114

Table 7-20 A data.frame: 4 x 3

ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE
<chr> <chr> <dbl>
Petal.Length NA 0.60890776
Petal.Width NA 0.53412466
Sepal.Length NA 0.23343292
Sepal.Width NA 0.06182114