Random Forest Model

7.18 Random Forest Model

The ore.odmRF class creates an in-database Random Forest (RF) model that provides an ensemble learning technique for classification.

By combining the ideas of bagging and random selection of variables, the Random Forest algorithm produces a collection of decision trees with controlled variance while avoiding overfitting, which is a common problem for decision trees.

Settings for a Random Forest Model

The following table lists settings that apply to Random Forest models.

Table 7-19 Random Forest Model Settings

Setting Name Setting Value Description

Setting Name	Setting Value	Description
`RFOR_MTRY`	`a number >= 0`	Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value `0` indicates that the candidate pool includes all columns.
`RFOR_NUM_TREES`	`1<=a number <=65535`	Number of trees in the forest Default is `20`.
`RFOR_SAMPLING_RATIO`	`0< a fraction<=1`	Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data.

RFOR_MTRY

a number >= 0

Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value 0 indicates that the candidate pool includes all columns.

RFOR_NUM_TREES

1<=a number <=65535

Number of trees in the forest

Default is 20.

RFOR_SAMPLING_RATIO

0< a fraction<=1

Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data.

Example 7-21 Using the ore.odmRF Function

This example pushes the data frame iris: to a temporary database table IRIS and creates a Random Forest model.


# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Create the a temporary OML4R proxy object IRIS.

IRIS <- ore.push(iris)

# Create an RF model object. Fit the RF model according to the data and setting parameters.

mod.rf <- ore.odmRF(Species ~ ., IRIS, 
                        odm.settings = list(tree_impurity_metric = 'TREE_IMPURITY_ENTROPY',
                        tree_term_max_depth = 5,
                        tree_term_minrec_split = 5,
                        tree_term_minpct_split = 2,
                        tree_term_minrec_node = 5,
                        tree_term_minpct_node = 0.05))
                        
# Show the model summary and attribute importance.

summary(mod.rf)
importance(mod.rf)

# Use the model to make predictions on the input data.

pred.rf <- predict(mod.rf, IRIS, supplemental.cols="Species")

# Generate a confusion matrix.

with(pred.rf, table(Species, PREDICTION))

Listing for This Example

Call: ore.odmRF(formula = Species ~ ., data = IRIS, odm.settings = list(tree_impurity_metric = "TREE_IMPURITY_ENTROPY", tree_term_max_depth = 5, tree_term_minrec_split = 5, tree_term_minpct_split = 2, tree_term_minrec_node = 5, tree_term_minpct_node = 0.05))

Settings:
                                                 value 
      clas.max.sup.bins                          32
      clas.weights.balanced                      OFF
      odms.details                               odms.enable
      odms.missing.value.treatment   odms.missing.value.auto 
      odms.random.seed                                     0 
      odms.sampling                    odms.sampling.disable 
      prep.auto                                           ON
      rfor.num.trees                                      20
      rfor.sampling.ratio                                 .5
      impurity.metric                       impurity.entropy 
      term.max.depth                                       5
      term.minpct.node                                  0.05 
      term.minpct.split                                    2 
      term.minrec.node                                     5
      term.minrec.split                                    5

Importance:
    ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE 
1   Petal.Length             <NA>              0.60890776 
2   Petal.Width              <NA>              0.53412466
3   Sepal.Length             <NA>              0.23343292
4   Sepal.Width              <NA>              0.06182114

Table 7-20 A data.frame: 4 x 3

ATTRIBUTE_NAME	ATTRIBUTE_SUBNAME	ATTRIBUTE_IMPORTANCE
<chr>	<chr>	<dbl>
Petal.Length	NA	0.60890776
Petal.Width	NA	0.53412466
Sepal.Length	NA	0.23343292
Sepal.Width	NA	0.06182114

Parent topic: OML4R Classes That Provide Access to In-Database Machine Learning Algorithms