6.18 Random Forest Model
The ore.odmRF
class creates an in-database Random Forest (RF) model that provides an ensemble learning technique for classification.
By combining the ideas of bagging and random selection of variables, the Random Forest algorithm produces a collection of decision trees with controlled variance while avoiding overfitting, which is a common problem for decision trees.
Settings for a Random Forest Model
The following table lists settings that apply to Random Forest models.
Table 6-19 Random Forest Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value |
|
|
Number of trees in the forest Default is |
|
A fraction in the range |
Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data. |
Example 6-21 Using the ore.odmRF Function
This example pushes the data frame iris: to a temporary database table IRIS and creates a Random Forest model.
# Turn off row ordering warnings
options(ore.warn.order=FALSE)
# Create the a temporary OML4R proxy object IRIS.
IRIS <- ore.push(iris)
# Create an RF model object. Fit the RF model according to the data and setting parameters.
mod.rf <- ore.odmRF(Species ~ ., IRIS,
odm.settings = list(tree_impurity_metric = 'TREE_IMPURITY_ENTROPY',
tree_term_max_depth = 5,
tree_term_minrec_split = 5,
tree_term_minpct_split = 2,
tree_term_minrec_node = 5,
tree_term_minpct_node = 0.05))
# Show the model summary and attribute importance.
summary(mod.rf)
importance(mod.rf)
# Use the model to make predictions on the input data.
pred.rf <- predict(mod.rf, IRIS, supplemental.cols="Species")
# Generate a confusion matrix.
with(pred.rf, table(Species, PREDICTION))
Listing for This Example
Call: ore.odmRF(formula = Species ~ ., data = IRIS, odm.settings = list(tree_impurity_metric = "TREE_IMPURITY_ENTROPY", tree_term_max_depth = 5, tree_term_minrec_split = 5, tree_term_minpct_split = 2, tree_term_minrec_node = 5, tree_term_minpct_node = 0.05))
Settings:
value
clas.max.sup.bins 32
clas.weights.balanced OFF
odms.details odms.enable
odms.missing.value.treatment odms.missing.value.auto
odms.random.seed 0
odms.sampling odms.sampling.disable
prep.auto ON
rfor.num.trees 20
rfor.sampling.ratio .5
impurity.metric impurity.entropy
term.max.depth 5
term.minpct.node 0.05
term.minpct.split 2
term.minrec.node 5
term.minrec.split 5
Importance:
ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE
1 Petal.Length <NA> 0.60890776
2 Petal.Width <NA> 0.53412466
3 Sepal.Length <NA> 0.23343292
4 Sepal.Width <NA> 0.06182114
Table 6-20 A data.frame: 4 x 3
ATTRIBUTE_NAME | ATTRIBUTE_SUBNAME | ATTRIBUTE_IMPORTANCE |
---|---|---|
<chr> | <chr> | <dbl> |
Petal.Length | NA | 0.60890776 |
Petal.Width | NA | 0.53412466 |
Sepal.Length | NA | 0.23343292 |
Sepal.Width | NA | 0.06182114 |