9.16 Random Forest
The oml.rf
class creates a Random Forest (RF) model that provides an ensemble learning technique for classification.
By combining the ideas of bagging and random selection of variables, the Random Forest algorithm produces a collection of decision trees with controlled variance while avoiding overfitting, which is a common problem for decision trees.
For information on the oml.rf
class attributes and methods, invoke help(oml.rf)
or see Oracle Machine Learning for Python API Reference.
Settings for a Random Forest Model
The following table lists settings for RF models.
Table 9-14 Random Forest Model Settings
Setting Name | Setting Value | Description |
---|---|---|
CLAS_COST_TABLE_NAME |
table_name |
The name of a table that stores a cost matrix for the algorithm to use in scoring the model. The cost matrix specifies the costs associated with misclassifications. The cost matrix table is user-created. The following are the column requirements for the table.
|
|
|
Specifies the maximum number of bins for each attribute. The default value is |
CLAS_WEIGHTS_BALANCED |
|
Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is |
ODMS_RANDOM_SEED |
A non-negative integer |
Controls the random number seed used by the hash function to generate a random number with uniform distribution. The default values is |
|
|
Size of the random subset of columns to consider when choosing a split at a node. For each node, the size of the pool remains the same but the specific candidate columns change. The default is half of the columns in the model signature. The special value |
RFOR_NUM_TREES |
|
Number of trees in the forest The default value is |
|
|
Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data. |
|
|
Tree impurity metric for a decision tree model. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is measured in accordance with a metric. Decision trees can use either gini ( |
|
|
Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). The default is |
TREE_TERM_MINPCT_NODE |
|
The minimum number of training rows in a node expressed as a percentage of the rows in the training data. The default value is |
|
|
Minimum number of rows required to consider splitting a node expressed as a percentage of the training rows. The default value is |
|
|
Minimum number of rows in a node. The default value is |
|
|
Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value. The default value is |
See Also:
Example 9-16 Using the oml.rf Class
This example creates an RF model and uses some of the methods of the oml.rf
class.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop('IRIS')
oml.drop(table = 'RF_COST')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]
# Create a cost matrix table in the database.
cost_matrix = [['setosa', 'setosa', 0],
['setosa', 'virginica', 0.2],
['setosa', 'versicolor', 0.8],
['virginica', 'virginica', 0],
['virginica', 'setosa', 0.5],
['virginica', 'versicolor', 0.5],
['versicolor', 'versicolor', 0],
['versicolor', 'setosa', 0.4],
['versicolor', 'virginica', 0.6]]
cost_matrix = \
oml.create(pd.DataFrame(cost_matrix,
columns = ['ACTUAL_TARGET_VALUE',
'PREDICTED_TARGET_VALUE',
'COST']),
table = 'RF_COST')
# Create an RF model object.
rf_mod = oml.rf(tree_term_max_depth = '2')
# Fit the RF model according to the training data and parameter
# settings.
rf_mod = rf_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
# Show details of the model.
rf_mod
# Use the model to make predictions on the test data.
rf_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Return the prediction probability.
rf_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Species']],
proba = True)
# Return the top two most influencial attributes of the highest
# probability class.
rf_mod.predict_proba(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Species']],
topN = 2).sort_values(by = ['Sepal_Length', 'Species'])
rf_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
# Reset TREE_TERM_MAX_DEPTH and refit the model.
rf_mod.set_params(tree_term_max_depth = '3').fit(train_x, train_y, cost_matrix)
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop('IRIS')
... oml.drop(table = 'RF_COST')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # Create a cost matrix table in the database.
... cost_matrix = [['setosa', 'setosa', 0],
... ['setosa', 'virginica', 0.2],
... ['setosa', 'versicolor', 0.8],
... ['virginica', 'virginica', 0],
... ['virginica', 'setosa', 0.5],
... ['virginica', 'versicolor', 0.5],
... ['versicolor', 'versicolor', 0],
... ['versicolor', 'setosa', 0.4],
... ['versicolor', 'virginica', 0.6]]
>>> cost_matrix = \
... oml.create(pd.DataFrame(cost_matrix,
... columns = ['ACTUAL_TARGET_VALUE',
... 'PREDICTED_TARGET_VALUE',
... 'COST']),
... table = 'RF_COST')
>>>
>>> # Create an RF model object.
... rf_mod = oml.rf(tree_term_max_depth = '2')
>>>
>>> # Fit the RF model according to the training data and parameter
... # settings.
>>> rf_mod = rf_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
>>>
>>> # Show details of the model.
... rf_mod
Algorithm Name: Random Forest
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_RANDOM_FOREST
1 CLAS_COST_TABLE_NAME "OML_USER"."RF_COST"
2 CLAS_MAX_SUP_BINS 32
3 CLAS_WEIGHTS_BALANCED OFF
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_RANDOM_SEED 0
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
9 RFOR_NUM_TREES 20
10 RFOR_SAMPLING_RATIO .5
11 TREE_IMPURITY_METRIC TREE_IMPURITY_GINI
12 TREE_TERM_MAX_DEPTH 2
13 TREE_TERM_MINPCT_NODE .05
14 TREE_TERM_MINPCT_SPLIT .1
15 TREE_TERM_MINREC_NODE 10
16 TREE_TERM_MINREC_SPLIT 20
Computed Settings:
setting name setting value
0 RFOR_MTRY 2
Global Statistics:
attribute name attribute value
0 AVG_DEPTH 2
1 AVG_NODECOUNT 3
2 MAX_DEPTH 2
3 MAX_NODECOUNT 2
4 MIN_DEPTH 2
5 MIN_NODECOUNT 2
6 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Partition: NO
Importance:
ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE
0 Petal_Length None 0.329971
1 Petal_Width None 0.296799
2 Sepal_Length None 0.037309
3 Sepal_Width None 0.000000
>>> # Use the model to make predictions on the test data.
... rf_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica virginica
43 6.7 3.0 5.2 virginica virginica
44 6.5 3.0 5.2 virginica virginica
45 5.9 3.0 5.1 virginica virginica
>>> # Return the prediction probability.
... rf_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Species']],
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION PROBABILITY
0 4.9 3.0 setosa setosa 0.989130
1 4.9 3.1 setosa setosa 0.989130
2 4.8 3.4 setosa setosa 0.989130
3 5.8 4.0 setosa setosa 0.950000
... ... ... ... ... ...
42 6.7 3.3 virginica virginica 0.501016
43 6.7 3.0 virginica virginica 0.501016
44 6.5 3.0 virginica virginica 0.501016
45 5.9 3.0 virginica virginica 0.501016
>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> rf_mod.predict_proba(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Species']],
... topN = 2).sort_values(by = ['Sepal_Length', 'Species'])
Sepal_Length Species TOP_1 TOP_1_VAL TOP_2 TOP_2_VAL
0 4.4 setosa setosa 0.989130 versicolor 0.010870
1 4.4 setosa setosa 0.989130 versicolor 0.010870
2 4.5 setosa setosa 0.989130 versicolor 0.010870
3 4.8 setosa setosa 0.989130 versicolor 0.010870
... ... ... ... ... ... ...
42 6.7 virginica virginica 0.501016 versicolor 0.498984
43 6.9 versicolor virginica 0.501016 versicolor 0.498984
44 6.9 virginica virginica 0.501016 versicolor 0.498984
45 7.0 versicolor virginica 0.501016 versicolor 0.498984
>>> rf_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.76087
>>> # Reset TREE_TERM_MAX_DEPTH and refit the model.
... rf_mod.set_params(tree_term_max_depth = '3').fit(train_x, train_y, cost_matrix)
Algorithm Name: Random Forest
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_RANDOM_FOREST
1 CLAS_COST_TABLE_NAME "OML_USER"."RF_COST"
2 CLAS_MAX_SUP_BINS 32
3 CLAS_WEIGHTS_BALANCED OFF
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_RANDOM_SEED 0
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
9 RFOR_NUM_TREES 20
10 RFOR_SAMPLING_RATIO .5
11 TREE_IMPURITY_METRIC TREE_IMPURITY_GINI
12 TREE_TERM_MAX_DEPTH 3
13 TREE_TERM_MINPCT_NODE .05
14 TREE_TERM_MINPCT_SPLIT .1
15 TREE_TERM_MINREC_NODE 10
16 TREE_TERM_MINREC_SPLIT 20
Computed Settings:
setting name setting value
0 RFOR_MTRY 2
Global Statistics:
attribute name attribute value
0 AVG_DEPTH 3
1 AVG_NODECOUNT 5
2 MAX_DEPTH 3
3 MAX_NODECOUNT 6
4 MIN_DEPTH 3
5 MIN_NODECOUNT 4
6 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Partition: NO
Importance:
ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE
0 Petal_Length None 0.501022
1 Petal_Width None 0.568170
2 Sepal_Length None 0.091617
3 Sepal_Width None 0.000000