9.9 Decision Tree
The oml.dt
class uses the Decision Tree algorithm for classification.
Decision Tree models are classification models that contain axis-parallel rules. A rule is a conditional statement that can be understood by humans and may be used within a database to identify a set of records.
A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.
During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The oml.dt
class offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.
For information on the oml.dt
class attributes and methods, invoke help(oml.dt)
or see Oracle Machine Learning for Python API Reference.
Settings for a Decision Tree Model
The following table lists settings that apply to Decision Tree models.
Table 9-4 Decision Tree Model Settings
Setting Name | Setting Value | Description |
---|---|---|
CLAS_COST_TABLE_NAME |
table_name |
The name of a table that stores a cost matrix for the algorithm to use in building and applying the model. The cost matrix specifies the costs associated with misclassifications. The cost matrix table is user-created. The following are the column requirements for the table.
|
CLAS_MAX_SUP_BINS |
|
Specifies the maximum number of bins for each attribute. The default value is |
CLAS_WEIGHTS_BALANCED |
|
Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is |
|
|
Tree impurity metric for a Decision Tree model. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is measured in accordance with a metric. Decision trees can use either gini ( |
|
|
Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). The default is |
TREE_TERM_MINPCT_NODE |
|
The minimum number of training rows in a node expressed as a percentage of the rows in the training data. The default value is |
|
|
Minimum number of rows required to consider splitting a node expressed as a percentage of the training rows. The default value is |
|
|
Minimum number of rows in a node. The default value is |
|
|
Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value. The default value is |
See Also:
Example 9-9 Using the oml.dt Class
This example demonstrates the use of various methods of the oml.dt
class. In the listing for this example, some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop('COST_MATRIX')
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]
# Create a cost matrix table in the database.
cost_matrix = [['setosa', 'setosa', 0],
['setosa', 'virginica', 0.2],
['setosa', 'versicolor', 0.8],
['virginica', 'virginica', 0],
['virginica', 'setosa', 0.5],
['virginica', 'versicolor', 0.5],
['versicolor', 'versicolor', 0],
['versicolor', 'setosa', 0.4],
['versicolor', 'virginica', 0.6]]
cost_matrix = oml.create(
pd.DataFrame(cost_matrix,
columns = ['ACTUAL_TARGET_VALUE',
'PREDICTED_TARGET_VALUE', 'COST']),
table = 'COST_MATRIX')
# Specify settings.
setting = {'TREE_TERM_MAX_DEPTH':'2'}
# Create a DT model object.
dt_mod = oml.dt(**setting)
# Fit the DT model according to the training data and parameter
# settings.
dt_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
# Use the model to make predictions on the test data.
dt_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Return the prediction probability.
dt_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Species']],
proba = True)
# Make predictions and return the probability for each class
# on new data.
dt_mod.predict_proba(test_dat.drop('Species'),
supplemental_cols = test_dat[:,
['Sepal_Length',
'Species']]).sort_values(by = ['Sepal_Length',
'Species'])
dt_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> try:
... oml.drop('COST_MATRIX')
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # Create a cost matrix table in the database.
... cost_matrix = [['setosa', 'setosa', 0],
... ['setosa', 'virginica', 0.2],
... ['setosa', 'versicolor', 0.8],
... ['virginica', 'virginica', 0],
... ['virginica', 'setosa', 0.5],
... ['virginica', 'versicolor', 0.5],
... ['versicolor', 'versicolor', 0],
... ['versicolor', 'setosa', 0.4],
... ['versicolor', 'virginica', 0.6]]
>>> cost_matrix = oml.create(
... pd.DataFrame(cost_matrix,
... columns = ['ACTUAL_TARGET_VALUE',
... 'PREDICTED_TARGET_VALUE',
... 'COST']),
... table = 'COST_MATRIX')
>>>
>>> # Specify settings.
... setting = {'TREE_TERM_MAX_DEPTH':'2'}
>>>
>>> # Create a DT model object.
... dt_mod = oml.dt(**setting)
>>>
>>> # Fit the DT model according to the training data and parameter
... # settings.
>>> dt_mod.fit(train_x, train_y, cost_matrix = cost_matrix)
Algorithm Name: Decision Tree
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_DECISION_TREE
1 CLAS_COST_TABLE_NAME "OML_USER"."COST_MATRIX"
2 CLAS_MAX_SUP_BINS 32
3 CLAS_WEIGHTS_BALANCED OFF
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
7 PREP_AUTO ON
8 TREE_IMPURITY_METRIC TREE_IMPURITY_GINI
9 TREE_TERM_MAX_DEPTH 2
10 TREE_TERM_MINPCT_NODE .05
11 TREE_TERM_MINPCT_SPLIT .1
12 TREE_TERM_MINREC_NODE 10
13 TREE_TERM_MINREC_SPLIT 20
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Partition: NO
Distributions:
NODE_ID TARGET_VALUE TARGET_COUNT
0 0 setosa 36
1 0 versicolor 35
2 0 virginica 33
3 1 setosa 36
4 2 versicolor 35
5 2 virginica 33
Nodes:
parent node.id row.count prediction \
0 0.0 1 36 setosa
1 0.0 2 68 versicolor
2 NaN 0 104 setosa
split \
0 (Petal_Length <=(2.4500000000000002E+000))
1 (Petal_Length >(2.4500000000000002E+000))
2 None
surrogate \
0 Petal_Width <=(8.0000000000000004E-001))
1 Petal_Width >(8.0000000000000004E-001))
2 None
full.splits
0 (Petal_Length <=(2.4500000000000002E+000))
1 (Petal_Length >(2.4500000000000002E+000))
2 (
>>>
>>> # Use the model to make predictions on the test data.
... dt_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
44 6.7 3.3 5.7 virginica versicolor
45 6.7 3.0 5.2 virginica versicolor
46 6.5 3.0 5.2 virginica versicolor
47 5.9 3.0 5.1 virginica versicolor
>>>
>>> # Return the prediction probability.
... dt_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Species']],
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION PROBABILITY
0 4.9 3.0 setosa setosa 1.000000
1 4.9 3.1 setosa setosa 1.000000
2 4.8 3.4 setosa setosa 1.000000
3 5.8 4.0 setosa setosa 1.000000
... ... ... ... ... ...
44 6.7 3.3 virginica versicolor 0.514706
45 6.7 3.0 virginica versicolor 0.514706
46 6.5 3.0 virginica versicolor 0.514706
47 5.9 3.0 virginica versicolor 0.514706
>>> # Make predictions and return the probability for each class
>>> # on new data.
>>> dt_mod.predict_proba(test_dat.drop('Species'),
... supplemental_cols = test_dat[:,
... ['Sepal_Length',
... 'Species']]).sort_values(by = ['Sepal_Length',
... 'Species'])
Sepal_Length Species PROBABILITY_OF_SETOSA \
0 4.4 setosa 1.0
1 4.4 setosa 1.0
2 4.5 setosa 1.0
3 4.8 setosa 1.0
... ... ... ...
42 6.7 virginica 0.0
43 6.9 versicolor 0.0
44 6.9 virginica 0.0
45 7.0 versicolor 0.0
PROBABILITY_OF_VERSICOLOR PROBABILITY_OF_VIRGINICA
0 0.000000 0.000000
1 0.000000 0.000000
2 0.000000 0.000000
3 0.000000 0.000000
... ... ...
42 0.514706 0.485294
43 0.514706 0.485294
44 0.514706 0.485294
45 0.514706 0.485294
>>>
>>> dt_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.645833