9.14 Naive Bayes
The oml.nb
class creates a Naive Bayes (NB) model for classification.
The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the historical data and calculates conditional probabilities for the target values by observing the frequency of attribute values and of combinations of attribute values.
Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes' Theorem requires that the predictors be independent.)
For information on the oml.nb
class attributes and methods, invoke help(oml.nb)
or see Oracle Machine Learning for Python API Reference.
Settings for a Naive Bayes Model
The following table lists the settings that apply to NB models.
Table 9-12 Naive Bayes Model Settings
Setting Name | Setting Value | Description |
---|---|---|
CLAS_COST_TABLE_NAME |
table_name |
The name of a table that stores a cost matrix for the algorithm to use in building the model. The cost matrix specifies the costs associated with misclassifications. The cost matrix table is user-created. The following are the column requirements for the table.
|
CLAS_MAX_SUP_BINS |
|
Specifies the maximum number of bins for each attribute. The default value is |
|
table_name |
The name of a table that stores prior probabilities to offset differences in distribution between the build data and the scoring data. The priors table is user-created. The following are the column requirements for the table.
|
CLAS_WEIGHTS_BALANCED |
|
Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is |
|
|
Value of the pairwise threshold for the NB algorithm. The default value is |
|
|
Value of the singleton threshold for the NB algorithm. The default value is |
See Also:
Example 9-14 Using the oml.nb Class
This example creates an NB model and uses some of the methods of the oml.nb
class.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]
# User specified settings.
setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
# Create an oml NB model object.
nb_mod = oml.nb(**setting)
# Fit the NB model according to the training data and parameter
# settings.
nb_mod = nb_mod.fit(train_x, train_y)
# Show details of the model.
nb_mod
# Create a priors table in the database.
priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
priors = oml.create(pd.DataFrame(list(priors.items()),
columns = ['TARGET_VALUE',
'PRIOR_PROBABILITY']),
table = 'NB_PRIOR_PROBABILITY_DEMO')
# Change the setting parameter and refit the model
# with a user-defined prior table.
new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
nb_mod = nb_mod.set_params(**new_setting).fit(train_x,
train_y,
priors = priors)
nb_mod
# Use the model to make predictions on test data.
nb_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Return the prediction probability.
nb_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Species']],
proba = True)
# Return the top two most influencial attributes of the highest
# probability class.
nb_mod.predict(test_dat.drop('Species'),
supplemental_cols = test_dat[:, ['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']],
topN_attrs = 2)
# Make predictions and return the probability for each class
# on new data.
nb_mod.predict_proba(test_dat.drop('Species'),
supplemental_cols = test_dat[:,
['Sepal_Length',
'Species']]).sort_values(by =
['Sepal_Length',
'Species',
'PROBABILITY_OF_setosa',
'PROBABILITY_OF_versicolor'])
# Make predictions on new data and return the mean accuracy.
nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop(table = 'NB_PRIOR_PROBABILITY_DEMO')
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
>>> dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
>>>
>>> # User specified settings.
... setting = {'CLAS_WEIGHTS_BALANCED': 'ON'}
>>>
>>> # Create an oml NB model object.
... nb_mod = oml.nb(**setting)
>>>
>>> # Fit the NB model according to the training data and parameter
... # settings.
>>> nb_mod = nb_mod.fit(train_x, train_y)
>>>
>>> # Show details of the model.
... nb_mod
Algorithm Name: Naive Bayes
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_NAIVE_BAYES
1 CLAS_WEIGHTS_BALANCED ON
2 NABS_PAIRWISE_THRESHOLD 0
3 NABS_SINGLETON_THRESHOLD 0
4 ODMS_DETAILS ODMS_ENABLE
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
7 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Priors:
TARGET_NAME TARGET_VALUE PRIOR_PROBABILITY COUNT
0 Species setosa 0.333333 36
1 Species versicolor 0.333333 35
2 Species virginica 0.333333 33
Conditionals:
TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE \
0 Species setosa Petal_Length None ( ; 1.05]
1 Species setosa Petal_Length None (1.05; 1.2]
2 Species setosa Petal_Length None (1.2; 1.35]
3 Species setosa Petal_Length None (1.35; 1.45]
... ... ... ... ... ...
152 Species virginica Sepal_Width None (3.25; 3.35]
153 Species virginica Sepal_Width None (3.35; 3.45]
154 Species virginica Sepal_Width None (3.55; 3.65]
155 Species virginica Sepal_Width None (3.75; 3.85]
CONDITIONAL_PROBABILITY COUNT
0 0.027778 1
1 0.027778 1
2 0.083333 3
3 0.277778 10
... ... ...
152 0.030303 1
153 0.060606 2
154 0.030303 1
155 0.060606 2
[156 rows x 7 columns]
>>> # Create a priors table in the database.
... priors = {'setosa': 0.2, 'versicolor': 0.3, 'virginica': 0.5}
>>> priors = oml.create(pd.DataFrame(list(priors.items()),
... columns = ['TARGET_VALUE',
... 'PRIOR_PROBABILITY']),
... table = 'NB_PRIOR_PROBABILITY_DEMO')
>>>
>>> # Change the setting parameter and refit the model
... # with a user-defined prior table.
... new_setting = {'CLAS_WEIGHTS_BALANCED': 'OFF'}
>>> nb_mod = nb_mod.set_params(**new_setting).fit(train_x,
... train_y,
... priors = priors)
>>> nb_mod
Algorithm Name: Naive Bayes
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_NAIVE_BAYES
1 CLAS_PRIORS_TABLE_NAME "OML_USER"."NB_PRIOR_PROBABILITY_DEMO"
2 CLAS_WEIGHTS_BALANCED OFF
3 NABS_PAIRWISE_THRESHOLD 0
4 NABS_SINGLETON_THRESHOLD 0
5 ODMS_DETAILS ODMS_ENABLE
6 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
Priors:
TARGET_NAME TARGET_VALUE PRIOR_PROBABILITY COUNT
0 Species setosa 0.2 36
1 Species versicolor 0.3 35
2 Species virginica 0.5 33
Conditionals:
TARGET_NAME TARGET_VALUE ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE \
0 Species setosa Petal_Length None ( ; 1.05]
1 Species setosa Petal_Length None (1.05; 1.2]
2 Species setosa Petal_Length None (1.2; 1.35]
3 Species setosa Petal_Length None (1.35; 1.45]
... ... ... ... ... ...
152 Species virginica Sepal_Width None (3.25; 3.35]
153 Species virginica Sepal_Width None (3.35; 3.45]
154 Species virginica Sepal_Width None (3.55; 3.65]
155 Species virginica Sepal_Width None (3.75; 3.85]
CONDITIONAL_PROBABILITY COUNT
0 0.027778 1
1 0.027778 1
2 0.083333 3
3 0.277778 10
... ... ...
152 0.030303 1
153 0.060606 2
154 0.030303 1
155 0.060606 2
[156 rows x 7 columns]
>>> # Use the model to make predictions on test data.
... nb_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica virginica
43 6.7 3.0 5.2 virginica virginica
44 6.5 3.0 5.2 virginica virginica
45 5.9 3.0 5.1 virginica virginica
>>> # Return the prediction probability.
>>> nb_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Species']],
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION PROBABILITY
0 4.9 3.0 setosa setosa 1.000000
1 4.9 3.1 setosa setosa 1.000000
2 4.8 3.4 setosa setosa 1.000000
3 5.8 4.0 setosa setosa 1.000000
... ... ... ... ... ...
42 6.7 3.3 virginica virginica 1.000000
43 6.7 3.0 virginica virginica 0.953848
44 6.5 3.0 virginica virginica 1.000000
45 5.9 3.0 virginica virginica 0.932334
>>> # Return the top two most influencial attributes of the highest
... # probability class.
>>> nb_mod.predict(test_dat.drop('Species'),
... supplemental_cols = test_dat[:, ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']],
... topN_attrs = 2)
Sepal_Length Sepal_Width Petal_Length Species PREDICTION \
0 4.9 3.0 1.4 setosa setosa
1 4.9 3.1 1.5 setosa setosa
2 4.8 3.4 1.6 setosa setosa
3 5.8 4.0 1.2 setosa setosa
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica virginica
43 6.7 3.0 5.2 virginica virginica
44 6.5 3.0 5.2 virginica virginica
45 5.9 3.0 5.1 virginica virginica
TOP_N_ATTRIBUTES
0 <Details algorithm="Naive Bayes" class="setosa...
1 <Details algorithm="Naive Bayes" class="setosa...
2 <Details algorithm="Naive Bayes" class="setosa...
3 <Details algorithm="Naive Bayes" class="setosa...
...
42 <Details algorithm="Naive Bayes" class="virgin...
43 <Details algorithm="Naive Bayes" class="virgin...
44 <Details algorithm="Naive Bayes" class="virgin...
45 <Details algorithm="Naive Bayes" class="virgin...
>>> # Make predictions and return the probability for each class
... # on new data.
>>> nb_mod.predict_proba(test_dat.drop('Species'),
... supplemental_cols = test_dat[:,
... ['Sepal_Length',
... 'Species']]).sort_values(by =
... ['Sepal_Length',
... 'Species',
... 'PROBABILITY_OF_setosa,
... 'PROBABILITY_OF_versicolor'])
Sepal_Length Species PROBABILITY_OF_SETOSA \
0 4.4 setosa 1.000000e+00
1 4.4 setosa 1.000000e+00
2 4.5 setosa 1.000000e+00
3 4.8 setosa 1.000000e+00
... ... ... ...
42 6.7 virginica 1.412132e-13
43 6.9 versicolor 5.295492e-20
44 6.9 virginica 5.295492e-20
45 7.0 versicolor 6.189014e-14
PROBABILITY_OF_VERSICOLOR PROBABILITY_OF_VIRGINICA
0 9.327306e-21 7.868301e-20
1 3.497737e-20 1.032715e-19
2 2.238553e-13 2.360490e-19
3 6.995487e-22 2.950617e-21
... ... ...
42 4.741700e-13 1.000000e+00
43 1.778141e-07 9.999998e-01
44 2.963565e-20 1.000000e+00
45 4.156340e-01 5.843660e-01
>>> # Make predictions on new data and return the mean accuracy.
... nb_mod.score(test_dat.drop('Species'), test_dat[:, ['Species']])
0.934783