Model Tuning

10.4 Model Tuning

The oml.automl.ModelTuning class tunes the hyperparameters for the specified classification or regression algorithm and training data.

Model tuning is a laborious machine learning task that relies heavily on data scientist expertise. With limited user input, the oml.automl.ModelTuning class automates this process using a highly-parallel, asynchronous gradient-based hyperparameter optimization algorithm to tune the hyperparameters of an Oracle Machine Learning algorithm.

The oml.automl.ModelTuning class supports classification and regression algorithms. To use the oml.automl.ModelTuning class, you specify a data set and an algorithm to obtain a tuned model and its corresponding hyperparameters. An advanced user can provide a customized hyperparameter search space and a non-default scoring metric to this black box optimizer.

For a partitioned model, if you pass in the column to partition on in the param_space argument of the tune method, oml.automl.ModelTuning tunes the partitioned model’s hyperparameters.

For information on the parameters and methods of the class, invoke help(oml.automl.ModelTuning) or see Oracle Machine Learning for Python API Reference.

Example 10-3 Using the oml.automl.ModelTuning Class

This example creates an oml.automl.ModelTuning object.

import oml
from oml import automl
import pandas as pd
from sklearn import datasets

# Load the breast cancer data set.
bc = datasets.load_breast_cancer()
bc_data = bc.data.astype(float)
X = pd.DataFrame(bc_data, columns = bc.feature_names)
y = pd.DataFrame(bc.target, columns = ['TARGET'])

# Create the database table BreastCancer.
oml_df = oml.create(pd.concat([X, y], axis=1), 
                    table = 'BreastCancer')

# Split the data set into training and test data.
train, test = oml_df.split(ratio=(0.8, 0.2), seed = 1234)
X, y = train.drop('TARGET'), train['TARGET']
X_test, y_test = test.drop('TARGET'), test['TARGET']

# Start an automated model tuning run with a Decision Tree model.
at = automl.ModelTuning(mining_function='classification', score_metric='accuracy',
      parallel=4)
results = at.tune('dt', X, y)

# Show the tuned model details.
tuned_model = results['best_model']
tuned_model

# Show the best tuned model train score and the 
# corresponding hyperparameters.
score, params = results['all_evals'][0]
"{:.2}".format(score), ["{}:{}".format(k, params[k])
  for k in sorted(params)]

# Use the tuned model to get the score on the test set.
"{:.2}".format(tuned_model.score(X_test, y_test)) 

# An example invocation of model tuning with user-defined  
# search ranges for selected hyperparameters on a new tuning 
# metric (f1_macro).
search_space = {
  'RFOR_SAMPLING_RATIO': {'type': 'continuous', 
                         'range': [0.01, 0.5]}, 
  'RFOR_NUM_TREES': {'type': 'discrete', 
                     'range': [50, 100]}, 
  'TREE_IMPURITY_METRIC': {'type': 'categorical', 
                           'range': ['TREE_IMPURITY_ENTROPY', 
                           'TREE_IMPURITY_GINI']},}
results = at.tune('rf', X, y, param_space=search_space)
score, params = results['all_evals'][0]
("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
  for k in sorted(params)])

# Some hyperparameter search ranges need to be defined based on the 
# training data set sizes (for example, the number of samples and 
# features). You can use placeholders specific to the data set,
# such as $nr_features and $nr_samples, as the search ranges.
search_space = {'RFOR_MTRY': {'type': 'discrete',
                              'range': [1, '$nr_features/2']}}
results = at.tune('rf', X, y, param_space=search_space)
score, params = results['all_evals'][0]
("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
  for k in sorted(params)])

# Drop the database table.
oml.drop('BreastCancer')

Listing for This Example

>>> import oml
>>> from oml import automl
>>> import pandas as pd
>>> from sklearn import datasets
>>> 
>>> # Load the breast cancer data set.
... bc = datasets.load_breast_cancer()
>>> bc_data = bc.data.astype(float)
>>> X = pd.DataFrame(bc_data, columns = bc.feature_names)
>>> y = pd.DataFrame(bc.target, columns = ['TARGET'])
>>>
>>> # Create the database table BreastCancer.
>>> oml_df = oml.create(pd.concat([X, y], axis=1), 
...                     table = 'BreastCancer')
>>>
>>> # Split the data set into training and test data.
... train, test = oml_df.split(ratio=(0.8, 0.2), seed = 1234)
>>> X, y = train.drop('TARGET'), train['TARGET']
>>> X_test, y_test = test.drop('TARGET'), test['TARGET']
>>> 
>>> # Start an automated model tuning run with a Decision Tree model.
... at = automl.ModelTuning(mining_function='classification', score_metric='accuracy',
...     parallel=4)
>>> results = at.tune('dt', X, y)
>>>
>>> # Show the tuned model details.
... tuned_model = results['best_model']
>>> tuned_model

Algorithm Name: Decision Tree

Mining Function: CLASSIFICATION

Target: TARGET

Settings: 
                    setting name            setting value
0                      ALGO_NAME       ALGO_DECISION_TREE
1              CLAS_MAX_SUP_BINS                       32
2          CLAS_WEIGHTS_BALANCED                      OFF
3                   ODMS_DETAILS             ODMS_DISABLE
4   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
5                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
6                      PREP_AUTO                       ON
7           TREE_IMPURITY_METRIC       TREE_IMPURITY_GINI
8            TREE_TERM_MAX_DEPTH                        8
9          TREE_TERM_MINPCT_NODE                     3.34
10        TREE_TERM_MINPCT_SPLIT                      0.1
11         TREE_TERM_MINREC_NODE                       10
12        TREE_TERM_MINREC_SPLIT                       20

Attributes: 
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension

Partition: NO

>>>
>>> # Show the best tuned model train score and the 
... # corresponding hyperparameters.
... score, params = results['all_evals'][0]
>>> "{:.2}".format(score), ["{}:{}".format(k, params[k]) 
...   for k in sorted(params)]
('0.92', ['CLAS_MAX_SUP_BINS:32', 'TREE_IMPURITY_METRIC:TREE_IMPURITY_GINI', 'TREE_TERM_MAX_DEPTH:7', 'TREE_TERM_MINPCT_NODE:0.05', 'TREE_TERM_MINPCT_SPLIT:0.1'])
>>>
>>> # Use the tuned model to get the score on the test set.
... "{:.2}".format(tuned_model.score(X_test, y_test))
'0.92
>>>
>>> # An example invocation of model tuning with user-defined
... # search ranges for selected hyperparameters on a new tuning 
... # metric (f1_macro).
...  search_space = {
...   'RFOR_SAMPLING_RATIO': {'type': 'continuous', 
...                          'range': [0.01, 0.5]}, 
...   'RFOR_NUM_TREES': {'type': 'discrete', 
...                      'range': [50, 100]}, 
...   'TREE_IMPURITY_METRIC': {'type': 'categorical', 
...                            'range': ['TREE_IMPURITY_ENTROPY', 
...                            'TREE_IMPURITY_GINI']},}
>>> results = at.tune('rf', X, y, param_space=search_space)
>>> score, params = results['all_evals'][0]
>>> ("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
...   for k in sorted(params)])
('0.92', ['RFOR_NUM_TREES:53', 'RFOR_SAMPLING_RATIO:0.4999951', 'TREE_IMPURITY_METRIC:TREE_IMPURITY_ENTROPY'])
>>>
>>> # Some hyperparameter search ranges need to be defined based on the 
... # training data set sizes (for example, the number of samples and 
... # features). You can use placeholders specific to the data set,
... # such as $nr_features and $nr_samples, as the search ranges.
... search_space = {'RFOR_MTRY': {'type': 'discrete',
...                               'range': [1, '$nr_features/2']}}
>>> results = at.tune('rf', X, y, param_space=search_space)
>>> score, params = results['all_evals'][0]
>>> ("{:.2}".format(score), ["{}:{}".format(k, params[k]) 
...   for k in sorted(params)])
('0.93', ['RFOR_MTRY:10'])
>>> 
>>> # Drop the database table.
... oml.drop('BreastCancer')

Parent topic: Automated Machine Learning