10.6 AutoML Pipeline

The Automated Machine Learning Pipeline uses oml.automl.Pipeline class to automatically identify the most relevant algorithms, features, and hyperparameters based on a given training dataset.

The oml.automl.Pipeline class automates three major stages of the machine learning pipeline: algorithm selection, adaptive data reduction (feature and sample size selection), and hyperparameter optimization. These stages are combined into an AutoML pipeline which automatically optimizes the whole pipeline with limited user input/interaction.

The oml.automl.Pipeline class supports classification and regression algorithms. To use the oml.automl.Pipeline class, you specify the mining function, score metric and the degree of parallelism for the AutoML module.

For information on the parameters and methods of the class, invoke help(oml.automl.Pipeline) or see Oracle Machine Learning for Python API Reference.

Example 10-5 Using the oml.automl.Pipeline Class

This example uses automl.Pipeline object to create an automated machine learning pipeline object and then uses the object to fit, predict and inspect the best tuned and fitted model produced by the AutoML pipeline.

import oml 
from oml import automl
import pandas as pd
import numpy as np
from sklearn import datasets

# Load the breast cancer dataset into the database
bc = datasets.load_breast_cancer()
bc_data = bc.data.astype(float)
X = pd.DataFrame(bc_data, columns = bc.feature_names)
y = pd.DataFrame(bc.target, columns = ['TARGET'])
row_id = pd.DataFrame(np.arange(bc_data.shape[0]), columns = ['CASE_ID'])
df = oml.create(pd.concat([row_id, X, y], axis=1), table = 'BreastCancer')

# Split dataset into train and test
train, test = df.split(ratio=(0.8, 0.2), seed = 1234, hash_cols=['CASE_ID'])
X, y = train.drop('TARGET'), train['TARGET']
X_test, y_test = test.drop('TARGET'), test['TARGET']

# Create an automated machine learning pipeline object with f1_macro score_metric
pipeline = automl.Pipeline(mining_function='classification', 
score_metric='f1_macro', parallel=4)

# Fit the pipeline to perform automated algorithm selection, feature selection, and model tuning on the dataset
pipeline = pipeline.fit(X, y, case_id='CASE_ID')

# Use the pipeline for prediction
pipeline.predict(X_test, supplemental_cols=y_test)

# For classification tasks, the pipeline can also predict class probabilities
pipeline.predict_proba(X_test, supplemental_cols=y_test)

# Inspect the best tuned and fitted model produced by the AutoML pipeline
pipeline.top_k_tuned_models[0]['fitted_model']

oml.drop('BreastCancer')

Listing for This Example

>>> import oml
>>> from oml import automl
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn import datasets

# Load the breast cancer dataset into the database
>>> bc = datasets.load_breast_cancer()
>>> bc_data = bc.data.astype(float)
>>> X = pd.DataFrame(bc_data, columns = bc.feature_names)
>>> y = pd.DataFrame(bc.target, columns = ['TARGET'])
>>> row_id = pd.DataFrame(np.arange(bc_data.shape[0]), columns = ['CASE_ID'])
>>> df = oml.create(pd.concat([row_id, X, y], axis=1), table = 'BreastCancer')

# Split dataset into train and test
>>> train, test = df.split(ratio=(0.8, 0.2), seed = 1234, hash_cols=['CASE_ID'])
>>> X, y = train.drop('TARGET'), train['TARGET']
>>> X_test, y_test = test.drop('TARGET'), test['TARGET']

# Create an automated machine learning pipeline object with f1_macro score_metric
>>> pipeline = automl.Pipeline(mining_function='classification', 
score_metric='f1_macro', parallel=4)

# Fit the pipeline to perform automated algorithm selection, feature selection, and model tuning on the dataset
>>> pipeline = pipeline.fit(X, y, case_id='CASE_ID')

# Use the pipeline for prediction
>>> pipeline.predict(X_test, supplemental_cols=y_test)
TARGET PREDICTION
0 0 0
1 0 0
2 0 0
3 0 0
4 1 1
.. ... ...
109 1 1
110 1 1
111 0 0
112 1 1
113 0 0
[114 rows x 2 columns]

# For classification tasks, the pipeline can also predict class probabilities
>>> pipeline.predict_proba(X_test, supplemental_cols=y_test)
TARGET PROBABILITY_OF_0 PROBABILITY_OF_1
0 1 0.295266 0.704734
1 1 0.000947 0.999053
2 0 0.999089 0.000911
3 1 0.576097 0.423903
4 1 0.000350 0.999650
.. ... ... ...
109 0 0.991110 0.008890
110 0 0.999776 0.000224
111 0 0.966321 0.033679
112 1 0.000134 0.999866
113 1 0.000744 0.999256
[114 rows x 3 columns]

# Inspect the best tuned and fitted model produced by the AutoML pipeline
>>> pipeline.top_k_tuned_models[0]['fitted_model']
Algorithm Name: Support Vector Machine
Mining Function: CLASSIFICATION
Target: TARGET
Settings:
setting name setting value
0 ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
1 CLAS_WEIGHTS_BALANCED OFF
2 ODMS_DETAILS ODMS_ENABLE
3 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
4 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
5 PREP_AUTO ON
6 SVMS_COMPLEXITY_FACTOR 10
7 SVMS_CONV_TOLERANCE .0001
8 SVMS_KERNEL_FUNCTION SVMS_GAUSSIAN
9 SVMS_NUM_PIVOTS 200
10 SVMS_STD_DEV 3.6742346141747673
Computed Settings:
setting name setting value
0 SVMS_NUM_ITERATIONS 30
1 SVMS_SOLVER SVMS_SOLVER_IPM
Global Statistics:
attribute name attribute value
0 CONVERGED YES
1 ITERATIONS 13
2 NUM_ROWS 455
Attributes:
area error
compactness error
concave points error
concavity error
fractal dimension error
mean area
mean compactness
mean concave points
mean concavity
mean fractal dimension
mean perimeter
mean radius
mean smoothness
mean symmetry
mean texture
perimeter error
radius error
worst area
worst compactness
worst concave points
worst concavity
worst fractal dimension
worst perimeter
worst radius
worst smoothness
worst symmetry
worst texture
Partition: NO

>>> oml.drop('BreastCancer')