10.6 AutoML Pipeline
The Automated Machine Learning Pipeline uses
oml.automl.Pipeline
class to automatically identify the most relevant
algorithms, features, and hyperparameters based on a given training dataset.
The oml.automl.Pipeline
class automates three major stages
of the machine learning pipeline: algorithm selection, adaptive data
reduction (feature and sample size selection), and hyperparameter
optimization. These stages are combined into an AutoML pipeline which
automatically optimizes the whole pipeline with limited user input/interaction.
The oml.automl.Pipeline
class supports classification and
regression algorithms. To use the oml.automl.Pipeline
class, you
specify the mining function, score metric and the degree of parallelism for the AutoML
module.
For information on the parameters and methods of the class, invoke
help(oml.automl.Pipeline)
or see Oracle Machine
Learning for Python API Reference.
Example 10-5 Using the
oml.automl.Pipeline
Class
This example uses automl.Pipeline
object to create an
automated machine learning pipeline object and then uses the object to fit, predict
and inspect the best tuned and fitted model produced by the AutoML pipeline.
import oml
from oml import automl
import pandas as pd
import numpy as np
from sklearn import datasets
# Load the breast cancer dataset into the database
bc = datasets.load_breast_cancer()
bc_data = bc.data.astype(float)
X = pd.DataFrame(bc_data, columns = bc.feature_names)
y = pd.DataFrame(bc.target, columns = ['TARGET'])
row_id = pd.DataFrame(np.arange(bc_data.shape[0]), columns = ['CASE_ID'])
df = oml.create(pd.concat([row_id, X, y], axis=1), table = 'BreastCancer')
# Split dataset into train and test
train, test = df.split(ratio=(0.8, 0.2), seed = 1234, hash_cols=['CASE_ID'])
X, y = train.drop('TARGET'), train['TARGET']
X_test, y_test = test.drop('TARGET'), test['TARGET']
# Create an automated machine learning pipeline object with f1_macro score_metric
pipeline = automl.Pipeline(mining_function='classification',
score_metric='f1_macro', parallel=4)
# Fit the pipeline to perform automated algorithm selection, feature selection, and model tuning on the dataset
pipeline = pipeline.fit(X, y, case_id='CASE_ID')
# Use the pipeline for prediction
pipeline.predict(X_test, supplemental_cols=y_test)
# For classification tasks, the pipeline can also predict class probabilities
pipeline.predict_proba(X_test, supplemental_cols=y_test)
# Inspect the best tuned and fitted model produced by the AutoML pipeline
pipeline.top_k_tuned_models[0]['fitted_model']
oml.drop('BreastCancer')
Listing for This Example
>>> import oml
>>> from oml import automl
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn import datasets
# Load the breast cancer dataset into the database
>>> bc = datasets.load_breast_cancer()
>>> bc_data = bc.data.astype(float)
>>> X = pd.DataFrame(bc_data, columns = bc.feature_names)
>>> y = pd.DataFrame(bc.target, columns = ['TARGET'])
>>> row_id = pd.DataFrame(np.arange(bc_data.shape[0]), columns = ['CASE_ID'])
>>> df = oml.create(pd.concat([row_id, X, y], axis=1), table = 'BreastCancer')
# Split dataset into train and test
>>> train, test = df.split(ratio=(0.8, 0.2), seed = 1234, hash_cols=['CASE_ID'])
>>> X, y = train.drop('TARGET'), train['TARGET']
>>> X_test, y_test = test.drop('TARGET'), test['TARGET']
# Create an automated machine learning pipeline object with f1_macro score_metric
>>> pipeline = automl.Pipeline(mining_function='classification',
score_metric='f1_macro', parallel=4)
# Fit the pipeline to perform automated algorithm selection, feature selection, and model tuning on the dataset
>>> pipeline = pipeline.fit(X, y, case_id='CASE_ID')
# Use the pipeline for prediction
>>> pipeline.predict(X_test, supplemental_cols=y_test)
TARGET PREDICTION
0 0 0
1 0 0
2 0 0
3 0 0
4 1 1
.. ... ...
109 1 1
110 1 1
111 0 0
112 1 1
113 0 0
[114 rows x 2 columns]
# For classification tasks, the pipeline can also predict class probabilities
>>> pipeline.predict_proba(X_test, supplemental_cols=y_test)
TARGET PROBABILITY_OF_0 PROBABILITY_OF_1
0 1 0.295266 0.704734
1 1 0.000947 0.999053
2 0 0.999089 0.000911
3 1 0.576097 0.423903
4 1 0.000350 0.999650
.. ... ... ...
109 0 0.991110 0.008890
110 0 0.999776 0.000224
111 0 0.966321 0.033679
112 1 0.000134 0.999866
113 1 0.000744 0.999256
[114 rows x 3 columns]
# Inspect the best tuned and fitted model produced by the AutoML pipeline
>>> pipeline.top_k_tuned_models[0]['fitted_model']
Algorithm Name: Support Vector Machine
Mining Function: CLASSIFICATION
Target: TARGET
Settings:
setting name setting value
0 ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
1 CLAS_WEIGHTS_BALANCED OFF
2 ODMS_DETAILS ODMS_ENABLE
3 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
4 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
5 PREP_AUTO ON
6 SVMS_COMPLEXITY_FACTOR 10
7 SVMS_CONV_TOLERANCE .0001
8 SVMS_KERNEL_FUNCTION SVMS_GAUSSIAN
9 SVMS_NUM_PIVOTS 200
10 SVMS_STD_DEV 3.6742346141747673
Computed Settings:
setting name setting value
0 SVMS_NUM_ITERATIONS 30
1 SVMS_SOLVER SVMS_SOLVER_IPM
Global Statistics:
attribute name attribute value
0 CONVERGED YES
1 ITERATIONS 13
2 NUM_ROWS 455
Attributes:
area error
compactness error
concave points error
concavity error
fractal dimension error
mean area
mean compactness
mean concave points
mean concavity
mean fractal dimension
mean perimeter
mean radius
mean smoothness
mean symmetry
mean texture
perimeter error
radius error
worst area
worst compactness
worst concave points
worst concavity
worst fractal dimension
worst perimeter
worst radius
worst smoothness
worst symmetry
worst texture
Partition: NO
>>> oml.drop('BreastCancer')
Parent topic: Automated Machine Learning