9.22 XGBoost
The oml.xgb
class supports the in-database scalable gradient tree boosting algorithm for both classification, regression specifications, ranking models, and survival models. It makes available the open source gradient boosting framework. It prepares the categorical encoding and missing value replacement from the OML infrastructure, calls the in-database XGBoost, builds and persists a model as a first-class database model object, and supports using the model for prediction.
You can use oml.xgb
as a stand-alone predictor or incorporate it into real-world production pipelines for a wide range of problems such as ad click-through rate prediction, hazard risk prediction, web text classification, and so on.
The oml.xgb
algorithm takes three types of parameters: general parameters, booster parameters, and task parameters. You set the parameters through the model settings. The algorithm supports most of the settings of the open source XGBoost project. For more information on the supported settings, see XGBoost parameters.
Through oml.xgb
, OML4Py supports a number of different classification and regression specifications, ranking models, and survival models. Binary and multi-class models are supported under the classification machine learning technique while regression, ranking, count, and survival are supported under the regression machine learning technique.
oml.xgb
also supports partitioned models and internalizes the data preparation.
XG Boost feature interaction constraints allow users to specify which variables can and cannot interact. By focusing on key interactions and eliminating noise, it aids in improving predicting performance. This, in turn, may lead to more generalized predictions. For more information about XG Boost feature interaction constraints, see Oracle Machine Learning for SQL Concepts Guide.
Settings for an XGBoost model
The following table lists settings that apply to XGBoost models.
Table 9-19 XGBoost Model Settings
Setting Name | Setting Value | Description |
---|---|---|
xgboost_booster |
A string that is one of the following:
|
The booster to use:
The The default value is |
|
A non-negative integer. |
The number of rounds for boosting. The default value is |
Note: Available only in Oracle Database 23ai. |
|
This setting specifies permitted interactions in the model. Specify the constraints in the form of a nested list where each inner list is a group of features (column names) that are allowed to interact with each other. If a single column is passed in the interactions then, the input is ignored. Here, features x0, x1, and x2 are allowed to interact with each other but with no other feature. Similarly, x0 and x4 are allowed to interact with each other but with no other feature and so on. This setting is applicable to 2-Dimensional features. An error occurs if you pass columns of non-supported type and non-existing feature names. |
Note: Available only in Oracle Database 23ai. |
|
This setting specifies the features (column names) that must obey the decreasing constraint. The feature names are separated by a comma. For example, setting value 'x4,x5' sets decreasing constraint on features x4 and x5. This setting applies to numeric columns and 2-Dimensional features. An error occurs if you pass columns of non-supported type and non-existing feature names. |
Note: Available only in Oracle Database 23ai. |
|
This setting specifies the features (column names) that must obey the increasing constraint. The feature names are separated by a comma. For example, setting value 'x0,x3' sets increasing constraint on features x0 and x3. This setting is applicable to 2-Dimensional features. An error occurs if you pass columns of non-supported type and non-existing feature names. |
Note: Available only in Oracle Database 23ai. |
For a classification model, a string that is one of the following:
For a regression model, a string that is one of the following:
|
Settings for a Classification model:
The default Settings for a Regression model:
The default |
Note: Available only in Oracle Database 23ai. |
[normal, logistic, extreme] |
Specifies the distribution of the Z term in the AFT
model. It specifies the Probabilty Density Function used by
|
Note: Available only in Oracle Database 23ai. |
A positive number |
Specifies the scaling factor σ, which scales the size of
Z term in the AFT model. The default value is
|
Note: Available only in Oracle Database 23ai. |
column_name |
Specifies the column containing the right bounds of the labels for an AFT model. You cannot select this parameter for a non-AFT model. Note: Oracle Machine Learning does not supportBOOLEAN values for this
setting.
|
For more information on the booster settings, see XGBoost parameters
Example 9-21 Using the oml.xgb Class
This example creates an XGB model and uses some of the methods of the oml.xgb
class.
#Load the iris data from sklearn and combine the target and predictors into a single DataFrame, which matches the form of a database table.
Use the oml.create function to load this Pandas DataFrame into the databae, which creates a persistent table and returns a proxy object that you assign to z.#
import oml
from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
x = pd.DataFrame(iris.data, columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
y = pd.DataFrame(list(map(lambda x: {0: 'setosa', 1: 'versicolor', 2:'virginica'}[x], iris.target)), columns = ['Species'])
#For on-premises database follow the below command to connect to the database#
oml.connect("<username>","<password>", dsn="<dsn>")
z = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
#Create training data and test data.#
dat = oml.sync(table = "IRIS").split()
train_x = dat[0].drop('Species')
train_y = dat[0]['Species']
test_dat = dat[1]
#Classification Example:#
#Create an XGBoost model object.#
setting = {'xgboost_max_depth': '3',
... 'xgboost_eta': '1',
... 'xgboost_num_round': '10'}
xgb_mod = oml.xgb('classification', **setting)
#Fit the XGBoost model to the training data.#
xgb_mod.fit(train_x, train_y)
#Use the model to make predictions on the test data and return the prediction probabilities for each category in Species.#
xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Species']], proba = True).sort_values(by = ['Sepal_Length', 'Sepal_Width'])
Sepal_Length Sepal_Width Species TOP_1 TOP_1_VAL
0 4.4 3.0 setosa setosa 0.993619
1 4.4 3.2 setosa setosa 0.993619
2 4.5 2.3 setosa setosa 0.942128
3 4.8 3.4 setosa setosa 0.993619
... ... ... ... ... ...
42 6.7 3.3 virginica virginica 0.996170
43 6.9 3.1 versicolor versicolor 0.925217
44 6.9 3.1 virginica virginica 0.996170
45 7.0 3.2 versicolor versicolor 0.990586
#Create training data and test data.#
dat = oml.sync(table = "IRIS").split()
train_x = dat[0].drop('Sepal_Length')
train_y = dat[0]['Sepal_Length']
test_dat = dat[1]
#Create an XGBoost model object.#
setting = {'xgboost_booster': 'gblinear'}
xgb_mod = oml.xgb('regression', **setting)
#Fit the XGBoost Model according to the training data and parameter settings.#
xgb_mod.fit(train_x, train_y)
xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']]) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
#Create an XGBoost model object.#
setting = {'xgboost_objective': 'rank:pairwise',
... 'xgboost_max_depth': '3',
... 'xgboost_eta': '0.1',
... 'xgboost_gamma': '1.0',
... 'xgboost_num_round': '4'}
xgb_mod = oml.xgb('regression', **setting)
#Fit the XGBoost Model according to the training data and parameter settings.#
xgb_mod.fit(train_x, train_y)
#Use the model to make predictions on the test data, returning the Sepal_Length, Sepal_Width, Petal_Length, and Species columns in the result.#
xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']])
Listing for This Example
#Load the iris data from sklearn and combine the target and predictors into a single DataFrame, which matches the form of a database table.
Use the oml.create function to load this Pandas DataFrame into the databae, which creates a persistent table and returns a proxy object that you assign to z.#
>>> import oml
>>> from sklearn import datasets
>>> import pandas as pd
>>> iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data, columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x: {0: 'setosa', 1: 'versicolor', 2:'virginica'}[x], iris.target)), columns = ['Species'])
>>> #For on-premises database follow the below command to connect to the database#
>>> oml.connect("<username>","<password>", dsn="<dsn>")
>>> z = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
#Create training data and test data.#
>>> dat = oml.sync(table = "IRIS").split()
>>> train_x = dat[0].drop('Species')
>>> train_y = dat[0]['Species']
>>> test_dat = dat[1]
#Classification Example:#
#Create an XGBoost model object.#
>>> setting = {'xgboost_max_depth': '3',
... 'xgboost_eta': '1',
... 'xgboost_num_round': '10'}
>>> xgb_mod = oml.xgb('classification', **setting)
#Fit the XGBoost model to the training data.#
>>> xgb_mod.fit(train_x, train_y)
Algorithm Name: XGBOOST
Mining Function: CLASSIFICATION
Target: Species
Settings:
setting name setting value
0 ALGO_NAME ALGO_XGBOOST
1 CLAS_WEIGHTS_BALANCED OFF
2 ODMS_DETAILS ODMS_ENABLE
3 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
4 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
5 PREP_AUTO ON
6 booster gbtree
7 eta 1
8 max_depth 3
9 ntree_limit 0
10 num_round 10
11 objective multi:softprob
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
1 mlogloss 0.024858
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Partition: NO
ATTRIBUTE IMPORTANCE:
PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE GAIN COVER \
0 None Petal_Length None None 0.743941 0.560554
1 None Petal_Width None None 0.162191 0.245400
2 None Sepal_Length None None 0.003738 0.044741
3 None Sepal_Width None None 0.090129 0.149306
FREQUENCY
0 0.447761
1 0.268657
2 0.119403
3 0.164179
#Use the model to make predictions on the test data and return the prediction probabilities for each category in Species.#
>>> xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Species']], proba = True).sort_values(by = ['Sepal_Length', 'Sepal_Width'])
Sepal_Length Sepal_Width Species TOP_1 TOP_1_VAL
0 4.4 3.0 setosa setosa 0.993619
1 4.4 3.2 setosa setosa 0.993619
2 4.5 2.3 setosa setosa 0.942128
3 4.8 3.4 setosa setosa 0.993619
... ... ... ... ... ...
42 6.7 3.3 virginica virginica 0.996170
43 6.9 3.1 versicolor versicolor 0.925217
44 6.9 3.1 virginica virginica 0.996170
45 7.0 3.2 versicolor versicolor 0.990586
#Regression Example:#
#Create training data and test data.#
>>> dat = oml.sync(table = "IRIS").split()
>>> train_x = dat[0].drop('Sepal_Length')
>>> train_y = dat[0]['Sepal_Length']
>>> test_dat = dat[1]
#Create an XGBoost model object.#
>>> setting = {'xgboost_booster': 'gblinear'}
>>> xgb_mod = oml.xgb('regression', **setting)
#Fit the XGBoost Model according to the training data and parameter settings.#
>>> xgb_mod.fit(train_x, train_y)
Algorithm Name: XGBOOST
Mining Function: REGRESSION
Target: Sepal_Length
Settings:
setting name setting value
0 ALGO_NAME ALGO_XGBOOST
1 ODMS_DETAILS ODMS_ENABLE
2 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
3 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
4 PREP_AUTO ON
5 booster gblinear
6 ntree_limit 0
7 num_round 10
Computed Settings:
setting name setting value
0 ODMS_EXPLOSION_MIN_SUPP 1
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
1 rmse 0.364149
Attributes:
Petal_Length
Petal_Width
Sepal_Width
Species
Partition: NO
ATTRIBUTE IMPORTANCE:
PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE WEIGHT CLASS
0 None Petal_Length None None 0.335183 0
1 None Petal_Width None None 0.368738 0
2 None Sepal_Width None None 0.249208 0
3 None Species None versicolor -0.197582 0
4 None Species None virginica -0.170522 0
>>> xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']]) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa 4.797075
1 4.9 3.1 1.5 setosa 4.818641
2 4.8 3.4 1.6 setosa 4.963796
3 5.8 4.0 1.2 setosa 4.979247
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica 6.990700
43 6.7 3.0 5.2 virginica 6.674599
44 6.5 3.0 5.2 virginica 6.563977
45 5.9 3.0 5.1 virginica 6.456711
#Ranking Example:#
#Create an XGBoost model object.#
>>> setting = {'xgboost_objective': 'rank:pairwise',
... 'xgboost_max_depth': '3',
... 'xgboost_eta': '0.1',
... 'xgboost_gamma': '1.0',
... 'xgboost_num_round': '4'}
>>> xgb_mod = oml.xgb('regression', **setting)
#Fit the XGBoost Model according to the training data and parameter settings.#
>>> xgb_mod.fit(train_x, train_y)
Algorithm Name: XGBOOST
Mining Function: REGRESSION
Target: Sepal_Length
Settings:
setting name setting value
0 ALGO_NAME ALGO_XGBOOST
1 ODMS_DETAILS ODMS_ENABLE
2 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
3 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
4 PREP_AUTO ON
5 booster gbtree
6 eta 0.1
7 gamma 1.0
8 max_depth 3
9 ntree_limit 0
10 num_round 4
11 objective rank:pairwise
Computed Settings:
setting name setting value
0 ODMS_EXPLOSION_MIN_SUPP 1
Global Statistics:
attribute name attribute value
0 NUM_ROWS 104
1 map 1
Attributes:
Petal_Length
Petal_Width
Sepal_Width
Species
Partition: NO
ATTRIBUTE IMPORTANCE:
PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE GAIN COVER \
0 None Petal_Length None None 0.873855 0.677624
1 None Petal_Width None None 0.083504 0.184802
2 None Sepal_Width None None 0.042641 0.137574
FREQUENCY
0 0.500000
1 0.285714
2 0.214286
#Use the model to make predictions on the test data, returning the Sepal_Length, Sepal_Width, Petal_Length, and Species columns in the result.#
>>> xgb_mod.predict(test_dat.drop('Species'), supplemental_cols = test_dat[:, ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa 0.243485
1 4.9 3.1 1.5 setosa 0.243485
2 4.8 3.4 1.6 setosa 0.243485
3 5.8 4.0 1.2 setosa 0.310980
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica 0.771761
43 6.7 3.0 5.2 virginica 0.728637
44 6.5 3.0 5.2 virginica 0.728637
45 5.9 3.0 5.1 virginica 0.674835