XGBoost Model

6.22 XGBoost Model

The ore.odmXGB class is a scalable gradient tree boosting system that supports both classification and regression. It makes available the open source gradient boosting framework. It prepares training data, calls the in-database XGBoost, builds and persists a model, and applies the model for prediction.

Note:

The ore.odmXGB algorithm is available in the database 21c or later.

You can use ore.odmXGB as a stand-alone predictor or incorporate it into real-world production pipelines for a wide range of problems such as ad click-through rate prediction, hazard risk prediction, web text classification, and so on.

The ore.odmXGB algorithm takes three types of parameters: general parameters, booster parameters, and task parameters.

A booster is one of the ensemble learning methods that combines a set of weak learners into a strong learner to minimize training errors. The booster in XGBoost determines the type of model used in the ensemble.

General parameters relate to which booster we are using to do boosting, commonly tree or linear model
Booster parameters depend on which booster you have chosen
Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

The algorithm supports most of the settings of the open source project. While CREATE_MODEL requires creating a separate model settings table for parameter specification, CREATE_MODEL2 offers a more streamlined approach. It allows you to pass a list of parameters directly within the function call.

Through ore.odmXGB, OML4R supports multiple classification and regression specifications, as well as ranking and survival models. Binary and multi-class models are used for classification tasks, while regression is used to predict continuous values. Ranking, count, and survival analysis are separate tasks addressed by specialized machine learning techniques.

ore.odmXGB also supports partitioned models and internalizes the data preparation.

XG Boost feature interaction constraints allow you to specify which variables can and cannot interact. By focusing on key interactions and eliminating noise, it aids in improving predicting performance. This, in turn, may lead to more generalized predictions. For more information about XG Boost feature interaction constraints, see Oracle Machine Learning for SQL Concepts Guide.

Settings for an XGBoost model

The following table lists settings that apply to XGBoost models.

Table 6-23 XGBoost Model Settings

Setting Name	Setting Value	Description
`booster`	One string value from the list: `dart` `gblinear` `gbtree`	The booster to use: `dart` `gblinear` `gbtree` The `dart` and `gbtree` boosters use tree-based models whereas `gblinear` uses linear functions. The default value is `gbtree`. Use `dart` to prevent overfitting by dropping trees and introducing randomness. Use `gbtree` for traditional gradient boosting without dropout.
`num_round`	`X >= 0`	The number of rounds for boosting. Rounds refer to the number of iterations used to build the final model. The default value is `10`.
`xgboost_interaction_constraints` Note: Available only in Oracle Database 23ai.	One string with the format of nested list. For example, `[[x₀,x₁,x₂],[x₀,x₄],[x₅,x₆]]`. Where `x_i - x_n` are feature names or columns.	This setting specifies permitted interactions in the model. Specify the constraints in the form of a nested list where each inner list is a group of features (column names) that are allowed to interact with each other. If a single column is passed in the interactions then, the input is ignored. Here, features x0, x1, and x2 are allowed to interact with each other but with no other feature. Similarly, x0 and x4 are allowed to interact with each other but with no other feature and so on. This setting is applicable to 2-Dimensional features. An error occurs if you pass columns of non-supported type and non-existing feature names.
`xgboost_decrease_constraints` Note: Available only in Oracle Database 23ai.	One string with the format of nested list. For example, `[x₀,x₁],[x₄,x₅]`	This setting specifies the features (column names) that must obey the decreasing constraint. The feature names are separated by a comma. For example, setting value 'x4,x5' sets decreasing constraint on features x4 and x5. This setting applies to numeric columns and 2-Dimensional features. An error occurs if you pass columns of non-supported type and non-existing feature names.
`xgboost_increase_constraints` Note: Available only in Oracle Database 23ai.	One string with the format of nested list. For example, `[x₀,x₁],[x₀,x₃]`	This setting specifies the features (column names) that must obey the increasing constraint. The feature names are separated by a comma. For example, setting value 'x0,x3' sets increasing constraint on features x0 and x3. This setting is applicable to 2-Dimensional features. An error occurs if you pass columns of non-supported type and non-existing feature names.
`objective` Note: Available only in Oracle Database 23ai.	For a classification model, one string value from the list: `binary:hinge` `binary:logistic` `multi:softmax` `multi:softprob` For a regression model, one string value from the list: `binary:logitraw` `count:poisson` `rank:map` `rank:ndcg` `rank:pairwise` `reg:gamma` `reg:logistic` `reg:tweedie` `survival:aft` `survival:cox` `reg:squarederror` `reg:squaredlogerror`	Settings for a Classification model: `binary:hinge:` Hinge loss for binary classification. This setting makes predictions of 0 or 1, rather than producing probabilities. `binary:logistic:` Logistic regression for binary classification. The output is the probability. `multi:softmax:` Performs multiclass classification using the `softmax` objective; you must also set `num_class(number_of_classes)`. `multi:softprob: :` Same as `softmax`, except the output is a vector of `ndata * nclass`, which can be further reshaped to an `ndata * nclass` matrix. The result contains the predicted probability of each data point belonging to each class. The default `objective` value for classification is `multi:softprob`. Settings for a Regression model: `binary:logitraw:` Logistic regression for binary classification; the output is the score before logistic transformation. `count:poisson:` Poisson regression for count data; the output is the mean of the Poisson distribution. The `max_delta_step` value is set to 0.7 by default in Poisson regression to safeguard optimization. `rank:map:` Using `LambdaMART`, performs list-wise ranking in which the Mean Average Precision (MAP) is maximized. `rank:ndcg:` Using `LambdaMART`, performs list-wise ranking in which the Normalized Discounted Cumulative Gain (NDCG) is maximized. `rank:pairwise:` Performs ranking by minimizing the pairwise loss. `reg:gamma:` Gamma regression with log-link; the output is the mean of the gamma distribution. This setting might be useful for any outcome that might be gamma-distributed, such as modeling insurance claims severity. `reg:logistic:` Logistic regression. `reg:tweedie:` Tweedie regression with log-link. This setting might be useful for any outcome that might be Tweedie-distributed, such as modeling total loss in insurance. `survival:aft:` Applies the Accelerated Failure Time (AFT) model for censored survival time data. When you select this option, `eval_metric` uses `aft-nloglik` as the default value. `survival:cox:` Cox regression for right-censored survival time data (negative values are considered right-censored). Predictions are returned on the hazard ratio scale (that is, as `HR = exp(marginal_prediction)` in the proportional hazard function `h(t) = h0(t) * HR)`. `reg:squarederror:` Regression with squared loss. `reg:squaredlogerror:` Regression with squared log loss. All input labels must be greater than -1. The default `objective` value for regression is `reg:squarederror`.
`xgboost_aft_loss_distribution` Note: Available only in Oracle Database 23ai.	[normal, logistic, extreme]	Specifies the distribution of the Z term in the AFT model. It specifies the Probabilty Density Function used by `survival:aft` objective and `aft-nloglik` evaluation metric. The default value is `normal`.
`xgboost_aft_loss_distribution_scale` Note: Available only in Oracle Database 23ai.	`X > 0`	Specifies the scaling factor σ, which scales the size of Z term in the AFT model. The default value is `1`.
`xgboost_aft_right_bound_column_name` Note: Available only in Oracle Database 23ai.	column_name	Specifies the column containing the right bounds of the labels for an AFT model. You cannot select this parameter for a non-AFT model. Note: Oracle Machine Learning does not support `BOOLEAN` values for this setting.

Example 6-27 Using the ore.odmXGB Regression Function

This example pushes the data frame to a temporary database table DAT and creates an XGBoost model.

# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Data setup

x <- seq(0.1, 5, by = 0.02)
y <- log(x) + rnorm(x, sd = 0.2)


# Create the a temporary OML4R proxy object DAT.

DAT <-ore.push(data.frame(x=x, y=y))

# Create an XGBoost regression model object. Fit the XGBoost model according to the data and setting parameters.

xgb.mod <- ore.odmXGB(y~x,dat,"regression")

# Display the model summary and attribute importance

summary(xgb.mod)
importance(xgb.mod)

# Use the model to make predictions on the input data.

xgb.res <- predict(xgb.mod,dat,supplemental.cols="x")
head(xgb.res,6)

Listing for This Example

>   x <- seq(0.1, 5, by = 0.02)
>   y <- log(x) + rnorm(x, sd = 0.2)
>   DAT <-ore.push(data.frame(x=x, y=y))

>   xgb.mod <- ore.odmXGB(y~x,dat,"regression")
>   summary(xgb.mod)

Call:
ore.odmXGB(formula = y ~ x, data = dat, type = "regression")

Settings: 
                                               value
odms.details                             odms.enable
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
booster                                       gbtree
ntree.limit                                        0
num.round                                         10

Importance: 
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE GAIN COVER FREQUENCY
1  <NA>              x              <NA>            <NA>    1     1         1

>   importance(xgb.mod)
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE GAIN COVER FREQUENCY
1  <NA>              x              <NA>            <NA>    1     1         1
>   xgb.res <- predict(xgb.mod,dat,supplemental.cols="x")
>   head(xgb.res,6)
     x PREDICTION
1 0.10  -1.957506
2 0.12  -1.957506
3 0.14  -1.957506
4 0.16  -1.484602
5 0.18  -1.559072
6 0.20  -1.559072

Example 6-28 Using the ore.odmXGB Classification Function

This example pushes the data frame mtcars to a temporary database table MTCARS and creates an XGBoost model.

# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Data setup

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)

# Create the a temporary OML4R proxy object DAT.

MTCARS <- ore.push(m)

# Create an XGBoost classification model object. Fit the XGBoot model according to the data and setting parameters.

xgb.mod <- ore.odmXGB(gear ~ .-ID, MTCARS, "classification")

# Display the model summary and attribute importance

summary(xgb.mod)
importance(xgb.mod)

# Use the model to make predictions on the input data.

xgb.res <- predict (xgb.mod, MTCARS,"gear")

# Generate a confusion matrix.
with(xgb.res, table(gear, PREDICTION))

Listing for This Example

>   m <- mtcars
>   m$gear <- as.factor(m$gear)
>   m$cyl  <- as.factor(m$cyl)
>   m$vs   <- as.factor(m$vs)
>   m$ID   <- 1:nrow(m)
>   MTCARS <- ore.push(m)

>   xgb.mod  <- ore.odmXGB(gear ~ .-ID, MTCARS,"classification")
>   summary(xgb.mod)

Call:
ore.odmXGB(formula = gear ~ . - ID, data = MTCARS, type = "classification")

Settings: 
                                               value
clas.weights.balanced                            OFF
odms.details                             odms.enable
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
booster                                       gbtree
ntree.limit                                        0
num.round                                         10
objective                             multi:softprob

Importance: 
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE         GAIN
1  <NA>             am              <NA>            <NA> 0.1062399524
2  <NA>           carb              <NA>            <NA> 0.0001902411
3  <NA>           disp              <NA>            <NA> 0.1903797590
4  <NA>           drat              <NA>            <NA> 0.5099772379
5  <NA>             hp              <NA>            <NA> 0.0120000788
6  <NA>            mpg              <NA>            <NA> 0.0040766784
7  <NA>           qsec              <NA>            <NA> 0.1771360524
        COVER  FREQUENCY
1 0.121840842 0.13924051
2 0.009026413 0.02531646
3 0.292335393 0.36708861
4 0.320671772 0.24050633
5 0.028994248 0.02531646
6 0.022994361 0.03797468
7 0.204136970 0.16455696

>   importance(xgb.mod)
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE         GAIN
1  <NA>             am              <NA>            <NA> 0.1062399524
2  <NA>           carb              <NA>            <NA> 0.0001902411
3  <NA>           disp              <NA>            <NA> 0.1903797590
4  <NA>           drat              <NA>            <NA> 0.5099772379
5  <NA>             hp              <NA>            <NA> 0.0120000788
6  <NA>            mpg              <NA>            <NA> 0.0040766784
7  <NA>           qsec              <NA>            <NA> 0.1771360524
        COVER  FREQUENCY
1 0.121840842 0.13924051
2 0.009026413 0.02531646
3 0.292335393 0.36708861
4 0.320671772 0.24050633
5 0.028994248 0.02531646
6 0.022994361 0.03797468
7 0.204136970 0.16455696
>   xgb.res  <- predict (xgb.mod, MTCARS,"gear")
>   with(xgb.res, table(gear,PREDICTION))  
    PREDICTION
gear  3  4  5
   3 15  0  0
   4  0 12  0
   5  0  0  5

Parent topic: OML4R Classes That Provide Access to In-Database Machine Learning Algorithms