Generalized Linear Models

4.2.8 Generalized Linear Models

The ore.odmGLM function builds a Generalized Linear Model (GLM) model, which includes and extends the class of linear models (linear regression).

Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.

The OML4SQL GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.

The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.

In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, OML4SQL GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.

GLM can be used to build classification or regression models as follows:

Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.
Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.

The ore.odmGLM function allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.

For information on the ore.odmGLM function arguments, invoke help(ore.odmGLM).

The following examples build several models using GLM. The input ore.frame objects are R data sets pushed to the database.

Settings for a Generalized Linear Models

The following table lists settings that apply to Generalized Linear models.

Table 4-8 Generalized Linear Model Settings

Setting Name	Setting Value	Description
`GLMS_CONF_LEVEL`	`TO_CHAR(0< numeric_expr <1)`	The confidence level for coefficient confidence intervals. The default confidence level is `0.95`.
`GLMS_FTR_GEN_METHOD`	`GLMS_FTR_GEN_QUADRATIC` `GLMS_FTR_GEN_CUBIC`	Whether feature generation is quadratic or cubic. When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data.
`GLMS_FTR_GENERATION`	`GLMS_FTR_GENERATION_ENABLE` `GLMS_FTR_GENERATION_DISABLE`	Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled. Note: Feature generation can only be enabled when feature selection is also enabled.
`GLMS_FTR_SEL_CRIT`	`GLMS_FTR_SEL_AIC` `GLMS_FTR_SEL_SBIC` `GLMS_FTR_SEL_RIC` `GLMS_FTR_SEL_ALPHA_INV`	Feature selection penalty criterion for adding a feature to the model. When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data.
`GLMS_FTR_SELECTION`	`GLMS_FTR_SELECTION_ENABLE` `GLMS_FTR_SELECTION_DISABLE`	Whether or not feature selection is enabled for GLM. By default, feature selection is not enabled.
`GLMS_MAX_FEATURES`	`TO_CHAR(0 < numeric_expr <= 2000)`	When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model. By default, the algorithm limits the number of features to ensure sufficient memory.
`GLMS_PRUNE_MODEL`	`GLMS_PRUNE_MODEL_ENABLE` `GLMS_PRUNE_MODEL_DISABLE`	Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data. When feature selection is enabled, the algorithm automatically performs pruning based on the data.
`GLMS_REFERENCE_CLASS_NAME`	target_value	The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the non-reference class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.
`GLMS_RIDGE_REGRESSION`	`GLMS_RIDGE_REG_ENABLE` `GLMS_RIDGE_REG_DISABLE`	Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions. When ridge is enabled, prediction bounds are not produced by the `PREDICTION_BOUNDS` SQL function. Note: Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised.
`GLMS_RIDGE_VALUE`	`TO_CHAR (numeric_expr > 0)`	The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression. If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm.
`GLMS_ROW_DIAGNOSTICS`	`GLMS_ROW_DIAG_ENABLE` `GLMS_ROW_DIAG_DISABLE (default).`	Enable or disable row diagnostics.
`GLMS_CONV_TOLERANCE`	The range is (`0, 1`) non-inclusive.	Convergence Tolerance setting of the GLM algorithm The default value is system-determined.
`GLMS_NUM_ITERATIONS`	Positive integer	Maximum number of iterations for the GLM algorithm. The default value is system-determined.
`GLMS_BATCH_ROWS`	`0` or Positive integer	Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is `2000`
`GLMS_SOLVER`	`GLMS_SOLVER_SGD (StochasticGradient Descent)` `GLMS_SOLVER_CHOL (Cholesky)` `GLMS_SOLVER_QR` `GLMS_SOLVER_LBFGS_ADMM`	This setting allows the user to choose the GLM solver. The solver cannot be selected if `GLMS_FTR_SELECTION` setting is enabled. The default value is system determined.
`GLMS_SPARSE_SOLVER`	`GLMS_SPARSE_SOLVER_ENABLE` `GLMS_SPARSE_SOLVER_DISABLE` (default).	This setting allows the user to use sparse solver if it is available. The default value is `GLMS_SPARSE_SOLVER_DISABLE`.

Example 4-15 Building a Linear Regression Model

This example builds a linear regression model using the longley data set.

longley_of <- ore.push(longley)
longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
summary(longfit1)

Listing for This Example

R> longley_of <- ore.push(longley)
R> longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
R> summary(longfit1)
 
Call:
ore.odmGLM(formula = Employed ~ ., data = longely_of)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-0.41011 -0.15767 -0.02816  0.10155  0.45539 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212    
Year          1.829e+00  4.555e-01   4.016 0.003037 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared:  0.9955,    Adjusted R-squared:  0.9925 
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10

Example 4-16 Using Ridge Estimation for the Coefficients of the ore.odmGLM Model

This example uses the longley_of ore.frame from the previous example. This example invokes the ore.odmGLM function and specifies using ridge estimation for the coefficients.

longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
                       ridge.vif = TRUE)
summary(longfit2)

Listing for This Example

R> longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
+                         ridge.vif = TRUE)
R> summary(longfit2)
 
Call:
ore.odmGLM(formula = Employed ~ ., data = longley_of, ridge = TRUE, 
    ridge.vif = TRUE)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-0.4100 -0.1579 -0.0271  0.1017  0.4575 
 
Coefficients:
               Estimate   VIF
(Intercept)  -3.466e+03 0.000
GNP.deflator  1.479e-02 0.077
GNP          -3.535e-02 0.012
Unemployed   -2.013e-02 0.000
Armed.Forces -1.031e-02 0.000
Population   -5.262e-02 0.548
Year          1.821e+00 2.212
 
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared:  0.9955,    Adjusted R-squared:  0.9925 
F-statistic: 330.2 on 6 and 9 DF,  p-value: 4.986e-10

Example 4-17 Building a Logistic Regression GLM

This example builds a logistic regression (classification) model. It uses the infert data set. The example invokes the ore.odmGLM function and specifies logistic as the type argument, which builds a binomial GLM.

infert_of <- ore.push(infert)
infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic")
infit1

Listing for This Example

R> infert_of <- ore.push(infert)
R> infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
+                       data = infert_of, type = "logistic")
R> infit1
 
Response:
case == "1"
 
Call:  ore.odmGLM(formula = case ~ age + parity + education + spontaneous + 
    induced, data = infert_of, type = "logistic")
 
Coefficients:
     (Intercept)               age            parity   education0-5yrs  education12+ yrs       spontaneous           induced  
        -2.19348           0.03958          -0.82828           1.04424          -0.35896           2.04590           1.28876  
 
Degrees of Freedom: 247 Total (i.e. Null);  241 Residual
Null Deviance:      316.2 
Residual Deviance: 257.8        AIC: 271.8

Example 4-18 Specifying a Reference Value in Building a Logistic Regression GLM

This example builds a logistic regression (classification) model and specifies a reference value. The example uses the infert_of ore.frame from Example 4-17.

infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic", reference = 1)
infit2

Listing for This Example

infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic", reference = 1)
infit2

Response:
case == "0"
 
Call:  ore.odmGLM(formula = case ~ age + parity + education + spontaneous + 
    induced, data = infert_of, type = "logistic", reference = 1)
 
Coefficients:
     (Intercept)               age            parity   education0-5yrs  education12+ yrs       spontaneous           induced  
         2.19348          -0.03958           0.82828          -1.04424           0.35896          -2.04590          -1.28876  
 
Degrees of Freedom: 247 Total (i.e. Null);  241 Residual
Null Deviance:      316.2 
Residual Deviance: 257.8        AIC: 271.8

Parent topic: Build Oracle Machine Learning for SQL Models