4.2.8 Generalized Linear Models

The ore.odmGLM function builds a Generalized Linear Model (GLM) model, which includes and extends the class of linear models (linear regression).

Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.

The OML4SQL GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.

The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.

In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, OML4SQL GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.

GLM can be used to build classification or regression models as follows:

  • Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.

  • Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.

The ore.odmGLM function allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.

For information on the ore.odmGLM function arguments, invoke help(ore.odmGLM).

The following examples build several models using GLM. The input ore.frame objects are R data sets pushed to the database.

Settings for a Generalized Linear Models

The following table lists settings that apply to Generalized Linear models.

Table 4-8 Generalized Linear Model Settings

Setting Name Setting Value Description

GLMS_CONF_LEVEL

TO_CHAR(0< numeric_expr <1)

The confidence level for coefficient confidence intervals.

The default confidence level is 0.95.

GLMS_FTR_GEN_METHOD

GLMS_FTR_GEN_QUADRATIC

GLMS_FTR_GEN_CUBIC

Whether feature generation is quadratic or cubic.

When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data.

GLMS_FTR_GENERATION

GLMS_FTR_GENERATION_ENABLE

GLMS_FTR_GENERATION_DISABLE

Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

Note:

Feature generation can only be enabled when feature selection is also enabled.

GLMS_FTR_SEL_CRIT

GLMS_FTR_SEL_AIC

GLMS_FTR_SEL_SBIC

GLMS_FTR_SEL_RIC

GLMS_FTR_SEL_ALPHA_INV

Feature selection penalty criterion for adding a feature to the model.

When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data.

GLMS_FTR_SELECTION

GLMS_FTR_SELECTION_ENABLE

GLMS_FTR_SELECTION_DISABLE

Whether or not feature selection is enabled for GLM.

By default, feature selection is not enabled.

GLMS_MAX_FEATURES

TO_CHAR(0 < numeric_expr <= 2000)

When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model.

By default, the algorithm limits the number of features to ensure sufficient memory.

GLMS_PRUNE_MODEL

GLMS_PRUNE_MODEL_ENABLE

GLMS_PRUNE_MODEL_DISABLE

Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data.

When feature selection is enabled, the algorithm automatically performs pruning based on the data.

GLMS_REFERENCE_CLASS_NAME

target_value

The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the non-reference class.

By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.

GLMS_RIDGE_REGRESSION

GLMS_RIDGE_REG_ENABLE

GLMS_RIDGE_REG_DISABLE

Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions.

When ridge is enabled, prediction bounds are not produced by the PREDICTION_BOUNDS SQL function.

Note:

Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised.

GLMS_RIDGE_VALUE

TO_CHAR (numeric_expr > 0)

The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression.

If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm.

GLMS_ROW_DIAGNOSTICS

GLMS_ROW_DIAG_ENABLE

GLMS_ROW_DIAG_DISABLE (default).

Enable or disable row diagnostics.

GLMS_CONV_TOLERANCE

The range is (0, 1) non-inclusive.

Convergence Tolerance setting of the GLM algorithm

The default value is system-determined.

GLMS_NUM_ITERATIONS

Positive integer

Maximum number of iterations for the GLM algorithm. The default value is system-determined.

GLMS_BATCH_ROWS

0 or Positive integer

Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate.

The default is 2000

GLMS_SOLVER

GLMS_SOLVER_SGD (StochasticGradient Descent)

GLMS_SOLVER_CHOL (Cholesky)

GLMS_SOLVER_QR

GLMS_SOLVER_LBFGS_ADMM

This setting allows the user to choose the GLM solver. The solver cannot be selected if GLMS_FTR_SELECTION setting is enabled. The default value is system determined.

GLMS_SPARSE_SOLVER

GLMS_SPARSE_SOLVER_ENABLE

GLMS_SPARSE_SOLVER_DISABLE (default).

This setting allows the user to use sparse solver if it is available. The default value is GLMS_SPARSE_SOLVER_DISABLE.

Example 4-15 Building a Linear Regression Model

This example builds a linear regression model using the longley data set.

longley_of <- ore.push(longley)
longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
summary(longfit1)

Listing for This Example

R> longley_of <- ore.push(longley)
R> longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
R> summary(longfit1)
 
Call:
ore.odmGLM(formula = Employed ~ ., data = longely_of)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-0.41011 -0.15767 -0.02816  0.10155  0.45539 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212    
Year          1.829e+00  4.555e-01   4.016 0.003037 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared:  0.9955,    Adjusted R-squared:  0.9925 
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10

Example 4-16 Using Ridge Estimation for the Coefficients of the ore.odmGLM Model

This example uses the longley_of ore.frame from the previous example. This example invokes the ore.odmGLM function and specifies using ridge estimation for the coefficients.

longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
                       ridge.vif = TRUE)
summary(longfit2)

Listing for This Example

R> longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
+                         ridge.vif = TRUE)
R> summary(longfit2)
 
Call:
ore.odmGLM(formula = Employed ~ ., data = longley_of, ridge = TRUE, 
    ridge.vif = TRUE)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-0.4100 -0.1579 -0.0271  0.1017  0.4575 
 
Coefficients:
               Estimate   VIF
(Intercept)  -3.466e+03 0.000
GNP.deflator  1.479e-02 0.077
GNP          -3.535e-02 0.012
Unemployed   -2.013e-02 0.000
Armed.Forces -1.031e-02 0.000
Population   -5.262e-02 0.548
Year          1.821e+00 2.212
 
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared:  0.9955,    Adjusted R-squared:  0.9925 
F-statistic: 330.2 on 6 and 9 DF,  p-value: 4.986e-10

Example 4-17 Building a Logistic Regression GLM

This example builds a logistic regression (classification) model. It uses the infert data set. The example invokes the ore.odmGLM function and specifies logistic as the type argument, which builds a binomial GLM.

infert_of <- ore.push(infert)
infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic")
infit1

Listing for This Example

R> infert_of <- ore.push(infert)
R> infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
+                       data = infert_of, type = "logistic")
R> infit1
 
Response:
case == "1"
 
Call:  ore.odmGLM(formula = case ~ age + parity + education + spontaneous + 
    induced, data = infert_of, type = "logistic")
 
Coefficients:
     (Intercept)               age            parity   education0-5yrs  education12+ yrs       spontaneous           induced  
        -2.19348           0.03958          -0.82828           1.04424          -0.35896           2.04590           1.28876  
 
Degrees of Freedom: 247 Total (i.e. Null);  241 Residual
Null Deviance:      316.2 
Residual Deviance: 257.8        AIC: 271.8

Example 4-18 Specifying a Reference Value in Building a Logistic Regression GLM

This example builds a logistic regression (classification) model and specifies a reference value. The example uses the infert_of ore.frame from Example 4-17.

infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic", reference = 1)
infit2

Listing for This Example

infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic", reference = 1)
infit2

Response:
case == "0"
 
Call:  ore.odmGLM(formula = case ~ age + parity + education + spontaneous + 
    induced, data = infert_of, type = "logistic", reference = 1)
 
Coefficients:
     (Intercept)               age            parity   education0-5yrs  education12+ yrs       spontaneous           induced  
         2.19348          -0.03958           0.82828          -1.04424           0.35896          -2.04590          -1.28876  
 
Degrees of Freedom: 247 Total (i.e. Null);  241 Residual
Null Deviance:      316.2 
Residual Deviance: 257.8        AIC: 271.8