4.2.8 Generalized Linear Models
The ore.odmGLM
function builds a Generalized Linear Model (GLM) model, which includes and extends the class of linear models (linear regression).
Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.
The OML4SQL GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.
The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.
In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, OML4SQL GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.
GLM can be used to build classification or regression models as follows:
-
Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.
-
Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.
The ore.odmGLM
function allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.
For information on the ore.odmGLM
function arguments, invoke help(ore.odmGLM)
.
The following examples build several models using GLM. The input ore.frame
objects are R data sets pushed to the database.
Settings for a Generalized Linear Models
The following table lists settings that apply to Generalized Linear models.
Table 4-8 Generalized Linear Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
The confidence level for coefficient confidence intervals. The default confidence level is |
|
|
Whether feature generation is quadratic or cubic. When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data. |
|
|
Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled. Note: Feature generation can only be enabled when feature selection is also enabled. |
|
|
Feature selection penalty criterion for adding a feature to the model. When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data. |
|
|
Whether or not feature selection is enabled for GLM. By default, feature selection is not enabled. |
|
|
When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model. By default, the algorithm limits the number of features to ensure sufficient memory. |
GLMS_PRUNE_MODEL |
|
Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data. When feature selection is enabled, the algorithm automatically performs pruning based on the data. |
|
target_value |
The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the non-reference class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class. |
GLMS_RIDGE_REGRESSION |
|
Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions. When ridge is enabled, prediction bounds are not produced by the Note: Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised. |
|
|
The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression. If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm. |
|
|
Enable or disable row diagnostics. |
|
The range is ( |
Convergence Tolerance setting of the GLM algorithm The default value is system-determined. |
|
Positive integer |
Maximum number of iterations for the GLM algorithm. The default value is system-determined. |
|
0 or Positive integer
|
Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is |
|
|
This setting allows the user to choose the GLM solver. The solver cannot be selected if |
|
|
This setting allows the user to use sparse solver if it is available. The default value is |
Example 4-15 Building a Linear Regression Model
This example builds a linear regression model using the longley
data set.
longley_of <- ore.push(longley) longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of) summary(longfit1)
Listing for This Example
R> longley_of <- ore.push(longley)
R> longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
R> summary(longfit1)
Call:
ore.odmGLM(formula = Employed ~ ., data = longely_of)
Residuals:
Min 1Q Median 3Q Max
-0.41011 -0.15767 -0.02816 0.10155 0.45539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 **
GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141
GNP -3.582e-02 3.349e-02 -1.070 0.312681
Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 **
Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 ***
Population -5.110e-02 2.261e-01 -0.226 0.826212
Year 1.829e+00 4.555e-01 4.016 0.003037 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925
F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10
Example 4-16 Using Ridge Estimation for the Coefficients of the ore.odmGLM Model
This example uses the longley_of
ore.frame
from the previous example. This example invokes the ore.odmGLM
function and specifies using ridge estimation for the coefficients.
longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE, ridge.vif = TRUE) summary(longfit2)
Listing for This Example
R> longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
+ ridge.vif = TRUE)
R> summary(longfit2)
Call:
ore.odmGLM(formula = Employed ~ ., data = longley_of, ridge = TRUE,
ridge.vif = TRUE)
Residuals:
Min 1Q Median 3Q Max
-0.4100 -0.1579 -0.0271 0.1017 0.4575
Coefficients:
Estimate VIF
(Intercept) -3.466e+03 0.000
GNP.deflator 1.479e-02 0.077
GNP -3.535e-02 0.012
Unemployed -2.013e-02 0.000
Armed.Forces -1.031e-02 0.000
Population -5.262e-02 0.548
Year 1.821e+00 2.212
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925
F-statistic: 330.2 on 6 and 9 DF, p-value: 4.986e-10
Example 4-17 Building a Logistic Regression GLM
This example builds a logistic regression (classification) model. It uses the infert
data set. The example invokes the ore.odmGLM
function and specifies logistic
as the type
argument, which builds a binomial GLM.
infert_of <- ore.push(infert) infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = infert_of, type = "logistic") infit1
Listing for This Example
R> infert_of <- ore.push(infert)
R> infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
+ data = infert_of, type = "logistic")
R> infit1
Response:
case == "1"
Call: ore.odmGLM(formula = case ~ age + parity + education + spontaneous +
induced, data = infert_of, type = "logistic")
Coefficients:
(Intercept) age parity education0-5yrs education12+ yrs spontaneous induced
-2.19348 0.03958 -0.82828 1.04424 -0.35896 2.04590 1.28876
Degrees of Freedom: 247 Total (i.e. Null); 241 Residual
Null Deviance: 316.2
Residual Deviance: 257.8 AIC: 271.8
Example 4-18 Specifying a Reference Value in Building a Logistic Regression GLM
This example builds a logistic regression (classification) model and specifies a reference value. The example uses the infert_of
ore.frame
from Example 4-17.
infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = infert_of, type = "logistic", reference = 1) infit2
Listing for This Example
infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
data = infert_of, type = "logistic", reference = 1)
infit2
Response:
case == "0"
Call: ore.odmGLM(formula = case ~ age + parity + education + spontaneous +
induced, data = infert_of, type = "logistic", reference = 1)
Coefficients:
(Intercept) age parity education0-5yrs education12+ yrs spontaneous induced
2.19348 -0.03958 0.82828 -1.04424 0.35896 -2.04590 -1.28876
Degrees of Freedom: 247 Total (i.e. Null); 241 Residual
Null Deviance: 316.2
Residual Deviance: 257.8 AIC: 271.8
Parent topic: Build Oracle Machine Learning for SQL Models