9.12 Generalized Linear Model
The oml.glm
class builds a Generalized Linear Model (GLM) model.
GLM models include and extend the class of linear models. They relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have the same variance across classes.
GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.
The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.
In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, Oracle Machine Learning GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.
GLM can be used to build classification or regression models as follows:
-
Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.
-
Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.
The oml.glm
class allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.
For information on the oml.glm
class attributes and methods, invoke help(oml.glm)
or see Oracle Machine Learning for Python API Reference.
Settings for a Generalized Linear Model
The following table lists the settings that apply to GLM models.
Table 9-10 Generalized Linear Model Settings
Setting Name | Setting Value | Description |
---|---|---|
CLAS_COST_TABLE_NAME |
table_name |
The name of a table that stores a cost matrix for the algorithm to use in scoring the model. The cost matrix specifies the costs associated with misclassifications. The cost matrix table is user-created. The following are the column requirements for the table.
|
CLAS_WEIGHTS_BALANCED |
|
Indicates whether the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is |
CLAS_WEIGHTS_TABLE_NAME |
table_name |
The name of a table that stores weighting information for individual target values in GLM logistic regression models. The weights are used by the algorithm to bias the model in favor of higher weighted classes. The class weights table is user-created. The following are the column requirements for the table.
|
|
|
Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data-driven batch size estimate. The default value is |
GLMS_CONF_LEVEL |
TO_CHAR(0< numeric_expr <1) |
The confidence level for coefficient confidence intervals. The default confidence level is |
|
The range is ( |
Convergence tolerance setting of the GLM algorithm. The default value is system-determined. |
GLMS_FTR_GEN_METHOD |
|
Whether feature generation is cubic or quadratic. When you enable feature generation, the algorithm automatically chooses the most appropriate feature generation method based on the data. |
GLMS_FTR_GENERATION |
|
Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled. Note: Note: Feature generation can only be enabled when feature selection is also enabled. |
GLMS_FTR_SEL_CRIT |
|
Feature selection penalty criterion for adding a feature to the model. When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data. |
|
|
Enable or disable feature selection for GLM. By default, feature selection is not enabled. |
|
|
When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model. By default, the algorithm limits the number of features to ensure sufficient memory. |
|
A positive integer. |
Maximum number of iterations for the GLM algorithm. The default value is system-determined. |
|
|
When feature selection is enabled, the algorithm automatically performs pruning based on the data. |
|
target_value |
The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the other class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class. |
|
|
Enable or disable ridge regression. Ridge applies to both regression and classification machine learning functions. When ridge is enabled, prediction bounds are not produced by the |
|
|
The value of the ridge parameter. Use this setting only when you have configured the algorithm to use ridge regression. If ridge regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm. |
|
|
Enable or disable row diagnostics. By default, row diagnostics are disabled. |
|
|
Specifies the GLM solver. You cannot select the solver if The The |
|
|
Enable or disable the use of a sparse solver if it is available. The default value is |
ODMS_ROW_WEIGHT_COLUMN_ NAME |
column_name |
The name of a column in the training data that contains a weighting factor for the rows. The column datatype must be NUMBER. You can use row weights as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. You can also use row weights to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data. |
See Also:
Example 9-12 Using the oml.glm Class
This example demonstrates the use of various methods of the oml.glm
class. In the listing for this example, some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_x = dat[0].drop('Petal_Width')
train_y = dat[0]['Petal_Width']
test_dat = dat[1]
# Specify settings.
setting = {'GLMS_SOLVER': 'dbms_data_mining.GLMS_SOLVER_QR'}
# Create a GLM model object.
glm_mod = oml.glm("regression", **setting)
# Fit the GLM model according to the training data and parameter
# settings.
glm_mod = glm_mod.fit(train_x, train_y)
# Show the model details.
glm_mod
# Use the model to make predictions on the test data.
glm_mod.predict(test_dat.drop('Petal_Width'),
supplemental_cols = test_dat[:,
['Sepal_Length', 'Sepal_Width',
'Petal_Length', 'Species']])
# Return the prediction probability.
glm_mod.predict(test_dat.drop('Petal_Width'),
supplemental_cols = test_dat[:,
['Sepal_Length', 'Sepal_Width',
'Petal_Length', 'Species']],
proba = True)
glm_mod.score(test_dat.drop('Petal_Width'),
test_dat[:, ['Petal_Width']])
# Change the parameter setting and refit the model.
new_setting = {'GLMS_SOLVER': 'GLMS_SOLVER_SGD'}
glm_mod.set_params(**new_setting).fit(train_x, train_y)
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_x = dat[0].drop('Petal_Width')
>>> train_y = dat[0]['Petal_Width']
>>> test_dat = dat[1]
>>>
>>> # Specify settings.
... setting = {'GLMS_SOLVER': 'dbms_data_mining.GLMS_SOLVER_QR'}
>>>
>>> # Create a GLM model object.
... glm_mod = oml.glm("regression", **setting)
>>>
>>> # Fit the GLM model according to the training data and parameter
... # settings.
>>> glm_mod = glm_mod.fit(train_x, train_y)
>>>
>>> # Show the model details.
... glm_mod
Algorithm Name: Generalized Linear Model
Mining Function: REGRESSION
Target: Petal_Width
Settings:
setting name setting value
0 ALGO_NAME ALGO_GENERALIZED_LINEAR_MODEL
1 GLMS_CONF_LEVEL .95
2 GLMS_FTR_GENERATION GLMS_FTR_GENERATION_DISABLE
3 GLMS_FTR_SELECTION GLMS_FTR_SELECTION_DISABLE
4 GLMS_SOLVER GLMS_SOLVER_QR
5 ODMS_DETAILS ODMS_ENABLE
6 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
Computed Settings:
setting name setting value
0 GLMS_CONV_TOLERANCE .0000050000000000000004
1 GLMS_NUM_ITERATIONS 30
2 GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE
Global Statistics:
attribute name attribute value
0 ADJUSTED_R_SQUARE 0.949634
1 AIC -363.888
2 COEFF_VAR 14.6284
3 CONVERGED YES
4 CORRECTED_TOTAL_DF 103
5 CORRECTED_TOT_SS 58.4565
6 DEPENDENT_MEAN 1.15577
7 ERROR_DF 98
8 ERROR_MEAN_SQUARE 0.028585
9 ERROR_SUM_SQUARES 2.80131
10 F_VALUE 389.405
11 GMSEP 0.030347
12 HOCKING_SP 0.000295
13 J_P 0.030234
14 MODEL_DF 5
15 MODEL_F_P_VALUE 0
16 MODEL_MEAN_SQUARE 11.131
17 MODEL_SUM_SQUARES 55.6552
18 NUM_PARAMS 6
19 NUM_ROWS 104
20 RANK_DEFICIENCY 0
21 ROOT_MEAN_SQ 0.16907
22 R_SQ 0.952079
23 SBIC -348.021
24 VALID_COVARIANCE_MATRIX YES
[1 rows x 25 columns]
Attributes:
Petal_Length
Sepal_Length
Sepal_Width
Species
Partition: NO
Coefficients:
name level estimate
0 (Intercept) None -0.600603
1 Petal_Length None 0.239775
2 Sepal_Length None -0.078338
3 Sepal_Width None 0.253996
4 Species versicolor 0.652420
5 Species virginica 1.010438
Fit Details:
name value
0 ADJUSTED_R_SQUARE 9.496338e-01
1 AIC -3.638876e+02
2 COEFF_VAR 1.462838e+01
3 CORRECTED_TOTAL_DF 1.030000e+02
...
21 ROOT_MEAN_SQ 1.690704e-01
22 R_SQ 9.520788e-01
23 SBIC -3.480213e+02
24 VALID_COVARIANCE_MATRIX 1.000000e+00
Rank:
6
Deviance:
2.801309
AIC:
-364
Null Deviance:
58.456538
DF Residual:
98.0
DF Null:
103.0
Converged:
True
>>>
>>> # Use the model to make predictions on the test data.
... glm_mod.predict(test_dat.drop('Petal_Width'),
... supplemental_cols = test_dat[:,
... ['Sepal_Length', 'Sepal_Width',
... 'Petal_Length', 'Species']])
Sepal_Length Sepal_Width Petal_Length Species PREDICTION
0 4.9 3.0 1.4 setosa 0.113215
1 4.9 3.1 1.5 setosa 0.162592
2 4.8 3.4 1.6 setosa 0.270602
3 5.8 4.0 1.2 setosa 0.248752
... ... ... ... ... ...
42 6.7 3.3 5.7 virginica 2.89876
43 6.7 3.0 5.2 virginica 1.893790
44 6.5 3.0 5.2 virginica 1.909457
45 5.9 3.0 5.1 virginica 1.932483
>>> # Return the prediction probability.
... glm_mod.predict(test_dat.drop('Petal_Width'),
... supplemental_cols = test_dat[:,
... ['Sepal_Length', 'Sepal_Width',
... 'Petal_Length', 'Species']]),
... proba = True)
Sepal_Length Sepal_Width Species PREDICTION
0 4.9 3.0 setosa 0.113215
1 4.9 3.1 setosa 0.162592
2 4.8 3.4 setosa 0.270602
3 5.8 4.0 setosa 0.248752
... ... ... ... ...
42 6.7 3.3 virginica 2.089876
43 6.7 3.0 virginica 1.893790
44 6.5 3.0 virginica 1.909457
45 5.9 3.0 virginica 1.932483
>>>
>>> glm_mod.score(test_dat.drop('Petal_Width'),
... test_dat[:, ['Petal_Width']])
0.951252
>>>
>>> # Change the parameter setting and refit the model.
... new_setting = {'GLMS_SOLVER': 'GLMS_SOLVER_SGD'}
>>> glm_mod.set_params(**new_setting).fit(train_x, train_y)
Algorithm Name: Generalized Linear Model
Mining Function: REGRESSION
Target: Petal_Width
Settings:
setting name setting value
0 ALGO_NAME ALGO_GENERALIZED_LINEAR_MODEL
1 GLMS_CONF_LEVEL .95
2 GLMS_FTR_GENERATION GLMS_FTR_GENERATION_DISABLE
3 GLMS_FTR_SELECTION GLMS_FTR_SELECTION_DISABLE
4 GLMS_SOLVER GLMS_SOLVER_SGD
5 ODMS_DETAILS ODMS_ENABLE
6 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
7 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
8 PREP_AUTO ON
Computed Settings:
setting name setting value
0 GLMS_BATCH_ROWS 2000
1 GLMS_CONV_TOLERANCE .0001
2 GLMS_NUM_ITERATIONS 500
3 GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_ENABLE
4 GLMS_RIDGE_VALUE .01
Global Statistics:
attribute name attribute value
0 ADJUSTED_R_SQUARE 0.94175
1 AIC -348.764
2 COEFF_VAR 15.7316
3 CONVERGED NO
4 CORRECTED_TOTAL_DF 103
5 CORRECTED_TOT_SS 58.4565
6 DEPENDENT_MEAN 1.15577
7 ERROR_DF 98
8 ERROR_MEAN_SQUARE 0.033059
9 ERROR_SUM_SQUARES 3.23979
10 F_VALUE 324.347
11 GMSEP 0.035097
12 HOCKING_SP 0.000341
13 J_P 0.034966
14 MODEL_DF 5
15 MODEL_F_P_VALUE 0
16 MODEL_MEAN_SQUARE 10.7226
17 MODEL_SUM_SQUARES 53.613
18 NUM_PARAMS 6
19 NUM_ROWS 104
20 RANK_DEFICIENCY 0
21 ROOT_MEAN_SQ 0.181821
22 R_SQ 0.944578
23 SBIC -332.898
24 VALID_COVARIANCE_MATRIX NO
[1 rows x 25 columns]
Attributes:
Petal_Length
Sepal_Length
Sepal_Width
Species
Partition: NO
Coefficients:
name level estimate
0 (Intercept) None -0.338046
1 Petal_Length None 0.378658
2 Sepal_Length None -0.084440
3 Sepal_Width None 0.137150
4 Species versicolor 0.151916
5 Species virginica 0.337535
Fit Details:
name value
0 ADJUSTED_R_SQUARE 9.417502e-01
1 AIC -3.487639e+02
2 COEFF_VAR 1.573164e+01
3 CORRECTED_TOTAL_DF 1.030000e+02
... ... ...
21 ROOT_MEAN_SQ 1.818215e-01
22 R_SQ 9.445778e-01
23 SBIC -3.328975e+02
24 VALID_COVARIANCE_MATRIX 0.000000e+00
Rank:
6
Deviance:
3.239787
AIC:
-349
Null Deviance:
58.456538
Prior Weights:
1
DF Residual:
98.0
DF Null:
103.0
Converged:
False