9.17 Singular Value Decomposition
Use the oml.svd
class to build a model for feature extraction.
The oml.svd
class creates a model that uses the Singular Value Decomposition (SVD) algorithm for feature extraction. SVD performs orthogonal linear transformations that capture the underlying variance of the data by decomposing a rectangular matrix into three matrices: U, V, and D. Columns of matrix V contain the right singular vectors and columns of matrix U contain the left singular vectors. Matrix D is a diagonal matrix and its singular values reflect the amount of data variance captured by the bases.
The SVDS_MAX_NUM_FEATURES
constant specifies the maximum number of features supported by SVD. The value of the constant is 2500.
For information on the oml.svd
class attributes and methods, invoke help(oml.svd)
or see Oracle Machine Learning for Python API Reference.
Settings for a Singular Value Decomposition Model
Table 9-15 Singular Value Decomposition Model Settings
Setting Name | Setting Value | Description |
---|---|---|
FEAT_NUM_FEATURES |
|
The number of features to extract. The default value is estimated by the algorithm. If the matrix rank is smaller than this number, fewer features are returned. |
|
Range [ |
Configures the number of columns in the sampling matrix used by the Stochastic SVD solver. The number of columns in this matrix is equal to the requested number of features plus the oversampling setting. |
|
Range [ |
Improves the accuracy of the SSVD solver. The default value is |
|
Range [ |
The random seed value for initializing the sampling matrix used by the Stochastic SVD solver. The default value is |
|
|
Whether to use SVD or PCA scoring for the model. When the build data is scored with SVD, the projections are the same as the U matrix. When the build data is scored with PCA, the projections are the product of the U and D matrices. The default value is |
|
|
Specifies the solver to be used for computing SVD of the data. For PCA, the solver setting indicates the type of SVD solver used to compute the PCA for the data. When this setting is not specified, the solver type selection is data driven. If the number of attributes is greater than 3240, then the default wide solver is used. Otherwise, the default narrow solver is selected. The following are the group of solvers:
For narrow data solvers:
For wide data solvers:
|
|
Range [ |
Defines the minimum value for the eigenvalue of a feature as a share of the first eigenvalue to not prune. Use this setting to prune features. The default value is data driven. |
|
|
Specifies whether to persist the U matrix produced by SVD. The U matrix in SVD has as many rows as the number of rows in the build data. To avoid creating a large model, the U matrix is persisted only when When The default value is |
See Also:
Example 9-17 Using the oml.svd Class
This example uses some of the methods of the oml.svd
class. In the listing for this example, some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_dat = dat[0]
test_dat = dat[1]
# Create an SVD model object.
svd_mod = oml.svd(ODMS_DETAILS = 'ODMS_ENABLE')
# Fit the model according to the training data and parameter
# settings.
svd_mod = svd_mod.fit(train_dat)
# Show the model details.
svd_mod
# Use the model to make predictions on the test data.
svd_mod.predict(test_dat,
supplemental_cols = test_dat[:,
['Sepal_Length',
'Sepal_Width',
'Petal_Length',
'Species']])
# Perform dimensionality reduction and return values for the two
# features that have the highest topN values.
svd_mod.transform(test_dat,
supplemental_cols = test_dat[:, ['Sepal_Length']],
topN = 2).sort_values(by = ['Sepal_Length',
'TOP_1',
'TOP_1_VAL'])
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_dat = dat[0]
>>> test_dat = dat[1]
>>>
>>> # Create an SVD model object.
... svd_mod = oml.svd(ODMS_DETAILS = 'ODMS_ENABLE')
>>>
>>> # Fit the model according to the training data and parameter
... # settings.
>>> svd_mod = svd_mod.fit(train_dat)
>>>
>>> # Show the model details.
... svd_mod
Algorithm Name: Singular Value Decomposition
Mining Function: FEATURE_EXTRACTION
Settings:
setting name setting value
0 ALGO_NAME ALGO_SINGULAR_VALUE_DECOMP
1 ODMS_DETAILS ODMS_ENABLE
2 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
3 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
4 PREP_AUTO ON
5 SVDS_SCORING_MODE SVDS_SCORING_SVD
6 SVDS_U_MATRIX_OUTPUT SVDS_U_MATRIX_DISABLE
Computed Settings:
setting name setting value
0 FEAT_NUM_FEATURES 8
1 SVDS_SOLVER SVDS_SOLVER_TSEIGEN
2 SVDS_TOLERANCE .000000000000024646951146678475
Global Statistics:
attribute name attribute value
0 NUM_COMPONENTS 8
1 NUM_ROWS 111
2 SUGGESTED_CUTOFF 1
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE VALUE
0 1 ID None 0.996297
1 1 Petal_Length None 0.046646
2 1 Petal_Width None 0.015917
3 1 Sepal_Length None 0.063312
... ... ... ... ...
60 8 Sepal_Width None -0.030620
61 8 Species setosa 0.431543
62 8 Species versicolor 0.566418
63 8 Species virginica 0.699261
[64 rows x 4 columns]
D:
FEATURE_ID VALUE
0 1 886.737809
1 2 32.736792
2 3 10.043389
3 4 5.270496
4 5 2.708602
5 6 1.652340
6 7 0.938640
7 8 0.452170
V:
'1' '2' '3' '4' '5' '6' '7' \
0 0.001332 0.156581 -0.317375 0.113462 -0.154414 -0.113058 0.799390
1 0.003692 0.052289 0.316295 0.733040 0.190746 0.022285 -0.046406
2 0.005267 -0.051498 -0.052111 0.527881 -0.066995 0.046461 -0.469396
3 0.015917 0.008741 0.263614 0.244811 0.460445 0.767503 0.262966
4 0.030208 0.550384 -0.358277 0.041807 0.689962 -0.261815 -0.143258
5 0.046646 0.189325 0.766663 0.326363 0.079611 -0.479070 0.177661
6 0.063312 0.790864 0.097964 -0.051230 -0.490804 0.312159 -0.131337
7 0.996297 -0.076079 -0.035940 -0.017429 -0.000960 -0.001908 0.001755
'8'
0 0.431543
1 0.566418
2 0.699261
3 0.005000
4 -0.030620
5 -0.016932
6 -0.052185
7 -0.001415
>>> # Use the model to make predictions on the test data.
>>> svd_mod.predict(test_dat,
supplemental_cols = test_dat[:,
... ['Sepal_Length',
... 'Sepal_Width',
... 'Petal_Length',
... 'Species']])
Sepal_Length Sepal_Width Petal_Length Species FEATURE_ID
0 5.0 3.6 1.4 setosa 2
1 5.0 3.4 1.5 setosa 2
2 4.4 2.9 1.4 setosa 8
3 4.9 3.1 1.5 setosa 2
... ... ... ... ... ...
35 6.9 3.1 5.4 virginica 1
36 5.8 2.7 5.1 virginica 1
37 6.2 3.4 5.4 virginica 5
38 5.9 3.0 5.1 virginica 1
>>> # Perform dimensionality reduction and return values for the two
... # features that have the highest topN values.
>>> svd_mod.transform(test_dat,
... supplemental_cols = test_dat[:, ['Sepal_Length']],
... topN = 2).sort_values(by = ['Sepal_Length',
... 'TOP_1',
... 'TOP_1_VAL'])
Sepal_Length TOP_1 TOP_1_VAL TOP_2 TOP_2_VAL
0 4.4 7 0.153125 3 -0.130778
1 4.4 8 0.171819 2 0.147070
2 4.8 2 0.159324 6 -0.085194
3 4.8 7 0.157187 3 -0.141668
... ... ... ... ... ...
35 7.2 6 -0.167688 1 0.142545
36 7.2 7 -0.176290 6 -0.175527
37 7.6 4 0.205779 3 0.141533
38 7.9 8 -0.253194 7 -0.166967