9.10 Expectation Maximization
The oml.em
class uses the Expectation Maximization (EM) algorithm to create a clustering model.
EM is a density estimation algorithm that performs probabilistic clustering. In density estimation, the goal is to construct a density function that captures how a given population is distributed. The density estimate is based on observed data that represents a sample of the population.
For information on the oml.em
class methods, invoke help(oml.em)
or see Oracle Machine Learning for Python API Reference.
Settings for an Expectation Maximization Model
The following table lists settings for data preparation and analysis for EM
models.
Table 9-5 Expectation Maximization Settings for Data Preparation and Analysis
Setting Name | Setting Value | Description |
---|---|---|
EMCS_ATTRIBUTE_FILTER |
|
Whether or not to include uncorrelated attributes in the model. When Note: This setting applies only to attributes that are not nested. The default value is system-determined. |
EMCS_MAX_NUM_ATTR_2D |
|
Maximum number of correlated attributes to include in the model. Note: This setting applies only to attributes that are not nested (2D).The default value is |
|
|
The distribution for modeling numeric attributes. Applies to the input table or view as a whole and does not allow per-attribute specifications. The options include Bernoulli, Gaussian, or system-determined distribution. When Bernoulli or Gaussian distribution is chosen, all numeric attributes are modeled using the same type of distribution. When the distribution is system-determined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data. The default value is |
|
|
Number of equi-width bins that will be used for gathering cluster statistics for numeric columns. The default value is |
|
|
Specifies the number of projections to use for each nested column. If a column has fewer distinct attributes than the specified number of projections, then the data is not projected. The setting applies to all nested columns. The default value is |
|
|
Specifies the number of quantile bins to use for modeling numeric columns with multivalued Bernoulli distributions. The default value is system-determined. |
|
|
Specifies the number of top-N bins to use for modeling categorical columns with multivalued Bernoulli distributions. The default value is system-determined. |
The following table lists settings for learning for EM models.
Table 9-6 Expectation Maximization Settings for Learning
Setting Name | Setting Value | Description |
---|---|---|
|
|
The convergence criterion for EM. The convergence criterion may be based on a held-aside data set or it may be Bayesian Information Criterion. The default value is system determined. |
|
|
When the convergence criterion is based on a held-aside data set ( |
|
|
Enables model search in EM where different model sizes are explored and the best size is selected. The default value is |
|
|
Maximum number of components in the model. If model search is enabled, the algorithm automatically determines the number of components based on improvements in the likelihood function or based on regularization, up to the specified maximum. The number of components must be greater than or equal to the number of clusters. The default value is 20. |
|
|
Specifies the maximum number of iterations in the EM algorithm. The default value is |
|
Non-negative integer |
Controls the seed of the random generator used in EM. The default value is |
|
|
Allows the EM algorithm to remove a small component from the solution. The default value is |
The following table lists the settings for component clustering for EM models.
Table 9-7 Expectation Maximization Settings for Component Clustering
Setting Name | Setting Value | Description |
---|---|---|
CLUS_NUM_CLUSTERS |
|
The maximum number of leaf clusters generated by the algorithm. The algorithm may return fewer clusters than the specified number, depending on the data. but it cannot return more clusters than the number of components, which is governed by algorithm-specific settings. (See Table 9-6.) Depending on these settings, there may be fewer clusters than components. If component clustering is disabled, then the number of clusters equals the number of components. The default value is system-determined. |
|
|
Enables or disables the grouping of EM components into high-level clusters. When disabled, the components themselves are treated as clusters. When component clustering is enabled, model scoring through the SQL The default value is |
|
|
Dissimilarity threshold that controls the clustering of EM components. When the dissimilarity measure is less than the threshold, the components are combined into a single cluster. A lower threshold may produce more clusters that are more compact. A higher threshold may produce fewer clusters that are more spread out. The default value is |
|
|
Allows the specification of a linkage function for the agglomerative clustering step.
The default value is |
The following table lists the settings for cluster statistics for EM models.
Table 9-8 Expectation Maximization Settings for Cluster Statistics
Setting Name | Setting Value | Description |
---|---|---|
|
|
Enables or disables the gathering of descriptive statistics for clusters (centroids, histograms, and rules). When statistics are disabled, model size is reduced. The default value is |
|
|
Minimum support required for including an attribute in the cluster rule. The support is the percentage of the data rows assigned to a cluster that must have non-null values for the attribute. The default value is |
See Also:
Example 9-10 Using the oml.em Class
This example creates an EM model and uses some of the methods of the oml.em
class.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_dat = dat[0]
test_dat = dat[1]
# Specify settings.
setting = {'emcs_num_iterations': 100}
# Create an EM model object
em_mod = oml.em(n_clusters = 2, **setting)
# Fit the EM model according to the training data and parameter
# settings.
em_mod = em_mod.fit(train_dat)
# Show details of the model.
em_mod
# Use the model to make predictions on the test data.
em_mod.predict(test_dat)
# Make predictions and return the probability for each class
# on new data.
em_mod.predict_proba(test_dat,
supplemental_cols = test_dat[:,
['Sepal_Length', 'Sepal_Width',
'Petal_Length']]).sort_values(by = ['Sepal_Length',
'Sepal_Width', 'Petal_Length',
'PROBABILITY_OF_2', 'PROBABILITY_OF_3'])
# Change the random seed and refit the model.
em_mod.set_params(EMCS_RANDOM_SEED = '5').fit(train_dat)
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_dat = dat[0]
>>> test_dat = dat[1]
>>>
>>> # Specify settings.
... setting = {'emcs_num_iterations': 100}
>>>
>>> # Create an EM model object.
... em_mod = oml.em(n_clusters = 2, **setting)
>>>
>>> # Fit the EM model according to the training data and parameter
... # settings.
>>> em_mod = em_mod.fit(train_dat)
>>>
>>> # Show details of the model.
... em_mod
Algorithm Name: Expectation Maximization
Mining Function: CLUSTERING
Settings:
setting name setting value
0 ALGO_NAME ALGO_EXPECTATION_MAXIMIZATION
1 CLUS_NUM_CLUSTERS 2
2 EMCS_CLUSTER_COMPONENTS EMCS_CLUSTER_COMP_ENABLE
3 EMCS_CLUSTER_STATISTICS EMCS_CLUS_STATS_ENABLE
4 EMCS_CLUSTER_THRESH 2
5 EMCS_LINKAGE_FUNCTION EMCS_LINKAGE_SINGLE
6 EMCS_LOGLIKE_IMPROVEMENT .001
7 EMCS_MAX_NUM_ATTR_2D 50
8 EMCS_MIN_PCT_ATTR_SUPPORT .1
9 EMCS_MODEL_SEARCH EMCS_MODEL_SEARCH_DISABLE
10 EMCS_NUM_COMPONENTS 20
11 EMCS_NUM_DISTRIBUTION EMCS_NUM_DISTR_SYSTEM
12 EMCS_NUM_EQUIWIDTH_BINS 11
13 EMCS_NUM_ITERATIONS 100
14 EMCS_NUM_PROJECTIONS 50
15 EMCS_RANDOM_SEED 0
16 EMCS_REMOVE_COMPONENTS EMCS_REMOVE_COMPS_ENABLE
17 ODMS_DETAILS ODMS_ENABLE
18 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
19 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
20 PREP_AUTO ON
Computed Settings:
setting name setting value
0 EMCS_ATTRIBUTE_FILTER EMCS_ATTR_FILTER_DISABLE
1 EMCS_CONVERGENCE_CRITERION EMCS_CONV_CRIT_BIC
2 EMCS_NUM_QUANTILE_BINS 3
3 EMCS_NUM_TOPN_BINS 3
Global Statistics:
attribute name attribute value
0 CONVERGED YES
1 LOGLIKELIHOOD -2.10044
2 NUM_CLUSTERS 2
3 NUM_COMPONENTS 8
4 NUM_ROWS 104
5 RANDOM_SEED 0
6 REMOVED_COMPONENTS 12
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Clusters:
CLUSTER_ID CLUSTER_NAME RECORD_COUNT PARENT TREE_LEVEL \
0 1 1 104 NaN 1
1 2 2 68 1.0 2
2 3 3 36 1.0 2
LEFT_CHILD_ID RIGHT_CHILD_ID
0 2.0 3.0
1 NaN NaN
2 NaN NaN
Taxonomy:
PARENT_CLUSTER_ID CHILD_CLUSTER_ID
0 1 2.0
1 1 3.0
2 2 NaN
3 3 NaN
Centroids:
CLUSTER_ID ATTRIBUTE_NAME MEAN MODE_VALUE VARIANCE
0 1 Petal_Length 3.721154 None 3.234694
1 1 Petal_Width 1.155769 None 0.567539
2 1 Sepal_Length 5.831731 None 0.753255
3 1 Sepal_Width 3.074038 None 0.221358
4 1 Species NaN setosa NaN
5 2 Petal_Length 4.902941 None 0.860588
6 2 Petal_Width 1.635294 None 0.191572
7 2 Sepal_Length 6.266176 None 0.545555
8 2 Sepal_Width 2.854412 None 0.128786
9 2 Species NaN versicolor NaN
10 3 Petal_Length 1.488889 None 0.033016
11 3 Petal_Width 0.250000 None 0.012857
12 3 Sepal_Length 5.011111 None 0.113016
13 3 Sepal_Width 3.488889 None 0.134159
14 3 Species NaN setosa NaN
Leaf Cluster Counts:
CLUSTER_ID CNT
0 2 68
1 3 36
Attribute Importance:
ATTRIBUTE_NAME ATTRIBUTE_IMPORTANCE_VALUE ATTRIBUTE_RANK
0 Petal_Length 0.558311 2
1 Petal_Width 0.556300 3
2 Sepal_Length 0.469978 4
3 Sepal_Width 0.196211 5
4 Species 0.612463 1
Components:
COMPONENT_ID CLUSTER_ID PRIOR_PROBABILITY
0 1 2 0.115366
1 2 2 0.079158
2 3 3 0.113448
3 4 2 0.148059
4 5 3 0.126979
5 6 2 0.134402
6 7 3 0.105727
7 8 2 0.176860
Cluster Hists:
cluster.id variable bin.id lower.bound upper.bound \
0 1 Petal_Length 1 1.00 1.59
1 1 Petal_Length 2 1.59 2.18
2 1 Petal_Length 3 2.18 2.77
3 1 Petal_Length 4 2.77 3.36
... ... ... ... ... ...
137 3 Sepal_Width 11 NaN NaN
138 3 Species:'Other' 1 NaN NaN
139 3 Species:setosa 2 NaN NaN
140 3 Species:versicolor 3 NaN NaN
label count
0 1:1.59 25
1 1.59:2.18 11
2 2.18:2.77 0
3 2.77:3.36 3
... ... ...
137 : 0
138 : 0
139 : 36
140 : 0
[141 rows x 7 columns]
Rules:
cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var \
0 1 104 1.000000 93 0.892157 Sepal_Width
1 1 104 1.000000 93 0.892157 Sepal_Width
2 1 104 1.000000 99 0.892157 Petal_Length
3 1 104 1.000000 99 0.892157 Petal_Length
... ... ... ... ... ... ...
26 3 36 0.346154 36 0.972222 Petal_Length
27 3 36 0.346154 36 0.972222 Sepal_Length
28 3 36 0.346154 36 0.972222 Sepal_Length
29 3 36 0.346154 36 0.972222 Species
lhs.var.support lhs.var.conf predicate
0 93 0.400000 Sepal_Width <= 3.92
1 93 0.400000 Sepal_Width > 2.48
2 93 0.222222 Petal_Length <= 6.31
3 93 0.222222 Petal_Length >= 1
... ... ... ...
26 35 0.134398 Petal_Length >= 1
27 35 0.094194 Sepal_Length <= 5.74
28 35 0.094194 Sepal_Length >= 4.3
29 35 0.281684 Species = setosa
[30 rows x 9 columns]
>>> # Use the model to make predictions on the test data.
... em_mod.predict(test_dat)
CLUSTER_ID
0 3
1 3
2 3
3 3
... ...
42 2
43 2
44 2
45 2
>>> # Make predictions and return the probability for each class
... # on new data.
>>> em_mod.predict_proba(test_dat,
... supplemental_cols = test_dat[:,
... ['Sepal_Length', 'Sepal_Width',
... 'Petal_Length']]).sort_values(by = ['Sepal_Length',
... 'Sepal_Width', 'Petal_Length',
... 'PROBABILITY_OF_2', 'PROBABILITY_OF_3'])
Sepal_Length Sepal_Width Petal_Length PROBABILITY_OF_2 \
0 4.4 3.0 1.3 4.680788e-20
1 4.4 3.2 1.3 1.052071e-20
2 4.5 2.3 1.3 7.751240e-06
3 4.8 3.4 1.6 5.363418e-19
... ... ... ... ...
43 6.9 3.1 4.9 1.000000e+00
44 6.9 3.1 5.4 1.000000e+00
45 7.0 3.2 4.7 1.000000e+00
PROBABILITY_OF_3
0 1.000000e+00
1 1.000000e+00
2 9.999922e-01
3 1.000000e+00
... ...
43 3.295578e-97
44 6.438740e-137
45 3.853925e-89
>>>
>>> # Change the random seed and refit the model.
... em_mod.set_params(EMCS_RANDOM_SEED = '5').fit(train_dat)
Algorithm Name: Expectation Maximization
Mining Function: CLUSTERING
Settings:
setting name setting value
0 ALGO_NAME ALGO_EXPECTATION_MAXIMIZATION
1 CLUS_NUM_CLUSTERS 2
2 EMCS_CLUSTER_COMPONENTS EMCS_CLUSTER_COMP_ENABLE
3 EMCS_CLUSTER_STATISTICS EMCS_CLUS_STATS_ENABLE
4 EMCS_CLUSTER_THRESH 2
5 EMCS_LINKAGE_FUNCTION EMCS_LINKAGE_SINGLE
6 EMCS_LOGLIKE_IMPROVEMENT .001
7 EMCS_MAX_NUM_ATTR_2D 50
8 EMCS_MIN_PCT_ATTR_SUPPORT .1
9 EMCS_MODEL_SEARCH EMCS_MODEL_SEARCH_DISABLE
10 EMCS_NUM_COMPONENTS 20
11 EMCS_NUM_DISTRIBUTION EMCS_NUM_DISTR_SYSTEM
12 EMCS_NUM_EQUIWIDTH_BINS 11
13 EMCS_NUM_ITERATIONS 100
14 EMCS_NUM_PROJECTIONS 50
15 EMCS_RANDOM_SEED 5
16 EMCS_REMOVE_COMPONENTS EMCS_REMOVE_COMPS_ENABLE
17 ODMS_DETAILS ODMS_ENABLE
18 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
19 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
20 PREP_AUTO ON
Computed Settings:
setting name setting value
0 EMCS_ATTRIBUTE_FILTER EMCS_ATTR_FILTER_DISABLE
1 EMCS_CONVERGENCE_CRITERION EMCS_CONV_CRIT_BIC
2 EMCS_NUM_QUANTILE_BINS 3
3 EMCS_NUM_TOPN_BINS 3
Global Statistics:
attribute name attribute value
0 CONVERGED YES
1 LOGLIKELIHOOD -1.75777
2 NUM_CLUSTERS 2
3 NUM_COMPONENTS 9
4 NUM_ROWS 104
5 RANDOM_SEED 5
6 REMOVED_COMPONENTS 11
Attributes:
Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Clusters:
CLUSTER_ID CLUSTER_NAME RECORD_COUNT PARENT TREE_LEVEL LEFT_CHILD_ID \
0 1 1 104 NaN 1 2.0
1 2 2 36 1.0 2 NaN
2 3 3 68 1.0 2 NaN
RIGHT_CHILD_ID
0 3.0
1 NaN
2 NaN
Taxonomy:
PARENT_CLUSTER_ID CHILD_CLUSTER_ID
0 1 2.0
1 1 3.0
2 2 NaN
3 3 NaN
Centroids:
CLUSTER_ID ATTRIBUTE_NAME MEAN MODE_VALUE VARIANCE
0 1 Petal_Length 3.721154 None 3.234694
1 1 Petal_Width 1.155769 None 0.567539
2 1 Sepal_Length 5.831731 None 0.753255
3 1 Sepal_Width 3.074038 None 0.221358
4 1 Species NaN setosa NaN
5 2 Petal_Length 1.488889 None 0.033016
6 2 Petal_Width 0.250000 None 0.012857
7 2 Sepal_Length 5.011111 None 0.113016
8 2 Sepal_Width 3.488889 None 0.134159
9 2 Species NaN setosa NaN
10 3 Petal_Length 4.902941 None 0.860588
11 3 Petal_Width 1.635294 None 0.191572
12 3 Sepal_Length 6.266176 None 0.545555
13 3 Sepal_Width 2.854412 None 0.128786
14 3 Species NaN versicolor NaN
Leaf Cluster Counts:
CLUSTER_ID CNT
0 2 36
1 3 68
Attribute Importance:
ATTRIBUTE_NAME ATTRIBUTE_IMPORTANCE_VALUE ATTRIBUTE_RANK
0 Petal_Length 0.558311 2
1 Petal_Width 0.556300 3
2 Sepal_Length 0.469978 4
3 Sepal_Width 0.196211 5
4 Species 0.612463 1
Components:
COMPONENT_ID CLUSTER_ID PRIOR_PROBABILITY
0 1 2 0.113452
1 2 2 0.105727
2 3 3 0.114202
3 4 3 0.086285
4 5 3 0.067294
5 6 2 0.124365
6 7 3 0.126975
7 8 3 0.105761
8 9 3 0.155939
Cluster Hists:
cluster.id variable bin.id lower.bound upper.bound \
0 1 Petal_Length 1 1.00 1.59
1 1 Petal_Length 2 1.59 2.18
2 1 Petal_Length 3 2.18 2.77
3 1 Petal_Length 4 2.77 3.36
... ... ... ... ... ...
137 3 Sepal_Width 11 NaN NaN
138 3 Species:'Other' 1 NaN NaN
139 3 Species:setosa 3 NaN NaN
140 3 Species:versicolor 2 NaN NaN
label count
0 1:1.59 25
1 1.59:2.18 11
2 2.18:2.77 0
3 2.77:3.36 3
... ... ...
137 : 0
138 : 33
139 : 0
140 : 35
[141 rows x 7 columns]
Rules:
cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var \
0 1 104 1.000000 93 0.894231 Sepal_Width
1 1 104 1.000000 93 0.894231 Sepal_Width
2 1 104 1.000000 99 0.894231 Petal_Length
3 1 104 1.000000 99 0.894231 Petal_Length
... ... ... ... ... ... ...
26 3 68 0.653846 68 0.955882 Sepal_Length
27 3 68 0.653846 68 0.955882 Sepal_Length
28 3 68 0.653846 68 0.955882 Species
29 3 68 0.653846 68 0.955882 Species
lhs.var.support lhs.var.conf predicate
0 93 0.400000 Sepal_Width <= 3.92
1 93 0.400000 Sepal_Width > 2.48
2 93 0.222222 Petal_Length <= 6.31
3 93 0.222222 Petal_Length >= 1
... ... ... ...
26 65 0.026013 Sepal_Length <= 7.9
27 65 0.026013 Sepal_Length > 4.66
28 65 0.125809 Species IN 'Other'
29 65 0.125809 Species IN versicolor