9.13 k-Means
The oml.km
class uses the k-Means (KM) algorithm, which is a hierarchical, distance-based clustering algorithm that partitions data into a specified number of clusters.
The algorithm has the following features:
-
Several distance functions: Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean.
-
For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numeric attributes.
For information on the oml.km
class attributes and methods, invoke help(oml.km)
or see Oracle Machine Learning for Python API Reference.
Settings for a k-Means Model
The following table lists the settings that apply to KM models.
Table 9-11 k-Means Model Settings
Setting Name | Setting Value | Description |
---|---|---|
CLUS_NUM_CLUSTERS |
|
The maximum number of leaf clusters generated by the algorithm. The algorithm produces the specified number of clusters unless there are fewer distinct data points. The default value is |
|
|
Minimum Convergence Tolerance for k-Means. The algorithm iterates until the minimum Convergence Tolerance is satisfied or until the maximum number of iterations, specified in Decreasing the Convergence Tolerance produces a more accurate solution but may result in longer run times. The default Convergence Tolerance is |
|
|
Determines the level of cluster detail that is computed during the build.
|
|
|
Distance function for k-Means. The default distance function is |
|
|
Maximum number of iterations for k-Means. The algorithm iterates until either the maximum number of iterations is reached or the minimum Convergence Tolerance, specified in The default number of iterations is |
|
|
Minimum percentage of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster. If the data is sparse or includes many missing values, a minimum support that is too high can cause very short rules or even empty rules. The default minimum support is |
|
|
Number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value, which have only one bin. The default number of histogram bins is |
KMNS_RANDOM_SEED |
Non-negative integer |
Controls the seed of the random generator used during the k-Means initialization. It must be a non-negative integer value. The default value is |
|
|
Split criterion for k-Means. The split criterion controls the initialization of new k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time. When the split criterion is based on size, the new cluster is placed in the area where the largest current cluster is located. When the split criterion is based on the variance, the new cluster is placed in the area of the most spread-out cluster. The default split criterion is the |
See Also:
Example 9-13 Using the oml.km Class
This example creates a KM model and uses methods of it. In the listing for this example, some of the output is not shown as indicated by ellipses.
import oml
import pandas as pd
from sklearn import datasets
# Load the iris data set and create a pandas.DataFrame for it.
iris = datasets.load_iris()
x = pd.DataFrame(iris.data,
columns = ['Sepal_Length','Sepal_Width',
'Petal_Length','Petal_Width'])
y = pd.DataFrame(list(map(lambda x:
{0: 'setosa', 1: 'versicolor',
2:'virginica'}[x], iris.target)),
columns = ['Species'])
try:
oml.drop('IRIS')
except:
pass
# Create the IRIS database table and the proxy object for the table.
oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
# Create training and test data.
dat = oml.sync(table = 'IRIS').split()
train_dat = dat[0]
test_dat = dat[1]
# Specify settings.
setting = {'kmns_iterations': 20}
# Create a KM model object and fit it.
km_mod = oml.km(n_clusters = 3, **setting).fit(train_dat)
# Show model details.
km_mod
# Use the model to make predictions on the test data.
km_mod.predict(test_dat,
supplemental_cols =
test_dat[:, ['Sepal_Length', 'Sepal_Width',
'Petal_Length', 'Species']])
km_mod.predict_proba(test_dat,
supplemental_cols =
test_dat[:, ['Species']]).sort_values(by =
['Species', 'PROBABILITY_OF_3'])
km_mod.transform(test_dat)
km_mod.score(test_dat)
Listing for This Example
>>> import oml
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load the iris data set and create a pandas.DataFrame for it.
... iris = datasets.load_iris()
>>> x = pd.DataFrame(iris.data,
... columns = ['Sepal_Length','Sepal_Width',
... 'Petal_Length','Petal_Width'])
>>> y = pd.DataFrame(list(map(lambda x:
... {0: 'setosa', 1: 'versicolor',
... 2:'virginica'}[x], iris.target)),
... columns = ['Species'])
>>>
>>> try:
... oml.drop('IRIS')
... except:
... pass
>>>
>>> # Create the IRIS database table and the proxy object for the table.
... oml_iris = oml.create(pd.concat([x, y], axis=1), table = 'IRIS')
>>>
>>> # Create training and test data.
... dat = oml.sync(table = 'IRIS').split()
>>> train_dat = dat[0]
>>> test_dat = dat[1]
>>>
>>> # Specify settings.
... setting = {'kmns_iterations': 20}
>>>
>>> # Create a KM model object and fit it.
... km_mod = omlkm(n_clusters = 3, **setting).fit(train_dat)
>>>
>>> # Show model details.
... km_mod
Algorithm Name: K-Means
Mining Function: CLUSTERING
Settings:
setting name setting value
0 ALGO_NAME ALGO_KMEANS
1 CLUS_NUM_CLUSTERS 3
2 KMNS_CONV_TOLERANCE .001
3 KMNS_DETAILS KMNS_DETAILS_HIERARCHY
4 KMNS_DISTANCE KMNS_EUCLIDEAN
5 KMNS_ITERATIONS 20
6 KMNS_MIN_PCT_ATTR_SUPPORT .1
7 KMNS_NUM_BINS 11
8 KMNS_RANDOM_SEED 0
9 KMNS_SPLIT_CRITERION KMNS_VARIANCE
10 ODMS_DETAILS ODMS_ENABLE
11 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO
12 ODMS_SAMPLING ODMS_SAMPLING_DISABLE
13 PREP_AUTO ON
Global Statistics:
attribute name attribute value
0 CONVERGED YES
1 NUM_ROWS 104.0
Attributes: Petal_Length
Petal_Width
Sepal_Length
Sepal_Width
Species
Partition: NO
Clusters:
CLUSTER_ID ROW_CNT PARENT_CLUSTER_ID TREE_LEVEL DISPERSION
0 1 104 NaN 1 0.986153
1 2 68 1.0 2 1.102147
2 3 36 1.0 2 0.767052
3 4 37 2.0 3 1.015669
4 5 31 2.0 3 1.205363
Taxonomy:
PARENT_CLUSTER_ID CHILD_CLUSTER_ID
0 1 2.0
1 1 3.0
2 2 4.0
3 2 5.0
4 3 NaN
5 4 NaN
6 5 NaN
Leaf Cluster Counts:
CLUSTER_ID CNT
0 3 50
1 4 53
2 5 47
>>>
>>> # Use the model to make predictions on the test data.
... km_mod.predict(test_dat, ['Sepal_Length', 'Sepal_Width',
... 'Petal_Length', 'Species']])
Sepal_Length Sepal_Width Petal_Length Species CLUSTER_ID
0 4.9 3.0 1.4 setosa 3
1 4.9 3.1 1.5 setosa 3
2 4.8 3.4 1.6 setosa 3
3 5.8 4.0 1.2 setosa 3
... ... ... ... ... ...
38 6.4 2.8 5.6 virginica 5
39 6.9 3.1 5.4 virginica 5
40 6.7 3.1 5.6 virginica 5
41 5.8 2.7 5.1 virginica 5
>>>
>>> km_mod.predict_proba(test_dat,
... supplemental_cols =
... test_dat[:, ['Species']]).sort_values(by =
... ['Species', 'PROBABILITY_OF_3'])
Species PROBABILITY_OF_3 PROBABILITY_OF_4 PROBABILITY_OF_5
0 setosa 0.791267 0.208494 0.000240
1 setosa 0.971498 0.028350 0.000152
2 setosa 0.981020 0.018499 0.000481
3 setosa 0.981907 0.017989 0.000104
... ... ... ... ...
42 virginica 0.000655 0.316671 0.682674
43 virginica 0.001036 0.413744 0.585220
44 virginica 0.001036 0.413744 0.585220
45 virginica 0.002452 0.305021 0.692527
>>>
>>> km_mod.transform(test_dat)
CLUSTER_DISTANCE
0 1.050234
1 0.859817
2 0.321065
3 1.427080
... ...
42 0.837757
43 0.479313
44 0.448562
45 1.123587
>>>
>>> km_mod.score(test_dat)
-47.487712