3.2.3 Build Model

To evaluate a model's performance, it is common practice to split the data into training and test sets. This allows you to assess how well the model generalizes to unseen data. However, in unsupervised learning, such as clustering, there are no labels or predictors available to calculate accuracy or evaluate performance. As a result, you can use the entire dataset to build the model without the need to split it. Since there is no ground truth to compare the results against, the training-test split is neither applicable nor useful in unsupervised learning.

Algorithm Selection

Using OML4R, you can choose one of the following algorithms to solve a clustering problem:
  1. K-Means (KM)
  2. Expectation-Maximization (EM)
  3. Orthogonal Partitioning Cluster (O-Cluster)

The k-Means(KM) algorithm is a distance-based clustering algorithm that partitions the data into a specified number of clusters. Distance-based algorithms are based on the concept that nearby data points are more related to each other than data points that are farther away. The algorithm iteratively tries to minimize the within-cluster variance with respect to its nearest cluster centroid. The Expectation-Maximization(EM) algorithm uses a probabilistic clustering based on a density estimation algorithm. The Orthogonal Partitioning Cluster (O-Cluster) algorithm is a density-based clustering method designed for large, high-dimensional datasets.

A good starting point for clustering is the K-means algorithm. It works by assigning each data point to the closest cluster center (centroid). Unlike some methods, K-means doesn't make assumptions about the underlying shapes of the clusters. This simplicity makes it a user-friendly choice for many applications, and it will be the method we use for this use case.

We will use the elbow method to determine the number of clusters in the dataset. The elbow method uses the leaf clusters. In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the variance (or dispersion) as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. We will start with one cluster, and continue specifying one cluster through 8 clusters. We will look for the "elbow" in the resulting dispersion curve to assess which number of clusters seems best.

To specify model settings and build a k-Means model object that will segment the data, run the following command. The settings are given as key-value or dictionary pairs where it refers to parameters name and value setting respectively. Here are some of the settings specified: KMNS_ITERATIONS, KMNS_RANDOM_SEED, KMNS_CONV_TOLERANCE, KMNS_NUM_BINS, KMNS_DETAILS, and PREP_AUTO. The k-Means algorithm uses the number of clusters (k) and other settings to configure the algorithm, as shown here:

settings = list(
    KMNS_ITERATIONS = 15,
    KMNS_RANDOM_SEED = 1,
    KMNS_CONV_TOLERANCE = 0.001,
    KMNS_NUM_BINS = 11,
    KMNS_DETAILS = "KMNS_DETAILS_HIERARCHY",
    CASE_ID_COLUMN_NAME = "CUST_ID"
)
 
KM.MOD <- ore.odmKMeans(
    formula = ~.-CUST_ID,
    data = CUST_DF_CLEAN,
    num.centers = 3,
    odm.settings = settings
)
 
KM.MOD

The following is the list of algorithm settings used in this example:

  • KMNS_ITERATIONS: Specifies the maximum number of iterations for k-Means that are allowed. The default number of iterations is 20.
  • KMNS_RANDOM_SEED: The random number generator uses a number called the random seed to initialize itself. The random number generator generates random numbers that are used by the k-Means algorithm to select the initial cluster centroid. This setting controls the seed of the random generator used during the k-Means initialization. It must be a non-negative integer value. The default is 0.
  • KMNS_CONV_TOLERANCE: Convergence Tolerance is the threshold value for the change in the centroids between consecutive iterations of the algorithm. This setting is used to specify the minimum Convergence Tolerance for k-Means. The algorithm iterates until the minimum Convergence Tolerance is satisfied or until the maximum number of iterations, specified in KMNS_ITERATIONS, is reached. Decreasing the Convergence Tolerance produces a more accurate solution but may result in longer run times. The default Convergence Tolerance is 0.001.
  • KMNS_NUM_BINS: Number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin.
  • KMNS_DETAILS: This setting determines the level of cluster details that is computed during the build. The value KMNS_DETAILS_ALL means that the cluster hierarchy, record counts, and descriptive statistics (means, variances, modes, histograms, and rules) are computed and this is the default value. The value KMNS_DETAILS_NONE means no cluster details are computed and only the scoring information persisted. The value KMNS_DETAILS_HIERARCHY means cluster hierarchy and cluster record counts are computed.
  • PREP_AUTO: Used to specify whether to use automatic data preparation or if the user is responsible for algorithm-specific data preparation. By default, it is enabled with a constant value as 'PREP_AUTO': PREP_AUTO_ON and requires the DBMS_DATA_MINING package. Alternatively, it can also be specified as 'PREP_AUTO': 'ON'.
  • ~.-CUST_ID: This argument is passed to the function to cluster the data in the CUST_DF_CLEAN data frame, excluding the CUST_ID column.
  • CUST_DF_CLEAN: The data frame that needs to be clustered.
  • num.centers: Defines the number of clusters for a clustering model. A value greater than or equal to 1. The default value is 10.
  • odm.settings: A list to specify in-database algorithm parameter settings. This argument is applicable to building a model in Database 12.2 or later. Each list element's name and value refer to the parameter setting name and value, respectively. The setting value must be numeric or string.

    The output appears as follows:

    Call:
    ore.odmKMeans(formula = ~. - CUST_ID, data = CUST_DF_CLEAN, num.centers = 3,
        odm.settings = settings)
     
    Settings:
                                                   value
    clus.num.clusters                                  3
    block.growth                                       2
    conv.tolerance                                 0.001
    details                            details.hierarchy
    distance                                   euclidean
    iterations                                        15
    min.pct.attr.support                             0.1
    num.bins                                          11
    random.seed                                        1
    split.criterion                             variance
    odms.details                             odms.enable
    odms.missing.value.treatment odms.missing.value.auto
    odms.sampling                  odms.sampling.disable
    prep.auto                                         ON