3.3.3 Build Model

To evaluate a model's performance, it is common practice to split the data into training and test sets. This allows you to assess how well the model generalizes to unseen data. However, in unsupervised learning, such as clustering, there are no labels or predictors available to calculate accuracy or evaluate performance. As a result, you can use the entire dataset to build the model without the need to split it. Since there is no ground truth to compare the results against, the training-test split is neither applicable nor useful in unsupervised learning.

Algorithm Selection

Using OML4Py, you can choose one of the following algorithms to solve a clustering problem:
  1. Expectation-Maximization (EM)
  2. K-Means (KM)

The Expectation-Maximization (EM) algorithm uses a probabilistic clustering based on a density estimation algorithm. The EM algorithm is used when data contains hidden components or when some data points are absent. In contrast, the k-Means (KM) algorithm is a distance-based clustering algorithm that partitions data into a specified number of clusters. Distance-based algorithms are based on the principle that nearby data points are more closely related to one another than to those that are farther away. This algorithm works iteratively to minimize the within-cluster variance in relation to the nearest cluster centroid.

The k-Means algorithm is chosed, as it is simpler than the Expectation-Maximization (EM) algorithm. Since, the optimal number of clusters is unknown, start with one cluster and gradually increase the number of clusters. Use the Elbow meathod to determine the optimal number of clusters.

To specify model settings and build a k-Means model object that will partition and segment the data, run the following script. The settings are provided as key-value pairs, or dictionary pairs, where each key represents a parameter name and its corresponding value represents the setting. Some of the specified settings include KMNS_ITERATIONS and KMNS_DISTANCE. The k-Means algorithm utilizes the number of clusters (k) along with these settings to configure the algorithm.

The following steps guide you to build your model with the selected algorithm.

  • Use the oml.km algorithm to build your model and specify the model settings. Run the following script:

    try:
        oml.drop(model="CUST_CLUSTER_MODEL")
    except:
        pass
         
    setting = {'KMNS_ITERATIONS': 10,
               'KMNS_DISTANCE': 'KMNS_EUCLIDEAN',
               'KMNS_NUM_BINS': 10,
               'KMNS_DETAILS': 'KMNS_DETAILS_ALL',
               'PREP_AUTO': 'ON'}
     
    km_mod1 = oml.km(n_clusters = 1, **setting).fit(CUSTOMER_DATA_CLEAN, model_name = "CUST_CLUSTER_MODEL", case_id = 'CUST_ID')
    

    Examine the script:

    • KMNS_ITERATIONS: Specifies the maximum number of allowed iterations, with a default of 20.
    • KMNS_DISTANCE: Specify the type of distance functions used, by default distance function is the Euclidean distance.
    • KMNS_NUM_BINS: Specifies the number of bins in the attribute histogram produced by k-Means.
    • KMNS_DETAILS: Determines the level of cluster details that are computed during the build. The value KMNS_DETAILS_ALL indicates that the cluster hierarchy, record counts, and descriptive statistics such as variances, modes, histograms, and rules are computed.
    • PREP_AUTO: Used for Automatic Data Preparation. By default, it is enabled as 'PREP_AUTO': PREP_AUTO_ON, which requires the DBMS_DATA_MINING package. Alternatively, it can be set as 'PREP_AUTO': 'ON'. This allows the compiler to validate that the PL/SQL constant name is correct.