1.6 Example: Using Vector Data for Dimensionality Reduction and Clustering

The example demonstrates how to use vector data for dimensionality reduction and clustering, using Principal Component Analysis (PCA) and k-Means.

  1. Assume that there is a data set called datavec containing one ID column and a vector column with 100 dimensions.
    Name                   Null?    Type
    ---------------------- -------- ----------------------------
    ID                              NUMBER
    PROD_DATA                       VECTOR(100, FLOAT32, DENSE)
  2. Build a PCA feature extraction model. The following step creates a model that uses PCA scoring to reduce dimensionality.
    DECLARE
      v_setlst DBMS_DATA_MINING.SETTING_LIST;
    BEGIN
      v_setlst('ALGO_NAME')         := 'ALGO_SINGULAR_VALUE_DECOMP';
      v_setlst('SVDS_SCORING_MODE') := 'SVDS_SCORING_PCA';
      
      DBMS_DATA_MINING.CREATE_MODEL2(
            MODEL_NAME          => 'pca_model',
            MINING_FUNCTION     => 'FEATURE_EXTRACTION',
            DATA_QUERY          => 'SELECT * FROM DATAVEC',
            CASE_ID_COLUMN_NAME => 'id',
            SET_LIST            => v_setlst);    
    END;
    /
  3. Transform PCA results into a vector table pca_data with reduced dimensions by using the VECTOR_EMBEDDING() operator.
    CREATE table pca_data as SELECT id, VECTOR_EMBEDDING(pca_model using *) embedding FROM datavec;
  4. The new pca_data contains one ID column and one vector with 10 dimensions based on the data characteristics.
    DESC pca_data;
    Name             Null?    Type
    ---------------- -------- ----------------------------
    ID                    	  NUMBER
    EMBEDDING                 VECTOR(10, FLOAT64, DENSE)
  5. Build a k-Means clustering model on pca_data, leveraging its reduced dimensions.
    DECLARE
      v_setlst DBMS_DATA_MINING.SETTING_LIST;
    BEGIN
      v_setlst('ALGO_NAME')         := 'ALGO_KMEANS';
      v_setlst('KMNS_DETAILS')      := 'KMNS_DETAILS_ALL';
      v_setlst('CLUS_NUM_CLUSTERS') := '2';
    
      DBMS_DATA_MINING.CREATE_MODEL2(
            MODEL_NAME          => 'km_model',
            MINING_FUNCTION     => 'CLUSTERING',
            DATA_QUERY          => 'SELECT * FROM PCA_DATA',
            CASE_ID_COLUMN_NAME => 'id',
            SET_LIST            => v_setlst);    
    END;
    /
  6. Check the data dictionary settings.
    SELECT model_name, attribute_name, data_type, target, vector_info 
    FROM   USER_MINING_MODEL_ATTRIBUTES
    WHERE  model_name='KM_MODEL’ ORDER BY attribute_name;
     
    MODEL_NAME ATTRIBUTE_NAME  DATA_TYPE	        TAR VECTOR_INFO
    ---------- --------------- ------------------------ --- --------------------
    KM_MODEL   EMBEDDING	   VECTOR		         NO  VECTOR(10,FLOAT64)
  7. You can check the model detail views for KM_MODEL model.
    SELECT model_name, view_name, view_type
    FROM USER_MINING_MODEL_VIEWS
    WHERE model_name='KM_MODEL’ ORDER BY view_name; 
     
    MODEL_NAME VIEW_NAME	     VIEW_TYPE
    ---------- --------------- ----------------------------------------
    KM_MODEL   DM$VAKM_MODEL   Clustering Attribute Statistics
    KM_MODEL   DM$VCKM_MODEL   k-Means Scoring Centroids
    KM_MODEL   DM$VDKM_MODEL   Clustering Description
    KM_MODEL   DM$VGKM_MODEL   Global Name-Value Pairs
    KM_MODEL   DM$VHKM_MODEL   Clustering Histograms
    KM_MODEL   DM$VNKM_MODEL   Normalization and Missing Value Handling
    KM_MODEL   DM$VRKM_MODEL   Clustering Rules
    KM_MODEL   DM$VSKM_MODEL   Computed Settings
    KM_MODEL   DM$VWKM_MODEL   Model Build Alerts 
  8. You can also view each vector dimension as a predictor from the model details.
    SELECT * FROM(SELECT cluster_id, attribute_name, attribute_subname, 
           mean, variance, mode_value 
    FROM DM$VAKM_MODEL ORDER BY cluster_id, attribute_name,attribute_subname) 
    
    CLUSTER_ID ATTRIBUTE_NAME  ATTRIBUTE_SUBNAME    MEAN	        VARIANCE       MODE_VALUE
    ---------- --------------- -------------------- ------------- ------------- --------------------
             1 EMBEDDING	      DM$$VEC1             28.9538       3.4382
             2 EMBEDDING	      DM$$VEC1             27.9580       5.5661
             3 EMBEDDING	      DM$$VEC1             29.9495       2.1698
  9. Use scoring operators CLUSTER_ID and CLUSTER_PROBABILITY to find cluster assignments and probabilities for each record in pca_data.
    SELECT id, cluster_id(km_model using *) cluster_id, cluster_probability(km_model using *)probability FROM pca_data ORDER BY id;
    
    ID 		CLUSTER_ID      PROBABILITY
    ----------      ----------      -----------
    1                1              .617
    2                2              .584
    3                1              .579
    4                1              .605
    5                1              .621
    6                1              .642
    7                2              .598
    8                2              .614
    9                2              .650
    10               2              .618