Example: Using Vector Data for Dimensionality Reduction and Clustering
The example demonstrates how to use vector data for dimensionality reduction and clustering, using Principal Component Analysis (PCA) and k-Means.
- Assume that there is a data set called
datavec
containing oneID
column and a vector column with 100 dimensions.Name Null? Type ---------------------- -------- ---------------------------- ID NUMBER PROD_DATA VECTOR(100, FLOAT32, DENSE)
- Build a PCA feature extraction model. The following step creates a model that
uses PCA scoring to reduce dimensionality.
DECLARE v_setlst DBMS_DATA_MINING.SETTING_LIST; BEGIN v_setlst('ALGO_NAME') := 'ALGO_SINGULAR_VALUE_DECOMP'; v_setlst('SVDS_SCORING_MODE') := 'SVDS_SCORING_PCA'; DBMS_DATA_MINING.CREATE_MODEL2( MODEL_NAME => 'pca_model', MINING_FUNCTION => 'FEATURE_EXTRACTION', DATA_QUERY => 'SELECT * FROM DATAVEC', CASE_ID_COLUMN_NAME => 'id', SET_LIST => v_setlst); END; /
- Transform PCA results into a vector table
pca_data
with reduced dimensions by using theVECTOR_EMBEDDING()
operator.CREATE table pca_data as SELECT id, VECTOR_EMBEDDING(pca_model using *) embedding FROM datavec;
- The new
pca_data
contains one ID column and one vector with 10 dimensions based on the data characteristics.DESC pca_data; Name Null? Type ---------------- -------- ---------------------------- ID NUMBER EMBEDDING VECTOR(10, FLOAT64, DENSE)
- Build a k-Means clustering model on
pca_data
, leveraging its reduced dimensions.DECLARE v_setlst DBMS_DATA_MINING.SETTING_LIST; BEGIN v_setlst('ALGO_NAME') := 'ALGO_KMEANS'; v_setlst('KMNS_DETAILS') := 'KMNS_DETAILS_ALL'; v_setlst('CLUS_NUM_CLUSTERS') := '2'; DBMS_DATA_MINING.CREATE_MODEL2( MODEL_NAME => 'km_model', MINING_FUNCTION => 'CLUSTERING', DATA_QUERY => 'SELECT * FROM PCA_DATA', CASE_ID_COLUMN_NAME => 'id', SET_LIST => v_setlst); END; /
- Check the data dictionary settings.
SELECT model_name, attribute_name, data_type, target, vector_info FROM USER_MINING_MODEL_ATTRIBUTES WHERE model_name='KM_MODEL’ ORDER BY attribute_name; MODEL_NAME ATTRIBUTE_NAME DATA_TYPE TAR VECTOR_INFO ---------- --------------- ------------------------ --- -------------------- KM_MODEL EMBEDDING VECTOR NO VECTOR(10,FLOAT64)
- You can check the model detail views for
KM_MODEL
model.SELECT model_name, view_name, view_type FROM USER_MINING_MODEL_VIEWS WHERE model_name='KM_MODEL’ ORDER BY view_name; MODEL_NAME VIEW_NAME VIEW_TYPE ---------- --------------- ---------------------------------------- KM_MODEL DM$VAKM_MODEL Clustering Attribute Statistics KM_MODEL DM$VCKM_MODEL k-Means Scoring Centroids KM_MODEL DM$VDKM_MODEL Clustering Description KM_MODEL DM$VGKM_MODEL Global Name-Value Pairs KM_MODEL DM$VHKM_MODEL Clustering Histograms KM_MODEL DM$VNKM_MODEL Normalization and Missing Value Handling KM_MODEL DM$VRKM_MODEL Clustering Rules KM_MODEL DM$VSKM_MODEL Computed Settings KM_MODEL DM$VWKM_MODEL Model Build Alerts
- You can also view each vector dimension as a predictor from the model
details.
SELECT * FROM(SELECT cluster_id, attribute_name, attribute_subname, mean, variance, mode_value FROM DM$VAKM_MODEL ORDER BY cluster_id, attribute_name,attribute_subname) CLUSTER_ID ATTRIBUTE_NAME ATTRIBUTE_SUBNAME MEAN VARIANCE MODE_VALUE ---------- --------------- -------------------- ------------- ------------- -------------------- 1 EMBEDDING DM$$VEC1 28.9538 3.4382 2 EMBEDDING DM$$VEC1 27.9580 5.5661 3 EMBEDDING DM$$VEC1 29.9495 2.1698
- Use scoring operators
CLUSTER_ID
andCLUSTER_PROBABILITY
to find cluster assignments and probabilities for each record inpca_data
.SELECT id, cluster_id(km_model using *) cluster_id, cluster_probability(km_model using *)probability FROM pca_data ORDER BY id; ID CLUSTER_ID PROBABILITY ---------- ---------- ----------- 1 1 .617 2 2 .584 3 1 .579 4 1 .605 5 1 .621 6 1 .642 7 2 .598 8 2 .614 9 2 .650 10 2 .618