Vectorize Relational Tables Using OML Feature Extraction Algorithms
This example shows you how to use OML's Feature Extraction algorithms in conjunction with
the VECTOR_EMBEDDING()
operator to vectorize sets of relational data, build
similarity indexes, and perform similarity searches on the resulting vectors.
Feature Extraction algorithms help in extracting the most informative features/columns from the data and aim to reduce the dimensionality of large data sets by identifying the principal components that capture the most variance in the data. This reduction simplifies the data set while retaining the most important information, making it easier to analyze correlations and redundancies in the data.
The Principal Component Analysis (PCA) algorithm, a widely used dimensionality reduction technique in machine learning, is used in this tutorial.
Note:
This example uses customer bank marketing data available at https://archive.ics.uci.edu/dataset/222/bank+marketing.
The relational data table includes a mix of numeric and categorical columns. It has more than 4000 records.
SELECT column_name, data_type
FROM user_tab_columns
WHERE table_name = 'BANK'
ORDER BY data_type, column_name;
Output:
COLUMN_NAME DATA_TYPE
-------------------- --------------------
AGE NUMBER
CAMPAIGN NUMBER
CONS_CONF_IDX NUMBER
CONS_PRICE_IDX NUMBER
DURATION NUMBER
EMP_VAR_RATE NUMBER
EURIBOR3M NUMBER
ID NUMBER
NR_EMPLOYED NUMBER
PDAYS NUMBER
PREVIOUS NUMBER
CONTACT VARCHAR2
CREDIT_DEFAULT VARCHAR2
DAY_OF_WEEK VARCHAR2
EDUCATION VARCHAR2
HOUSING VARCHAR2
JOB VARCHAR2
LOAN VARCHAR2
MARITAL VARCHAR2
MONTH VARCHAR2
POUTCOME VARCHAR2
Y VARCHAR2
To perform a similarity search, you need to vectorize the relational data. To do so, you can first use the OML Feature Extraction algorithm to project the data onto a more compact numeric space. In this example, you configure the SVD algorithm to perform a Principal Component Analysis (PCA) projection of the original data table. The number of features/columns (5 in this case) is specified in the setting table. The input number determines the number of principal features or columns that will be retained after the dimensionality reduction process. Each of these columns represent a direction in the feature space along which the data varies the most.
Because you need to use the DBMS_DATA_MINING
package to
create the model, you need the CREATE MINING MODEL
privilege in addition to
the other privileges relevant to vector indexes and similarity search. For more information
about the CREATE MINING MODEL
privilege, see Oracle Machine Learning
for SQL User’s Guide.
Parent topic: Generate Embeddings