3.1 Data Requirements
Understand how data is stored and viewed for Oracle Machine Learning.
Machine learning activities require data that is defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can optionally be identified by a unique case ID. The table or view itself can be referred to as a case table.
The CUSTOMERS
table in the
SH
schema is an example of a table that
could be used for machine learning. All the
information for each customer is contained in a single row. The
case ID is the CUST_ID
column. The rows listed in
the following example are
selected from SH.CUSTOMERS
.
Note:
Oracle Machine Learning requires single-record case data for all types of models except association models, which can be built on native transactional data.
Example 3-1 Sample Case Table
select cust_id, cust_gender, cust_year_of_birth, cust_main_phone_number from sh.customers where cust_id < 11;
The output is as follows:
CUST_ID CUST_GENDER CUST_YEAR_OF_BIRTH CUST_MAIN_PHONE_NUMBER
------- ----------- ---- ------------- -------------------------
1 M 1946 127-379-8954
2 F 1957 680-327-1419
3 M 1939 115-509-3391
4 M 1934 577-104-2792
5 M 1969 563-667-7731
6 F 1925 682-732-7260
7 F 1986 648-272-6181
8 F 1964 234-693-8728
9 F 1936 697-702-2618
10 F 1947 601-207-4099
- Column Data Types
Understand the different types of column data in a case table. - Vector Data Type
You can provideVECTOR
data as input to Oracle Machine Learning in-database algorithms to complement other structured data or be used alone. The vector data type is supported for clustering, classification, anomaly detection, and feature extraction. - Data Sets for Classification and Regression
Understand how data sets are used for training and testing the model. - Scoring Requirements
Learn how scoring is done in Oracle Machine Learning for SQL.
Related Topics
Parent topic: Prepare the Data
3.1.1 Column Data Types
Understand the different types of column data in a case table.
The columns of the case table hold the attributes that describe each case. In Example 3-1, the attributes are: CUST_GENDER
, CUST_YEAR_OF_BIRTH
, and CUST_MAIN_PHONE_NUMBER
. The attributes are the predictors in a supervised model or the descriptors in an unsupervised model. The case ID, CUST_ID
, can be viewed as a special attribute; it is not a predictor or a descriptor.
OML4SQL supports standard Oracle data types except DATE
, TIMESTAMP
, RAW
, and LONG
. Oracle Machine Learning supports date type (datetime, date, timestamp) for case_id, CLOB
/BLOB
/FILE
that are interpreted as text columns, and the following collection types as well:
DM_NESTED_CATEGORICALS
DM_NESTED_NUMERICALS
DM_NESTED_BINARY_DOUBLES
DM_NESTED_BINARY_FLOATS
Note:
The attributes with the data type BOOLEAN
are treated as numeric with the following values: TRUE
means 1
, FALSE
means 0
, and NULL
is interpreted as an unknown value. The CASE_ID_COLUMN_NAME
attribute does not support BOOLEAN
data type.
Parent topic: Data Requirements
3.1.2 Vector Data Type
You can provide
VECTOR
data as input to Oracle Machine Learning
in-database algorithms to complement other structured data or be used alone. The vector data
type is supported for clustering, classification, anomaly detection, and feature extraction.
While dense vectors with arbitrary precision and dimensions are supported, in a flex vector column, precision may differ and dimensions within a single vector column must remain consistent. Errors are raised for mismatched dimensions.
Partitioned models track vector dimensions alongside partition statistics. Different partitions can have different vector dimensions, however, dimensions must remain consistent within a single partition. Errors are raised for mismatched dimensions within a single partition.
The system supports FLOAT32
, FLOAT64
, and
INT8
as datatypes. Vectors with FLEX
dimension and
precision are supported. These features can be used in combination with the other data
types supported by OML (numerical, categorical, nested, and text).
Scoring with Vectors
The system treats each vector dimension as an individual predictor and
provides model details at the vector component level, labeled as
DM$$VECxxx
, where xxx
represents the
component's position. For example, DM$$VEC1
. During scoring, the
system matches vector dimensions between the model and input data at compile time or
runtime, raising errors if mismatches occur. A vector cannot be a target or a
case_id
column, errors are raised if you set vector as a target
or case_id
.
- analytic scoring with vectors, analytic scoring operators skip the vector inputs without displaying any error.
- sparse vectors, raises an error that the format is not supported if sparse vectors are identified. To learn more about sparse vectors, see Create Tables Using the VECTOR Data Type.
- binary vector precision, raises an error that the format is not supported
OML supports the vector data type for the following algorithms and the scoring operators:
Technique | Algorithms | Scoring Operator |
---|---|---|
Classification or Regression | SVM, Neural Network, GLM | PREDICTION ,
PREDICTION_PROBABILITY ,
PREDICTION_SET ,
PREDICTION_BOUNDS |
Anomaly Detection | One-class SVM, Expectation Maximization | PREDICTION ,
PREDICTION_PROBABILITY ,
PREDICTION_SET |
Clustering | k-Means, Expectation Maximization | CLUSTER_ID ,
CLUSTER_PROBABILITY ,
CLUSTER_SET ,
CLUSTER_DISTANCE |
Feature Extraction | SVD, PCA | FEATURE_ID ,
FEATURE_VALUE , FEATURE_SET ,
VECTOR_EMBEDDING |
See Example: Using Vector Data for Dimensionality Reduction and Clustering for more details.
Parent topic: Data Requirements
3.1.3 Data Sets for Classification and Regression
Understand how data sets are used for training and testing the model.
You need two case tables to build and validate classification and regression models. One set of rows is used for training the model, another set of rows is used for testing the model. It is often convenient to derive the build data and test data from the same data set. For example, you could randomly select 60% of the rows for training the model; the remaining 40% could be used for testing the model.
Models that implement other machine learning functions, such as attribute importance, clustering, association, or feature extraction, do not use separate test data.
Parent topic: Data Requirements
3.1.4 Scoring Requirements
Learn how scoring is done in Oracle Machine Learning for SQL.
Most machine learning models can be applied to separate data in a process known as scoring. Oracle Machine Learning for SQL supports the scoring operation for classification, regression, anomaly detection, clustering, and feature extraction.
The scoring process matches column names in the scoring data with the names of the columns that were used to build the model. The scoring process does not require all the columns to be present in the scoring data. If the data types do not match, OML4SQL attempts to perform type coercion. For example, if a column called PRODUCT_RATING
is VARCHAR2
in the training data but NUMBER
in the scoring data, OML4SQL effectively applies a TO_CHAR()
function to convert it.
The column in the test or scoring data must undergo the same transformations as the corresponding column in the build data. For example, if the AGE
column in the build data was transformed from numbers to the values CHILD
, ADULT
, and SENIOR
, then the AGE
column in the scoring data must undergo the same transformation so that the model can properly evaluate it.
Note:
OML4SQL can embed user-specified transformation instructions in the model and reapply them whenever the model is applied. When the transformation instructions are embedded in the model, you do not need to specify them for the test or scoring data sets.
OML4SQL also supports Automatic Data Preparation (ADP). When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model along with any user-specified transformations.
See Also:
Automatic Data Preparation and Embed Transformations in a Model for more information on automatic and embedded data transformations
Parent topic: Data Requirements