3.1.2 Explore Data
Explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.
Identify Target Variable
Data Understanding and Preparation
For this use case, the task is to train a Support Vector Machine model that predicts which customers most likely to be positive responders to an Affinity Card loyalty program. Therefore, the target variable is the attribute AFFINITY_CARD
.
To access database data from R using OML4R, you must first create an ore.frame
proxy object in R that represents a database table, view, or query. In this example, the proxy object is created using a query. Create proxy objects for SUPPLEMENTARY_DEMOGRAPHICS
and then assess the data to identify data types and noise in the data. Look for missing values, outlier numeric values, or inconsistently labeled categorical values.
For data preparation and understanding run the following steps:
- Run the following command in an R interpreter paragraph (using %r) to import the Oracle Machine Learning for R libraries and to suppress warning regarding row ordering:
library(ORE) options(ore.warn.order=FALSE)
- Use the
ore.sync
function to create theore.frame
object that is a proxy for the SUPPLEMENTARY DEMOGRAPHICS table in the SH schema database table.ore.sync(query = c("SUP_DEM" = "select * from SH.SUPPLEMENTARY_DEMOGRAPHICS")) ore.attach()
- Run the following command to display few rows from SUPPLEMENTARY_DEMOGRAPHICS table
z.show(head(SUP_DEM))
- To display the number of rows and columns in the ore.frame object
SUPPLEMENTARY_DEMOGRAPHICS
, usez.show(dim(SUP_DEM))
z.show(dim(SUP_DEM))
(4500, 14)
- View the data type of the columns in CUST_DF with the @desc operator.
SUP_DEM@desc
- Run the following command to check if there are any missing values in the data. The following code gives you the total number of missing values in the
CUST_DF
proxy object.sum(is.na(SUP_DEM))
205
The value 205 indicates that there are missing values in the SUP_DEM proxy object.
OML supports Automatic Data Preparation (ADP). ADP is enabled through the model settings. When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model. You can enable ADP during the Build Model stage. The commonly used methods of data preparation are binning, normalization, and missing value treatment.
See How ADP Transforms the Data to understand how ADP prepares the data for some algorithms.
This completes the data understanding and data preparation stage.
Parent topic: Classification Use Case