3.1.2 Explore Data

Explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Identify Target Variable

Data Understanding and Preparation

For this use case, the task is to train a Support Vector Machine model that predicts which customers most likely to be positive responders to an Affinity Card loyalty program. Therefore, the target variable is the attribute AFFINITY_CARD.

To access database data from R using OML4R, you must first create an ore.frame proxy object in R that represents a database table, view, or query. In this example, the proxy object is created using a query. Create proxy objects for SUPPLEMENTARY_DEMOGRAPHICS and then assess the data to identify data types and noise in the data. Look for missing values, outlier numeric values, or inconsistently labeled categorical values.

For data preparation and understanding run the following steps:

  1. Run the following command in an R interpreter paragraph (using %r) to import the Oracle Machine Learning for R libraries and to suppress warning regarding row ordering:
    library(ORE)
    options(ore.warn.order=FALSE)
  2. Use the ore.sync function to create the ore.frame object that is a proxy for the SUPPLEMENTARY DEMOGRAPHICS table in the SH schema database table.
    ore.sync(query = c("SUP_DEM" = "select * from SH.SUPPLEMENTARY_DEMOGRAPHICS"))
    ore.attach()
  3. Run the following command to display few rows from SUPPLEMENTARY_DEMOGRAPHICS table
    z.show(head(SUP_DEM))

    Shows top 5 tows of SUPPLEMENTARY_DEMOGRAPHICS

  4. To display the number of rows and columns in the ore.frame object SUPPLEMENTARY_DEMOGRAPHICS, use z.show(dim(SUP_DEM))
    z.show(dim(SUP_DEM))
    (4500, 14)
  5. View the data type of the columns in CUST_DF with the @desc operator.
    SUP_DEM@desc

    Shows the data types of data set.

  6. Run the following command to check if there are any missing values in the data. The following code gives you the total number of missing values in the CUST_DF proxy object.
    sum(is.na(SUP_DEM))
    205

    The value 205 indicates that there are missing values in the SUP_DEM proxy object.

    OML supports Automatic Data Preparation (ADP). ADP is enabled through the model settings. When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model. You can enable ADP during the Build Model stage. The commonly used methods of data preparation are binning, normalization, and missing value treatment.

    See How ADP Transforms the Data to understand how ADP prepares the data for some algorithms.

This completes the data understanding and data preparation stage.