3.3.2 Explore Data

Explore the data to understand its structure and assess its quality. At this stage, identify the data types and detect any noise present. Check for missing values as well as numeric outliers.

Note:

Each record in the database is called a case, and each case is identified by a case_id. Here, the case id is TIME_ID, which serves as the independent variable. You are forecasting sales over evenly spaced time intervals.

The following steps will guide you through exploratory data analysis.

  1. Import libraries

    Run the following command in an R interpreter paragraph (using %r) in a notebook to import the Oracle Machine Learning for R libraries and suppress warnings regarding row ordering. This establishes a default connection to the database for use by OML4R.

    Note:

    The OML4R functions are prefixed with "ore," which refers to the original product name "Oracle R Enterprise."
    library(ORE)
    options(ore.warn.order=FALSE)
  2. Create a DataFrame proxy object on the SH.SALES table

    Use the ore.sync function to create the a proxy object ESM_SALES for the SH.SALES database table. The ore.sync function returns an ore.frame object that represents the table in OML4R.

    This script uses the ore.sync function to create a data.frame proxy object to the table SH.SALES in the object ESM_SALES. It then attaches the synchronized object to the R environment with ore.attach(), allowing direct access to ESM_SALES as if it were a local data frame. Finally, it displays the first few rows of this data using head() combined with ORE’s formatted output function z.show().
    ore.sync(query = c(ESM_SALES = "select * from SH.SALES"))
    ore.attach()
    z.show(head(ESM_SALES))

    The output appears as follows:


    Shows the top 5 rows of sales data.

    This output displays the first 6 rows of the SH.SALES database table, including all available columns. It allows you to quickly inspect a sample of the sales data stored in the Oracle database directly from your OML4R session.

  3. Sales dataset row and column count

    To determine the number of rows and columns in the ore.frame object SALES, use the function dim(SALES).

    This script runs an R command inside a notebook to return the number of rows and columns of the ESM_SALES data frame or table.

    %r
     
    dim(ESM_SALES)

    This image shows the number of rows and columns in the ESM_SALES dataset.

    The output shows the number of rows and columns in the ESM_SALES dataset to understand the shape and scale of the data you are working with in OML4R.

  4. Sales Dataset Column Types

    Use the following code to view the data type of each column.

    This script retrieves and displays metadata for the ESM_SALES object in OML4R, providing details about the structure and columns of the linked Oracle database table.

    
    %r
     
    ESM_SALES@desc
    The output appears as follows:
    The image shows the data type of each column in the ESM_SALES dataset

    The output of ESM_SALES@desc is a structure of the ESM_SALES object. It includes Column names and Data types.

  5. Count of missing values by column

    To check if there are any missing values in the data, run the following code. It returns you the total number of missing values in the ESM_SALES proxy object to assess data completeness.

    %r
     
    sum(is.na(ESM_SALES))

    The output appears as 0, which indicates that there are no missing values in the ESM_SALES proxy object.

  6. Prepare the data for sales forecasting, start by selecting the required columns from the SH.SALES table.

    Use the ore.sync() function to create a proxy object named SALES_TIME_AMOUNT containing the TIME_ID and AMOUNT_SOLD columns..

    This defines a new ore.data.frame object in R called SALES_TIME_AMOUNT by linking it to a query that selects TIME_ID and AMOUNT_SOLD columns from the SH.SALES table in an Oracle database. It helps in-database analytics without fully loading the data into R.

    %r
     
    ore.sync(query = c(SALES_TIME_AMOUNT = "select TIME_ID, AMOUNT_SOLD from SH.SALES"))
    z.show(head(SALES_TIME_AMOUNT)))

    This image shows the first 6 rows of the SALES_TIME_AMOUNT dataset.

    The output shows the formatted table with the first 6 rows of the SALES_TIME_AMOUNT dataset. It includes two columns:
    1. TIME_ID – A time identifier (date).
    2. AMOUNT_SOLD – A numeric value showing the amount sold for each transaction.
  7. Use the dim() function to view the dimensions (number of rows and columns) of the SALES_TIME_AMOUNT proxy object. This helps confirm the size of the dataset retrieved from the SH.SALES table.

    The returns the number of rows and columns in the SALES_TIME_AMOUNT dataset, showing you the shape of the data queried from the database.

    %r
     
    dim(SALES_TIME_AMOUNT)

    The image shows that the shows that the dataset contains 918843 rows and 2 columns.

    The output of dim(SALES_TIME_AMOUNT) shows that the dataset contains 918843 rows and 2 columns. This gives you a quick understanding of the dataset’s size and structure.

This gives you a quick understanding of the dataset’s size and structure.