3.4.2 Explore Data

Explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Note:

Each record in the database is called a case and each case is identified by a case_id. Here, the case id is TIME_ID, which is an independent variable. You are forecasting the sales for evenly spaced time.

The following steps help you with exploratory analysis of the data.

  1. Import libraries

    Run the following script in a %python interpreter paragraph to import the oml modules, the Panda's module, and set the display options:

    import oml
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
      
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
  2. Create a DataFrame proxy object on the SH.SALES table

    Use the oml.sync function to create the Python object SALES as a proxy for a database table SALES. The oml.sync function returns an oml.DataFrame object.

    Note:

    Only one environment for a given database schema can exist at a time. If "schema=None", then objects are created searched in the current user's schema.
    SALES = oml.sync(table = "SALES", schema = "SH") 
    z.show(SALES.head())

    Shows the top 5 rows of sales data.

  3. Sales dataset row and column Count

    To determine the number of rows and columns in the oml.DataFrame object SALES, use DataFrame.shape.

    print(f"Rows: {SALES.shape[0]}, Columns: SALES.shape[1]}")
    Rows: 918843, Columns: 7
  4. Sales Dataset Column Types

    Run the following script to view the data type of each column.
    print(f"Data types of each column in the Sales dataset:\n{SALES.dtypes}")

    Shows the datatypes of each column.

  5. Count of missing values by column

    To check if there are any missing values run the following script. The count function returns the number of elements that are not NULL for each column and the len() function returns the number of rows in the dataset.

    print("Number of missing values in each column is : \n")
    print(len(SALES)-SALES.count())

    Shows the count of missing values in each column.

  6. Prepare data to forecast sales by selecting needed columns and view content

    Now, prepare a Python proxy object called ESM_SH_DATA by selecting the necessary columns from SH.SALES table. For this use case, select TIME_ID and AMOUNT_SOLD.

    ESM_SH_DATA= SALES[['TIME_ID', 'AMOUNT_SOLD']]
    z.show(ESM_SH_DATA.head())

    Shows the top 5 rows of ESM_SH_DATA dataset.

  7. ESM_SH_DATA rows and columns

    Determine the shape of ESM_SH_DATA:

    print(f"Rows: {ESM_SH_DATA.shape[0]}, Columns: ESM_SH_DATA.shape[1]}")
    Rows: 918843, Columns: 2

This completes the data exploration stage.