3.3.2 Explore Data

Once the data is accessible, explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Assess Data Quality

To access database data from Python using OML4Py, you must first create a oml.DataFrame proxy object in Python which represents a database table, view, or query. Create a oml.DataFrame proxy object for SUPPLEMENTARY_DEMOGRAPHICS and CUSTOMERS and then merge them by inner join on a key column, e.g., CUST_ID. Assess the data to identify data types and noise in the data. Look for missing values, outlier numeric values, or inconsistently labeled categorical values.

The following steps help you with the exploratory analysis of the data:

  1. Run the following script in a %python interpreter paragraph to import the oml modules, the Panda's module, and set the display options:
    import pandas as pd
    import matplotlib.pyplot as plt
    import oml
      
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
  2. Use the oml.sync function to create the Python object DEMOGRAPHICS as a proxy for a database table SUPPLEMENTARY_DEMOGRAPHICS. The oml.sync function returns an oml.DataFrame object. The oml.DataFrame object returned by oml.sync is a proxy for the database object.

    Note:

    Only one environment for a given database schema can exist at a time. If "schema=None", then objects are created searched in the current user's schema.
    DEMOGRAPHICS = oml.sync(table = "SUPPLEMENTARY_DEMOGRAPHICS", schema = "SH")
    z.show(DEMOGRAPHICS.head())

    Shows data for demographics

  3. To determine the number of rows and columns in the oml.DataFrame object DEMOGRAPHICS, use DataFrame.shape.
    print("Shape:",DEMOGRAPHICS.shape)
    (4500, 14)
  4. Use the oml.sync function to create the Python object CUSTOMERS as a proxy for a database table SH.CUSTOMERS. The query argument uses the SQL SELECT statement for selecting columns to include for use through the proxy object.c
    CUSTOMERS = oml.sync(query = 'SELECT CUST_ID, CUST_GENDER, CUST_MARITAL_STATUS, CUST_YEAR_OF_BIRTH, CUST_INCOME_LEVEL, CUST_CREDIT_LIMIT FROM SH.CUSTOMERS')
    z.show(CUSTOMERS.head())

    Shows data for customers

  5. To determine the number of rows and columns in the oml.DataFrame object CUSTOMERS, use DataFrame.shape.
    print("Shape:",CUSTOMERS.shape)
    (55500, 6)
  6. Create a new oml.DataFrame CUSTOMER_DATA by merging the table CUSTOMERS and DEMOGRAPHICS with an inner join on the common column CUST_ID. The merge function joins one oml.DataFrame to another oml.DataFrame. The suffixes parameter is used when the two oml.DataFrame have conflicting column names.
    CUSTOMER_DATA = CUSTOMERS[["CUST_ID", "CUST_GENDER", "CUST_MARITAL_STATUS", "CUST_YEAR_OF_BIRTH", "CUST_INCOME_LEVEL", "CUST_CREDIT_LIMIT"]].merge(DEMOGRAPHICS[["CUST_ID", "HOUSEHOLD_SIZE","YRS_RESIDENCE", "Y_BOX_GAMES"]], how = "inner", on = 'CUST_ID',suffixes = ["",""])
    
  7. To determine the number of rows and columns in the oml.DataFrame object CUSTOMER_DATA, use DataFrame.shape.
    print("Shape:",CUSTOMER_DATA.shape)
    Shape: (4500, 9)
  8. Use the concat function to concatenate the new column CUST_AGE in an oml.DataFrame object CUSTOMER_DATA. The column CUST_AGE contains the age based on the column CUST_YEAR_OF_BIRTH where the year of birth is converted to age in the year 2005. The information in the CUST_YEAR_OF_BIRTH column has been modified and maintained in CUST AGE, so drop the CUST_YEAR_OF_BIRTH column.
    CUSTOMER_DATA=CUSTOMER_DATA.concat({'CUST_AGE':abs(CUSTOMER_DATA['CUST_YEAR_OF_BIRTH'] -2005)})
    CUSTOMER_DATA=CUSTOMER_DATA.drop('CUST_YEAR_OF_BIRTH')
    CUSTOMER_DATA.head()

    shows customer data

  9. Run the following script to view the data type of each column.
    print("The datatypes of the column: ","\n")
    print(CUSTOMER_DATA.dtypes)

    Shows data types.

  10. To check if there are any missing values run the following script. The count function returns the number of elements that are not NULL for each column and the len() function returns the number of rows in the dataset.
    print("Number of missing values in each column is : \n")
    print(len(CUSTOMER_DATA)-CUSTOMER_DATA.count())

    Shows the missing values.

  11. Use the crosstab method to perform a cross-column analysis of an oml.DataFrame object in the database. The crosstab method computes a cross-tabulation of two or more columns. By default, it computes a frequency table for the columns unless a column and an aggregation function have been passed to it. In this example, the crosstab function displays the distribution of unique values of CUST_CREDIT_LIMIT along the x-axis and its occurrence frequency along the y-axis.
    z.show(CUSTOMER_DATA.crosstab(‘CUST_CREDIT_LIMIT’))

    Cross tab for customer limit.

  12. Use the transparency layer method describe to calculate descriptive statistics that summarize the central tendency, dispersion, and shape of the CUSTOMER_DATA table in each numeric column.

    Note:

    All computations are computed in the database and only the result statistics are returned to the Python client, in this case, the notebook. Eliminating the need to move data and using the database as a high-performance compute engine greatly increases scalability.
    CUSTOMER_DATA.describe()

    Describe

  13. Before building the model, it's important to ensure that the data is clean. Data often contains outliers, which can form separate clusters that negatively impact model quality.

    The following script defines a function, IQR, to calculate the interquartile range for a dataframe. It takes two arguments: SUMMARY_DF (which contains summary statistics of the dataframe, generated using the describe method) and a list of features. The IQR function uses a for loop to compute the interquartile range for each feature in the list.

    Run the script to calculate the interquartile range for the specified columns:
    def IQR(SUMMARY_DF, features):
        result = [0]*len(features)   
        for i, feature in enumerate(features):
            result[i] = abs(SUMMARY_DF[feature]['75%'] - SUMMARY_DF[feature]['25%'])
        return result
     
    print(IQR(CUSTOMER_DATA.describe(),['CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES']))
    
    [20.0, 6000.0, 2.0, 1.0]

    The user-defined function remove_outlier uses the interquartile range to find outliers in the data and remove them. In boxplot, outliers are points that lie outside of the upper and lower quartiles by 1.5 times the interquartile range (Q1 - 1.5 * IQR or Q3 + 1.5 * IQR). Another form of outlier treatment is clipping or capping, where more extreme values are replaced with a max or min value, e.g., the 1.5 IRQ values.

    The following function removes rows with outliers of a given feature based on quantiles:
    def remove_outlier(DF, SUMMARY_DF, features):
        iqrs = IQR(SUMMARY_DF, features)   
        for i, iqr in enumerate(iqrs):
             H = 1.5*iqr
             DF = DF[ ( DF[features[i]] > SUMMARY_DF[features[i]]['25%'] - H ) & ( DF[features[i]] < SUMMARY_DF[features[i]]['75%'] + H )]
        print(DF.shape)
        return DF
         
    CUSTOMER_DATA_CLEAN= remove_outlier(CUSTOMER_DATA, CUSTOMER_DATA.describe(), ['CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES'])
     
    print("Shape:",CUSTOMER_DATA_CLEAN.shape)
    Shape: (4233, 9)

This completes the data understanding and data preparation stage.