3.3.2 Explore Data
Once the data is accessible, explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.
Assess Data Quality
To access database data from Python using OML4Py, you must first create a
oml.DataFrame
proxy object in Python which represents a
database table, view, or query. Create a oml.DataFrame
proxy object
for SUPPLEMENTARY_DEMOGRAPHICS and CUSTOMERS and then merge them by inner join on a
key column, e.g., CUST_ID
. Assess the data to identify data types
and noise in the data. Look for missing values, outlier numeric values, or
inconsistently labeled categorical values.
The following steps help you with the exploratory analysis of the data:
- Run the following script in a
%python
interpreter paragraph to import theoml
modules, the Panda's module, and set the display options:import pandas as pd import matplotlib.pyplot as plt import oml pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 1000) import warnings warnings.simplefilter(action='ignore', category=FutureWarning)
- Use the
oml.sync
function to create the Python object DEMOGRAPHICS as a proxy for a database table SUPPLEMENTARY_DEMOGRAPHICS. Theoml.sync
function returns anoml.DataFrame
object. Theoml.DataFrame
object returned byoml.sync
is a proxy for the database object.Note:
Only one environment for a given database schema can exist at a time. If"schema=None"
, then objects are created searched in the current user's schema.DEMOGRAPHICS = oml.sync(table = "SUPPLEMENTARY_DEMOGRAPHICS", schema = "SH") z.show(DEMOGRAPHICS.head())
- To determine the number of rows and columns in the
oml.DataFrame
object DEMOGRAPHICS, useDataFrame.shape
.print("Shape:",DEMOGRAPHICS.shape)
(4500, 14)
- Use the
oml.sync
function to create the Python object CUSTOMERS as a proxy for a database table SH.CUSTOMERS. The query argument uses the SQL SELECT statement for selecting columns to include for use through the proxy object.cCUSTOMERS = oml.sync(query = 'SELECT CUST_ID, CUST_GENDER, CUST_MARITAL_STATUS, CUST_YEAR_OF_BIRTH, CUST_INCOME_LEVEL, CUST_CREDIT_LIMIT FROM SH.CUSTOMERS') z.show(CUSTOMERS.head())
- To determine the number of rows and columns in the
oml.DataFrame
object CUSTOMERS, useDataFrame.shape
.print("Shape:",CUSTOMERS.shape)
(55500, 6)
- Create a new
oml.DataFrame
CUSTOMER_DATA by merging the table CUSTOMERS and DEMOGRAPHICS with an inner join on the common column CUST_ID. The merge function joins one oml.DataFrame to another oml.DataFrame. The suffixes parameter is used when the two oml.DataFrame have conflicting column names.CUSTOMER_DATA = CUSTOMERS[["CUST_ID", "CUST_GENDER", "CUST_MARITAL_STATUS", "CUST_YEAR_OF_BIRTH", "CUST_INCOME_LEVEL", "CUST_CREDIT_LIMIT"]].merge(DEMOGRAPHICS[["CUST_ID", "HOUSEHOLD_SIZE","YRS_RESIDENCE", "Y_BOX_GAMES"]], how = "inner", on = 'CUST_ID',suffixes = ["",""])
- To determine the number of rows and columns in the
oml.DataFrame
object CUSTOMER_DATA, use DataFrame.shape.print("Shape:",CUSTOMER_DATA.shape)
Shape: (4500, 9)
- Use the
concat
function to concatenate the new column CUST_AGE in anoml.DataFrame
object CUSTOMER_DATA. The column CUST_AGE contains the age based on the column CUST_YEAR_OF_BIRTH where the year of birth is converted to age in the year 2005. The information in the CUST_YEAR_OF_BIRTH column has been modified and maintained in CUST AGE, so drop the CUST_YEAR_OF_BIRTH column.CUSTOMER_DATA=CUSTOMER_DATA.concat({'CUST_AGE':abs(CUSTOMER_DATA['CUST_YEAR_OF_BIRTH'] -2005)}) CUSTOMER_DATA=CUSTOMER_DATA.drop('CUST_YEAR_OF_BIRTH') CUSTOMER_DATA.head()
- Run the following script to view the data type of each
column.
print("The datatypes of the column: ","\n") print(CUSTOMER_DATA.dtypes)
- To check if there are any missing values run the following script.
The count function returns the number of elements that are not NULL for each
column and the
len()
function returns the number of rows in the dataset.print("Number of missing values in each column is : \n") print(len(CUSTOMER_DATA)-CUSTOMER_DATA.count())
- Use the crosstab method to perform a cross-column analysis of an
oml.DataFrame
object in the database. The crosstab method computes a cross-tabulation of two or more columns. By default, it computes a frequency table for the columns unless a column and an aggregation function have been passed to it. In this example, the crosstab function displays the distribution of unique values of CUST_CREDIT_LIMIT along the x-axis and its occurrence frequency along the y-axis.z.show(CUSTOMER_DATA.crosstab(‘CUST_CREDIT_LIMIT’))
- Use the transparency layer method
describe
to calculate descriptive statistics that summarize the central tendency, dispersion, and shape of the CUSTOMER_DATA table in each numeric column.Note:
All computations are computed in the database and only the result statistics are returned to the Python client, in this case, the notebook. Eliminating the need to move data and using the database as a high-performance compute engine greatly increases scalability.CUSTOMER_DATA.describe()
- Before building the model, it's important to ensure that the data
is clean. Data often contains outliers, which can form separate clusters that
negatively impact model quality.
The following script defines a function, IQR, to calculate the interquartile range for a dataframe. It takes two arguments: SUMMARY_DF (which contains summary statistics of the dataframe, generated using the
Run the script to calculate the interquartile range for the specified columns:describe
method) and a list of features. The IQR function uses a for loop to compute the interquartile range for each feature in the list.def IQR(SUMMARY_DF, features): result = [0]*len(features) for i, feature in enumerate(features): result[i] = abs(SUMMARY_DF[feature]['75%'] - SUMMARY_DF[feature]['25%']) return result print(IQR(CUSTOMER_DATA.describe(),['CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES']))
[20.0, 6000.0, 2.0, 1.0]
The user-defined function
The following function removes rows with outliers of a given feature based on quantiles:remove_outlier
uses the interquartile range to find outliers in the data and remove them. In boxplot, outliers are points that lie outside of the upper and lower quartiles by 1.5 times the interquartile range (Q1 - 1.5 * IQR or Q3 + 1.5 * IQR). Another form of outlier treatment is clipping or capping, where more extreme values are replaced with a max or min value, e.g., the 1.5 IRQ values.def remove_outlier(DF, SUMMARY_DF, features): iqrs = IQR(SUMMARY_DF, features) for i, iqr in enumerate(iqrs): H = 1.5*iqr DF = DF[ ( DF[features[i]] > SUMMARY_DF[features[i]]['25%'] - H ) & ( DF[features[i]] < SUMMARY_DF[features[i]]['75%'] + H )] print(DF.shape) return DF CUSTOMER_DATA_CLEAN= remove_outlier(CUSTOMER_DATA, CUSTOMER_DATA.describe(), ['CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES']) print("Shape:",CUSTOMER_DATA_CLEAN.shape)
Shape: (4233, 9)
This completes the data understanding and data preparation stage.
Parent topic: Clustering Use Case