3.2.2 Explore Data
Once the data is accessible, explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.
To gain a broader understanding of the data and identify potential issues, we will now explore the dataset, focusing on data quality assessment and identifying missing or outlier values.
- To determine the number of rows and columns in the
ore.frame
objectCUSTOMERS
, usedim(CUSTOMERS)
.dim(CUSTOMERS) 55500 6
- To determine the number of rows and columns in the ore.frame object
SUPPLEMENTARY_DEMOGRAPHICS
, usedim(SUPPLEMENTARY_DEMOGRAPHICS)
dim(SUPPLEMENTARY_DEMOGRAPHICS) 4500 4
- Create a new
ore.frame
objectCUST_DF
by merging the tableCUSTOMERS
andSUPPLEMENTARY_DEMOGRAPHICS
with an inner join on the common columnCUST_ID
. The merge function joins oneore.frame
to anotherore.frame
.CUST_DF <- merge(SUPPLEMENTARY_DEMOGRAPHICS,CUSTOMERS, by="CUST_ID")
- To display first 5 rows of
CUST_DF
data run the following code:z.show(head(CUST_DF,5))
- To get the dimensions using
CUST_DF
proxy object, use 'dim':dim(CUST_DF) 4500 9
- To transform the column
CUST_YEAR_OF_BIRTH
toCUST_AGE
in theCUST_DF
proxy object and produce a barplot of the distribution of customer ages, use the following code.Date1 <- format(Sys.Date(), "%Y") Date2 <- as.numeric(Date1) CUST_DF$CUST_AGE <- Date2-CUST_DF$CUST_YEAR_OF_BIRTH CUST_DF$CUST_YEAR_OF_BIRTH <- NULL tbl <- with(CUST_DF, table(CUST_AGE)) barplot(tbl, ylim=c(0,150), ylab = "Number of Customers", xlab = "Customer Age", las=3)
- View the data type of the columns in
CUST_DF
with the @desc operator, which is crucial for understanding to understand your data and perform calculations accurately.CUST_DF@desc
- To check if there are any missing values in the data, run the following code. The following code gives you the total number of missing values in the
CUST_DF
proxy object.sum(is.na(CUST_DF)) 0
The value 0 indicates that there are no missing values in the
CUST_DF
proxy object. - Use the crosstab method to perform a cross-column analysis of the ore.frame object in the database. By default, it computes a frequency table for the columns unless a column and an aggregation function have been passed to it. In this example, the crosstab function displays the distribution of unique values of
CUST_CREDIT_LIMIT
along the x-axis and its occurrence frequency along the y-axis.In the output, click the Bar chart. In the Settings tab, choose “CUST_CREDIT_LIMIT” as the Group By column, and use “Last” as the Aggregate Duplicates function.
ct <- ore.crosstab(~CUST_CREDIT_LIMIT, data=CUST_DF) z.show(ct)
- To compute the statistics of the
CUST_DF
table, use thesummary
function.options(width = 80) summary(subset(CUST_DF, select = -CUST_ID)) CUST_GENDER CUST_MARITAL_STATUS CUST_INCOME_LEVEL Length:4500 Length:4500 Length:4500 Class :ore.character Class :ore.character Class :ore.character Mode :character Mode :character Mode :character CUST_CREDIT_LIMIT HOUSEHOLD_SIZE YRS_RESIDENCE Y_BOX_GAMES Min. : 1500 Length:4500 Min. : 0.000 Min. :0.0000 1st Qu.: 5000 Class :ore.character 1st Qu.: 3.000 1st Qu.:0.0000 Median : 9000 Mode :character Median : 4.000 Median :0.0000 Mean : 7924 Mean : 4.022 Mean :0.3124 3rd Qu.:11000 3rd Qu.: 5.000 3rd Qu.:1.0000 Max. :15000 Max. :14.000 Max. :1.0000 CUST_AGE Min. : 39.00 1st Qu.: 49.00 Median : 59.00 Mean : 60.38 3rd Qu.: 69.00 Max. :112.00
This completes the data understanding stage.
Data Preparation
Before building the model you want to clean the data, if needed. Usually, data can contain outliers that may form a separate cluster, which can affect model quality. The command below defines the function filter_outliers
to calculate the interquartile range for a dataframe object. The function remove_outliers
uses a for loop to compute the interquartile range for the list of features. The user-defined function remove_outliers
uses the interquartile range to find outliers in the data and remove them.
# create filter outliers function
filter_outliers <- function(x) {
# calculate first quantile
Quantile1 <- quantile(x, probs=.25)
# calculate third quantile
Quantile3 <- quantile(x, probs=.75)
# calculate inter quartile range
IQR = Quantile3-Quantile1
# return true or false
x < Quantile3 + (IQR*1.5) & x > Quantile1 - (IQR*1.5)
}
# create remove outliers function
remove_outliers <- function(dataframe,
columns=names(dataframe)) {
# for loop to traverse in columns vector
for (col in columns) {
# remove observation if it satisfies outlier function
dataframe <- dataframe[filter_outliers(dataframe[[col]]),]
}
ore.pull(dataframe)
}
CUST_DF_CLEAN <- remove_outliers(CUST_DF, c('CUST_AGE', 'CUST_CREDIT_LIMIT', 'YRS_RESIDENCE', 'Y_BOX_GAMES'))
CUST_DF_CLEAN <- ore.push(CUST_DF_CLEAN)
dim(CUST_DF_CLEAN)
4233 9
This completes the data preparation stage.
Parent topic: Clustering Use Case