8.2.6 Summarize Data
The describe
method calculates descriptive statistics that summarize the central tendency, dispersion, and shape of the data in each column.
You can also specify the types of columns to include or exclude from the results.
With the sum
and cumsum
methods, you can compute the sum and cumulative sum of each Float or Boolean column of an oml.DataFrame
.
The describe
method supports finding the following statistics:
-
Mean, minimum, maximum, median, top character, standard deviation
-
Number of not-Null values, unique values, top characters
-
Percentiles between 0 and 1
Example 8-13 Calculating Descriptive Statistics
The following example demonstrates these operations.
import pandas as pd
import oml
df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],
'string' : [None, None, 'a', 'a', 'a', 'b'],
'bytes' : [b'a', b'b', b'c', b'c', b'd', b'e']})
oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
'string':'CHAR(1)',
'bytes':'RAW(1)'})
# Combine a Boolean column with oml_df.
oml_bool = oml_df['numeric'] > 3
oml_df = oml_df.concat(oml_bool)
oml_df.rename({'COL4':'boolean'})
# Describe all of the columns.
oml_df.describe(include='all')
# Exclude Float columns.
oml_df.describe(exclude=[oml.Float])
# Get the sum of values in each Float or Boolean column.
oml_df.sum()
# Find the cumulative sum of values in each Float or Boolean column
# after oml_df is sorted by the bytes column in descending order.
oml_df.cumsum(by = 'bytes', ascending = False)
# Compute the skewness of values in the Float columns.
oml_df.skew()
# Find the median value of Float columns.
oml_df.median()
# Calculate the kurtosis of Float columns.
oml_df.kurtosis()
Listing for This Example
>>> import pandas as pd
>>> import oml
>>>
>>> df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],
... 'string' : [None, None, 'a', 'a', 'a', 'b'],
... 'bytes' : [b'a', b'b', b'c', b'c', b'd', b'e']})
>>>
>>> oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
... 'string':'CHAR(1)',
... 'bytes':'RAW(1)'})
>>>
>>> # Combine a Boolean column with oml_df.
... oml_bool = oml_df['numeric'] > 3
>>> oml_df = oml_df.concat(oml_bool)
>>> oml_df.rename({'COL4':'boolean'})
bytes numeric string boolean
0 b'a' 1.000 None False
1 b'b' 1.400 None False
2 b'c' -4.000 a False
3 b'c' 3.145 a True
4 b'd' 5.000 a True
5 b'e' NaN b True
>>>
>>> # Describe all of the columns.
... oml_df.describe(include='all')
bytes numeric string boolean
count 6 5.000000 4 6
unique 5 NaN 2 2
top b'c' NaN a False
freq 2 NaN 3 3
mean NaN 1.309000 NaN NaN
std NaN 3.364655 NaN NaN
min NaN -4.000000 NaN NaN
25% NaN 1.000000 NaN NaN
50% NaN 1.400000 NaN NaN
75% NaN 3.145000 NaN NaN
max NaN 5.000000 NaN NaN
>>>
>>> # Exclude Float columns.
... oml_df.describe(exclude=[oml.Float])
bytes string boolean
count 6 4 6
unique 5 2 2
top b'c' a False
freq 2 3 3
>>>
>>> # Get the sum of values in each Float or Boolean column.
... oml_df.sum()
numeric 6.545
boolean 3.000
dtype: float64
>>>
>>> # Find the cumulative sum of values in each Float or Boolean column
... # after oml_df is sorted by the bytes column in descending order.
... oml_df.cumsum(by = 'bytes', ascending = False)
numeric boolean
0 NaN 1
1 5.000 2
2 1.000 2
3 4.145 3
4 5.545 3
5 6.545 3
>>>
>>> # Compute the skewness of values in the Float columns.
... oml_df.skew()
numeric -0.683838
dtype: float64
>>>
>>> # Find the median value of Float columns.
... oml_df.median()
numeric 1.4
dtype: float64
>>>
>>> # Calculate the kurtosis of Float columns.
... oml_df.kurtosis()
numeric -0.582684
dtype: float64
Parent topic: Explore Data