8.2.3 Cross-Tabulate Data
Use the crosstab
method to perform cross-column analysis of an oml.DataFrame
object and the pivot_table
method to convert an oml.DataFrame
to a spreadsheet-style pivot table.
Cross-tabulation is a statistical technique that finds an interdependent relationship between two columns of values. The crosstab
method computes a cross-tabulation of two or more columns. By default, it computes a frequency table for the columns unless a column and an aggregation function have been passed to it.
The pivot_table
method converts a data set into a pivot table. Due to the database 1000 column limit, pivot tables with more than 1000 columns are automatically truncated to display the categories with the most entries for each column value.
For details about the method arguments, invoke help(oml.DataFrame.crosstab)
or help(oml.DataFrame.pivot_table)
, or see Oracle Machine Learning for Python API Reference.
Example 8-11 Producing Cross-Tabulation and Pivot Tables
This example demonstrates the use of the crosstab
and pivot_table
methods.
import pandas as pd
import oml
x = pd.DataFrame({
'GENDER': ['M', 'M', 'F', 'M', 'F', 'M', 'F', 'F',
None, 'F', 'M', 'F'],
'HAND': ['L', 'R', 'R', 'L', 'R', None, 'L', 'R',
'R', 'R', 'R', 'R'],
'SPEED': [40.5, 30.4, 60.8, 51.2, 54, 29.3, 34.1,
39.6, 46.4, 12, 25.3, 37.5],
'ACCURACY': [.92, .94, .87, .9, .85, .97, .96, .93,
.89, .84, .91, .95]
})
x = oml.push(x)
# Find the categories that the most entries belonged to.
x.crosstab('GENDER', 'HAND').sort_values('count', ascending=False)
# For each gender value and across all entries, find the ratio of entries
# with different hand values.
x.crosstab('GENDER', 'HAND', pivot = True, margins = True, normalize = 0)
# Find the mean speed across all gender and hand combinations.
x.pivot_table('GENDER', 'HAND', 'SPEED')
# Find the median accuracy and speed for every gender and hand combination.
x.pivot_table('GENDER', 'HAND', aggfunc = oml.DataFrame.median)
# Find the max and min speeds for every gender and hand combination and
# across all combinations.
x.pivot_table('GENDER', 'HAND', 'SPEED',
aggfunc = [oml.DataFrame.max, oml.DataFrame.min],
margins = True)
Listing for This Example
>>> import pandas as pd
>>> import oml
>>>
>>> x = pd.DataFrame({
... 'GENDER': ['M', 'M', 'F', 'M', 'F', 'M', 'F', 'F',
... None, 'F', 'M', 'F'],
... 'HAND': ['L', 'R', 'R', 'L', 'R', None, 'L', 'R',
... 'R', 'R', 'R', 'R'],
... 'SPEED': [40.5, 30.4, 60.8, 51.2, 54, 29.3, 34.1,
... 39.6, 46.4, 12, 25.3, 37.5],
... 'ACCURACY': [.92, .94, .87, .9, .85, .97, .96, .93,
... .89, .84, .91, .95]
... })
>>> x = oml.push(x)
>>>
>>> # Find the categories that the most entries belonged to.
... x.crosstab('GENDER', 'HAND').sort_values('count', ascending=False)
GENDER HAND count
0 F R 5
1 M L 2
2 M R 2
3 M None 1
4 F L 1
5 None R 1
>>>
>>> # For each gender value and across all entries, find the ratio of entries
... # with different hand values.
... x.crosstab('GENDER', 'HAND', pivot = True, margins = True, normalize = 0)
GENDER count_(L) count_(R) count_(None)
0 None 0.000000 1.000000 0.000000
1 F 0.166667 0.833333 0.000000
2 M 0.400000 0.400000 0.200000
3 All 0.250000 0.666667 0.083333
>>>
>>> # Find the mean speed across all gender and hand combinations.
... x.pivot_table('GENDER', 'HAND', 'SPEED')
GENDER mean(SPEED)_(L) mean(SPEED)_(R) mean(SPEED)_(None)
0 None NaN 46.40 NaN
1 F 34.10 40.78 NaN
2 M 45.85 27.85 29.3
>>>
>>> # Find the median accuracy and speed for every gender and hand combination.
... x.pivot_table('GENDER', 'HAND', aggfunc = oml.DataFrame.median)
GENDER median(ACCURACY)_(L) median(ACCURACY)_(R) median(ACCURACY)_(None) \
0 None NaN 0.890 NaN
1 F 0.96 0.870 NaN
2 M 0.91 0.925 0.97
median(SPEED)_(L) median(SPEED)_(R) median(SPEED)_(None)
0 NaN 46.40 NaN
1 34.10 39.60 NaN
2 45.85 27.85 29.3
>>>
>>> # Find the max and min speeds for every gender and hand combination and
... # across all combinations.
... x.pivot_table('GENDER', 'HAND', 'SPEED',
... aggfunc = [oml.DataFrame.max, oml.DataFrame.min],
... margins = True)
GENDER max(SPEED)_(L) max(SPEED)_(R) max(SPEED)_(None) max(SPEED)_(All) \
0 None NaN 46.4 NaN 46.4
1 F 34.1 60.8 NaN 60.8
2 M 51.2 30.4 29.3 51.2
3 All 51.2 60.8 29.3 60.8
min(SPEED)_(L) min(SPEED)_(R) min(SPEED)_(None) min(SPEED)_(All)
0 NaN 46.4 NaN 46.4
1 34.1 12.0 NaN 12.0
2 40.5 25.3 29.3 25.3
3 34.1 12.0 29.3 12.0
Parent topic: Explore Data