Aggregation And Grouping

Aggregation: We will study aggregations like sum(), mean(), median(), min(), and max(), in which a single number gives insight into the nature of a potentially large dataset **Grouping: **When we are dealing with large datasets, it is useful to group data by common categories (value of particular column). To accomplish this, we use the GroupBy function pandas DataFrame

🛳 Titanic Ship Data for Demonstration

  • We will use Titanic survivors data from our analysis

  • You can download your own copy here

import numpy as np
import pandas as pd 
titanic = pd.read_csv('data/titanic.csv')

# fetching column names using .keys method
print(titanic.keys()) 
Index(['survived', 'pclass', 'sex', 'age', 'fare', 'embarked', 'who',
       'embark_town', 'alive', 'alone'],
      dtype='object')
print(titanic.head()) # show the header, includes first five rows
   survived  pclass     sex   age     fare embarked    who  embark_town alive  \
0         0       3    male  22.0   7.2500        S    man  Southampton    no   
1         1       1  female  38.0  71.2833        C  woman    Cherbourg   yes   
2         1       3  female  26.0   7.9250        S  woman  Southampton   yes   
3         1       1  female  35.0  53.1000        S  woman  Southampton   yes   
4         0       3    male  35.0   8.0500        S    man  Southampton    no   
   alone  
0  False  
1  False  
2   True  
3  False  
4   True  

1. SIMPLE AGGREGATION

**Features of Dataset: ** Columns of a DataFrame is referred to as features of a dataset, these features can be:

  • Quantitative feature: Any value represented by numbers. We apply numerical metrics (sum, mean, std) on them

  • Categorical feature: Values are categories that can be used to group the dataset. GroupBy is applied on categorical features

1.1. describe()

Computes several common aggregates for each column (containing integers and floats) i.e, .describe applies to categorical features

print(titanic.describe())
         survived      pclass         age        fare
count  891.000000  891.000000  714.000000  891.000000
mean     0.383838    2.308642   29.699118   32.204208
std      0.486592    0.836071   14.526497   49.693429
min      0.000000    1.000000    0.420000    0.000000
25%      0.000000    2.000000   20.125000    7.910400
50%      0.000000    3.000000   28.000000   14.454200
75%      1.000000    3.000000   38.000000   31.000000
max      1.000000    3.000000   80.000000  512.329200

1.2. Fetching Unique Values

For categorical features, we can use .unique() method to obtain unique instances of each categorical feature:

# fetching unique values under column 'embark_station'
titanic['embark_town'].unique()
array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

1.3. Frequency of Unique Values

For categorical features, we can use .value_counts() method to obtain frequency counts of each category in column feature

# fetching all unique instances of column 'embark_town'
# along with the frequency count
titanic['embark_town'].value_counts()
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

We can use normalize=True argument to get proportion of frequency count

# sum of all values is equal to 1
titanic['embark_town'].value_counts(normalize=True)
Southampton    0.724409
Cherbourg      0.188976
Queenstown     0.086614
Name: embark_town, dtype: float64

2. GROUPBY: SPLIT, APPLY, COMBINE

GroupBy, split-apply-combine; is one of the most common and useful strategy of data analysis:

  • The split step involves breaking up and grouping a DataFrame depending on the values of the specified column

  • The apply step involves computing some function - usually an aggregate, transformation, or filtering; within the individual groups. Apply can take following forms:

    • Aggregation

    • Tranformation

    • Filteration

  • The combine step merges the results of these operations into an output array.

2.1. Grouping by Single Column

a. Split

Split (Creating a DataFrame GroupBy Object): Let suppose we would like to GroupBy column name sex. Doing this will create a GroupBy object which stores the data of the individual groups in the form of key value pairs, which we can fetch using .groups method on this DataFrame object

#grouping by column 'sex'
group_by_sex = titanic.groupby('sex') 

# the output is DataFrameGroupBy object
group_by_sex
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9778b203d0>
# getting all groups as key-value pair
group_by_sex.groups
{'female': Int64Index([  1,   2,   3,   8,   9,  10,  11,  14,  15,  18,
             ...
             866, 871, 874, 875, 879, 880, 882, 885, 887, 888],
            dtype='int64', length=314),
 'male': Int64Index([  0,   4,   5,   6,   7,  12,  13,  16,  17,  20,
             ...
             873, 876, 877, 878, 881, 883, 884, 886, 889, 890],
            dtype='int64', length=577)}

b. Apply-Combine

Suppose we are interested to know .mean() of all columns values, once the data is grouped by sex This step will apply the mean to each instance of group and then combine the result to give us a Series (for single column selection) or DataFrame (for multiple columns selection)

titanic.groupby('sex').mean()
        survived    pclass        age       fare     alone
sex                                                       
female  0.742038  2.159236  27.915709  44.479818  0.401274
male    0.188908  2.389948  30.726645  25.523893  0.712305

➞ Let suppose further that instead of getting mean of all column values, we are only interested in finding mean of values under column survived and age

print(titanic.groupby('sex')['survived','age'].mean())
        survived        age
sex                        
female  0.742038  27.915709
male    0.188908  30.726645

➞ In addition, we can also fetch specific group using get_group('group_name_here') and apply aggregator on that object:

group_by_sex.get_group('male').mean()
survived     0.188908
pclass       2.389948
age         30.726645
fare        25.523893
alone        0.712305
dtype: float64

2.2. Grouping by two Columns

We are not limited to GroupBy single column. In this example, we group the DataFrame by columns sex and alive and then apply the mean aggregator. Further, we are only interested to get the output on age and fare columns

print(titanic.groupby(['sex','alive'])['age','fare'].mean())
                    age       fare
sex    alive                      
female no     25.046875  23.024385
       yes    28.847716  51.938573
male   no     31.618056  21.960993
       yes    27.276022  40.821484

2.3. Iteration over groups

We have discussed earlier that GroupBy function creates a GroupBy object, which stores the values in dictionary style key-value pair. To demonstrate this concept, we will apply a for loop on GroupBy object

# let get the shape of each group, when we GroupBy 'pclass'
for (name, group) in titanic.groupby('pclass'):
    print(f"{name}: shape={group.shape}")
1: shape=(216, 10)
2: shape=(184, 10)
3: shape=(491, 10)

3. AGGREGATE, FILTER, TRANSFORM, APPLY

GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data

First step is to construct a DataFrame for this discussion

rand = np.random.RandomState(42)
df = pd.DataFrame({'key': ['A','B','C','A','B','C'],
                  'data1': rand.randint(1,10, size=6),
                  'data2': rand.randint(1,20, size=6)})

print(df)
  key  data1  data2
0   A      7     11
1   B      4     11
2   C      8      4
3   A      5      8
4   B      7      3
5   C      3      2

3.1. Aggregation

In the above example, we just calculated a single aggregator(mean), however using the aggregate() method, we can compute multiple aggregators in a single command. The aggregate() method takes a string, a function or a list of all the required aggregates to compute.

# passing list of aggregators we need for the data1 and data 2
df.groupby('key').aggregate([min,max,np.median,np.mean,np.std])
    data1                           data2                          
      min max median mean       std   min max median mean       std
key                                                                
A       5   7    6.0  6.0  1.414214     8  11    9.5  9.5  2.121320
B       4   7    5.5  5.5  2.121320     3  11    7.0  7.0  5.656854
C       3   8    5.5  5.5  3.535534     2   4    3.0  3.0  1.414214

Let suppose, we don’t want to compute one type of aggregator for all columns, rather, we want to compute median for data1 and mean for data2

# we will pass the dictionary of key(column-name) and value(aggregator)
df.groupby('key').aggregate({'data1':np.median,
                            'data2':np.mean})
     data1  data2
key              
A      6.0    9.5
B      5.5    7.0
C      5.5    3.0

➞ Now, as we grasp the concept of aggregate() method, let apply this on our titanic data

titanic.groupby('sex')['age','fare'].aggregate([min,max,np.mean])
         age                   fare                     
         min   max       mean   min       max       mean
sex                                                     
female  0.75  63.0  27.915709  6.75  512.3292  44.479818
male    0.42  80.0  30.726645  0.00  512.3292  25.523893

3.2 Filtering

A filtering operation allows us to drop data based on some group properties Filter is applied in form of function

Let’s apply filter to drop rows where standard deviation of data1 is greater than 2

df.groupby('key').filter(lambda x: x['data1'].std() > 2)
  key  data1  data2
1   B      4     11
2   C      8      4
4   B      7      3
5   C      3      2

Let’s find out the standard deviation of each key, so we can say for sure that key A is dropped because standard deviation of data1 is less than 2

print(df.groupby('key').std())
        data1     data2
key                    
A    1.414214  2.121320
B    2.121320  5.656854
C    3.535534  1.414214

3.3. Transformation

In the above example, when we applied the aggregate function, we end up with reduced version of the data (For example, in aggregate example above, we end up with 3 rows from 6 rows). However, transformation function/method can return some transformed version of the full data but the output remains the same shape as the input.

Suppose we would like to create a new column that list sum of each key for ‘data1’ and ‘data2’

df_new_col = df.groupby('key').transform('sum')
print(df_new_col)
   data1  data2
0     12     19
1     11     14
2     11      6
3     12     19
4     11     14
5     11      6

3.4. Apply

The apply() method applies function to the group results. In the example below, let me add another column data3 which is sum of corresponding row value of columns data1 and data2

def apply_func(x):
    x['data3'] = x['data1'] + x['data2']
    return x

print(df.groupby('key').apply(apply_func))
  key  data1  data2  data3
0   A      7     11     18
1   B      4     11     15
2   C      8      4     12
3   A      5      8     13
4   B      7      3     10
5   C      3      2      5

Last updated