Aggregation And Grouping
Aggregation: We will study aggregations like sum()
, mean()
, median()
, min()
, and max()
, in which a single number gives insight into the nature of a potentially large dataset **Grouping: **When we are dealing with large datasets, it is useful to group data by common categories (value of particular column). To accomplish this, we use the GroupBy
function pandas DataFrame
๐ณ Titanic Ship Data for Demonstration
We will use Titanic survivors data from our analysis
You can download your own copy here
1. SIMPLE AGGREGATION
**Features of Dataset: ** Columns of a DataFrame is referred to as features of a dataset, these features can be:
Quantitative feature: Any value represented by numbers. We apply numerical metrics (sum, mean, std) on them
Categorical feature: Values are categories that can be used to group the dataset. GroupBy is applied on categorical features
1.1. describe()
Computes several common aggregates for each column (containing integers and floats) i.e, .describe
applies to categorical features
1.2. Fetching Unique Values
For categorical features, we can use .unique()
method to obtain unique instances of each categorical feature:
1.3. Frequency of Unique Values
For categorical features, we can use .value_counts()
method to obtain frequency counts of each category in column feature
We can use normalize=True
argument to get proportion of frequency count
2. GROUPBY: SPLIT, APPLY, COMBINE
GroupBy
, split-apply-combine; is one of the most common and useful strategy of data analysis:
The split step involves breaking up and grouping a DataFrame depending on the values of the specified column
The apply step involves computing some function - usually an aggregate, transformation, or filtering; within the individual groups. Apply can take following forms:
Aggregation
Tranformation
Filteration
The combine step merges the results of these operations into an output array.
2.1. Grouping by Single Column
a. Split
Split (Creating a DataFrame GroupBy Object): Let suppose we would like to GroupBy
column name sex
. Doing this will create a GroupBy object which stores the data of the individual groups in the form of key value pairs, which we can fetch using .groups
method on this DataFrame object
b. Apply-Combine
Suppose we are interested to know .mean()
of all columns values, once the data is grouped by sex
This step will apply the mean to each instance of group and then combine the result to give us a Series (for single column selection) or DataFrame (for multiple columns selection)
โ Let suppose further that instead of getting mean of all column values, we are only interested in finding mean of values under column survived
and age
โ In addition, we can also fetch specific group using get_group('group_name_here')
and apply aggregator on that object:
2.2. Grouping by two Columns
We are not limited to GroupBy single column. In this example, we group the DataFrame by columns sex
and alive
and then apply the mean
aggregator. Further, we are only interested to get the output on age
and fare
columns
2.3. Iteration over groups
We have discussed earlier that GroupBy function creates a GroupBy object, which stores the values in dictionary style key-value pair. To demonstrate this concept, we will apply a for
loop on GroupBy
object
3. AGGREGATE, FILTER, TRANSFORM, APPLY
GroupBy objects have aggregate()
, filter()
, transform()
, and apply()
methods that efficiently implement a variety of useful operations before combining the grouped data
First step is to construct a DataFrame for this discussion
3.1. Aggregation
In the above example, we just calculated a single aggregator(mean), however using the aggregate()
method, we can compute multiple aggregators in a single command. The aggregate()
method takes a string, a function or a list of all the required aggregates to compute.
Let suppose, we donโt want to compute one type of aggregator for all columns, rather, we want to compute median for data1
and mean
for data2
โ Now, as we grasp the concept of aggregate()
method, let apply this on our titanic data
3.2 Filtering
A filtering operation allows us to drop data based on some group properties Filter is applied in form of function
Letโs apply filter
to drop rows where standard deviation of data1
is greater than 2
Letโs find out the standard deviation of each key, so we can say for sure that key A
is dropped because standard deviation of data1
is less than 2
3.3. Transformation
In the above example, when we applied the aggregate
function, we end up with reduced version of the data (For example, in aggregate
example above, we end up with 3 rows from 6 rows). However, transformation
function/method can return some transformed version of the full data but the output remains the same shape as the input.
Suppose we would like to create a new column that list sum of each key for โdata1โ and โdata2โ
3.4. Apply
The apply()
method applies function to the group results. In the example below, let me add another column data3
which is sum of corresponding row value of columns data1
and data2
Last updated
Was this helpful?