Hierarchical Indexing In Pandas
While Pandas does provide Panel and Panel4D objects to natively handle three-dimensional and four-dimensional data, a far more common practice is to use hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. Doing this, higher-dimensional data can be compactly represented with the familiar one-dimensional Series and two-dimensional DataFrame objects
1. CREATING MULTI-INDEXED SERIES
First, let’s create a multi-index data from the tuples as follows:
Second, provide the above multi-index data to Pandas pd.MultiIndex.from_tuples()
function:
Third, define the data, pop
for our multi-index series, in the form of list
:
Fourth, use pd.Series
constructor with data and index as arguments:
In the above example, the first two columns of the Series representation show the multiple index values, while the third column shows the data
Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it
➞ Indexing and Slicing syntax will be the same as we covered in Indexing Pandas Series and DataFrame
Population of City A, for all years:
Population of all cities, for year 2018
Population of City A for year 2018
1.1. Stack and Unstack
a. unstack Method
We could easily have stored the same data using a simple DataFrame with index and column labels. The unstack()
method will quickly convert a multi-indexed Series into a conventionally indexed DataFrame
b. stack
The stack()
method provides the opposite operation than unstack()
— converts DataFrame to multi-indexed Series
1.2. Handling three or more Dimensions
Just as we were able to use multi-indexing to represent two-dimensional DataFrame within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame. Each extra level in a multi-index represents an extra dimension of data.
1.3. Applying UFunc
Let’s find the percentage of under 18 population in each city, each year:
2. VARIOUS METHODS OF MULTI-INDEX CREATION
In Section 1, we studied one way to create a multi-index object using tuples. In this section, we will study various methods/techniques for creating multi-index object and using it to create Series and DataFrame:
➞ The most straightforward way to construct a multi-indexed Series or DataFrame is to simply pass a list of two or more index arrays to the pd.Series()
or pd.DataFrame
constructor. The number of data points should be equal to number of indices.
➞ We can also create multi-index Series by passing dictionary with appropriate tuples as keys, Pandas will automatically recognize the indices and data values:
2.1. Explicit Multi-index constructors
We can use the class method available in the pd.MultiIndex
a. from_arrays
b. from_tuples
c. from_product
This one is easiest of all three, needs to input least amount of data:
2.2. Multi-Index level names
In this sub-section, we will learn various methods to name the multi-index:
a. Directly as argument in Explicit Multi-Index constructor
In sub-section 2.1, we studied three explicit multi-index constructor. In them, we can provide keyword argument names=[]
to define the name of each index level:
b. By setting index.name for Series or DataFrame
If we have already created a multi-index Series or DataFrame object without index level names, we can use the method index.names=[]
to explicitly set the names of each index level
2.3. Multi-levels for Columns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can also have multiple levels
3. INDEXING AND SLICING A MULTI-INDEX
3.1. Multi-Index Series
Remember that
[]
method of indexing and slicing on Series object, applies on the index labels
Let’s use indexing techniques to answer some basic question about the above multi-indexed Series, pop
→ What is population of City A in 2018
→ What is population of City A in all available years
→ To use .loc
, make sure that index is sorted. If it is not, use .index.sort()
on Series or DataFrame object. Now, let ask question, what is the population of City A and City B, in all available year
→ We can also use the integer based indexing. Let’s fetch first two rows using [:2]
→ What is population of all cities, for year 2018
→ What is population of City A and City B in all available years:
3.2. Multi-Index DataFrame
Remember that
[]
method of indexing and slicing on DataFrame object, applies on the column labels. Therefore to apply indexing on index level, we can use.iloc[]
andloc[]
Let’s use indexing techniques to answer some basic question about the above multi-indexed DataFrame, df_multi
→ What are marks of student, Tom, in all the subjects , for all available years and exams:
→ Tom marks in HR, for all available years and exams:
➞ Fetching first row of a multi-index DataFrame using iloc[]
method
➞ Fetching first two rows and first two columns using iloc[,]
method. The integers for slicing that we provide before the ,
in iloc[ , ]
applies to row and after the ,
applies to column
→ We can also use the explicit values of index and column labels using .loc[]
For example, let’s get score of all students, in all the subjects, for all the exams, but only in year 2018:
→ We can also use .loc[ , ]
to slice at both index and column levels. Let’s fetch scores of Tom, in all subjects and all exams, but only in year 2018:
4. REARRANGING MULTI-INDICES
We saw few examples of this concept, sub-section 1.1. under stack()
and unstack()
methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns
4.1. Sorted and Unsorted indices
We can sort the index of a Series or DataFrame object using .sort_index()
method:
4.2. Stacking and Unstacking indices
Earlier, in sub-section 1.1, we applied stack
and unstuck
on Pandas Series object. Let’s us apply the same methods on DataFrame (in the example below, we intentionally edit the DataFrame by removing “John” so that the DataFrame is easy to read and understand)
a. Unstack
Let unstack the results, which by default applies to level=-1
, i.e, the last index in the multi-index series.
As we can see, the index name ‘exam’ is now unstacked and become part of another level in the column:
Let’s unstack
with level=0
which will unstack the index, name year
into another level in the column
b. Stack
Let’s stack one of the DataFrame columns into index. By default, it applies to the last level in the column, which is exam
in our example:
Let suppose, we would like to stack the student
column instead of exam
To do that we can provide level=0
because student
is the at position of 0
4.3. Index Resetting and Setting
a. Index Reset
Index to column: We can use reset_index
method to turn the index labels into columns. We can also fine control the result using various parameters of this method :
Let’s apply reset_index()
which will turn the old indices into columns and new integer based sequential index is used:
Last column has no name, so let’s give it a name to make the results both presentable and meaningful:
If we don’t want to reset all indices to columns, we can use the argument level=
to fine tune our results
b. Set Index
Column-to-index: We can use set_index()
method to build a multi-index Series or DataFrame by providing the list of column labels that we would like to convert into indices:
5. DATA AGGREGATIONS ON MULTI-INDICES
In this section, we will perform sum()
, mean()
, max()
kind of aggregation on multi-index DataFrame
a. Along rows
Let suppose we would like to find the mean scores in each year, for each subject and each student. To accomplish this, we will use the keyword argument level=year
b. Along column and rows
Let suppose we would like to find the mean scores each year for each subject, for all subjects and exams. To accomplish this, we will use two keyword arguments level='marks
and axis=1
(to tell Pandas to look for level under column)
Last updated
Was this helpful?