# Hierarchical Indexing In Pandas

While Pandas does provide **Panel** and **Panel4D** objects to natively handle ***three-dimensional*** and ***four-dimensional data***, a far more common practice is to use **hierarchical indexing** (also known as multi-indexing) to incorporate multiple index levels within a single index. Doing this, higher-dimensional data can be compactly represented with the familiar [one-dimensional Series and two-dimensional DataFrame objects](https://tahamaddam.com/coding/pandas/pandas-series-and-dataframe-object/)

```python
import numpy as np
import pandas as pd 
```

## 1. CREATING MULTI-INDEXED SERIES

First, let’s create a multi-index data from the tuples as follows:

```python
# defining multi-index in form of tuples
index = [('City A', 2018),('City A', 2019),
        ('City B', 2018), ('City B', 2019),
        ('City C', 2018), ('City C', 2019)]

index
```

```
[('City A', 2018),
 ('City A', 2019),
 ('City B', 2018),
 ('City B', 2019),
 ('City C', 2018),
 ('City C', 2019)]
```

Second, provide the above multi-index data to Pandas `pd.MultiIndex.from_tuples()` function:

```python
# creating multi-index in Pandas from Tuples (we created above)
index_pd = pd.MultiIndex.from_tuples(index)

index_pd
```

```
MultiIndex([('City A', 2018),
            ('City A', 2019),
            ( 'City B', 2018),
            ( 'City B', 2019),
            ('City C', 2018),
            ('City C', 2019)],
           )
```

Third, define the data, `pop` for our multi-index series, in the form of `list`:

```python
# Defining data for the multi-index
pop = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
```

Fourth, use `pd.Series` constructor with data and index as arguments:

```python
# Linking multi-index with population data
pop = pd.Series(pop, index=index_pd)

pop
```

```
City A  2018    33871648
        2019    37253956
City B  2018    18976457
        2019    19378102
City C  2018    20851820
        2019    25145561
dtype: int64
```

* In the above example, the first two columns of the Series representation show the **multiple index values**, while the third column shows the data
* Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it

➞ Indexing and Slicing syntax will be the same as we covered in [Indexing Pandas Series and DataFrame](https://tahamaddam.com/coding/pandas/indexing-pandas-series-and-dataframe/)

Population of City A, for all years:

```python
pop['City A']
```

```
2018    33871648
2019    37253956
dtype: int64
```

Population of all cities, for year 2018

```python
pop[:,2018]
```

```
City A    33871648
City B     18976457
City C    20851820
dtype: int64
```

Population of City A for year 2018

```python
pop['City A',2018]
```

```
33871648
```

### 1.1. Stack and Unstack

#### a. unstack Method

We could easily have stored the same data using a simple DataFrame with index and column labels. The `unstack()` method will quickly convert a multi-indexed **Series** into a conventionally indexed **DataFrame**

```python
pop_df = pop.unstack()
print(pop_df)
```

```
year     2018      2019
city                        
City B  18976457  19378102
City A  33871648  37253956
City C  20851820  25145561
```

#### b. stack

The `stack()` method provides the opposite operation than `unstack()` — converts DataFrame to multi-indexed Series

```python
pop_df = pop_df.stack()
pop_df
```

```
City B   2018    18976457
         2019    19378102
City A   2018    33871648
         2019    37253956
City C   2018    20851820
         2019    25145561
dtype: int64
```

### 1.2. Handling three or more Dimensions

Just as we were able to use multi-indexing to represent two-dimensional DataFrame within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame. Each extra level in a multi-index represents an extra dimension of data.

```python
pop_df = pd.DataFrame({'total': pop,
                      'under18': [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]})
print(pop_df)
```

```
               total  under18
city     year                   
City B  2018  18976457  9267089
        2019  19378102  9284094
City A  2018  33871648  4687374
        2019  37253956  4318033
City C  2018  20851820  5906301
        2019  25145561  6879014
```

### 1.3. Applying UFunc

Let’s find the percentage of under 18 population in each city, each year:

```python
pop_u18_percent = pop_df['under18'] / pop_df['total']
pop_u18_percent
```

```
City A  2018    0.273594
        2019    0.249211
City B  2018    0.247010
        2019    0.222831
City C  2018    0.283251
        2019    0.273568
dtype: float64
```

```python
print(pop_u18_percent.unstack())
```

```
year      2018      2019
city                        
City B  0.247010  0.222831
City A  0.273594  0.249211
City C  0.283251  0.273568
```

## 2. VARIOUS METHODS OF MULTI-INDEX CREATION

In Section 1, we studied one way to create a multi-index object using tuples. In this section, we will study various methods/techniques for creating multi-index object and using it to create Series and DataFrame:

➞ The most straightforward way to construct a multi-indexed Series or DataFrame is *to simply pass a list of two or more index arrays* to the `pd.Series()` or `pd.DataFrame` constructor. The number of data points should be equal to number of indices.

```python
create_mi_df = pd.DataFrame(np.random.rand(4,2),
                           index=[['a','a','b','b'],[1,2,1,2]],
                           columns=['data1','data2'])
```

```python
print(create_mi_df)
```

```
        data1     data2
a 1  0.695555  0.776309
  2  0.696634  0.502602
b 1  0.322619  0.127614
  2  0.293457  0.415007
```

➞ We can also create multi-index Series by passing dictionary with appropriate *tuples as keys*, Pandas will automatically recognize the indices and data values:

```python
data_for_series = {('California', 2000): 33871648,
                   ('California', 2010): 37253956,
                   ('Texas', 2000): 20851820,
                   ('Texas', 2010): 25145561,
                   ('New York', 2000): 18976457,
                   ('New York', 2010): 19378102}
pd.Series(data_for_series)
```

```
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64
```

### 2.1. Explicit Multi-index constructors

We can use the *class method* available in the `pd.MultiIndex`

#### a. from\_arrays

```python
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
```

```
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
```

#### b. from\_tuples

```python
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
```

```
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
```

#### c. from\_product

This one is easiest of all three, needs to input least amount of data:

```python
pd.MultiIndex.from_product([['a','b'],[1,2]])
```

```
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
```

### 2.2. Multi-Index level names

In this sub-section, we will learn various methods to name the multi-index:

#### a. Directly as argument in Explicit Multi-Index constructor

In sub-section 2.1, we studied three explicit multi-index constructor. In them, we can provide keyword argument `names=[]` to define the name of each index level:

```python
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]], names=['alpha','num'])
```

```
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           names=['alpha', 'num'])
```

#### b. By setting [index.name](http://index.name) for Series or DataFrame

If we have already created a multi-index Series or DataFrame object without index level names, we can use the method `index.names=[]` to explicitly set the names of each index level

```python
# reproducing Series that don't have index name
pop
```

```
City A  2018    33871648
        2019    37253956
City B  2018    18976457
        2019    19378102
City C  2018    20851820
        2019    25145561
dtype: int64
```

```python
# setting index level names
pop.index.names = ['city','year']
pop
```

```
city      year
City A  2018    33871648
          2019    37253956
City B   2018    18976457
          2019    19378102
City C  2018    20851820
          2019    25145561
dtype: int64
```

### 2.3. Multi-levels for Columns

In a DataFrame, the rows and columns are completely symmetric, and just as *the rows can have multiple levels of indices*, *the columns can also have multiple levels*

```python
# creating hierarchical (multiple) indices and columns
indices = pd.MultiIndex.from_product([[2018,2019],[1,2]],
                                     names=['year','exam'])

columns = pd.MultiIndex.from_product([['Tom', 'Harry', 'John'], ['HR', 'Marketing']], 
                                     names=['student', 'marks'])

# mock data
# how do we know that we need to have 4*6 = 24 data points?
data = np.random.randint(50,100, size=(4,6))

# create Dataframe with multiple indices and colums
df_multi = pd.DataFrame(data,
                       index=indices,
                       columns=columns)
# result
print(df_multi)
```

```
student   Tom           Harry           John          
marks      HR Marketing    HR Marketing   HR Marketing
year exam                                             
2018 1     58        85    54        91   73        69
     2     70        98    64        70   79        96
2019 1     67        79    87        79   81        70
     2     81        72    90        88   80        66
```

## 3. INDEXING AND SLICING A MULTI-INDEX

### 3.1. Multi-Index Series

> Remember that `[]` method of indexing and slicing on Series object, applies on the index labels

```python
# multi-index series to perform indexing and slicing
pop
```

```
city    year
City A  2018    33871648
        2019    37253956
City B  2018    18976457
        2019    19378102
City C  2018    20851820
        2019    25145561
dtype: int64
```

Let’s use indexing techniques to answer some basic question about the above multi-indexed Series, `pop`

→ What is population of City A in 2018

```python
pop['City A', 2018]
```

```
33871648
```

→ What is population of City A in all available years

```python
pop['City A']
```

```
year
2018    33871648
2019    37253956
dtype: int64
```

→ To use `.loc`, make sure that index is sorted. If it is not, use `.index.sort()` on Series or DataFrame object. Now, let ask question, what is the population of City A and City B, in all available year

```python
# using .loc
pop.loc['City A':'City B']
```

```
city    year
City A  2018    33871648
        2019    37253956
City B  2018    18976457
        2019    19378102
dtype: int64
```

→ We can also use the integer based indexing. Let’s fetch first two rows using `[:2]`

```python
# using integer value of index
pop[:2]
```

```
city    year
City A  2018    33871648
        2019    37253956
dtype: int64
```

→ What is population of all cities, for year 2018

```python
pop[:,2018]
```

```
city
City A    33871648
City B    18976457
City C    20851820
dtype: int64
```

→ What is population of City A and City B in all available years:

```python
# fancy indexing
# pay attention that we use the array here
pop[['City A','City B']]
```

```
city      year
City B   2018    18976457
          2019    19378102
City A  2018    33871648
          2019    37253956
dtype: int64
```

### 3.2. Multi-Index DataFrame

> Remember that `[]` method of indexing and slicing on DataFrame object, applies on the column labels. Therefore to apply indexing on index level, we can use `.iloc[]` and `loc[]`

```python
# reproduing dataframe we will work on
print(df_multi)
```

```python
student        Tom           Harry           John          
marks      HR Marketing    HR Marketing   HR Marketing
year exam                                             
2018 1     58        85    54        91   73        69
     2     70        98    64        70   79        96
2019 1     67        79    87        79   81        70
     2     81        72    90        88   80        66
```

Let’s use indexing techniques to answer some basic question about the above multi-indexed DataFrame, `df_multi`

→ What are marks of student, Tom, in all the subjects , for all available years and exams:

```python
print(df_multi['Tom'])
```

```
marks      HR  Marketing
year exam               
2018 1     58         85
     2     70         98
2019 1     67         79
     2     81         72
```

→ Tom marks in HR, for all available years and exams:

```python
df_multi['Tom','HR']
```

```
year  exam
2018  1       81
      2       80
2019  1       79
      2       55
Name: (Tom, HR), dtype: int64
```

➞ Fetching first row of a multi-index DataFrame using `iloc[]` method

```python
print(df_multi.iloc[:1])
```

```
student        Tom           Harry           John          
marks     HR Marketing    HR Marketing   HR Marketing
year exam                                             
2018 1     58        85    54        91   73        69
```

➞ Fetching first two rows and first two columns using `iloc[,]` method. The integers for slicing that we provide before the `,` in `iloc[ , ]` applies to row and after the `,` applies to column

```python
print(df_multi.iloc[:2,:2])
```

```
student   Tom          
marks      HR Marketing
year exam              
2018 1     58        85
     2     70        98
```

→ We can also use the explicit values of index and column labels using `.loc[]` For example, let’s get score of all students, in all the subjects, for all the exams, but only in year 2018:

```python
print(df_multi.loc[2018])
```

```
student Tom           Harry           John          
marks    HR Marketing    HR Marketing   HR Marketing
exam                                                
1        58        85    54        91   73        69
2        70        98    64        70   79        96
```

→ We can also use `.loc[ , ]` to slice at both index and column levels. Let’s fetch scores of Tom, in all subjects and all exams, but only in year 2018:

```python
print(df_multi.loc[2018,'Tom'])
```

```
marks  HR  Marketing
exam                
1      58         85
2      70         98
```

## 4. REARRANGING MULTI-INDICES

We saw few examples of this concept, sub-section 1.1. under `stack()` and `unstack()` methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns

### 4.1. Sorted and Unsorted indices

We can sort the index of a Series or DataFrame object using `.sort_index()` method:

```python
# defining index of Series
index_series = pd.MultiIndex.from_product([['c','a','b'],[1,2,3]])

# defining data
data_series = np.random.randint(100, size=9)

# constructing Series object
series_object = pd.Series(data_series, index=index_series)

# printing Series in unordered (original) form
print(series_object)

# printing Series in ordered (alphabetically) form
print(series_object.sort_index())
```

```
c  1    34
   2    72
   3    98
a  1    40
   2    40
   3    74
b  1    48
   2    19
   3    11
dtype: int64
a  1    40
   2    40
   3    74
b  1    48
   2    19
   3    11
c  1    34
   2    72
   3    98
dtype: int64
```

### 4.2. Stacking and Unstacking indices

Earlier, in sub-section 1.1, we applied `stack` and `unstuck` on Pandas Series object. Let’s us apply the same methods on DataFrame (in the example below, we intentionally edit the DataFrame by removing “John” so that the DataFrame is easy to read and understand)

```python
# reproducing multi-index DataFrame
print(df_multi)
```

```
student   Tom           Harry          
marks      HR Marketing    HR Marketing
year exam                              
2018 1     65        69    87        84
     2     76        98    53        87
2019 1     58        62    70        93
     2     62        71    83        66
```

#### a. Unstack

Let unstack the results, which by default applies to `level=-1`, i.e, the last index in the multi-index series.

```python
# unstack
df_multi_unstack = df_multi.unstack()
print(df_multi_unstack)
```

As we can see, the index name ‘exam’ is now unstacked and become part of another level in the column:

```
student Tom                   Harry                  
marks    HR     Marketing        HR     Marketing    
exam      1   2         1   2     1   2         1   2
year                                                 
2018     65  76        69  98    87  53        84  87
2019     58  62        62  71    70  83        93  66
```

Let’s `unstack` with `level=0` which will unstack the index, name `year` into another level in the column

```python
# unstack with 'level=0'
print(df_multi.unstack(level=0))
```

```
student  Tom                     Harry                    
marks     HR      Marketing         HR      Marketing     
year    2018 2019      2018 2019  2018 2019      2018 2019
exam                                                      
1         65   58        69   62    87   70        84   93
2         76   62        98   71    53   83        87   66
```

#### b. Stack

Let’s stack one of the DataFrame columns into index. By default, it applies to the last level in the column, which is `exam` in our example:

```python
df_multi_unstack.stack()
```

```
student   Harry           Tom          
marks        HR Marketing  HR Marketing
year exam                              
2018 1       87        84  65        69
     2       53        87  76        98
2019 1       70        93  58        62
     2       83        66  62        71
```

Let suppose, we would like to stack the `student` column instead of `exam` To do that we can provide `level=0` because `student` is the at position of `0`

```python
print(df_multi_unstack.stack(level=0))
```

```
marks         HR     Marketing    
exam           1   2         1   2
year student                      
2018 Harry    87  53        84  87
     Tom      65  76        69  98
2019 Harry    70  83        93  66
     Tom      58  62        62  71
```

### 4.3. Index Resetting and Setting

#### a. Index Reset

**Index to column:** We can use `reset_index` method to turn the index labels into columns. We can also fine control the result using [various parameters](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) of this method :

```python
# reproducing multi-index series, pop
pop 
```

```
city     year
City B   2018    18976457
         2019    19378102
City A   2018    33871648
         2019    37253956
City C   2018    20851820
         2019    25145561
dtype: int64
```

Let’s apply `reset_index()` which will turn the old indices into columns and new integer based sequential index is used:

```python
print(pop.reset_index())
```

```
    city   year     0
0  City B  2018  18976457
1  City B  2019  19378102
2  City A  2018  33871648
3  City A  2019  37253956
4  City C  2018  20851820
5  City C  2019  25145561
```

Last column has no name, so let’s give it a name to make the results both presentable and meaningful:

```python
# let give name to the column
pop_resetindex = pop.reset_index(name='population')
print(pop_resetindex)
```

```
   city    year    population
0  City B  2018    18976457
1  City B  2019    19378102
2  City A  2018    33871648
3  City A  2019    37253956
4  City C  2018    20851820
5  City C  2019    25145561
```

If we don’t want to reset all indices to columns, we can use the argument `level=` to fine tune our results

#### b. Set Index

**Column-to-index:** We can use `set_index()` method to build a multi-index Series or DataFrame by providing the list of column labels that we would like to convert into indices:

```python
print(pop_resetindex.set_index(['city','year']))
```

```
                population
city     year            
City B   2018    18976457
         2019    19378102
City A   2018    33871648
         2019    37253956
City C   2018    20851820
         2019    25145561
```

## 5. DATA AGGREGATIONS ON MULTI-INDICES

In this section, we will perform `sum()`, `mean()`, `max()` kind of aggregation on multi-index DataFrame

```python
# reproducing multi-index dataframe
print(df_multi)
```

```
student   Tom           Harry           John          
marks      HR Marketing    HR Marketing   HR Marketing
year exam                                             
2018 1     58        85    54        91   73        69
     2     70        98    64        70   79        96
2019 1     67        79    87        79   81        70
     2     81        72    90        88   80        66
```

#### a. Along rows

Let suppose we would like to find the *mean scores in each year*, for each subject and each student. To accomplish this, we will use the keyword argument `level=year`

```python
df_mean = df_multi.mean(level='year')
print(df_mean)
```

```
student   Tom           Harry            John          
marks      HR Marketing    HR Marketing    HR Marketing
year                                                   
2018     64.0      91.5  59.0      80.5  76.0      82.5
2019     74.0      75.5  88.5      83.5  80.5      68.0
```

#### b. Along column and rows

Let suppose we would like to find the *mean scores each year for each subject*, for all subjects and exams. To accomplish this, we will use two keyword arguments `level='marks` and `axis=1` (to tell Pandas to look for level under column)

```python
print(df_mean.mean(axis=1, level='marks'))
```

```
marks         HR  Marketing
year                       
2018   66.333333  84.833333
2019   81.000000  75.666667
```
