# Combining Datasets In Pandas

## A: COMBINING DATASETS: CONCAT AND APPEND

We will start with basic examples of concatenation of `Series` and `DataFrames` objects, with the `pd.concat` function; later we will dive into more sophisticated in-memory `merge` and `join` implemented in Pandas.

```python
import numpy as np
import pandas as pd 
```

#### Function to construct DataFrame

We will first define a function that will be used to make the DataFrame from letters and numbers, with fewer keystrokes and help us keep the code clean:

```python
def make_df(columns, indices):
    data = {c: [f"{c}{i}" for i in indices]
           for c in columns}
    return pd.DataFrame(data, indices)
```

```
# understanding the components of function
data = {c: [f"{c}{i}" for i in range(3)]
           for c in 'ABC'}
data
```

```
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
```

```python
# example to run function
example_df = make_df('ABC', [1,2,3])
print(example_df)
```

```
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
3  A3  B3  C3
```

## 1. SIMPLE CONCATENATION WITH pd.concat

Pandas has a function, `pd.concat()`, which has a similar syntax to `np.concatenate` but contains a number of options Syntax for Pandas concat function:

`pd.concat(objs, paramters)`

| Parameter          | Default |
| ------------------ | ------- |
| axis=              | 0       |
| join=              | ‘outer’ |
| join\_axes=        | None    |
| ignore\_index=     | False   |
| keys=              | None    |
| levels=            | None    |
| names=             | None    |
| verify\_integrity= | False   |
| copy               | True    |

### 1.1. Concatenating Series

```python
# concatenating along columns (stacking one on top of other), 
# because default value of keyword argument, index=0 
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['X','Y','Z'], index=[4,5,6])

pd.concat([ser1,ser2])
```

```
1    A
2    B
3    C
4    X
5    Y
6    Z
dtype: object
```

### 1.2. Concatenating DataFrame

→ In this first example, for both DataFrame objects that we are going to concatenate, column names are the same i.e, A,B and C and indices are different. Because we are going to use `index=0` default value which will concatenate along columns:

```python
# concatenating along columns (stacking one on top of other),
# because default value of keyword argument, index=0 
df1 = make_df('ABC',[1,2,3])
df2 = make_df('ABC',[4,5,6])

print(df1)
print(df2)
print(pd.concat([df1,df2]))
```

```
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
3  A3  B3  C3
    A   B   C
4  A4  B4  C4
5  A5  B5  C5
6  A6  B6  C6
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
3  A3  B3  C3
4  A4  B4  C4
5  A5  B5  C5
6  A6  B6  C6
```

→ In this second example, for both DataFrame objects that we are going to concatenate, column names are the different and indices are same (1,2). Because we are going to use `index=1` which will concatenate along rows:

```python
# concatenating along rows
# keyword argument, axis=1
df3 = make_df(['A','B'], [1,2])
df4 = make_df(['C','D'], [1,2])

print(df3); print(df4); print(pd.concat([df3,df4], axis=1))
```

```
    A   B
1  A1  B1
2  A2  B2
    C   D
1  C1  D1
2  C2  D2
    A   B   C   D
1  A1  B1  C1  D1
2  A2  B2  C2  D2
```

### 1.3. Duplicate Indices

One important difference between `np.concatenate` and `pd.concat` is that Pandas concatenation **preserves indices**, even if the result will have duplicate indices

```python
# making two series with same index
sr3 = pd.Series(['A','B'], index=[1,2])
sr4 = pd.Series(['A','B'], index=[1,2])

pd.concat([sr3,sr4])
```

```
1    A
2    B
1    A
2    B
dtype: object
```

`pd.concat()` gives us a few ways to handle the repeated indices issue

#### a) Handling duplicate indices through `varify_integrity`

`varify_integrity=True` checks whether the new concatenated axis contains duplicates. If yes, it will raise the `ValueEror`

```python
pd.concat([sr3,sr4], verify_integrity=True)
```

```
ValueError: Indexes have overlapping values: Int64Index([1, 2], dtype='int64')
```

#### b) Using `keys` Argument

We can use `keys=[]` kwarg to form Multi-index Series or DataFrame

```python
pd.concat([sr3,sr4], keys=['y','z'])
```

```
y  1    A
   2    B
z  1    A
   2    B
dtype: object
```

### 1.4. Concatenation with “Join”

In the examples above, we discussed cases of concatenating DataFrames with shared column names. In practice, data from different sources might have different sets of column names, and `pd.concat` offers several options to handle this.

```python
# look what happens when we concat the two df with not identical columns
df5 = make_df('ABC',[1,2]) 
df6 = make_df('BCD', [3,4])

pd.concat([df5,df6], sort=False) 
#remember the default value of kwarg, join=outer
```

```
     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4
```

> `join='inner'`: joins intersection of columns in the DataFrames

```python
print(pd.concat([df5,df6], join='inner'))
```

```
    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4
```

### 1.5. The append() method

Series and DataFrame objects have an `.append()` method that can accomplish the same behaviour like `.concat()` but in fewer keystrokes Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in Pandas does not modify the original object

```python
print(df1.append(df2))
```

```
    A   B   C
1  A1  B1  C1
2  A2  B2  C2
3  A3  B3  C3
4  A4  B4  C4
5  A5  B5  C5
6  A6  B6  C6
```

## B: COMBINING DATASETS: MERGE AND JOIN

We will very briefly discuss `pd.join`, but the discussion is mainly focused on `pd.merge` function

**A note on Relational Algebra**

`pd.merge()` works in a manner that is considered to be a subset of what is known as relational algebra — formal set of rules for manipulating relational data. Pandas implements several of these fundamental building blocks in the `pd.merge()` function and the related `join()` method of Series and DataFrames.

### 1. CATEGORIES OF MERGE

The `pd.merge()` function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins — the type of join performed depends on the form of the input data

### 1.1. One-to-one Merge

```python
# creating two DataFrames with one identical column name
df11 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                      'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
                    
df12 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                     'hire_date': [2004, 2008, 2012, 2014]})

print(df11)
print(df12)
```

```
  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014
```

When we merge these two DataFrame object, `df11` and `df12`, the `pd.merge()` function recognizes that each DataFrame has a common `employee`column, and automatically joins using this column as a key

```python
df13 = pd.merge(df11,df12)
print(df13)
```

```
  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
```

### 1.2. Many-to-one Merge

It is type of Join in which one of the two matching key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate:

```python
df14 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                     'supervisor': ['Carly', 'Guido', 'Steve']})
print(df13)
print(df14)
```

```
  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve
```

When we merge these two DataFrame object, `df13` and `df14`, the `pd.merge()` function recognizes the common column `group` and automatically joins using this column as a key. However, `group` column under `df13` contain 4 values (3 unique), while under `df14` contain 3 values (all 3 unique) In this case, the **many-to-one** use same `supervisor` value for both `Engineering` row

```python
print(pd.merge(df13,df14))
```

```
  employee        group  hire_date supervisor
0      Bob   Accounting       2008      Carly
1     Jake  Engineering       2012      Guido
2     Lisa  Engineering       2004      Guido
3      Sue           HR       2014      Steve
```

### 1.3. Many-to-many Merge

If the key column in both the left and right DataFrame contains duplicates, then the result is a many-to-many merge. Example makes understanding it easier:

```python
df15 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
                     'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})

print(df11); print(df15)
```

```
  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
         group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4           HR  spreadsheets
5           HR  organization
```

When we merge these two DataFrame objects, `df11` and `df15`, the `pd.merge()` function recognizes the common column `group` and automatically joins using this column as a key. However, there are duplicates values under `group` for both DataFrame object so the join would be **many-to-many** join. `Bob` is related to `accounting` `group` but `accounting` has two `skills`, so both will have their own row. Same logic goes for `Jake`, `Lisa` under `engineering`

```python
print(pd.merge(df11,df15))
```

```
  employee        group        skills
0      Bob   Accounting          math
1      Bob   Accounting  spreadsheets
2     Jake  Engineering        coding
3     Jake  Engineering         linux
4     Lisa  Engineering        coding
5     Lisa  Engineering         linux
6      Sue           HR  spreadsheets
7      Sue           HR  organization
```

## 2. SPECIFICATION OF MERGE KEY

Often the column names will not match so nicely as we have seen in above examples, but `pd.merge()` provides a variety of options to handle this scenario and explicitly tell to merge which column/key.

### 2.1. The `on` Keyword Argument

Explicitly specify the name of the key column using the `on=` keyword argument, which takes a column name or a list of column names

```python
print(pd.merge(df11,df12, on='employee'))
```

```
  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
```

### 2.2. The `left_on` and `right_on` Keyword Argument

For example, we may have a dataset in which the employee name is labeled as “name” rather than “employee” but the values are same inside i.e, just the column label is different. In such case, we can specify which column to merge `left_on=` and which to merge on `right_on=` Example will make it clearer:

```python
df16 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                     'salary': [70000, 80000, 120000, 90000]})
print(df16)
```

```
   name  salary
0   Bob   70000
1  Jake   80000
2  Lisa  120000
3   Sue   90000
```

```python
df17 = pd.merge(df11,df16,  left_on='employee', right_on='name')
print(df17)
```

```
  employee        group  name  salary
0      Bob   Accounting   Bob   70000
1     Jake  Engineering  Jake   80000
2     Lisa  Engineering  Lisa  120000
3      Sue           HR   Sue   90000
```

### 2.3. `drop` method

What to do if we want to delete a specified label from an index or column? use `.drop()` method. There are two ways to implement this:

* Specify the label names and `axis` (0 if along row, 1 if along column)
* Directly specify the `columns` or `index` name to remove the column or row, respectively.

This is the example of the first case — specifying the label name and axis:

```python
print(df17.drop('name', axis=1))
```

```
  employee        group  salary
0      Bob   Accounting   70000
1     Jake  Engineering   80000
2     Lisa  Engineering  120000
3      Sue           HR   90000
```

### 2.4. The `left_index` and `right_index`

Sometimes, rather than merging on a column, we would instead like to merge on an index. `left_index=` and `right_index=` keyword arguments, takes in boolean value of True or False

`.set_index()` method will set the provided column as the index of a DataFrame

```python
# first let set_index of df11 and df12 so we can merge on rows
df11a = df11.set_index('employee')
df12a = df12.set_index('employee')

print(df11a)
print(df12a)
```

```
                group
employee             
Bob        Accounting
Jake      Engineering
Lisa      Engineering
Sue                HR
          hire_date
employee           
Lisa           2004
Bob            2008
Jake           2012
Sue            2014
```

```python
print(pd.merge(df11a, df12a, right_index=True, left_index=True))
```

```
                group  hire_date
employee                        
Bob        Accounting       2008
Jake      Engineering       2012
Lisa      Engineering       2004
Sue                HR       2014
```

### 2.5. Join method

To join by indices we can also use `.join` method, which by-default merges by indices. Here is an example:

```python
print(df11a.join(df12a))
```

```
                group  hire_date
employee                        
Bob        Accounting       2008
Jake      Engineering       2012
Lisa      Engineering       2004
Sue                HR       2014
```

### 2.6. Mix Merge with Indices and Columns

```python
print(df11a)
print(df16)
```

```
                group
employee             
Bob        Accounting
Jake      Engineering
Lisa      Engineering
Sue                HR
   name  salary
0   Bob   70000
1  Jake   80000
2  Lisa  120000
3   Sue   90000
```

We would like to merge DataFrame `df11a` and `df16` However, the common key is index, `employee` from `df11a` and column, `name` from `df16` Therefore, we will use `left_index=True` for `df11a` and `right_on='name'` for `df16`

```python
print(pd.merge(df11a, df16, left_index=True, right_on='name'))
```

```
         group  name  salary
0   Accounting   Bob   70000
1  Engineering  Jake   80000
2  Engineering  Lisa  120000
3           HR   Sue   90000
```

## 3. SPECIFYING `how` TO MERGE

The `how` keyword argument defines the type of merge to perform, which takes one of these four values:

* `how=inner` which is default value and keeps the intersection of keys. It preserve the keys order of the DataFrame mentioned on the left
* `how=outer` keeps the union of the keys
* `how=left` keeps the keys of DataFrame mentioned on the left side
* `how=right` keeps the keys of DataFrame mentioned on the right side

```python
# starting by defining dataframes

df18 = pd.DataFrame({'name': ['Parker', 'Piper', 'Mia'],
                    'food': ['chicken', 'rice', 'bread']}, columns=['name', 'food'])

df19 = pd.DataFrame({'name': ['Mia', 'Justin'], 
                     'drink': ['soda', 'water']}, columns=['name', 'drink'])

print(df18)
print(df19)
```

```
    name      food
0  Parker   chicken
1  Piper    rice
2  Mia      bread
    name     drink
0  Mia      soda
1 Justin    water
```

### 3.1. `how='inner'`

```python
print(pd.merge(df18, df19))
```

```
   name   food drink
0  Mia  bread  soda
```

Let change the order of frames in the `pd.merge()`

```python
print(pd.merge(df19, df18))
```

```
   name drink   food
0  Mia  soda  bread
```

### 3.2. `how='outer'`

```python
print(pd.merge(df18,df19, how='outer'))
```

```
     name    food     drink
0  Parker   chicken   NaN
1  Piper    rice      NaN
2  Mia      bread     soda
3  Justin   NaN       water
```

### 3.3. `how=‘left’`

```python
print(pd.merge(df18,df19, how='left'))
```

```
    name   food       drink
0  Parker   chicken   NaN
1  Piper    rice      NaN
2  Mia      bread     soda
```

### 3.4. `how=‘right’`

```python
print(pd.merge(df18,df19, how='right'))
```

```
   name      food   drink
0  Mia       bread  soda
1  Justin    NaN    water
```

## 4. OVERLAPPING COLUMN NAMES

We may end up in a situation where our two input DataFrames have two or more same labeled column. We can use `on=` to specify column name to merge on.

```python
df20 = pd.DataFrame({'name': ['A','B','C'],
                    'rank': [1,2,3]})
df21 = pd.DataFrame({'name':['C','A','B'],
                    'rank':[1,2,3]})

print(pd.merge(df20,df21, on='name'))
```

```
  name  rank_x  rank_y
0    A       1       2
1    B       2       3
2    C       3       1
```

Instead of appending the column names with *\_x* and *\_y* we can give our own keywords, using argument `suffixes=[]`

```python
print(pd.merge(df20,df21, on='name', suffixes=['_bio','_stats']))
```

```
  name  rank_bio  rank_stats
0    A         1           2
1    B         2           3
2    C         3           1
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://codingnotes.gitbook.io/coding_notes/coding/pandas/combining-datasets-in-pandas.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
