Combining Datasets In Pandas

A: COMBINING DATASETS: CONCAT AND APPEND

We will start with basic examples of concatenation of Series and DataFrames objects, with the pd.concat function; later we will dive into more sophisticated in-memory merge and join implemented in Pandas.

import numpy as np
import pandas as pd 

Function to construct DataFrame

We will first define a function that will be used to make the DataFrame from letters and numbers, with fewer keystrokes and help us keep the code clean:

def make_df(columns, indices):
    data = {c: [f"{c}{i}" for i in indices]
           for c in columns}
    return pd.DataFrame(data, indices)
# understanding the components of function
data = {c: [f"{c}{i}" for i in range(3)]
           for c in 'ABC'}
data
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
# example to run function
example_df = make_df('ABC', [1,2,3])
print(example_df)

1. SIMPLE CONCATENATION WITH pd.concat

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a number of options Syntax for Pandas concat function:

pd.concat(objs, paramters)

Parameter
Default

axis=

0

join=

β€˜outer’

join_axes=

None

ignore_index=

False

keys=

None

levels=

None

names=

None

verify_integrity=

False

copy

True

1.1. Concatenating Series

1.2. Concatenating DataFrame

β†’ In this first example, for both DataFrame objects that we are going to concatenate, column names are the same i.e, A,B and C and indices are different. Because we are going to use index=0 default value which will concatenate along columns:

β†’ In this second example, for both DataFrame objects that we are going to concatenate, column names are the different and indices are same (1,2). Because we are going to use index=1 which will concatenate along rows:

1.3. Duplicate Indices

One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices

pd.concat() gives us a few ways to handle the repeated indices issue

a) Handling duplicate indices through varify_integrity

varify_integrity=True checks whether the new concatenated axis contains duplicates. If yes, it will raise the ValueEror

b) Using keys Argument

We can use keys=[] kwarg to form Multi-index Series or DataFrame

1.4. Concatenation with β€œJoin”

In the examples above, we discussed cases of concatenating DataFrames with shared column names. In practice, data from different sources might have different sets of column names, and pd.concat offers several options to handle this.

join='inner': joins intersection of columns in the DataFrames

1.5. The append() method

Series and DataFrame objects have an .append() method that can accomplish the same behaviour like .concat() but in fewer keystrokes Unlike the append() and extend() methods of Python lists, the append() method in Pandas does not modify the original object

B: COMBINING DATASETS: MERGE AND JOIN

We will very briefly discuss pd.join, but the discussion is mainly focused on pd.merge function

A note on Relational Algebra

pd.merge() works in a manner that is considered to be a subset of what is known as relational algebra β€” formal set of rules for manipulating relational data. Pandas implements several of these fundamental building blocks in the pd.merge() function and the related join() method of Series and DataFrames.

1. CATEGORIES OF MERGE

The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins β€” the type of join performed depends on the form of the input data

1.1. One-to-one Merge

When we merge these two DataFrame object, df11 and df12, the pd.merge() function recognizes that each DataFrame has a common employeecolumn, and automatically joins using this column as a key

1.2. Many-to-one Merge

It is type of Join in which one of the two matching key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate:

When we merge these two DataFrame object, df13 and df14, the pd.merge() function recognizes the common column group and automatically joins using this column as a key. However, group column under df13 contain 4 values (3 unique), while under df14 contain 3 values (all 3 unique) In this case, the many-to-one use same supervisor value for both Engineering row

1.3. Many-to-many Merge

If the key column in both the left and right DataFrame contains duplicates, then the result is a many-to-many merge. Example makes understanding it easier:

When we merge these two DataFrame objects, df11 and df15, the pd.merge() function recognizes the common column group and automatically joins using this column as a key. However, there are duplicates values under group for both DataFrame object so the join would be many-to-many join. Bob is related to accounting group but accounting has two skills, so both will have their own row. Same logic goes for Jake, Lisa under engineering

2. SPECIFICATION OF MERGE KEY

Often the column names will not match so nicely as we have seen in above examples, but pd.merge() provides a variety of options to handle this scenario and explicitly tell to merge which column/key.

2.1. The on Keyword Argument

Explicitly specify the name of the key column using the on= keyword argument, which takes a column name or a list of column names

2.2. The left_on and right_on Keyword Argument

For example, we may have a dataset in which the employee name is labeled as β€œname” rather than β€œemployee” but the values are same inside i.e, just the column label is different. In such case, we can specify which column to merge left_on= and which to merge on right_on= Example will make it clearer:

2.3. drop method

What to do if we want to delete a specified label from an index or column? use .drop() method. There are two ways to implement this:

  • Specify the label names and axis (0 if along row, 1 if along column)

  • Directly specify the columns or index name to remove the column or row, respectively.

This is the example of the first case β€” specifying the label name and axis:

2.4. The left_index and right_index

Sometimes, rather than merging on a column, we would instead like to merge on an index. left_index= and right_index= keyword arguments, takes in boolean value of True or False

.set_index() method will set the provided column as the index of a DataFrame

2.5. Join method

To join by indices we can also use .join method, which by-default merges by indices. Here is an example:

2.6. Mix Merge with Indices and Columns

We would like to merge DataFrame df11a and df16 However, the common key is index, employee from df11a and column, name from df16 Therefore, we will use left_index=True for df11a and right_on='name' for df16

3. SPECIFYING how TO MERGE

The how keyword argument defines the type of merge to perform, which takes one of these four values:

  • how=inner which is default value and keeps the intersection of keys. It preserve the keys order of the DataFrame mentioned on the left

  • how=outer keeps the union of the keys

  • how=left keeps the keys of DataFrame mentioned on the left side

  • how=right keeps the keys of DataFrame mentioned on the right side

3.1. how='inner'

Let change the order of frames in the pd.merge()

3.2. how='outer'

3.3. how=β€˜left’

3.4. how=β€˜right’

4. OVERLAPPING COLUMN NAMES

We may end up in a situation where our two input DataFrames have two or more same labeled column. We can use on= to specify column name to merge on.

Instead of appending the column names with _x and _y we can give our own keywords, using argument suffixes=[]

Last updated

Was this helpful?