Handling Missing Data In Pandas

1. INTRODUCTION

Detection of Missing Data

Two schemes to indicate the presence of missing data in a table or DataFrame:

  1. Masking Approach: The mask that can be a separate Boolean array

  2. Sentinel Approach: The sentinel value could be some;

    • data-specific convention, such as indicating a missing integer value with –9999 or some rare bit pattern, or

    • global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.

Handling Missing Data in Python

Pandas chose to use sentinels for missing data , and further chose to use two already-existing Python null values: the special floating-point NaN value, and the Python None object.

  1. None: Pythonic Missing Data: Because None is a Python object, it cannot be used in any arbitrary NumPy array, but only in arrays with data type ‘object’ (i.e., arrays of Python objects)

  2. NaN: Missing Numerical Data: NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

import numpy as np
import pandas as pd
# None, in an array, makes the whole array an object
None_object = np.array([1,None,3,4])
None_object
array([1, None, 3, 4], dtype=object)
# Nan, in an array, returns a standard flaoting point type array
NaN_type = np.array([1,np.nan,2,3])
NaN_type
array([ 1., nan,  2.,  3.])
# checking data type
NaN_type.dtype
dtype('float64')

Operating on Null Values

Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.

2. DETECTING NULL VALUES

Pandas data structures have two useful methods for detecting null data: isnull() and notnull() Either one will return a Boolean mask over the data.

a. isnull

data_pd = pd.Series([1,np.nan,2,3,None])
data_pd.isnull()
0    False
1     True
2    False
3    False
4     True
dtype: bool

b. notnull

# is not null? True means yes, not null
data_pd.notnull()
0     True
1    False
2     True
3     True
4    False
dtype: bool
# masking to get only the data which is not null
data_pd[data_pd.notnull()]
0    1.0
2    2.0
3    3.0
dtype: float64

3. DROPPING NULL VALUES

We use dropna() method on Series or DataFrame, which removes NaN values

a. On Series

data_pd.dropna() 
0    1.0
2    2.0
3    3.0
dtype: float64

b. On DataFrame

# Creating df with some missing values
data_df = pd.DataFrame([[1, np.nan, 2],
                       [2, 200, 300],
                       

[np.nan, 0,1]

]) print(data_df)

     0      1    2
0  1.0    NaN    2
1  2.0  200.0  300
2  NaN    0.0    1
print(data_df.dropna())
     0      1    2
1  2.0  200.0  300
  • Using dropna() method, we cannot drop single values from a DataFrame; we can only drop complete row(s) or complete column(s), where one of the cell contains NaN

  • Depending on the application, you might want one or the other, so dropna() gives a number of options to handle this

➞ Using axis=column keyword argument to apply the dropna() to columns of a DataFrame

print(data_df.dropna(axis='columns'))
     2
0    2
1  300
2    1

➞ We can drop column(s)/row(s) whose all cell values are NaN through kwarg how='all'

# adding column to data_df with all values as nan
data_df[3] =np.nan
print(data_df)
     0      1    2   3
0  1.0    NaN    2 NaN
1  2.0  200.0  300 NaN
2  NaN    0.0    1 NaN
# dropping the column(s) whose all values are NaN
print(data_df.dropna(axis='columns', how='all'))
     0      1    2
0  1.0    NaN    2
1  2.0  200.0  300
2  NaN    0.0    1
# by default, it applies on axis=index
# as none of the rows in our dataframe contains all NaN, so no row is dropped
print(data_df.dropna(how='all'))
     0      1    2   3
0  1.0    NaN    2 NaN
1  2.0  200.0  300 NaN
2  NaN    0.0    1 NaN

➞ Using keyword argument thresh=integer we can specify min number of non-null values, that must exist in row/column

print(data_df.dropna(thresh=3))
     0      1    2   3
1  2.0  200.0  300 NaN
print(data_df.dropna(axis='columns', thresh=3))
     2
0    2
1  300
2    1

4. FILLING THE NULL VALUES

  • We use fillna() method on a Series or DataFrame, which fills NaN values with a given value. This value might be a single number like zero or some other good-values

a. On Series

# reproducing series that contains NaN
data_pd
0    1.0
1    NaN
2    2.0
3    3.0
4    NaN
dtype: float64
# filling all NaN with integer, 101
data_pd.fillna(101)
0      1.0
1    101.0
2      2.0
3      3.0
4    101.0
dtype: float64

b. On DataFrame

For aDataFrame, we use same method but can also mention the axis keyword argument

# reproducing a DataFrame
print(data_df)
     0      1    2   3
0  1.0    NaN    2 NaN
1  2.0  200.0  300 NaN
2  NaN    0.0    1 NaN
# fill all instances of NaN with integer, 101
print(data_df.fillna(101))
       0      1    2      3
0    1.0  101.0    2  101.0
1    2.0  200.0  300  101.0
2  101.0    0.0    1  101.0

c. Types of Fill

We can use the keyword argument method=ffill or method=bfill to fill the values

Forward Fill

We can use forward fill (method=ffill) — to propagate previous value forward

➞ On Series

data_pd.fillna(method='ffill')
0    1.0
1    1.0
2    2.0
3    3.0
4    3.0
dtype: float64

➞ On DataFrame

print(data_df.fillna(method='ffill'))
     0      1    2   3
0  1.0    NaN    2 NaN
1  2.0  200.0  300 NaN
2  2.0    0.0    1 NaN
# with `axis=1`
print(data_df.fillna(method='ffill', axis=1))
     0      1      2      3
0  1.0    1.0    2.0    2.0
1  2.0  200.0  300.0  300.0
2  NaN    0.0    1.0    1.0

Backward Fill

We can use backward fill(method=bfill) — to propagate the next value backward

➞ On Series

data_pd.fillna(method='bfill')
0    1.0
1    2.0
2    2.0
3    3.0
4    NaN
dtype: float64

➞ On DataFrame

df_new = pd.DataFrame({'a':[1,np.nan,2],
                      'b':[3,np.nan,4],
                      'c':[5,np.nan,6]})
print(f"DataFrame:\n{df_new}")
print(f"DataFrame with bfill along axis=0:\n{df_new.fillna(method='bfill')}")
print(f"DataFrame with bfill along axis=1:\n{df_new.fillna(method='bfill', axis=1)}")
      a    b    c
0  1.0  3.0  5.0
1  NaN  NaN  NaN
2  2.0  4.0  6.0
DataFrame with bfill along axis=0:
     a    b    c
0  1.0  3.0  5.0
1  2.0  4.0  6.0
2  2.0  4.0  6.0
DataFrame with bfill along axis=1:
     a    b    c
0  1.0  3.0  5.0
1  NaN  NaN  NaN
2  2.0  4.0  6.0

Last updated