Handling Missing Data In Pandas
1. INTRODUCTION
Detection of Missing Data
Two schemes to indicate the presence of missing data in a table or DataFrame:
Masking Approach: The mask that can be a separate Boolean array
Sentinel Approach: The sentinel value could be some;
data-specific convention, such as indicating a missing integer value with –9999 or some rare bit pattern, or
global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.
Handling Missing Data in Python
Pandas chose to use sentinels for missing data , and further chose to use two already-existing Python null values: the special floating-point NaN
value, and the Python None
object.
None: Pythonic Missing Data: Because
None
is a Python object, it cannot be used in any arbitrary NumPy array, but only in arrays with data type ‘object’ (i.e., arrays of Python objects)NaN: Missing Numerical Data:
NaN
(acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation
Operating on Null Values
Pandas treats None
and NaN
as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
2. DETECTING NULL VALUES
Pandas data structures have two useful methods for detecting null data: isnull()
and notnull()
Either one will return a Boolean mask over the data.
a. isnull
b. notnull
3. DROPPING NULL VALUES
We use dropna()
method on Series or DataFrame, which removes NaN
values
a. On Series
b. On DataFrame
[np.nan, 0,1]
]) print(data_df)
Using
dropna()
method, we cannot drop single values from a DataFrame; we can only drop complete row(s) or complete column(s), where one of the cell containsNaN
Depending on the application, you might want one or the other, so
dropna()
gives a number of options to handle this
➞ Using axis=column
keyword argument to apply the dropna()
to columns of a DataFrame
➞ We can drop column(s)/row(s) whose all cell values are NaN
through kwarg how='all'
➞ Using keyword argument thresh=integer
we can specify min number of non-null values, that must exist in row/column
4. FILLING THE NULL VALUES
We use
fillna()
method on a Series or DataFrame, which fillsNaN
values with a given value. This value might be a single number like zero or some other good-values
a. On Series
b. On DataFrame
For aDataFrame, we use same method but can also mention the axis
keyword argument
c. Types of Fill
We can use the keyword argument method=ffill
or method=bfill
to fill the values
Forward Fill
We can use forward fill (method=ffill
) — to propagate previous value forward
➞ On Series
➞ On DataFrame
Backward Fill
We can use backward fill(method=bfill
) — to propagate the next value backward
➞ On Series
➞ On DataFrame
Last updated
Was this helpful?