MicromOne: Handling Missing Data in Pandas: A Complete Guide

Before training any machine learning model, one of the most important steps is preparing the data. Real-world datasets often contain errors, outliers, or inconsistent values, but the most common issue is missing data. In Pandas, missing values are usually represented as NaN. In this article, you will learn how to detect, count, remove, and replace NaN values using practical examples.

Creating a DataFrame with Missing Values

To begin, let’s create a simple DataFrame that contains some NaN values.

We start by defining a list of dictionaries and converting it into a DataFrame:

items2 = [
    {'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes': 8, 'suits': 45},
    {'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
    {'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes': 10}
]

store_items = pd.DataFrame(items2, index=['store 1', 'store 2', 'store 3'])
store_items

The resulting DataFrame contains three NaN values.

Detecting Missing Values

When working with large datasets, visually identifying NaN values is not practical. Pandas provides useful methods to detect and count them.

Counting Total NaN Values

You can combine the isnull() method with sum() to count all NaN values:

store_items.isnull().sum().sum()

The isnull() method returns a Boolean DataFrame where True marks missing values. Since True is treated as 1, summing twice gives the total number of NaN values.

Checking Which Values Are Missing

store_items.isnull()

This shows True wherever the DataFrame contains NaN.

Counting Non-Missing Values

If you want the opposite, use the count() method:

store_items.count()

This returns the number of non-NaN values in each column.

Removing NaN Values

Once you detect missing data, you can choose to eliminate it. Pandas allows you to drop rows or columns that contain NaN values.

Dropping Rows with NaN

store_items.dropna(axis=0)

This removes any row that has at least one NaN.

Dropping Columns with NaN

store_items.dropna(axis=1)

This removes any column containing NaN.

By default, dropna() does not change the original DataFrame unless you add inplace=True.

Replacing NaN Values

Instead of deleting data, a common approach is replacing NaN values. Pandas provides several useful methods for this.

Replacing NaN with a Fixed Value

You can fill all missing values with zero:

store_items.fillna(0)

Forward Fill (Using Previous Values)

Forward filling replaces a missing value with the most recent non-missing value.

Down a Column (axis=0)

store_items.ffill(axis=0)

Values are filled from the previous row in the same column.

Across a Row (axis=1)

store_items.ffill(axis=1)

This method fills missing values using the previous value in the same row.

Backward Fill (Using Next Values)

Backward filling uses the next available non-missing value.

Down a Column

store_items.bfill(axis=0)

Across a Row

store_items.bfill(axis=1)

Like the previous methods, forward and backward filling do not modify the original DataFrame unless you set inplace=True.

Interpolating Missing Values

Interpolation estimates missing values by using nearby data points. Linear interpolation is commonly used.

Interpolating Down a Column

store_items.interpolate(method='linear', axis=0)

Interpolating Across a Row

store_items.interpolate(method='linear', axis=1)

MicromOne

Pagine

Handling Missing Data in Pandas: A Complete Guide