Before training any machine learning model, one of the most important steps is preparing the data. Real-world datasets often contain errors, outliers, or inconsistent values, but the most common issue is missing data. In Pandas, missing values are usually represented as NaN. In this article, you will learn how to detect, count, remove, and replace NaN values using practical examples.
Creating a DataFrame with Missing Values
To begin, let’s create a simple DataFrame that contains some NaN values.
We start by defining a list of dictionaries and converting it into a DataFrame:
items2 = [
{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes': 8, 'suits': 45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants': 5, 'shirts': 2, 'shoes': 5, 'suits': 7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes': 10}
]
store_items = pd.DataFrame(items2, index=['store 1', 'store 2', 'store 3'])
store_items
The resulting DataFrame contains three NaN values.
Detecting Missing Values
When working with large datasets, visually identifying NaN values is not practical. Pandas provides useful methods to detect and count them.
Counting Total NaN Values
You can combine the isnull() method with sum() to count all NaN values:
store_items.isnull().sum().sum()
The isnull() method returns a Boolean DataFrame where True marks missing values. Since True is treated as 1, summing twice gives the total number of NaN values.
Checking Which Values Are Missing
store_items.isnull()
This shows True wherever the DataFrame contains NaN.
Counting Non-Missing Values
If you want the opposite, use the count() method:
store_items.count()
This returns the number of non-NaN values in each column.
Removing NaN Values
Once you detect missing data, you can choose to eliminate it. Pandas allows you to drop rows or columns that contain NaN values.
Dropping Rows with NaN
store_items.dropna(axis=0)
This removes any row that has at least one NaN.
Dropping Columns with NaN
store_items.dropna(axis=1)
This removes any column containing NaN.
By default, dropna() does not change the original DataFrame unless you add inplace=True.
Replacing NaN Values
Instead of deleting data, a common approach is replacing NaN values. Pandas provides several useful methods for this.
Replacing NaN with a Fixed Value
You can fill all missing values with zero:
store_items.fillna(0)
Forward Fill (Using Previous Values)
Forward filling replaces a missing value with the most recent non-missing value.
Down a Column (axis=0)
store_items.ffill(axis=0)
Values are filled from the previous row in the same column.
Across a Row (axis=1)
store_items.ffill(axis=1)
This method fills missing values using the previous value in the same row.
Backward Fill (Using Next Values)
Backward filling uses the next available non-missing value.
Down a Column
store_items.bfill(axis=0)
Across a Row
store_items.bfill(axis=1)
Like the previous methods, forward and backward filling do not modify the original DataFrame unless you set inplace=True.
Interpolating Missing Values
Interpolation estimates missing values by using nearby data points. Linear interpolation is commonly used.
Interpolating Down a Column
store_items.interpolate(method='linear', axis=0)
Interpolating Across a Row
store_items.interpolate(method='linear', axis=1)