How to Remove Missing Values from your Data in Python?
In data science, you will remove missing values in almost all your projects. Empty values in a dataset can cause inaccuracies and inconsistencies in the data. Removing these empty values is important for ensuring that the data is accurate and reliable.
If you are new to data science, you may need to delete missing values because:
- you need to reduce bias in your data
- 60-70% of your data is missing
- you need to improve your model’s performance
- Among other things
Whatever the reason it is, you will need to know how to delete missing values. In this article, I will be showing you code snippets on how to delete missing values with python using both NumPy and pandas.
How to delete remove values with Pandas
More often than not your will be working with DataFrames. Since removing missing values is such a common task, you need to know the various ways of removing them.
Let’s first import the dataset. For this one, we will be using the Melbourne Housing Market dataset available in Kaggle.
import pandas as pd
df = pd.read_csv("Melbourne_housing_FULL.csv")
df.shape
(34857, 21)
How to remove all missing values in the dataframe with python?
The simplest and fastest way to delete all missing values is to simply use the dropna() attribute available in Pandas. It will simply remove every single row in your data frame containing an empty value.
df2 = df.dropna()
df2.shape
(8887, 21)
As you can see the dataframe went from ~35k to ~9k rows. We have 4x fewer rows after using dropna() on all datasets. Remember imputing the missing values may sometimes be better than removing them.
How to find which column/feature contains empty values
As you’ve probably thought, sometimes you do not want to remove every single missing value in your dataframe, but just the values from some columns. You can do it by identifying the columns that contain missing values, and naming the columns of interest. Here’s how it’s done below.
If you want to get only the columns names that contain missing values, here’s how it is done.
# get the name of the columns containing missing values
# Method 1
missing = df.columns[df.isnull().any()]
print(missing)
# Method 2
missing = [col for col in df.columns if df[col].isna().any()]
print(missing)
# Method 1 Output
Index(['Price', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')
# Method 2 Output
['Price', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude', 'Longtitude', 'Regionname', 'Propertycount']
If you want to get the column names along with the number of missing values, you can use the code snippet below.
# Get the number of missing values per column
df.isna().sum()
Suburb 0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 1
Postcode 1
Bedroom2 8217
Bathroom 8226
Car 8728
Landsize 11810
BuildingArea 21115
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
How to delete empty values in specific columns?
Once we have an idea as to what variables contain missing values, we can drop empty values from specific columns. To do that, you can still use dropna() attribute but, you simply need to define a subset you want to apply the drop function to.
df.dropna( subset=["Bedroom2", "Bathroom"], inplace=True)
df.shape
(26631, 21)
The use of inplace is simply to tell the program that you want your original dataframe to be changed. If you do not want your original dataframe changed, you can assign it to a new variable with the term copy(). Here’s how it is done below.
How to remove missing values from one column in pandas?
If you want to remove missing values from just one column, there are essentially two ways of doing that with Python. You can use the dropna() function, or you can simply look for cells that are not considered empty (na). You can use the same technique to check for duplicated values.
# Method 1
df.dropna(subset = ["Car"], inplace =True )
# Method 2
df = df.loc[~df.Car.isna()]
How to delete remove missing values with NumPy?
Using pandas is not the only way of dealing with missing values. You can similarly use NumPy for it. Let’s set up a toy matrix with np and apply the same techniques as above.
import numpy as np
X = df[["Rooms", "Price"]].to_numpy()
X[~np.isnan(X).any(axis=1)]
How to delete missing values in a list in python?
If you are not working with the above two and want to remove empty values from a list, here’s how you can do it.
arr = ["a","", "b", "c", np.nan ]
# Remove empty string
[e for e in arr if e]
# ['a', 'b', 'c', nan]
# Remove nan
arr.remove(np.nan)
# ['a', '', 'b', 'c']
How to remove missing values in a set in python?
Removing missing values from sets should be quite similar to removing empty values from lists. The main difference between a set and a list is that a set doesn’t contain duplicated values.
s = set(["a","", "b", "c", np.nan] )
# Remove empty string from set
[e for e in s if e]
# ['c', nan, 'a', 'b']
# Remove nan
s.remove(np.nan)
# {'', 'a', 'b', 'c'}
If you made this far in the article, thank you very much.
I hope this information was of use to you.
Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.
[boldgrid_component type=”wp_mc4wp_form_widget”]
If you liked this article, maybe you will like these too.
Split Dataset in Train, Test and Validation Sets
Hyperparameter Tuning with Random Search
Hyperparameter Tuning with Grid Search
How to create a practice dataset?
Machine Learning project for Beginners
Thank you. Great job