Table of contents
1. Why Should You Care About Missing Data? |
2. Why Does Data Go Missing? |
3. Different Types of Missing Data
|
4. How to Handle Missing Data: Easy Techniques
|
5. Simple Example Using Python (pandas) |
6. Best Tips for Handling Missing Data |
7. Mistakes to Avoid |
8. Tools You Can Use |
9. Final Thoughts |
Have you ever looked at a dataset and noticed some empty boxes or blank spaces? That’s called missing data. It happens often and can create problems in your analysis or machine learning models.
In this blog by Apponix Academy, let’s learn how to handle missing data easily, using simple language and practical tips. By the end, you’ll feel confident cleaning your data for any project.
You might think, “Why does it matter if some data is missing?”
Here’s why:
Missing data can change your results and give wrong insights.
Many machine learning models don’t work if there are missing values.
Your reports and decisions will be more reliable when the data is clean.
That’s why data scientists spend a lot of time fixing missing data before any analysis.
Data can be missing for many reasons:
Someone forgot to enter it.
There was a technical issue while recording it.
People didn’t want to share that information.
Knowing why data is missing helps you decide how to handle it properly.
Here are the three types of missing data:
The data is missing by chance. For example, if a sensor stops working randomly and skips recording temperature for one hour.
The missing data is related to some other data in your dataset. For example, income is missing, but you know their job title, which can help you guess it.
The missing data is missing for a reason related to itself. For example, people with very high incomes may not want to reveal their salary.
If only a few rows or columns have missing data, you can simply remove them.
Drop rows with missing values if they are not important.
Drop columns if almost all the data is missing in them.
But remember, don’t remove too much data. You might lose important information.
If you don’t want to remove data, you can fill it with other values. This is called imputation.
a. Mean or Median
For numbers, you can fill in missing values with:
Mean: The average value
Median: The middle value (better if the data has extreme values)
Example: If someone’s age is missing, fill it with the average or median age of the group.
b. Mode
For categories like city or gender, fill in missing values with the most common value.
Example: If many people live in Bangalore and one entry is missing the city, fill it with Bangalore.
c. Forward Fill / Backward Fill
For time-based data like stock prices, use:
Forward fill: Fill the missing value with the last available value
Backward fill: Fill the missing value with the next available value
d. Constant Value
Sometimes, you can fill missing values with a constant like:
“Unknown” for categories
0 for numbers (only if it makes sense)
e. Predicting Missing Values
This is an advanced method where you use other data to predict missing values. For example, using a regression model or KNN (nearest neighbors). It takes extra effort but gives better results.
Some machine learning algorithms, like XGBoost, can handle missing data on their own. But it’s still better to clean the data yourself for better control.
For categories, you can create a new value called “Missing” or “Unknown”. This way, you keep the data and let your model know it was missing.
Create a new column showing if data was missing (1) or not (0). This helps your model learn patterns related to missing data.
Here’s how you can handle missing data using Python:
python
CopyEdit
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Check missing values
print(df.isnull().sum())
# Fill missing ages with median
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing city with mode
df['City'].fillna(df['City'].mode()[0], inplace=True)
# Drop rows where purchase is missing
df.dropna(subset=['Purchase'], inplace=True)
In the data science course in Bangalore by Apponix Academy, you will practice such techniques with real datasets.
Check why the data is missing before deciding what to do.
Look at missing data patterns using tools like missingno in Python.
Don’t blindly fill in values without thinking about the business meaning.
Keep notes of what you did for future reference.
Ignoring missing data completely
Removing too many rows and losing useful data
Using the mean for skewed data instead of the median
Filling the target (output) variable with guesses – never do this
Here are some tools that make handling missing data easier:
Excel: For small datasets
Python (pandas, scikit-learn): For powerful data cleaning
R (dplyr, tidyr, mice): For statistical imputations
Power Query (Excel/Power BI): For BI workflows
OpenRefine: For text data cleaning
At Apponix Academy, we teach these tools step by step to build your practical confidence.
Missing data is a normal part of working with real-world data. Don’t be scared of it. Just remember:
Understand why data is missing
Choose the best way to handle it
Keep your data clean for better results
If you want to learn data cleaning, data preparation, and data science skills from scratch, join the data science course in Bangalore offered by Apponix Academy. You will practice these techniques with expert guidance and real projects.
Key Takeaways
Missing data can affect your analysis results.
Handle missing data by removing, filling, predicting, or creating missing flags.
Practice these techniques to become confident in data preparation.
Apponix Academy