Data Cleaning in Python: Handle Missing Values

Section 01

The Dirty Data Problem

In an ideal world, every dataset arrives perfectly formatted, complete, and consistent. In the real world, data scientists spend 60–80% of their time cleaning data before a single model is trained. Missing values, duplicate rows, inconsistent formats, and corrupted entries are not edge cases — they are the norm in every real production dataset.

📖 Real-World Story

The Hospital That Almost Got It Wrong

In 2019, a hospital in the United States used a machine learning model to predict which patients needed urgent care. The model consistently underestimated the severity of certain patients' conditions — not because of a flawed algorithm, but because the training data had a systematic missing value problem: some groups had fewer recorded diagnoses, not because they were healthier, but because of unequal access to care. The missing values were not random — they were structurally biased. This is why understanding WHY data is missing is just as important as knowing HOW to fill it. Data cleaning is never just a technical exercise — it carries ethical weight.

⚠️

Data Cleaning Is Not Optional

A model trained on dirty data will produce dirty predictions. Missing values cause statistical bias. Duplicates artificially inflate certain patterns. Inconsistent entries create phantom categories. None of these problems are fixed by a more powerful algorithm — they must be fixed in the data itself, before any analysis begins.

🗺️ The Data Cleaning Pipeline

Data cleaning is a structured pipeline — not a single step. Each stage feeds the next. Skipping any stage allows hidden errors to survive into your model.

Section 02

Step 1 — Audit Before You Clean

Before touching a single value, you must understand the full extent of the problem. Cleaning without auditing is like performing surgery without an X-ray. The audit gives you a complete map of where the damage is and how serious it is.

import pandas as pd
import numpy as np

df = pd.read_csv('sales_data.csv')

# ── Shape and types ──────────────────────────────────
print(df.shape)
df.info()

# ── Missing value audit ──────────────────────────────
null_report = pd.DataFrame({
    'missing': df.isnull().sum(),
    'pct':     (df.isnull().mean() * 100).round(2),
    'dtype':   df.dtypes
}).query('missing > 0').sort_values('pct', ascending=False)
print(null_report)

# ── Duplicate audit ──────────────────────────────────
print(f"Duplicate rows : {df.duplicated().sum()}")
print(f"Duplicate IDs  : {df['customer_id'].duplicated().sum()}")

📊 Output: Null Audit Report — Before Cleaning

Columns beyond the 5% threshold (green line) need active imputation strategies. age at 15.7% and rating at 9.8% are the priority targets.

Section 03

Understanding WHY Data Is Missing

Not all missing data is the same. The mechanism behind why values are missing determines which treatment strategy is valid. Using the wrong strategy for the wrong type can introduce more bias than the original missing values themselves.

🎲

MCAR

Missing Completely At Random

No relationship to any other variable — like a sensor randomly dropping packets. Safe to delete rows or use simple imputation. No systematic bias introduced.

🔗

MAR

Missing At Random

Missingness depends on other observed variables. Example: younger customers less likely to fill in income. Imputation using other features is valid and recommended.

⚠️

MNAR

Missing Not At Random

The missing value is related to the value itself. Example: high earners skip the income field. Most dangerous — simple imputation introduces structural bias. Requires domain expertise.

📊 MCAR vs MNAR — The Pattern Tells the Story

MCAR — Random scatter (safe to impute)

MNAR — Clustered in high-value zone (dangerous)

MCAR: missing values (red crosses) are scattered randomly. MNAR: missing values cluster in the high-income, high-spend region — biasing any imputation toward underestimating those values.

💡

How to Test for MCAR

Create a binary indicator: 1 where the value is missing, 0 where present. Check its correlation with all other columns. If correlations are near zero across the board — MCAR. If certain columns correlate strongly with the missing indicator — MAR or MNAR. Simple deletion will bias your results in those cases.

Section 04

Handling Missing Values — Strategy by Strategy

There is no single correct strategy for handling missing values. The right approach depends on the type of missingness, the percentage missing, the column's role in your model, and the domain context.

🗺️ Missing Value Treatment — Decision Tree

Follow this decision tree for every column with missing values. The percentage missing and data type together determine the correct treatment strategy.

Strategy 1 — Drop Rows or Columns (dropna)

📖 When to Use This

The Survey That Lost 3% of Responses

A retail company ran a 50,000-row customer satisfaction survey. After auditing, 1,400 rows (2.8%) had missing responses in the "overall satisfaction" column — which was also the target variable for the model. You cannot impute a target variable. The right answer: drop those 1,400 rows. The remaining 48,600 rows were clean, and the small loss had no statistical impact. Dropping is correct when the missing percentage is below 5%, rows are MCAR, or a column is over 50% missing.

# Drop rows where ANY value is missing
df_clean = df.dropna()

# Drop rows only where SPECIFIC columns are missing
df_clean = df.dropna(subset=['age', 'purchase_amount'])

# Drop columns more than 50% missing
threshold = len(df) * 0.5
df_clean  = df.dropna(axis=1, thresh=threshold)

# Always check row count before and after
print(f"Before: {len(df):,}  After: {len(df_clean):,}  Lost: {len(df)-len(df_clean):,}")

Strategy 2 — Fill with Statistics (fillna)

📖 When to Use This

The E-Commerce Platform's Age Problem

An Indian e-commerce platform had 15.7% missing values in the customer age column — too many to drop without serious data loss. A correlation test showed no relationship between age-missingness and any other variable (MCAR). The solution: fill with the median age (34 years). Why median and not mean? Because age distributions are often slightly right-skewed by a small number of very elderly customers — the median is more robust to this. After filling, the age distribution looked natural and downstream model performance improved by 4% on the validation set.

# Fill numeric with median (robust to skew)
df['age'].fillna(df['age'].median(), inplace=True)

# Fill categorical with most frequent value
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Fill boolean flags with False
df['is_returned'].fillna(False, inplace=True)

# Group-wise fill — smarter than global median
# Fill age using median per income bracket
df['age'] = df.groupby('income_bracket')['age'].transform(
    lambda x: x.fillna(x.median())
)

❌ Before fillna

id	age	city	rating
1001	28	Mumbai	4.5
1002	NaN	Delhi	NaN
1003	42	NaN	3.8
1004	NaN	Chennai	5.0
1005	31	Pune	NaN

✅ After fillna

id	age	city	rating
1001	28	Mumbai	4.5
1002	34	Delhi	4.2
1003	42	Mumbai	3.8
1004	34	Chennai	5.0
1005	31	Pune	4.2

Strategy 3 — Interpolation (for time series)

📖 When to Use This

The Weather Station Gap

A meteorological research team collected hourly temperature readings for a full year. A power outage caused 6 consecutive hours of missing readings in July. Filling with the monthly mean (32°C) would have created a jarring flat line in the middle of a smooth daily curve — corrupting the time series pattern. Interpolation solved this elegantly: it drew a smooth line between the last known reading (31.2°C) and the first reading after the gap (34.1°C), estimating each missing hour proportionally. The resulting series looked exactly as the sensor would have recorded it.

# Linear interpolation
df['temperature'] = df['temperature'].interpolate(method='linear')

# Time-aware interpolation (respects unequal gaps)
df = df.set_index('timestamp')
df['temperature'] = df['temperature'].interpolate(method='time')

# Forward-fill: copy last known value forward
df['price'] = df['price'].ffill()   # ideal for stock prices

# Backward-fill: copy next known value backward
df['price'] = df['price'].bfill()

📊 Output: Interpolation vs Mean-Fill — Time Series Gap

Actual data Interpolated (correct) Mean-fill (wrong)

Mean-fill (red) creates a flat, unnatural plateau. Interpolation (green) follows the natural curve of the data, preserving the time series pattern the model needs to learn.

Strategy 4 — KNN Imputation

from sklearn.impute import KNNImputer

# KNN finds k most similar rows, uses their average
imputer    = KNNImputer(n_neighbors=5, weights='distance')
num_cols   = df.select_dtypes(include='number').columns
df[num_cols] = imputer.fit_transform(df[num_cols])

# IterativeImputer — multiple regression imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute       import IterativeImputer
iter_imp   = IterativeImputer(max_iter=10, random_state=42)
df[num_cols] = iter_imp.fit_transform(df[num_cols])

🎯

Strategy Quick Reference

MCAR + numeric + low % → median fill. MCAR + categorical → mode fill. Time series → interpolate. MAR + numeric + model matters → KNN or IterativeImputer. MNAR → add missing indicator, consult domain expert. Over 50% missing → drop the column.

Section 05

Removing Duplicates

A duplicate row contains the same information as another row — either identical across all columns, or identical only in the key columns that should be unique. Duplicates inflate dataset size, distort statistics, and teach models to overfit to repeated patterns.

📖 Real-World Story

The Retailer With 23% Phantom Sales

A major retail chain was building a demand forecasting model. Their data engineering team merged three regional databases — North, South, and West — into a single national table. What they missed: 23% of transactions appeared in two or more regional databases due to a shared centralised payment processor. The model trained on this data massively overestimated demand in overlapping regions. A procurement order was placed based on the model's output — and excess inventory worth ₹2.4 crore piled up in warehouses. A single df.duplicated().sum() call before training would have caught the entire problem in seconds.

📊 Three Types of Duplicate Rows

Exact duplicates are the easiest to fix. Key duplicates require a business decision (keep first? keep last? aggregate?). Fuzzy duplicates require string normalisation and libraries like rapidfuzz.

# ── Check for duplicates ─────────────────────────────
print(df.duplicated().sum())
df[df.duplicated(keep=False)].head(10)       # show ALL copies

# ── Drop exact duplicates ────────────────────────────
df = df.drop_duplicates()                      # keep first occurrence
df = df.drop_duplicates(keep='last')          # keep most recent

# ── Key-based deduplication ──────────────────────────
df = df.drop_duplicates(subset=['customer_id'], keep='last')

# ── Keep most recent record per customer ─────────────
df = (df
      .sort_values('order_date', ascending=False)
      .drop_duplicates(subset=['customer_id'], keep='first')
      .sort_index())

# ── Fuzzy deduplication ──────────────────────────────
df['city'] = df['city'].str.lower().str.strip()
city_map  = {'mum':'mumbai', 'del':'delhi', 'blr':'bengaluru'}
df['city'] = df['city'].replace(city_map)

❌ Before drop_duplicates

id	name	amount	date
101	Priya	4500	2024-03-01
102	Arjun	8200	2024-03-02
102	Arjun	8200	2024-03-02
103	Deepa	6100	2024-03-03
103	Deepa	6100	2024-03-03
104	Ravi	3300	2024-03-04

✅ After drop_duplicates()

id	name	amount	date
101	Priya	4500	2024-03-01
102	Arjun	8200	2024-03-02
103	Deepa	6100	2024-03-03
104	Ravi	3300	2024-03-04

📊 Output: Dataset Before vs After Full Cleaning

Before After

After removing 2,287 duplicate rows and fixing 1,570 missing values, the clean dataset has 9,713 high-quality rows — ready for modelling with zero nulls and zero duplicates.

Section 06

Step 3 — Validate the Clean Dataset

Cleaning without validating is like washing dishes without checking if they are actually clean. After every cleaning operation, programmatically verify that the problems you fixed are actually gone — and that no new problems were introduced.

# ── Post-cleaning assertions ──────────────────────────
assert df.isnull().sum().sum() == 0,     "Still has missing values!"
assert df.duplicated().sum()     == 0,     "Still has duplicates!"
assert df['age'].between(18,100).all(),  "Age out of valid range!"
assert df['purchase_amount'].ge(0).all(),  "Negative purchase found!"
assert df['rating'].between(1,5).all(),    "Rating outside 1–5 range!"

# ── Cleaning summary report ───────────────────────────
report = {
    'rows_before':     len(df_raw),
    'rows_after':      len(df),
    'rows_removed':    len(df_raw) - len(df),
    'nulls_remaining': df.isnull().sum().sum(),
    'dupes_remaining': df.duplicated().sum()
}
pd.Series(report)

📊 Output: Cleaning Report — Full Summary

Save this report as a JSON file alongside every cleaned dataset. It is your audit trail — essential for reproducibility, regulatory compliance, and handing work to teammates.

Section 07

Complete Data Cleaning Template

Copy this function into any project for an instant production-ready cleaning pipeline. It handles missing values, removes duplicates, and prints a full before/after report.

def clean_dataframe(df, id_col=None, date_col=None, verbose=True):
    df_clean = df.copy()
    report   = {'rows_start': len(df_clean)}

    # 1. Drop entirely empty rows / columns
    df_clean = df_clean.dropna(how='all')
    df_clean = df_clean.dropna(axis=1, how='all')

    # 2. Drop columns > 50% missing
    thresh   = len(df_clean) * 0.5
    df_clean = df_clean.dropna(axis=1, thresh=thresh)
    report['cols_dropped'] = df.shape[1] - df_clean.shape[1]

    # 3. Fill numeric with median
    for col in df_clean.select_dtypes(include='number').columns:
        df_clean[col] = df_clean[col].fillna(df_clean[col].median())

    # 4. Fill categorical with mode
    for col in df_clean.select_dtypes(include=['object','category']).columns:
        if df_clean[col].isnull().any():
            df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])

    # 5. Remove duplicates
    if id_col and date_col:
        df_clean = (df_clean
                    .sort_values(date_col, ascending=False)
                    .drop_duplicates(subset=[id_col], keep='first'))
    else:
        df_clean = df_clean.drop_duplicates()

    report.update({
        'rows_end':       len(df_clean),
        'rows_removed':   report['rows_start'] - len(df_clean),
        'nulls_remaining':df_clean.isnull().sum().sum(),
        'dupes_remaining':df_clean.duplicated().sum()
    })
    if verbose:
        print(f"✅ {report['rows_start']:,} rows → {report['rows_end']:,}")
        print(f"✅ Nulls remaining : {report['nulls_remaining']}")
        print(f"✅ Dupes remaining : {report['dupes_remaining']}")
    return df_clean, report

# Usage
df_clean, report = clean_dataframe(
    df_raw, id_col='customer_id', date_col='order_date'
)

✅

What This Template Produces

Running clean_dataframe(df_raw) on a 12,000-row messy dataset produces: 0 remaining nulls, 0 remaining duplicates, a full row-count report, and a DataFrame ready to pass directly into any sklearn pipeline. Save the report dict alongside your clean CSV for full reproducibility.

Section 08

Golden Rules of Data Cleaning

🎯 8 Rules Every Data Scientist Must Follow

Never clean the original file. Always work on a copy: df_clean = df.copy(). The raw data is your single source of truth — overwriting it makes cleaning mistakes unrecoverable.

Always audit before you clean. Run .info(), .isnull().sum(), and .duplicated().sum() first. Cleaning without auditing means guessing at problems rather than solving them.

Understand WHY data is missing before choosing a strategy. MCAR allows deletion or simple imputation. MAR requires smarter imputation. MNAR requires domain expertise. The mechanism determines the method.

For MAR and MNAR columns, always add a binary missing-indicator before imputing: df['col_missing'] = df['col'].isnull().astype(int). The fact that a value was missing is itself predictive information.

Use median, not mean, for numeric imputation in skewed distributions. Income, house prices, revenue, and purchase amounts are almost always right-skewed. The mean is dragged upward by extremes; the median is not.

Before removing duplicates, always inspect them first: df[df.duplicated(keep=False)]. Understand whether they are exact copies, key conflicts, or fuzzy near-matches. Each type requires a different fix.

Validate after every cleaning step using assert statements. A cleaning step that silently introduces a new problem is worse than no cleaning at all. Assertions catch mistakes immediately.

Document every cleaning decision. Which rows were dropped? Which values were imputed with what strategy? Why? This documentation is essential for audit trails, regulatory compliance, and handoffs to teammates.

🧮

Key Takeaway

Data cleaning is not a chore you rush through to get to modelling — it is where the most important analytical decisions are made. Every choice about a missing value or a duplicate row encodes an assumption about the world. Make those assumptions consciously, document them clearly, and validate them rigorously. Clean data is not just technically correct — it is ethically sound.