The Dirty Data Problem
In an ideal world, every dataset arrives perfectly formatted, complete, and consistent. In the real world, data scientists spend 60–80% of their time cleaning data before a single model is trained. Missing values, duplicate rows, inconsistent formats, and corrupted entries are not edge cases — they are the norm in every real production dataset.
A model trained on dirty data will produce dirty predictions. Missing values cause statistical bias. Duplicates artificially inflate certain patterns. Inconsistent entries create phantom categories. None of these problems are fixed by a more powerful algorithm — they must be fixed in the data itself, before any analysis begins.
Data cleaning is a structured pipeline — not a single step. Each stage feeds the next. Skipping any stage allows hidden errors to survive into your model.
Step 1 — Audit Before You Clean
Before touching a single value, you must understand the full extent of the problem. Cleaning without auditing is like performing surgery without an X-ray. The audit gives you a complete map of where the damage is and how serious it is.
import pandas as pd
import numpy as np
df = pd.read_csv('sales_data.csv')
# ── Shape and types ──────────────────────────────────
print(df.shape)
df.info()
# ── Missing value audit ──────────────────────────────
null_report = pd.DataFrame({
'missing': df.isnull().sum(),
'pct': (df.isnull().mean() * 100).round(2),
'dtype': df.dtypes
}).query('missing > 0').sort_values('pct', ascending=False)
print(null_report)
# ── Duplicate audit ──────────────────────────────────
print(f"Duplicate rows : {df.duplicated().sum()}")
print(f"Duplicate IDs : {df['customer_id'].duplicated().sum()}")
Columns beyond the 5% threshold (green line) need active imputation strategies. age at 15.7% and rating at 9.8% are the priority targets.
Understanding WHY Data Is Missing
Not all missing data is the same. The mechanism behind why values are missing determines which treatment strategy is valid. Using the wrong strategy for the wrong type can introduce more bias than the original missing values themselves.
MCAR — Random scatter (safe to impute)
MNAR — Clustered in high-value zone (dangerous)
MCAR: missing values (red crosses) are scattered randomly. MNAR: missing values cluster in the high-income, high-spend region — biasing any imputation toward underestimating those values.
Create a binary indicator: 1 where the value is missing, 0 where present. Check its correlation with all other columns. If correlations are near zero across the board — MCAR. If certain columns correlate strongly with the missing indicator — MAR or MNAR. Simple deletion will bias your results in those cases.
Handling Missing Values — Strategy by Strategy
There is no single correct strategy for handling missing values. The right approach depends on the type of missingness, the percentage missing, the column's role in your model, and the domain context.
Follow this decision tree for every column with missing values. The percentage missing and data type together determine the correct treatment strategy.
Strategy 1 — Drop Rows or Columns (dropna)
# Drop rows where ANY value is missing
df_clean = df.dropna()
# Drop rows only where SPECIFIC columns are missing
df_clean = df.dropna(subset=['age', 'purchase_amount'])
# Drop columns more than 50% missing
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)
# Always check row count before and after
print(f"Before: {len(df):,} After: {len(df_clean):,} Lost: {len(df)-len(df_clean):,}")
Strategy 2 — Fill with Statistics (fillna)
# Fill numeric with median (robust to skew)
df['age'].fillna(df['age'].median(), inplace=True)
# Fill categorical with most frequent value
df['city'].fillna(df['city'].mode()[0], inplace=True)
# Fill boolean flags with False
df['is_returned'].fillna(False, inplace=True)
# Group-wise fill — smarter than global median
# Fill age using median per income bracket
df['age'] = df.groupby('income_bracket')['age'].transform(
lambda x: x.fillna(x.median())
)
| id | age | city | rating |
|---|---|---|---|
| 1001 | 28 | Mumbai | 4.5 |
| 1002 | NaN | Delhi | NaN |
| 1003 | 42 | NaN | 3.8 |
| 1004 | NaN | Chennai | 5.0 |
| 1005 | 31 | Pune | NaN |
| id | age | city | rating |
|---|---|---|---|
| 1001 | 28 | Mumbai | 4.5 |
| 1002 | 34 | Delhi | 4.2 |
| 1003 | 42 | Mumbai | 3.8 |
| 1004 | 34 | Chennai | 5.0 |
| 1005 | 31 | Pune | 4.2 |
Strategy 3 — Interpolation (for time series)
# Linear interpolation
df['temperature'] = df['temperature'].interpolate(method='linear')
# Time-aware interpolation (respects unequal gaps)
df = df.set_index('timestamp')
df['temperature'] = df['temperature'].interpolate(method='time')
# Forward-fill: copy last known value forward
df['price'] = df['price'].ffill() # ideal for stock prices
# Backward-fill: copy next known value backward
df['price'] = df['price'].bfill()
Mean-fill (red) creates a flat, unnatural plateau. Interpolation (green) follows the natural curve of the data, preserving the time series pattern the model needs to learn.
Strategy 4 — KNN Imputation
from sklearn.impute import KNNImputer
# KNN finds k most similar rows, uses their average
imputer = KNNImputer(n_neighbors=5, weights='distance')
num_cols = df.select_dtypes(include='number').columns
df[num_cols] = imputer.fit_transform(df[num_cols])
# IterativeImputer — multiple regression imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
iter_imp = IterativeImputer(max_iter=10, random_state=42)
df[num_cols] = iter_imp.fit_transform(df[num_cols])
MCAR + numeric + low % → median fill. MCAR + categorical → mode fill. Time series → interpolate. MAR + numeric + model matters → KNN or IterativeImputer. MNAR → add missing indicator, consult domain expert. Over 50% missing → drop the column.
Removing Duplicates
A duplicate row contains the same information as another row — either identical across all columns, or identical only in the key columns that should be unique. Duplicates inflate dataset size, distort statistics, and teach models to overfit to repeated patterns.
Exact duplicates are the easiest to fix. Key duplicates require a business decision (keep first? keep last? aggregate?). Fuzzy duplicates require string normalisation and libraries like rapidfuzz.
# ── Check for duplicates ─────────────────────────────
print(df.duplicated().sum())
df[df.duplicated(keep=False)].head(10) # show ALL copies
# ── Drop exact duplicates ────────────────────────────
df = df.drop_duplicates() # keep first occurrence
df = df.drop_duplicates(keep='last') # keep most recent
# ── Key-based deduplication ──────────────────────────
df = df.drop_duplicates(subset=['customer_id'], keep='last')
# ── Keep most recent record per customer ─────────────
df = (df
.sort_values('order_date', ascending=False)
.drop_duplicates(subset=['customer_id'], keep='first')
.sort_index())
# ── Fuzzy deduplication ──────────────────────────────
df['city'] = df['city'].str.lower().str.strip()
city_map = {'mum':'mumbai', 'del':'delhi', 'blr':'bengaluru'}
df['city'] = df['city'].replace(city_map)
| id | name | amount | date |
|---|---|---|---|
| 101 | Priya | 4500 | 2024-03-01 |
| 102 | Arjun | 8200 | 2024-03-02 |
| 102 | Arjun | 8200 | 2024-03-02 |
| 103 | Deepa | 6100 | 2024-03-03 |
| 103 | Deepa | 6100 | 2024-03-03 |
| 104 | Ravi | 3300 | 2024-03-04 |
| id | name | amount | date |
|---|---|---|---|
| 101 | Priya | 4500 | 2024-03-01 |
| 102 | Arjun | 8200 | 2024-03-02 |
| 103 | Deepa | 6100 | 2024-03-03 |
| 104 | Ravi | 3300 | 2024-03-04 |
After removing 2,287 duplicate rows and fixing 1,570 missing values, the clean dataset has 9,713 high-quality rows — ready for modelling with zero nulls and zero duplicates.
Step 3 — Validate the Clean Dataset
Cleaning without validating is like washing dishes without checking if they are actually clean. After every cleaning operation, programmatically verify that the problems you fixed are actually gone — and that no new problems were introduced.
# ── Post-cleaning assertions ──────────────────────────
assert df.isnull().sum().sum() == 0, "Still has missing values!"
assert df.duplicated().sum() == 0, "Still has duplicates!"
assert df['age'].between(18,100).all(), "Age out of valid range!"
assert df['purchase_amount'].ge(0).all(), "Negative purchase found!"
assert df['rating'].between(1,5).all(), "Rating outside 1–5 range!"
# ── Cleaning summary report ───────────────────────────
report = {
'rows_before': len(df_raw),
'rows_after': len(df),
'rows_removed': len(df_raw) - len(df),
'nulls_remaining': df.isnull().sum().sum(),
'dupes_remaining': df.duplicated().sum()
}
pd.Series(report)
Save this report as a JSON file alongside every cleaned dataset. It is your audit trail — essential for reproducibility, regulatory compliance, and handing work to teammates.
Complete Data Cleaning Template
Copy this function into any project for an instant production-ready cleaning pipeline. It handles missing values, removes duplicates, and prints a full before/after report.
def clean_dataframe(df, id_col=None, date_col=None, verbose=True):
df_clean = df.copy()
report = {'rows_start': len(df_clean)}
# 1. Drop entirely empty rows / columns
df_clean = df_clean.dropna(how='all')
df_clean = df_clean.dropna(axis=1, how='all')
# 2. Drop columns > 50% missing
thresh = len(df_clean) * 0.5
df_clean = df_clean.dropna(axis=1, thresh=thresh)
report['cols_dropped'] = df.shape[1] - df_clean.shape[1]
# 3. Fill numeric with median
for col in df_clean.select_dtypes(include='number').columns:
df_clean[col] = df_clean[col].fillna(df_clean[col].median())
# 4. Fill categorical with mode
for col in df_clean.select_dtypes(include=['object','category']).columns:
if df_clean[col].isnull().any():
df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])
# 5. Remove duplicates
if id_col and date_col:
df_clean = (df_clean
.sort_values(date_col, ascending=False)
.drop_duplicates(subset=[id_col], keep='first'))
else:
df_clean = df_clean.drop_duplicates()
report.update({
'rows_end': len(df_clean),
'rows_removed': report['rows_start'] - len(df_clean),
'nulls_remaining':df_clean.isnull().sum().sum(),
'dupes_remaining':df_clean.duplicated().sum()
})
if verbose:
print(f"✅ {report['rows_start']:,} rows → {report['rows_end']:,}")
print(f"✅ Nulls remaining : {report['nulls_remaining']}")
print(f"✅ Dupes remaining : {report['dupes_remaining']}")
return df_clean, report
# Usage
df_clean, report = clean_dataframe(
df_raw, id_col='customer_id', date_col='order_date'
)
Running clean_dataframe(df_raw) on a 12,000-row messy dataset produces: 0 remaining nulls, 0 remaining duplicates, a full row-count report, and a DataFrame ready to pass directly into any sklearn pipeline. Save the report dict alongside your clean CSV for full reproducibility.
Golden Rules of Data Cleaning
Data cleaning is not a chore you rush through to get to modelling — it is where the most important analytical decisions are made. Every choice about a missing value or a duplicate row encodes an assumption about the world. Make those assumptions consciously, document them clearly, and validate them rigorously. Clean data is not just technically correct — it is ethically sound.