Fixing Inconsistent Data & Outlier Detection in Python

Section 01

The Hidden Cost of Inconsistent Data

After missing values and duplicates are handled, a subtler and often more dangerous class of data quality problem remains: inconsistency. Inconsistent data looks valid on the surface — it passes null checks, it has the right data type, it appears in the right column — but it is quietly wrong in ways that corrupt every downstream calculation.

📖 Real-World Story

The ₹12 Crore City That Never Existed

A logistics company was analysing delivery performance by city. Their dataset had a "city" column filled by sales representatives manually typing values into a mobile app. When the data team ran their first groupby, they found 847 unique city values instead of the expected 28. "Mumbai" appeared as: "mumbai", "Mumbai", "MUMBAI", "Mum.", "mum", "Mumbay", "Bombay", and 6 other variations. Each variation was treated as a separate city — meaning every aggregate, every KPI, and every heatmap was wrong. The logistics model trained on this data routed ₹12 crore worth of deliveries suboptimally for three months before the problem was found. The fix was 11 lines of pandas code. The cost of not fixing it was discovered only in a quarterly audit.

⚠️

Inconsistency Is Silent — It Never Throws an Error

Missing values raise flags. Wrong data types cause immediate errors. But inconsistent data — "Male" vs "male" vs "M", "2024/03/12" vs "12-03-2024" — silently passes through every check and corrupts every result. You must actively hunt for it, because it will never announce itself.

🗺️ Six Types of Data Inconsistency

Case Inconsistency

"Mumbai" / "mumbai"
"MUMBAI" / "Mum."
"Male" / "MALE" / "male"

Whitespace & Padding

"Delhi " / " Delhi"
" Chennai "
"Pune\t" / "Pune\n"

Abbreviation Variants

"Dr." / "Doctor" / "dr"
"Ltd" / "Limited" / "ltd."
"kg" / "KG" / "Kg"

Date Format Mix

"2024-03-12"
"12/03/2024"
"March 12, 2024"
"12-Mar-24"

Unit Inconsistency

500 (grams) / 0.5 (kg)
100 (cm) / 1 (meter)
USD 50 / ₹4150

Encoding Variants

True / "True" / 1 / "yes"
NaN / "NA" / "N/A" / "none"
0 / "0" / "zero" / False

Each type requires a different fix. Case issues need .str.lower(). Whitespace needs .str.strip(). Abbreviations need a mapping dictionary. Dates need pd.to_datetime(). Units need multiplication. Encodings need .replace().

Section 02

Fixing String Inconsistencies

String columns are the most common source of data inconsistency. Text entered by humans — through forms, CRMs, mobile apps — accumulates case variations, trailing spaces, abbreviations, and typos that pandas treats as entirely different values.

import pandas as pd
import numpy  as np

# ── Step 1: Diagnose the problem ─────────────────────
print(df['city'].value_counts().head(20))  # spot the variants
print(f"Unique values: {df['city'].nunique()}") # 847 → should be 28

# ── Step 2: Basic normalisation ──────────────────────
df['city'] = (df['city']
    .str.lower()          # "Mumbai" → "mumbai"
    .str.strip()          # " mumbai " → "mumbai"
    .str.replace(r'\s+', ' ', regex=True)  # "new  delhi" → "new delhi"
)

# ── Step 3: Abbreviation map ─────────────────────────
city_map = {
    'mum':     'mumbai',   'mum.':    'mumbai',
    'bombay':  'mumbai',   'mumbay':  'mumbai',
    'del':     'delhi',    'new del': 'delhi',
    'blr':     'bengaluru','bangalore':'bengaluru',
    'chn':     'chennai',  'madras':  'chennai',
    'hyd':     'hyderabad','hyd.':    'hyderabad'
}
df['city'] = df['city'].replace(city_map)

# ── Step 4: Validate ─────────────────────────────────
print(f"Unique values after: {df['city'].nunique()}") # now 28 ✅

❌ Before — 8 Variants of Mumbai

city (raw)	count
Mumbai	4821
mumbai	2108
MUMBAI	876
Mum.	412
Bombay	301
Mumbay	89
mum	64
Mumbai	31

✅ After — 1 Canonical Form

city (clean)	count
mumbai	8702
delhi	6441
bengaluru	4821
hyderabad	3102
chennai	2901
pune	1844
847 variants → 28 clean cities

📊 Output: Unique Category Count — Before vs After Normalisation

String normalisation reduced "city" from 847 phantom categories to 28 real ones. "gender" went from 9 variants down to 3. "product_category" from 24 to 8. Each reduction is a direct improvement in model quality.

Fuzzy String Matching for Harder Cases

from rapidfuzz import process, fuzz

# Find the best canonical match for each dirty value
canonical = ['mumbai', 'delhi', 'bengaluru', 'hyderabad', 'chennai']

def fuzzy_fix(val, choices, threshold=80):
    result = process.extractOne(val, choices, scorer=fuzz.ratio)
    return result[0] if result and result[1] >= threshold else val

df['city'] = df['city'].apply(lambda x: fuzzy_fix(x, canonical))

# Install: pip install rapidfuzz

📐

Normalisation Order Matters

Always apply normalisation in this order: (1) lowercase, (2) strip whitespace, (3) remove special characters, (4) apply abbreviation map, (5) fuzzy match. Each step makes the next step more effective. Applying the map before lowercasing means half your mappings will miss.

Section 03

Fixing Date & Time Inconsistencies

Date columns are the second most common source of inconsistency in real datasets. When data flows from multiple systems — a legacy ERP, a web form, a mobile app, and a spreadsheet — each source uses its own date format. Pandas treats every format as a separate string value until you explicitly parse and standardise them.

📖 Real-World Story

The Sales Model That Lost Six Months of History

A sales analytics team was building a monthly forecasting model for a retail chain. Their transaction table merged data from three sources: a POS system (dates as "DD/MM/YYYY"), a web store (dates as "YYYY-MM-DD"), and a legacy ERP (dates as "DD-Mon-YY", e.g. "12-Mar-24"). When pandas read the CSV, all three formats were stored as plain strings. When the team sorted by date to extract the training window, the sort was alphabetical — not chronological. "12-Mar-24" sorted before "2024-03-01", which sorted before "01/03/2024". The model trained on a randomly scrambled timeline. Six months of data was effectively noise. One call to pd.to_datetime(col, dayfirst=True, errors='coerce') and subsequent formatting fixed the entire problem in under 30 seconds.

# ── Diagnose: spot the format variants ───────────────
print(df['order_date'].value_counts().head(10))

# ── Parse mixed formats ──────────────────────────────
df['order_date'] = pd.to_datetime(
    df['order_date'],
    dayfirst=True,      # DD/MM/YYYY → day comes first
    errors='coerce'    # unparseable → NaT (not error)
)

# ── Check how many dates failed to parse ─────────────
failed = df['order_date'].isnull().sum()
print(f"{failed} dates could not be parsed → inspect those rows")

# ── Standardise to a single string format if needed ──
df['order_date_str'] = df['order_date'].dt.strftime('%Y-%m-%d')

# ── Extract useful features from clean dates ─────────
df['year']       = df['order_date'].dt.year
df['month']      = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.day_name()
df['quarter']    = df['order_date'].dt.quarter
df['is_weekend']  = df['order_date'].dt.dayofweek >= 5

❌ Before — Mixed Date Formats

order_date (raw)	source
12/03/2024	POS system
2024-03-15	Web store
20-Mar-24	Legacy ERP
March 22, 2024	Manual entry
25.03.2024	Finance tool
20240328	API export

✅ After — Unified datetime64

order_date (clean)	dtype
2024-03-12	datetime64
2024-03-15	datetime64
2024-03-20	datetime64
2024-03-22	datetime64
2024-03-25	datetime64
2024-03-28	datetime64

Section 04

Fixing Units, Encodings & Type Inconsistencies

Unit and encoding inconsistencies are particularly dangerous because they produce values that look numerically valid — they pass all range checks — but are off by a constant factor. A weight column mixing grams and kilograms will make 500g look like 500kg to every downstream calculation.

# ── Fix boolean / yes-no encoding ────────────────────
yes_no_map = {
    'yes':True,  'y':True,   '1':True,  'true':True,
    'no':False,  'n':False,  '0':False, 'false':False
}
df['is_returned'] = (df['is_returned']
    .astype('str').str.lower().str.strip()
    .map(yes_no_map))

# ── Fix "NA" strings that pandas misses ──────────────
na_strings = ['na', 'n/a', 'none', 'null', '-', '', 'nan', 'unknown']
df = df.replace(na_strings, np.nan)

# ── Fix mixed currency (USD → INR) ───────────────────
usd_mask = df['currency'] == 'USD'
df.loc[usd_mask, 'amount'] = df.loc[usd_mask, 'amount'] * 83.5
df['currency'] = 'INR'

# ── Fix mixed weight units (kg/g → g) ────────────────
kg_mask = df['unit'].str.lower() == 'kg'
df.loc[kg_mask, 'weight'] *= 1000
df['unit'] = 'g'

# ── Standardise gender column ─────────────────────────
gender_map = {
    'm':'Male',    'male':'Male',   'man':'Male',   'boy':'Male',
    'f':'Female',  'female':'Female','woman':'Female',
    'other':'Other', 'o':'Other',  'non-binary':'Other'
}
df['gender'] = (df['gender']
    .str.lower().str.strip()
    .map(gender_map)
    .fillna('Unknown'))

💡

Use pd.read_csv(na_values=...) at Load Time

Catch fake NA strings before they even enter the DataFrame. Pass a custom list at read time: pd.read_csv('file.csv', na_values=['NA', 'N/A', 'none', '-', 'unknown', '']). This converts all variants to real NaN immediately, so your null audit is accurate from the first line of code.

Section 05

Outlier Detection — Understanding the Problem

An outlier is a data point that lies abnormally far from the bulk of the data. But "abnormally far" is not a fixed rule — it depends on the distribution of the data, the domain context, and what the outlier represents. Before detecting outliers, you must ask: is this a genuine extreme value, or is it a data quality error?

📖 Real-World Story

The Data Scientist Who Deleted Their Best Customer

A data scientist at a B2B SaaS company was cleaning the company's customer transaction dataset before training a churn prediction model. She ran an IQR outlier detector on the "annual_revenue_from_customer" column and automatically deleted all rows beyond the 1.5×IQR fence. One of the deleted rows was the company's single largest enterprise client — a ₹4.2 crore annual contract that represented 18% of total revenue. The model trained without this customer learned that high-revenue customers were not at risk of churning — because there was no high-revenue customer in the training data. The model then confidently predicted the enterprise client as "low churn risk" when the client was actively negotiating with a competitor. The company lost the contract six months later. The lesson: always inspect outliers before removing them. They may be your most important data points.

🗺️ The Outlier Decision Framework

The decision to keep, remove, or transform an outlier is never purely statistical — it requires domain knowledge. Always inspect outliers manually before applying any treatment.

Section 06

Outlier Detection Methods

Three statistical methods cover the vast majority of outlier detection needs in data science. Each makes different assumptions about the data's distribution and each has its correct use case.

📦

IQR Method

Q1 − 1.5×IQR / Q3 + 1.5×IQR

Non-parametric. Makes no distribution assumption. Works for skewed data. The default choice for most business data (income, prices, counts).

📐

Z-Score

z = (x − μ) / σ |z| > 3

Parametric. Assumes normality. Fast and interpretable. Use only when the column is approximately normally distributed. Sensitive to the outliers it's trying to find.

🌲

Isolation Forest

anomaly_score < threshold

Machine learning method. Detects multivariate outliers — points that are anomalous across multiple dimensions simultaneously. Use for high-dimensional datasets.

📊 Three Methods — Same Data, Different Detections

Normal data IQR outlier Z-score outlier Isolation Forest outlier

The three methods overlap but are not identical. IQR is the widest net. Z-score only catches extreme univariate values. Isolation Forest can flag points in clusters that are anomalous in context but not extreme individually.

Method 1 — IQR Method (Recommended Default)

# ── Compute IQR bounds ────────────────────────────────
Q1  = df['purchase_amount'].quantile(0.25)
Q3  = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1

lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

print(f"Lower fence : ₹{lower_fence:,.0f}")
print(f"Upper fence : ₹{upper_fence:,.0f}")

# ── Flag outliers ─────────────────────────────────────
df['is_outlier_iqr'] = (
    (df['purchase_amount'] < lower_fence) |
    (df['purchase_amount'] > upper_fence)
)
print(f"Outliers: {df['is_outlier_iqr'].sum()} rows ({df['is_outlier_iqr'].mean()*100:.1f}%)")

# ── Inspect before deciding treatment ────────────────
df[df['is_outlier_iqr']].sort_values('purchase_amount', ascending=False).head(20)

Method 2 — Z-Score Method

from scipy import stats

# ── Standard Z-score ─────────────────────────────────
z_scores = np.abs(stats.zscore(df['purchase_amount'].dropna()))
df['is_outlier_z'] = z_scores > 3

# ── Modified Z-score (robust — uses median) ──────────
median_val = df['purchase_amount'].median()
mad        = (df['purchase_amount'] - median_val).abs().median()
mod_z      = 0.6745 * (df['purchase_amount'] - median_val) / mad
df['is_outlier_modz'] = mod_z.abs() > 3.5

# Modified Z is better for skewed data — use it over standard Z

Method 3 — Isolation Forest (Multivariate)

from sklearn.ensemble import IsolationForest

# ── Select numeric features ───────────────────────────
features = df[['age', 'purchase_amount', 'rating', 'delivery_days']].dropna()

# ── Fit Isolation Forest ──────────────────────────────
iso = IsolationForest(
    contamination=0.05,   # expect ~5% of data to be anomalous
    random_state=42,
    n_estimators=200
)
df['is_outlier_iso'] = iso.fit_predict(features) == -1  # -1 = anomaly

# ── Anomaly scores (more negative = more anomalous) ──
df['anomaly_score'] = iso.score_samples(features)

print(f"Isolation Forest anomalies: {df['is_outlier_iso'].sum()}")

📊 Output: Distribution with IQR Fences & Outliers Flagged

Normal values Outlier values IQR fences

The IQR method catches 3.7% of values as outliers — both on the very low end (returns/refunds) and the very high end (enterprise orders). Inspect both groups before deciding treatment.

Section 07

Outlier Treatment Strategies

Once outliers are identified and inspected, there are four treatment strategies. The choice depends entirely on whether the outlier is a data error or a legitimate extreme value, and on the requirements of the downstream model.

Strategy	Method	When to Use	Risk
Remove	df = df[~df['is_outlier']]	Confirmed data entry errors only	High if legitimate
Cap (Winsorise)	.clip(lower, upper)	Extreme but legitimate values; regression models	Medium
Transform	np.log1p(col)	Right-skewed data with long tail (income, revenue)	Low
Keep + Scale	RobustScaler	Legitimate extremes; tree models handle naturally	Low

Treatment 1 — Remove (Errors Only)

# ONLY for confirmed data errors
df = df[
    (df['age'].between(18, 100)) &
    (df['purchase_amount'] >= 0) &
    (df['rating'].between(1, 5))
]

Treatment 2 — Cap / Winsorise

# ── Manual clip at IQR fences ────────────────────────
df['purchase_amount'] = df['purchase_amount'].clip(
    lower=lower_fence, upper=upper_fence
)

# ── Winsorise at 1st / 99th percentile ───────────────
p01 = df['purchase_amount'].quantile(0.01)
p99 = df['purchase_amount'].quantile(0.99)
df['purchase_amount'] = df['purchase_amount'].clip(p01, p99)

# ── Using scipy.stats.mstats.winsorize ───────────────
from scipy.stats.mstats import winsorize
df['purchase_winsorized'] = winsorize(
    df['purchase_amount'], limits=[0.01, 0.01]  # clip 1% from each tail
)

Treatment 3 — Log Transformation

# ── Log transformation (right-skewed data) ───────────
df['purchase_log'] = np.log1p(df['purchase_amount'])  # log(1+x) safe for 0

# ── Square root (moderate skew) ──────────────────────
df['purchase_sqrt'] = np.sqrt(df['purchase_amount'])

# ── Box-Cox (finds best transformation automatically) ─
from scipy.stats import boxcox
df['purchase_boxcox'], lambda_val = boxcox(df['purchase_amount'] + 1)
print(f"Best Box-Cox lambda: {lambda_val:.3f}")

📊 Output: Log Transformation — Before (right-skewed) vs After (near-normal)

Before: purchase_amount (right-skewed)

After: np.log1p(purchase_amount) (near-normal)

Log transformation compresses the long right tail and brings extreme values closer to the bulk of the data. Most linear models assume normality — log transforming right-skewed features often improves them significantly.

Treatment 4 — RobustScaler (Keep Outliers, Scale Robustly)

from sklearn.preprocessing import RobustScaler, StandardScaler

# RobustScaler uses median and IQR — not affected by outliers
scaler    = RobustScaler()
df['purchase_robust'] = scaler.fit_transform(df[['purchase_amount']])

# StandardScaler — affected by outliers (uses mean and std)
std_scaler = StandardScaler()
df['purchase_std'] = std_scaler.fit_transform(df[['purchase_amount']])

# RobustScaler is the recommended default when outliers are legitimate

📊 Output: RobustScaler vs StandardScaler — Effect of Outliers on Scaling

Original RobustScaler (resistant) StandardScaler (distorted)

StandardScaler (red) compresses all normal values toward zero because the mean and std are dragged by outliers. RobustScaler (green) preserves the spread of normal values because it uses the median and IQR instead.

Section 08

Complete Inconsistency & Outlier Fixing Template

def fix_inconsistencies(df, string_maps=None):
    """Fix case, whitespace, and abbreviation issues in all string columns."""
    df = df.copy()
    cat_cols = df.select_dtypes(include=['object', 'category']).columns
    for col in cat_cols:
        df[col] = df[col].str.lower().str.strip()
        if string_maps and col in string_maps:
            df[col] = df[col].replace(string_maps[col])
    return df

def detect_and_treat_outliers(df, cols, method='iqr', treatment='cap'):
    """
    Detect and treat outliers in specified numeric columns.
    method    : 'iqr' | 'zscore'
    treatment : 'cap' | 'log' | 'flag_only'
    """
    df = df.copy()
    report = {}
    for col in cols:
        series = df[col].dropna()
        if method == 'iqr':
            Q1, Q3  = series.quantile([0.25, 0.75])
            IQR     = Q3 - Q1
            lo, hi  = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        else:  # zscore
            lo = series.mean() - 3 * series.std()
            hi = series.mean() + 3 * series.std()

        outlier_mask = (df[col] < lo) | (df[col] > hi)
        report[col]  = {'n_outliers': outlier_mask.sum(), 'lo': lo, 'hi': hi}
        df[col + '_outlier'] = outlier_mask

        if treatment == 'cap':
            df[col] = df[col].clip(lo, hi)
        elif treatment == 'log':
            df[col] = np.log1p(df[col].clip(lower=0))

    return df, report

# ── Usage ─────────────────────────────────────────────
maps = {
    'city':   {'mum':'mumbai', 'bombay':'mumbai', 'del':'delhi'},
    'gender': {'m':'male', 'f':'female', 'man':'male'}
}
df = fix_inconsistencies(df, string_maps=maps)
df, out_report = detect_and_treat_outliers(
    df, cols=['purchase_amount', 'age', 'delivery_days'],
    method='iqr', treatment='cap'
)
print(pd.DataFrame(out_report).T)

✅

What This Template Produces

Running both functions on a raw dataset produces: normalised string columns, standardised categories, a per-column outlier report with counts and fences, outlier flag columns for downstream auditing, and treated numeric values ready for modelling — all in under 50 lines of reusable code.

Section 09

Golden Rules — Inconsistency & Outliers

🎯 8 Rules Every Data Scientist Must Follow

Always run df['col'].value_counts() on every categorical column before modelling. The number of unique values should match your domain knowledge. 847 cities when you expect 28 is a signal that demands investigation.

Normalise strings in the order: lowercase → strip → remove extra spaces → map abbreviations → fuzzy match. Each step must come before the next. Skipping the order means earlier steps miss values that later steps would have caught.

Always use pd.to_datetime(col, errors='coerce') for date columns. The errors='coerce' parameter converts unparseable dates to NaT instead of raising an exception — you then audit those NaT rows separately.

Replace fake NA strings at data load time using na_values=['NA','N/A','none','-','unknown']. Fake NAs that survive as strings corrupt null counts, fill operations, and every downstream calculation that depends on them.

Never automatically delete outliers. Always inspect them first with df[df['is_outlier']].sort_values('col', ascending=False).head(20). Your most important data points are often your most extreme ones.

Use the IQR method as your default outlier detector — it makes no distribution assumption. Only switch to Z-score when you have confirmed the column is approximately normally distributed using a histogram or Q-Q plot.

Use RobustScaler instead of StandardScaler whenever your data has legitimate extreme values. StandardScaler uses the mean and standard deviation — both are severely distorted by outliers. RobustScaler uses the median and IQR, which are not.

Document every inconsistency fix and outlier treatment decision in a data dictionary. List what was changed, why, and what assumption underlies each decision. Undocumented cleaning is a liability — for your model, your team, and any regulatory audit.

🧮

Key Takeaway

Inconsistent data and outliers are not the same problem but they share the same cure: systematic investigation before treatment. The data scientist who runs value_counts() on every column before cleaning, and inspects every flagged outlier before removing it, will build models that are more accurate, more robust, and more trustworthy than one who applies blanket fixes. Curiosity is the most powerful data cleaning tool.