The Hidden Cost of Inconsistent Data
After missing values and duplicates are handled, a subtler and often more dangerous class of data quality problem remains: inconsistency. Inconsistent data looks valid on the surface — it passes null checks, it has the right data type, it appears in the right column — but it is quietly wrong in ways that corrupt every downstream calculation.
Missing values raise flags. Wrong data types cause immediate errors. But inconsistent data — "Male" vs "male" vs "M", "2024/03/12" vs "12-03-2024" — silently passes through every check and corrupts every result. You must actively hunt for it, because it will never announce itself.
"MUMBAI" / "Mum."
"Male" / "MALE" / "male"
" Chennai "
"Pune\t" / "Pune\n"
"Ltd" / "Limited" / "ltd."
"kg" / "KG" / "Kg"
"12/03/2024"
"March 12, 2024"
"12-Mar-24"
100 (cm) / 1 (meter)
USD 50 / ₹4150
NaN / "NA" / "N/A" / "none"
0 / "0" / "zero" / False
Each type requires a different fix. Case issues need .str.lower(). Whitespace needs .str.strip(). Abbreviations need a mapping dictionary. Dates need pd.to_datetime(). Units need multiplication. Encodings need .replace().
Fixing String Inconsistencies
String columns are the most common source of data inconsistency. Text entered by humans — through forms, CRMs, mobile apps — accumulates case variations, trailing spaces, abbreviations, and typos that pandas treats as entirely different values.
import pandas as pd
import numpy as np
# ── Step 1: Diagnose the problem ─────────────────────
print(df['city'].value_counts().head(20)) # spot the variants
print(f"Unique values: {df['city'].nunique()}") # 847 → should be 28
# ── Step 2: Basic normalisation ──────────────────────
df['city'] = (df['city']
.str.lower() # "Mumbai" → "mumbai"
.str.strip() # " mumbai " → "mumbai"
.str.replace(r'\s+', ' ', regex=True) # "new delhi" → "new delhi"
)
# ── Step 3: Abbreviation map ─────────────────────────
city_map = {
'mum': 'mumbai', 'mum.': 'mumbai',
'bombay': 'mumbai', 'mumbay': 'mumbai',
'del': 'delhi', 'new del': 'delhi',
'blr': 'bengaluru','bangalore':'bengaluru',
'chn': 'chennai', 'madras': 'chennai',
'hyd': 'hyderabad','hyd.': 'hyderabad'
}
df['city'] = df['city'].replace(city_map)
# ── Step 4: Validate ─────────────────────────────────
print(f"Unique values after: {df['city'].nunique()}") # now 28 ✅
| city (raw) | count |
|---|---|
| Mumbai | 4821 |
| mumbai | 2108 |
| MUMBAI | 876 |
| Mum. | 412 |
| Bombay | 301 |
| Mumbay | 89 |
| mum | 64 |
| Mumbai | 31 |
| city (clean) | count |
|---|---|
| mumbai | 8702 |
| delhi | 6441 |
| bengaluru | 4821 |
| hyderabad | 3102 |
| chennai | 2901 |
| pune | 1844 |
| 847 variants → 28 clean cities | |
String normalisation reduced "city" from 847 phantom categories to 28 real ones. "gender" went from 9 variants down to 3. "product_category" from 24 to 8. Each reduction is a direct improvement in model quality.
Fuzzy String Matching for Harder Cases
from rapidfuzz import process, fuzz
# Find the best canonical match for each dirty value
canonical = ['mumbai', 'delhi', 'bengaluru', 'hyderabad', 'chennai']
def fuzzy_fix(val, choices, threshold=80):
result = process.extractOne(val, choices, scorer=fuzz.ratio)
return result[0] if result and result[1] >= threshold else val
df['city'] = df['city'].apply(lambda x: fuzzy_fix(x, canonical))
# Install: pip install rapidfuzz
Always apply normalisation in this order: (1) lowercase, (2) strip whitespace, (3) remove special characters, (4) apply abbreviation map, (5) fuzzy match. Each step makes the next step more effective. Applying the map before lowercasing means half your mappings will miss.
Fixing Date & Time Inconsistencies
Date columns are the second most common source of inconsistency in real datasets. When data flows from multiple systems — a legacy ERP, a web form, a mobile app, and a spreadsheet — each source uses its own date format. Pandas treats every format as a separate string value until you explicitly parse and standardise them.
# ── Diagnose: spot the format variants ───────────────
print(df['order_date'].value_counts().head(10))
# ── Parse mixed formats ──────────────────────────────
df['order_date'] = pd.to_datetime(
df['order_date'],
dayfirst=True, # DD/MM/YYYY → day comes first
errors='coerce' # unparseable → NaT (not error)
)
# ── Check how many dates failed to parse ─────────────
failed = df['order_date'].isnull().sum()
print(f"{failed} dates could not be parsed → inspect those rows")
# ── Standardise to a single string format if needed ──
df['order_date_str'] = df['order_date'].dt.strftime('%Y-%m-%d')
# ── Extract useful features from clean dates ─────────
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.day_name()
df['quarter'] = df['order_date'].dt.quarter
df['is_weekend'] = df['order_date'].dt.dayofweek >= 5
| order_date (raw) | source |
|---|---|
| 12/03/2024 | POS system |
| 2024-03-15 | Web store |
| 20-Mar-24 | Legacy ERP |
| March 22, 2024 | Manual entry |
| 25.03.2024 | Finance tool |
| 20240328 | API export |
| order_date (clean) | dtype |
|---|---|
| 2024-03-12 | datetime64 |
| 2024-03-15 | datetime64 |
| 2024-03-20 | datetime64 |
| 2024-03-22 | datetime64 |
| 2024-03-25 | datetime64 |
| 2024-03-28 | datetime64 |
Fixing Units, Encodings & Type Inconsistencies
Unit and encoding inconsistencies are particularly dangerous because they produce values that look numerically valid — they pass all range checks — but are off by a constant factor. A weight column mixing grams and kilograms will make 500g look like 500kg to every downstream calculation.
# ── Fix boolean / yes-no encoding ────────────────────
yes_no_map = {
'yes':True, 'y':True, '1':True, 'true':True,
'no':False, 'n':False, '0':False, 'false':False
}
df['is_returned'] = (df['is_returned']
.astype('str').str.lower().str.strip()
.map(yes_no_map))
# ── Fix "NA" strings that pandas misses ──────────────
na_strings = ['na', 'n/a', 'none', 'null', '-', '', 'nan', 'unknown']
df = df.replace(na_strings, np.nan)
# ── Fix mixed currency (USD → INR) ───────────────────
usd_mask = df['currency'] == 'USD'
df.loc[usd_mask, 'amount'] = df.loc[usd_mask, 'amount'] * 83.5
df['currency'] = 'INR'
# ── Fix mixed weight units (kg/g → g) ────────────────
kg_mask = df['unit'].str.lower() == 'kg'
df.loc[kg_mask, 'weight'] *= 1000
df['unit'] = 'g'
# ── Standardise gender column ─────────────────────────
gender_map = {
'm':'Male', 'male':'Male', 'man':'Male', 'boy':'Male',
'f':'Female', 'female':'Female','woman':'Female',
'other':'Other', 'o':'Other', 'non-binary':'Other'
}
df['gender'] = (df['gender']
.str.lower().str.strip()
.map(gender_map)
.fillna('Unknown'))
Catch fake NA strings before they even enter the DataFrame. Pass a custom list at read time: pd.read_csv('file.csv', na_values=['NA', 'N/A', 'none', '-', 'unknown', '']). This converts all variants to real NaN immediately, so your null audit is accurate from the first line of code.
Outlier Detection — Understanding the Problem
An outlier is a data point that lies abnormally far from the bulk of the data. But "abnormally far" is not a fixed rule — it depends on the distribution of the data, the domain context, and what the outlier represents. Before detecting outliers, you must ask: is this a genuine extreme value, or is it a data quality error?
The decision to keep, remove, or transform an outlier is never purely statistical — it requires domain knowledge. Always inspect outliers manually before applying any treatment.
Outlier Detection Methods
Three statistical methods cover the vast majority of outlier detection needs in data science. Each makes different assumptions about the data's distribution and each has its correct use case.
The three methods overlap but are not identical. IQR is the widest net. Z-score only catches extreme univariate values. Isolation Forest can flag points in clusters that are anomalous in context but not extreme individually.
Method 1 — IQR Method (Recommended Default)
# ── Compute IQR bounds ────────────────────────────────
Q1 = df['purchase_amount'].quantile(0.25)
Q3 = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
print(f"Lower fence : ₹{lower_fence:,.0f}")
print(f"Upper fence : ₹{upper_fence:,.0f}")
# ── Flag outliers ─────────────────────────────────────
df['is_outlier_iqr'] = (
(df['purchase_amount'] < lower_fence) |
(df['purchase_amount'] > upper_fence)
)
print(f"Outliers: {df['is_outlier_iqr'].sum()} rows ({df['is_outlier_iqr'].mean()*100:.1f}%)")
# ── Inspect before deciding treatment ────────────────
df[df['is_outlier_iqr']].sort_values('purchase_amount', ascending=False).head(20)
Method 2 — Z-Score Method
from scipy import stats
# ── Standard Z-score ─────────────────────────────────
z_scores = np.abs(stats.zscore(df['purchase_amount'].dropna()))
df['is_outlier_z'] = z_scores > 3
# ── Modified Z-score (robust — uses median) ──────────
median_val = df['purchase_amount'].median()
mad = (df['purchase_amount'] - median_val).abs().median()
mod_z = 0.6745 * (df['purchase_amount'] - median_val) / mad
df['is_outlier_modz'] = mod_z.abs() > 3.5
# Modified Z is better for skewed data — use it over standard Z
Method 3 — Isolation Forest (Multivariate)
from sklearn.ensemble import IsolationForest
# ── Select numeric features ───────────────────────────
features = df[['age', 'purchase_amount', 'rating', 'delivery_days']].dropna()
# ── Fit Isolation Forest ──────────────────────────────
iso = IsolationForest(
contamination=0.05, # expect ~5% of data to be anomalous
random_state=42,
n_estimators=200
)
df['is_outlier_iso'] = iso.fit_predict(features) == -1 # -1 = anomaly
# ── Anomaly scores (more negative = more anomalous) ──
df['anomaly_score'] = iso.score_samples(features)
print(f"Isolation Forest anomalies: {df['is_outlier_iso'].sum()}")
The IQR method catches 3.7% of values as outliers — both on the very low end (returns/refunds) and the very high end (enterprise orders). Inspect both groups before deciding treatment.
Outlier Treatment Strategies
Once outliers are identified and inspected, there are four treatment strategies. The choice depends entirely on whether the outlier is a data error or a legitimate extreme value, and on the requirements of the downstream model.
| Strategy | Method | When to Use | Risk |
|---|---|---|---|
| Remove | df = df[~df['is_outlier']] | Confirmed data entry errors only | High if legitimate |
| Cap (Winsorise) | .clip(lower, upper) | Extreme but legitimate values; regression models | Medium |
| Transform | np.log1p(col) | Right-skewed data with long tail (income, revenue) | Low |
| Keep + Scale | RobustScaler | Legitimate extremes; tree models handle naturally | Low |
Treatment 1 — Remove (Errors Only)
# ONLY for confirmed data errors
df = df[
(df['age'].between(18, 100)) &
(df['purchase_amount'] >= 0) &
(df['rating'].between(1, 5))
]
Treatment 2 — Cap / Winsorise
# ── Manual clip at IQR fences ────────────────────────
df['purchase_amount'] = df['purchase_amount'].clip(
lower=lower_fence, upper=upper_fence
)
# ── Winsorise at 1st / 99th percentile ───────────────
p01 = df['purchase_amount'].quantile(0.01)
p99 = df['purchase_amount'].quantile(0.99)
df['purchase_amount'] = df['purchase_amount'].clip(p01, p99)
# ── Using scipy.stats.mstats.winsorize ───────────────
from scipy.stats.mstats import winsorize
df['purchase_winsorized'] = winsorize(
df['purchase_amount'], limits=[0.01, 0.01] # clip 1% from each tail
)
Treatment 3 — Log Transformation
# ── Log transformation (right-skewed data) ───────────
df['purchase_log'] = np.log1p(df['purchase_amount']) # log(1+x) safe for 0
# ── Square root (moderate skew) ──────────────────────
df['purchase_sqrt'] = np.sqrt(df['purchase_amount'])
# ── Box-Cox (finds best transformation automatically) ─
from scipy.stats import boxcox
df['purchase_boxcox'], lambda_val = boxcox(df['purchase_amount'] + 1)
print(f"Best Box-Cox lambda: {lambda_val:.3f}")
Before: purchase_amount (right-skewed)
After: np.log1p(purchase_amount) (near-normal)
Log transformation compresses the long right tail and brings extreme values closer to the bulk of the data. Most linear models assume normality — log transforming right-skewed features often improves them significantly.
Treatment 4 — RobustScaler (Keep Outliers, Scale Robustly)
from sklearn.preprocessing import RobustScaler, StandardScaler
# RobustScaler uses median and IQR — not affected by outliers
scaler = RobustScaler()
df['purchase_robust'] = scaler.fit_transform(df[['purchase_amount']])
# StandardScaler — affected by outliers (uses mean and std)
std_scaler = StandardScaler()
df['purchase_std'] = std_scaler.fit_transform(df[['purchase_amount']])
# RobustScaler is the recommended default when outliers are legitimate
StandardScaler (red) compresses all normal values toward zero because the mean and std are dragged by outliers. RobustScaler (green) preserves the spread of normal values because it uses the median and IQR instead.
Complete Inconsistency & Outlier Fixing Template
def fix_inconsistencies(df, string_maps=None):
"""Fix case, whitespace, and abbreviation issues in all string columns."""
df = df.copy()
cat_cols = df.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
df[col] = df[col].str.lower().str.strip()
if string_maps and col in string_maps:
df[col] = df[col].replace(string_maps[col])
return df
def detect_and_treat_outliers(df, cols, method='iqr', treatment='cap'):
"""
Detect and treat outliers in specified numeric columns.
method : 'iqr' | 'zscore'
treatment : 'cap' | 'log' | 'flag_only'
"""
df = df.copy()
report = {}
for col in cols:
series = df[col].dropna()
if method == 'iqr':
Q1, Q3 = series.quantile([0.25, 0.75])
IQR = Q3 - Q1
lo, hi = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
else: # zscore
lo = series.mean() - 3 * series.std()
hi = series.mean() + 3 * series.std()
outlier_mask = (df[col] < lo) | (df[col] > hi)
report[col] = {'n_outliers': outlier_mask.sum(), 'lo': lo, 'hi': hi}
df[col + '_outlier'] = outlier_mask
if treatment == 'cap':
df[col] = df[col].clip(lo, hi)
elif treatment == 'log':
df[col] = np.log1p(df[col].clip(lower=0))
return df, report
# ── Usage ─────────────────────────────────────────────
maps = {
'city': {'mum':'mumbai', 'bombay':'mumbai', 'del':'delhi'},
'gender': {'m':'male', 'f':'female', 'man':'male'}
}
df = fix_inconsistencies(df, string_maps=maps)
df, out_report = detect_and_treat_outliers(
df, cols=['purchase_amount', 'age', 'delivery_days'],
method='iqr', treatment='cap'
)
print(pd.DataFrame(out_report).T)
Running both functions on a raw dataset produces: normalised string columns, standardised categories, a per-column outlier report with counts and fences, outlier flag columns for downstream auditing, and treated numeric values ready for modelling — all in under 50 lines of reusable code.
Golden Rules — Inconsistency & Outliers
Inconsistent data and outliers are not the same problem but they share the same cure: systematic investigation before treatment. The data scientist who runs value_counts() on every column before cleaning, and inspects every flagged outlier before removing it, will build models that are more accurate, more robust, and more trustworthy than one who applies blanket fixes. Curiosity is the most powerful data cleaning tool.