Data Preparation / Data Preprocessing 📂 Data Collection · 8 of 13 49 min read

Encoding Categorical Variables

A story-driven, comprehensive guide to converting categorical text data into numeric form using Label Encoding, One-Hot Encoding, Ordinal Encoding, and advanced techniques — with live diagrams, before/after tables, real-world stories, and complete reusable code.

Section 01

Why Machines Cannot Read Text Labels

Every machine learning algorithm is fundamentally a mathematical function — it multiplies, adds, takes derivatives, and computes distances. Numbers are the only currency it understands. When your dataset contains columns like "city", "gender", "product_category", or "education_level", you are holding data in a format that a model cannot process directly. Encoding categorical variables is the process of translating human-readable text labels into numeric representations that preserve the right information — and crucially, that do not introduce false mathematical relationships that don't actually exist.

The Loan Model That Thought Delhi Was Better Than Mumbai
A junior data scientist at an NBFC (Non-Banking Financial Company) was building a credit risk model. The dataset had a "city" column with values: Mumbai, Delhi, Bengaluru, Chennai, Pune. To convert it to numbers, she simply applied Python's LabelEncoder — which assigned: Bengaluru=0, Chennai=1, Delhi=2, Mumbai=3, Pune=4. She fed this directly into a Logistic Regression model. The model began learning a pattern: Delhi (2) borrowers were lower risk than Mumbai (3) borrowers, but higher risk than Chennai (1). This was completely fabricated — the numbers implied Delhi > Chennai mathematically, but no such ordering existed in the data. The model's credit decisions were influenced by an ordering that had been invented by alphabetical sorting. After replacing Label Encoding with One-Hot Encoding, the model's Gini coefficient improved from 0.48 to 0.61 — a 27% improvement from fixing one encoding mistake.
⚠️
The Cardinal Sin of Encoding

Using Label Encoding on a nominal (unordered) categorical column for a linear or distance-based model is the most common encoding mistake in data science. It silently introduces a false mathematical ordering — the model learns it as a real signal and produces biased predictions. Always match the encoding method to the nature of the categorical variable.

🗺️ Six Encoding Methods — When to Use Each
🔢
Label Encoding
LabelEncoder
Assigns each category an integer 0,1,2…N. Use ONLY for binary features or tree-based models. Implies false ordering for linear models.
🔳
One-Hot Encoding
get_dummies / OHE
Creates one binary column per category. No ordering implied. Best for nominal features with under 15 unique values.
📶
Ordinal Encoding
OrdinalEncoder
Assigns integers that preserve a user-defined order. Use ONLY when order genuinely exists: Low < Medium < High.
🎯
Target Encoding
mean target per cat.
Replaces category with mean target value. Powerful for high cardinality. Requires careful cross-validation to prevent leakage.
📊
Frequency Encoding
value_counts / len
Replaces category with its frequency in the dataset. Safe (no leakage), good for tree models, handles high cardinality well.
🔗
Binary / Hash Encoding
category_encoders
Encodes categories in binary bits. Reduces dimensionality vs OHE for high-cardinality columns. Useful for tree models at scale.

Match the encoding to the feature type: nominal (no order) → One-Hot. Ordinal (has order) → Ordinal. Binary → Label. High cardinality (>15) → Target or Frequency. The choice is not aesthetic — it is mathematical.


Section 02

Label Encoding — Integers for Categories

Label Encoding assigns each unique category value an integer, starting from 0. It is the simplest encoding method and the most frequently misused one. The integers are assigned in alphabetical order by default, which means the model sees a mathematical ranking that has no relationship to the actual data.

🗺️ When Is Label Encoding Safe?
Decision tree showing when label encoding is safe to use How many unique values? 2 (binary) 2+ ordered 3+ unordered ✅ Label Encoding OK Male/Female, Yes/No, 0/1 ✅ Use OrdinalEncoder Specify order explicitly ❌ Do NOT Label Encode Use One-Hot Encoding Exception: Tree-based models (Random Forest, XGBoost) are safe with Label Encoding on nominal features because splits are threshold-based — not distance/gradient-based

Label Encoding — Code

from sklearn.preprocessing import LabelEncoder
import pandas as pd

# ── Simple LabelEncoder ───────────────────────────────
le = LabelEncoder()
df['gender_enc'] = le.fit_transform(df['gender'])

# See the mapping
mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(mapping)   # {'Female': 0, 'Male': 1, 'Other': 2}

# ── Pandas map (explicit — always safer) ─────────────
gender_map = {'Female': 0, 'Male': 1, 'Other': 2}
df['gender_enc'] = df['gender'].map(gender_map)

# ── Encode multiple columns at once ──────────────────
binary_cols = ['is_returned', 'gender', 'has_discount']
for col in binary_cols:
    le = LabelEncoder()
    df[col + '_enc'] = le.fit_transform(df[col].astype('str'))

# ── Inverse transform: numbers back to labels ─────────
df['gender_original'] = le.inverse_transform(df['gender_enc'])
❌ Before — Text labels
customer_idgenderis_returned
1001MaleYes
1002FemaleNo
1003OtherYes
1004MaleNo
1005FemaleYes
✅ After LabelEncoder
customer_idgender_encreturned_enc
100111
100200
100321
100410
100501
📊 The Label Encoding Problem — False Mathematical Distance
Diagram showing how label encoding creates false distances between city categories 0 1 2 3 4 Bengaluru Chennai Delhi Mumbai Pune Model sees: Pune (4) is 4× further from Bengaluru (0) than Chennai (1) This mathematical relationship does not exist in reality — it was invented by alphabetical sorting

LabelEncoder assigns integers alphabetically. The model interprets these integers as a number line — implying that Pune is further from Bengaluru than Chennai is. There is no geographic or business logic behind this ordering.


Section 03

One-Hot Encoding — The Gold Standard for Nominal Categories

One-Hot Encoding (OHE) creates one new binary column for each unique category value. Each row has exactly one 1 and all other values are 0 — indicating which category this row belongs to. No ordering is implied. No false mathematical distance is created. It is the correct default for any nominal (unordered) categorical feature going into a linear model, neural network, or SVM.

The Retail Recommendation System That Finally Got City Right
After fixing the label encoding mistake in their credit model, the same NBFC team rebuilt a product recommendation system. They had 28 Indian cities in their dataset. With Label Encoding, city had been a single column with values 0–27. With One-Hot Encoding, it became 28 binary columns. Each city was now its own independent feature — the model learned separate purchase-pattern weights for Mumbai, Delhi, and Bengaluru independently rather than trying to fit a single coefficient to a number line. The recommendation click-through rate improved by 34% in A/B testing. The 28 new columns added computation cost — but the signal quality improvement was worth it by a wide margin.
📊 One-Hot Encoding — Before and After Visual
❌ Before — 1 column, 5 categories
idcityamount
1Mumbai4500
2Delhi8200
3Bengaluru6100
4Mumbai3300
5Chennai9800
6Pune2200
✅ After OHE — 5 binary columns
idBengaluruChennaiDelhiMumbaiPune
100010
200100
310000
400010
501000
600001

One-Hot Encoding — Three Ways in Python

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# ── Method 1: pd.get_dummies (quickest for notebooks) ─
df_enc = pd.get_dummies(df, columns=['city', 'gender'], dtype=int)
print(df_enc.columns.tolist())
# ['amount','city_Bengaluru','city_Chennai','city_Delhi','city_Mumbai','city_Pune','gender_Female',...]

# ── Method 2: pd.get_dummies with drop_first ──────────
# Drop one column to avoid multicollinearity (dummy variable trap)
df_enc = pd.get_dummies(df, columns=['city'], drop_first=True, dtype=int)
# 5 cities → 4 columns (Bengaluru is the reference/baseline)

# ── Method 3: sklearn OneHotEncoder (pipeline-friendly) ─
ohe = OneHotEncoder(
    sparse_output=False,       # return dense array (not sparse matrix)
    handle_unknown='ignore',   # unseen categories → all zeros (safe for test data)
    drop='first'               # drop first category to avoid multicollinearity
)
ohe.fit(X_train[['city']])
encoded = ohe.transform(X_test[['city']])

# Get column names for the encoded matrix
feature_names = ohe.get_feature_names_out(['city'])
print(feature_names)   # ['city_Chennai','city_Delhi','city_Mumbai','city_Pune']

# ── Convert back to DataFrame ─────────────────────────
df_ohe = pd.DataFrame(encoded, columns=feature_names, index=X_test.index)
X_test  = pd.concat([X_test.drop(columns=['city']), df_ohe], axis=1)
📊 Output: One-Hot Encoding — Column Expansion by Category Count
Columns before encoding Columns after One-Hot Encoding

One-Hot Encoding expands each column into as many columns as there are unique values. Low-cardinality columns (gender: 3 categories) are fine. High-cardinality columns (city: 28 categories, postcode: 6000+) can cause the curse of dimensionality — use Target or Frequency Encoding instead.

⚠️
The Dummy Variable Trap

When you One-Hot Encode a column with N categories, you get N binary columns. But only N−1 are needed — because if all other columns are 0, the remaining category is implied. Keeping all N columns creates perfect multicollinearity, which destabilises linear model coefficient estimation. Use drop='first' in sklearn or drop_first=True in pandas to drop one reference column. Tree-based models are not affected and can keep all N columns.


Section 04

Ordinal Encoding — When Order Genuinely Exists

Ordinal Encoding is the correct choice when your categorical variable has a natural, meaningful order that you want the model to learn from. Unlike Label Encoding, you specify the order explicitly — so the integers assigned reflect actual rank, not alphabetical accident.

from sklearn.preprocessing import OrdinalEncoder

# ── Define the explicit order ─────────────────────────
income_order    = [['Low', 'Medium', 'High', 'Very High']]
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
size_order      = [['XS', 'S', 'M', 'L', 'XL', 'XXL']]

# ── Fit OrdinalEncoder with specified order ───────────
oe = OrdinalEncoder(
    categories=income_order,
    handle_unknown='use_encoded_value',
    unknown_value=-1          # unseen categories → -1
)
df[['income_encoded']] = oe.fit_transform(df[['income_bracket']])
# Low→0  Medium→1  High→2  Very High→3

# ── Verify the mapping ────────────────────────────────
for cat, code in zip(oe.categories_[0], range(len(oe.categories_[0]))):
    print(f"  {cat:12} → {code}")

# ── Pandas equivalent (simpler for notebooks) ─────────
income_map = {'Low':0, 'Medium':1, 'High':2, 'Very High':3}
df['income_encoded'] = df['income_bracket'].map(income_map)
❌ Before — Unencoded ordinal
customerincome_bracketeducation
C001LowBachelor
C002Very HighPhD
C003MediumMaster
C004HighHigh School
C005LowBachelor
✅ After OrdinalEncoder
customerincome_encedu_enc
C0010 (Low)1 (Bach.)
C0023 (V.High)3 (PhD)
C0031 (Med.)2 (Master)
C0042 (High)0 (HS)
C0050 (Low)1 (Bach.)
📊 Ordinal Encoding vs Label Encoding — Order Matters

OrdinalEncoder assigns integers that match the true rank — Low=0, Medium=1, High=2, Very High=3. LabelEncoder would assign alphabetically — High=0, Low=1, Medium=2, Very High=3 — which places "High" below "Low". The model would learn the wrong relationship.


Section 05

Target Encoding — For High-Cardinality Categories

When a categorical column has many unique values — 28 cities, 500 product SKUs, 10,000 user IDs — One-Hot Encoding creates a sparse, high-dimensional matrix that causes the curse of dimensionality. Target Encoding replaces each category with the mean of the target variable for that category. It produces a single numeric column that directly encodes the relationship between category and outcome.

The E-Commerce Postcode Problem
A large Indian e-commerce platform had 8,200 unique pin codes in their delivery dataset. One-Hot Encoding would have created 8,200 new columns — an unusable explosion of dimensionality. Target Encoding solved the problem cleanly: each pin code was replaced by the mean delivery satisfaction rating for orders from that pin code. A pin code where 90% of deliveries were rated 5/5 became 4.8. A consistently problematic area where deliveries were rated 2.1 on average became 2.1. One column, fully numeric, carrying the full signal of 8,200 categories. The model's R² on delivery satisfaction improved from 0.61 to 0.79. The only engineering challenge: implementing it correctly with cross-validation to prevent leakage.
# ── Basic Target Encoding (risk: leakage on train set) ─
target_means = df.groupby('city')['purchase_amount'].mean()
df['city_target'] = df['city'].map(target_means)

# ── Proper Target Encoding with K-Fold (no leakage) ───
from sklearn.model_selection import KFold

def target_encode_kfold(df, col, target, n_splits=5):
    df = df.copy()
    df[col + '_te'] = np.nan
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    for train_idx, val_idx in kf.split(df):
        means = df.iloc[train_idx].groupby(col)[target].mean()
        df.iloc[val_idx, df.columns.get_loc(col + '_te')] = (
            df.iloc[val_idx][col].map(means)
        )
    # Fill any unseen categories with global mean
    global_mean = df[target].mean()
    df[col + '_te'] = df[col + '_te'].fillna(global_mean)
    return df

df = target_encode_kfold(df, col='city', target='purchase_amount')

# ── Using category_encoders library (recommended) ─────
# pip install category_encoders
import category_encoders as ce
te = ce.TargetEncoder(cols=['city'], smoothing=10)
te.fit(X_train[['city']], y_train)
X_train['city_te'] = te.transform(X_train[['city']])
X_test['city_te']  = te.transform(X_test[['city']])
📊 Target Encoding — City → Mean Purchase Amount
Mean purchase (₹) per city Global mean

Each city is replaced by its mean purchase amount. Mumbai customers spend the most (₹9,200 avg), Patna the least (₹3,800 avg). The model now sees a continuous numeric signal that directly captures the city-spend relationship — far richer than a one-hot binary column.

⚠️
Target Encoding Leakage — The Most Dangerous Pitfall

If you compute target means using the full training set and then train a model on the same data, the encoded values contain information about the target — the model is peeking at the answer. Always use K-Fold cross-validation or a holdout set to compute target means. Better yet, use the category_encoders library which handles this automatically with the smoothing parameter.


Section 06

Frequency Encoding — Safe High-Cardinality Alternative

Frequency Encoding replaces each category with the proportion of rows that belong to that category. It captures the idea that rarer categories are different from common ones, without using the target variable — so there is no leakage risk. It is particularly effective for tree-based models.

# ── Frequency encoding ────────────────────────────────
freq_map = df['city'].value_counts(normalize=True)  # proportion of total rows
df['city_freq'] = df['city'].map(freq_map)

# ── Count encoding (raw counts instead of proportion) ─
count_map = df['city'].value_counts()
df['city_count'] = df['city'].map(count_map)

# ── Apply to multiple high-cardinality columns ────────
high_card_cols = ['city', 'product_sku', 'agent_id']
for col in high_card_cols:
    freq = df[col].value_counts(normalize=True)
    df[col + '_freq'] = df[col].map(freq).fillna(0)   # unseen → 0

print(df[['city', 'city_freq']].drop_duplicates().sort_values('city_freq', ascending=False))
📊 Output: Frequency Encoding — City Distribution

Mumbai (18.4%) and Delhi (15.2%) are the most frequent cities — they get higher encoded values. Patna (0.8%) and Nagpur (1.1%) are rare — they get near-zero values. This signal can be useful: rare categories often have different behaviour patterns than common ones.


Section 07

Encoding Methods — Side-by-Side Comparison

📊 Model Accuracy by Encoding Method — Same Dataset, Same Model
Logistic Regression Random Forest

Label Encoding hurts Logistic Regression badly (false ordering signals). One-Hot Encoding is the best for Logistic Regression. Target Encoding wins for both models when implemented correctly. For Random Forest, Label Encoding works reasonably well — trees are not misled by false ordering.

Method Output Cols Handles High Cardinality Leakage Risk Best For
Label Encoding 1 Yes None Binary cols, tree models only
One-Hot Encoding N unique No (>15 = problem) None Nominal, low cardinality, linear models
Ordinal Encoding 1 Yes None Ordered categories: Low/Med/High
Target Encoding 1 Yes High — use CV High cardinality, strong signal needed
Frequency Encoding 1 Yes None High cardinality, safe alternative to target
Binary Encoding log₂(N) Yes None Very high cardinality, space-efficient

Section 08

Complete Encoding Pipeline with sklearn

from sklearn.pipeline      import Pipeline
from sklearn.compose       import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute         import SimpleImputer
from sklearn.ensemble       import RandomForestClassifier

# ── Column groups ─────────────────────────────────────
numeric_cols = ['age', 'purchase_amount', 'delivery_days']
nominal_cols = ['city', 'gender', 'product_category']   # unordered
ordinal_cols = ['income_bracket']                         # ordered

# ── Numeric pipeline ─────────────────────────────────
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

# ── Nominal pipeline ─────────────────────────────────
nom_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe',     OneHotEncoder(
        handle_unknown='ignore',
        sparse_output=False,
        drop='first'
    ))
])

# ── Ordinal pipeline ─────────────────────────────────
ord_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(
        categories=[['Low', 'Medium', 'High', 'Very High']],
        handle_unknown='use_encoded_value',
        unknown_value=-1
    ))
])

# ── Combine all transformers ──────────────────────────
preprocessor = ColumnTransformer([
    ('numeric',  num_pipe, numeric_cols),
    ('nominal',  nom_pipe, nominal_cols),
    ('ordinal',  ord_pipe, ordinal_cols),
], remainder='drop')

# ── Full model pipeline ───────────────────────────────
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        RandomForestClassifier(n_estimators=200, random_state=42))
])

full_pipeline.fit(X_train, y_train)
print(f"Test accuracy: {full_pipeline.score(X_test, y_test):.3f}")

# ── Save and reload the pipeline ─────────────────────
import joblib
joblib.dump(full_pipeline, 'model_pipeline.pkl')
pipeline_loaded = joblib.load('model_pipeline.pkl')
predictions = pipeline_loaded.predict(new_data)          # raw data → predictions
📊 Complete Encoding Pipeline — Architecture
Sklearn pipeline architecture showing numeric, nominal, and ordinal branches merging into a classifier X_train Raw data Numeric cols SimpleImputer (median) StandardScaler → z-scores age, purchase, delivery_days Nominal cols SimpleImputer (mode) OneHotEncoder → 0/1 cols city, gender, product_cat Ordinal cols SimpleImputer (mode) OrdinalEncoder → 0,1,2,3 income_bracket Feature Matrix All numeric — model-ready Classifier RandomForest Predict ŷ = 0 / 1 pipeline.fit(X_train, y_train) → All encoders fit on training data only. Zero leakage guaranteed.

The ColumnTransformer applies each encoding branch to the correct columns in parallel, then concatenates the results. Save the fitted pipeline with joblib.dump() — at inference time, raw data flows in and predictions come out.


Section 09

Golden Rules of Categorical Encoding

🎯 8 Rules Every Data Scientist Must Follow
1
Never use Label Encoding on a nominal (unordered) categorical column for linear models, SVM, KNN, or neural networks. It invents a false mathematical ordering. The only safe uses are binary columns (2 values) and tree-based models.
2
Always specify the order explicitly with OrdinalEncoder's categories parameter. Never let the encoder decide the order alphabetically — alphabetical is almost never the correct rank for your domain.
3
Use handle_unknown='ignore' in OneHotEncoder and handle_unknown='use_encoded_value' in OrdinalEncoder. In production, new data will always contain categories the model has never seen — without this setting, prediction will throw an error.
4
Drop one column after One-Hot Encoding for linear models using drop='first'. This prevents the dummy variable trap (perfect multicollinearity) which makes regression coefficients unidentifiable. Tree models don't need this — keep all N columns.
5
Set the threshold for One-Hot Encoding at 15 unique values. Above that, dimensionality becomes a problem. Switch to Target Encoding (with K-Fold CV) or Frequency Encoding for high-cardinality columns like city, postcode, or product SKU.
6
Never compute target encoding means using the full training set and then train on the same data. This leaks target information into the features. Use K-Fold cross-validation or the category_encoders.TargetEncoder library which handles this automatically.
7
Always fit encoders on the training set only, then transform() the test set. Fitting on the full dataset leaks test statistics (category frequencies, target means) into the scaler's parameters — inflating all your validation metrics.
8
Wrap all encoding steps inside an sklearn Pipeline with ColumnTransformer. Hand-applying encoders step by step is error-prone and inconsistent. A saved pipeline guarantees that the same transformations applied at training are applied identically at inference time.
🧮
Key Takeaway

Encoding is not a mechanical translation step — it is a modelling decision. The encoding you choose determines what mathematical relationships the model is allowed to learn from each categorical feature. The NBFC team that switched from Label Encoding to One-Hot Encoding did not write a better model — they gave their existing model better information. Understanding the mathematics behind each encoding method is what separates data scientists who get good results from those who get great ones.