Why Machines Cannot Read Text Labels
Every machine learning algorithm is fundamentally a mathematical function — it multiplies, adds, takes derivatives, and computes distances. Numbers are the only currency it understands. When your dataset contains columns like "city", "gender", "product_category", or "education_level", you are holding data in a format that a model cannot process directly. Encoding categorical variables is the process of translating human-readable text labels into numeric representations that preserve the right information — and crucially, that do not introduce false mathematical relationships that don't actually exist.
Using Label Encoding on a nominal (unordered) categorical column for a linear or distance-based model is the most common encoding mistake in data science. It silently introduces a false mathematical ordering — the model learns it as a real signal and produces biased predictions. Always match the encoding method to the nature of the categorical variable.
Match the encoding to the feature type: nominal (no order) → One-Hot. Ordinal (has order) → Ordinal. Binary → Label. High cardinality (>15) → Target or Frequency. The choice is not aesthetic — it is mathematical.
Label Encoding — Integers for Categories
Label Encoding assigns each unique category value an integer, starting from 0. It is the simplest encoding method and the most frequently misused one. The integers are assigned in alphabetical order by default, which means the model sees a mathematical ranking that has no relationship to the actual data.
Label Encoding — Code
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# ── Simple LabelEncoder ───────────────────────────────
le = LabelEncoder()
df['gender_enc'] = le.fit_transform(df['gender'])
# See the mapping
mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(mapping) # {'Female': 0, 'Male': 1, 'Other': 2}
# ── Pandas map (explicit — always safer) ─────────────
gender_map = {'Female': 0, 'Male': 1, 'Other': 2}
df['gender_enc'] = df['gender'].map(gender_map)
# ── Encode multiple columns at once ──────────────────
binary_cols = ['is_returned', 'gender', 'has_discount']
for col in binary_cols:
le = LabelEncoder()
df[col + '_enc'] = le.fit_transform(df[col].astype('str'))
# ── Inverse transform: numbers back to labels ─────────
df['gender_original'] = le.inverse_transform(df['gender_enc'])
| customer_id | gender | is_returned |
|---|---|---|
| 1001 | Male | Yes |
| 1002 | Female | No |
| 1003 | Other | Yes |
| 1004 | Male | No |
| 1005 | Female | Yes |
| customer_id | gender_enc | returned_enc |
|---|---|---|
| 1001 | 1 | 1 |
| 1002 | 0 | 0 |
| 1003 | 2 | 1 |
| 1004 | 1 | 0 |
| 1005 | 0 | 1 |
LabelEncoder assigns integers alphabetically. The model interprets these integers as a number line — implying that Pune is further from Bengaluru than Chennai is. There is no geographic or business logic behind this ordering.
One-Hot Encoding — The Gold Standard for Nominal Categories
One-Hot Encoding (OHE) creates one new binary column for each unique category value. Each row has exactly one 1 and all other values are 0 — indicating which category this row belongs to. No ordering is implied. No false mathematical distance is created. It is the correct default for any nominal (unordered) categorical feature going into a linear model, neural network, or SVM.
| id | city | amount |
|---|---|---|
| 1 | Mumbai | 4500 |
| 2 | Delhi | 8200 |
| 3 | Bengaluru | 6100 |
| 4 | Mumbai | 3300 |
| 5 | Chennai | 9800 |
| 6 | Pune | 2200 |
| id | Bengaluru | Chennai | Delhi | Mumbai | Pune |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 1 | 0 | 0 |
| 3 | 1 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 1 | 0 |
| 5 | 0 | 1 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 1 |
One-Hot Encoding — Three Ways in Python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# ── Method 1: pd.get_dummies (quickest for notebooks) ─
df_enc = pd.get_dummies(df, columns=['city', 'gender'], dtype=int)
print(df_enc.columns.tolist())
# ['amount','city_Bengaluru','city_Chennai','city_Delhi','city_Mumbai','city_Pune','gender_Female',...]
# ── Method 2: pd.get_dummies with drop_first ──────────
# Drop one column to avoid multicollinearity (dummy variable trap)
df_enc = pd.get_dummies(df, columns=['city'], drop_first=True, dtype=int)
# 5 cities → 4 columns (Bengaluru is the reference/baseline)
# ── Method 3: sklearn OneHotEncoder (pipeline-friendly) ─
ohe = OneHotEncoder(
sparse_output=False, # return dense array (not sparse matrix)
handle_unknown='ignore', # unseen categories → all zeros (safe for test data)
drop='first' # drop first category to avoid multicollinearity
)
ohe.fit(X_train[['city']])
encoded = ohe.transform(X_test[['city']])
# Get column names for the encoded matrix
feature_names = ohe.get_feature_names_out(['city'])
print(feature_names) # ['city_Chennai','city_Delhi','city_Mumbai','city_Pune']
# ── Convert back to DataFrame ─────────────────────────
df_ohe = pd.DataFrame(encoded, columns=feature_names, index=X_test.index)
X_test = pd.concat([X_test.drop(columns=['city']), df_ohe], axis=1)
One-Hot Encoding expands each column into as many columns as there are unique values. Low-cardinality columns (gender: 3 categories) are fine. High-cardinality columns (city: 28 categories, postcode: 6000+) can cause the curse of dimensionality — use Target or Frequency Encoding instead.
When you One-Hot Encode a column with N categories, you get N binary columns. But only N−1 are needed — because if all other columns are 0, the remaining category is implied. Keeping all N columns creates perfect multicollinearity, which destabilises linear model coefficient estimation. Use drop='first' in sklearn or drop_first=True in pandas to drop one reference column. Tree-based models are not affected and can keep all N columns.
Ordinal Encoding — When Order Genuinely Exists
Ordinal Encoding is the correct choice when your categorical variable has a natural, meaningful order that you want the model to learn from. Unlike Label Encoding, you specify the order explicitly — so the integers assigned reflect actual rank, not alphabetical accident.
from sklearn.preprocessing import OrdinalEncoder
# ── Define the explicit order ─────────────────────────
income_order = [['Low', 'Medium', 'High', 'Very High']]
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
size_order = [['XS', 'S', 'M', 'L', 'XL', 'XXL']]
# ── Fit OrdinalEncoder with specified order ───────────
oe = OrdinalEncoder(
categories=income_order,
handle_unknown='use_encoded_value',
unknown_value=-1 # unseen categories → -1
)
df[['income_encoded']] = oe.fit_transform(df[['income_bracket']])
# Low→0 Medium→1 High→2 Very High→3
# ── Verify the mapping ────────────────────────────────
for cat, code in zip(oe.categories_[0], range(len(oe.categories_[0]))):
print(f" {cat:12} → {code}")
# ── Pandas equivalent (simpler for notebooks) ─────────
income_map = {'Low':0, 'Medium':1, 'High':2, 'Very High':3}
df['income_encoded'] = df['income_bracket'].map(income_map)
| customer | income_bracket | education |
|---|---|---|
| C001 | Low | Bachelor |
| C002 | Very High | PhD |
| C003 | Medium | Master |
| C004 | High | High School |
| C005 | Low | Bachelor |
| customer | income_enc | edu_enc |
|---|---|---|
| C001 | 0 (Low) | 1 (Bach.) |
| C002 | 3 (V.High) | 3 (PhD) |
| C003 | 1 (Med.) | 2 (Master) |
| C004 | 2 (High) | 0 (HS) |
| C005 | 0 (Low) | 1 (Bach.) |
OrdinalEncoder assigns integers that match the true rank — Low=0, Medium=1, High=2, Very High=3. LabelEncoder would assign alphabetically — High=0, Low=1, Medium=2, Very High=3 — which places "High" below "Low". The model would learn the wrong relationship.
Target Encoding — For High-Cardinality Categories
When a categorical column has many unique values — 28 cities, 500 product SKUs, 10,000 user IDs — One-Hot Encoding creates a sparse, high-dimensional matrix that causes the curse of dimensionality. Target Encoding replaces each category with the mean of the target variable for that category. It produces a single numeric column that directly encodes the relationship between category and outcome.
# ── Basic Target Encoding (risk: leakage on train set) ─
target_means = df.groupby('city')['purchase_amount'].mean()
df['city_target'] = df['city'].map(target_means)
# ── Proper Target Encoding with K-Fold (no leakage) ───
from sklearn.model_selection import KFold
def target_encode_kfold(df, col, target, n_splits=5):
df = df.copy()
df[col + '_te'] = np.nan
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
means = df.iloc[train_idx].groupby(col)[target].mean()
df.iloc[val_idx, df.columns.get_loc(col + '_te')] = (
df.iloc[val_idx][col].map(means)
)
# Fill any unseen categories with global mean
global_mean = df[target].mean()
df[col + '_te'] = df[col + '_te'].fillna(global_mean)
return df
df = target_encode_kfold(df, col='city', target='purchase_amount')
# ── Using category_encoders library (recommended) ─────
# pip install category_encoders
import category_encoders as ce
te = ce.TargetEncoder(cols=['city'], smoothing=10)
te.fit(X_train[['city']], y_train)
X_train['city_te'] = te.transform(X_train[['city']])
X_test['city_te'] = te.transform(X_test[['city']])
Each city is replaced by its mean purchase amount. Mumbai customers spend the most (₹9,200 avg), Patna the least (₹3,800 avg). The model now sees a continuous numeric signal that directly captures the city-spend relationship — far richer than a one-hot binary column.
If you compute target means using the full training set and then train a model on the same data, the encoded values contain information about the target — the model is peeking at the answer. Always use K-Fold cross-validation or a holdout set to compute target means. Better yet, use the category_encoders library which handles this automatically with the smoothing parameter.
Frequency Encoding — Safe High-Cardinality Alternative
Frequency Encoding replaces each category with the proportion of rows that belong to that category. It captures the idea that rarer categories are different from common ones, without using the target variable — so there is no leakage risk. It is particularly effective for tree-based models.
# ── Frequency encoding ────────────────────────────────
freq_map = df['city'].value_counts(normalize=True) # proportion of total rows
df['city_freq'] = df['city'].map(freq_map)
# ── Count encoding (raw counts instead of proportion) ─
count_map = df['city'].value_counts()
df['city_count'] = df['city'].map(count_map)
# ── Apply to multiple high-cardinality columns ────────
high_card_cols = ['city', 'product_sku', 'agent_id']
for col in high_card_cols:
freq = df[col].value_counts(normalize=True)
df[col + '_freq'] = df[col].map(freq).fillna(0) # unseen → 0
print(df[['city', 'city_freq']].drop_duplicates().sort_values('city_freq', ascending=False))
Mumbai (18.4%) and Delhi (15.2%) are the most frequent cities — they get higher encoded values. Patna (0.8%) and Nagpur (1.1%) are rare — they get near-zero values. This signal can be useful: rare categories often have different behaviour patterns than common ones.
Encoding Methods — Side-by-Side Comparison
Label Encoding hurts Logistic Regression badly (false ordering signals). One-Hot Encoding is the best for Logistic Regression. Target Encoding wins for both models when implemented correctly. For Random Forest, Label Encoding works reasonably well — trees are not misled by false ordering.
| Method | Output Cols | Handles High Cardinality | Leakage Risk | Best For |
|---|---|---|---|---|
| Label Encoding | 1 | Yes | None | Binary cols, tree models only |
| One-Hot Encoding | N unique | No (>15 = problem) | None | Nominal, low cardinality, linear models |
| Ordinal Encoding | 1 | Yes | None | Ordered categories: Low/Med/High |
| Target Encoding | 1 | Yes | High — use CV | High cardinality, strong signal needed |
| Frequency Encoding | 1 | Yes | None | High cardinality, safe alternative to target |
| Binary Encoding | log₂(N) | Yes | None | Very high cardinality, space-efficient |
Complete Encoding Pipeline with sklearn
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
# ── Column groups ─────────────────────────────────────
numeric_cols = ['age', 'purchase_amount', 'delivery_days']
nominal_cols = ['city', 'gender', 'product_category'] # unordered
ordinal_cols = ['income_bracket'] # ordered
# ── Numeric pipeline ─────────────────────────────────
num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# ── Nominal pipeline ─────────────────────────────────
nom_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(
handle_unknown='ignore',
sparse_output=False,
drop='first'
))
])
# ── Ordinal pipeline ─────────────────────────────────
ord_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ordinal', OrdinalEncoder(
categories=[['Low', 'Medium', 'High', 'Very High']],
handle_unknown='use_encoded_value',
unknown_value=-1
))
])
# ── Combine all transformers ──────────────────────────
preprocessor = ColumnTransformer([
('numeric', num_pipe, numeric_cols),
('nominal', nom_pipe, nominal_cols),
('ordinal', ord_pipe, ordinal_cols),
], remainder='drop')
# ── Full model pipeline ───────────────────────────────
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestClassifier(n_estimators=200, random_state=42))
])
full_pipeline.fit(X_train, y_train)
print(f"Test accuracy: {full_pipeline.score(X_test, y_test):.3f}")
# ── Save and reload the pipeline ─────────────────────
import joblib
joblib.dump(full_pipeline, 'model_pipeline.pkl')
pipeline_loaded = joblib.load('model_pipeline.pkl')
predictions = pipeline_loaded.predict(new_data) # raw data → predictions
The ColumnTransformer applies each encoding branch to the correct columns in parallel, then concatenates the results. Save the fitted pipeline with joblib.dump() — at inference time, raw data flows in and predictions come out.
Golden Rules of Categorical Encoding
Encoding is not a mechanical translation step — it is a modelling decision. The encoding you choose determines what mathematical relationships the model is allowed to learn from each categorical feature. The NBFC team that switched from Label Encoding to One-Hot Encoding did not write a better model — they gave their existing model better information. Understanding the mathematics behind each encoding method is what separates data scientists who get good results from those who get great ones.