Data Transformation in Python

Section 01

Why Data Transformation Is Essential

Raw data arrives in the units humans find convenient — age in years, salary in rupees, height in centimetres, distance in kilometres. But machine learning algorithms do not care about human convenience. They care about mathematics. When one feature ranges from 0 to 1 and another ranges from 10,000 to 10,000,000, distance-based algorithms like KNN and SVM spend nearly all their computation on the large-scale feature — effectively ignoring the small one. Data transformation is the process of converting raw feature values into a scale that algorithms can learn from fairly and efficiently.

📖 Real-World Story

The Model That Only Learned From Salary

A fintech startup in Bengaluru built a loan default prediction model using KNN (K-Nearest Neighbours). Their features were: age (18–65), years of employment (0–40), number of dependents (0–8), and annual salary (₹200,000–₹5,000,000). They skipped data transformation and fed raw values directly into the model. The model achieved 71% accuracy in testing — but when the team visualised which features were driving the predictions, salary was responsible for 98% of the distance calculations. Age, employment years, and dependents had virtually zero influence on the nearest-neighbour search because their ranges were tiny compared to the salary column. After applying StandardScaler to all four features, accuracy jumped to 88% and the model began learning meaningful patterns from all features equally. The transformation took 4 lines of code. The accuracy gain took four months to find.

⚠️

Which Algorithms Require Transformation?

Distance-based algorithms (KNN, SVM, K-Means), gradient-based algorithms (Linear Regression, Logistic Regression, Neural Networks), and regularised models (Ridge, Lasso, ElasticNet) are all sensitive to feature scale. Tree-based algorithms (Decision Trees, Random Forest, XGBoost) are scale-invariant — they split on ranks, not values. Always scale before distance or gradient methods. You can skip it for trees.

🗺️ Data Transformation Pipeline

Every feature must pass through this pipeline before entering a model. Numeric features need scaling. Skewed distributions need reshaping. Categorical features need encoding. Optional feature engineering can create new signals.

Section 02

Scaling Methods — The Core Four

Four scaling methods cover every situation in data science. Each transforms the feature values into a new range or distribution — and each is the correct answer for a different scenario.

📏

Min-Max Normalisation

x' = (x − min) / (max − min)

Scales to [0, 1]. Preserves the shape of the original distribution. Sensitive to outliers — one extreme value compresses everything else.

📐

Z-Score Standardisation

x' = (x − μ) / σ

Mean=0, Std=1. Most widely used. Works for normally distributed features. Still affected by extreme outliers.

🛡️

Robust Scaling

x' = (x − Q2) / IQR

Uses median and IQR. Outlier-resistant. Best choice when your data has legitimate extreme values that you don't want to remove.

📉

Log Transformation

x' = log(1 + x)

Compresses right-skewed distributions. Ideal for income, prices, counts. Cannot handle negative values — use log1p for safety.

🔢

Max-Abs Scaling

x' = x / |max(x)|

Scales to [−1, 1]. Does not shift the mean. Ideal for sparse data (NLP features) where you need to preserve zero entries.

🔲

Power Transform

Box-Cox / Yeo-Johnson

Finds the optimal transformation to make data as normal as possible. Yeo-Johnson handles negative values; Box-Cox requires positive data.

Section 03

Min-Max Normalisation

Min-Max normalisation scales every value into the range [0, 1] by subtracting the minimum and dividing by the full range. The shape of the distribution is preserved — the transformed data looks the same as the original, just rescaled. This makes it ideal for algorithms that require inputs in a fixed range, such as neural networks and image pixel values.

📖 When to Use This

The Neural Network That Refused to Converge

A research team was training a neural network to classify medical images. Their tabular metadata features included pixel intensity (0–255), blood pressure (60–180 mmHg), and a binary flag (0–1). The network refused to converge — loss oscillated wildly for 200 epochs without improving. The diagnosis: pixel intensity values of 200+ were causing gradient explosions during backpropagation because the weight updates scaled proportionally to the input magnitude. After applying Min-Max normalisation to bring all features into [0, 1], the network converged cleanly in 40 epochs. For neural networks, normalisation to [0, 1] or [-1, 1] is not optional — it is a prerequisite.

from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy  as np

# ── Manual calculation ────────────────────────────────
df['age_norm'] = (df['age'] - df['age'].min()) / (df['age'].max() - df['age'].min())

# ── Using sklearn MinMaxScaler ────────────────────────
scaler    = MinMaxScaler(feature_range=(0, 1))   # default: [0, 1]
num_cols  = ['age', 'purchase_amount', 'delivery_days']
df[num_cols] = scaler.fit_transform(df[num_cols])

# ── Custom range (e.g. for neural networks) ──────────
scaler_neg = MinMaxScaler(feature_range=(-1, 1))
df[num_cols] = scaler_neg.fit_transform(df[num_cols])

# ── CRITICAL: fit on train, transform both ────────────
scaler.fit(X_train[num_cols])                   # learn min/max from train only
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols]  = scaler.transform(X_test[num_cols])  # use train's min/max

# ── Inverse transform to get original values back ─────
original = scaler.inverse_transform(X_train[num_cols])

❌ Before — Raw (different scales)

age	salary (₹)	exp (yrs)
22	240,000	0
35	850,000	8
28	420,000	3
52	2,100,000	22
45	1,500,000	18

✅ After MinMaxScaler [0,1]

age	salary	exp
0.00	0.00	0.00
0.43	0.32	0.36
0.20	0.10	0.14
1.00	1.00	1.00
0.77	0.69	0.82

📊 Output: Min-Max Normalisation — Before vs After Distribution

Before (raw ₹) After MinMaxScaler [0,1]

The shape of the distribution is preserved — it is simply rescaled. The skew remains. This is why normalisation alone does not fix skewed distributions — you need a power or log transform for that.

⚠️

The Train-Test Leakage Trap

Always fit your scaler only on the training set — never on the full dataset or the test set. Fitting on the full dataset leaks test set statistics (min, max, mean, std) into the model's view of the world — a subtle form of data leakage that inflates validation scores. Use scaler.fit(X_train) then scaler.transform(X_test).

Section 04

Z-Score Standardisation

Standardisation transforms each feature to have a mean of 0 and a standard deviation of 1. Unlike normalisation, it does not bound values to a fixed range — outliers remain proportionally distant from the mean. This makes it the standard preprocessing step for linear models, logistic regression, PCA, and SVM.

📖 When to Use This

The Loan Model's Hidden Bias

A bank built a logistic regression model to predict loan approval. Without standardisation, the coefficient on "annual salary" (₹500,000–₹5,000,000) was 0.000002 while the coefficient on "credit score" (300–900) was 1.4. A manager reviewing the model concluded that credit score was 700,000 times more important than salary — because the coefficient was 700,000× larger. This was completely wrong. The coefficients had vastly different magnitudes only because their input features had vastly different scales. After standardising all features to mean=0, std=1, the coefficients became directly comparable: salary was actually the strongest predictor, with a standardised coefficient of 2.1. The model was the same — but now it was interpretable.

from sklearn.preprocessing import StandardScaler

# ── Manual Z-score ────────────────────────────────────
df['age_std'] = (df['age'] - df['age'].mean()) / df['age'].std()

# ── Using sklearn StandardScaler ─────────────────────
scaler   = StandardScaler()
num_cols = ['age', 'salary', 'credit_score', 'experience_years']
scaler.fit(X_train[num_cols])
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols]  = scaler.transform(X_test[num_cols])

# ── Verify output: mean≈0, std≈1 ─────────────────────
print(pd.DataFrame(X_train[num_cols]).describe().loc[['mean','std']].round(4))

# ── Access learned parameters ─────────────────────────
print("Means:", scaler.mean_)
print("Stds :", np.sqrt(scaler.var_))

📊 Output: Z-Score Standardisation — Four Features on Same Scale

Before (raw — different ranges) After StandardScaler (mean=0, std=1)

Before: Raw Feature Ranges

After: Standardised (mean=0, std=1)

After standardisation all four features occupy the same numerical range — the model can now compare gradients across features fairly. Coefficients in a linear model become directly comparable.

Section 05

Robust Scaling — When Outliers Are Legitimate

RobustScaler centres each feature on its median and divides by the interquartile range (IQR). Because it uses statistics that are insensitive to outliers — median and IQR instead of mean and standard deviation — a single extreme value cannot distort the scaling of the entire feature.

from sklearn.preprocessing import RobustScaler

scaler   = RobustScaler(quantile_range=(25.0, 75.0))  # uses IQR by default
num_cols = ['purchase_amount', 'salary', 'debt_amount']

scaler.fit(X_train[num_cols])
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols]  = scaler.transform(X_test[num_cols])

# The outliers still exist in the data — they just don't
# distort the scaling of all other values around them.

📊 Output: MinMax vs Standard vs Robust — Effect of One Outlier

MinMaxScaler StandardScaler RobustScaler

A single outlier (point 20) compresses all MinMaxScaler values into the bottom-left corner. StandardScaler is also pulled toward the outlier. RobustScaler (green) maintains a natural spread for all other points because it uses the median and IQR, not the mean and standard deviation.

Scaler	Formula	Output Range	Outlier Sensitive	Best For
MinMaxScaler	(x−min)/(max−min)	[0, 1]	Yes	Neural networks, image data
StandardScaler	(x−μ)/σ	[−∞, +∞]	Partially	Linear models, PCA, SVM, logistic regression
RobustScaler	(x−median)/IQR	[−∞, +∞]	No	Data with legitimate extreme values
MaxAbsScaler	x / \|max\|	[−1, +1]	Yes	Sparse matrices, NLP/TF-IDF features

Section 06

Log & Power Transformations — Fixing Skewed Distributions

Scaling methods change the range of a feature but preserve its shape. If a feature is heavily right-skewed — the long tail that appears on income, transaction amounts, and page views — no amount of scaling fixes that. You need a shape-changing transformation: log, square root, or a power transform.

📖 When to Use This

The Insurance Model That Underpriced High-Risk Customers

An insurance company built a linear regression model to predict claim amounts. Their "annual premium paid" feature had a massive right skew — most customers paid ₹5,000–₹15,000 per year, but a handful of corporate clients paid ₹2,000,000+. The linear regression model, which assumes normally distributed residuals, could not fit this distribution well. It consistently underestimated high-risk customers' premiums because it had no way to represent the extreme tail of the distribution. After applying np.log1p() to the premium column, the skewness dropped from 4.8 to 0.3 — nearly perfectly normal. The model's RMSE on high-value customers improved by 62% and the company stopped systematically underpricing its most expensive segment.

from sklearn.preprocessing import PowerTransformer, QuantileTransformer

# ── Log transform (most common for right-skewed data) ─
df['purchase_log']  = np.log1p(df['purchase_amount'])  # log(1+x) safe for 0
df['salary_log']    = np.log(df['salary'])             # log(x) — x must be > 0

# ── Square root (moderate skew) ──────────────────────
df['purchase_sqrt']  = np.sqrt(df['purchase_amount'])

# ── Yeo-Johnson (handles negatives, finds best lambda) ─
pt = PowerTransformer(method='yeo-johnson', standardize=True)
df[['purchase_yj']] = pt.fit_transform(df[['purchase_amount']])
print(f"Best lambda: {pt.lambdas_[0]:.3f}")

# ── Box-Cox (only positive values) ───────────────────
pt_bc = PowerTransformer(method='box-cox')
df[['purchase_bc']] = pt_bc.fit_transform(df[['purchase_amount']] + 1)

# ── Quantile Transform (forces normal distribution) ───
qt = QuantileTransformer(output_distribution='normal', random_state=42)
df[['purchase_qt']] = qt.fit_transform(df[['purchase_amount']])

# ── Check skewness improvement ────────────────────────
print(f"Original skew : {df['purchase_amount'].skew():.2f}")
print(f"Log skew      : {df['purchase_log'].skew():.2f}")
print(f"Yeo-Johnson   : {df['purchase_yj'].skew():.2f}")

📊 Output: Four Transformations — Skewness Comparison

Original — Skew: 4.8 (heavy right tail)

Square Root — Skew: 2.1

Log (np.log1p) — Skew: 0.3 (near-normal)

Yeo-Johnson — Skew: 0.02 (almost perfect)

Each transformation reduces skewness progressively. Log is the practical default for most cases. Yeo-Johnson finds the mathematically optimal transform — use it when you need the best possible normality.

Section 07

Categorical Encoding — Converting Text to Numbers

Machine learning algorithms cannot process text labels directly. Every categorical column — city, gender, product_category, income_bracket — must be converted to numbers before training. There are several encoding strategies, and choosing the wrong one can introduce ordering where none exists or create high-dimensional sparse data.

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# ── 1. One-Hot Encoding (nominal categories) ──────────
# Best for unordered categories with < 15 unique values
df_encoded = pd.get_dummies(df, columns=['city', 'gender'], dtype=int)

# Using sklearn (pipeline-friendly)
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = ohe.fit_transform(df[['city']])
print(ohe.get_feature_names_out())

# ── 2. Ordinal Encoding (ordered categories) ──────────
# Use ONLY when order matters (Low < Medium < High)
order = [['Low', 'Medium', 'High', 'Very High']]
oe    = OrdinalEncoder(categories=order)
df['income_encoded'] = oe.fit_transform(df[['income_bracket']])
# Low→0  Medium→1  High→2  VeryHigh→3

# ── 3. Label Encoding (binary or tree models only) ────
le = LabelEncoder()
df['gender_le'] = le.fit_transform(df['gender'])

# ── 4. Target Encoding (high cardinality) ─────────────
# Replace category with mean target value per category
target_means = df.groupby('city')['target'].mean()
df['city_target_enc'] = df['city'].map(target_means)

# ── 5. Frequency Encoding (high cardinality) ──────────
freq_map = df['city'].value_counts() / len(df)
df['city_freq_enc'] = df['city'].map(freq_map)

📊 Encoding Strategy Selector — Which Method for Which Situation?

The most common mistake is using Label Encoding on a nominal categorical column — this tells the model that "Mumbai" (0) is less than "Delhi" (1) which is less than "Bengaluru" (2). That ordering doesn't exist and will mislead the model.

Section 08

Feature Engineering — Creating New Signals

Feature engineering is the process of creating new columns from existing ones that better expose the underlying patterns in the data. A model trained on engineered features often dramatically outperforms one trained on raw features — even if the raw data contains the same underlying information.

# ── Binning continuous to categorical ────────────────
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 35, 45, 55, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '55+']
)

# ── Interaction features ─────────────────────────────
df['spend_per_delivery_day'] = df['purchase_amount'] / (df['delivery_days'] + 1)
df['age_income_ratio']       = df['age'] / (df['income_amount'] + 1)

# ── Date features ────────────────────────────────────
df['order_date']     = pd.to_datetime(df['order_date'])
df['day_of_week']    = df['order_date'].dt.dayofweek
df['is_weekend']     = df['day_of_week'] >= 5
df['month']          = df['order_date'].dt.month
df['quarter']        = df['order_date'].dt.quarter
df['days_since_reg']  = (pd.Timestamp.today() - df['registration_date']).dt.days

# ── Polynomial features (for linear models) ──────────
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'purchase_amount']])
print(poly.get_feature_names_out())  # ['age', 'purchase', 'age^2', 'age*purchase', 'purchase^2']

📊 Output: Feature Importance — Raw Features vs Engineered Features

Raw features Engineered features

Engineered features like spend_per_day and is_weekend rank higher in feature importance than their raw components. Creating ratio and interaction features often surfaces signals that were hidden inside individual columns.

Section 09

Building a Complete Transformation Pipeline

In production, every transformation must be applied consistently to both training and test data, and later to new prediction data. sklearn's Pipeline and ColumnTransformer encapsulate all transformation steps into a single reusable object that guarantees consistency and prevents data leakage.

from sklearn.pipeline         import Pipeline
from sklearn.compose          import ColumnTransformer
from sklearn.preprocessing    import StandardScaler, MinMaxScaler
from sklearn.preprocessing    import OneHotEncoder, OrdinalEncoder
from sklearn.impute            import SimpleImputer
from sklearn.linear_model     import LogisticRegression

# ── Define column groups ──────────────────────────────
num_cols  = ['age', 'purchase_amount', 'delivery_days']
nom_cols  = ['city', 'gender', 'product_category']
ord_cols  = ['income_bracket']

# ── Numeric pipeline: impute → scale ─────────────────
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

# ── Nominal pipeline: impute → one-hot encode ─────────
nom_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe',     OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# ── Ordinal pipeline: impute → ordinal encode ─────────
ord_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=[['Low','Medium','High','Very High']]))
])

# ── Combine with ColumnTransformer ────────────────────
preprocessor = ColumnTransformer([
    ('numeric',  num_pipe, num_cols),
    ('nominal',  nom_pipe, nom_cols),
    ('ordinal',  ord_pipe, ord_cols),
])

# ── Full model pipeline ───────────────────────────────
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model',        LogisticRegression(max_iter=1000))
])

# ── Fit on training data only ─────────────────────────
model_pipeline.fit(X_train, y_train)
score = model_pipeline.score(X_test, y_test)
print(f"Test accuracy: {score:.3f}")

📊 Full sklearn Pipeline Architecture

The Pipeline ensures that fit() is only called on training data. When you call predict() on new data, all transformations are applied automatically using the parameters learned from training — zero risk of leakage.

Section 10

Golden Rules of Data Transformation

🎯 8 Rules Every Data Scientist Must Follow

Always fit scalers and encoders on the training set only. Using the full dataset to fit a scaler leaks test statistics into the model's training process — a subtle but real form of data leakage that inflates validation scores.

Use StandardScaler as your default for linear models, PCA, and SVM. Use MinMaxScaler for neural networks. Use RobustScaler when your data has legitimate extreme values you are keeping. Skip scaling entirely for tree-based models.

Always check skewness before scaling: df['col'].skew(). If |skew| > 1, apply a log or power transform before scaling — scaling a skewed feature does not fix the skew, it just rescales it.

Never use Label Encoding on a nominal (unordered) categorical column for linear models or neural networks. It introduces a false ordering that the model will learn as a real signal — corrupting predictions. Use One-Hot Encoding instead.

For high-cardinality categoricals (>15 unique values), avoid One-Hot Encoding — it creates too many columns and can cause the curse of dimensionality. Use Target Encoding (with cross-validation to prevent leakage) or Frequency Encoding instead.

Use sklearn's Pipeline and ColumnTransformer in every project. Hand-applying transformations step by step is error-prone and creates inconsistency between training and inference. Pipelines prevent this by design.

Save the fitted pipeline with joblib.dump(pipeline, 'model.pkl') — not just the model. At inference time, raw data must pass through the same fitted transformations the model was trained with. A saved model without its fitted scaler is unusable.

Check distributions after every transformation. Run df.describe() and plot histograms after scaling. A scaler applied to a column that still contains outliers from a data error will produce scaled values that look correct but are wrong. Transformation is the last step — cleaning must come first.

🧮

Key Takeaway

Data transformation is not preprocessing bureaucracy — it is the bridge between raw data and learnable patterns. The fintech team that discovered their KNN model only learned from salary did not have a model problem or a data problem. They had a transformation problem. Four lines of code and two hours of investigation converted a 71% model into an 88% model. Transformation decisions made before training echo through every metric you will ever report on that model.