Why Data Transformation Is Essential
Raw data arrives in the units humans find convenient — age in years, salary in rupees, height in centimetres, distance in kilometres. But machine learning algorithms do not care about human convenience. They care about mathematics. When one feature ranges from 0 to 1 and another ranges from 10,000 to 10,000,000, distance-based algorithms like KNN and SVM spend nearly all their computation on the large-scale feature — effectively ignoring the small one. Data transformation is the process of converting raw feature values into a scale that algorithms can learn from fairly and efficiently.
Distance-based algorithms (KNN, SVM, K-Means), gradient-based algorithms (Linear Regression, Logistic Regression, Neural Networks), and regularised models (Ridge, Lasso, ElasticNet) are all sensitive to feature scale. Tree-based algorithms (Decision Trees, Random Forest, XGBoost) are scale-invariant — they split on ranks, not values. Always scale before distance or gradient methods. You can skip it for trees.
Every feature must pass through this pipeline before entering a model. Numeric features need scaling. Skewed distributions need reshaping. Categorical features need encoding. Optional feature engineering can create new signals.
Scaling Methods — The Core Four
Four scaling methods cover every situation in data science. Each transforms the feature values into a new range or distribution — and each is the correct answer for a different scenario.
Min-Max Normalisation
Min-Max normalisation scales every value into the range [0, 1] by subtracting the minimum and dividing by the full range. The shape of the distribution is preserved — the transformed data looks the same as the original, just rescaled. This makes it ideal for algorithms that require inputs in a fixed range, such as neural networks and image pixel values.
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
# ── Manual calculation ────────────────────────────────
df['age_norm'] = (df['age'] - df['age'].min()) / (df['age'].max() - df['age'].min())
# ── Using sklearn MinMaxScaler ────────────────────────
scaler = MinMaxScaler(feature_range=(0, 1)) # default: [0, 1]
num_cols = ['age', 'purchase_amount', 'delivery_days']
df[num_cols] = scaler.fit_transform(df[num_cols])
# ── Custom range (e.g. for neural networks) ──────────
scaler_neg = MinMaxScaler(feature_range=(-1, 1))
df[num_cols] = scaler_neg.fit_transform(df[num_cols])
# ── CRITICAL: fit on train, transform both ────────────
scaler.fit(X_train[num_cols]) # learn min/max from train only
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols]) # use train's min/max
# ── Inverse transform to get original values back ─────
original = scaler.inverse_transform(X_train[num_cols])
| age | salary (₹) | exp (yrs) |
|---|---|---|
| 22 | 240,000 | 0 |
| 35 | 850,000 | 8 |
| 28 | 420,000 | 3 |
| 52 | 2,100,000 | 22 |
| 45 | 1,500,000 | 18 |
| age | salary | exp |
|---|---|---|
| 0.00 | 0.00 | 0.00 |
| 0.43 | 0.32 | 0.36 |
| 0.20 | 0.10 | 0.14 |
| 1.00 | 1.00 | 1.00 |
| 0.77 | 0.69 | 0.82 |
The shape of the distribution is preserved — it is simply rescaled. The skew remains. This is why normalisation alone does not fix skewed distributions — you need a power or log transform for that.
Always fit your scaler only on the training set — never on the full dataset or the test set. Fitting on the full dataset leaks test set statistics (min, max, mean, std) into the model's view of the world — a subtle form of data leakage that inflates validation scores. Use scaler.fit(X_train) then scaler.transform(X_test).
Z-Score Standardisation
Standardisation transforms each feature to have a mean of 0 and a standard deviation of 1. Unlike normalisation, it does not bound values to a fixed range — outliers remain proportionally distant from the mean. This makes it the standard preprocessing step for linear models, logistic regression, PCA, and SVM.
from sklearn.preprocessing import StandardScaler
# ── Manual Z-score ────────────────────────────────────
df['age_std'] = (df['age'] - df['age'].mean()) / df['age'].std()
# ── Using sklearn StandardScaler ─────────────────────
scaler = StandardScaler()
num_cols = ['age', 'salary', 'credit_score', 'experience_years']
scaler.fit(X_train[num_cols])
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
# ── Verify output: mean≈0, std≈1 ─────────────────────
print(pd.DataFrame(X_train[num_cols]).describe().loc[['mean','std']].round(4))
# ── Access learned parameters ─────────────────────────
print("Means:", scaler.mean_)
print("Stds :", np.sqrt(scaler.var_))
Before: Raw Feature Ranges
After: Standardised (mean=0, std=1)
After standardisation all four features occupy the same numerical range — the model can now compare gradients across features fairly. Coefficients in a linear model become directly comparable.
Robust Scaling — When Outliers Are Legitimate
RobustScaler centres each feature on its median and divides by the interquartile range (IQR). Because it uses statistics that are insensitive to outliers — median and IQR instead of mean and standard deviation — a single extreme value cannot distort the scaling of the entire feature.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(quantile_range=(25.0, 75.0)) # uses IQR by default
num_cols = ['purchase_amount', 'salary', 'debt_amount']
scaler.fit(X_train[num_cols])
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
# The outliers still exist in the data — they just don't
# distort the scaling of all other values around them.
A single outlier (point 20) compresses all MinMaxScaler values into the bottom-left corner. StandardScaler is also pulled toward the outlier. RobustScaler (green) maintains a natural spread for all other points because it uses the median and IQR, not the mean and standard deviation.
| Scaler | Formula | Output Range | Outlier Sensitive | Best For |
|---|---|---|---|---|
| MinMaxScaler | (x−min)/(max−min) | [0, 1] | Yes | Neural networks, image data |
| StandardScaler | (x−μ)/σ | [−∞, +∞] | Partially | Linear models, PCA, SVM, logistic regression |
| RobustScaler | (x−median)/IQR | [−∞, +∞] | No | Data with legitimate extreme values |
| MaxAbsScaler | x / |max| | [−1, +1] | Yes | Sparse matrices, NLP/TF-IDF features |
Log & Power Transformations — Fixing Skewed Distributions
Scaling methods change the range of a feature but preserve its shape. If a feature is heavily right-skewed — the long tail that appears on income, transaction amounts, and page views — no amount of scaling fixes that. You need a shape-changing transformation: log, square root, or a power transform.
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
# ── Log transform (most common for right-skewed data) ─
df['purchase_log'] = np.log1p(df['purchase_amount']) # log(1+x) safe for 0
df['salary_log'] = np.log(df['salary']) # log(x) — x must be > 0
# ── Square root (moderate skew) ──────────────────────
df['purchase_sqrt'] = np.sqrt(df['purchase_amount'])
# ── Yeo-Johnson (handles negatives, finds best lambda) ─
pt = PowerTransformer(method='yeo-johnson', standardize=True)
df[['purchase_yj']] = pt.fit_transform(df[['purchase_amount']])
print(f"Best lambda: {pt.lambdas_[0]:.3f}")
# ── Box-Cox (only positive values) ───────────────────
pt_bc = PowerTransformer(method='box-cox')
df[['purchase_bc']] = pt_bc.fit_transform(df[['purchase_amount']] + 1)
# ── Quantile Transform (forces normal distribution) ───
qt = QuantileTransformer(output_distribution='normal', random_state=42)
df[['purchase_qt']] = qt.fit_transform(df[['purchase_amount']])
# ── Check skewness improvement ────────────────────────
print(f"Original skew : {df['purchase_amount'].skew():.2f}")
print(f"Log skew : {df['purchase_log'].skew():.2f}")
print(f"Yeo-Johnson : {df['purchase_yj'].skew():.2f}")
Original — Skew: 4.8 (heavy right tail)
Square Root — Skew: 2.1
Log (np.log1p) — Skew: 0.3 (near-normal)
Yeo-Johnson — Skew: 0.02 (almost perfect)
Each transformation reduces skewness progressively. Log is the practical default for most cases. Yeo-Johnson finds the mathematically optimal transform — use it when you need the best possible normality.
Categorical Encoding — Converting Text to Numbers
Machine learning algorithms cannot process text labels directly. Every categorical column — city, gender, product_category, income_bracket — must be converted to numbers before training. There are several encoding strategies, and choosing the wrong one can introduce ordering where none exists or create high-dimensional sparse data.
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# ── 1. One-Hot Encoding (nominal categories) ──────────
# Best for unordered categories with < 15 unique values
df_encoded = pd.get_dummies(df, columns=['city', 'gender'], dtype=int)
# Using sklearn (pipeline-friendly)
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = ohe.fit_transform(df[['city']])
print(ohe.get_feature_names_out())
# ── 2. Ordinal Encoding (ordered categories) ──────────
# Use ONLY when order matters (Low < Medium < High)
order = [['Low', 'Medium', 'High', 'Very High']]
oe = OrdinalEncoder(categories=order)
df['income_encoded'] = oe.fit_transform(df[['income_bracket']])
# Low→0 Medium→1 High→2 VeryHigh→3
# ── 3. Label Encoding (binary or tree models only) ────
le = LabelEncoder()
df['gender_le'] = le.fit_transform(df['gender'])
# ── 4. Target Encoding (high cardinality) ─────────────
# Replace category with mean target value per category
target_means = df.groupby('city')['target'].mean()
df['city_target_enc'] = df['city'].map(target_means)
# ── 5. Frequency Encoding (high cardinality) ──────────
freq_map = df['city'].value_counts() / len(df)
df['city_freq_enc'] = df['city'].map(freq_map)
The most common mistake is using Label Encoding on a nominal categorical column — this tells the model that "Mumbai" (0) is less than "Delhi" (1) which is less than "Bengaluru" (2). That ordering doesn't exist and will mislead the model.
Feature Engineering — Creating New Signals
Feature engineering is the process of creating new columns from existing ones that better expose the underlying patterns in the data. A model trained on engineered features often dramatically outperforms one trained on raw features — even if the raw data contains the same underlying information.
# ── Binning continuous to categorical ────────────────
df['age_group'] = pd.cut(
df['age'],
bins=[0, 25, 35, 45, 55, 100],
labels=['18-25', '26-35', '36-45', '46-55', '55+']
)
# ── Interaction features ─────────────────────────────
df['spend_per_delivery_day'] = df['purchase_amount'] / (df['delivery_days'] + 1)
df['age_income_ratio'] = df['age'] / (df['income_amount'] + 1)
# ── Date features ────────────────────────────────────
df['order_date'] = pd.to_datetime(df['order_date'])
df['day_of_week'] = df['order_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'] >= 5
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['days_since_reg'] = (pd.Timestamp.today() - df['registration_date']).dt.days
# ── Polynomial features (for linear models) ──────────
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['age', 'purchase_amount']])
print(poly.get_feature_names_out()) # ['age', 'purchase', 'age^2', 'age*purchase', 'purchase^2']
Engineered features like spend_per_day and is_weekend rank higher in feature importance than their raw components. Creating ratio and interaction features often surfaces signals that were hidden inside individual columns.
Building a Complete Transformation Pipeline
In production, every transformation must be applied consistently to both training and test data, and later to new prediction data. sklearn's Pipeline and ColumnTransformer encapsulate all transformation steps into a single reusable object that guarantees consistency and prevents data leakage.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
# ── Define column groups ──────────────────────────────
num_cols = ['age', 'purchase_amount', 'delivery_days']
nom_cols = ['city', 'gender', 'product_category']
ord_cols = ['income_bracket']
# ── Numeric pipeline: impute → scale ─────────────────
num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# ── Nominal pipeline: impute → one-hot encode ─────────
nom_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# ── Ordinal pipeline: impute → ordinal encode ─────────
ord_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ordinal', OrdinalEncoder(categories=[['Low','Medium','High','Very High']]))
])
# ── Combine with ColumnTransformer ────────────────────
preprocessor = ColumnTransformer([
('numeric', num_pipe, num_cols),
('nominal', nom_pipe, nom_cols),
('ordinal', ord_pipe, ord_cols),
])
# ── Full model pipeline ───────────────────────────────
model_pipeline = Pipeline([
('preprocessor', preprocessor),
('model', LogisticRegression(max_iter=1000))
])
# ── Fit on training data only ─────────────────────────
model_pipeline.fit(X_train, y_train)
score = model_pipeline.score(X_test, y_test)
print(f"Test accuracy: {score:.3f}")
The Pipeline ensures that fit() is only called on training data. When you call predict() on new data, all transformations are applied automatically using the parameters learned from training — zero risk of leakage.
Golden Rules of Data Transformation
Data transformation is not preprocessing bureaucracy — it is the bridge between raw data and learnable patterns. The fintech team that discovered their KNN model only learned from salary did not have a model problem or a data problem. They had a transformation problem. Four lines of code and two hours of investigation converted a 71% model into an 88% model. Transformation decisions made before training echo through every metric you will ever report on that model.