Data Preparation / Data Preprocessing 📂 Data Collection · 11 of 13 68 min read

Feature Engineering & Feature Scaling

A story-driven, comprehensive guide to creating powerful new features from raw data and scaling them correctly — covering polynomial features, interaction terms, binning, date extraction, target statistics, and all major scaling methods — with live diagrams, real-world stories, before/after comparisons, and complete reusable sklearn pipeline code.

Section 01

Why Feature Engineering Is the Most Impactful Step

Machine learning algorithms are mathematical functions — they can only learn from the signals you give them. If the raw data expresses those signals poorly, no amount of hyperparameter tuning or model complexity will recover them. Feature engineering is the art and science of transforming raw data into representations that expose the underlying patterns more clearly — creating new variables, combining existing ones, and scaling everything so that each feature speaks to the model in a language it can learn from.

The ₹18 Crore Revenue Unlock From Three New Columns
A major Indian e-commerce platform had been running a product recommendation model for two years. The model used 45 raw features — user age, purchase amount, product category, city, and so on — and achieved a click-through rate (CTR) of 3.2%. A data science consultant was brought in to improve it. Instead of changing the model architecture, she spent three days on feature engineering. She created: (1) days_since_last_purchase from the order timestamps — capturing recency, the strongest predictor of future purchase intent; (2) avg_spend_per_visit by dividing total lifetime spend by session count — capturing customer value density; (3) hour_of_day_encoded using sine/cosine encoding of the purchase hour — capturing time-of-day shopping patterns without introducing false cyclical ordering. CTR improved from 3.2% to 5.1% — a 59% relative gain. At the platform's scale, this translated to ₹18 crore in additional annual revenue. The model did not change. The features did.
💡
Feature Engineering vs Model Selection

Studies of Kaggle competition winners consistently show that the most impactful improvements come from feature engineering, not model selection. A well-engineered feature set with a simple logistic regression often outperforms a poorly-engineered one with XGBoost. The model learns patterns — but only the patterns you put in front of it. Feature engineering determines what patterns are visible.

🗺️ The Feature Engineering + Scaling Pipeline
Full feature engineering and scaling pipeline from raw data to model-ready features Raw Data 45 features Interaction & Polynomial a×b, a², a³ Date / Time Extraction hour, day, sin/cos Binning & Aggregation cut, qcut, groupby Scaling & Normalisation MinMax, Std, Robust Encoding & Selection OHE, Lasso, RFE Model ready ✅

Feature engineering happens before scaling. You must create, transform, and combine features first — then scale the result. Scaling a feature before creating interactions would change the meaning of the interaction term.


Section 02

Six Types of Feature Engineering

✖️
Interaction Features
Multiply or divide two features to capture their combined effect — often more predictive than either alone.
spend / delivery_days
age × income
price × quantity
📈
Polynomial Features
Raise features to powers (x², x³) to let linear models fit non-linear relationships without switching to a non-linear algorithm.
age² → diminishing returns
distance³ → cubic cost
x¹ x² x₁x₂
📅
Date / Time Features
Extract hour, day, month, quarter, weekday, is_weekend, days_since from timestamps. Cyclical features need sin/cos encoding.
dt.hour, dt.dayofweek
sin(2π×hour/24)
days_since_signup
📦
Binning / Discretisation
Convert continuous features into discrete buckets. Equal-width (cut) or equal-frequency (qcut) bins expose non-linear patterns to linear models.
age → 18-25, 26-35
income → Low/Med/High
pd.cut, pd.qcut
📊
Aggregation Features
Compute group statistics (mean, std, count, max) per entity — capturing behaviour patterns that single-row features cannot express.
customer avg purchase
city median salary
product return rate
🔄
Domain-Specific Features
Features crafted from domain knowledge — ratios, flags, and transformations that encode expert understanding of the problem.
debt_to_income_ratio
bmi = weight/height²
churn_risk_score

Section 03

Interaction Features — Combining Columns

Two features that are individually weak can become a powerful predictor when combined. Interaction features capture the joint effect of two variables — a signal that exists only in their relationship, not in either variable alone.

The Loan Model That Finally Understood Risk
A microfinance institution's default prediction model used income and debt as separate features. Income alone was a weak predictor — some high earners defaulted due to lifestyle spending. Debt alone was also weak — some high-debt customers were high earners who easily serviced it. When the data scientist created debt_to_income_ratio = debt / (income + 1) and monthly_obligation_pct = emi / monthly_income × 100, the model's AUC improved from 0.71 to 0.84. These interaction features encoded the fundamental credit risk concept that neither income nor debt alone captured: how much of your income is consumed by your obligations? The two raw features were individually mediocre. Their ratio was the single most predictive feature in the entire dataset.
import pandas as pd
import numpy  as np

# ── Ratio features ────────────────────────────────────
df['debt_to_income']       = df['total_debt'] / (df['annual_income'] + 1)
df['spend_per_delivery_day'] = df['purchase_amount'] / (df['delivery_days'] + 1)
df['revenue_per_visit']     = df['total_revenue'] / (df['visit_count'] + 1)

# ── Product features ──────────────────────────────────
df['age_x_income']     = df['age'] * df['income_bracket_enc']
df['price_x_quantity']  = df['unit_price'] * df['quantity']

# ── Difference features ───────────────────────────────
df['balance_change']    = df['balance_end'] - df['balance_start']
df['price_discount']    = df['original_price'] - df['sale_price']
df['discount_pct']      = df['price_discount'] / (df['original_price'] + 1) * 100

# ── Flag / binary interaction ─────────────────────────
df['is_high_value_weekend'] = (
    (df['purchase_amount'] > df['purchase_amount'].quantile(0.75)) &
    (df['is_weekend'] == True)
).astype(int)

# ── Always protect against division by zero ───────────
df['safe_ratio'] = np.where(df['denominator'] > 0,
                            df['numerator'] / df['denominator'],
                            0)
📊 Feature Importance — Raw Features vs Engineered Interaction Features
Raw features Engineered interaction features

debt_to_income_ratio is the single most important feature — more predictive than either total_debt or annual_income alone. spend_per_delivery_day ranks above its component features. Interaction features consistently outperform their raw inputs.


Section 04

Polynomial Features — Capturing Non-Linear Relationships

A linear model can only fit straight lines. When the true relationship between a feature and the target is curved — diminishing returns, exponential growth, a U-shape — a linear model will never capture it regardless of how much data you have. Polynomial features solve this by creating x², x³, and cross-terms, allowing a linear model to fit non-linear patterns without switching to a non-linear algorithm.

from sklearn.preprocessing import PolynomialFeatures

# ── Degree 2: adds x², y², and x×y ───────────────────
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train[['age', 'income']])

feature_names = poly.get_feature_names_out(['age', 'income'])
print(feature_names)
# ['age', 'income', 'age^2', 'age income', 'income^2']

# ── Manual polynomial (more readable) ────────────────
df['age_sq']      = df['age'] ** 2
df['age_cubed']   = df['age'] ** 3
df['income_sq']   = df['income'] ** 2
df['age_x_inc']   = df['age'] * df['income']    # cross-term

# ── IMPORTANT: Scale before polynomial! ──────────────
# Polynomial of unscaled features → huge numbers → numerical instability
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model  import Ridge

poly_pipeline = Pipeline([
    ('scaler', StandardScaler()),           # scale FIRST
    ('poly',   PolynomialFeatures(degree=2, include_bias=False)),
    ('model',  Ridge(alpha=1.0))
])
poly_pipeline.fit(X_train, y_train)
📊 Output: Linear vs Polynomial Fit — Capturing Non-Linear Relationship
Data points Linear fit (underfit) Degree-2 polynomial (correct)

The linear fit (red) misses the curve entirely — it systematically underpredicts at both extremes. The degree-2 polynomial (green) follows the true U-shaped relationship. Adding age² to the feature set lets the linear model fit this curve without changing the algorithm.


Section 05

Date & Time Feature Extraction

Datetime columns are one of the richest sources of feature engineering in business data. A single timestamp contains many potential signals — hour of day, day of week, seasonality, recency, and cyclical patterns. Extracting these signals explicitly is far more informative than feeding a raw timestamp or integer to a model.

The Delivery Model That Did Not Know It Was Friday
A logistics company's delivery time prediction model used raw timestamps as Unix integers (seconds since epoch). The model treated Monday at 9 AM and Friday at 9 AM as numerically similar — because their Unix timestamps differed by only 4 days × 86,400 seconds. But Friday deliveries in India take 40% longer on average due to weekend traffic and reduced warehouse staffing. By extracting is_friday, day_of_week, and hour_of_day from the timestamp and adding them as explicit features, the model's RMSE on delivery time prediction dropped from 4.2 hours to 1.8 hours. The information had always been inside the timestamp — but buried in arithmetic the model could not decode without explicit extraction.
# ── Parse and extract all datetime features ───────────
df['order_date'] = pd.to_datetime(df['order_date'])

# Basic calendar features
df['year']        = df['order_date'].dt.year
df['month']       = df['order_date'].dt.month
df['day']         = df['order_date'].dt.day
df['hour']        = df['order_date'].dt.hour
df['day_of_week']  = df['order_date'].dt.dayofweek   # 0=Mon, 6=Sun
df['quarter']     = df['order_date'].dt.quarter
df['week_of_year'] = df['order_date'].dt.isocalendar().week

# Derived flag features
df['is_weekend']   = (df['day_of_week'] >= 5).astype(int)
df['is_friday']    = (df['day_of_week'] == 4).astype(int)
df['is_month_end']  = df['order_date'].dt.is_month_end.astype(int)
df['is_quarter_end']= df['order_date'].dt.is_quarter_end.astype(int)

# Recency feature: days since a reference date
ref_date = pd.Timestamp('2024-01-01')
df['days_since_signup']  = (df['order_date'] - df['signup_date']).dt.days
df['days_since_purchase'] = (pd.Timestamp.today() - df['order_date']).dt.days

# ── Cyclical encoding for hour and month ──────────────
# Prevents model from thinking Hour 23 is far from Hour 0
df['hour_sin']   = np.sin(2 * np.pi * df['hour']  / 24)
df['hour_cos']   = np.cos(2 * np.pi * df['hour']  / 24)
df['month_sin']  = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos']  = np.cos(2 * np.pi * df['month'] / 12)
📊 Cyclical Encoding — Hour of Day as Sin/Cos (Hour 23 ≈ Hour 0)
sin(2π×hour/24) cos(2π×hour/24) Raw hour (wrong — treats 23 and 0 as far apart)

Raw hour (red dashed) treats 23:00 and 00:00 as maximally different — a distance of 23. Sin/cos encoding places them almost on top of each other (sin(23×2π/24) ≈ sin(0×2π/24) = 0). This is essential for any cyclical variable: hours, months, days of week.


Section 06

Binning & Discretisation — Grouping Continuous Values

Binning converts a continuous numeric column into ordered categorical buckets. This is useful when: the relationship between a feature and the target is step-wise rather than linear; you want to create interaction features between two continuous variables; or you want to let a linear model learn non-linear thresholds without using polynomial features.

# ── Equal-width bins (pd.cut) ─────────────────────────
# Each bin covers the same numeric range
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 35, 45, 55, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '55+'],
    right=True
)

# ── Equal-frequency bins (pd.qcut) ────────────────────
# Each bin contains the same number of rows
df['income_quartile'] = pd.qcut(
    df['annual_income'],
    q=4,
    labels=['Q1_Low', 'Q2_Med', 'Q3_High', 'Q4_VHigh']
)

# ── KBinsDiscretizer (sklearn — pipeline-friendly) ────
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df['income_bin'] = kbd.fit_transform(df[['annual_income']])

# ── Custom domain bins ────────────────────────────────
bmi_bins   = [0, 18.5, 24.9, 29.9, 100]
bmi_labels = ['Underweight', 'Normal', 'Overweight', 'Obese']
df['bmi_category'] = pd.cut(df['bmi'], bins=bmi_bins, labels=bmi_labels)

# ── Encode binned features for modelling ──────────────
df['age_group_enc'] = df['age_group'].cat.codes     # ordinal
df = pd.get_dummies(df, columns=['age_group'], dtype=int) # one-hot
📊 pd.cut vs pd.qcut — Equal-Width vs Equal-Frequency Bins

pd.cut — Equal-Width Bins

pd.qcut — Equal-Frequency Bins

pd.cut creates bins of equal numeric width — some bins may have very few data points if the distribution is skewed. pd.qcut creates bins with equal numbers of data points — better for skewed data as every bin gets balanced representation.


Section 07

Aggregation Features — Capturing Group Behaviour

Aggregation features compute statistics about a group that a single row cannot express. A customer's individual purchase amount is limited information. The customer's average purchase amount over 12 months, their maximum single purchase, and the standard deviation of their purchases — these reveal spending personality. Aggregation features are among the most powerful in time-series and customer analytics problems.

# ── Customer-level aggregations ───────────────────────
cust_agg = df.groupby('customer_id').agg(
    cust_total_spend    = ('purchase_amount', 'sum'),
    cust_avg_spend      = ('purchase_amount', 'mean'),
    cust_max_spend      = ('purchase_amount', 'max'),
    cust_std_spend      = ('purchase_amount', 'std'),
    cust_order_count    = ('order_id',        'count'),
    cust_days_active    = ('order_date',      'nunique'),
    cust_avg_rating     = ('rating',          'mean'),
    cust_return_rate    = ('is_returned',     'mean'),
).reset_index()

# Merge back to original dataframe
df = df.merge(cust_agg, on='customer_id', how='left')

# ── Product-level aggregations ────────────────────────
prod_agg = df.groupby('product_id').agg(
    prod_avg_rating  = ('rating',       'mean'),
    prod_sale_count  = ('order_id',     'count'),
    prod_return_rate = ('is_returned',  'mean'),
    prod_avg_price   = ('unit_price',   'mean')
).reset_index()

# ── City-level aggregations ───────────────────────────
city_agg = df.groupby('city')['purchase_amount'].agg(
    city_avg_spend='mean', city_spend_std='std'
).reset_index()
df = df.merge(city_agg, on='city', how='left')

# ── Rolling window aggregations (time series) ─────────
df = df.sort_values(['customer_id', 'order_date'])
df['rolling_30d_spend'] = (df
    .groupby('customer_id')['purchase_amount']
    .transform(lambda x: x.rolling(30, min_periods=1).mean())
)

Section 08

Domain-Specific Feature Engineering

The most powerful features are those derived from domain knowledge — features that encode what a human expert would consider when making the same prediction. These cannot be discovered by any automated feature engineering tool because they require understanding of the business context, not just the data.

Domain Raw Features Engineered Feature Why It Works
Credit Risk debt, income debt / income Standard credit risk metric — captures affordability
E-Commerce total_spend, visits spend / visits Captures value per engagement — predicts premium behaviour
Healthcare weight (kg), height (m) weight / height² BMI — established medical risk indicator
Logistics order_date, delivery_date (delivery − order).days Actual delivery time — the metric customers care about
Retail last_purchase_date, today days_since_last_purchase Recency — strongest predictor of next purchase intent
Telecom calls_made, plan_limit calls / plan_limit × 100 Plan utilisation % — predicts upsell and churn
Finance current_assets, liabilities current_assets / liabilities Current ratio — standard liquidity metric

Section 09

Feature Scaling — Why It Must Come After Engineering

Feature scaling transforms numeric values into a consistent range so that distance-based and gradient-based algorithms treat all features equally. It must come after feature engineering because: (1) scaling before creating interaction features changes the meaning of the interaction; (2) newly created features may have very different ranges and also need scaling; and (3) binned and encoded features must be created from the raw values before any scaling is applied to numeric columns.

⚠️
The Correct Order Is Non-Negotiable

Always follow this order: (1) clean data, (2) engineer features, (3) encode categoricals, (4) scale numerics, (5) select features. If you scale before engineering, your interaction features will be computed on scaled values — losing their interpretability and changing their mathematical relationship to the target. If you scale before encoding, you will try to scale text columns and throw errors.

MinMaxScaler
(x − min) / (max − min)
Output: [0, 1]
Neural networks, image data. Sensitive to outliers. Bounded range.
StandardScaler
(x − μ) / σ
Output: mean=0, std=1
Default for linear models, SVM, PCA. Partially affected by outliers.
RobustScaler
(x − Q2) / IQR
Output: centred on median
When legitimate outliers exist. Uses median + IQR — outlier-resistant.
MaxAbsScaler
x / |max(x)|
Output: [−1, +1]
Sparse data (TF-IDF). Preserves zero entries. No mean shifting.
📊 Output: All Four Scalers — Same Engineered Features, Four Transformations
Raw (unscaled) MinMaxScaler StandardScaler RobustScaler

Raw features have wildly different ranges — salary dwarfs all others. After scaling, all features are comparable. Notice debt_to_income_ratio (an engineered feature) also needs scaling — newly created features do not arrive pre-scaled.

Scaling the Engineered Features

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# All columns to scale — including engineered ones
scale_cols = [
    'age', 'income', 'salary',            # original numeric
    'debt_to_income', 'spend_per_day',       # engineered ratios
    'age_sq', 'income_sq', 'age_x_income',   # polynomial features
    'cust_avg_spend', 'cust_return_rate',    # aggregation features
    'days_since_signup', 'hour_sin',          # time features
]

# Standard: for linear models, SVM, PCA
std_scaler = StandardScaler()
std_scaler.fit(X_train[scale_cols])
X_train[scale_cols] = std_scaler.transform(X_train[scale_cols])
X_test[scale_cols]  = std_scaler.transform(X_test[scale_cols])

# Robust: for features with outliers (e.g. aggregated revenue)
robust_cols = ['cust_total_spend', 'cust_max_spend']
rob_scaler  = RobustScaler()
rob_scaler.fit(X_train[robust_cols])
X_train[robust_cols] = rob_scaler.transform(X_train[robust_cols])
X_test[robust_cols]  = rob_scaler.transform(X_test[robust_cols])

Section 10

Complete Feature Engineering + Scaling Pipeline

In production, all feature engineering and scaling steps must be encapsulated in a single sklearn pipeline so they are applied identically at training and inference time. A custom transformer wraps the engineering logic; ColumnTransformer applies different scalers to different column groups.

from sklearn.pipeline        import Pipeline
from sklearn.compose         import ColumnTransformer
from sklearn.preprocessing   import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.preprocessing   import OneHotEncoder, PolynomialFeatures
from sklearn.base             import BaseEstimator, TransformerMixin
from sklearn.impute           import SimpleImputer
from sklearn.ensemble         import GradientBoostingClassifier
import joblib

# ── 1. Custom feature engineering transformer ─────────
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X):
        X = X.copy()
        # Interaction features
        X['debt_to_income']  = X['debt']  / (X['income']  + 1)
        X['spend_per_day']   = X['spend']  / (X['days']    + 1)
        X['discount_pct']    = (X['orig_price'] - X['sale_price']) / (X['orig_price'] + 1)
        # Date features
        X['day_of_week']     = pd.to_datetime(X['order_date']).dt.dayofweek
        X['is_weekend']      = (X['day_of_week'] >= 5).astype(int)
        X['hour_sin']        = np.sin(2*np.pi*pd.to_datetime(X['order_date']).dt.hour/24)
        X['hour_cos']        = np.cos(2*np.pi*pd.to_datetime(X['order_date']).dt.hour/24)
        X['days_since_signup'] = (pd.Timestamp.today() - pd.to_datetime(X['signup_date'])).dt.days
        return X.drop(columns=['order_date', 'signup_date'])

# ── 2. Column groups after engineering ────────────────
standard_cols = ['age', 'income', 'credit_score', 'debt_to_income',
                 'spend_per_day', 'hour_sin', 'hour_cos', 'days_since_signup']
robust_cols   = ['total_spend', 'max_spend']    # outlier-prone
bounded_cols  = ['discount_pct', 'utilisation_pct']  # naturally [0,100]
cat_cols      = ['city', 'gender', 'product_type']

# ── 3. Preprocessing sub-pipelines ───────────────────
std_pipe    = Pipeline([('imp',SimpleImputer(strategy='median')),    ('sc',StandardScaler())])
robust_pipe = Pipeline([('imp',SimpleImputer(strategy='median')),    ('sc',RobustScaler())])
mm_pipe     = Pipeline([('imp',SimpleImputer(strategy='median')),    ('sc',MinMaxScaler())])
cat_pipe    = Pipeline([('imp',SimpleImputer(strategy='most_frequent')),
                         ('ohe',OneHotEncoder(handle_unknown='ignore',sparse_output=False))])

preprocessor = ColumnTransformer([
    ('std',    std_pipe,    standard_cols),
    ('robust', robust_pipe, robust_cols),
    ('mm',     mm_pipe,     bounded_cols),
    ('cat',    cat_pipe,    cat_cols),
])

# ── 4. Full pipeline ─────────────────────────────────
full_pipeline = Pipeline([
    ('engineer',    FeatureEngineer()),
    ('preprocessor',preprocessor),
    ('model',       GradientBoostingClassifier(n_estimators=300, random_state=42))
])

full_pipeline.fit(X_train, y_train)
print(f"Test AUC: {full_pipeline.score(X_test, y_test):.3f}")
joblib.dump(full_pipeline, 'feature_eng_pipeline.pkl')

# At inference — raw data in, predictions out
pipeline = joblib.load('feature_eng_pipeline.pkl')
predictions = pipeline.predict(new_raw_df)
📊 Output: Model Performance — Raw Features vs Engineered + Scaled
Raw features only + Feature Engineering + Engineering + Scaling

Feature Engineering alone provides the biggest accuracy jump across all algorithms — especially for linear models where engineered features expose non-linear patterns. Adding scaling on top delivers a further gain for distance-based and gradient-based algorithms (KNN, SVM, Neural Net, Logistic Regression). Tree models (RF, XGB) see less benefit from scaling but still benefit from engineering.


Section 11

Multicollinearity — Where It Enters & How to Fix It

Multicollinearity occurs when two or more features are so highly correlated with each other that one can be predicted from the others. It does not affect every algorithm — but for linear models, logistic regression, Ridge, Lasso, and KNN it silently destroys coefficient interpretability, inflates standard errors, and makes the model unstable. Crucially, it enters your dataset most often as a result of feature engineering itself — the very step designed to improve your model.

The Bank That Could Not Explain Its Own Model
A private bank built a logistic regression model to predict loan default. Their data science team had done excellent feature engineering — they created debt_to_income, income_sq, debt_sq, and age_x_income among others. The model achieved 0.84 AUC. But when the RBI auditor asked the team to explain the coefficient on annual_income, the answer was: coefficient = −2.3, meaning higher income predicts more default. This was obviously wrong. The real problem: annual_income, debt_to_income, and income_sq were highly correlated (VIF > 45). Multicollinearity had made the coefficients mathematically unstable — a tiny change in the data changed the sign of the income coefficient entirely. The model's predictions were fine. Its explanations were meaningless. After removing multicollinear features and switching to Ridge regression, the income coefficient correctly became positive (−0.8 → +1.4) and the model passed the audit.
🗺️ Where Multicollinearity Enters the Pipeline
Diagram showing three pipeline stages where multicollinearity commonly enters: feature engineering, one-hot encoding, and polynomial features Raw Data Clean ✓ Feature Eng. ⚠ MC enters here debt + debt/income correlated income + income² correlated One-Hot Enc. ⚠ Dummy variable trap N cols sum = 1.0 always Use drop_first=True to fix Polynomial ⚠ x and x² correlated age and age² always correlated Use Ridge after PolynomialFeatures ⚡ MC Check VIF + corr heatmap Model Safe to train ✓

Multicollinearity is a post-engineering, pre-modelling problem. Always run the VIF check and correlation heatmap after completing all feature engineering and encoding — before fitting any linear or distance-based model.

Which Models Are Harmed vs Safe

Algorithm Harmed? Why Fix
Linear Regression Severely Coefficients become unstable — tiny data changes flip signs Remove features or use Ridge
Logistic Regression Severely Same as linear — gradient instability, uninterpretable coefficients Remove features or use Ridge/L2
Ridge Regression Partially L2 penalty stabilises correlated coefficients — still interpretable Already mitigated — preferred fix
Lasso Regression Partially Auto-selects one of a correlated pair and zeros the other Already mitigated — auto-selects
KNN Yes Redundant features inflate distance — similar points look far apart Drop duplicated-signal features
SVM Mildly Affects margin geometry but kernel methods partially mitigate it Apply PCA before SVM
Decision Tree No Splits one feature at a time — redundant features are simply unused None needed
Random Forest / XGBoost No Tree-based — threshold splits are rank-based, not distance-based None needed
Neural Network Mildly Handles it but slows convergence — redundant features waste capacity Drop highly correlated pairs

Detecting Multicollinearity

import pandas as pd
import numpy  as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# ── Method 1: Correlation matrix ─────────────────────
corr = X_train.corr().abs()
upper = corr.where(
    np.triu(np.ones(corr.shape, dtype=bool), k=1)
)
problem_pairs = [
    (col, row, upper.loc[row, col])
    for col in upper.columns
    for row in upper.index
    if upper.loc[row, col] > 0.85
]
print(pd.DataFrame(problem_pairs, columns=['feature_1','feature_2','correlation']))

# ── Method 2: Variance Inflation Factor (VIF) ────────
# The definitive test — VIF > 10 = serious problem
vif_data = pd.DataFrame({
    'feature': X_train.columns,
    'VIF':     [variance_inflation_factor(X_train.values, i)
               for i in range(X_train.shape[1])]
}).sort_values('VIF', ascending=False)

print(vif_data)
# VIF > 10  → serious multicollinearity — must fix
# VIF 5-10  → moderate — worth investigating
# VIF < 5   → acceptable

# ── Flag the worst offenders ──────────────────────────
high_vif = vif_data[vif_data['VIF'] > 10]
print(f"Features with VIF > 10: {high_vif['feature'].tolist()}")
📊 Output: Variance Inflation Factor (VIF) — Before & After Fixing
Before fixing After fixing VIF=10 danger threshold VIF=5 warning threshold

income_sq (VIF=48.2) and debt_to_income (VIF=31.6) are severely collinear with annual_income and total_debt. After fixing — dropping one from each collinear pair and switching to Ridge — all VIF values fall below 5.0 (green bars).

Fixing Multicollinearity — Four Strategies

# ── Strategy 1: Drop one of each correlated pair ─────
# Keep the one with higher target correlation
target_corr = X_train.corrwith(y_train).abs()

def drop_multicollinear(X, threshold=0.85):
    corr = X.corr().abs()
    upper = corr.where(np.triu(np.ones(corr.shape, dtype=bool), k=1))
    to_drop = set()
    for col in upper.columns:
        correlated = upper.index[upper[col] > threshold].tolist()
        for row in correlated:
            # Drop whichever has lower target correlation
            drop = col if target_corr.get(col, 0) < target_corr.get(row, 0) else row
            to_drop.add(drop)
    print(f"Dropping {len(to_drop)} collinear features: {to_drop}")
    return X.drop(columns=list(to_drop))

X_train = drop_multicollinear(X_train, threshold=0.85)

# ── Strategy 2: Ridge Regression (L2 regularisation) ─
# Does NOT remove features — stabilises unstable coefficients
from sklearn.linear_model import RidgeCV
ridge = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge.fit(X_train_scaled, y_train)
print(f"Best alpha: {ridge.alpha_}")

# ── Strategy 3: PCA — compress correlated features ───
from sklearn.decomposition import PCA
# All correlated features → a few orthogonal (uncorrelated) components
pca = PCA(n_components=0.95)   # keep 95% of variance
X_pca = pca.fit_transform(X_train_scaled)
print(f"Reduced from {X_train.shape[1]} to {X_pca.shape[1]} components")

# ── Strategy 4: Fix dummy variable trap ──────────────
# Always use drop_first=True for linear/logistic regression
df_encoded = pd.get_dummies(df, columns=['city', 'gender'],
                             drop_first=True, dtype=int)
# sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output=False)
🗺️ Multicollinearity Fix Decision Tree
Decision tree for fixing multicollinearity based on VIF score and model type VIF > 10 detected? No (VIF < 5) Yes — fix needed VIF 5–10 ✅ Safe to proceed No action needed Is model linear? (LR / Logistic / SVM) Yes No (tree) Use Ridge / Lasso or drop one feature Use Ridge or drop correlated feature ✅ Safe — ignore Trees are immune

Decision rule: VIF below 5 — safe for all models. VIF 5–10 — investigate; fix if using a linear model. VIF above 10 — must fix for linear models; tree models can safely ignore it.

🎯
Pro Tip — After Polynomial Features Always Use Ridge

Whenever you use PolynomialFeatures(degree≥2), the resulting features (x, x², x³, x×y) are always highly correlated by mathematical construction. Never use ordinary least squares regression after polynomial expansion — always use Ridge or Lasso which are designed to handle this. This is not optional — it is the standard workflow for polynomial regression.


Section 12

Golden Rules of Feature Engineering & Scaling

🎯 9 Rules Every Data Scientist Must Follow
1
Engineer features before scaling — always. Creating a ratio from two raw features then scaling gives a meaningful, scaled ratio. Scaling first then creating the ratio gives an arbitrary combination of two dimensionless numbers that loses interpretability.
2
Start with domain knowledge, not automation. The most powerful features come from understanding what an expert human would consider when making the same prediction. Ask domain experts: "What ratios and combinations do you mentally compute when assessing a case?"
3
Always protect ratio features against division by zero: numerator / (denominator + 1) or use np.where(denominator > 0, numerator/denominator, 0). A single zero in the denominator will create NaN or inf that propagates through every downstream calculation.
4
Use cyclical encoding (sin/cos) for hour, month, day-of-week, and any other circular variable. Encoding hour as a raw integer tells the model that 23:00 is maximally different from 00:00 — which is false. The sin/cos pair correctly places them adjacent on the unit circle.
5
Scale polynomial features: StandardScaler → PolynomialFeatures, never the reverse. Polynomial expansion of unscaled features creates astronomically large values (salary² = 10¹²) that cause numerical instability in gradient-based models.
6
Use pd.qcut (equal frequency) over pd.cut (equal width) for skewed distributions. Equal-width bins on a skewed feature put 80% of the data in one bin and 5% in each of the others — the model sees one large bucket and several nearly empty ones.
7
Aggregation features must be computed on training data and then mapped to test data — never computed on the full dataset. Computing customer average spend on the full dataset leaks test rows into the training features. Group by on train, then merge to test using the train-computed statistics.
8
Wrap all feature engineering in a custom sklearn transformer (BaseEstimator + TransformerMixin). This makes your engineering steps part of the pipeline — they are fitted on training data, applied to test data, and automatically applied at inference time. Hand-engineering outside the pipeline creates inconsistency and leakage.
9
Validate every engineered feature with a distribution check and a correlation-with-target check before including it in the model. A feature that seems logically correct might have a zero-variance implementation bug, or might already be perfectly correlated with another feature you created.
🧮
Key Takeaway

The e-commerce platform that unlocked ₹18 crore in revenue did not build a better model — they built better features. Feature engineering is where a data scientist's domain knowledge, creativity, and mathematical intuition combine. Scaling is where mathematical rigour ensures the model sees those features fairly. Together they are the highest-leverage activity in the entire machine learning workflow. A brilliant feature set with a simple model will almost always outperform a poor feature set with a complex model. Invest your time here first.