Feature Engineering & Feature Scaling

Section 01

Why Feature Engineering Is the Most Impactful Step

Machine learning algorithms are mathematical functions — they can only learn from the signals you give them. If the raw data expresses those signals poorly, no amount of hyperparameter tuning or model complexity will recover them. Feature engineering is the art and science of transforming raw data into representations that expose the underlying patterns more clearly — creating new variables, combining existing ones, and scaling everything so that each feature speaks to the model in a language it can learn from.

📖 Real-World Story

The ₹18 Crore Revenue Unlock From Three New Columns

A major Indian e-commerce platform had been running a product recommendation model for two years. The model used 45 raw features — user age, purchase amount, product category, city, and so on — and achieved a click-through rate (CTR) of 3.2%. A data science consultant was brought in to improve it. Instead of changing the model architecture, she spent three days on feature engineering. She created: (1) days_since_last_purchase from the order timestamps — capturing recency, the strongest predictor of future purchase intent; (2) avg_spend_per_visit by dividing total lifetime spend by session count — capturing customer value density; (3) hour_of_day_encoded using sine/cosine encoding of the purchase hour — capturing time-of-day shopping patterns without introducing false cyclical ordering. CTR improved from 3.2% to 5.1% — a 59% relative gain. At the platform's scale, this translated to ₹18 crore in additional annual revenue. The model did not change. The features did.

💡

Feature Engineering vs Model Selection

Studies of Kaggle competition winners consistently show that the most impactful improvements come from feature engineering, not model selection. A well-engineered feature set with a simple logistic regression often outperforms a poorly-engineered one with XGBoost. The model learns patterns — but only the patterns you put in front of it. Feature engineering determines what patterns are visible.

🗺️ The Feature Engineering + Scaling Pipeline

Feature engineering happens before scaling. You must create, transform, and combine features first — then scale the result. Scaling a feature before creating interactions would change the meaning of the interaction term.

Section 02

Six Types of Feature Engineering

✖️

Interaction Features

Multiply or divide two features to capture their combined effect — often more predictive than either alone.

spend / delivery_days
age × income
price × quantity

📈

Polynomial Features

Raise features to powers (x², x³) to let linear models fit non-linear relationships without switching to a non-linear algorithm.

age² → diminishing returns
distance³ → cubic cost
x¹ x² x₁x₂

📅

Date / Time Features

Extract hour, day, month, quarter, weekday, is_weekend, days_since from timestamps. Cyclical features need sin/cos encoding.

dt.hour, dt.dayofweek
sin(2π×hour/24)
days_since_signup

📦

Binning / Discretisation

Convert continuous features into discrete buckets. Equal-width (cut) or equal-frequency (qcut) bins expose non-linear patterns to linear models.

age → 18-25, 26-35
income → Low/Med/High
pd.cut, pd.qcut

📊

Aggregation Features

Compute group statistics (mean, std, count, max) per entity — capturing behaviour patterns that single-row features cannot express.

customer avg purchase
city median salary
product return rate

🔄

Domain-Specific Features

Features crafted from domain knowledge — ratios, flags, and transformations that encode expert understanding of the problem.

debt_to_income_ratio
bmi = weight/height²
churn_risk_score

Section 03

Interaction Features — Combining Columns

Two features that are individually weak can become a powerful predictor when combined. Interaction features capture the joint effect of two variables — a signal that exists only in their relationship, not in either variable alone.

📖 Real-World Story

The Loan Model That Finally Understood Risk

A microfinance institution's default prediction model used income and debt as separate features. Income alone was a weak predictor — some high earners defaulted due to lifestyle spending. Debt alone was also weak — some high-debt customers were high earners who easily serviced it. When the data scientist created debt_to_income_ratio = debt / (income + 1) and monthly_obligation_pct = emi / monthly_income × 100, the model's AUC improved from 0.71 to 0.84. These interaction features encoded the fundamental credit risk concept that neither income nor debt alone captured: how much of your income is consumed by your obligations? The two raw features were individually mediocre. Their ratio was the single most predictive feature in the entire dataset.

import pandas as pd
import numpy  as np

# ── Ratio features ────────────────────────────────────
df['debt_to_income']       = df['total_debt'] / (df['annual_income'] + 1)
df['spend_per_delivery_day'] = df['purchase_amount'] / (df['delivery_days'] + 1)
df['revenue_per_visit']     = df['total_revenue'] / (df['visit_count'] + 1)

# ── Product features ──────────────────────────────────
df['age_x_income']     = df['age'] * df['income_bracket_enc']
df['price_x_quantity']  = df['unit_price'] * df['quantity']

# ── Difference features ───────────────────────────────
df['balance_change']    = df['balance_end'] - df['balance_start']
df['price_discount']    = df['original_price'] - df['sale_price']
df['discount_pct']      = df['price_discount'] / (df['original_price'] + 1) * 100

# ── Flag / binary interaction ─────────────────────────
df['is_high_value_weekend'] = (
    (df['purchase_amount'] > df['purchase_amount'].quantile(0.75)) &
    (df['is_weekend'] == True)
).astype(int)

# ── Always protect against division by zero ───────────
df['safe_ratio'] = np.where(df['denominator'] > 0,
                            df['numerator'] / df['denominator'],
                            0)

📊 Feature Importance — Raw Features vs Engineered Interaction Features

Raw features Engineered interaction features

debt_to_income_ratio is the single most important feature — more predictive than either total_debt or annual_income alone. spend_per_delivery_day ranks above its component features. Interaction features consistently outperform their raw inputs.

Section 04

Polynomial Features — Capturing Non-Linear Relationships

A linear model can only fit straight lines. When the true relationship between a feature and the target is curved — diminishing returns, exponential growth, a U-shape — a linear model will never capture it regardless of how much data you have. Polynomial features solve this by creating x², x³, and cross-terms, allowing a linear model to fit non-linear patterns without switching to a non-linear algorithm.

from sklearn.preprocessing import PolynomialFeatures

# ── Degree 2: adds x², y², and x×y ───────────────────
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train[['age', 'income']])

feature_names = poly.get_feature_names_out(['age', 'income'])
print(feature_names)
# ['age', 'income', 'age^2', 'age income', 'income^2']

# ── Manual polynomial (more readable) ────────────────
df['age_sq']      = df['age'] ** 2
df['age_cubed']   = df['age'] ** 3
df['income_sq']   = df['income'] ** 2
df['age_x_inc']   = df['age'] * df['income']    # cross-term

# ── IMPORTANT: Scale before polynomial! ──────────────
# Polynomial of unscaled features → huge numbers → numerical instability
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model  import Ridge

poly_pipeline = Pipeline([
    ('scaler', StandardScaler()),           # scale FIRST
    ('poly',   PolynomialFeatures(degree=2, include_bias=False)),
    ('model',  Ridge(alpha=1.0))
])
poly_pipeline.fit(X_train, y_train)

📊 Output: Linear vs Polynomial Fit — Capturing Non-Linear Relationship

Data points Linear fit (underfit) Degree-2 polynomial (correct)

The linear fit (red) misses the curve entirely — it systematically underpredicts at both extremes. The degree-2 polynomial (green) follows the true U-shaped relationship. Adding age² to the feature set lets the linear model fit this curve without changing the algorithm.

Section 05

Date & Time Feature Extraction

Datetime columns are one of the richest sources of feature engineering in business data. A single timestamp contains many potential signals — hour of day, day of week, seasonality, recency, and cyclical patterns. Extracting these signals explicitly is far more informative than feeding a raw timestamp or integer to a model.

📖 Real-World Story

The Delivery Model That Did Not Know It Was Friday

A logistics company's delivery time prediction model used raw timestamps as Unix integers (seconds since epoch). The model treated Monday at 9 AM and Friday at 9 AM as numerically similar — because their Unix timestamps differed by only 4 days × 86,400 seconds. But Friday deliveries in India take 40% longer on average due to weekend traffic and reduced warehouse staffing. By extracting is_friday, day_of_week, and hour_of_day from the timestamp and adding them as explicit features, the model's RMSE on delivery time prediction dropped from 4.2 hours to 1.8 hours. The information had always been inside the timestamp — but buried in arithmetic the model could not decode without explicit extraction.

# ── Parse and extract all datetime features ───────────
df['order_date'] = pd.to_datetime(df['order_date'])

# Basic calendar features
df['year']        = df['order_date'].dt.year
df['month']       = df['order_date'].dt.month
df['day']         = df['order_date'].dt.day
df['hour']        = df['order_date'].dt.hour
df['day_of_week']  = df['order_date'].dt.dayofweek   # 0=Mon, 6=Sun
df['quarter']     = df['order_date'].dt.quarter
df['week_of_year'] = df['order_date'].dt.isocalendar().week

# Derived flag features
df['is_weekend']   = (df['day_of_week'] >= 5).astype(int)
df['is_friday']    = (df['day_of_week'] == 4).astype(int)
df['is_month_end']  = df['order_date'].dt.is_month_end.astype(int)
df['is_quarter_end']= df['order_date'].dt.is_quarter_end.astype(int)

# Recency feature: days since a reference date
ref_date = pd.Timestamp('2024-01-01')
df['days_since_signup']  = (df['order_date'] - df['signup_date']).dt.days
df['days_since_purchase'] = (pd.Timestamp.today() - df['order_date']).dt.days

# ── Cyclical encoding for hour and month ──────────────
# Prevents model from thinking Hour 23 is far from Hour 0
df['hour_sin']   = np.sin(2 * np.pi * df['hour']  / 24)
df['hour_cos']   = np.cos(2 * np.pi * df['hour']  / 24)
df['month_sin']  = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos']  = np.cos(2 * np.pi * df['month'] / 12)

📊 Cyclical Encoding — Hour of Day as Sin/Cos (Hour 23 ≈ Hour 0)

sin(2π×hour/24) cos(2π×hour/24) Raw hour (wrong — treats 23 and 0 as far apart)

Raw hour (red dashed) treats 23:00 and 00:00 as maximally different — a distance of 23. Sin/cos encoding places them almost on top of each other (sin(23×2π/24) ≈ sin(0×2π/24) = 0). This is essential for any cyclical variable: hours, months, days of week.

Section 06

Binning & Discretisation — Grouping Continuous Values

Binning converts a continuous numeric column into ordered categorical buckets. This is useful when: the relationship between a feature and the target is step-wise rather than linear; you want to create interaction features between two continuous variables; or you want to let a linear model learn non-linear thresholds without using polynomial features.

# ── Equal-width bins (pd.cut) ─────────────────────────
# Each bin covers the same numeric range
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 35, 45, 55, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '55+'],
    right=True
)

# ── Equal-frequency bins (pd.qcut) ────────────────────
# Each bin contains the same number of rows
df['income_quartile'] = pd.qcut(
    df['annual_income'],
    q=4,
    labels=['Q1_Low', 'Q2_Med', 'Q3_High', 'Q4_VHigh']
)

# ── KBinsDiscretizer (sklearn — pipeline-friendly) ────
from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df['income_bin'] = kbd.fit_transform(df[['annual_income']])

# ── Custom domain bins ────────────────────────────────
bmi_bins   = [0, 18.5, 24.9, 29.9, 100]
bmi_labels = ['Underweight', 'Normal', 'Overweight', 'Obese']
df['bmi_category'] = pd.cut(df['bmi'], bins=bmi_bins, labels=bmi_labels)

# ── Encode binned features for modelling ──────────────
df['age_group_enc'] = df['age_group'].cat.codes     # ordinal
df = pd.get_dummies(df, columns=['age_group'], dtype=int) # one-hot

📊 pd.cut vs pd.qcut — Equal-Width vs Equal-Frequency Bins

pd.cut — Equal-Width Bins

pd.qcut — Equal-Frequency Bins

pd.cut creates bins of equal numeric width — some bins may have very few data points if the distribution is skewed. pd.qcut creates bins with equal numbers of data points — better for skewed data as every bin gets balanced representation.

Section 07

Aggregation Features — Capturing Group Behaviour

Aggregation features compute statistics about a group that a single row cannot express. A customer's individual purchase amount is limited information. The customer's average purchase amount over 12 months, their maximum single purchase, and the standard deviation of their purchases — these reveal spending personality. Aggregation features are among the most powerful in time-series and customer analytics problems.

# ── Customer-level aggregations ───────────────────────
cust_agg = df.groupby('customer_id').agg(
    cust_total_spend    = ('purchase_amount', 'sum'),
    cust_avg_spend      = ('purchase_amount', 'mean'),
    cust_max_spend      = ('purchase_amount', 'max'),
    cust_std_spend      = ('purchase_amount', 'std'),
    cust_order_count    = ('order_id',        'count'),
    cust_days_active    = ('order_date',      'nunique'),
    cust_avg_rating     = ('rating',          'mean'),
    cust_return_rate    = ('is_returned',     'mean'),
).reset_index()

# Merge back to original dataframe
df = df.merge(cust_agg, on='customer_id', how='left')

# ── Product-level aggregations ────────────────────────
prod_agg = df.groupby('product_id').agg(
    prod_avg_rating  = ('rating',       'mean'),
    prod_sale_count  = ('order_id',     'count'),
    prod_return_rate = ('is_returned',  'mean'),
    prod_avg_price   = ('unit_price',   'mean')
).reset_index()

# ── City-level aggregations ───────────────────────────
city_agg = df.groupby('city')['purchase_amount'].agg(
    city_avg_spend='mean', city_spend_std='std'
).reset_index()
df = df.merge(city_agg, on='city', how='left')

# ── Rolling window aggregations (time series) ─────────
df = df.sort_values(['customer_id', 'order_date'])
df['rolling_30d_spend'] = (df
    .groupby('customer_id')['purchase_amount']
    .transform(lambda x: x.rolling(30, min_periods=1).mean())
)

Section 08

Domain-Specific Feature Engineering

The most powerful features are those derived from domain knowledge — features that encode what a human expert would consider when making the same prediction. These cannot be discovered by any automated feature engineering tool because they require understanding of the business context, not just the data.

Domain	Raw Features	Engineered Feature	Why It Works
Credit Risk	debt, income	debt / income	Standard credit risk metric — captures affordability
E-Commerce	total_spend, visits	spend / visits	Captures value per engagement — predicts premium behaviour
Healthcare	weight (kg), height (m)	weight / height²	BMI — established medical risk indicator
Logistics	order_date, delivery_date	(delivery − order).days	Actual delivery time — the metric customers care about
Retail	last_purchase_date, today	days_since_last_purchase	Recency — strongest predictor of next purchase intent
Telecom	calls_made, plan_limit	calls / plan_limit × 100	Plan utilisation % — predicts upsell and churn
Finance	current_assets, liabilities	current_assets / liabilities	Current ratio — standard liquidity metric

Section 09

Feature Scaling — Why It Must Come After Engineering

Feature scaling transforms numeric values into a consistent range so that distance-based and gradient-based algorithms treat all features equally. It must come after feature engineering because: (1) scaling before creating interaction features changes the meaning of the interaction; (2) newly created features may have very different ranges and also need scaling; and (3) binned and encoded features must be created from the raw values before any scaling is applied to numeric columns.

⚠️

The Correct Order Is Non-Negotiable

Always follow this order: (1) clean data, (2) engineer features, (3) encode categoricals, (4) scale numerics, (5) select features. If you scale before engineering, your interaction features will be computed on scaled values — losing their interpretability and changing their mathematical relationship to the target. If you scale before encoding, you will try to scale text columns and throw errors.

MinMaxScaler

(x − min) / (max − min)

Output: [0, 1]

Neural networks, image data. Sensitive to outliers. Bounded range.

StandardScaler

(x − μ) / σ

Output: mean=0, std=1

Default for linear models, SVM, PCA. Partially affected by outliers.

RobustScaler

(x − Q2) / IQR

Output: centred on median

When legitimate outliers exist. Uses median + IQR — outlier-resistant.

MaxAbsScaler

x / |max(x)|

Output: [−1, +1]

Sparse data (TF-IDF). Preserves zero entries. No mean shifting.

📊 Output: All Four Scalers — Same Engineered Features, Four Transformations

Raw (unscaled) MinMaxScaler StandardScaler RobustScaler

Raw features have wildly different ranges — salary dwarfs all others. After scaling, all features are comparable. Notice debt_to_income_ratio (an engineered feature) also needs scaling — newly created features do not arrive pre-scaled.

Scaling the Engineered Features

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# All columns to scale — including engineered ones
scale_cols = [
    'age', 'income', 'salary',            # original numeric
    'debt_to_income', 'spend_per_day',       # engineered ratios
    'age_sq', 'income_sq', 'age_x_income',   # polynomial features
    'cust_avg_spend', 'cust_return_rate',    # aggregation features
    'days_since_signup', 'hour_sin',          # time features
]

# Standard: for linear models, SVM, PCA
std_scaler = StandardScaler()
std_scaler.fit(X_train[scale_cols])
X_train[scale_cols] = std_scaler.transform(X_train[scale_cols])
X_test[scale_cols]  = std_scaler.transform(X_test[scale_cols])

# Robust: for features with outliers (e.g. aggregated revenue)
robust_cols = ['cust_total_spend', 'cust_max_spend']
rob_scaler  = RobustScaler()
rob_scaler.fit(X_train[robust_cols])
X_train[robust_cols] = rob_scaler.transform(X_train[robust_cols])
X_test[robust_cols]  = rob_scaler.transform(X_test[robust_cols])

Section 10

Complete Feature Engineering + Scaling Pipeline

In production, all feature engineering and scaling steps must be encapsulated in a single sklearn pipeline so they are applied identically at training and inference time. A custom transformer wraps the engineering logic; ColumnTransformer applies different scalers to different column groups.

from sklearn.pipeline        import Pipeline
from sklearn.compose         import ColumnTransformer
from sklearn.preprocessing   import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.preprocessing   import OneHotEncoder, PolynomialFeatures
from sklearn.base             import BaseEstimator, TransformerMixin
from sklearn.impute           import SimpleImputer
from sklearn.ensemble         import GradientBoostingClassifier
import joblib

# ── 1. Custom feature engineering transformer ─────────
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X):
        X = X.copy()
        # Interaction features
        X['debt_to_income']  = X['debt']  / (X['income']  + 1)
        X['spend_per_day']   = X['spend']  / (X['days']    + 1)
        X['discount_pct']    = (X['orig_price'] - X['sale_price']) / (X['orig_price'] + 1)
        # Date features
        X['day_of_week']     = pd.to_datetime(X['order_date']).dt.dayofweek
        X['is_weekend']      = (X['day_of_week'] >= 5).astype(int)
        X['hour_sin']        = np.sin(2*np.pi*pd.to_datetime(X['order_date']).dt.hour/24)
        X['hour_cos']        = np.cos(2*np.pi*pd.to_datetime(X['order_date']).dt.hour/24)
        X['days_since_signup'] = (pd.Timestamp.today() - pd.to_datetime(X['signup_date'])).dt.days
        return X.drop(columns=['order_date', 'signup_date'])

# ── 2. Column groups after engineering ────────────────
standard_cols = ['age', 'income', 'credit_score', 'debt_to_income',
                 'spend_per_day', 'hour_sin', 'hour_cos', 'days_since_signup']
robust_cols   = ['total_spend', 'max_spend']    # outlier-prone
bounded_cols  = ['discount_pct', 'utilisation_pct']  # naturally [0,100]
cat_cols      = ['city', 'gender', 'product_type']

# ── 3. Preprocessing sub-pipelines ───────────────────
std_pipe    = Pipeline([('imp',SimpleImputer(strategy='median')),    ('sc',StandardScaler())])
robust_pipe = Pipeline([('imp',SimpleImputer(strategy='median')),    ('sc',RobustScaler())])
mm_pipe     = Pipeline([('imp',SimpleImputer(strategy='median')),    ('sc',MinMaxScaler())])
cat_pipe    = Pipeline([('imp',SimpleImputer(strategy='most_frequent')),
                         ('ohe',OneHotEncoder(handle_unknown='ignore',sparse_output=False))])

preprocessor = ColumnTransformer([
    ('std',    std_pipe,    standard_cols),
    ('robust', robust_pipe, robust_cols),
    ('mm',     mm_pipe,     bounded_cols),
    ('cat',    cat_pipe,    cat_cols),
])

# ── 4. Full pipeline ─────────────────────────────────
full_pipeline = Pipeline([
    ('engineer',    FeatureEngineer()),
    ('preprocessor',preprocessor),
    ('model',       GradientBoostingClassifier(n_estimators=300, random_state=42))
])

full_pipeline.fit(X_train, y_train)
print(f"Test AUC: {full_pipeline.score(X_test, y_test):.3f}")
joblib.dump(full_pipeline, 'feature_eng_pipeline.pkl')

# At inference — raw data in, predictions out
pipeline = joblib.load('feature_eng_pipeline.pkl')
predictions = pipeline.predict(new_raw_df)

📊 Output: Model Performance — Raw Features vs Engineered + Scaled

Raw features only + Feature Engineering + Engineering + Scaling

Feature Engineering alone provides the biggest accuracy jump across all algorithms — especially for linear models where engineered features expose non-linear patterns. Adding scaling on top delivers a further gain for distance-based and gradient-based algorithms (KNN, SVM, Neural Net, Logistic Regression). Tree models (RF, XGB) see less benefit from scaling but still benefit from engineering.

Section 11

Multicollinearity — Where It Enters & How to Fix It

Multicollinearity occurs when two or more features are so highly correlated with each other that one can be predicted from the others. It does not affect every algorithm — but for linear models, logistic regression, Ridge, Lasso, and KNN it silently destroys coefficient interpretability, inflates standard errors, and makes the model unstable. Crucially, it enters your dataset most often as a result of feature engineering itself — the very step designed to improve your model.

📖 Real-World Story

The Bank That Could Not Explain Its Own Model

A private bank built a logistic regression model to predict loan default. Their data science team had done excellent feature engineering — they created debt_to_income, income_sq, debt_sq, and age_x_income among others. The model achieved 0.84 AUC. But when the RBI auditor asked the team to explain the coefficient on annual_income, the answer was: coefficient = −2.3, meaning higher income predicts more default. This was obviously wrong. The real problem: annual_income, debt_to_income, and income_sq were highly correlated (VIF > 45). Multicollinearity had made the coefficients mathematically unstable — a tiny change in the data changed the sign of the income coefficient entirely. The model's predictions were fine. Its explanations were meaningless. After removing multicollinear features and switching to Ridge regression, the income coefficient correctly became positive (−0.8 → +1.4) and the model passed the audit.

🗺️ Where Multicollinearity Enters the Pipeline

Multicollinearity is a post-engineering, pre-modelling problem. Always run the VIF check and correlation heatmap after completing all feature engineering and encoding — before fitting any linear or distance-based model.

Which Models Are Harmed vs Safe

Algorithm	Harmed?	Why	Fix
Linear Regression	Severely	Coefficients become unstable — tiny data changes flip signs	Remove features or use Ridge
Logistic Regression	Severely	Same as linear — gradient instability, uninterpretable coefficients	Remove features or use Ridge/L2
Ridge Regression	Partially	L2 penalty stabilises correlated coefficients — still interpretable	Already mitigated — preferred fix
Lasso Regression	Partially	Auto-selects one of a correlated pair and zeros the other	Already mitigated — auto-selects
KNN	Yes	Redundant features inflate distance — similar points look far apart	Drop duplicated-signal features
SVM	Mildly	Affects margin geometry but kernel methods partially mitigate it	Apply PCA before SVM
Decision Tree	No	Splits one feature at a time — redundant features are simply unused	None needed
Random Forest / XGBoost	No	Tree-based — threshold splits are rank-based, not distance-based	None needed
Neural Network	Mildly	Handles it but slows convergence — redundant features waste capacity	Drop highly correlated pairs

Detecting Multicollinearity

import pandas as pd
import numpy  as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# ── Method 1: Correlation matrix ─────────────────────
corr = X_train.corr().abs()
upper = corr.where(
    np.triu(np.ones(corr.shape, dtype=bool), k=1)
)
problem_pairs = [
    (col, row, upper.loc[row, col])
    for col in upper.columns
    for row in upper.index
    if upper.loc[row, col] > 0.85
]
print(pd.DataFrame(problem_pairs, columns=['feature_1','feature_2','correlation']))

# ── Method 2: Variance Inflation Factor (VIF) ────────
# The definitive test — VIF > 10 = serious problem
vif_data = pd.DataFrame({
    'feature': X_train.columns,
    'VIF':     [variance_inflation_factor(X_train.values, i)
               for i in range(X_train.shape[1])]
}).sort_values('VIF', ascending=False)

print(vif_data)
# VIF > 10  → serious multicollinearity — must fix
# VIF 5-10  → moderate — worth investigating
# VIF < 5   → acceptable

# ── Flag the worst offenders ──────────────────────────
high_vif = vif_data[vif_data['VIF'] > 10]
print(f"Features with VIF > 10: {high_vif['feature'].tolist()}")

📊 Output: Variance Inflation Factor (VIF) — Before & After Fixing

Before fixing After fixing VIF=10 danger threshold VIF=5 warning threshold

income_sq (VIF=48.2) and debt_to_income (VIF=31.6) are severely collinear with annual_income and total_debt. After fixing — dropping one from each collinear pair and switching to Ridge — all VIF values fall below 5.0 (green bars).

Fixing Multicollinearity — Four Strategies

# ── Strategy 1: Drop one of each correlated pair ─────
# Keep the one with higher target correlation
target_corr = X_train.corrwith(y_train).abs()

def drop_multicollinear(X, threshold=0.85):
    corr = X.corr().abs()
    upper = corr.where(np.triu(np.ones(corr.shape, dtype=bool), k=1))
    to_drop = set()
    for col in upper.columns:
        correlated = upper.index[upper[col] > threshold].tolist()
        for row in correlated:
            # Drop whichever has lower target correlation
            drop = col if target_corr.get(col, 0) < target_corr.get(row, 0) else row
            to_drop.add(drop)
    print(f"Dropping {len(to_drop)} collinear features: {to_drop}")
    return X.drop(columns=list(to_drop))

X_train = drop_multicollinear(X_train, threshold=0.85)

# ── Strategy 2: Ridge Regression (L2 regularisation) ─
# Does NOT remove features — stabilises unstable coefficients
from sklearn.linear_model import RidgeCV
ridge = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge.fit(X_train_scaled, y_train)
print(f"Best alpha: {ridge.alpha_}")

# ── Strategy 3: PCA — compress correlated features ───
from sklearn.decomposition import PCA
# All correlated features → a few orthogonal (uncorrelated) components
pca = PCA(n_components=0.95)   # keep 95% of variance
X_pca = pca.fit_transform(X_train_scaled)
print(f"Reduced from {X_train.shape[1]} to {X_pca.shape[1]} components")

# ── Strategy 4: Fix dummy variable trap ──────────────
# Always use drop_first=True for linear/logistic regression
df_encoded = pd.get_dummies(df, columns=['city', 'gender'],
                             drop_first=True, dtype=int)
# sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output=False)

🗺️ Multicollinearity Fix Decision Tree

Decision rule: VIF below 5 — safe for all models. VIF 5–10 — investigate; fix if using a linear model. VIF above 10 — must fix for linear models; tree models can safely ignore it.

🎯

Pro Tip — After Polynomial Features Always Use Ridge

Whenever you use PolynomialFeatures(degree≥2), the resulting features (x, x², x³, x×y) are always highly correlated by mathematical construction. Never use ordinary least squares regression after polynomial expansion — always use Ridge or Lasso which are designed to handle this. This is not optional — it is the standard workflow for polynomial regression.

Section 12

Golden Rules of Feature Engineering & Scaling

🎯 9 Rules Every Data Scientist Must Follow

Engineer features before scaling — always. Creating a ratio from two raw features then scaling gives a meaningful, scaled ratio. Scaling first then creating the ratio gives an arbitrary combination of two dimensionless numbers that loses interpretability.

Start with domain knowledge, not automation. The most powerful features come from understanding what an expert human would consider when making the same prediction. Ask domain experts: "What ratios and combinations do you mentally compute when assessing a case?"

Always protect ratio features against division by zero: numerator / (denominator + 1) or use np.where(denominator > 0, numerator/denominator, 0). A single zero in the denominator will create NaN or inf that propagates through every downstream calculation.

Use cyclical encoding (sin/cos) for hour, month, day-of-week, and any other circular variable. Encoding hour as a raw integer tells the model that 23:00 is maximally different from 00:00 — which is false. The sin/cos pair correctly places them adjacent on the unit circle.

Scale polynomial features: StandardScaler → PolynomialFeatures, never the reverse. Polynomial expansion of unscaled features creates astronomically large values (salary² = 10¹²) that cause numerical instability in gradient-based models.

Use pd.qcut (equal frequency) over pd.cut (equal width) for skewed distributions. Equal-width bins on a skewed feature put 80% of the data in one bin and 5% in each of the others — the model sees one large bucket and several nearly empty ones.

Aggregation features must be computed on training data and then mapped to test data — never computed on the full dataset. Computing customer average spend on the full dataset leaks test rows into the training features. Group by on train, then merge to test using the train-computed statistics.

Wrap all feature engineering in a custom sklearn transformer (BaseEstimator + TransformerMixin). This makes your engineering steps part of the pipeline — they are fitted on training data, applied to test data, and automatically applied at inference time. Hand-engineering outside the pipeline creates inconsistency and leakage.

Validate every engineered feature with a distribution check and a correlation-with-target check before including it in the model. A feature that seems logically correct might have a zero-variance implementation bug, or might already be perfectly correlated with another feature you created.

🧮

Key Takeaway

The e-commerce platform that unlocked ₹18 crore in revenue did not build a better model — they built better features. Feature engineering is where a data scientist's domain knowledge, creativity, and mathematical intuition combine. Scaling is where mathematical rigour ensures the model sees those features fairly. Together they are the highest-leverage activity in the entire machine learning workflow. A brilliant feature set with a simple model will almost always outperform a poor feature set with a complex model. Invest your time here first.