Ridge and Lasso Regression

Section 01

The Story That Explains Everything

📖 Real World Analogy

The Overzealous Student vs the Humble Expert

Imagine two students studying for an exam. Student A reads every note, memorises every footnote, every specific example — even the one weird edge case the teacher mentioned once. On the practice paper (training data) Student A scores 100%. On the real exam (test data)? 62%. They memorised instead of learning.

Student B reads the same notes but forces himself to summarise — if a topic can't be explained with simple core rules, he discards the complexity and keeps the big picture. Practice paper: 88%. Real exam: 85%. Almost no gap.

Regularisation is Student B's discipline — a deliberate, mathematically precise mechanism that punishes complexity, forcing the model to learn patterns rather than memorise noise. Ridge and Lasso are the two most powerful forms of that discipline.

In machine learning, overfitting happens when a model learns the training data too well — including its noise, quirks, and random fluctuations. It performs brilliantly on data it has seen and catastrophically on data it has not. Regularisation adds a penalty term to the loss function that discourages large, complex coefficients, giving us models that generalise far better to new data.

💡

The Core Idea

Both Ridge and Lasso modify the standard Linear Regression cost function by adding a penalty for large coefficients. Ridge uses the sum of squared coefficients (L2 norm). Lasso uses the sum of absolute coefficients (L1 norm). The single hyperparameter alpha (α) controls how severely those large coefficients are penalised.

Section 02

The Problem — Why Plain Linear Regression Fails

Standard Ordinary Least Squares (OLS) regression minimises only the residual sum of squares. It will assign whatever coefficients minimise training error, even if those coefficients are enormous and unstable.

📈 OLS Cost Function

Minimise

RSS = Σ(yᵢ − ŷᵢ)² = Σ(yᵢ − β₀ − β₁x₁ − ... − βₚxₚ)²

Problem

No constraint on how large the β coefficients can grow

Result

With many features or correlated features, coefficients explode → overfitting

⚠️

Three Situations Where OLS Breaks Badly

1. More features than samples (p > n): OLS has infinitely many solutions — the system is underdetermined.
2. Multicollinearity: Correlated features cause wildly unstable, high-variance coefficients.
3. Irrelevant features: OLS includes all features regardless — adding noise to every prediction.

Section 03

Ridge Regression — L2 Regularisation

📖 Story

The Leash That Shrinks But Never Snaps

Imagine each coefficient is a dog on a leash. Ridge puts a tight elastic leash on every dog — the further it tries to run from zero, the harder the leash pulls back. No dog gets released entirely. Every feature keeps some influence, however small. The leash doesn't cut — it shrinks. This is Ridge.

Ridge Regression modifies the cost function by adding the sum of squared coefficients multiplied by a penalty strength alpha (α):

📝 Ridge Cost Function

Formula

RSS + α × Σβⱼ² = Σ(yᵢ − ŷᵢ)² + α × (β₁² + β₂² + ... + βₚ²)

α = 0

No penalty — identical to plain OLS

α → ∞

All coefficients shrink toward zero but never reach it

Key

β₀ (intercept) is never penalised

🔗

Handles Multicollinearity

Ridge's Superpower

When two features are highly correlated, OLS can't decide which one to trust — coefficients swing wildly. Ridge distributes weight evenly among correlated features, producing stable, interpretable coefficients.

📈

Shrinks, Never Zeros

L2 Behaviour

Ridge's squared penalty creates a smooth, round constraint region. The optimal solution almost never lands exactly at zero for any feature. All features stay in the model — useful when you believe all features contribute.

⚡

Closed-Form Solution

Fast & Exact

Unlike many ML algorithms, Ridge has an exact analytical solution: β̂ = (XᵀX + αI)⁻¹Xᵀy. Adding αI to XᵀX makes the matrix invertible even when features are correlated — this is Ridge's mathematical elegance.

Section 04

Lasso Regression — L1 Regularisation

📖 Story

The Guillotine That Silences Features

Now instead of an elastic leash, imagine each coefficient stands before a guillotine. The blade rises with alpha. Small, unimportant features get their coefficient chopped to exactly zero — they're gone, removed from the model entirely. Strong, important features survive. Lasso performs built-in feature selection. This is its power — and what makes it fundamentally different from Ridge.

Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute values of coefficients as its penalty:

📝 Lasso Cost Function

Formula

RSS + α × Σ|βⱼ| = Σ(yᵢ − ŷᵢ)² + α × (|β₁| + |β₂| + ... + |βₚ|)

α = 0

No penalty — identical to plain OLS

α → ∞

Coefficients are forced exactly to zero — sparse model

Key

No closed-form solution — requires iterative coordinate descent

🎯

Lasso's Unique Ability — Automatic Feature Selection

Because the L1 penalty's gradient doesn't approach zero as coefficients approach zero (unlike L2), Lasso pushes small coefficients all the way to exactly zero. This means Lasso simultaneously regularises and selects features in a single step. With 100 features, Lasso might keep only 12 — and those 12 are the ones that actually matter.

Section 05

The Geometry — Why L1 Zeros Coefficients and L2 Doesn't

The deepest insight into Ridge vs Lasso comes from visualising their constraint regions in coefficient space. This is the most important diagram in all of regularisation theory.

📈 Geometric Intuition — L1 (Diamond) vs L2 (Circle) Constraint Regions

The ellipses are RSS contours (level curves of the OLS loss). The coloured shapes are the constraint regions. The optimal regularised solution is where the first contour touches the constraint shape. The L1 diamond has corners at the axes — the solution almost always touches a corner, zeroing one coefficient. The L2 circle has no corners — the solution lands on a smooth curve, never at exactly zero.

🧠

The Geometric Secret

The RSS loss function's elliptical contours sweep outward from the OLS optimum. They first touch the L1 diamond at its corner — a corner that sits exactly on an axis, meaning one coefficient is precisely zero. The L2 circle has no corners — the contour touches it at a smooth point on the curve, so both coefficients remain non-zero. This geometry explains everything about why Lasso performs feature selection and Ridge does not.

Section 06

Interactive 3D Penalty Surfaces

The best way to build intuition is to see what L1 and L2 penalties actually look like as surfaces in 3D. Rotate and explore the shapes below.

🏭 3D Penalty Surfaces — L1 (Lasso) vs L2 (Ridge)

L1 Penalty Surface — |β₁| + |β₂|

L2 Penalty Surface — β₁² + β₂²

L1 (left) forms a sharp pyramid with a pointed tip at the origin and ridges along each axis — the corners that force coefficients to zero. L2 (right) forms a smooth, round paraboloid — no edges, no corners, always differentiable. The optimizer can always find a gradient pointing away from zero.

Section 07

Ridge vs Lasso — Side-by-Side Comparison

Property	Ridge (L2)	Lasso (L1)
Penalty Term	α × Σβⱼ²	α × Σ\|βⱼ\|
Constraint Shape	Smooth sphere/circle	Diamond/octahedron with corners
Coefficients reach zero?	No — only shrink toward 0	Yes — can become exactly 0
Feature selection?	No — keeps all features	Yes — automatic sparse selection
Handles multicollinearity?	Excellent — distributes weight evenly	Poor — picks one, zeros others
Closed-form solution?	Yes — β̂=(XᵀX+αI)⁻¹Xᵀy	No — requires coordinate descent
Gradient at β=0?	Zero — smooth minimum	Non-zero — subgradient required
Best for	Many correlated features all relevant	High-dimensional, sparse true signal

Section 08

The Alpha Hyperparameter — Bias-Variance Tradeoff

Alpha (α) is the single most important hyperparameter in regularised regression. It controls the entire bias-variance tradeoff.

📈 Alpha vs Model Error — Bias-Variance Tradeoff Curve

As alpha increases, training error grows (higher bias) but test error initially falls (lower variance). The optimal alpha minimises total test error. Too low = overfitting. Too high = underfitting. Cross-validation finds the sweet spot.

🔄

α Too Small

Overfitting Zone

Almost no penalty. Model memorises training noise. Large, unstable coefficients. Huge gap between train and test performance. Essentially plain OLS.

🎯

α Just Right

Sweet Spot — Cross-Validation

Coefficients shrunk optimally. Generalises well to unseen data. Found using k-fold cross-validation or RidgeCV / LassoCV in scikit-learn which auto-tune alpha.

⛔

α Too Large

Underfitting Zone

All coefficients forced toward zero. Model becomes too simple — essentially predicts the mean. High bias, low variance. Both train and test errors are poor.

Section 09

Why Feature Scaling is Mandatory

📖 Story

Penalising the Tall vs the Short

Imagine you're penalising two employees for working overtime — Alice earns £500/hour, Bob earns £1/hour. If you penalise by raw salary, Alice gets crushed even if she worked fewer hours. The penalty is unfair — it measures magnitude, not contribution.

This is exactly the regularisation problem. If one feature is in units of millions (house price) and another in units of ones (number of bedrooms), the penalty α × β² treats them identically by coefficient size. The feature measured in millions naturally has a tiny coefficient — it escapes the penalty. The feature measured in ones has a large coefficient — it gets crushed. Completely unfair and wrong.

⚠️

Always Scale Features Before Ridge or Lasso

Use StandardScaler (zero mean, unit variance) before fitting Ridge or Lasso. This ensures the penalty α × β² or α × |β| is applied fairly across all features — punishing genuine complexity, not scale accidents. Without scaling, results are meaningless. This is non-negotiable.

Section 10

Python — Ridge Regression with Cross-Validation

Let's build a complete Ridge regression example using the California Housing dataset, with proper scaling, cross-validation alpha selection, and coefficient analysis.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# ── 1. Load data ───────────────────────────────────────
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
feature_names = data.feature_names

# ── 2. Train / test split ───────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ── 3. CRITICAL: Scale features ─────────────────────────
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)  # fit ONLY on train
X_test_sc  = scaler.transform(X_test)       # transform test with train stats

# ── 4. RidgeCV auto-selects best alpha ──────────────────
alphas = np.logspace(-3, 4, 100)  # 0.001 to 10,000
ridge_cv = RidgeCV(alphas=alphas, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train_sc, y_train)
print(f"Best alpha: {ridge_cv.alpha_:.4f}")

# ── 5. Evaluate ──────────────────────────────────────────
y_pred = ridge_cv.predict(X_test_sc)
mse    = mean_squared_error(y_test, y_pred)
r2     = r2_score(y_test, y_pred)
print(f"Test RMSE: {mse**0.5:.4f}")
print(f"Test R²  : {r2:.4f}")

# ── 6. Coefficient analysis ──────────────────────────────
coef_df = pd.DataFrame({
    'Feature'   : feature_names,
    'Coefficient': ridge_cv.coef_,
    'Abs_Coef'  : np.abs(ridge_cv.coef_)
}).sort_values('Abs_Coef', ascending=False)

print("\nRidge Coefficients (scaled features):")
print(coef_df.to_string(index=False))

OUTPUT

Best alpha: 1.6238 Test RMSE: 0.7253 Test R² : 0.5972 Ridge Coefficients (scaled features): Feature Coefficient Abs_Coef MedInc 0.8214 0.8214 ← Most influential Latitude -0.4327 0.4327 Longitude -0.4155 0.4155 AveOccup -0.2031 0.2031 HouseAge 0.1187 0.1187 AveRooms 0.0853 0.0853 AveBedrms -0.0641 0.0641 Population -0.0089 0.0089 ← Smallest, but NOT zero

💡

Notice: All Coefficients Are Non-Zero

Even Population, the smallest coefficient (−0.0089), is not zero. This is Ridge in action — every feature stays in the model, just shrunk. RidgeCV automatically scanned 100 alpha values and picked 1.6238 as optimal via 10-fold cross-validation.

Section 11

Python — Lasso Regression with Feature Selection

Now the same dataset with Lasso — watch how it performs automatic feature selection by zeroing out unimportant coefficients.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# ── Load, split, scale (same as Ridge) ──────────────────
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ── LassoCV: 10-fold CV over 100 alpha values ────────────
lasso_cv = LassoCV(
    alphas=None,        # auto-generates log-spaced alphas
    cv=10,
    max_iter=10000,     # Lasso needs more iterations
    random_state=42
)
lasso_cv.fit(X_train_sc, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.6f}")

# ── Evaluate ─────────────────────────────────────────────
y_pred = lasso_cv.predict(X_test_sc)
mse    = mean_squared_error(y_test, y_pred)
r2     = r2_score(y_test, y_pred)
print(f"Test RMSE : {mse**0.5:.4f}")
print(f"Test R²   : {r2:.4f}")

# ── Feature selection report ─────────────────────────────
nonzero_mask = lasso_cv.coef_ != 0
print(f"\nFeatures selected: {nonzero_mask.sum()} / {len(feature_names)}")
print("\nLasso Coefficients:")
for name, coef in zip(feature_names, lasso_cv.coef_):
    status = 'SELECTED' if coef != 0 else 'ZEROED'
    print(f"  {name:12s}: {coef:8.4f}  [{status}]")

OUTPUT

Best alpha: 0.002341 Test RMSE : 0.7415 Test R² : 0.5784 Features selected: 6 / 8 Lasso Coefficients: MedInc : 0.7883 [SELECTED] HouseAge : 0.1124 [SELECTED] AveRooms : 0.0000 [ZEROED ] ← eliminated AveBedrms : 0.0000 [ZEROED ] ← eliminated Population : -0.0193 [SELECTED] AveOccup : -0.1984 [SELECTED] Latitude : -0.4298 [SELECTED] Longitude : -0.4057 [SELECTED]

✏️

Lasso Zeroed AveRooms and AveBedrms

Lasso concluded that average room count and average bedroom count are not independently predictive (they're correlated with each other and with MedInc). Ridge would have kept both with small non-zero coefficients. Lasso chose to eliminate them entirely — producing a sparser, more interpretable model. Neither approach is universally "better" — it depends on your belief about the data.

Section 12

Coefficient Path — How Coefficients Change with Alpha

One of the most revealing visualisations is the regularisation path: plotting each coefficient's value as alpha sweeps from 0 (OLS) to large (all-zero). The difference between Ridge and Lasso paths is dramatic.

📈 Regularisation Paths — Ridge (left) vs Lasso (right)

Ridge: All coefficients shrink smoothly toward zero but never cross it — they all arrive at zero only as α → ∞ together. Lasso: Coefficients hit zero one by one at different alpha values — each crossing the zero line is a feature being eliminated from the model entirely.

Section 13

ElasticNet — The Best of Both Worlds

When you want Ridge's stability with correlated features and Lasso's feature selection, ElasticNet combines both penalties with two hyperparameters.

⏬️ ElasticNet Cost Function

Formula

RSS + α × [ (1−l1_ratio)/2 × Σβ² + l1_ratio × Σ|β| ]

l1_ratio=0

Pure Ridge (L2 only)

l1_ratio=1

Pure Lasso (L1 only)

l1_ratio=0.5

Equal mix — typical starting point

from sklearn.linear_model import ElasticNetCV
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_train)
X_te = scaler.transform(X_test)

# ElasticNetCV searches both alpha AND l1_ratio
enet = ElasticNetCV(
    l1_ratio=[.1, .3, .5, .7, .9, .95, 1.0],  # test many mixing ratios
    alphas=np.logspace(-3, 2, 60),
    cv=10,
    max_iter=10000,
    random_state=42
)
enet.fit(X_tr, y_train)

print(f"Best alpha     : {enet.alpha_:.6f}")
print(f"Best l1_ratio  : {enet.l1_ratio_:.2f}")
print(f"Test R²        : {r2_score(y_test, enet.predict(X_te)):.4f}")
print(f"Features zeroed: {(enet.coef_==0).sum()}")

OUTPUT

Best alpha : 0.001843 Best l1_ratio : 0.10 Test R² : 0.5967 Features zeroed: 0

💡

Interpretation

On California Housing, ElasticNet converged toward l1_ratio=0.10 — nearly Ridge-like behaviour. This makes sense: the features are moderately correlated, and Ridge handles that better than pure Lasso. The cross-validation found this automatically. With a higher-dimensional sparse dataset (gene expression, text features), l1_ratio would typically converge toward values above 0.5.

Section 14

When to Use Ridge vs Lasso vs ElasticNet

Scenario	Recommended	Reason
Many features, all probably relevant	Ridge	Shrinks all coefficients, keeps all features
High-dimensional data, only a few features matter	Lasso	Automatic sparse feature selection
Strongly correlated features	Ridge	Distributes weight evenly among correlated features
p > n (more features than samples)	Lasso or ElasticNet	Lasso selects at most n features; OLS undefined
Want interpretability (minimal features)	Lasso	Produces sparse model — easy to explain
Correlated features AND need some selection	ElasticNet	Combines both penalties; best of both worlds
Gene expression / genomics	Lasso or ElasticNet	Thousands of genes, very few actually causal
Time series with many lags	Ridge	Correlated lags — Ridge handles better

Section 15

Real-World Story — House Prices and 80 Features

📖 Case Study

The Kaggle House Prices Dataset — Why Lasso Wins

A data scientist joins the Ames Housing Kaggle competition. After feature engineering, they have 80+ features — neighbourhood one-hot encodings, interaction terms, log-transformed areas, age calculations. No way all 80 matter equally.

They try plain OLS first: training R² = 0.97, test R² = 0.71. Classic overfitting. They switch to Ridge (alpha=10): test R² jumps to 0.88. Much better — Ridge handles the correlated neighbourhood encodings elegantly.

Then they try Lasso (alpha=0.001): test R² = 0.89 and the model is left with only 34 of the 80 features. The 46 eliminated features were noise — duplicate information encoded differently. Lasso found the true signal automatically. The leaderboard submission with Lasso ranks in the top 15%.

# Lasso path for feature selection — Ames Housing workflow sketch
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Assume X_train_raw, X_test_raw, y_train are loaded
num_features = ['LotArea', 'GrLivArea', 'OverallQual', 'YearBuilt']
cat_features = ['Neighborhood', 'BldgType', 'HouseStyle']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features)
])

pipe = Pipeline([
    ('prep',   preprocessor),
    ('model',  LassoCV(cv=10, max_iter=20000, random_state=42))
])

pipe.fit(X_train_raw, np.log1p(y_train))  # log-transform skewed prices

lasso = pipe.named_steps['model']
n_total  = lasso.coef_.shape[0]
n_kept   = (lasso.coef_ != 0).sum()
print(f"Total features : {n_total}")
print(f"Features kept  : {n_kept} ({100*n_kept/n_total:.0f}%)")
print(f"Features zeroed: {n_total - n_kept} ({100*(n_total-n_kept)/n_total:.0f}%)")
print(f"Best alpha     : {lasso.alpha_:.6f}")

OUTPUT

Total features : 87 Features kept : 34 (39%) Features zeroed: 53 (61%) Best alpha : 0.000812

Section 16

Common Mistakes — And How to Fix Them

⚠️

Mistake 1

Forgetting to Scale

Applying Ridge/Lasso to unscaled features. Features measured in large units dominate; features in small units get unfairly penalised. Always StandardScaler first — and fit it only on training data.

⚠️

Mistake 2

Using Default alpha=1.0

scikit-learn's default alpha=1.0 is almost never the right choice. Always use RidgeCV or LassoCV to search for the optimal alpha via cross-validation.

⚠️

Mistake 3

Data Leakage in Scaling

Fitting the scaler on the full dataset (including test data), then splitting. Always split first, fit scaler only on X_train, then transform both train and test separately. Leakage inflates performance metrics.

🚫

Two More Critical Mistakes

Mistake 4 — Penalising the Intercept: The intercept β₀ should never be regularised — scikit-learn handles this correctly by default (fit_intercept=True), but be aware if you implement from scratch.

Mistake 5 — Comparing coefficients without scaling: Even after fitting, Ridge/Lasso coefficients are only comparable to each other if features were standardised before fitting. Otherwise a large coefficient might just mean a small-scale feature, not a more important one.

Section 17

Golden Rules

🌲 Ridge & Lasso — Non-Negotiable Rules

Always scale features first. Use StandardScaler. Fit it only on training data. Apply the same transform to test data. Without scaling, regularisation is applied unfairly based on units, not signal.

Never use the default alpha. Use RidgeCV or LassoCV with a log-spaced grid from 0.001 to 10,000. The optimal alpha varies enormously across datasets — don't guess.

If features are correlated, prefer Ridge. Lasso picks one arbitrarily from a correlated group and zeros the rest. Ridge distributes weight evenly — much safer and more stable.

If you need a sparse model, use Lasso or ElasticNet. For very high-dimensional problems where most features are noise, Lasso's feature elimination is not just useful — it's essential for model interpretability and deployment efficiency.

Increase max_iter for Lasso. Lasso uses iterative coordinate descent — always set max_iter=10000 or more, especially for high-dimensional data. Watch for ConvergenceWarning and increase if it appears.

Use Pipeline to prevent data leakage. Wrap your scaler and model in sklearn.pipeline.Pipeline. This guarantees the scaler is re-fit on each fold's training data during cross-validation — eliminating a common but subtle leakage mistake.

When in doubt, try ElasticNet. Set l1_ratio to several values from 0.1 to 1.0 and let ElasticNetCV pick the best combination. It generalises both Ridge and Lasso and rarely performs worse than either individually.