The Story That Explains Everything
Student B reads the same notes but forces himself to summarise — if a topic can't be explained with simple core rules, he discards the complexity and keeps the big picture. Practice paper: 88%. Real exam: 85%. Almost no gap.
Regularisation is Student B's discipline — a deliberate, mathematically precise mechanism that punishes complexity, forcing the model to learn patterns rather than memorise noise. Ridge and Lasso are the two most powerful forms of that discipline.
In machine learning, overfitting happens when a model learns the training data too well — including its noise, quirks, and random fluctuations. It performs brilliantly on data it has seen and catastrophically on data it has not. Regularisation adds a penalty term to the loss function that discourages large, complex coefficients, giving us models that generalise far better to new data.
Both Ridge and Lasso modify the standard Linear Regression cost function by adding a penalty for large coefficients. Ridge uses the sum of squared coefficients (L2 norm). Lasso uses the sum of absolute coefficients (L1 norm). The single hyperparameter alpha (α) controls how severely those large coefficients are penalised.
The Problem — Why Plain Linear Regression Fails
Standard Ordinary Least Squares (OLS) regression minimises only the residual sum of squares. It will assign whatever coefficients minimise training error, even if those coefficients are enormous and unstable.
1. More features than samples (p > n): OLS has infinitely many solutions — the system is underdetermined.
2. Multicollinearity: Correlated features cause wildly unstable, high-variance coefficients.
3. Irrelevant features: OLS includes all features regardless — adding noise to every prediction.
Ridge Regression — L2 Regularisation
Ridge Regression modifies the cost function by adding the sum of squared coefficients multiplied by a penalty strength alpha (α):
Lasso Regression — L1 Regularisation
Lasso (Least Absolute Shrinkage and Selection Operator) uses the sum of absolute values of coefficients as its penalty:
Because the L1 penalty's gradient doesn't approach zero as coefficients approach zero (unlike L2), Lasso pushes small coefficients all the way to exactly zero. This means Lasso simultaneously regularises and selects features in a single step. With 100 features, Lasso might keep only 12 — and those 12 are the ones that actually matter.
The Geometry — Why L1 Zeros Coefficients and L2 Doesn't
The deepest insight into Ridge vs Lasso comes from visualising their constraint regions in coefficient space. This is the most important diagram in all of regularisation theory.
The ellipses are RSS contours (level curves of the OLS loss). The coloured shapes are the constraint regions. The optimal regularised solution is where the first contour touches the constraint shape. The L1 diamond has corners at the axes — the solution almost always touches a corner, zeroing one coefficient. The L2 circle has no corners — the solution lands on a smooth curve, never at exactly zero.
The RSS loss function's elliptical contours sweep outward from the OLS optimum. They first touch the L1 diamond at its corner — a corner that sits exactly on an axis, meaning one coefficient is precisely zero. The L2 circle has no corners — the contour touches it at a smooth point on the curve, so both coefficients remain non-zero. This geometry explains everything about why Lasso performs feature selection and Ridge does not.
Interactive 3D Penalty Surfaces
The best way to build intuition is to see what L1 and L2 penalties actually look like as surfaces in 3D. Rotate and explore the shapes below.
L1 Penalty Surface — |β₁| + |β₂|
L2 Penalty Surface — β₁² + β₂²
L1 (left) forms a sharp pyramid with a pointed tip at the origin and ridges along each axis — the corners that force coefficients to zero. L2 (right) forms a smooth, round paraboloid — no edges, no corners, always differentiable. The optimizer can always find a gradient pointing away from zero.
Ridge vs Lasso — Side-by-Side Comparison
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty Term | α × Σβⱼ² | α × Σ|βⱼ| |
| Constraint Shape | Smooth sphere/circle | Diamond/octahedron with corners |
| Coefficients reach zero? | No — only shrink toward 0 | Yes — can become exactly 0 |
| Feature selection? | No — keeps all features | Yes — automatic sparse selection |
| Handles multicollinearity? | Excellent — distributes weight evenly | Poor — picks one, zeros others |
| Closed-form solution? | Yes — β̂=(XᵀX+αI)⁻¹Xᵀy | No — requires coordinate descent |
| Gradient at β=0? | Zero — smooth minimum | Non-zero — subgradient required |
| Best for | Many correlated features all relevant | High-dimensional, sparse true signal |
The Alpha Hyperparameter — Bias-Variance Tradeoff
Alpha (α) is the single most important hyperparameter in regularised regression. It controls the entire bias-variance tradeoff.
As alpha increases, training error grows (higher bias) but test error initially falls (lower variance). The optimal alpha minimises total test error. Too low = overfitting. Too high = underfitting. Cross-validation finds the sweet spot.
Why Feature Scaling is Mandatory
This is exactly the regularisation problem. If one feature is in units of millions (house price) and another in units of ones (number of bedrooms), the penalty α × β² treats them identically by coefficient size. The feature measured in millions naturally has a tiny coefficient — it escapes the penalty. The feature measured in ones has a large coefficient — it gets crushed. Completely unfair and wrong.
Use StandardScaler (zero mean, unit variance) before fitting Ridge or Lasso. This ensures the penalty α × β² or α × |β| is applied fairly across all features — punishing genuine complexity, not scale accidents. Without scaling, results are meaningless. This is non-negotiable.
Python — Ridge Regression with Cross-Validation
Let's build a complete Ridge regression example using the California Housing dataset, with proper scaling, cross-validation alpha selection, and coefficient analysis.
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
# ── 1. Load data ───────────────────────────────────────
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
feature_names = data.feature_names
# ── 2. Train / test split ───────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ── 3. CRITICAL: Scale features ─────────────────────────
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train) # fit ONLY on train
X_test_sc = scaler.transform(X_test) # transform test with train stats
# ── 4. RidgeCV auto-selects best alpha ──────────────────
alphas = np.logspace(-3, 4, 100) # 0.001 to 10,000
ridge_cv = RidgeCV(alphas=alphas, cv=10, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train_sc, y_train)
print(f"Best alpha: {ridge_cv.alpha_:.4f}")
# ── 5. Evaluate ──────────────────────────────────────────
y_pred = ridge_cv.predict(X_test_sc)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {mse**0.5:.4f}")
print(f"Test R² : {r2:.4f}")
# ── 6. Coefficient analysis ──────────────────────────────
coef_df = pd.DataFrame({
'Feature' : feature_names,
'Coefficient': ridge_cv.coef_,
'Abs_Coef' : np.abs(ridge_cv.coef_)
}).sort_values('Abs_Coef', ascending=False)
print("\nRidge Coefficients (scaled features):")
print(coef_df.to_string(index=False))
Even Population, the smallest coefficient (−0.0089), is not zero. This is Ridge in action — every feature stays in the model, just shrunk. RidgeCV automatically scanned 100 alpha values and picked 1.6238 as optimal via 10-fold cross-validation.
Python — Lasso Regression with Feature Selection
Now the same dataset with Lasso — watch how it performs automatic feature selection by zeroing out unimportant coefficients.
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# ── Load, split, scale (same as Ridge) ──────────────────
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
feature_names = data.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
# ── LassoCV: 10-fold CV over 100 alpha values ────────────
lasso_cv = LassoCV(
alphas=None, # auto-generates log-spaced alphas
cv=10,
max_iter=10000, # Lasso needs more iterations
random_state=42
)
lasso_cv.fit(X_train_sc, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.6f}")
# ── Evaluate ─────────────────────────────────────────────
y_pred = lasso_cv.predict(X_test_sc)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE : {mse**0.5:.4f}")
print(f"Test R² : {r2:.4f}")
# ── Feature selection report ─────────────────────────────
nonzero_mask = lasso_cv.coef_ != 0
print(f"\nFeatures selected: {nonzero_mask.sum()} / {len(feature_names)}")
print("\nLasso Coefficients:")
for name, coef in zip(feature_names, lasso_cv.coef_):
status = 'SELECTED' if coef != 0 else 'ZEROED'
print(f" {name:12s}: {coef:8.4f} [{status}]")
Lasso concluded that average room count and average bedroom count are not independently predictive (they're correlated with each other and with MedInc). Ridge would have kept both with small non-zero coefficients. Lasso chose to eliminate them entirely — producing a sparser, more interpretable model. Neither approach is universally "better" — it depends on your belief about the data.
Coefficient Path — How Coefficients Change with Alpha
One of the most revealing visualisations is the regularisation path: plotting each coefficient's value as alpha sweeps from 0 (OLS) to large (all-zero). The difference between Ridge and Lasso paths is dramatic.
Ridge: All coefficients shrink smoothly toward zero but never cross it — they all arrive at zero only as α → ∞ together. Lasso: Coefficients hit zero one by one at different alpha values — each crossing the zero line is a feature being eliminated from the model entirely.
ElasticNet — The Best of Both Worlds
When you want Ridge's stability with correlated features and Lasso's feature selection, ElasticNet combines both penalties with two hyperparameters.
from sklearn.linear_model import ElasticNetCV
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_train)
X_te = scaler.transform(X_test)
# ElasticNetCV searches both alpha AND l1_ratio
enet = ElasticNetCV(
l1_ratio=[.1, .3, .5, .7, .9, .95, 1.0], # test many mixing ratios
alphas=np.logspace(-3, 2, 60),
cv=10,
max_iter=10000,
random_state=42
)
enet.fit(X_tr, y_train)
print(f"Best alpha : {enet.alpha_:.6f}")
print(f"Best l1_ratio : {enet.l1_ratio_:.2f}")
print(f"Test R² : {r2_score(y_test, enet.predict(X_te)):.4f}")
print(f"Features zeroed: {(enet.coef_==0).sum()}")
On California Housing, ElasticNet converged toward l1_ratio=0.10 — nearly Ridge-like behaviour. This makes sense: the features are moderately correlated, and Ridge handles that better than pure Lasso. The cross-validation found this automatically. With a higher-dimensional sparse dataset (gene expression, text features), l1_ratio would typically converge toward values above 0.5.
When to Use Ridge vs Lasso vs ElasticNet
| Scenario | Recommended | Reason |
|---|---|---|
| Many features, all probably relevant | Ridge | Shrinks all coefficients, keeps all features |
| High-dimensional data, only a few features matter | Lasso | Automatic sparse feature selection |
| Strongly correlated features | Ridge | Distributes weight evenly among correlated features |
| p > n (more features than samples) | Lasso or ElasticNet | Lasso selects at most n features; OLS undefined |
| Want interpretability (minimal features) | Lasso | Produces sparse model — easy to explain |
| Correlated features AND need some selection | ElasticNet | Combines both penalties; best of both worlds |
| Gene expression / genomics | Lasso or ElasticNet | Thousands of genes, very few actually causal |
| Time series with many lags | Ridge | Correlated lags — Ridge handles better |
Real-World Story — House Prices and 80 Features
They try plain OLS first: training R² = 0.97, test R² = 0.71. Classic overfitting. They switch to Ridge (alpha=10): test R² jumps to 0.88. Much better — Ridge handles the correlated neighbourhood encodings elegantly.
Then they try Lasso (alpha=0.001): test R² = 0.89 and the model is left with only 34 of the 80 features. The 46 eliminated features were noise — duplicate information encoded differently. Lasso found the true signal automatically. The leaderboard submission with Lasso ranks in the top 15%.
# Lasso path for feature selection — Ames Housing workflow sketch
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
# Assume X_train_raw, X_test_raw, y_train are loaded
num_features = ['LotArea', 'GrLivArea', 'OverallQual', 'YearBuilt']
cat_features = ['Neighborhood', 'BldgType', 'HouseStyle']
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_features),
('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features)
])
pipe = Pipeline([
('prep', preprocessor),
('model', LassoCV(cv=10, max_iter=20000, random_state=42))
])
pipe.fit(X_train_raw, np.log1p(y_train)) # log-transform skewed prices
lasso = pipe.named_steps['model']
n_total = lasso.coef_.shape[0]
n_kept = (lasso.coef_ != 0).sum()
print(f"Total features : {n_total}")
print(f"Features kept : {n_kept} ({100*n_kept/n_total:.0f}%)")
print(f"Features zeroed: {n_total - n_kept} ({100*(n_total-n_kept)/n_total:.0f}%)")
print(f"Best alpha : {lasso.alpha_:.6f}")
Common Mistakes — And How to Fix Them
Mistake 4 — Penalising the Intercept: The intercept β₀ should never be
regularised — scikit-learn handles this correctly by default (fit_intercept=True),
but be aware if you implement from scratch.
Mistake 5 — Comparing coefficients without scaling: Even after fitting,
Ridge/Lasso coefficients are only comparable to each other if features were standardised
before fitting. Otherwise a large coefficient might just mean a small-scale feature, not
a more important one.
Golden Rules
StandardScaler.
Fit it only on training data. Apply the same transform to test data.
Without scaling, regularisation is applied unfairly based on units, not signal.
RidgeCV or
LassoCV with a log-spaced grid from 0.001 to 10,000.
The optimal alpha varies enormously across datasets — don't guess.
max_iter for Lasso.
Lasso uses iterative coordinate descent — always set
max_iter=10000 or more, especially for high-dimensional data.
Watch for ConvergenceWarning and increase if it appears.
sklearn.pipeline.Pipeline.
This guarantees the scaler is re-fit on each fold's training data during
cross-validation — eliminating a common but subtle leakage mistake.
l1_ratio to several values from 0.1 to 1.0 and let
ElasticNetCV pick the best combination. It generalises
both Ridge and Lasso and rarely performs worse than either individually.