Why Feature Selection Is Critical
More data is not always better. More features are almost never better — at least not without selection. Every irrelevant feature you add to a machine learning model is a source of noise, a contributor to overfitting, and a drain on training time. The curse of dimensionality means that as you add features, the volume of the feature space grows exponentially — and the data you have becomes increasingly sparse, making it harder for any algorithm to find meaningful patterns.
Removing irrelevant features achieves four things simultaneously: (1) reduces overfitting by removing noise the model memorises instead of learning from; (2) improves accuracy by focusing the model on truly predictive signals; (3) reduces training time — fewer features mean fewer computations per training step; (4) improves interpretability — a 10-feature model that a human can reason about is more trustworthy than a 100-feature black box.
The three families differ in how they evaluate features: Filter methods use statistics with no model involved. Wrapper methods train a model repeatedly on different feature subsets. Embedded methods perform selection as part of the model training itself.
The Three Families of Feature Selection
Filter Methods — Statistical Feature Ranking
Step 1 — Remove Zero-Variance Features
The first and cheapest filter: remove any feature where all values are identical. A feature that never changes contains zero information for any model.
from sklearn.feature_selection import VarianceThreshold
# Remove features where variance < threshold
sel = VarianceThreshold(threshold=0.01) # also removes near-constant cols
X_filtered = sel.fit_transform(X_train)
removed = X_train.columns[~sel.get_support()].tolist()
print(f"Removed {len(removed)} low-variance features: {removed}")
# Manual variance check
low_var = X_train.var()[X_train.var() < 0.01]
print(low_var.sort_values())
Step 2 — Remove Highly Correlated Features
Two features that correlate above 0.90 carry nearly identical information — one of them is redundant. Keeping both wastes model capacity and causes instability in linear models.
# Compute correlation matrix
corr_matrix = X_train.corr().abs()
# Select upper triangle
upper = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape, dtype=bool), k=1)
)
# Find columns with correlation > 0.90
to_drop = [col for col in upper.columns if upper[col].max() > 0.90]
print(f"Dropping {len(to_drop)} highly correlated features")
X_train = X_train.drop(columns=to_drop)
X_test = X_test.drop(columns=to_drop)
Pairs above 0.90 (dark amber): bmi ↔ weight (0.94), income ↔ salary (0.97), age ↔ exp_years (0.91). Drop one from each pair — whichever has lower correlation with the target variable.
Step 3 — SelectKBest with Statistical Tests
SelectKBest ranks features by a statistical score and keeps the top K. The scoring function depends on whether the target is continuous or categorical.
from sklearn.feature_selection import (
SelectKBest, SelectPercentile,
f_classif, f_regression,
chi2, mutual_info_classif, mutual_info_regression
)
# ── For classification (categorical target) ───────────
# f_classif: ANOVA F-test (assumes normality)
sel_f = SelectKBest(score_func=f_classif, k=10)
X_selected = sel_f.fit_transform(X_train, y_train)
scores_df = pd.DataFrame({
'feature': X_train.columns,
'score': sel_f.scores_,
'p_value': sel_f.pvalues_
}).sort_values('score', ascending=False)
# chi2: for non-negative integer features (counts, freq)
sel_chi = SelectKBest(score_func=chi2, k=10)
sel_chi.fit_transform(X_train_counts, y_train)
# mutual_info: non-parametric, handles non-linear relationships
sel_mi = SelectKBest(score_func=mutual_info_classif, k=10)
sel_mi.fit_transform(X_train, y_train)
# ── For regression (continuous target) ────────────────
sel_reg = SelectKBest(score_func=f_regression, k=10)
X_selected_reg = sel_reg.fit_transform(X_train, y_reg_train)
# ── Top-k% instead of fixed k ─────────────────────────
sel_pct = SelectPercentile(score_func=f_classif, percentile=30) # top 30%
X_top30 = sel_pct.fit_transform(X_train, y_train)
Features above the amber threshold line are selected. credit_score and income have the highest F-scores — they discriminate best between classes. employee_id and badge_colour score near zero — pure noise.
Chi-Square Test vs Correlation Analysis — When to Use Which
Two of the most commonly confused statistical tools in feature selection are the Chi-Square test and Pearson Correlation. Both measure relationships between variables — but they apply to completely different data types, and using the wrong one on the wrong data produces meaningless results. The choice is determined entirely by the nature of your variables, not your preference.
Follow the tree from top: identify your variable types first, then choose the test. The most common mistake is applying Pearson Correlation to label-encoded categoricals — the result is mathematically computed but statistically meaningless.
| Property | Chi-Square Test | Pearson Correlation | Mutual Information | ANOVA (f_classif) |
|---|---|---|---|---|
| Feature type | Categorical | Numeric (continuous) | Any (numeric or cat) | Numeric feature |
| Target type | Categorical | Numeric (continuous) | Any | Categorical (classes) |
| Detects non-linear? | No | No | Yes | No |
| Assumes normality? | No | Partial | No | Partial |
| sklearn function | chi2 | f_regression | mutual_info_classif | f_classif |
| Use case example | gender → churn | age → salary | age → churn (non-linear) | income → churn |
| Output | χ² statistic + p-value | r (−1 to +1) | Bits (≥ 0) | F-statistic + p-value |
Chi-Square Test — When Both Variables Are Categorical
The Chi-Square test measures whether two categorical variables are independent. A low p-value means the variables are associated — the feature is likely useful. A high p-value means the feature and target are independent — the feature is likely irrelevant.
from sklearn.feature_selection import SelectKBest, chi2
from scipy.stats import chi2_contingency
import pandas as pd
# ── sklearn chi2 (for feature selection pipelines) ────
# Features must be non-negative (counts, freq, OHE values)
sel_chi = SelectKBest(score_func=chi2, k=8)
X_chi = sel_chi.fit_transform(X_train_counts, y_train) # y must be categorical
chi_df = pd.DataFrame({
'feature': X_train_counts.columns,
'chi2': sel_chi.scores_,
'p_value': sel_chi.pvalues_
}).sort_values('chi2', ascending=False)
# Features with p_value < 0.05 are statistically significant
significant = chi_df[chi_df['p_value'] < 0.05]
print(significant)
# ── scipy chi2 (manual — for any two categorical cols) ─
contingency_table = pd.crosstab(df['gender'], df['churn'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi2 statistic : {chi2_stat:.2f}")
print(f"p-value : {p_val:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Significant : {p_val < 0.05}")
A p-value below 0.05 means you can reject the null hypothesis that the two variables are independent — they are associated and the feature is worth keeping. A p-value above 0.05 means no significant association was found — the feature may be irrelevant. The Chi-Square statistic itself (χ²) indicates strength: a higher value means a stronger association, but it also grows with sample size, so always look at the p-value for significance.
Correlation Analysis — When Both Variables Are Numeric
Pearson Correlation measures the linear relationship between two continuous numeric variables. The result r ranges from −1 (perfect negative) to +1 (perfect positive). For feature selection, use it to find features that correlate strongly with a continuous target, and to identify pairs of features that are too similar to both keep (multicollinearity).
# ── Feature-target correlation (regression problems) ──
target_corr = X_train.corrwith(y_reg_train).abs().sort_values(ascending=False)
print(target_corr.head(10))
# Select features with |correlation| > 0.20 with target
strong_features = target_corr[target_corr > 0.20].index.tolist()
# ── Feature-feature correlation (multicollinearity) ───
corr_matrix = X_train.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape, dtype=bool), k=1))
to_drop = [col for col in upper.columns if upper[col].max() > 0.90]
# ── Spearman correlation (for monotonic non-linear) ───
spearman_corr = X_train.corrwith(y_reg_train, method='spearman').abs()
# ── Point-Biserial (numeric feature, binary target) ───
from scipy.stats import pointbiserialr
r_val, p_val = pointbiserialr(df['income'], df['churn'])
print(f"r = {r_val:.3f}, p = {p_val:.4f}")
Categorical features (left group, blue) are scored by Chi-Square — higher is more associated with the target. Numeric features (right group, amber) are scored by absolute Pearson correlation with the target — closer to 1.0 means stronger linear relationship. The two scores are not comparable across groups — only rank within each group.
Density Plots — Visualising Feature Separability
Before running any statistical test, a density (KDE) plot for each feature grouped by the target class is one of the most powerful visual tools for feature selection. A feature that separates classes well — where the two density curves are clearly separated — is likely to be predictive. A feature where the two curves completely overlap is likely uninformative.
Look at how much the two class distributions overlap. If they are almost completely on top of each other — the feature cannot distinguish between classes and should be dropped. If they are clearly separated with little overlap — the feature is highly predictive and should be kept. The degree of separation directly correlates with the feature's F-score or mutual information score.
✅ credit_score — Good Separator (F=284.2)
✅ monthly_charges — Good Separator (F=198.7)
❌ employee_id — Zero Separation (F=0.3)
❌ badge_colour_code — Zero Separation (F=0.8)
Top row: good features where the blue (Class 0) and red (Class 1) density curves are clearly separated — the model can use these to distinguish classes. Bottom row: bad features where the curves completely overlap — these features carry zero discriminative signal and should be dropped.
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
def plot_feature_density(df, feature, target, figsize=(10,4)):
"""Plot KDE density for a feature split by target class."""
fig, ax = plt.subplots(figsize=figsize)
colors = {0: '#60a5fa', 1: '#f87171'}
labels = {0: 'Class 0 (Not Churned)', 1: 'Class 1 (Churned)'}
for cls in df[target].unique():
data = df[df[target] == cls][feature].dropna()
kde = gaussian_kde(data, bw_method='scott')
x_range = np.linspace(data.min(), data.max(), 300)
ax.plot(x_range, kde(x_range), label=labels[cls],
color=colors[cls], linewidth=2.5)
ax.fill_between(x_range, kde(x_range), alpha=0.15, color=colors[cls])
ax.set_title(f'Density: {feature} by {target}', fontsize=13)
ax.set_xlabel(feature); ax.set_ylabel('Density'); ax.legend()
plt.tight_layout(); plt.show()
# Plot all features and visually inspect separation
for col in X_train.select_dtypes(include='number').columns:
plot_feature_density(df, feature=col, target='churn')
Wrapper Methods — Model-Based Feature Search
Wrapper methods search for the best subset of features by actually training the model on different combinations and evaluating performance. They are more accurate than filter methods because they account for feature interactions — but they are computationally expensive because each candidate subset requires a full model fit.
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# ── RFE: Recursive Feature Elimination ────────────────
# Trains model, removes weakest feature, repeats
estimator = LogisticRegression(max_iter=1000, C=1.0)
rfe = RFE(estimator=estimator, n_features_to_select=10, step=1)
rfe.fit(X_train_scaled, y_train)
selected_cols = X_train.columns[rfe.get_support()].tolist()
print("Selected by RFE:", selected_cols)
ranking = pd.DataFrame({
'feature': X_train.columns,
'rank': rfe.ranking_ # 1 = selected, higher = less important
}).sort_values('rank')
print(ranking)
# ── RFECV: Cross-Validated RFE (finds optimal k) ─────
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
step=1,
cv=5,
scoring='roc_auc',
min_features_to_select=3,
n_jobs=-1
)
rfecv.fit(X_train, y_train)
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {X_train.columns[rfecv.get_support()].tolist()}")
AUC improves rapidly up to 12 features then plateaus — adding more features provides no benefit. The amber marker at 12 features is the optimal point selected by RFECV. Features 13–87 are noise — they add computational cost without improving the model.
Embedded Methods — Selection During Training
Embedded methods perform feature selection as an intrinsic part of the model training process. They achieve the best balance between filter methods (fast but model-agnostic) and wrapper methods (accurate but slow). The two most important embedded methods are Lasso regularisation and tree-based feature importance.
Lasso Regularisation (L1)
Lasso adds an L1 penalty to the loss function that drives the coefficients of uninformative features to exactly zero. Features with zero coefficients are automatically eliminated — Lasso performs simultaneous model fitting and feature selection.
from sklearn.linear_model import LassoCV, Lasso
from sklearn.feature_selection import SelectFromModel
# ── LassoCV: finds optimal alpha via cross-validation ─
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=5000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
# ── See which coefficients are non-zero ───────────────
coef_df = pd.DataFrame({
'feature': X_train.columns,
'coefficient': lasso_cv.coef_
})
selected = coef_df[coef_df['coefficient'] != 0].sort_values('coefficient', key=abs, ascending=False)
zeroed = coef_df[coef_df['coefficient'] == 0]
print(f"Features kept : {len(selected)}")
print(f"Features zeroed: {len(zeroed)}")
# ── SelectFromModel: use any model's coefficients/importances
sfm = SelectFromModel(lasso_cv, prefit=True)
X_lasso_selected = sfm.transform(X_train_scaled)
As alpha increases (stronger regularisation), coefficients are driven toward zero. Uninformative features (red — employee_id) are zeroed out first at very small alpha values. Important features (blue — credit_score) persist until much higher alpha. The vertical line marks the optimal alpha chosen by LassoCV.
Random Forest Feature Importance
Random Forest measures feature importance by tracking how much each feature reduces impurity (Gini or entropy) across all splits in all trees. Features that are never selected for splitting have near-zero importance.
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
# ── Train Random Forest ───────────────────────────────
rf = RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
# ── Extract and rank importances ──────────────────────
importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance_df.head(15))
# ── Select features above mean importance ────────────
sfm = SelectFromModel(rf, prefit=True, threshold='mean')
X_rf_selected = sfm.transform(X_train)
print(f"Selected: {sfm.get_support().sum()} features")
# ── Permutation importance (more reliable) ────────────
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_test, y_test, n_repeats=20, random_state=42)
perm_df = pd.DataFrame({
'feature': X_train.columns,
'importance_mean': result.importances_mean,
'importance_std': result.importances_std
}).sort_values('importance_mean', ascending=False)
Features above the amber threshold are selected. The error bars show variability across permutation repeats — wide bars mean unstable importance estimates. credit_score, monthly_charges, and tenure are stable top features. employee_id has near-zero importance with high variance — confirming it is noise.
Before vs After Feature Selection — The Results
| Metric | Value |
|---|---|
| Features | 87 |
| Train Accuracy | 94.2% |
| Test Accuracy | 61.3% |
| Train AUC | 0.99 |
| Test AUC | 0.67 |
| Training Time | 48.3 sec |
| Overfit Gap | 32.9% |
| Metric | Value |
|---|---|
| Features | 12 |
| Train Accuracy | 82.1% |
| Test Accuracy | 79.4% |
| Train AUC | 0.88 |
| Test AUC | 0.89 |
| Training Time | 4.8 sec |
| Overfit Gap | 2.7% |
All selection methods improve test AUC over the 87-feature baseline. RFECV achieves the highest AUC (0.89) because it optimises feature count with cross-validation. Lasso is close behind (0.87) and is much faster. The overfit gap drops from 32.9% to under 5% across all methods.
Complete Feature Selection Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel, RFECV
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
import joblib
num_cols = ['age', 'salary', 'credit_score', 'monthly_charges']
cat_cols = ['city', 'gender', 'product_type']
# ── Preprocessing ─────────────────────────────────────
num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
cat_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols),
])
# ── Option A: Embedded (Lasso + SelectFromModel) ──────
lasso_pipeline = Pipeline([
('prep', preprocessor),
('selection', SelectFromModel(LassoCV(cv=5, random_state=42))),
('model', LogisticRegression(max_iter=1000))
])
# ── Option B: Wrapper (RFECV) ─────────────────────────
rfecv_pipeline = Pipeline([
('prep', preprocessor),
('selection', RFECV(RandomForestClassifier(n_estimators=100, random_state=42),
cv=5, scoring='roc_auc', min_features_to_select=3)),
('model', RandomForestClassifier(n_estimators=200, random_state=42))
])
# ── Fit and evaluate ──────────────────────────────────
lasso_pipeline.fit(X_train, y_train)
print(f"Lasso pipeline AUC: {lasso_pipeline.score(X_test, y_test):.3f}")
rfecv_pipeline.fit(X_train, y_train)
print(f"RFECV pipeline AUC: {rfecv_pipeline.score(X_test, y_test):.3f}")
# ── Save the complete pipeline ────────────────────────
joblib.dump(lasso_pipeline, 'lasso_fs_pipeline.pkl')
joblib.dump(rfecv_pipeline, 'rfecv_fs_pipeline.pkl')
# ── Predict on new raw data ───────────────────────────
pipeline = joblib.load('lasso_fs_pipeline.pkl')
predictions = pipeline.predict(new_raw_data) # preprocessing + selection + model
Start with filter methods (VarianceThreshold + correlation + SelectKBest) to quickly eliminate the worst features in seconds. If accuracy matters more than speed, follow with Lasso (fast embedded selection) or RFECV (most accurate, slowest). For any production model, wrap everything in a sklearn Pipeline — it guarantees the same feature selection is applied identically at inference time, with zero risk of accidentally using all 87 features when only 12 were selected.
Golden Rules of Feature Selection
Feature selection is not about discarding data — it is about discarding noise. The HR model that learned from employee ID numbers was not wrong because of bad code or a bad algorithm. It was wrong because it was allowed to see features that should never have been in the dataset. Removing 75 features from 87 reduced training accuracy by 12 percentage points and improved test accuracy by 18. That trade is not a loss — it is the entire point of machine learning. A model that generalises is worth a thousand models that memorise.