Data Preparation / Data Preprocessing 📂 Data Collection · 10 of 13 64 min read

Feature Selection in Machine Learning

A story-driven, comprehensive guide to selecting the most relevant features for machine learning models — covering filter methods, wrapper methods, embedded methods, and dimensionality reduction, with live diagrams, density charts, real-world stories, and complete reusable code.

Section 01

Why Feature Selection Is Critical

More data is not always better. More features are almost never better — at least not without selection. Every irrelevant feature you add to a machine learning model is a source of noise, a contributor to overfitting, and a drain on training time. The curse of dimensionality means that as you add features, the volume of the feature space grows exponentially — and the data you have becomes increasingly sparse, making it harder for any algorithm to find meaningful patterns.

The HR Model That Learned From the Employee ID Number
A tech company built an employee attrition prediction model. Their dataset had 87 features — everything from salary and performance score to the employee's badge colour, desk floor number, and employee ID. The model achieved 94% accuracy on the training set and 61% on the test set — a classic sign of overfitting. When the team used Random Forest feature importance to inspect what the model had learned, they found the top predictor was employee_id. The model had memorised that specific employees with certain ID ranges (employees who joined in a specific year) had left the company. Employee ID is not a causal feature — it is a leakage artefact. After removing 52 irrelevant and leaking features, accuracy on the training set dropped to 82% — but test set accuracy improved to 79%. The simpler model generalised far better because it had learned real patterns, not memorised noise.
💡
The Four Benefits of Feature Selection

Removing irrelevant features achieves four things simultaneously: (1) reduces overfitting by removing noise the model memorises instead of learning from; (2) improves accuracy by focusing the model on truly predictive signals; (3) reduces training time — fewer features mean fewer computations per training step; (4) improves interpretability — a 10-feature model that a human can reason about is more trustworthy than a 100-feature black box.

🗺️ Feature Selection — The Three Families of Methods
Three families of feature selection methods: Filter, Wrapper, and Embedded 87 Raw Features Filter Methods Statistical scores — no model Correlation, Chi2, MI, Variance Wrapper Methods Train model — evaluate subsets RFE, RFECV, Forward/Backward Embedded Methods Feature selection during training Lasso, Ridge, RF Importance Selected Top-k features Lean Model Better generalisation Prediction ŷ ✅

The three families differ in how they evaluate features: Filter methods use statistics with no model involved. Wrapper methods train a model repeatedly on different feature subsets. Embedded methods perform selection as part of the model training itself.


Section 02

The Three Families of Feature Selection

🔵 Filter Methods
No model required
Rank features using statistical measures computed directly from the data. Fast, scalable, and model-agnostic. Run once before any training.
✅ Very fast — O(n features)
✅ Works with any downstream model
❌ Ignores feature interactions
❌ May miss conditionally useful features
🟢 Wrapper Methods
Train model per subset
Search for the best subset of features by training the model on different combinations and comparing performance. Finds optimal combinations but is expensive.
✅ Accounts for feature interactions
✅ Model-specific optimal selection
❌ Slow — O(2^n features)
❌ Risk of overfitting to validation set
🟡 Embedded Methods
Selection during training
Feature selection happens automatically as part of model training through regularisation or built-in importance scoring. Best balance of performance and speed.
✅ No extra training cycles
✅ Accounts for feature interactions
❌ Tied to a specific model type
❌ Less interpretable selection logic

Section 03

Filter Methods — Statistical Feature Ranking

Step 1 — Remove Zero-Variance Features

The first and cheapest filter: remove any feature where all values are identical. A feature that never changes contains zero information for any model.

from sklearn.feature_selection import VarianceThreshold

# Remove features where variance < threshold
sel = VarianceThreshold(threshold=0.01)   # also removes near-constant cols
X_filtered = sel.fit_transform(X_train)

removed = X_train.columns[~sel.get_support()].tolist()
print(f"Removed {len(removed)} low-variance features: {removed}")

# Manual variance check
low_var = X_train.var()[X_train.var() < 0.01]
print(low_var.sort_values())

Step 2 — Remove Highly Correlated Features

Two features that correlate above 0.90 carry nearly identical information — one of them is redundant. Keeping both wastes model capacity and causes instability in linear models.

# Compute correlation matrix
corr_matrix = X_train.corr().abs()

# Select upper triangle
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape, dtype=bool), k=1)
)

# Find columns with correlation > 0.90
to_drop = [col for col in upper.columns if upper[col].max() > 0.90]
print(f"Dropping {len(to_drop)} highly correlated features")
X_train = X_train.drop(columns=to_drop)
X_test  = X_test.drop(columns=to_drop)
📊 Output: Correlation Heatmap — Identifying Redundant Features

Pairs above 0.90 (dark amber): bmi ↔ weight (0.94), income ↔ salary (0.97), age ↔ exp_years (0.91). Drop one from each pair — whichever has lower correlation with the target variable.

Step 3 — SelectKBest with Statistical Tests

SelectKBest ranks features by a statistical score and keeps the top K. The scoring function depends on whether the target is continuous or categorical.

from sklearn.feature_selection import (
    SelectKBest, SelectPercentile,
    f_classif, f_regression,
    chi2, mutual_info_classif, mutual_info_regression
)

# ── For classification (categorical target) ───────────
# f_classif: ANOVA F-test (assumes normality)
sel_f = SelectKBest(score_func=f_classif, k=10)
X_selected = sel_f.fit_transform(X_train, y_train)
scores_df   = pd.DataFrame({
    'feature': X_train.columns,
    'score':   sel_f.scores_,
    'p_value': sel_f.pvalues_
}).sort_values('score', ascending=False)

# chi2: for non-negative integer features (counts, freq)
sel_chi = SelectKBest(score_func=chi2, k=10)
sel_chi.fit_transform(X_train_counts, y_train)

# mutual_info: non-parametric, handles non-linear relationships
sel_mi = SelectKBest(score_func=mutual_info_classif, k=10)
sel_mi.fit_transform(X_train, y_train)

# ── For regression (continuous target) ────────────────
sel_reg = SelectKBest(score_func=f_regression, k=10)
X_selected_reg = sel_reg.fit_transform(X_train, y_reg_train)

# ── Top-k% instead of fixed k ─────────────────────────
sel_pct = SelectPercentile(score_func=f_classif, percentile=30)  # top 30%
X_top30 = sel_pct.fit_transform(X_train, y_train)
📊 Output: SelectKBest (f_classif) — Feature Scores Ranked
Selected (top 8) Rejected Selection threshold

Features above the amber threshold line are selected. credit_score and income have the highest F-scores — they discriminate best between classes. employee_id and badge_colour score near zero — pure noise.


Section 04

Chi-Square Test vs Correlation Analysis — When to Use Which

Two of the most commonly confused statistical tools in feature selection are the Chi-Square test and Pearson Correlation. Both measure relationships between variables — but they apply to completely different data types, and using the wrong one on the wrong data produces meaningless results. The choice is determined entirely by the nature of your variables, not your preference.

The Marketing Team That Measured the Wrong Relationship
A marketing analytics team at a large FMCG company was trying to determine which customer attributes predicted premium product purchases. They ran Pearson correlation between "product_tier_purchased" (Tier 1, Tier 2, Tier 3 — stored as 1, 2, 3) and "region" (North=1, South=2, East=3, West=4). The correlation came back at 0.38 and they concluded that region was "moderately correlated" with product tier. This was completely wrong — both variables were categorical, not numeric. The numbers 1, 2, 3, 4 for regions had no mathematical ordering. Pearson correlation was measuring a relationship between arbitrary integer labels, not between actual categories. When they reran the analysis using a Chi-Square test (the correct test for two categorical variables), the result showed no significant association (p=0.42). They had nearly made a regional marketing budget decision based on a statistically invalid calculation.
🗺️ When to Use Chi-Square vs Correlation — Decision Guide
Decision guide showing when to use Chi-Square test versus Pearson Correlation versus alternatives What are your variable types? Both categorical Both numeric Mixed types ✅ Chi-Square Test gender ↔ product_type city ↔ churn sklearn: chi2 Linear relationship? Yes No ✅ Pearson Correlation age ↔ salary height ↔ weight df.corr() / f_regression ✅ Mutual Information Non-linear numeric relationships mutual_info_regression ✅ ANOVA / f_classif numeric feature ↔ categorical target sklearn: f_classif ⚠ Never run Pearson Correlation on categorical columns — even if stored as integers Region encoded as 1/2/3/4 is still categorical — the numbers have no mathematical meaning

Follow the tree from top: identify your variable types first, then choose the test. The most common mistake is applying Pearson Correlation to label-encoded categoricals — the result is mathematically computed but statistically meaningless.

Property Chi-Square Test Pearson Correlation Mutual Information ANOVA (f_classif)
Feature type Categorical Numeric (continuous) Any (numeric or cat) Numeric feature
Target type Categorical Numeric (continuous) Any Categorical (classes)
Detects non-linear? No No Yes No
Assumes normality? No Partial No Partial
sklearn function chi2 f_regression mutual_info_classif f_classif
Use case example gender → churn age → salary age → churn (non-linear) income → churn
Output χ² statistic + p-value r (−1 to +1) Bits (≥ 0) F-statistic + p-value

Chi-Square Test — When Both Variables Are Categorical

The Chi-Square test measures whether two categorical variables are independent. A low p-value means the variables are associated — the feature is likely useful. A high p-value means the feature and target are independent — the feature is likely irrelevant.

from sklearn.feature_selection import SelectKBest, chi2
from scipy.stats               import chi2_contingency
import pandas as pd

# ── sklearn chi2 (for feature selection pipelines) ────
# Features must be non-negative (counts, freq, OHE values)
sel_chi = SelectKBest(score_func=chi2, k=8)
X_chi   = sel_chi.fit_transform(X_train_counts, y_train)  # y must be categorical

chi_df  = pd.DataFrame({
    'feature':  X_train_counts.columns,
    'chi2':     sel_chi.scores_,
    'p_value':  sel_chi.pvalues_
}).sort_values('chi2', ascending=False)

# Features with p_value < 0.05 are statistically significant
significant = chi_df[chi_df['p_value'] < 0.05]
print(significant)

# ── scipy chi2 (manual — for any two categorical cols) ─
contingency_table = pd.crosstab(df['gender'], df['churn'])
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(f"Chi2 statistic : {chi2_stat:.2f}")
print(f"p-value        : {p_val:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Significant    : {p_val < 0.05}")
💡
Reading the Chi-Square Result

A p-value below 0.05 means you can reject the null hypothesis that the two variables are independent — they are associated and the feature is worth keeping. A p-value above 0.05 means no significant association was found — the feature may be irrelevant. The Chi-Square statistic itself (χ²) indicates strength: a higher value means a stronger association, but it also grows with sample size, so always look at the p-value for significance.

Correlation Analysis — When Both Variables Are Numeric

Pearson Correlation measures the linear relationship between two continuous numeric variables. The result r ranges from −1 (perfect negative) to +1 (perfect positive). For feature selection, use it to find features that correlate strongly with a continuous target, and to identify pairs of features that are too similar to both keep (multicollinearity).

# ── Feature-target correlation (regression problems) ──
target_corr = X_train.corrwith(y_reg_train).abs().sort_values(ascending=False)
print(target_corr.head(10))

# Select features with |correlation| > 0.20 with target
strong_features = target_corr[target_corr > 0.20].index.tolist()

# ── Feature-feature correlation (multicollinearity) ───
corr_matrix = X_train.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape, dtype=bool), k=1))
to_drop = [col for col in upper.columns if upper[col].max() > 0.90]

# ── Spearman correlation (for monotonic non-linear) ───
spearman_corr = X_train.corrwith(y_reg_train, method='spearman').abs()

# ── Point-Biserial (numeric feature, binary target) ───
from scipy.stats import pointbiserialr
r_val, p_val = pointbiserialr(df['income'], df['churn'])
print(f"r = {r_val:.3f}, p = {p_val:.4f}")
📊 Chi-Square Scores vs Correlation Scores — Same Features, Different Lenses
Chi-Square score (categorical features) |Correlation| with target (numeric features)

Categorical features (left group, blue) are scored by Chi-Square — higher is more associated with the target. Numeric features (right group, amber) are scored by absolute Pearson correlation with the target — closer to 1.0 means stronger linear relationship. The two scores are not comparable across groups — only rank within each group.

🎯 Quick Decision Rules — Chi-Square vs Correlation
1
Both variables categorical (gender, city, product_type) → Chi-Square test. Use chi2_contingency() for manual testing or SelectKBest(chi2) for pipeline selection.
2
Both variables numeric and target is continuous (age → salary, area → house_price) → Pearson Correlation. Use df.corrwith(target) or SelectKBest(f_regression).
3
Numeric feature with categorical target (income → churn) → ANOVA F-test. Use SelectKBest(f_classif) — it computes the F-ratio of between-class to within-class variance.
4
Non-linear relationships or mixed types → Mutual Information. Use mutual_info_classif or mutual_info_regression. It detects any statistical dependency, not just linear ones.
5
Never run Pearson Correlation on label-encoded categorical columns. Region stored as 1/2/3/4 is still categorical — computing correlation treats it as if North < South < East < West mathematically, which is meaningless.

Section 05

Density Plots — Visualising Feature Separability

Before running any statistical test, a density (KDE) plot for each feature grouped by the target class is one of the most powerful visual tools for feature selection. A feature that separates classes well — where the two density curves are clearly separated — is likely to be predictive. A feature where the two curves completely overlap is likely uninformative.

💡
Reading Density Plots for Feature Selection

Look at how much the two class distributions overlap. If they are almost completely on top of each other — the feature cannot distinguish between classes and should be dropped. If they are clearly separated with little overlap — the feature is highly predictive and should be kept. The degree of separation directly correlates with the feature's F-score or mutual information score.

📊 Density Plots — Good Features vs Irrelevant Features by Target Class
Class 0 (Not Churned) Class 1 (Churned)

✅ credit_score — Good Separator (F=284.2)

✅ monthly_charges — Good Separator (F=198.7)

❌ employee_id — Zero Separation (F=0.3)

❌ badge_colour_code — Zero Separation (F=0.8)

Top row: good features where the blue (Class 0) and red (Class 1) density curves are clearly separated — the model can use these to distinguish classes. Bottom row: bad features where the curves completely overlap — these features carry zero discriminative signal and should be dropped.

import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

def plot_feature_density(df, feature, target, figsize=(10,4)):
    """Plot KDE density for a feature split by target class."""
    fig, ax = plt.subplots(figsize=figsize)
    colors = {0: '#60a5fa', 1: '#f87171'}
    labels = {0: 'Class 0 (Not Churned)', 1: 'Class 1 (Churned)'}
    for cls in df[target].unique():
        data  = df[df[target] == cls][feature].dropna()
        kde   = gaussian_kde(data, bw_method='scott')
        x_range = np.linspace(data.min(), data.max(), 300)
        ax.plot(x_range, kde(x_range), label=labels[cls],
               color=colors[cls], linewidth=2.5)
        ax.fill_between(x_range, kde(x_range), alpha=0.15, color=colors[cls])
    ax.set_title(f'Density: {feature} by {target}', fontsize=13)
    ax.set_xlabel(feature); ax.set_ylabel('Density'); ax.legend()
    plt.tight_layout(); plt.show()

# Plot all features and visually inspect separation
for col in X_train.select_dtypes(include='number').columns:
    plot_feature_density(df, feature=col, target='churn')

Section 06

Wrapper Methods — Model-Based Feature Search

Wrapper methods search for the best subset of features by actually training the model on different combinations and evaluating performance. They are more accurate than filter methods because they account for feature interactions — but they are computationally expensive because each candidate subset requires a full model fit.

The Credit Model That Found the Hidden Pair
A bank's data science team was building a credit default model. Filter methods (F-score) identified "monthly_income" and "monthly_expenses" as the top two features individually. When RFECV was run, however, it discovered that neither feature alone was particularly powerful — but together, the ratio of expenses to income was the strongest predictive signal in the entire dataset. A filter method cannot discover this because it evaluates features independently. RFECV, by training the model on combinations, found that including both features simultaneously unlocked a combined signal stronger than either alone. The final model used 12 features selected by RFECV versus the 10 selected by F-score — and the AUC improved from 0.82 to 0.89.
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model     import LogisticRegression
from sklearn.ensemble         import RandomForestClassifier

# ── RFE: Recursive Feature Elimination ────────────────
# Trains model, removes weakest feature, repeats
estimator = LogisticRegression(max_iter=1000, C=1.0)
rfe = RFE(estimator=estimator, n_features_to_select=10, step=1)
rfe.fit(X_train_scaled, y_train)

selected_cols = X_train.columns[rfe.get_support()].tolist()
print("Selected by RFE:", selected_cols)

ranking = pd.DataFrame({
    'feature': X_train.columns,
    'rank':    rfe.ranking_          # 1 = selected, higher = less important
}).sort_values('rank')
print(ranking)

# ── RFECV: Cross-Validated RFE (finds optimal k) ─────
rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    step=1,
    cv=5,
    scoring='roc_auc',
    min_features_to_select=3,
    n_jobs=-1
)
rfecv.fit(X_train, y_train)

print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {X_train.columns[rfecv.get_support()].tolist()}")
📊 Output: RFECV — Cross-Validated AUC vs Number of Features
Mean CV AUC Optimal point ±1 std band

AUC improves rapidly up to 12 features then plateaus — adding more features provides no benefit. The amber marker at 12 features is the optimal point selected by RFECV. Features 13–87 are noise — they add computational cost without improving the model.


Section 07

Embedded Methods — Selection During Training

Embedded methods perform feature selection as an intrinsic part of the model training process. They achieve the best balance between filter methods (fast but model-agnostic) and wrapper methods (accurate but slow). The two most important embedded methods are Lasso regularisation and tree-based feature importance.

Lasso Regularisation (L1)

Lasso adds an L1 penalty to the loss function that drives the coefficients of uninformative features to exactly zero. Features with zero coefficients are automatically eliminated — Lasso performs simultaneous model fitting and feature selection.

from sklearn.linear_model     import LassoCV, Lasso
from sklearn.feature_selection import SelectFromModel

# ── LassoCV: finds optimal alpha via cross-validation ─
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=5000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")

# ── See which coefficients are non-zero ───────────────
coef_df = pd.DataFrame({
    'feature':     X_train.columns,
    'coefficient': lasso_cv.coef_
})
selected = coef_df[coef_df['coefficient'] != 0].sort_values('coefficient', key=abs, ascending=False)
zeroed   = coef_df[coef_df['coefficient'] == 0]
print(f"Features kept  : {len(selected)}")
print(f"Features zeroed: {len(zeroed)}")

# ── SelectFromModel: use any model's coefficients/importances
sfm = SelectFromModel(lasso_cv, prefit=True)
X_lasso_selected = sfm.transform(X_train_scaled)
📊 Output: Lasso Coefficient Path — Features Zeroed as Alpha Increases
credit_score monthly_charges tenure employee_id (noise) Optimal α

As alpha increases (stronger regularisation), coefficients are driven toward zero. Uninformative features (red — employee_id) are zeroed out first at very small alpha values. Important features (blue — credit_score) persist until much higher alpha. The vertical line marks the optimal alpha chosen by LassoCV.

Random Forest Feature Importance

Random Forest measures feature importance by tracking how much each feature reduces impurity (Gini or entropy) across all splits in all trees. Features that are never selected for splitting have near-zero importance.

from sklearn.ensemble         import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt

# ── Train Random Forest ───────────────────────────────
rf = RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

# ── Extract and rank importances ──────────────────────
importance_df = pd.DataFrame({
    'feature':    X_train.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print(importance_df.head(15))

# ── Select features above mean importance ────────────
sfm = SelectFromModel(rf, prefit=True, threshold='mean')
X_rf_selected = sfm.transform(X_train)
print(f"Selected: {sfm.get_support().sum()} features")

# ── Permutation importance (more reliable) ────────────
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_test, y_test, n_repeats=20, random_state=42)
perm_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance_mean': result.importances_mean,
    'importance_std':  result.importances_std
}).sort_values('importance_mean', ascending=False)
📊 Output: Random Forest Feature Importance (with error bars)
Selected (above mean) Rejected (below mean) Mean threshold

Features above the amber threshold are selected. The error bars show variability across permutation repeats — wide bars mean unstable importance estimates. credit_score, monthly_charges, and tenure are stable top features. employee_id has near-zero importance with high variance — confirming it is noise.


Section 08

Before vs After Feature Selection — The Results

❌ Before — 87 features (raw)
MetricValue
Features87
Train Accuracy94.2%
Test Accuracy61.3%
Train AUC0.99
Test AUC0.67
Training Time48.3 sec
Overfit Gap32.9%
✅ After — 12 features (RFECV)
MetricValue
Features12
Train Accuracy82.1%
Test Accuracy79.4%
Train AUC0.88
Test AUC0.89
Training Time4.8 sec
Overfit Gap2.7%
📊 Output: Performance Comparison — All Methods vs Baseline
Test AUC Train–Test gap (overfit)

All selection methods improve test AUC over the 87-feature baseline. RFECV achieves the highest AUC (0.89) because it optimises feature count with cross-validation. Lasso is close behind (0.87) and is much faster. The overfit gap drops from 32.9% to under 5% across all methods.


Section 09

Complete Feature Selection Pipeline

from sklearn.pipeline         import Pipeline
from sklearn.compose          import ColumnTransformer
from sklearn.preprocessing    import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel, RFECV
from sklearn.linear_model     import LogisticRegression, LassoCV
from sklearn.ensemble         import RandomForestClassifier
from sklearn.impute            import SimpleImputer
import joblib

num_cols = ['age', 'salary', 'credit_score', 'monthly_charges']
cat_cols = ['city', 'gender', 'product_type']

# ── Preprocessing ─────────────────────────────────────
num_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe',     OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols),
])

# ── Option A: Embedded (Lasso + SelectFromModel) ──────
lasso_pipeline = Pipeline([
    ('prep',      preprocessor),
    ('selection', SelectFromModel(LassoCV(cv=5, random_state=42))),
    ('model',     LogisticRegression(max_iter=1000))
])

# ── Option B: Wrapper (RFECV) ─────────────────────────
rfecv_pipeline = Pipeline([
    ('prep',      preprocessor),
    ('selection', RFECV(RandomForestClassifier(n_estimators=100, random_state=42),
                        cv=5, scoring='roc_auc', min_features_to_select=3)),
    ('model',     RandomForestClassifier(n_estimators=200, random_state=42))
])

# ── Fit and evaluate ──────────────────────────────────
lasso_pipeline.fit(X_train, y_train)
print(f"Lasso pipeline AUC: {lasso_pipeline.score(X_test, y_test):.3f}")

rfecv_pipeline.fit(X_train, y_train)
print(f"RFECV pipeline AUC: {rfecv_pipeline.score(X_test, y_test):.3f}")

# ── Save the complete pipeline ────────────────────────
joblib.dump(lasso_pipeline,  'lasso_fs_pipeline.pkl')
joblib.dump(rfecv_pipeline,  'rfecv_fs_pipeline.pkl')

# ── Predict on new raw data ───────────────────────────
pipeline = joblib.load('lasso_fs_pipeline.pkl')
predictions = pipeline.predict(new_raw_data)   # preprocessing + selection + model
Which Method Should You Use?

Start with filter methods (VarianceThreshold + correlation + SelectKBest) to quickly eliminate the worst features in seconds. If accuracy matters more than speed, follow with Lasso (fast embedded selection) or RFECV (most accurate, slowest). For any production model, wrap everything in a sklearn Pipeline — it guarantees the same feature selection is applied identically at inference time, with zero risk of accidentally using all 87 features when only 12 were selected.


Section 10

Golden Rules of Feature Selection

🎯 8 Rules Every Data Scientist Must Follow
1
Always do feature selection before training — never after. If you train first then select, the model has already been contaminated by noise. Feature selection must happen as part of the training pipeline on training data only.
2
Always start with VarianceThreshold and correlation filtering. These eliminate the most obvious junk features in milliseconds. Running expensive wrapper methods on 87 features when 30 have near-zero variance is pure waste.
3
Use density plots (KDE by class) to visually validate which features separate your target classes before running any statistical test. A feature where both class densities completely overlap is almost certainly noise — confirm it with the score, but the eye test is faster.
4
Never use feature importance from a model trained on the full dataset to select features for that same model. This creates a circular dependency and leaks information. Fit the importance-estimating model on the training fold only.
5
Prefer RFECV over RFE. RFECV uses cross-validation to find the optimal number of features automatically — you do not need to guess. RFE requires you to specify k in advance, and guessing wrong invalidates the selection.
6
Use Lasso (LassoCV) when you have many features and need a fast embedded selection that is also interpretable. Lasso's zero-coefficient mechanism is mathematically principled — features zeroed by Lasso are genuinely irrelevant to the linear signal in the data.
7
Always compare test performance — not training performance — when evaluating feature selection methods. A model that selects 50 features and achieves 95% training accuracy but 60% test accuracy has overfit. A model with 10 features and 80% on both is the better model.
8
Save the complete fitted pipeline — preprocessing, feature selection, and model — as a single object using joblib.dump(). At inference time, you must apply exactly the same feature selection that was applied during training. A model without its fitted selector will silently produce wrong predictions if given all 87 features.
🧮
Key Takeaway

Feature selection is not about discarding data — it is about discarding noise. The HR model that learned from employee ID numbers was not wrong because of bad code or a bad algorithm. It was wrong because it was allowed to see features that should never have been in the dataset. Removing 75 features from 87 reduced training accuracy by 12 percentage points and improved test accuracy by 18. That trade is not a loss — it is the entire point of machine learning. A model that generalises is worth a thousand models that memorise.