Skip to content

Why More Data Isn't Always Better: Mastering Feature Selection

DS
LDS Team
Let's Data Science
12 minAudio
Listen Along
0:00/ 0:00
AI voice

A data scientist at a fintech startup built a customer churn model with 127 features scraped from every user event imaginable. Accuracy on the training set looked stellar. Then the model went to production and performed worse than a coin flip. The reason? Most of those 127 columns were noise, and the model memorized patterns that existed only in the training data.

Feature selection is the practice of identifying which input variables actually drive predictions and dropping the rest. It sounds straightforward, but the wrong approach can silently leak information, miss important interactions, or waste hours of compute. This guide covers the three major families of feature selection, shows exactly when each one shines (and when it fails), and uses a single running example so you can follow the logic end to end.

The Curse of Dimensionality and Why More Features Hurt

The curse of dimensionality describes how data becomes exponentially sparser as you add dimensions. In a one-dimensional space, 100 data points cover the line densely. Spread those same 100 points across a 50-dimensional hypercube, and each point is essentially isolated. Distance-based algorithms like KNN and SVMs rely on "nearness," and when every point is roughly the same distance from every other point, nearness loses all meaning.

Mathematically, the contrast between the farthest and nearest neighbor shrinks toward zero:

limddistmaxdistmindistmin0\lim_{d \to \infty} \frac{\text{dist}_{\max} - \text{dist}_{\min}}{\text{dist}_{\min}} \to 0

Where:

  • dd is the number of dimensions (features)
  • distmax\text{dist}_{\max} is the distance to the farthest data point
  • distmin\text{dist}_{\min} is the distance to the nearest data point

In Plain English: In our churn dataset with 50 features, many of those columns contribute nothing but empty space. A churned customer and a loyal customer end up looking equally "distant" from each other because the noise dimensions dominate the signal dimensions. Dropping the irrelevant features collapses the space back down to where genuine patterns are visible again.

This isn't just theoretical. Research by Trunk (1979) showed that adding features beyond the optimal count consistently degrades classifier performance, even when those features carry marginal information. The scikit-learn feature selection documentation recommends addressing this before any model tuning.

Key Insight: Feature selection doesn't just improve accuracy. It cuts training time, reduces memory usage in production, and makes models far easier to explain to stakeholders. A model with 10 meaningful inputs is worth more than one with 500 noisy columns.

Feature Selection vs. Feature Extraction

Feature selection picks a subset of your original columns. Feature extraction creates entirely new variables, typically through transformations like PCA or autoencoders. The key tradeoff: selection preserves interpretability (you still know that "monthly_charges" matters), while extraction can capture complex interactions but produces opaque components.

For regulated industries like finance and healthcare where you need to explain why a prediction was made, feature selection is almost always the right choice. For computer vision or NLP where raw features are already abstract (pixels, token embeddings), extraction often makes more sense.

Three Families of Feature Selection

Feature selection methods fall into three categories based on how they evaluate features. The choice between them depends on your dataset size, computational budget, and whether you need to account for feature interactions.

Filter, wrapper, and embedded feature selection methods compared by mechanism, speed, and use caseClick to expandFilter, wrapper, and embedded feature selection methods compared by mechanism, speed, and use case

CriterionFilterWrapperEmbedded
MechanismStatistical scoringModel-driven subset searchBuilt into training
SpeedO(np)O(n \cdot p)O(pTtrain)O(p \cdot T_{\text{train}})Single training pass
InteractionsIgnores themCaptures via modelCaptures via model
Overfitting riskLowHigh (many evaluations)Medium (regularized)
Best forInitial screening, 1000+ featuresSmall datasets, final tuningGeneral purpose, production

Where:

  • nn is the number of samples
  • pp is the number of features
  • TtrainT_{\text{train}} is the time to train one model

Filter Methods: Fast Statistical Screening

Filter methods score each feature independently using a statistical measure, then rank and threshold. They never train a machine learning model, making them the fastest option. The downside: they evaluate features in isolation and miss interactions (Feature A might be useless alone but powerful when combined with Feature B).

Variance Thresholding

The simplest filter. If a feature has zero or near-zero variance, it carries no information. A column where 99.9% of values are identical contributes nothing to any model.

Expected Output:

text
Original shape: (2000, 50)
Variance of injected columns:
  Feature 47 (constant):     0.000000
  Feature 48 (99.5% zeros):  0.004975
  Feature 49 (tiny range):   0.000000

After VarianceThreshold(0.01): (2000, 47)
Removed 3 low-variance features

Variance thresholding is a cleanup step, not a selection strategy. It catches obviously broken columns (constant IDs, flags that never trigger) but can't tell you which of the remaining features actually predict the target.

Mutual Information Scoring

Mutual information (MI) measures the dependency between a feature and the target variable. Unlike Pearson correlation, MI captures non-linear relationships: if a feature has a U-shaped relationship with the target, correlation reports near zero, but MI correctly identifies the dependency.

I(X;Y)=yYxXp(x,y)logp(x,y)p(x)p(y)I(X; Y) = \sum_{y \in Y} \sum_{x \in X} p(x, y) \log \frac{p(x, y)}{p(x) \cdot p(y)}

Where:

  • I(X;Y)I(X; Y) is the mutual information between feature XX and target YY
  • p(x,y)p(x, y) is the joint probability of X=xX = x and Y=yY = y
  • p(x)p(x) and p(y)p(y) are the marginal probabilities
  • The log\log is typically natural log (base ee)

In Plain English: MI asks: "How much does knowing this feature's value reduce my uncertainty about whether the customer will churn?" If knowing a customer's monthly charges tells you a lot about whether they'll leave, MI is high. If knowing their zip code tells you nothing, MI is near zero.

Expected Output:

text
Top 10 features by mutual information:
  1. Feature 32: MI = 0.1020
  2. Feature 14: MI = 0.0964
  3. Feature 15: MI = 0.0504
  4. Feature 41: MI = 0.0502
  5. Feature 43: MI = 0.0448
  6. Feature 5: MI = 0.0316
  7. Feature 36: MI = 0.0208
  8. Feature 35: MI = 0.0207
  9. Feature 31: MI = 0.0199
  10. Feature 4: MI = 0.0160

Total features: 50
Features with MI > 0.05: 4
Features with MI < 0.01: 36

Out of 50 features, 36 have mutual information below 0.01, confirming they're mostly noise. The top 4 features dominate, which aligns with our dataset having only 8 truly informative columns.

When to Use Filter Methods (and When NOT to)

Use when:

  • You have thousands of features and need a fast first pass (genomics, text, sensor data)
  • Computational budget is tight and you can't afford to train models repeatedly
  • You want a model-agnostic preprocessing step before handing data to any algorithm

Do NOT use when:

  • Feature interactions matter (e.g., age * income predicts churn but neither alone does)
  • You need the optimal subset for a specific model
  • Your dataset has fewer than 30 features; just go straight to embedded methods

Pro Tip: Filter methods work well as a pre-processing step before wrapper or embedded methods. Drop the obvious junk first with variance thresholding and MI, then let RFE or Lasso do the fine-tuning on the surviving features.

Wrapper methods treat feature selection as a search problem. They train a model on a candidate subset of features, evaluate performance, and iteratively add or remove features based on the result. This accounts for feature interactions because the model sees features together, not in isolation.

Recursive Feature Elimination (RFE)

RFE is the most widely used wrapper method, documented thoroughly in the scikit-learn RFE reference. It trains a model on all features, ranks them by importance (coefficients for linear models, feature importances for trees), removes the least important one, and repeats until the desired count is reached.

Recursive feature elimination process: train, rank, eliminate, repeat until target countClick to expandRecursive feature elimination process: train, rank, eliminate, repeat until target count

Expected Output:

text
RFE selected features (10 of 50):
  Indices: [6, 14, 19, 21, 23, 32, 36, 40, 42, 43]

Feature rankings (1 = selected):
  Feature  0: rank  3  [rank 3]
  Feature  2: rank  3  [rank 3]
  Feature  5: rank  2  [rank 2]
  Feature  6: rank  1  [SELECTED]
  Feature  7: rank  3  [rank 3]
  Feature 12: rank  2  [rank 2]
  Feature 13: rank  2  [rank 2]
  Feature 14: rank  1  [SELECTED]
  Feature 19: rank  1  [SELECTED]
  Feature 21: rank  1  [SELECTED]
  Feature 22: rank  3  [rank 3]
  Feature 23: rank  1  [SELECTED]
  Feature 27: rank  3  [rank 3]
  Feature 32: rank  1  [SELECTED]
  Feature 34: rank  2  [rank 2]
  Feature 36: rank  1  [SELECTED]
  Feature 40: rank  1  [SELECTED]
  Feature 42: rank  1  [SELECTED]
  Feature 43: rank  1  [SELECTED]
  Feature 46: rank  2  [rank 2]

Notice that RFE picked features 14 and 32, which were also the top MI scorers. But it also selected features like 6 and 40 that MI ranked lower. RFE captures how features interact inside the logistic regression model, something MI can't see.

Common Pitfall: RFE results depend heavily on the estimator you choose. Running RFE with logistic regression and then training a random forest on the selected features can produce suboptimal results. Match the RFE estimator to your final model, or use RFECV with cross-validation to make the choice more principled.

When to Use Wrapper Methods (and When NOT to)

Use when:

  • You have a small dataset (under 10,000 rows, under 100 features) where compute isn't a bottleneck
  • You need the absolute best subset for a specific model
  • You're in a Kaggle competition or research setting where every 0.1% of accuracy matters

Do NOT use when:

  • You have 1000+ features; RFE will take hours or days
  • Your data changes frequently (the selected subset may not be stable across retrains)
  • You need a model-agnostic selection (wrapper results are tied to the chosen estimator)

Embedded Methods: Selection During Training

Embedded methods perform feature selection as part of the model training process itself. The model learns which features matter and which to ignore, typically through regularization or built-in importance measures. This gives you the interaction-awareness of wrappers at closer to the speed of filters.

Lasso (L1 Regularization)

Lasso regression adds a penalty proportional to the absolute value of each coefficient. This penalty forces weak coefficients to exactly zero, effectively removing those features from the model.

CostLasso=12ni=1n(yiy^i)2+λj=1pwj\text{Cost}_{\text{Lasso}} = \frac{1}{2n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |w_j|

Where:

  • nn is the number of training samples
  • yiy_i is the actual target for sample ii
  • y^i\hat{y}_i is the predicted value
  • λ\lambda is the regularization strength (higher means more aggressive pruning)
  • wjw_j is the coefficient (weight) for feature jj
  • pp is the total number of features

In Plain English: Think of λ\lambda as a tax on model complexity. Each feature that keeps a non-zero weight has to "earn its keep" by reducing prediction error more than the tax costs. In our churn dataset, a feature like "days_since_last_login" might strongly predict churn and easily pays the tax. But "user_timezone" barely helps, so Lasso sets its weight to zero and drops it entirely.

Expected Output:

text
Lasso kept 18 features, zeroed out 32

Non-zero coefficients (sorted by magnitude):
  Feature 32: -0.1428
  Feature 19: -0.1036
  Feature 14: -0.0878
  Feature 42: -0.0654
  Feature 41: +0.0555
  Feature  6: +0.0314
  Feature 21: -0.0211
  Feature 23: +0.0085
  Feature 46: -0.0051
  Feature 34: -0.0041
  Feature 27: +0.0038
  Feature  7: -0.0031
  Feature  0: -0.0029
  Feature  2: -0.0023
  Feature 13: +0.0023
  Feature 22: -0.0005
  Feature  1: +0.0004
  Feature 43: -0.0001

Lasso zeroed out 32 of 50 features. Features 32, 19, and 14 have the largest coefficients, confirming they're the strongest predictors. The alpha parameter controls aggressiveness: higher alpha zeros more features. Use cross-validation to find the right balance (scikit-learn's LassoCV automates this).

Pro Tip: Always scale your features before applying Lasso. Without scaling, features measured in large units (like salary in dollars) dominate features in small units (like age in years), and the penalty doesn't apply fairly. StandardScaler is the standard choice.

Tree-Based Feature Importance

Random forests and gradient boosting models like XGBoost naturally compute feature importance during training. They measure how much each feature reduces impurity (Gini or entropy for classification, variance for regression) across all splits in all trees.

This approach requires zero extra computation: you train the model you're already going to use for prediction, then read off the importance scores as a free bonus.

Common Pitfall: Impurity-based importance is biased toward high-cardinality features. A random ID column with 10,000 unique values will appear "important" because it can split the data into many pure subsets. Permutation importance (available via sklearn.inspection.permutation_importance) avoids this bias by measuring how much performance drops when each feature's values are shuffled.

When to Use Embedded Methods (and When NOT to)

Use when:

  • You want a practical default that works well for most problems
  • You're already using linear models (Lasso) or tree models (importance scores)
  • You need production-friendly selection that doesn't add a separate pipeline step

Do NOT use when:

  • You need model-agnostic results (Lasso selection is specific to linear models)
  • Feature interactions are complex and non-linear, but you're using Lasso (it assumes linearity)
  • You want to explore exhaustive subsets; embedded methods are greedy

The Method Selection Decision Framework

Choosing the right approach depends on your specific situation. This decision tree covers the most common scenarios.

Decision flowchart for selecting the right feature selection method based on dataset size and model typeClick to expandDecision flowchart for selecting the right feature selection method based on dataset size and model type

In practice, experienced practitioners combine methods. A typical production pipeline:

  1. Variance threshold to remove constants and near-constants
  2. Mutual information to cut obviously irrelevant features
  3. Lasso or tree importance for the final selection
  4. RFECV only if you're squeezing out the last fraction of accuracy on a small dataset

Data Leakage: The Silent Accuracy Killer

Feature selection must happen inside your cross-validation loop, not before the train-test split. If you compute MI scores on the full dataset (including test rows), you've leaked future information into your feature choices. The model will look artificially good during evaluation and then fail on genuinely unseen data.

The correct workflow:

  1. Split into train and test
  2. Fit the selector on train only
  3. Transform both train and test with the fitted selector
  4. If using cross-validation, repeat steps 2 and 3 inside each fold

Scikit-learn's Pipeline makes this automatic:

python
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ('select', SelectKBest(mutual_info_classif, k=15)),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Feature selection happens inside each CV fold — no leakage
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

Warning: Performing feature selection before train_test_split is one of the most common causes of overfit models that fail in production. This applies to all selection methods: filters, wrappers, and embedded. Always fit on training data only.

Putting It All Together: Impact on Model Performance

Let's measure the actual effect of feature selection on our churn dataset. We'll compare using all 50 features against MI-selected subsets.

Expected Output:

text
Cross-Validated Accuracy (5-fold):
  All 50 features:  0.8330 +/- 0.0302
  Top 10 (MI):      0.8310 +/- 0.0232
  Top 15 (MI):      0.8515 +/- 0.0247

Accuracy change (10 features vs all 50): -0.20%
Features removed: 40 (80% reduction)

The top 15 MI features actually outperform the full 50-feature model by nearly 2 percentage points. And even the aggressive 10-feature subset loses only 0.20% accuracy while removing 80% of columns. In production, that 80% reduction translates directly to faster inference, lower memory, and simpler monitoring.

Production Considerations

Computational Cost

MethodTraining ComplexityMemoryParallelizable
Variance ThresholdO(np)O(n \cdot p)LowYes
Mutual InformationO(npk)O(n \cdot p \cdot k)MediumPer-feature
RFEO(pTmodel)O(p \cdot T_{\text{model}})HighLimited
LassoO(np2)O(n \cdot p^2)MediumNo
Tree ImportanceO(Tforest)O(T_{\text{forest}})HighYes (per tree)

Where kk is the number of neighbors used in MI estimation and TmodelT_{\text{model}} is the cost of a single model training pass.

For a comprehensive theoretical treatment, the Guyon and Elisseeff survey "An Introduction to Variable and Feature Selection" (JMLR, 2003) remains the foundational reference that most modern methods build upon.

Stability Across Retrains

Filter methods produce stable results: the same dataset yields the same MI scores every time. Wrapper methods are less stable because small data changes can shift which features get eliminated first. For production systems that retrain weekly, prefer embedded methods or filters combined with a stability threshold (keep a feature only if it's selected in 80%+ of retrain runs).

Scaling to Large Datasets

For datasets with 10,000+ features (common in genomics, NLP, and sensor data), start with filters. Running RFE on 50,000 features with a random forest estimator could take days. A practical approach: use MI to cut to 500 features, then Lasso to narrow to 50, then optionally RFECV for the final polish. This cascading strategy keeps compute manageable at every stage.

Conclusion

Feature selection determines whether your model sees signal or drowns in noise. The three method families serve different needs: filters for fast, large-scale screening; wrappers for maximum accuracy on small datasets; embedded methods for the best speed-accuracy tradeoff in production.

The most reliable approach combines methods in sequence. Start cheap with variance thresholding and mutual information, move to Lasso or Elastic Net for principled pruning, and reserve RFECV for situations where every fraction of a percent matters. Whatever you choose, always fit the selector on training data only and validate inside your cross-validation loop.

Once you've selected the right features, the next step is making them better. Feature engineering transforms raw columns into representations that help your model learn faster and generalize further. And if you're choosing the final model for those selected features, our guide on hyperparameter tuning covers how to find the optimal configuration without overfitting.

A model built on 10 well-chosen features will almost always outperform one trained on 500 noisy columns. Feature selection is where that advantage starts.

Frequently Asked Interview Questions

Q: What is the difference between feature selection and feature extraction?

Feature selection picks a subset of existing columns (you keep "income" and drop "user_id"). Feature extraction creates new variables from the originals, like PCA compressing 100 columns into 5 principal components. Selection preserves interpretability; extraction can capture more complex structure but produces opaque dimensions.

Q: Why does Lasso perform feature selection but Ridge does not?

Lasso uses an L1 penalty (sum of absolute coefficient values), which creates diamond-shaped constraint regions in parameter space. The optimal solution tends to hit the diamond's corners, where some coefficients are exactly zero. Ridge uses an L2 penalty (sum of squared coefficients), creating circular constraints. The optimal solution lands on the circle's surface, shrinking coefficients toward zero but never reaching it.

Q: A colleague ran feature selection on the entire dataset before splitting into train and test. What's wrong with this approach?

This causes data leakage. The selection step "saw" test set observations when computing statistics like variance, correlation, or mutual information. The chosen features are optimized for both train and test data, producing inflated evaluation metrics. In production, the model encounters truly unseen data and performance drops. The fix: fit the selector on training data only and apply the same transformation to test data.

Q: When would you prefer RFE over Lasso for feature selection?

RFE is preferable when using non-linear models like decision trees or SVMs as your final model, because it evaluates feature subsets using that exact model. Lasso assumes linear relationships, so it might miss features that interact non-linearly. However, RFE is computationally expensive, so for datasets with 1000+ features, Lasso (or tree-based importance) is more practical as a first pass.

Q: How do you handle feature selection when features are highly correlated with each other?

Correlated features cause instability: Lasso might arbitrarily pick one from a correlated group and zero out the rest, and the choice can flip across retrains. Elastic Net handles this better by combining L1 and L2 penalties, which tends to keep or drop correlated features as a group. Alternatively, compute the correlation matrix first, identify clusters of correlated features (r > 0.9), and keep only the representative with the highest target correlation from each cluster.

Q: Your model uses 200 features. Stakeholders want it reduced to under 20 for explainability. Describe your approach.

Start with variance thresholding to remove constants and near-constants. Then compute mutual information with the target and rank features. Use Lasso with cross-validated alpha to prune further. Finally, run RFECV with the production model to find the optimal subset under 20. Validate that the accuracy drop from 200 to 20 features is acceptable. If not, negotiate with stakeholders on the feature count or use SHAP values to explain the larger model instead of reducing it.

Q: What's the risk of using random forest feature importance for selection?

Impurity-based importance is biased toward high-cardinality and continuous features. A random ID column might appear "important" simply because its many unique values allow for pure splits. Permutation importance is more reliable: it measures performance drop when each feature is shuffled, directly testing whether the feature contains predictive signal rather than just splitting convenience.

Hands-On Practice

Now let's apply Filter, Wrapper, and Embedded feature selection methods to a real dataset. You'll see how different methods identify important features and compare their effectiveness.

Dataset: ML Fundamentals (Loan Approval) A classification dataset with features like age, income, credit score to predict loan approval.

Performance Note: This playground trains multiple models and applies several feature selection algorithms. Depending on your device, execution may take 10-20 seconds. The code has been optimized for browser execution while demonstrating all key concepts.

The four visualizations reveal how each method identifies important features differently. Notice how RF Importance and RFE may select different features, this is because they use different criteria (information gain vs. model performance impact). In practice, features selected by multiple methods are typically the most reliable choices.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths