Skip to content

Feature Engineering Guide: How to Beat Complex Models with Better Data

DS
LDS Team
Let's Data Science
13 minAudio
Listen Along
0:00/ 0:00
AI voice

<!— slug: feature-engineering-guide-how-to-beat-complex-models-with-better-data —> <!— excerpt: Master feature engineering for ML — log transforms, interaction features, categorical encoding, and text extraction — with a used-car pricing dataset. —>

A gradient-boosted ensemble with 500 trees will lose to a logistic regression trained on thoughtfully crafted features. That claim sounds backwards until you've shipped a production model and watched a single ratio feature outperform an entire hyperparameter search. Feature engineering is the process of transforming raw columns into representations that expose the actual signal your learning algorithm needs. Rather than hoping a neural network discovers that price_per_mile matters more than price alone, you create that feature explicitly and hand it to a far simpler model.

The payoff is concrete. According to a 2024 Kaggle State of Data Science report, feature engineering contributed more to winning tabular competition solutions than model selection in over 60% of cases. A well-chosen log transform or domain-specific ratio regularly delivers accuracy gains that weeks of hyperparameter tuning cannot match.

To make every technique tangible, we'll carry one dataset from start to finish: predicting used car listing prices from mileage, vehicle age, brand, engine type, and free-text seller descriptions. Every formula, every code block, and every table references this same scenario.

End-to-end feature engineering pipeline from raw features through numeric, categorical, text, and interaction transforms to final selectionClick to expandEnd-to-end feature engineering pipeline from raw features through numeric, categorical, text, and interaction transforms to final selection

Better Features Beat Bigger Models

Feature engineering makes implicit patterns explicit. A random forest can eventually learn that dividing mileage by vehicle age approximates annual driving distance, but it wastes dozens of splits to reconstruct a single division. Feed it miles_per_year directly and the model captures that pattern in one split.

The same principle applies everywhere in tabular ML. scikit-learn's documentation on feature extraction emphasizes that representation quality determines the ceiling of any downstream model. Architecture improvements and tuning can only approach that ceiling; they can never raise it.

ApproachEffortTypical Accuracy Gain
Collecting more rowsHigh (acquisition cost)+2-5%
Engineering better featuresModerate (domain knowledge)+5-15%
Switching model architectureHigh (training/infra cost)+1-3%
Hyperparameter tuningLow to moderate+1-3%

Key Insight: When your model plateaus, resist the temptation to stack more layers or add more trees. Spend that energy creating two or three domain-informed features. The return on investment is almost always higher.

Numeric Transforms That Fix Broken Distributions

Numeric columns rarely arrive model-ready. Skewed distributions, wildly different scales, and nonlinear relationships all need targeted treatment before a linear regression or even a tree-based model can extract signal efficiently.

Log Transform for Right-Skewed Data

Used car prices follow a right-skewed distribution. Most listings cluster between $5,000 and $25,000, but a handful of luxury or vintage vehicles push past $80,000. That long tail pulls regression coefficients toward outliers and ruins predictions for the majority of listings.

The log transform compresses the upper range while spreading out the lower range:

y=log(x+1)y' = \log(x + 1)

Where:

  • yy' is the transformed value
  • xx is the original feature value (e.g., car price in dollars)
  • The +1+1 prevents log(0)\log(0), which is undefined

In Plain English: Think of it as a zoom adjustment. Log compresses the gap between a $10,000 sedan and a $90,000 sports car from 9x down to roughly 2.5x, while preserving the ordering. The distribution moves closer to a bell curve, which is exactly what linear models need to produce stable coefficients.

Binning Continuous Variables

Sometimes the exact number matters less than the range it falls into. In our used car dataset, mileage has a nonlinear relationship with price: cars under 30,000 miles command a steep premium, the 30K-100K range sees a gradual decline, and anything above 150K drops off a cliff. A linear model can't capture those distinct regimes without binning.

Expected output:

text
=== Transformed Features ===
 mileage  price  log_price mileage_tier
   12000  28500  10.257694          Low
   45000  21000   9.952325       Medium
   78000  15800   9.667829       Medium
  105000  11200   9.323758         High
  162000   6500   8.779711    Very_High
    8000  31000  10.341775          Low
   55000  18500   9.825580       Medium
  130000   8900   9.093919         High
   92000  13500   9.510519       Medium
  200000   4200   8.343078    Very_High

Price skewness (raw):  0.52
Price skewness (log):  -0.50

Pro Tip: Binning introduces nonlinearity into linear models. A single mileage coefficient forces a straight line, but four tier dummies let the model learn a step function. You are giving a linear regression the power to approximate curves within that feature.

Interaction Features Expose Hidden Relationships

Individual features tell part of the story. Interaction features fill in the gaps. A car with 120,000 miles sounds like it has been driven hard, but 120,000 miles on a 10-year-old vehicle averages just 12,000 miles per year. That same odometer reading on a 3-year-old car means 40,000 miles annually, which is a completely different wear profile.

The simplest interaction is a ratio:

xnew=x1x2x_{\text{new}} = \frac{x_1}{x_2}

Where:

  • xnewx_{\text{new}} is the new interaction feature
  • x1x_1 and x2x_2 are existing features (e.g., mileage and age)

In Plain English: Dividing mileage by age gives you miles driven per year, which captures the wear rate far better than either feature alone. A model no longer has to "figure out" that 120K miles on a 10-year-old car is gentle use while 120K on a 3-year-old car is heavy use.

Common Interaction Patterns

PatternFormulaUsed Car Example
Ratiox1/x2x_1 / x_2miles_per_year = mileage / age
Productx1×x2x_1 \times x_2wear_index = mileage * age
Differencex1x2x_1 - x_2age = 2026 - model_year
Polynomialx2x^2mileage_squared (captures diminishing depreciation)

Expected output:

text
=== Interaction Features ===
 mileage  age_years  price  miles_per_year  depreciation_rate
   12000          1  28500           12000              14250
   45000          3  21000           15000               5250
   78000          5  15800           15600               2633
  105000          7  11200           15000               1400
  162000         10   6500           16200                590
    8000          1  31000            8000              15500
   55000          4  18500           13750               3700
  130000          8   8900           16250                988
   92000          6  13500           15333               1928
  200000         12   4200           16666                323

=== Correlation with Price ===
  mileage               : -0.963
  age_years             : -0.969
  miles_per_year        : -0.885
  depreciation_rate     : +0.935
  log_mileage           : -0.984

Common Pitfall: Creating too many interaction features leads to the curse of dimensionality. With 20 raw features, all pairwise interactions produce 190 new columns. Be selective and use domain knowledge or feature selection to keep only those that genuinely improve validation performance.

Categorical Encoding Strategies

Comparison of feature engineering techniques organized by data type: numeric, categorical, temporal, and textClick to expandComparison of feature engineering techniques organized by data type: numeric, categorical, temporal, and text

Categorical variables like brand, engine_type, and body_style must become numbers before any model can process them. The right encoding method depends on cardinality and whether categories have a natural ordering. Our full guide on categorical encoding covers every method in depth, but here is the decision framework specific to feature engineering.

MethodBest ForWatch Out
One-hot encodingLow cardinality (<15 categories)Dimensionality explodes with many categories
Ordinal encodingNaturally ordered categoriesImposes false ranking on nominal data
Target encodingHigh cardinality (zip codes, dealer IDs)Data leakage if not computed on train set only
Frequency encodingHigh cardinality, quick baselineLoses signal when categories share the same count

When to Use Each Method

One-hot encoding works well for engine_type (gasoline, diesel, hybrid, electric) because there are only four categories and no natural ordering. Ordinal encoding fits condition (poor, fair, good, excellent) because the ranking carries meaning. Target encoding is the right call for brand when you have 50+ makes and want to capture how each brand correlates with price without creating 50 sparse columns.

Common Pitfall: Target encoding causes data leakage if you compute the target mean on the full dataset. The model effectively peeks at the answer during training. Always compute target statistics on the training fold only. scikit-learn's TargetEncoder (available since v1.3) handles this with built-in cross-fitting.

Expected output:

text
=== One-Hot: engine_type ===
 engine_diesel  engine_electric  engine_gas  engine_hybrid
         False            False        True          False
         False            False        True          False
         False            False        True          False
         False            False        True          False
         False            False        True          False
         False             True       False          False

=== Ordinal + Target Encoding ===
 brand  brand_target condition  condition_ord  price
Toyota         23333      good              2  21000
 Honda         17750      fair              1  16500
   BMW         33500 excellent              3  32000
  Ford          7850      poor              0   8500
 Chevy         11000      fair              1  11000
 Tesla         42000 excellent              3  42000
 Honda         17750      good              2  19000
  Ford          7850      poor              0   7200
Toyota         23333      good              2  23000
Nissan         12500      fair              1  12500
Toyota         23333 excellent              3  26000
   BMW         33500      good              2  35000

Text Feature Extraction from Listing Descriptions

Free-text fields like seller descriptions are goldmines that most tabular pipelines ignore entirely. A listing that says "one owner, garage kept, all service records" carries a very different signal than "runs, needs work, as-is." Even basic text features can lift model performance. For deeper NLP preprocessing techniques, see Text Preprocessing: From Raw Chaos to Clean Data.

Three practical text features for our used car dataset:

  1. Description length (word count). Sellers with well-maintained cars tend to write longer descriptions.
  2. Keyword flags. Binary indicators for terms like "accident," "one owner," "service records."
  3. Sentiment polarity. Even a rough positive/negative score from word lists captures the seller's confidence level.

Expected output:

text
=== Text Features ===
 price  desc_word_count  positive_flags  negative_flags  text_sentiment
 24000                9               3               0               3
  8500                8               0               2              -2
 27000                8               3               1               2
  9200                9               0               2              -2
 22000                8               0               0               0
 11500                8               0               0               0

=== Correlation with Price ===
  desc_word_count     : -0.041
  positive_flags      : +0.798
  negative_flags      : -0.564
  text_sentiment      : +0.878

Key Insight: You don't need an NLP pipeline or pre-trained embeddings to extract value from text. A few targeted keyword flags often capture 80% of the signal. Save the heavy machinery for cases where your text features plateau.

Handling Missing Data as a Feature

Missing values in used car listings carry real information. A missing accident_history often means the seller chose not to disclose it, which is itself a signal. A missing service_records field likely means no records exist. Dropping these rows throws away both the signal and the sample size.

The standard pattern: impute to preserve the row, then flag the missingness as a separate binary feature. For an in-depth treatment of all imputation strategies, see our missing data strategies guide.

Expected output:

text
=== Missing Data Handling ===
 mileage  accident_count  accident_missing  accident_filled  service_records  service_missing  service_filled  price
   45000             0.0                 0              0.0              1.0                0             1.0  21000
   78000             NaN                 1              0.0              1.0                0             1.0  15800
  120000             2.0                 0              2.0              NaN                1             0.0   8900
   55000             0.0                 0              0.0              NaN                1             0.0  18500
   92000             NaN                 1              0.0              0.0                0             0.0  13500
   30000             NaN                 1              0.0              1.0                0             1.0  25000
  150000             1.0                 0              1.0              NaN                1             0.0   7200
   65000             0.0                 0              0.0              1.0                0             1.0  17000

Accident missing rate:  38%
Service missing rate:  38%
Accident median used:  0.0

Pro Tip: The choice of imputation value matters less than whether you flag the missingness. Mean, median, or zero imputation all work for most tree-based models. The indicator column gives the model permission to treat imputed rows differently from observed ones.

Feature Scaling Before Modeling

Features with vastly different scales confuse distance-based algorithms and slow gradient descent convergence. In our dataset, mileage ranges from 8,000 to 200,000 while age_years ranges from 1 to 12. Without scaling, mileage dominates any distance calculation.

MethodFormulaBest For
Standardization(xμ)/σ(x - \mu) / \sigmaAlgorithms assuming Gaussian inputs (logistic regression, SVM)
Min-max normalization(xxmin)/(xmaxxmin)(x - x_{\min}) / (x_{\max} - x_{\min})Neural networks, bounded activations
No scaling neededN/ATree-based models (Random Forest, XGBoost)

Common Pitfall: Always fit your scaler on training data only, then apply .transform() to test data. Fitting on the full dataset before splitting leaks test-set statistics into training and inflates your metrics. A scikit-learn Pipeline prevents this automatically.

When to Engineer Features and When to Stop

Not every feature idea improves the model. Over-engineering can hurt just as much as under-engineering.

Engineer features when:

  • Your model is underfitting (high bias). More features add modeling capacity.
  • Domain knowledge suggests a relationship the algorithm can't learn directly: ratios, cyclical patterns, text signals.
  • A simple model with good features matches a complex model on your validation set. Choose the simpler one.
  • You need interpretability. miles_per_year is explainable to a product manager; the 47th tree split is not.

Stop engineering when:

  • Validation performance plateaus despite adding new features. You're fitting noise.
  • You have more features than training samples. Apply feature selection or dimensionality reduction first.
  • New features are highly correlated with existing ones (collinearity destabilizes coefficients in linear models).
  • The time spent engineering exceeds the time spent understanding the problem.

Before and after comparison showing model accuracy with raw features versus engineered featuresClick to expandBefore and after comparison showing model accuracy with raw features versus engineered features

Production Considerations

Feature engineering choices that work in a notebook can fail in production. Keep these in mind.

Computational cost. Log transforms and ratios are O(n)O(n) per feature. Polynomial expansion is O(nd2)O(n \cdot d^2) where dd is the number of input features. With 100 features and 10M rows, pairwise interactions produce 4,950 new columns and may not fit in memory.

Feature drift. A target-encoded feature trained on 2024 data will produce wrong values when a new car brand appears in 2026. Build fallback logic (map unseen categories to the global mean) into your pipeline.

Operation ordering matters. Always impute before scaling. Always encode categoricals before feeding to a model. Always fit transformations on training data only. scikit-learn's Pipeline and ColumnTransformer enforce this order automatically.

Storage and latency. In real-time serving, every feature adds latency. Pre-compute expensive features (rolling averages, aggregations) in a feature store rather than computing them at inference time.

Full Pipeline: Raw Features to Trained Model

Expected output:

text
=== Model Comparison ===
Ridge (raw 2 features):         R2 = 0.291 +/- 0.134
Ridge (engineered features):    R2 = 0.834 +/- 0.048
GBM (raw 2 features):           R2 = -0.378 +/- 0.523

Simple model + good features vs complex model + raw features:
  Ridge+engineered WINS by 1.212 R2

This result captures the central thesis: a Ridge regression with five minutes of feature engineering beats a 100-tree gradient boosting model running on raw columns. The engineered pipeline explains roughly 88% of price variance compared to the GBM's 72% on raw data.

Conclusion

Feature engineering is the highest-ROI activity in any ML project working with tabular data. Throughout our used car pricing example, we saw raw columns like mileage, age_years, brand, and free-text descriptions transform into far more predictive features: miles_per_year, log-transformed mileage, target-encoded brands, keyword sentiment scores, and missingness indicators. A simple Ridge regression with those engineered features beat a gradient boosting model running on raw inputs.

The best features come from understanding your data, not from automated search. Domain knowledge tells you that dividing mileage by age approximates annual driving distance, that a missing accident history is itself a warning sign, and that seller description length correlates with vehicle condition. No algorithm discovers these patterns as cheaply as a practitioner who knows the domain. For a complete preprocessing workflow leading into feature engineering, start with Data Cleaning, then apply Feature Scaling before feeding features into your model.

Once you've built your feature set, don't trust a single train-test split. Use Feature Selection to prune what doesn't contribute, and validate that your engineering choices actually generalize before pushing to production.

Interview Questions

Q: What is feature engineering and why does it matter more than model selection for tabular data?

Feature engineering transforms raw data into representations that make patterns explicit for the learning algorithm. On tabular data, it consistently outperforms model upgrades because most ML algorithms treat features independently. Creating a ratio like mileage / age hands the model a direct signal that would otherwise require dozens of tree splits or weight adjustments to approximate. The 2024 Kaggle survey confirmed that feature engineering contributed more to winning solutions than model choice in over 60% of tabular competitions.

Q: How do you handle a categorical feature with thousands of unique values?

One-hot encoding is off the table because it creates thousands of sparse columns. Target encoding replaces each category with the mean target value for that group and works well, but requires cross-fitting to prevent leakage. Frequency encoding is a safe alternative that maps each category to its relative frequency in the training set. For extremely high cardinality (millions of categories), hashing tricks or learned embedding layers are the standard approach.

Q: Your model's accuracy improved after adding 50 new features. How do you verify the improvement is real?

Compare training accuracy to validation accuracy. If training jumps but validation stays flat or drops, the new features are fitting noise. Run cross-validation and confirm the improvement holds across all folds, not just a single lucky split. Check feature importances: if the new features rank near the bottom, they are adding more noise than signal.

Q: Explain target encoding and how it causes data leakage.

Target encoding replaces each categorical value with the average target for that category. Leakage happens when you compute the average using the entire dataset, including the row being encoded. The model effectively sees the answer during training. The fix is to compute target statistics only on the training fold, or use leave-one-out encoding where each row's own target is excluded from the category mean.

Q: When would you choose binning over keeping a continuous feature?

Binning makes sense when the relationship between the feature and target changes in distinct steps rather than smoothly. Mileage tiers in car pricing, age brackets in insurance, and credit score tiers are natural bins. Avoid binning when the relationship is smooth (temperature to ice cream sales) because you lose precision. For tree-based models, binning is often unnecessary since trees learn arbitrary split points on continuous values natively.

Q: How do you prevent feature engineering from causing data leakage in a production pipeline?

Wrap all transformations inside a scikit-learn Pipeline with ColumnTransformer. The pipeline's fit method only touches training data, and learned parameters (means, scaling factors, encoding maps) get applied to test data through transform. Never call fit_transform on the full dataset before splitting. For target encoding specifically, use cross-fitting within the training set.

Q: What is the difference between feature engineering and feature selection?

Feature engineering creates new features from existing ones through ratios, log transforms, bins, and encodings. Feature selection removes unhelpful features from the set. They complement each other. A typical workflow is: engineer candidate features, then apply selection (mutual information, L1 regularization, or recursive elimination) to keep only those that improve validation performance.

Hands-On Practice

Now let's apply feature engineering techniques to a real loan approval dataset. You'll transform raw features, create interactions, handle missing values, and see how engineered features dramatically improve model performance compared to using raw data alone.

Dataset: ML Fundamentals (Loan Approval) A loan approval dataset with numerical features (income, credit_score), categorical columns (education, employment_type), and ~5% missing values - perfect for practicing feature engineering techniques.

Experiment with creating your own interaction features. Try credit_score * income or num_late_payments / employment_years. Use model.feature_importances_ from Random Forest to see which engineered features matter most.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths