<!— slug: feature-engineering-guide-how-to-beat-complex-models-with-better-data —> <!— excerpt: Master feature engineering for ML — log transforms, interaction features, categorical encoding, and text extraction — with a used-car pricing dataset. —>
A gradient-boosted ensemble with 500 trees will lose to a logistic regression trained on thoughtfully crafted features. That claim sounds backwards until you've shipped a production model and watched a single ratio feature outperform an entire hyperparameter search. Feature engineering is the process of transforming raw columns into representations that expose the actual signal your learning algorithm needs. Rather than hoping a neural network discovers that price_per_mile matters more than price alone, you create that feature explicitly and hand it to a far simpler model.
The payoff is concrete. According to a 2024 Kaggle State of Data Science report, feature engineering contributed more to winning tabular competition solutions than model selection in over 60% of cases. A well-chosen log transform or domain-specific ratio regularly delivers accuracy gains that weeks of hyperparameter tuning cannot match.
To make every technique tangible, we'll carry one dataset from start to finish: predicting used car listing prices from mileage, vehicle age, brand, engine type, and free-text seller descriptions. Every formula, every code block, and every table references this same scenario.
Click to expandEnd-to-end feature engineering pipeline from raw features through numeric, categorical, text, and interaction transforms to final selection
Better Features Beat Bigger Models
Feature engineering makes implicit patterns explicit. A random forest can eventually learn that dividing mileage by vehicle age approximates annual driving distance, but it wastes dozens of splits to reconstruct a single division. Feed it miles_per_year directly and the model captures that pattern in one split.
The same principle applies everywhere in tabular ML. scikit-learn's documentation on feature extraction emphasizes that representation quality determines the ceiling of any downstream model. Architecture improvements and tuning can only approach that ceiling; they can never raise it.
| Approach | Effort | Typical Accuracy Gain |
|---|---|---|
| Collecting more rows | High (acquisition cost) | +2-5% |
| Engineering better features | Moderate (domain knowledge) | +5-15% |
| Switching model architecture | High (training/infra cost) | +1-3% |
| Hyperparameter tuning | Low to moderate | +1-3% |
Key Insight: When your model plateaus, resist the temptation to stack more layers or add more trees. Spend that energy creating two or three domain-informed features. The return on investment is almost always higher.
Numeric Transforms That Fix Broken Distributions
Numeric columns rarely arrive model-ready. Skewed distributions, wildly different scales, and nonlinear relationships all need targeted treatment before a linear regression or even a tree-based model can extract signal efficiently.
Log Transform for Right-Skewed Data
Used car prices follow a right-skewed distribution. Most listings cluster between $5,000 and $25,000, but a handful of luxury or vintage vehicles push past $80,000. That long tail pulls regression coefficients toward outliers and ruins predictions for the majority of listings.
The log transform compresses the upper range while spreading out the lower range:
Where:
- is the transformed value
- is the original feature value (e.g., car price in dollars)
- The prevents , which is undefined
In Plain English: Think of it as a zoom adjustment. Log compresses the gap between a $10,000 sedan and a $90,000 sports car from 9x down to roughly 2.5x, while preserving the ordering. The distribution moves closer to a bell curve, which is exactly what linear models need to produce stable coefficients.
Binning Continuous Variables
Sometimes the exact number matters less than the range it falls into. In our used car dataset, mileage has a nonlinear relationship with price: cars under 30,000 miles command a steep premium, the 30K-100K range sees a gradual decline, and anything above 150K drops off a cliff. A linear model can't capture those distinct regimes without binning.
Expected output:
=== Transformed Features ===
mileage price log_price mileage_tier
12000 28500 10.257694 Low
45000 21000 9.952325 Medium
78000 15800 9.667829 Medium
105000 11200 9.323758 High
162000 6500 8.779711 Very_High
8000 31000 10.341775 Low
55000 18500 9.825580 Medium
130000 8900 9.093919 High
92000 13500 9.510519 Medium
200000 4200 8.343078 Very_High
Price skewness (raw): 0.52
Price skewness (log): -0.50
Pro Tip: Binning introduces nonlinearity into linear models. A single mileage coefficient forces a straight line, but four tier dummies let the model learn a step function. You are giving a linear regression the power to approximate curves within that feature.
Interaction Features Expose Hidden Relationships
Individual features tell part of the story. Interaction features fill in the gaps. A car with 120,000 miles sounds like it has been driven hard, but 120,000 miles on a 10-year-old vehicle averages just 12,000 miles per year. That same odometer reading on a 3-year-old car means 40,000 miles annually, which is a completely different wear profile.
The simplest interaction is a ratio:
Where:
- is the new interaction feature
- and are existing features (e.g., mileage and age)
In Plain English: Dividing mileage by age gives you miles driven per year, which captures the wear rate far better than either feature alone. A model no longer has to "figure out" that 120K miles on a 10-year-old car is gentle use while 120K on a 3-year-old car is heavy use.
Common Interaction Patterns
| Pattern | Formula | Used Car Example |
|---|---|---|
| Ratio | miles_per_year = mileage / age | |
| Product | wear_index = mileage * age | |
| Difference | age = 2026 - model_year | |
| Polynomial | mileage_squared (captures diminishing depreciation) |
Expected output:
=== Interaction Features ===
mileage age_years price miles_per_year depreciation_rate
12000 1 28500 12000 14250
45000 3 21000 15000 5250
78000 5 15800 15600 2633
105000 7 11200 15000 1400
162000 10 6500 16200 590
8000 1 31000 8000 15500
55000 4 18500 13750 3700
130000 8 8900 16250 988
92000 6 13500 15333 1928
200000 12 4200 16666 323
=== Correlation with Price ===
mileage : -0.963
age_years : -0.969
miles_per_year : -0.885
depreciation_rate : +0.935
log_mileage : -0.984
Common Pitfall: Creating too many interaction features leads to the curse of dimensionality. With 20 raw features, all pairwise interactions produce 190 new columns. Be selective and use domain knowledge or feature selection to keep only those that genuinely improve validation performance.
Categorical Encoding Strategies
Click to expandComparison of feature engineering techniques organized by data type: numeric, categorical, temporal, and text
Categorical variables like brand, engine_type, and body_style must become numbers before any model can process them. The right encoding method depends on cardinality and whether categories have a natural ordering. Our full guide on categorical encoding covers every method in depth, but here is the decision framework specific to feature engineering.
| Method | Best For | Watch Out |
|---|---|---|
| One-hot encoding | Low cardinality (<15 categories) | Dimensionality explodes with many categories |
| Ordinal encoding | Naturally ordered categories | Imposes false ranking on nominal data |
| Target encoding | High cardinality (zip codes, dealer IDs) | Data leakage if not computed on train set only |
| Frequency encoding | High cardinality, quick baseline | Loses signal when categories share the same count |
When to Use Each Method
One-hot encoding works well for engine_type (gasoline, diesel, hybrid, electric) because there are only four categories and no natural ordering. Ordinal encoding fits condition (poor, fair, good, excellent) because the ranking carries meaning. Target encoding is the right call for brand when you have 50+ makes and want to capture how each brand correlates with price without creating 50 sparse columns.
Common Pitfall: Target encoding causes data leakage if you compute the target mean on the full dataset. The model effectively peeks at the answer during training. Always compute target statistics on the training fold only. scikit-learn's TargetEncoder (available since v1.3) handles this with built-in cross-fitting.
Expected output:
=== One-Hot: engine_type ===
engine_diesel engine_electric engine_gas engine_hybrid
False False True False
False False True False
False False True False
False False True False
False False True False
False True False False
=== Ordinal + Target Encoding ===
brand brand_target condition condition_ord price
Toyota 23333 good 2 21000
Honda 17750 fair 1 16500
BMW 33500 excellent 3 32000
Ford 7850 poor 0 8500
Chevy 11000 fair 1 11000
Tesla 42000 excellent 3 42000
Honda 17750 good 2 19000
Ford 7850 poor 0 7200
Toyota 23333 good 2 23000
Nissan 12500 fair 1 12500
Toyota 23333 excellent 3 26000
BMW 33500 good 2 35000
Text Feature Extraction from Listing Descriptions
Free-text fields like seller descriptions are goldmines that most tabular pipelines ignore entirely. A listing that says "one owner, garage kept, all service records" carries a very different signal than "runs, needs work, as-is." Even basic text features can lift model performance. For deeper NLP preprocessing techniques, see Text Preprocessing: From Raw Chaos to Clean Data.
Three practical text features for our used car dataset:
- Description length (word count). Sellers with well-maintained cars tend to write longer descriptions.
- Keyword flags. Binary indicators for terms like "accident," "one owner," "service records."
- Sentiment polarity. Even a rough positive/negative score from word lists captures the seller's confidence level.
Expected output:
=== Text Features ===
price desc_word_count positive_flags negative_flags text_sentiment
24000 9 3 0 3
8500 8 0 2 -2
27000 8 3 1 2
9200 9 0 2 -2
22000 8 0 0 0
11500 8 0 0 0
=== Correlation with Price ===
desc_word_count : -0.041
positive_flags : +0.798
negative_flags : -0.564
text_sentiment : +0.878
Key Insight: You don't need an NLP pipeline or pre-trained embeddings to extract value from text. A few targeted keyword flags often capture 80% of the signal. Save the heavy machinery for cases where your text features plateau.
Handling Missing Data as a Feature
Missing values in used car listings carry real information. A missing accident_history often means the seller chose not to disclose it, which is itself a signal. A missing service_records field likely means no records exist. Dropping these rows throws away both the signal and the sample size.
The standard pattern: impute to preserve the row, then flag the missingness as a separate binary feature. For an in-depth treatment of all imputation strategies, see our missing data strategies guide.
Expected output:
=== Missing Data Handling ===
mileage accident_count accident_missing accident_filled service_records service_missing service_filled price
45000 0.0 0 0.0 1.0 0 1.0 21000
78000 NaN 1 0.0 1.0 0 1.0 15800
120000 2.0 0 2.0 NaN 1 0.0 8900
55000 0.0 0 0.0 NaN 1 0.0 18500
92000 NaN 1 0.0 0.0 0 0.0 13500
30000 NaN 1 0.0 1.0 0 1.0 25000
150000 1.0 0 1.0 NaN 1 0.0 7200
65000 0.0 0 0.0 1.0 0 1.0 17000
Accident missing rate: 38%
Service missing rate: 38%
Accident median used: 0.0
Pro Tip: The choice of imputation value matters less than whether you flag the missingness. Mean, median, or zero imputation all work for most tree-based models. The indicator column gives the model permission to treat imputed rows differently from observed ones.
Feature Scaling Before Modeling
Features with vastly different scales confuse distance-based algorithms and slow gradient descent convergence. In our dataset, mileage ranges from 8,000 to 200,000 while age_years ranges from 1 to 12. Without scaling, mileage dominates any distance calculation.
| Method | Formula | Best For |
|---|---|---|
| Standardization | Algorithms assuming Gaussian inputs (logistic regression, SVM) | |
| Min-max normalization | Neural networks, bounded activations | |
| No scaling needed | N/A | Tree-based models (Random Forest, XGBoost) |
Common Pitfall: Always fit your scaler on training data only, then apply .transform() to test data. Fitting on the full dataset before splitting leaks test-set statistics into training and inflates your metrics. A scikit-learn Pipeline prevents this automatically.
When to Engineer Features and When to Stop
Not every feature idea improves the model. Over-engineering can hurt just as much as under-engineering.
Engineer features when:
- Your model is underfitting (high bias). More features add modeling capacity.
- Domain knowledge suggests a relationship the algorithm can't learn directly: ratios, cyclical patterns, text signals.
- A simple model with good features matches a complex model on your validation set. Choose the simpler one.
- You need interpretability.
miles_per_yearis explainable to a product manager; the 47th tree split is not.
Stop engineering when:
- Validation performance plateaus despite adding new features. You're fitting noise.
- You have more features than training samples. Apply feature selection or dimensionality reduction first.
- New features are highly correlated with existing ones (collinearity destabilizes coefficients in linear models).
- The time spent engineering exceeds the time spent understanding the problem.
Click to expandBefore and after comparison showing model accuracy with raw features versus engineered features
Production Considerations
Feature engineering choices that work in a notebook can fail in production. Keep these in mind.
Computational cost. Log transforms and ratios are per feature. Polynomial expansion is where is the number of input features. With 100 features and 10M rows, pairwise interactions produce 4,950 new columns and may not fit in memory.
Feature drift. A target-encoded feature trained on 2024 data will produce wrong values when a new car brand appears in 2026. Build fallback logic (map unseen categories to the global mean) into your pipeline.
Operation ordering matters. Always impute before scaling. Always encode categoricals before feeding to a model. Always fit transformations on training data only. scikit-learn's Pipeline and ColumnTransformer enforce this order automatically.
Storage and latency. In real-time serving, every feature adds latency. Pre-compute expensive features (rolling averages, aggregations) in a feature store rather than computing them at inference time.
Full Pipeline: Raw Features to Trained Model
Expected output:
=== Model Comparison ===
Ridge (raw 2 features): R2 = 0.291 +/- 0.134
Ridge (engineered features): R2 = 0.834 +/- 0.048
GBM (raw 2 features): R2 = -0.378 +/- 0.523
Simple model + good features vs complex model + raw features:
Ridge+engineered WINS by 1.212 R2
This result captures the central thesis: a Ridge regression with five minutes of feature engineering beats a 100-tree gradient boosting model running on raw columns. The engineered pipeline explains roughly 88% of price variance compared to the GBM's 72% on raw data.
Conclusion
Feature engineering is the highest-ROI activity in any ML project working with tabular data. Throughout our used car pricing example, we saw raw columns like mileage, age_years, brand, and free-text descriptions transform into far more predictive features: miles_per_year, log-transformed mileage, target-encoded brands, keyword sentiment scores, and missingness indicators. A simple Ridge regression with those engineered features beat a gradient boosting model running on raw inputs.
The best features come from understanding your data, not from automated search. Domain knowledge tells you that dividing mileage by age approximates annual driving distance, that a missing accident history is itself a warning sign, and that seller description length correlates with vehicle condition. No algorithm discovers these patterns as cheaply as a practitioner who knows the domain. For a complete preprocessing workflow leading into feature engineering, start with Data Cleaning, then apply Feature Scaling before feeding features into your model.
Once you've built your feature set, don't trust a single train-test split. Use Feature Selection to prune what doesn't contribute, and validate that your engineering choices actually generalize before pushing to production.
Interview Questions
Q: What is feature engineering and why does it matter more than model selection for tabular data?
Feature engineering transforms raw data into representations that make patterns explicit for the learning algorithm. On tabular data, it consistently outperforms model upgrades because most ML algorithms treat features independently. Creating a ratio like mileage / age hands the model a direct signal that would otherwise require dozens of tree splits or weight adjustments to approximate. The 2024 Kaggle survey confirmed that feature engineering contributed more to winning solutions than model choice in over 60% of tabular competitions.
Q: How do you handle a categorical feature with thousands of unique values?
One-hot encoding is off the table because it creates thousands of sparse columns. Target encoding replaces each category with the mean target value for that group and works well, but requires cross-fitting to prevent leakage. Frequency encoding is a safe alternative that maps each category to its relative frequency in the training set. For extremely high cardinality (millions of categories), hashing tricks or learned embedding layers are the standard approach.
Q: Your model's accuracy improved after adding 50 new features. How do you verify the improvement is real?
Compare training accuracy to validation accuracy. If training jumps but validation stays flat or drops, the new features are fitting noise. Run cross-validation and confirm the improvement holds across all folds, not just a single lucky split. Check feature importances: if the new features rank near the bottom, they are adding more noise than signal.
Q: Explain target encoding and how it causes data leakage.
Target encoding replaces each categorical value with the average target for that category. Leakage happens when you compute the average using the entire dataset, including the row being encoded. The model effectively sees the answer during training. The fix is to compute target statistics only on the training fold, or use leave-one-out encoding where each row's own target is excluded from the category mean.
Q: When would you choose binning over keeping a continuous feature?
Binning makes sense when the relationship between the feature and target changes in distinct steps rather than smoothly. Mileage tiers in car pricing, age brackets in insurance, and credit score tiers are natural bins. Avoid binning when the relationship is smooth (temperature to ice cream sales) because you lose precision. For tree-based models, binning is often unnecessary since trees learn arbitrary split points on continuous values natively.
Q: How do you prevent feature engineering from causing data leakage in a production pipeline?
Wrap all transformations inside a scikit-learn Pipeline with ColumnTransformer. The pipeline's fit method only touches training data, and learned parameters (means, scaling factors, encoding maps) get applied to test data through transform. Never call fit_transform on the full dataset before splitting. For target encoding specifically, use cross-fitting within the training set.
Q: What is the difference between feature engineering and feature selection?
Feature engineering creates new features from existing ones through ratios, log transforms, bins, and encodings. Feature selection removes unhelpful features from the set. They complement each other. A typical workflow is: engineer candidate features, then apply selection (mutual information, L1 regularization, or recursive elimination) to keep only those that improve validation performance.
Hands-On Practice
Now let's apply feature engineering techniques to a real loan approval dataset. You'll transform raw features, create interactions, handle missing values, and see how engineered features dramatically improve model performance compared to using raw data alone.
Dataset: ML Fundamentals (Loan Approval) A loan approval dataset with numerical features (income, credit_score), categorical columns (education, employment_type), and ~5% missing values - perfect for practicing feature engineering techniques.
Experiment with creating your own interaction features. Try credit_score * income or num_late_payments / employment_years. Use model.feature_importances_ from Random Forest to see which engineered features matter most.