A bank receives ten loan applications. Each applicant walks in with an income figure, a credit score, and years of employment. The loan officer needs a systematic rule (not gut feeling) to decide: approve or deny. A decision tree builds exactly that rule by asking one question at a time, choosing each question to separate approved applicants from denied ones as cleanly as possible.
Decision trees form the backbone of every tree-based model used in production today. Random forests, gradient boosting machines, and XGBoost all start here. If you don't understand how a single tree selects its splits, measures impurity, and controls its own complexity, the ensemble methods built on top will remain a black box.
This guide covers the full mechanics: splitting criteria (entropy and Gini impurity), the greedy search algorithm, stopping rules, pruning strategies, regression trees, and the practical trade-offs that determine when a decision tree is the right tool, and when it isn't.
Recursive binary splitting
A decision tree is a non-parametric supervised learning model that predicts a target variable by partitioning the feature space into axis-aligned rectangular regions. Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone formalized this in the CART (Classification and Regression Trees) algorithm in their 1984 monograph. Around the same time, Ross Quinlan introduced the entropy-based ID3 algorithm (1986), later extending it to C4.5 in 1993. Scikit-learn's decision tree module (v1.8, December 2025) implements an optimized version of CART.
The recursive splitting procedure works as follows:
- Start with the full dataset at a single node (the root).
- Evaluate every feature and every possible threshold within that feature.
- Select the split that produces the greatest reduction in impurity.
- Divide the data into two child nodes based on that split.
- Repeat steps 2 to 4 on each child node until a stopping condition is met.
- Assign a prediction to each terminal node (leaf): the majority class for classification, or the mean target value for regression.
This procedure is greedy: it picks the locally optimal split at each step without backtracking. It never reconsiders an earlier split in light of later ones. Hyafil and Rivest proved in 1976 that finding the globally optimal binary decision tree is NP-complete, which is why every practical implementation relies on greedy heuristics instead.
Click to expandDecision tree recursive splitting algorithm flowchart showing the greedy search process
Tree anatomy
Every decision tree has three types of nodes:
| Node type | Role | Example (loan approval) |
|---|---|---|
| Root node | Contains the entire dataset; performs the first split | "Credit Score > 700?" |
| Internal nodes | Each tests a single feature against a threshold and routes samples left or right | "Employment > 3 years?" |
| Leaf nodes | Hold the final prediction; no further splitting | "Approved" or "Denied" |
Key Insight: Decision trees are white-box models. You can trace any prediction from root to leaf and explain exactly which feature thresholds led to the outcome. This transparency is why regulated industries (banking, healthcare, insurance) still rely on tree-based models when they need to justify decisions to auditors and regulators.
The loan approval dataset
To make every formula and concept concrete, here's the dataset we'll use from start to finish: 10 loan applicants.
| Applicant | Income | Credit Score | Employment (yrs) | Decision |
|---|---|---|---|---|
| 1 | $72k | 750 | 6 | Approved |
| 2 | $45k | 680 | 2 | Denied |
| 3 | $91k | 620 | 8 | Denied |
| 4 | $38k | 610 | 1 | Denied |
| 5 | $65k | 740 | 5 | Approved |
| 6 | $42k | 690 | 4 | Approved |
| 7 | $48k | 580 | 3 | Denied |
| 8 | $85k | 760 | 7 | Approved |
| 9 | $55k | 710 | 4 | Approved |
| 10 | $60k | 730 | 5 | Approved |
We have 6 Approved and 4 Denied outcomes. The tree must decide: which feature and which threshold should it split on first?
To answer that, it needs a way to measure disorder.
Shannon entropy as a splitting criterion
Entropy, introduced by Claude Shannon in his landmark 1948 paper "A Mathematical Theory of Communication," quantifies the average amount of information (in bits) needed to identify the class of a randomly selected sample. For a node containing classes:
Where:
- is the entropy of node (measured in bits)
- is the proportion of samples belonging to class
- is the total number of distinct classes
- is the base-2 logarithm
In Plain English: Entropy measures how mixed a node is. If every loan applicant in the node got approved, entropy is 0, meaning there's no uncertainty. If half were approved and half denied, entropy hits its maximum of 1.0 bit; it's a coin flip. The tree's job is to find splits that drive entropy toward zero in the child nodes.
Computing entropy for the root node
The root contains 6 Approved and 4 Denied out of 10 total, so and :
An entropy of 0.971, close to the theoretical maximum of 1.0 for two classes, confirms the root node is heavily mixed. Almost no useful signal yet.
Information gain drives every split
Entropy measures disorder at a single node. Information gain measures how much a particular split reduces that disorder. It equals the parent entropy minus the weighted average entropy of the children:
Where:
- is the information gain from splitting set on attribute
- is the entropy of the parent node
- is the subset of samples that take value after the split
- is the proportion of samples routed to child
In Plain English: Information gain tells you how many bits of uncertainty you remove by asking a specific question about the loan applicants. The tree evaluates every candidate question (every feature-threshold pair) and picks the one that removes the most uncertainty. Higher gain means a more informative split.
Evaluating candidate splits on the loan data
Let's work through three candidate splits and compare their information gain.
Candidate 1: Credit Score > 700
- Left child (score > 700): applicants 1, 5, 8, 9, 10; all 5 Approved, 0 Denied. (pure node).
- Right child (score 700): applicants 2, 3, 4, 6, 7, with 1 Approved (applicant 6) and 4 Denied. .
Candidate 2: Income > $50k
- Left child (income > $50k): applicants 1, 3, 5, 8, 9, 10, yielding 5 Approved and 1 Denied (applicant 3 earns $91k but has a credit score of only 620). .
- Right child (income $50k): applicants 2, 4, 6, 7, with 1 Approved (applicant 6) and 3 Denied. .
Candidate 3: Employment > 3 years
- Left child (> 3 years): applicants 1, 3, 5, 6, 8, 9, 10; 6 Approved, 1 Denied (applicant 3 has 8 years experience but was denied for poor credit). .
- Right child ( 3 years): applicants 2, 4, 7, all Denied (pure node). .
Result: Credit Score > 700 wins with an information gain of 0.610, ahead of Employment > 3 years (0.557) and Income > $50k (0.257). The tree picks this as its root split.
| Candidate Split | Information Gain | Winner? |
|---|---|---|
| Credit Score > 700 | 0.610 | Yes |
| Employment > 3 years | 0.557 | |
| Income > $50k | 0.257 |
The algorithm then recurses into each child node. The left child is already pure (all Approved), so it becomes a leaf. The right child (1 Approved, 4 Denied) gets split further on a different feature.
Click to expandDecision tree for loan approval showing credit score as root split with information gain values
Gini impurity
Gini impurity is the default splitting criterion in scikit-learn's DecisionTreeClassifier (as of version 1.8). It measures the probability that a randomly chosen sample from a node would be misclassified if labeled according to the class distribution at that node:
Where:
- is the Gini impurity of node
- is the proportion of samples belonging to class
- is the total number of classes
In Plain English: For our loan approval node, Gini impurity equals $2 \times p_{\text{approved}} \times p_{\text{denied}}$, which represents the probability that if you randomly pick two loan applicants from the node, they'd have different outcomes. A pure node (all approved or all denied) has Gini = 0. A perfectly balanced node has Gini = 0.5: maximum confusion.
Computing Gini for the root node
Gini reduction for Credit Score > 700
The left child (5 Approved, 0 Denied) has . The right child (1 Approved, 4 Denied) has .
The same split wins by Gini impurity. In practice, both metrics agree on the optimal split in the vast majority of cases.
Entropy vs Gini impurity in practice
Both metrics range from 0 (pure) to a maximum at a balanced class distribution. Here's how they compare:
| Property | Gini Impurity | Entropy |
|---|---|---|
| Formula | $1 - \sum p_i^2$ | |
| Max value (binary) | 0.5 | 1.0 |
| Computation cost | No log needed | Requires logarithm |
| Default in scikit-learn | Yes (criterion='gini') | No (set criterion='entropy') |
| Used by | CART | ID3, C4.5 |
| Sensitivity to skew | Slightly less | Slightly more |
Empirical studies consistently find that the two criteria produce nearly identical trees on real-world datasets. Breiman himself noted this in the original CART monograph. The choice rarely affects model performance by more than a fraction of a percent.
Pro Tip: Unless you have a specific theoretical reason to prefer entropy (e.g., you're working on information-theoretic problems or optimizing mutual information), stick with the default Gini. It trains marginally faster because it avoids logarithm calculations, and performance is equivalent.
Gain ratio: correcting for high-cardinality bias
One weakness of raw information gain is that it favors features with many unique values. A feature like "Applicant ID" (unique per row) would achieve perfect information gain by creating one pure leaf per sample, which is obviously useless for generalization.
Quinlan's C4.5 addresses this with the gain ratio, which normalizes information gain by the feature's intrinsic information:
Where:
- is the intrinsic value (split information) of attribute
In Plain English: Gain ratio penalizes features that split the loan applicants into many tiny groups. A feature that creates 10 groups of 1 applicant each has high intrinsic value (high split information), which deflates its gain ratio. This prevents the tree from choosing splits that are technically pure but practically useless.
Scikit-learn's CART implementation doesn't use gain ratio; it handles this bias through max_features and tree depth constraints instead. But you'll encounter gain ratio in interview questions and in implementations based on C4.5.
Decision trees for regression
When the target variable is continuous (e.g., predicting loan amount rather than approve/deny), the tree can't measure "class purity." Instead, it minimizes variance (equivalently, mean squared error) within each node. Scikit-learn's DecisionTreeRegressor uses squared_error as its default criterion.
The variance reduction for a split is:
Where:
- is the variance of the target in the parent node
- are the sample counts in each child
- is the total sample count in the parent
In Plain English: Imagine the tree is predicting loan amounts, ranging from $5,000 to $500,000. A good split groups similar amounts together, maybe separating the $5k to $50k loans from the $100k to $500k ones. Each leaf predicts the average loan amount of its members. The tree keeps splitting until each leaf covers a tight enough range.
Each leaf predicts the mean of the training samples that fall into it. This produces the characteristic staircase-shaped predictions of regression trees, where the prediction is constant within each rectangular partition of the feature space.
The greedy search algorithm in detail
At every node, the algorithm runs an exhaustive search:
- For each feature in the dataset:
- Sort the samples by (cost: ).
- For each pair of adjacent distinct values, compute the midpoint as a candidate threshold.
- Calculate the impurity reduction for that threshold.
- Select the (feature, threshold) pair that maximizes impurity reduction.
- Split and recurse.
Computational complexity
| Operation | Time Complexity | Notes |
|---|---|---|
| Building a balanced tree | features, samples, sorting at each of levels | |
| Building a worst-case tree | Unbalanced tree with levels | |
| Prediction (single sample) | = tree depth | |
| Prediction ( samples) | Embarrassingly parallel |
Scikit-learn optimizes this with cached sort indices: features are sorted once at the root, and child nodes reuse the sorted order. This brings the practical cost closer to per level.
Pro Tip: For datasets beyond 100K rows, consider HistGradientBoostingClassifier instead of growing a single deep tree. It bins continuous features into 256 discrete buckets, reducing split evaluation from to per feature. This massive speedup was introduced by LightGBM.
Overfitting and why single trees need constraints
An unconstrained decision tree will grow until every leaf contains a single sample or is pure. On the training set, this produces 100% accuracy. On unseen data, it performs terribly because the tree has memorized noise rather than learning patterns.
This is the fundamental weakness of decision trees: their flexibility is also their liability. A tree with 500 leaves on a dataset with 500 samples has essentially memorized the training data. It's a lookup table, not a model.
The bias-variance tradeoff is stark with decision trees. An unconstrained tree has very low bias (it can fit almost anything) but extremely high variance (it changes drastically with small data perturbations). Regularization brings variance under control at the cost of slightly higher bias.
Click to expandPre-pruning vs post-pruning strategies for controlling decision tree overfitting
Pre-pruning (early stopping)
Pre-pruning stops the tree before it overfits. As of scikit-learn 1.8, DecisionTreeClassifier exposes these hyperparameters:
| Parameter | Default | Typical Range | Effect on Loan Tree |
|---|---|---|---|
max_depth | None (unlimited) | 3 to 12 | Limits how many questions deep each decision path goes |
min_samples_split | 2 | 5 to 50 | A node with fewer applicants than this won't be split further |
min_samples_leaf | 1 | 3 to 20 | Each resulting group must contain at least this many applicants |
max_leaf_nodes | None (unlimited) | 10 to 100 | Hard cap on total number of final decision groups |
max_features | None (all) | 'sqrt', 'log2', int | Restricts how many features are evaluated per split |
Common Pitfall: Setting max_depth too low (e.g., 1 or 2) creates an underfitting stump that captures only the most obvious pattern. Setting it too high (e.g., 20+) gives the tree enough room to memorize noise. Start with max_depth between 3 and 8 for a baseline, then tune with cross-validation.
Post-pruning (cost-complexity pruning)
Post-pruning grows the full tree first, then trims branches that don't justify their complexity. The CART algorithm defines the cost-complexity measure:
Where:
- is the regularized cost of tree
- is the total weighted impurity summed across all leaf nodes
- is the number of leaves in the tree
- is the complexity parameter (higher = more aggressive pruning)
In Plain English: Think of as a "tax per leaf." Each leaf in the loan approval tree costs units. If a branch improves impurity by less than its leaf tax, it gets pruned. Larger means a heavier tax, which means fewer leaves survive. You pick the optimal by finding the value where cross-validated accuracy peaks.
In scikit-learn, ccp_alpha controls this. The cost_complexity_pruning_path() method returns the sequence of effective alpha values, letting you search for the optimal value.
Native missing value support
As of scikit-learn 1.3 (released mid-2023, stable through 1.8), DecisionTreeClassifier and DecisionTreeRegressor handle missing values natively when using splitter='best'. You no longer need to impute NaN values before training.
The approach is straightforward: for each candidate threshold, the splitter evaluates two configurations (all missing values going left, and all missing values going right) and picks whichever yields greater impurity reduction. During prediction, samples with missing values follow whichever direction was learned at training time.
This works with criterion set to 'gini', 'entropy', or 'log_loss' for classification, and 'squared_error', 'friedman_mse', or 'poisson' for regression.
Pro Tip: Native NaN handling doesn't mean you should skip data cleaning entirely. It handles missing values gracefully, but if 40% of your credit scores are missing, the tree is working with much less signal than if you'd investigated why the data is missing and addressed the root cause. Check the missing data strategies guide for a systematic approach.
Feature importance and its limitations
After training, feature_importances_ gives you a ranked list of which features mattered most. Scikit-learn computes this using Mean Decrease in Impurity (MDI): for each feature, sum the total impurity reduction across every node where that feature is the splitter, weighted by sample count.
The weighted impurity decrease at node :
Where:
- is the total number of training samples
- is the number of samples reaching node
- are the sample counts in the left and right children
In Plain English: In our loan approval tree, credit score contributed the biggest impurity reduction at the root split, so it gets the highest importance score. Employment years contributed at a lower level, and income might not appear at all. The importance values sum to 1.0 across all features.
The high-cardinality bias problem
MDI-based importance has a known flaw: it's biased toward features with many unique values. A feature with 100 distinct values offers 99 candidate thresholds, while a binary feature offers just 1. The high-cardinality feature gets more chances to look useful, even if it carries no real predictive signal.
The fix: use permutation importance as a complement. Permutation importance randomly shuffles a feature's values and measures how much accuracy drops. It's model-agnostic, unbiased, and available via sklearn.inspection.permutation_importance. For a deeper look at this, see feature selection.
When to use decision trees (and when not to)
Decision trees aren't always the right tool. Here's a decision framework:
Click to expandDecision framework flowchart for choosing when to use decision trees vs other models
Use a single decision tree when:
- Interpretability is non-negotiable. Regulatory requirements (GDPR right to explanation, Fair Lending Act), stakeholder presentations, or clinical decision support all demand explainable models.
- You need a quick baseline. A depth-limited tree trains in seconds and reveals which features matter before you invest in complex models.
- Features are mixed types: numeric and categorical together, with potential non-linear interactions. Trees handle this without feature engineering.
- Dataset is small to medium, under 10K rows. Ensemble methods often don't have enough data to show their advantage over a well-tuned single tree.
Don't use a single decision tree when:
- Accuracy is the priority and interpretability is secondary. Random forests or gradient boosting will almost always outperform a single tree.
- Relationships are fundamentally linear. Logistic regression or linear regression will be more efficient and just as accurate with fewer parameters.
- Stability matters. Removing one loan applicant from training data can completely restructure the tree. If you need stable predictions, use ensembles.
- You have diagonal decision boundaries. A tree needs many axis-aligned splits to approximate a diagonal boundary. SVMs handle this natively.
Implementation in Python
The following code trains a DecisionTreeClassifier on our loan approval dataset, then demonstrates visualization, feature importance, and cost-complexity pruning on Iris for broader illustration.
Loan approval classification
Expected output:
Decision Tree Rules (Loan Approval):
|--- Credit_Score <= 685.00
| |--- class: 0
|--- Credit_Score > 685.00
| |--- class: 1
Feature Importances:
Income_k: 0.0000
Credit_Score: 1.0000
Employment_yrs: 0.0000
New applicant (Income: $62k, Credit: 715, Employment: 5 yrs)
Prediction: Approved
Notice that the tree never uses Income. Credit Score and Employment years carry all the predictive signal for this dataset. The root split at Credit Score = 700 matches exactly what our information gain calculations predicted.
Visualization and pruning on a larger dataset
Expected output:
Depth-3 tree test accuracy: 1.00
Feature importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0000
petal length (cm): 0.9346
petal width (cm): 0.0654
Best ccp_alpha: 0.0000
Best test accuracy: 1.00
Number of alpha values tested: 8
Petal length alone accounts for 93.5% of the total impurity reduction. The tree's first split, petal length <= 2.45, perfectly separates the setosa class from the other two in a single cut.
Regression tree example
Expected output:
Depth 2 | Leaves: 4 | RMSE: $3,255 | R²: 0.480
Depth 5 | Leaves: 26 | RMSE: $3,551 | R²: 0.382
Depth 10 | Leaves: 115 | RMSE: $3,639 | R²: 0.351
The depth-5 tree hits the sweet spot. Depth 2 underfits (too few leaves to capture the non-linear relationship). Depth 10 overfits (more leaves than the data justifies, and test RMSE actually increases).
Production considerations
When deploying decision trees in real systems, keep these practical constraints in mind:
| Consideration | Single Decision Tree | Ensemble (RF / XGBoost) |
|---|---|---|
| Training time (100K rows) | < 1 second | 5 to 30 seconds |
| Inference time (per sample) | ~1 microsecond | ~50 microseconds |
| Model size on disk | < 1 MB | 10 to 500 MB |
| Interpretability | Full path trace | SHAP values needed |
| Memory at training time | where = number of trees | |
| Handles concept drift | Retrain entirely | Retrain entirely |
Scaling notes:
- Scikit-learn's implementation is single-threaded for tree construction but parallelizes
predict()via NumPy. For training parallelism, you need ensembles withn_jobs=-1. - On datasets beyond 1M rows, a single tree still trains in seconds (assuming reasonable feature count). The bottleneck is the sorting step at each node.
- For real-time inference, a depth-10 tree needs about 10 comparisons per prediction, fast enough for sub-millisecond response times even in a Python web server.
Warning: Decision trees don't extrapolate. A regression tree trained on credit scores between 500 and 800 will predict the same constant value for a score of 900 as it does for 800: the rightmost leaf's mean. If your production data can contain values outside the training range, add explicit bounds checking or switch to a model that handles extrapolation (like linear regression).
Conclusion
Decision trees partition the feature space through recursive binary splitting, selecting each split to maximize purity, whether measured by entropy, Gini impurity, or variance reduction. The CART algorithm's greedy search evaluates every feature-threshold combination at every node, building a model you can explain to anyone with a flowchart.
The trade-off is clear and worth stating plainly: a single tree is unstable. Remove one loan applicant from training data and you might get a completely different tree. Pre-pruning parameters like max_depth and min_samples_leaf constrain this tendency, while post-pruning via cost-complexity analysis (the ccp_alpha parameter) provides a principled way to trim unnecessary branches.
These weaknesses are precisely what motivate ensemble methods. Random forests average hundreds of decorrelated trees to reduce variance. Gradient boosting trains shallow trees sequentially, each correcting the errors of the last. Both methods inherit every concept covered here (Gini splits, depth control, leaf constraints) and build on them. Understanding the single decision tree is what makes those ensemble methods interpretable rather than opaque. For the theory behind why ensembles fix single-tree instability, see The Bias-Variance Tradeoff.
Frequently Asked Interview Questions
Q: What's the difference between Gini impurity and entropy, and when would you choose one over the other?
Both measure node impurity and produce nearly identical trees in practice. Gini impurity ($1 - \sum p_i^2-\sum p_i \log_2 p_i$) penalizes impure nodes slightly more aggressively due to the concavity of the log function, which can matter with highly skewed class distributions. In my experience, I've never seen the choice change a model's real-world performance by more than a fraction of a percent.
Q: A decision tree achieved 100% training accuracy but only 60% test accuracy. What happened and how would you fix it?
The tree overfit. It grew deep enough to memorize the training data, including noise. Each leaf probably contains one or two samples. The fixes are regularization: set max_depth (start with 3 to 8), increase min_samples_leaf (try 5 to 20), or use post-pruning with ccp_alpha. Cross-validate to find the sweet spot. If accuracy is still too low after pruning, the signal might be too weak for a single tree; consider ensemble methods.
Q: Why are decision trees considered high-variance models?
A single decision tree is extremely sensitive to the training data. Removing or adding even one sample can change the root split, which cascades through the entire tree structure. This instability is the "variance" in the bias-variance tradeoff. The greedy splitting algorithm amplifies this: if the root split changes, every subsequent split changes too. Random forests fix this by training many trees on bootstrapped samples and averaging their predictions, which cancels out individual tree variance.
Q: How do decision trees handle categorical features?
In scikit-learn's CART implementation, you must encode categorical features as numbers (one-hot or ordinal encoding) before training. The tree then treats them as numeric and finds thresholds. Some implementations (LightGBM, CatBoost) handle categoricals natively by finding optimal groupings of categories. For ordinal categoricals (e.g., education level: high school < bachelor's < master's), ordinal encoding works well. For nominal categoricals (e.g., city names), one-hot encoding is safer but can inflate feature count.
Q: Explain the cost-complexity pruning parameter ccp_alpha. How do you select its value?
ccp_alpha adds a penalty of per leaf to the tree's cost function: . When , no pruning occurs. As increases, leaves whose impurity reduction is smaller than get pruned. To find the optimal value, call cost_complexity_pruning_path() to get the full range of effective alphas, train a tree at each one, and pick the that maximizes cross-validated accuracy. The "elbow" in the accuracy-vs-alpha curve is typically the right choice.
Q: Your decision tree feature importance shows that a noisy random feature ranks higher than domain-relevant features. What's going on?
MDI (Mean Decrease in Impurity) based feature importance is biased toward high-cardinality features. Features with many unique values get more candidate split points and more opportunities to reduce impurity by chance. A random continuous feature with many unique values can look important even though it carries no signal. The fix is permutation importance: shuffle the feature's values and measure how much accuracy drops. If it doesn't drop, the feature is noise regardless of what MDI says.
Q: Can decision trees capture feature interactions without explicit feature engineering?
Yes, and this is one of their biggest advantages over linear models. Each path from root to leaf implicitly encodes an interaction. If the tree splits on Credit Score > 700 at the root and then on Employment > 3 years in the right child, it's capturing the interaction "low credit score AND low employment leads to denial." Linear models need you to manually create interaction terms (). Trees discover these automatically.
Q: How would you handle a 50-million-row dataset with a decision tree?
A single tree can still train on 50M rows, but you should use scikit-learn's HistGradientBoostingClassifier instead, which bins continuous features into 256 buckets and reduces per-split cost from to . If you specifically need a single interpretable tree, subsample the data (stratified to preserve class balance), train the tree on the subsample, then validate on the full dataset. Alternatively, use Spark MLlib's DecisionTreeClassifier which distributes the sort-and-split computation across a cluster.
Hands-On Practice
Decision Trees are powerful because they model complex, non-linear relationships by breaking data down into simple, interpretable rules, much like a game of '20 Questions'. You'll build a Decision Tree Regressor from scratch to model factory efficiency based on temperature, visualizing exactly how the tree partitions the input space. We will use the Factory Efficiency dataset, which contains temperature and efficiency readings, making it perfect for seeing how a decision tree captures non-linear patterns that a simple linear line cannot.
Dataset: Factory Efficiency (Polynomial) Factory temperature vs efficiency data with clear parabolic relationship. Linear regression fails (R² ≈ 0.00) but polynomial succeeds (R² ≈ 0.98). The perfect dataset to demonstrate why polynomial regression exists.
Try changing the max_depth parameter from 3 to 1, and then to 10, to observe how the model transitions from underfitting (too simple) to overfitting (too jagged/noisy). Experiment with min_samples_split to prevent nodes from splitting if they contain too few samples. These adjustments will help you intuitively grasp the trade-off between model complexity and generalization.