Skip to content

Decision Trees: The Definitive Guide to Recursive Partitioning

DS
LDS Team
Let's Data Science
12 minAudio
Listen Along
0:00/ 0:00
AI voice

A bank receives ten loan applications. Each applicant walks in with an income figure, a credit score, and years of employment. The loan officer needs a systematic rule (not gut feeling) to decide: approve or deny. A decision tree builds exactly that rule by asking one question at a time, choosing each question to separate approved applicants from denied ones as cleanly as possible.

Decision trees form the backbone of every tree-based model used in production today. Random forests, gradient boosting machines, and XGBoost all start here. If you don't understand how a single tree selects its splits, measures impurity, and controls its own complexity, the ensemble methods built on top will remain a black box.

This guide covers the full mechanics: splitting criteria (entropy and Gini impurity), the greedy search algorithm, stopping rules, pruning strategies, regression trees, and the practical trade-offs that determine when a decision tree is the right tool, and when it isn't.

Recursive binary splitting

A decision tree is a non-parametric supervised learning model that predicts a target variable by partitioning the feature space into axis-aligned rectangular regions. Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone formalized this in the CART (Classification and Regression Trees) algorithm in their 1984 monograph. Around the same time, Ross Quinlan introduced the entropy-based ID3 algorithm (1986), later extending it to C4.5 in 1993. Scikit-learn's decision tree module (v1.8, December 2025) implements an optimized version of CART.

The recursive splitting procedure works as follows:

  1. Start with the full dataset at a single node (the root).
  2. Evaluate every feature and every possible threshold within that feature.
  3. Select the split that produces the greatest reduction in impurity.
  4. Divide the data into two child nodes based on that split.
  5. Repeat steps 2 to 4 on each child node until a stopping condition is met.
  6. Assign a prediction to each terminal node (leaf): the majority class for classification, or the mean target value for regression.

This procedure is greedy: it picks the locally optimal split at each step without backtracking. It never reconsiders an earlier split in light of later ones. Hyafil and Rivest proved in 1976 that finding the globally optimal binary decision tree is NP-complete, which is why every practical implementation relies on greedy heuristics instead.

Decision tree recursive splitting algorithm flowchart showing the greedy search processClick to expandDecision tree recursive splitting algorithm flowchart showing the greedy search process

Tree anatomy

Every decision tree has three types of nodes:

Node typeRoleExample (loan approval)
Root nodeContains the entire dataset; performs the first split"Credit Score > 700?"
Internal nodesEach tests a single feature against a threshold and routes samples left or right"Employment > 3 years?"
Leaf nodesHold the final prediction; no further splitting"Approved" or "Denied"

Key Insight: Decision trees are white-box models. You can trace any prediction from root to leaf and explain exactly which feature thresholds led to the outcome. This transparency is why regulated industries (banking, healthcare, insurance) still rely on tree-based models when they need to justify decisions to auditors and regulators.

The loan approval dataset

To make every formula and concept concrete, here's the dataset we'll use from start to finish: 10 loan applicants.

ApplicantIncomeCredit ScoreEmployment (yrs)Decision
1$72k7506Approved
2$45k6802Denied
3$91k6208Denied
4$38k6101Denied
5$65k7405Approved
6$42k6904Approved
7$48k5803Denied
8$85k7607Approved
9$55k7104Approved
10$60k7305Approved

We have 6 Approved and 4 Denied outcomes. The tree must decide: which feature and which threshold should it split on first?

To answer that, it needs a way to measure disorder.

Shannon entropy as a splitting criterion

Entropy, introduced by Claude Shannon in his landmark 1948 paper "A Mathematical Theory of Communication," quantifies the average amount of information (in bits) needed to identify the class of a randomly selected sample. For a node SS containing cc classes:

H(S)=i=1cpilog2(pi)H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)

Where:

  • H(S)H(S) is the entropy of node SS (measured in bits)
  • pip_i is the proportion of samples belonging to class ii
  • cc is the total number of distinct classes
  • log2\log_2 is the base-2 logarithm

In Plain English: Entropy measures how mixed a node is. If every loan applicant in the node got approved, entropy is 0, meaning there's no uncertainty. If half were approved and half denied, entropy hits its maximum of 1.0 bit; it's a coin flip. The tree's job is to find splits that drive entropy toward zero in the child nodes.

Computing entropy for the root node

The root contains 6 Approved and 4 Denied out of 10 total, so papproved=0.6p_{\text{approved}} = 0.6 and pdenied=0.4p_{\text{denied}} = 0.4:

H(root)=(0.6×log20.6+0.4×log20.4)H(\text{root}) = -(0.6 \times \log_2 0.6 + 0.4 \times \log_2 0.4)

H(root)=(0.6×(0.737)+0.4×(1.322))H(\text{root}) = -(0.6 \times (-0.737) + 0.4 \times (-1.322))

H(root)=0.442+0.529=0.971 bitsH(\text{root}) = 0.442 + 0.529 = 0.971 \text{ bits}

An entropy of 0.971, close to the theoretical maximum of 1.0 for two classes, confirms the root node is heavily mixed. Almost no useful signal yet.

Information gain drives every split

Entropy measures disorder at a single node. Information gain measures how much a particular split reduces that disorder. It equals the parent entropy minus the weighted average entropy of the children:

IG(S,A)=H(S)vValues(A)SvSH(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)

Where:

  • IG(S,A)IG(S, A) is the information gain from splitting set SS on attribute AA
  • H(S)H(S) is the entropy of the parent node
  • SvS_v is the subset of samples that take value vv after the split
  • Sv/S|S_v| / |S| is the proportion of samples routed to child vv

In Plain English: Information gain tells you how many bits of uncertainty you remove by asking a specific question about the loan applicants. The tree evaluates every candidate question (every feature-threshold pair) and picks the one that removes the most uncertainty. Higher gain means a more informative split.

Evaluating candidate splits on the loan data

Let's work through three candidate splits and compare their information gain.

Candidate 1: Credit Score > 700

  • Left child (score > 700): applicants 1, 5, 8, 9, 10; all 5 Approved, 0 Denied. Hleft=0H_{\text{left}} = 0 (pure node).
  • Right child (score \leq 700): applicants 2, 3, 4, 6, 7, with 1 Approved (applicant 6) and 4 Denied. Hright=(0.2×log20.2+0.8×log20.8)=0.722H_{\text{right}} = -(0.2 \times \log_2 0.2 + 0.8 \times \log_2 0.8) = 0.722.

IG=0.971(510×0+510×0.722)=0.9710.361=0.610IG = 0.971 - \left(\frac{5}{10} \times 0 + \frac{5}{10} \times 0.722\right) = 0.971 - 0.361 = 0.610

Candidate 2: Income > $50k

  • Left child (income > $50k): applicants 1, 3, 5, 8, 9, 10, yielding 5 Approved and 1 Denied (applicant 3 earns $91k but has a credit score of only 620). Hleft=0.650H_{\text{left}} = 0.650.
  • Right child (income \leq $50k): applicants 2, 4, 6, 7, with 1 Approved (applicant 6) and 3 Denied. Hright=0.811H_{\text{right}} = 0.811.

IG=0.971(610×0.650+410×0.811)=0.9710.714=0.257IG = 0.971 - \left(\frac{6}{10} \times 0.650 + \frac{4}{10} \times 0.811\right) = 0.971 - 0.714 = 0.257

Candidate 3: Employment > 3 years

  • Left child (> 3 years): applicants 1, 3, 5, 6, 8, 9, 10; 6 Approved, 1 Denied (applicant 3 has 8 years experience but was denied for poor credit). Hleft=0.592H_{\text{left}} = 0.592.
  • Right child (\leq 3 years): applicants 2, 4, 7, all Denied (pure node). Hright=0H_{\text{right}} = 0.

IG=0.971(710×0.592+310×0)=0.9710.414=0.557IG = 0.971 - \left(\frac{7}{10} \times 0.592 + \frac{3}{10} \times 0\right) = 0.971 - 0.414 = 0.557

Result: Credit Score > 700 wins with an information gain of 0.610, ahead of Employment > 3 years (0.557) and Income > $50k (0.257). The tree picks this as its root split.

Candidate SplitInformation GainWinner?
Credit Score > 7000.610Yes
Employment > 3 years0.557
Income > $50k0.257

The algorithm then recurses into each child node. The left child is already pure (all Approved), so it becomes a leaf. The right child (1 Approved, 4 Denied) gets split further on a different feature.

Decision tree for loan approval showing credit score as root split with information gain valuesClick to expandDecision tree for loan approval showing credit score as root split with information gain values

Gini impurity

Gini impurity is the default splitting criterion in scikit-learn's DecisionTreeClassifier (as of version 1.8). It measures the probability that a randomly chosen sample from a node would be misclassified if labeled according to the class distribution at that node:

Gini(S)=1i=1cpi2Gini(S) = 1 - \sum_{i=1}^{c} p_i^2

Where:

  • Gini(S)Gini(S) is the Gini impurity of node SS
  • pip_i is the proportion of samples belonging to class ii
  • cc is the total number of classes

In Plain English: For our loan approval node, Gini impurity equals $2 \times p_{\text{approved}} \times p_{\text{denied}}$, which represents the probability that if you randomly pick two loan applicants from the node, they'd have different outcomes. A pure node (all approved or all denied) has Gini = 0. A perfectly balanced node has Gini = 0.5: maximum confusion.

Computing Gini for the root node

Gini(root)=1(0.62+0.42)=1(0.36+0.16)=0.480Gini(\text{root}) = 1 - (0.6^2 + 0.4^2) = 1 - (0.36 + 0.16) = 0.480

Gini reduction for Credit Score > 700

The left child (5 Approved, 0 Denied) has Ginileft=0Gini_{\text{left}} = 0. The right child (1 Approved, 4 Denied) has Giniright=1(0.22+0.82)=0.320Gini_{\text{right}} = 1 - (0.2^2 + 0.8^2) = 0.320.

ΔGini=0.480(510×0+510×0.320)=0.4800.160=0.320\Delta Gini = 0.480 - \left(\frac{5}{10} \times 0 + \frac{5}{10} \times 0.320\right) = 0.480 - 0.160 = 0.320

The same split wins by Gini impurity. In practice, both metrics agree on the optimal split in the vast majority of cases.

Entropy vs Gini impurity in practice

Both metrics range from 0 (pure) to a maximum at a balanced class distribution. Here's how they compare:

PropertyGini ImpurityEntropy
Formula$1 - \sum p_i^2$pilog2pi-\sum p_i \log_2 p_i
Max value (binary)0.51.0
Computation costNo log neededRequires logarithm
Default in scikit-learnYes (criterion='gini')No (set criterion='entropy')
Used byCARTID3, C4.5
Sensitivity to skewSlightly lessSlightly more

Empirical studies consistently find that the two criteria produce nearly identical trees on real-world datasets. Breiman himself noted this in the original CART monograph. The choice rarely affects model performance by more than a fraction of a percent.

Pro Tip: Unless you have a specific theoretical reason to prefer entropy (e.g., you're working on information-theoretic problems or optimizing mutual information), stick with the default Gini. It trains marginally faster because it avoids logarithm calculations, and performance is equivalent.

Gain ratio: correcting for high-cardinality bias

One weakness of raw information gain is that it favors features with many unique values. A feature like "Applicant ID" (unique per row) would achieve perfect information gain by creating one pure leaf per sample, which is obviously useless for generalization.

Quinlan's C4.5 addresses this with the gain ratio, which normalizes information gain by the feature's intrinsic information:

GainRatio(S,A)=IG(S,A)IV(A)GainRatio(S, A) = \frac{IG(S, A)}{IV(A)}

Where:

  • IV(A)=vSvSlog2SvSIV(A) = -\sum_{v} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|} is the intrinsic value (split information) of attribute AA

In Plain English: Gain ratio penalizes features that split the loan applicants into many tiny groups. A feature that creates 10 groups of 1 applicant each has high intrinsic value (high split information), which deflates its gain ratio. This prevents the tree from choosing splits that are technically pure but practically useless.

Scikit-learn's CART implementation doesn't use gain ratio; it handles this bias through max_features and tree depth constraints instead. But you'll encounter gain ratio in interview questions and in implementations based on C4.5.

Decision trees for regression

When the target variable is continuous (e.g., predicting loan amount rather than approve/deny), the tree can't measure "class purity." Instead, it minimizes variance (equivalently, mean squared error) within each node. Scikit-learn's DecisionTreeRegressor uses squared_error as its default criterion.

The variance reduction for a split is:

ΔVar=Var(parent)(nleftn×Var(left)+nrightn×Var(right))\Delta Var = Var(\text{parent}) - \left(\frac{n_{\text{left}}}{n} \times Var(\text{left}) + \frac{n_{\text{right}}}{n} \times Var(\text{right})\right)

Where:

  • Var(parent)Var(\text{parent}) is the variance of the target in the parent node
  • nleft,nrightn_{\text{left}}, n_{\text{right}} are the sample counts in each child
  • nn is the total sample count in the parent

In Plain English: Imagine the tree is predicting loan amounts, ranging from $5,000 to $500,000. A good split groups similar amounts together, maybe separating the $5k to $50k loans from the $100k to $500k ones. Each leaf predicts the average loan amount of its members. The tree keeps splitting until each leaf covers a tight enough range.

Each leaf predicts the mean of the training samples that fall into it. This produces the characteristic staircase-shaped predictions of regression trees, where the prediction is constant within each rectangular partition of the feature space.

The greedy search algorithm in detail

At every node, the algorithm runs an exhaustive search:

  1. For each feature ff in the dataset:
  • Sort the samples by ff (cost: O(nlogn)O(n \log n)).
  • For each pair of adjacent distinct values, compute the midpoint as a candidate threshold.
  • Calculate the impurity reduction for that threshold.
  1. Select the (feature, threshold) pair that maximizes impurity reduction.
  2. Split and recurse.

Computational complexity

OperationTime ComplexityNotes
Building a balanced treeO(pnlog2n)O(p \cdot n \log^2 n)pp features, nn samples, sorting at each of logn\log n levels
Building a worst-case treeO(pn2)O(p \cdot n^2)Unbalanced tree with nn levels
Prediction (single sample)O(d)O(d)dd = tree depth
Prediction (mm samples)O(md)O(m \cdot d)Embarrassingly parallel

Scikit-learn optimizes this with cached sort indices: features are sorted once at the root, and child nodes reuse the sorted order. This brings the practical cost closer to O(pnlogn)O(p \cdot n \log n) per level.

Pro Tip: For datasets beyond 100K rows, consider HistGradientBoostingClassifier instead of growing a single deep tree. It bins continuous features into 256 discrete buckets, reducing split evaluation from O(n)O(n) to O(256)O(256) per feature. This massive speedup was introduced by LightGBM.

Overfitting and why single trees need constraints

An unconstrained decision tree will grow until every leaf contains a single sample or is pure. On the training set, this produces 100% accuracy. On unseen data, it performs terribly because the tree has memorized noise rather than learning patterns.

This is the fundamental weakness of decision trees: their flexibility is also their liability. A tree with 500 leaves on a dataset with 500 samples has essentially memorized the training data. It's a lookup table, not a model.

The bias-variance tradeoff is stark with decision trees. An unconstrained tree has very low bias (it can fit almost anything) but extremely high variance (it changes drastically with small data perturbations). Regularization brings variance under control at the cost of slightly higher bias.

Pre-pruning vs post-pruning strategies for controlling decision tree overfittingClick to expandPre-pruning vs post-pruning strategies for controlling decision tree overfitting

Pre-pruning (early stopping)

Pre-pruning stops the tree before it overfits. As of scikit-learn 1.8, DecisionTreeClassifier exposes these hyperparameters:

ParameterDefaultTypical RangeEffect on Loan Tree
max_depthNone (unlimited)3 to 12Limits how many questions deep each decision path goes
min_samples_split25 to 50A node with fewer applicants than this won't be split further
min_samples_leaf13 to 20Each resulting group must contain at least this many applicants
max_leaf_nodesNone (unlimited)10 to 100Hard cap on total number of final decision groups
max_featuresNone (all)'sqrt', 'log2', intRestricts how many features are evaluated per split

Common Pitfall: Setting max_depth too low (e.g., 1 or 2) creates an underfitting stump that captures only the most obvious pattern. Setting it too high (e.g., 20+) gives the tree enough room to memorize noise. Start with max_depth between 3 and 8 for a baseline, then tune with cross-validation.

Post-pruning (cost-complexity pruning)

Post-pruning grows the full tree first, then trims branches that don't justify their complexity. The CART algorithm defines the cost-complexity measure:

Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha \cdot |T|

Where:

  • Rα(T)R_\alpha(T) is the regularized cost of tree TT
  • R(T)R(T) is the total weighted impurity summed across all leaf nodes
  • T|T| is the number of leaves in the tree
  • α0\alpha \geq 0 is the complexity parameter (higher = more aggressive pruning)

In Plain English: Think of α\alpha as a "tax per leaf." Each leaf in the loan approval tree costs α\alpha units. If a branch improves impurity by less than its leaf tax, it gets pruned. Larger α\alpha means a heavier tax, which means fewer leaves survive. You pick the optimal α\alpha by finding the value where cross-validated accuracy peaks.

In scikit-learn, ccp_alpha controls this. The cost_complexity_pruning_path() method returns the sequence of effective alpha values, letting you search for the optimal value.

Native missing value support

As of scikit-learn 1.3 (released mid-2023, stable through 1.8), DecisionTreeClassifier and DecisionTreeRegressor handle missing values natively when using splitter='best'. You no longer need to impute NaN values before training.

The approach is straightforward: for each candidate threshold, the splitter evaluates two configurations (all missing values going left, and all missing values going right) and picks whichever yields greater impurity reduction. During prediction, samples with missing values follow whichever direction was learned at training time.

This works with criterion set to 'gini', 'entropy', or 'log_loss' for classification, and 'squared_error', 'friedman_mse', or 'poisson' for regression.

Pro Tip: Native NaN handling doesn't mean you should skip data cleaning entirely. It handles missing values gracefully, but if 40% of your credit scores are missing, the tree is working with much less signal than if you'd investigated why the data is missing and addressed the root cause. Check the missing data strategies guide for a systematic approach.

Feature importance and its limitations

After training, feature_importances_ gives you a ranked list of which features mattered most. Scikit-learn computes this using Mean Decrease in Impurity (MDI): for each feature, sum the total impurity reduction across every node where that feature is the splitter, weighted by sample count.

The weighted impurity decrease at node tt:

Importance(t)=NtN(impuritytNtLNtimpuritytLNtRNtimpuritytR)\text{Importance}(t) = \frac{N_t}{N} \left( \text{impurity}_t - \frac{N_{t_L}}{N_t} \cdot \text{impurity}_{t_L} - \frac{N_{t_R}}{N_t} \cdot \text{impurity}_{t_R} \right)

Where:

  • NN is the total number of training samples
  • NtN_t is the number of samples reaching node tt
  • NtL,NtRN_{t_L}, N_{t_R} are the sample counts in the left and right children

In Plain English: In our loan approval tree, credit score contributed the biggest impurity reduction at the root split, so it gets the highest importance score. Employment years contributed at a lower level, and income might not appear at all. The importance values sum to 1.0 across all features.

The high-cardinality bias problem

MDI-based importance has a known flaw: it's biased toward features with many unique values. A feature with 100 distinct values offers 99 candidate thresholds, while a binary feature offers just 1. The high-cardinality feature gets more chances to look useful, even if it carries no real predictive signal.

The fix: use permutation importance as a complement. Permutation importance randomly shuffles a feature's values and measures how much accuracy drops. It's model-agnostic, unbiased, and available via sklearn.inspection.permutation_importance. For a deeper look at this, see feature selection.

When to use decision trees (and when not to)

Decision trees aren't always the right tool. Here's a decision framework:

Decision framework flowchart for choosing when to use decision trees vs other modelsClick to expandDecision framework flowchart for choosing when to use decision trees vs other models

Use a single decision tree when:

  1. Interpretability is non-negotiable. Regulatory requirements (GDPR right to explanation, Fair Lending Act), stakeholder presentations, or clinical decision support all demand explainable models.
  2. You need a quick baseline. A depth-limited tree trains in seconds and reveals which features matter before you invest in complex models.
  3. Features are mixed types: numeric and categorical together, with potential non-linear interactions. Trees handle this without feature engineering.
  4. Dataset is small to medium, under 10K rows. Ensemble methods often don't have enough data to show their advantage over a well-tuned single tree.

Don't use a single decision tree when:

  1. Accuracy is the priority and interpretability is secondary. Random forests or gradient boosting will almost always outperform a single tree.
  2. Relationships are fundamentally linear. Logistic regression or linear regression will be more efficient and just as accurate with fewer parameters.
  3. Stability matters. Removing one loan applicant from training data can completely restructure the tree. If you need stable predictions, use ensembles.
  4. You have diagonal decision boundaries. A tree needs many axis-aligned splits to approximate a diagonal boundary. SVMs handle this natively.

Implementation in Python

The following code trains a DecisionTreeClassifier on our loan approval dataset, then demonstrates visualization, feature importance, and cost-complexity pruning on Iris for broader illustration.

Loan approval classification

Expected output:

text
Decision Tree Rules (Loan Approval):
|--- Credit_Score <= 685.00
|   |--- class: 0
|--- Credit_Score >  685.00
|   |--- class: 1


Feature Importances:
  Income_k: 0.0000
  Credit_Score: 1.0000
  Employment_yrs: 0.0000

New applicant (Income: &#36;62k, Credit: 715, Employment: 5 yrs)
Prediction: Approved

Notice that the tree never uses Income. Credit Score and Employment years carry all the predictive signal for this dataset. The root split at Credit Score = 700 matches exactly what our information gain calculations predicted.

Visualization and pruning on a larger dataset

Expected output:

text
Depth-3 tree test accuracy: 1.00

Feature importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0000
  petal length (cm): 0.9346
  petal width (cm): 0.0654

Best ccp_alpha: 0.0000
Best test accuracy: 1.00
Number of alpha values tested: 8

Petal length alone accounts for 93.5% of the total impurity reduction. The tree's first split, petal length <= 2.45, perfectly separates the setosa class from the other two in a single cut.

Regression tree example

Expected output:

text
Depth  2 | Leaves:   4 | RMSE: &#36;3,255 | R²: 0.480
Depth  5 | Leaves:  26 | RMSE: &#36;3,551 | R²: 0.382
Depth 10 | Leaves: 115 | RMSE: &#36;3,639 | R²: 0.351

The depth-5 tree hits the sweet spot. Depth 2 underfits (too few leaves to capture the non-linear relationship). Depth 10 overfits (more leaves than the data justifies, and test RMSE actually increases).

Production considerations

When deploying decision trees in real systems, keep these practical constraints in mind:

ConsiderationSingle Decision TreeEnsemble (RF / XGBoost)
Training time (100K rows)< 1 second5 to 30 seconds
Inference time (per sample)~1 microsecond~50 microseconds
Model size on disk< 1 MB10 to 500 MB
InterpretabilityFull path traceSHAP values needed
Memory at training timeO(np)O(n \cdot p)O(npT)O(n \cdot p \cdot T) where TT = number of trees
Handles concept driftRetrain entirelyRetrain entirely

Scaling notes:

  • Scikit-learn's implementation is single-threaded for tree construction but parallelizes predict() via NumPy. For training parallelism, you need ensembles with n_jobs=-1.
  • On datasets beyond 1M rows, a single tree still trains in seconds (assuming reasonable feature count). The bottleneck is the sorting step at each node.
  • For real-time inference, a depth-10 tree needs about 10 comparisons per prediction, fast enough for sub-millisecond response times even in a Python web server.

Warning: Decision trees don't extrapolate. A regression tree trained on credit scores between 500 and 800 will predict the same constant value for a score of 900 as it does for 800: the rightmost leaf's mean. If your production data can contain values outside the training range, add explicit bounds checking or switch to a model that handles extrapolation (like linear regression).

Conclusion

Decision trees partition the feature space through recursive binary splitting, selecting each split to maximize purity, whether measured by entropy, Gini impurity, or variance reduction. The CART algorithm's greedy search evaluates every feature-threshold combination at every node, building a model you can explain to anyone with a flowchart.

The trade-off is clear and worth stating plainly: a single tree is unstable. Remove one loan applicant from training data and you might get a completely different tree. Pre-pruning parameters like max_depth and min_samples_leaf constrain this tendency, while post-pruning via cost-complexity analysis (the ccp_alpha parameter) provides a principled way to trim unnecessary branches.

These weaknesses are precisely what motivate ensemble methods. Random forests average hundreds of decorrelated trees to reduce variance. Gradient boosting trains shallow trees sequentially, each correcting the errors of the last. Both methods inherit every concept covered here (Gini splits, depth control, leaf constraints) and build on them. Understanding the single decision tree is what makes those ensemble methods interpretable rather than opaque. For the theory behind why ensembles fix single-tree instability, see The Bias-Variance Tradeoff.

Frequently Asked Interview Questions

Q: What's the difference between Gini impurity and entropy, and when would you choose one over the other?

Both measure node impurity and produce nearly identical trees in practice. Gini impurity ($1 - \sum p_i^2)iscomputationallycheaperbecauseitavoidslogarithms,whichiswhyitsscikitlearnsdefault.Entropy() is computationally cheaper because it avoids logarithms, which is why it's scikit-learn's default. Entropy (-\sum p_i \log_2 p_i$) penalizes impure nodes slightly more aggressively due to the concavity of the log function, which can matter with highly skewed class distributions. In my experience, I've never seen the choice change a model's real-world performance by more than a fraction of a percent.

Q: A decision tree achieved 100% training accuracy but only 60% test accuracy. What happened and how would you fix it?

The tree overfit. It grew deep enough to memorize the training data, including noise. Each leaf probably contains one or two samples. The fixes are regularization: set max_depth (start with 3 to 8), increase min_samples_leaf (try 5 to 20), or use post-pruning with ccp_alpha. Cross-validate to find the sweet spot. If accuracy is still too low after pruning, the signal might be too weak for a single tree; consider ensemble methods.

Q: Why are decision trees considered high-variance models?

A single decision tree is extremely sensitive to the training data. Removing or adding even one sample can change the root split, which cascades through the entire tree structure. This instability is the "variance" in the bias-variance tradeoff. The greedy splitting algorithm amplifies this: if the root split changes, every subsequent split changes too. Random forests fix this by training many trees on bootstrapped samples and averaging their predictions, which cancels out individual tree variance.

Q: How do decision trees handle categorical features?

In scikit-learn's CART implementation, you must encode categorical features as numbers (one-hot or ordinal encoding) before training. The tree then treats them as numeric and finds thresholds. Some implementations (LightGBM, CatBoost) handle categoricals natively by finding optimal groupings of categories. For ordinal categoricals (e.g., education level: high school < bachelor's < master's), ordinal encoding works well. For nominal categoricals (e.g., city names), one-hot encoding is safer but can inflate feature count.

Q: Explain the cost-complexity pruning parameter ccp_alpha. How do you select its value?

ccp_alpha adds a penalty of α\alpha per leaf to the tree's cost function: Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|. When α=0\alpha = 0, no pruning occurs. As α\alpha increases, leaves whose impurity reduction is smaller than α\alpha get pruned. To find the optimal value, call cost_complexity_pruning_path() to get the full range of effective alphas, train a tree at each one, and pick the α\alpha that maximizes cross-validated accuracy. The "elbow" in the accuracy-vs-alpha curve is typically the right choice.

Q: Your decision tree feature importance shows that a noisy random feature ranks higher than domain-relevant features. What's going on?

MDI (Mean Decrease in Impurity) based feature importance is biased toward high-cardinality features. Features with many unique values get more candidate split points and more opportunities to reduce impurity by chance. A random continuous feature with many unique values can look important even though it carries no signal. The fix is permutation importance: shuffle the feature's values and measure how much accuracy drops. If it doesn't drop, the feature is noise regardless of what MDI says.

Q: Can decision trees capture feature interactions without explicit feature engineering?

Yes, and this is one of their biggest advantages over linear models. Each path from root to leaf implicitly encodes an interaction. If the tree splits on Credit Score > 700 at the root and then on Employment > 3 years in the right child, it's capturing the interaction "low credit score AND low employment leads to denial." Linear models need you to manually create interaction terms (CreditScore×Employment\text{CreditScore} \times \text{Employment}). Trees discover these automatically.

Q: How would you handle a 50-million-row dataset with a decision tree?

A single tree can still train on 50M rows, but you should use scikit-learn's HistGradientBoostingClassifier instead, which bins continuous features into 256 buckets and reduces per-split cost from O(n)O(n) to O(256)O(256). If you specifically need a single interpretable tree, subsample the data (stratified to preserve class balance), train the tree on the subsample, then validate on the full dataset. Alternatively, use Spark MLlib's DecisionTreeClassifier which distributes the sort-and-split computation across a cluster.

Hands-On Practice

Decision Trees are powerful because they model complex, non-linear relationships by breaking data down into simple, interpretable rules, much like a game of '20 Questions'. You'll build a Decision Tree Regressor from scratch to model factory efficiency based on temperature, visualizing exactly how the tree partitions the input space. We will use the Factory Efficiency dataset, which contains temperature and efficiency readings, making it perfect for seeing how a decision tree captures non-linear patterns that a simple linear line cannot.

Dataset: Factory Efficiency (Polynomial) Factory temperature vs efficiency data with clear parabolic relationship. Linear regression fails (R² ≈ 0.00) but polynomial succeeds (R² ≈ 0.98). The perfect dataset to demonstrate why polynomial regression exists.

Try changing the max_depth parameter from 3 to 1, and then to 10, to observe how the model transitions from underfitting (too simple) to overfitting (too jagged/noisy). Experiment with min_samples_split to prevent nodes from splitting if they contain too few samples. These adjustments will help you intuitively grasp the trade-off between model complexity and generalization.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths