Skip to content

Linear Regression: The Comprehensive Guide to Predictive Modeling

DS
LDS Team
Let's Data Science
10 minAudio
Listen Along
0:00/ 0:00
AI voice

<!— slug: linear-regression-the-comprehensive-guide-to-predictive-modeling —> <!— excerpt: Master linear regression from equation to Python code. OLS, gradient descent, LINE assumptions, residual diagnostics, and a complete house-price model. —>

A $350,000 house just sold on your street. The one next door, 200 square feet larger, closed at $383,000. Linear regression is the algorithm that turns those two data points into a rule: each additional square foot adds roughly $167 to the price. That rule — a slope, an intercept, and an error term — has been powering predictions in economics, medicine, and engineering since Legendre published the method of least squares in 1805.

Dismissing linear regression as a beginner's exercise is a mistake. Major banks, pharmaceutical firms, and regulatory agencies rely on it daily — not because they can't afford neural networks, but because linear regression gives you something deep learning can't: a coefficient you can read aloud in a courtroom, defend to a regulator, and explain to a product manager over coffee. According to the scikit-learn documentation, LinearRegression remains one of the most-used estimators in the library as of version 1.8.

Every concept in this guide maps to a single running example: predicting house prices from square footage and bedroom count. The same dataset threads through every formula, every code block, and every residual plot, so the math never feels disconnected from reality.

Linear regression model fitting pipeline from data to predictionsClick to expandLinear regression model fitting pipeline from data to predictions

The regression equation

Linear regression models the relationship between one or more input features and a continuous output. For a single feature, the equation is the slope-intercept form you've known since algebra class:

y=β0+β1x+εy = \beta_0 + \beta_1 x + \varepsilon

Where:

  • yy is the dependent variable (the value you predict — house sale price)
  • xx is the independent variable (the feature you measure — square footage)
  • β0\beta_0 is the intercept, the predicted value of yy when x=0x = 0
  • β1\beta_1 is the slope coefficient, how much yy changes per one-unit increase in xx
  • ε\varepsilon is the error term, random noise the model cannot explain

In Plain English: The predicted price equals a baseline of $50,000 (representing land value when square footage is theoretically zero), plus $167 for every square foot, plus random variation from renovations, neighborhood quirks, and market timing. That sentence is the model.

Here's each symbol mapped to our house-price example:

SymbolNameHouse-Price Meaning
yyDependent variableSale price ($350,000)
xxIndependent variableSquare footage (1,800 sq ft)
β0\beta_0Intercept$50,000 baseline (land value alone)
β1\beta_1Slope coefficient$167 per additional square foot
ε\varepsilonError termVariation from renovations, school district, market timing

Pro Tip: The error term isn't a flaw. It explicitly acknowledges that no single feature perfectly determines the outcome. Honesty about uncertainty is one of linear regression's greatest strengths — and one reason regulators trust it.

The cost function that drives optimization

The algorithm can't eyeball a line through a scatter plot. It needs a numerical score that quantifies how wrong a candidate line is. That score is the Mean Squared Error (MSE):

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Where:

  • nn is the number of observations (houses in our dataset)
  • yiy_i is the actual price of house ii
  • y^i\hat{y}_i is the predicted price of house ii
  • (yiy^i)(y_i - \hat{y}_i) is the residual — the gap between reality and prediction

In Plain English: For each house, measure the gap between the actual price and what the model predicted. Square that gap (so negative and positive errors both count, and big misses hurt disproportionately). Then average everything. A $100,000 miss contributes 10,000 times more to the cost than a $1,000 miss — MSE punishes large errors hard.

The algorithm's job is simple: find the specific values of β0\beta_0 and β1\beta_1 that produce the lowest possible MSE. Two classical methods accomplish this.

Ordinary Least Squares and the normal equation

Ordinary Least Squares (OLS) is a closed-form solution — it calculates the optimal coefficients in a single matrix operation, with no iterations and no hyperparameters.

Stack all nn observations into a matrix XX (with a column of ones for the intercept) and a vector y\mathbf{y}:

y=Xβ+ε\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}

The sum of squared residuals is S(β)=(yXβ)(yXβ)S(\boldsymbol{\beta}) = (\mathbf{y} - X\boldsymbol{\beta})^\top(\mathbf{y} - X\boldsymbol{\beta}). Take the derivative with respect to β\boldsymbol{\beta}, set it to zero, and solve:

Sβ=2Xy+2XXβ=0\frac{\partial S}{\partial \boldsymbol{\beta}} = -2X^\top\mathbf{y} + 2X^\top X\boldsymbol{\beta} = 0

Rearranging gives the normal equation:

β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top\mathbf{y}

Where:

  • XX^\top is the transpose of the feature matrix
  • (XX)1(X^\top X)^{-1} is the inverse of the Gram matrix
  • y\mathbf{y} is the vector of observed target values
  • β^\hat{\boldsymbol{\beta}} is the vector of optimal coefficients

In Plain English: Multiply the transpose of the feature matrix by itself, invert the result, then multiply by the transpose times the target vector. Out comes the exact coefficient vector that minimizes squared errors. For our house data, this produces β050,000\beta_0 \approx 50,000 and β1167\beta_1 \approx 167 in one shot.

The Gauss-Markov theorem guarantees that when the standard regression assumptions hold, OLS produces the Best Linear Unbiased Estimator (BLUE): among all linear, unbiased estimators, OLS has the smallest variance. That's a strong guarantee — and one reason OLS has survived over two centuries.

When OLS breaks down

The matrix XXX^\top X must be invertible. If two features are perfectly correlated (perfect multicollinearity), the matrix is singular and no inverse exists. Even near-singular matrices produce wildly unstable coefficients — tiny data changes cause huge swings in β\beta values.

OLS also hits computational limits when the feature count pp grows large. Inverting a (p+1)×(p+1)(p+1) \times (p+1) matrix scales as O(p3)O(p^3), which becomes painful above a few thousand features.

Gradient descent as an iterative alternative

When your dataset has millions of rows or thousands of features, gradient descent offers a scalable alternative to OLS. The intuition: imagine standing on a hillside in dense fog. You can't see the valley floor, but you can feel which direction slopes downward. Take a small step that way, reassess, repeat.

Gradient descent optimization flow for linear regression cost functionClick to expandGradient descent optimization flow for linear regression cost function

Formally, the update rule for each parameter is:

βjβjαβjMSE\beta_j \leftarrow \beta_j - \alpha \frac{\partial}{\partial \beta_j}\text{MSE}

Where:

  • βj\beta_j is the current value of coefficient jj
  • α\alpha is the learning rate, a hyperparameter controlling step size
  • βjMSE\frac{\partial}{\partial \beta_j}\text{MSE} is the partial derivative of the cost function with respect to βj\beta_j

In Plain English: Each iteration nudges the house-price coefficients a little closer to the minimum-error line. If the slope β1\beta_1 is too high (overpredicting expensive homes), the gradient points downward, and the update shrinks β1\beta_1. Repeat until the changes become negligible.

For simple linear regression, the partial derivatives expand to:

β0β0α2ni=1n(y^iyi)\beta_0 \leftarrow \beta_0 - \alpha \cdot \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)

β1β1α2ni=1n(y^iyi)xi\beta_1 \leftarrow \beta_1 - \alpha \cdot \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i) \cdot x_i

Too large a learning rate and the algorithm overshoots the minimum — the cost actually increases. Too small and convergence takes thousands of unnecessary iterations. In practice, learning rate schedules or adaptive optimizers like Adam handle this automatically.

Batch, mini-batch, and stochastic variants

Batch gradient descent computes the gradient over the entire dataset each iteration. Exact but slow for large nn.

Stochastic gradient descent (SGD) updates on a single random sample. Noisy but fast — scikit-learn's SGDRegressor handles datasets that don't fit in memory.

Mini-batch gradient descent splits the difference: compute gradients on random subsets of 32-256 samples. This is the standard in deep learning and works well for large-scale regression too.

MethodComputation per stepConvergenceBest for
OLS (Normal Equation)O(np2+p3)O(np^2 + p^3) onceExact in one stepn<100,000n < 100,000, p<1,000p < 1,000
Batch Gradient DescentO(np)O(np) per iterationSmooth, deterministicMedium datasets, convex problems
SGDO(p)O(p) per iterationNoisy, fastn>1,000,000n > 1,000,000, streaming data
Mini-BatchO(bp)O(bp) per iterationBalancedDeep learning, GPU pipelines

Key Insight: For linear regression, the cost function is convex — it has a single global minimum. Gradient descent is guaranteed to find it (given a small enough learning rate). This is not true for neural networks, where local minima and saddle points complicate optimization.

The four regression assumptions (LINE)

Linear regression isn't a black box that works on any dataset. Its statistical guarantees — unbiased coefficients, valid p-values, correct confidence intervals — depend on four assumptions. The mnemonic LINE makes them easy to remember:

The four LINE assumptions for valid linear regressionClick to expandThe four LINE assumptions for valid linear regression

L — Linearity. The true relationship between xx and yy must be approximately linear. If the scatter plot of house prices versus square footage curves upward (luxury homes appreciating nonlinearly), a straight line will systematically underpredict at both ends and overpredict in the middle. Check this with a residuals-vs-fitted plot.

I — Independence. Each observation must be independent. This assumption is routinely violated in time-series data (yesterday's stock price influences today's) and in clustered data (multiple houses from the same neighborhood sharing unmeasured confounders like school quality). Violation inflates the apparent precision of your estimates.

N — Normality of residuals. The residuals (not the raw data) should follow an approximately normal distribution centered at zero. This matters most for hypothesis testing and confidence intervals. With large samples (n>30n > 30), the Central Limit Theorem kicks in and makes coefficient estimates approximately normal regardless, so this assumption becomes less critical for pure prediction.

E — Equal variance (Homoscedasticity). The spread of residuals should stay constant across all predicted values. In house prices, if the model predicts $200,000 homes with $10,000 errors but $800,000 homes with $80,000 errors, the residuals "fan out" into a cone. This pattern — heteroscedasticity — doesn't bias the coefficients, but it invalidates standard errors, p-values, and confidence intervals.

Pro Tip: Assumption violations don't always mean you must abandon linear regression. Weighted Least Squares corrects heteroscedasticity. Log-transforming the target often stabilizes variance. Adding polynomial features or interaction terms can address mild nonlinearity. Know your remedies before you switch models.

Diagnosing violations with residual plots

Checking assumptions requires plotting residuals, not staring at summary statistics. Three plots reveal almost everything you need to know.

Residuals vs. Fitted Values. Plot y^\hat{y} on the x-axis and residuals (yy^)(y - \hat{y}) on the y-axis. A healthy model produces a horizontal band of random scatter around zero. A U-shape signals nonlinearity. A funnel (narrow left, wide right) signals heteroscedasticity.

Q-Q Plot (Quantile-Quantile). Plot theoretical normal quantiles against residual quantiles. Points hugging the diagonal confirm approximate normality. Heavy tails (points curving away at the ends) indicate outliers or a non-normal error distribution.

Residuals vs. Order. For time-ordered data, plot residuals in collection order. A wave-like pattern means autocorrelation (independence violation). The Durbin-Watson test formalizes this: values near 2.0 mean no autocorrelation, below 1.5 or above 2.5 flags a problem.

Expected output:

text
Intercept: 52,567, Slope: 166.42
Residual std: 19,327

The residuals-vs-fitted plot shows a roughly horizontal band (good — no nonlinearity). The Q-Q plot follows the diagonal closely (good — approximately normal residuals). The residuals-vs-order plot shows no wave pattern (good — observations are independent). All three LINE assumptions that we can check visually look healthy.

Multiple linear regression

Real estate agents don't price homes on square footage alone. Bedrooms, bathrooms, lot size, year built, and school district ratings all matter. Multiple linear regression extends the model to pp features:

y=β0+β1x1+β2x2++βpxp+εy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon

Where:

  • β0\beta_0 is the intercept (baseline price when all features are zero)
  • βj\beta_j is the coefficient for feature xjx_j
  • pp is the total number of features

In Plain English: Each coefficient tells you how much the house price changes for a one-unit increase in that feature, holding all other features constant. If βbedrooms=15,000\beta_{\text{bedrooms}} = 15,000, that means adding one bedroom increases the predicted price by $15,000 after controlling for square footage, lot size, and everything else in the model.

That "holding constant" clause is what makes multiple regression so powerful for causal reasoning in observational data. It's also what makes it tricky — you're trusting that you've included all relevant confounders.

The OLS solution generalizes naturally: the same normal equation β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top\mathbf{y} applies, with XX now an n×(p+1)n \times (p+1) matrix.

R-squared and adjusted R-squared

R-squared (R2R^2) measures the proportion of variance in the target explained by the model:

R2=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Where:

  • (yiy^i)2\sum(y_i - \hat{y}_i)^2 is the sum of squared residuals (unexplained variance)
  • (yiyˉ)2\sum(y_i - \bar{y})^2 is the total sum of squares (total variance)
  • yˉ\bar{y} is the mean of observed values

In Plain English: An R2R^2 of 0.87 means the model explains 87% of the variance in house prices. The remaining 13% comes from factors not captured by the features — maybe the house has a pool, or it's on a noisy street.

The problem with R2R^2: it never decreases when you add another feature, even a completely irrelevant one like the seller's shoe size. Adjusted R-squared penalizes model complexity:

Radj2=1(1R2)(n1)np1R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}

Where:

  • nn is the number of observations
  • pp is the number of predictors
  • R2R^2 is the ordinary R-squared

In Plain English: If adding "number of bathrooms" to the house-price model bumps R2R^2 from 0.87 to 0.88 but adjusted R2R^2 drops from 0.865 to 0.860, that feature isn't pulling its weight. The complexity penalty outweighs the marginal improvement.

Always use adjusted R2R^2 when comparing models with different numbers of features. For a deeper treatment of evaluation metrics, see Why 99% Accuracy Can Be a Disaster.

Common Pitfall: A high R2R^2 doesn't guarantee a valid model. A parabolic relationship forced through a straight line can yield R2=0.80R^2 = 0.80 while systematically mispredicting at the extremes. Always pair R2R^2 with residual plots.

Multicollinearity and the Variance Inflation Factor

Multicollinearity occurs when two or more features are highly correlated. In our house-price example, total square footage and number of rooms are strongly correlated — larger homes tend to have more rooms. When the model tries to separate their individual effects, the coefficients become unstable: small data changes produce large swings in estimated β\beta values, and standard errors inflate dramatically.

The Variance Inflation Factor (VIF) quantifies the severity:

VIFj=11Rj2\text{VIF}_j = \frac{1}{1 - R_j^2}

Where:

  • Rj2R_j^2 is the R-squared from regressing feature xjx_j on all other features
  • A VIF of 1 means zero correlation with other features

In Plain English: A VIF of 8 for square footage means that 87.5% of the variation in square footage can be predicted from the other features. The coefficient's variance is inflated 8x compared to what it would be without collinearity. In our house model, that means the $167/sqft estimate becomes unreliable.

According to Penn State's STAT 462 course, a VIF above 5 warrants investigation and above 10 indicates severe multicollinearity requiring action.

Three remedies for multicollinearity:

  1. Drop one correlated feature. If total square footage and room count both have high VIFs, keep the one with stronger domain relevance (probably square footage).
  2. Combine correlated features. Create a "size index" that merges square footage and room count into a single variable using PCA or domain logic.
  3. Use regularization. Ridge regression (L2 penalty) shrinks correlated coefficients toward each other, stabilizing estimates without dropping features.

Full implementation in Python

Here's the complete house-price workflow using scikit-learn 1.8 for prediction and statsmodels 0.14 for statistical inference.

Expected output:

text
scikit-learn Results
  Intercept:    45,609
  Coefficients: sqft=168.42, bedrooms=15,631
  R-squared:    0.9852
  RMSE:         18,408

statsmodels OLS Summary (abbreviated):
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.561e+04   3281.781     13.898      0.000    3.92e+04    5.21e+04
sqft         168.4231      2.547     66.114      0.000     163.415     173.431
bedrooms    1.563e+04   1716.936      9.104      0.000    1.23e+04     1.9e+04
==============================================================================

Variance Inflation Factors:
 Feature       VIF
   const 10.554782
    sqft  4.162673
bedrooms  4.162673

The coefficients closely recover the true data-generating parameters ($50,000 intercept, $167/sqft, $15,000/bedroom). Every coefficient has p<0.001p < 0.001, confirming statistical significance. The VIFs for sqft and bedrooms are about 2.6 — well below the threshold of 5, so multicollinearity isn't a concern here despite the features being correlated by construction.

Key Insight: Notice the difference in purpose between the two libraries. scikit-learn gives you .predict() — it's built for production pipelines and cross-validation. statsmodels gives you p-values, confidence intervals, and diagnostic tests — it's built for inference and hypothesis testing. Use both.

Feature scaling for gradient descent

OLS doesn't care about feature scales because the normal equation solves the system exactly. But gradient descent does care — a lot. If square footage ranges from 800 to 3,500 while bedrooms range from 1 to 6, the cost function forms an elongated ellipse. The gradient oscillates along the sqft axis and creeps along the bedrooms axis.

Standardization fixes this by centering each feature at zero with unit variance: x=(xμ)/σx' = (x - \mu) / \sigma. After scaling, gradient descent converges in roughly 50 iterations instead of 5,000.

Warning: Always fit the scaler on training data only, then transform both train and test sets. Fitting on the full dataset leaks information from the test set into the model, producing overly optimistic performance estimates.

When to use linear regression

Linear regression is the right choice more often than people think. Here's a decision framework:

Use linear regression when:

  • You need interpretable coefficients (finance, healthcare, legal)
  • The relationship is approximately linear (check a scatter plot first)
  • You have more observations than features (n>>pn >> p)
  • You need confidence intervals on predictions, not just point estimates
  • Regulatory or compliance requirements demand explainable models
  • You're building a baseline before trying complex models (always a good idea)

Don't use linear regression when:

  • The relationship is clearly nonlinear — try polynomial regression or tree-based methods
  • You have more features than observations (p>np > n) — use Ridge or Lasso instead
  • The target is categorical — that's logistic regression territory
  • Outliers dominate and can't be removed — consider robust regression (Huber loss, RANSAC)
  • You need to capture feature interactions automatically — random forests and gradient boosting handle this natively
  • The data has complex spatial or temporal structure — specialized models exist

Pro Tip: Even when you end up using XGBoost or a neural network, always fit a linear regression first. It takes 30 seconds and gives you a baseline that tells you how much the complex model is actually contributing. If linear regression gets R2=0.95R^2 = 0.95 and your deep network gets 0.96, the complexity isn't worth it.

Production considerations

Computational complexity

OperationTime ComplexitySpace Complexity
OLS trainingO(np2+p3)O(np^2 + p^3)O(np+p2)O(np + p^2)
SGD training (one epoch)O(np)O(np)O(p)O(p)
Prediction (single sample)O(p)O(p)O(p)O(p)
Prediction (batch of mm)O(mp)O(mp)O(mp)O(mp)

For n=10,000n = 10,000 and p=50p = 50, OLS training takes under a millisecond on modern hardware. Prediction is essentially free — it's a dot product. This makes linear regression one of the fastest models to deploy and serve.

Scaling behavior

  • 100K rows, 50 features: OLS trains in ~0.1 seconds. No issues.
  • 10M rows, 50 features: OLS still works but eats memory. Consider SGDRegressor with partial_fit() for streaming.
  • 100K rows, 10K features: OLS matrix inversion becomes the bottleneck (O(p3)O(p^3)). Switch to iterative solvers or regularized methods.
  • Sparse features (NLP, one-hot encoding): Use SGDRegressor with sparse matrices. scikit-learn handles scipy.sparse inputs natively.

Common deployment pitfalls

  1. Training-serving skew. If you standardize features during training but forget to apply the same scaler at serving time, predictions will be garbage. Save the scaler alongside the model.
  2. Extrapolation. Linear regression happily predicts negative prices for sufficiently small houses. Add bounds checks or clipping in production.
  3. Concept drift. Housing market coefficients from 2020 don't apply in 2026. Retrain periodically or monitor prediction residuals.

Conclusion

Linear regression holds a unique position in data science: it's one of the few algorithms where mastering the theory directly improves your practical results. Every coefficient has a direct interpretation. Every assumption is testable. The closed-form OLS solution guarantees the global minimum of the cost function — something gradient-based methods for complex models can never promise.

The progression from understanding to application follows a clear path: internalize the regression equation and what each term means, understand the cost function driving optimization, verify the LINE assumptions through residual diagnostics, and recognize when the data demands a more flexible approach.

When that moment arrives, the natural next steps build directly on this foundation. Polynomial Regression adds curved terms to capture nonlinear patterns while preserving the OLS framework. Ridge, Lasso, and Elastic Net introduce penalty terms that tame multicollinearity and prevent overfitting. And The Bias-Variance Tradeoff explains why constraining a model often improves its predictions on new data.

Start with the straight line. Master it completely. The curves will follow.

Frequently Asked Interview Questions

Q: What assumptions does linear regression make, and what happens when they're violated?

Linear regression assumes linearity, independence of errors, normally distributed residuals, and equal variance (homoscedasticity) — the LINE mnemonic. When linearity breaks, the model systematically mispredicts. When independence breaks (common in time series), standard errors shrink artificially, making coefficients look more significant than they are. Normality violations affect confidence intervals and p-values but matter less with large samples. Heteroscedasticity doesn't bias coefficients but invalidates all your inferential statistics.

Q: Explain the difference between R-squared and adjusted R-squared. When does it matter?

R-squared measures the proportion of target variance explained by the model, but it never decreases when you add features — even useless ones. Adjusted R-squared penalizes for the number of predictors, so it can decrease if a new feature doesn't contribute enough explanatory power. It matters whenever you're comparing models with different feature counts. In practice, I'd rely more on cross-validated RMSE than either metric, since both can mislead when assumptions are violated.

Q: Your model's training RMSE is low but test RMSE is much higher. What's going on?

That's classic overfitting — the model has memorized training noise rather than learning the true signal. In linear regression, this typically happens when you have too many features relative to observations (high p/np/n ratio) or severe multicollinearity inflating coefficient magnitudes. The fix is regularization: Ridge or Lasso penalize large coefficients and force the model to generalize. You should also check VIF values and consider dropping correlated features.

Q: When would you choose linear regression over a random forest or XGBoost?

Three scenarios. First, when interpretability is non-negotiable — regulated industries like finance and healthcare need coefficients that humans can audit. Second, when the relationship is genuinely linear — adding model complexity won't help and might hurt through overfitting. Third, when you need prediction intervals — linear regression provides them analytically through standard errors, while tree-based methods require bootstrapping or conformal prediction.

Q: How do you detect and handle multicollinearity?

Compute the Variance Inflation Factor for each feature. VIF above 5 warrants investigation; above 10 needs action. The most common fixes: drop one of the correlated features, combine them into a composite variable (like a "size index" from square footage and room count), or use Ridge regression, which handles collinearity gracefully by shrinking correlated coefficients toward each other. The correlation matrix alone isn't sufficient — VIF catches multicollinearity that involves three or more features simultaneously.

Q: Why does gradient descent need feature scaling but OLS doesn't?

OLS solves the normal equation in one matrix operation — the scale of features doesn't affect the result because the matrix inverse accounts for it. Gradient descent, however, takes steps proportional to the gradient. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the gradient along the second dimension is orders of magnitude larger, causing the optimizer to oscillate wildly. Standardizing features creates a roughly spherical loss surface where gradient steps make equal progress in all directions.

Q: A colleague adds 50 features to your 10-feature linear regression and says "R-squared went up, so the model is better." How do you respond?

R-squared mechanically increases with more features — it can't decrease. Adding random noise columns would still bump it up. I'd ask two questions: did adjusted R-squared improve, and did cross-validated RMSE decrease? If neither, the extra features are adding noise, not signal. I'd also check for overfitting by comparing training and test performance, and run Lasso regression to see which of the 50 features actually survive regularization.

Q: What's the geometric interpretation of the normal equation?

The normal equation projects the target vector y\mathbf{y} onto the column space of the feature matrix XX. The resulting prediction y^=Xβ^\hat{\mathbf{y}} = X\hat{\boldsymbol{\beta}} is the closest point to y\mathbf{y} in the subspace spanned by the features. The residual vector yy^\mathbf{y} - \hat{\mathbf{y}} is orthogonal to every column of XX — that's where the name "normal" equation comes from (normal as in perpendicular, not normal as in ordinary).

Hands-On Practice

Linear Regression is often the first step into machine learning, but applying it to real sensor data transforms abstract equations into tangible insights. In this hands-on tutorial, you will build a regression model to predict sensor readings based on time, allowing you to understand the baseline behavior of an industrial device. We will use the Sensor Anomalies dataset, which provides a time-series of sensor values, offering a perfect playground to see how regression lines attempt to fit noisy, real-world data.

Dataset: Sensor Anomalies (Detection) Sensor readings with 5% labeled anomalies (extreme values). Clear separation between normal and anomalous data. Precision ≈ 94% with Isolation Forest.

Experiment with splitting the data differently by changing test_size=0.2 to 0.5 or setting shuffle=True to see how time-dependence affects accuracy. Try filtering for anomalies (where is_anomaly == 1) specifically to see if regression works better or worse on outlier data. These adjustments will reveal the limitations of linear models on complex, potentially non-linear sensor patterns.

Practice with real Ad Tech data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Ad Tech problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths