Skip to content

Logistic Regression: The Definitive Guide to Classification

DS
LDS Team
Let's Data Science
13 minAudio
Listen Along
0:00/ 0:00
AI voice

You're building a fraud detection system. Your linear regression model spits out a prediction of 1.5 for a transaction. What does 150% probability of fraud even mean? It doesn't. Linear regression can't handle classification problems because it produces outputs that violate the basic rules of probability.

Logistic regression fixes this. It's the most widely deployed classification algorithm in production machine learning, used everywhere from credit scoring to medical diagnosis. Despite the misleading name, logistic regression is a classifier, not a regression model. It wraps a linear equation inside a sigmoid function that squashes any real number into a valid probability between 0 and 1.

Throughout this guide, we'll build a customer churn classifier from scratch, using that single example to understand every piece of logistic regression: the sigmoid curve, log-odds, the cost function, coefficient interpretation, threshold tuning, and multiclass extensions.

Logistic regression pipeline from raw data to classification outputClick to expandLogistic regression pipeline from raw data to classification output

Linear Regression Breaks Down for Classification

Linear regression predicts continuous values using the equation:

y=β0+β1xy = \beta_0 + \beta_1 x

Where:

  • yy is the predicted output
  • β0\beta_0 is the intercept (bias term)
  • β1\beta_1 is the slope coefficient for feature xx

The problem is obvious: if xx gets large enough, yy shoots past 1.0. If xx is very negative, yy drops below 0. Neither outcome is a valid probability. In our churn example, predicting a 250% chance of leaving or a negative 30% chance of staying makes no business sense.

Linear regression also places the decision boundary in a position that's extremely sensitive to outliers. A single extreme data point can shift the entire fitted line, flipping predictions for dozens of other customers. We need a function that accepts any real number and maps it strictly into [0,1][0, 1].

Key Insight: The fundamental mismatch is that a straight line has an unlimited range (-\infty to ++\infty), but probability is bounded between 0 and 1. Logistic regression solves this by fitting the line in "log-odds space" and then converting back to probability.

The Sigmoid Function

The sigmoid function (also called the logistic function) is the S-shaped curve that transforms any real number into a probability between 0 and 1. It's differentiable everywhere, which makes it compatible with gradient-based optimization.

Given a linear combination z=β0+β1x1++βnxnz = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n, the sigmoid function is:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Where:

  • σ(z)\sigma(z) is the output probability (between 0 and 1)
  • ee is Euler's number (approximately 2.718)
  • zz is the linear combination of inputs, also called the log-odds

In Plain English: Think of the sigmoid as a compressor. No matter how extreme the input, the output always lands between 0 and 1. A customer with an absurdly high bill and zero usage gets a churn probability near 0.999. A loyal customer with low bills and heavy usage gets something near 0.001. And right at z=0z = 0, the sigmoid returns exactly 0.5, the coin-flip point.

Let's see this in action:

Expected Output:

text
Sigmoid Function Values:
z (log-odds)    sigma(z)     Interpretation
----------------------------------------------------
-6              0.002473     Strong negative class
-3              0.047426     Strong negative class
-1              0.268941     Strong negative class
0               0.500000     Decision boundary
1               0.731059     Strong positive class
3               0.952574     Strong positive class
6               0.997527     Strong positive class

Notice the symmetry: σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z). The sigmoid is centered at z=0z = 0, where it returns exactly 0.5.

Odds represent the ratio of an event happening to it not happening. Log-odds (also called logits) are the natural logarithm of the odds. Logistic regression fits a linear equation to the log-odds, not to the raw probability.

This is the single most misunderstood aspect of logistic regression. When people call it a "linear" classifier, they mean linear in the log-odds space.

Probability to Odds

If a customer has a 0.8 probability of churning, the odds are:

Odds=p1p=0.80.2=4\text{Odds} = \frac{p}{1 - p} = \frac{0.8}{0.2} = 4

This means the customer is 4 times more likely to churn than to stay. Odds range from 0 to ++\infty.

Odds to Log-Odds (The Logit)

Taking the natural logarithm of the odds produces a value that ranges from -\infty to ++\infty, which perfectly matches the output of a linear equation:

logit(p)=ln(p1p)=β0+β1x1++βnxn\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n

Where:

  • logit(p)\text{logit}(p) is the log-odds of the positive class
  • ln\ln is the natural logarithm
  • β0\beta_0 is the intercept
  • β1,,βn\beta_1, \dots, \beta_n are the feature coefficients
  • x1,,xnx_1, \dots, x_n are the input features

In Plain English: Probability is trapped between 0 and 1, so a straight line can't model it directly. But log-odds can stretch all the way from -\infty to ++\infty. Logistic regression calculates the linear log-odds first, then converts backward through the sigmoid to get the churn probability. It's a mathematical trick that lets us use linear algebra for a classification problem.

Relationship between probability, odds, and log-odds in logistic regressionClick to expandRelationship between probability, odds, and log-odds in logistic regression

The Decision Boundary

The decision boundary is the threshold where the model flips from predicting one class to the other. By default, logistic regression uses 0.5: if σ(z)>0.5\sigma(z) > 0.5, predict "churned"; otherwise, predict "loyal."

Geometrically, σ(z)=0.5\sigma(z) = 0.5 happens exactly when z=0z = 0:

β0+β1x1++βnxn=0\beta_0 + \beta_1 x_1 + \dots + \beta_n x_n = 0

Where:

  • This equation defines a line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions)
  • Points on one side of this boundary have z>0z > 0 (positive class), and points on the other side have z<0z < 0 (negative class)

Common Pitfall: A 0.5 threshold is not always appropriate. In churn prediction, missing a churner (false negative) might cost your company far more than incorrectly flagging a loyal customer. In fraud detection where only 0.1% of transactions are fraudulent, a 0.5 threshold would classify almost everything as "not fraud." Always tune the threshold based on your business cost structure.

Log Loss: The Cost Function

Logistic regression uses Log Loss (Binary Cross-Entropy) instead of Mean Squared Error because plugging the sigmoid into MSE creates a non-convex surface with many local minima. Log Loss produces a smooth, convex cost function that gradient descent can optimize reliably.

J(θ)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]

Where:

  • J(θ)J(\theta) is the average cost across all training examples
  • mm is the number of training samples
  • y(i)y^{(i)} is the actual label (0 or 1) for sample ii
  • y^(i)\hat{y}^{(i)} is the predicted probability for sample ii
  • log\log is the natural logarithm

In Plain English: Log Loss is a "surprise penalty." If the model says there's a 99% chance a customer will churn, and the customer actually stays, the model gets hammered with a massive penalty. If it says 50/50 and gets it wrong, the penalty is moderate. The optimizer adjusts the weights to minimize total surprise across all training data.

This formulation comes from maximum likelihood estimation, where we maximize the probability of observing the actual labels given the model's predictions (see Bishop's Pattern Recognition and Machine Learning, Chapter 4). When y=1y = 1, only the log(y^)-\log(\hat{y}) term survives. If y^\hat{y} is close to 1, the loss is near zero. If y^\hat{y} is close to 0, the loss approaches infinity. This asymmetric penalty drives the model toward confident, correct predictions.

Building a Churn Classifier in Python

Let's train a logistic regression model on synthetic customer data with two features: monthly bill amount and total usage hours. Higher bills and lower usage correlate with churn.

Expected Output:

text
Accuracy: 0.975

Confusion Matrix:
[[21  0]
 [ 1 18]]

Classification Report:
              precision    recall  f1-score   support

       Loyal       0.95      1.00      0.98        21
     Churned       1.00      0.95      0.97        19

    accuracy                           0.97        40
   macro avg       0.98      0.97      0.97        40
weighted avg       0.98      0.97      0.97        40

Pro Tip: Always scale features before training logistic regression. The model uses gradient-based optimization, and features on wildly different scales (bill in dollars vs. usage in hours) cause the optimizer to zigzag instead of converging directly. StandardScaler centers each feature at mean 0 with standard deviation 1.

Interpreting Coefficients as Odds Ratios

Logistic regression coefficients represent the change in log-odds for a one-standard-deviation increase in a feature (when features are scaled). To make this actionable, exponentiate the coefficient to get the odds ratio.

Expected Output:

text
Coefficient Interpretation:
--------------------------------------------------
Monthly Bill: coef=2.3257, odds ratio=10.2339
Usage Hours: coef=-2.7433, odds ratio=0.0644
Intercept: 0.5132

Here's how to read this:

FeatureCoefficientOdds RatioInterpretation
Monthly Bill2.3310.23Each 1 SD increase in bill makes churn 10.2x more likely
Usage Hours-2.740.06Each 1 SD increase in usage makes churn 94% less likely

In Plain English: The raw coefficient tells you direction (positive means higher feature values increase churn risk). The odds ratio tells you magnitude. An odds ratio of 10.2 for Monthly Bill means that for every standard deviation increase in the bill, the customer's odds of churning multiply by 10.2. Usage Hours has an odds ratio of 0.06, which means higher usage drastically reduces churn risk.

Predicted Probabilities and Threshold Tuning

Logistic regression doesn't just output a class label. It outputs the probability of belonging to each class, giving you control over the precision-recall tradeoff through threshold selection.

Expected Output:

text
Predicted Probabilities (first 5 test samples):
Sample     P(Loyal)     P(Churned)   Prediction
----------------------------------------------
1          0.9946       0.0054       Loyal
2          0.9989       0.0011       Loyal
3          0.9784       0.0216       Loyal
4          0.0020       0.9980       Churned
5          0.0004       0.9996       Churned

The model isn't guessing. It's highly confident in each prediction. But what if your business cares more about catching churners (recall) than avoiding false alarms (precision)? Lower the threshold:

Expected Output:

text
Threshold    Precision    Recall       F1 Score
------------------------------------------------
0.3          0.9474       0.9474       0.9474
0.4          1.0000       0.9474       0.9730
0.5          1.0000       0.9474       0.9730
0.6          1.0000       0.9474       0.9730
0.7          1.0000       0.8947       0.9444

Lowering the threshold to 0.3 catches the same number of churners but introduces one false positive, dropping precision from 1.00 to 0.95. Raising it to 0.7 loses a churner, dropping recall to 0.89. The sweet spot depends entirely on the cost of each type of error in your specific business context.

Regularization: Controlling Overfitting

Regularization prevents logistic regression from overfitting by penalizing large coefficients. This is identical to the techniques used in ridge, lasso, and elastic net regression.

PenaltyHow It WorksEffect on CoefficientsFeature Selection?
L2 (Ridge)Adds λβj2\lambda \sum \beta_j^2 to costShrinks toward zeroNo
L1 (Lasso)Adds λβj\lambda \sum \|\beta_j\| to costDrives some to exactly zeroYes
Elastic NetCombines L1 + L2Mixture of both effectsYes

In scikit-learn, the C parameter controls regularization strength. Important: C is the inverse of regularization strength.

C ValueRegularizationModel Behavior
0.001Very strongUnderfitting risk; very simple boundary
1.0 (default)ModerateGood starting point for most problems
100.0Very weakOverfitting risk; complex boundary

Pro Tip: Start with the default C=1.0. If your training accuracy is much higher than validation accuracy, reduce C to add more regularization. If both are low, increase C to give the model more flexibility. Use LogisticRegressionCV or cross-validation to automate this search.

Multiclass Logistic Regression

Logistic regression extends beyond binary problems to handle multiple classes through two strategies: One-vs-Rest (OvR) and Multinomial (Softmax).

One-vs-Rest (OvR)

OvR trains a separate binary classifier for each class. For three classes (Apple, Banana, Orange), it trains three models: Apple vs. not-Apple, Banana vs. not-Banana, Orange vs. not-Orange. The class with the highest confidence wins.

Multinomial (Softmax)

The Softmax function generalizes the sigmoid to KK classes, forcing all class probabilities to sum to 1.0:

P(y=kx)=ezkj=1KezjP(y = k \mid x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

Where:

  • P(y=kx)P(y = k \mid x) is the probability of class kk given input xx
  • zkz_k is the linear output for class kk
  • KK is the total number of classes
  • The denominator ensures all probabilities sum to 1

In Plain English: Softmax is a competition. Each class produces a score, and Softmax converts those scores into probabilities by dividing each score's exponential by the sum of all exponentials. The class with the highest score gets the largest probability slice.

Scikit-learn handles this automatically (see the LogisticRegression documentation). Set multi_class='multinomial' with solver='lbfgs' for mutually exclusive classes. Softmax is generally preferred because it learns inter-class relationships jointly rather than in isolation.

Comparison of One-vs-Rest and Softmax multiclass strategiesClick to expandComparison of One-vs-Rest and Softmax multiclass strategies

When to Use Logistic Regression (and When Not To)

Logistic regression isn't always the right tool. Here's a decision framework:

Use logistic regression when:

  • You need interpretable predictions (healthcare, finance, regulatory environments)
  • The relationship between features and log-odds is approximately linear
  • You have a moderate dataset size (works well from hundreds to millions of rows)
  • You need probability estimates, not just class labels
  • You want a fast baseline before trying complex models

Don't use logistic regression when:

  • The decision boundary is highly nonlinear (try decision trees or gradient boosting)
  • You have complex feature interactions the model can't capture
  • You need state-of-the-art accuracy on tabular data (use XGBoost instead)
  • Features are highly correlated and you haven't applied PCA or regularization

Decision guide for when to use logistic regression versus other classifiersClick to expandDecision guide for when to use logistic regression versus other classifiers

Production Considerations

Computational complexity: Training is O(np)O(n \cdot p) per gradient descent iteration, where nn is the number of samples and pp is the number of features. The lbfgs solver typically converges in 10 to 100 iterations. Prediction is O(p)O(p) per sample, extremely fast for real-time inference.

Memory: Logistic regression stores only the coefficient vector (p+1p + 1 values for the intercept), making it one of the most memory-efficient classifiers. A model with 1,000 features uses roughly 8 KB of memory.

Scaling: Handles millions of rows easily with the saga solver and max_iter=200. For datasets exceeding available memory, use SGDClassifier(loss='log_loss') which processes data in mini-batches and supports online learning through partial_fit.

Deployment: The prediction function is just a dot product followed by sigmoid. You can export coefficients and implement inference in any language without needing scikit-learn at runtime. This makes logistic regression ideal for edge devices, embedded systems, and latency-sensitive APIs.

Conclusion

Logistic regression is the most important classification algorithm to understand deeply. It's the foundation that support vector machines, neural networks, and even attention mechanisms build upon. The sigmoid function, log-odds transformation, and log loss cost function appear repeatedly throughout machine learning, so mastering them here pays dividends everywhere.

The coefficient interpretability makes logistic regression irreplaceable in regulated industries. When a bank needs to explain why it declined a loan, or a hospital needs to justify a diagnosis recommendation, logistic regression provides clear, auditable reasoning that no black-box model can match.

For structured tabular data where you need a reliable baseline, start with logistic regression. If accuracy falls short, graduate to ensemble methods like gradient boosting or XGBoost. But don't be surprised if the logistic model holds its own, especially after thoughtful feature engineering.

Frequently Asked Interview Questions

Q: Why is logistic regression called "regression" if it's a classification algorithm?

The name comes from the mathematical technique, not the task. Logistic regression uses regression to model the log-odds of class membership as a linear function of the features. The sigmoid function then converts these continuous log-odds into probabilities, and a threshold converts probabilities into discrete class labels. The "regression" part describes the fitting process; the output is classification.

Q: What happens if you use MSE instead of Log Loss for logistic regression?

The cost surface becomes non-convex with multiple local minima because of the sigmoid function's shape. Gradient descent may get stuck in a local minimum and fail to find the optimal parameters. Log Loss produces a convex cost surface with a single global minimum, guaranteeing convergence. In practice, using MSE with sigmoid also produces weak gradients when predictions are confident but wrong, slowing learning dramatically.

Q: How do you handle class imbalance in logistic regression?

Three main approaches: (1) Set class_weight='balanced' in scikit-learn, which automatically adjusts the cost function to penalize misclassifying the minority class more heavily. (2) Adjust the classification threshold, lowering it from 0.5 to increase recall for the minority class. (3) Use resampling techniques like SMOTE or random undersampling before training. The first option is usually the best starting point because it requires no data manipulation.

Q: Can logistic regression handle nonlinear decision boundaries?

Not directly, since it learns a linear boundary in the original feature space. However, you can manually engineer polynomial or interaction features (e.g., x12x_1^2, x1x2x_1 \cdot x_2) to model nonlinear boundaries. Scikit-learn's PolynomialFeatures does this automatically. The boundary is still linear in the expanded feature space but appears nonlinear in the original space. For genuinely complex boundaries, tree-based methods are usually a better choice.

Q: What's the difference between solver='lbfgs' and solver='saga' in scikit-learn?

lbfgs is a quasi-Newton method that approximates the Hessian matrix for faster convergence. It works well for small to medium datasets and is the default solver. saga is a stochastic average gradient method that processes random subsets of data per iteration, making it efficient for large datasets (over 100K samples). saga also supports all penalty types including Elastic Net, while lbfgs only supports L2.

Q: Your logistic regression model shows 99% accuracy on an imbalanced fraud dataset. Is this a good model?

Almost certainly not. If only 1% of transactions are fraudulent, a model that predicts "not fraud" for every transaction achieves 99% accuracy while catching zero fraud. Accuracy is misleading for imbalanced datasets. Look at precision, recall, F1-score for the minority class, and the area under the precision-recall curve (AUPRC) instead. A model with 95% accuracy but 80% fraud recall is far more valuable than one with 99% accuracy and 0% recall.

Q: How do you interpret the intercept term in logistic regression?

The intercept β0\beta_0 is the log-odds of the positive class when all features are zero (or at their mean, if features are standardized). Exponentiating it gives the baseline odds. For example, if β0=2.3\beta_0 = -2.3, the baseline odds are e2.30.10e^{-2.3} \approx 0.10, meaning when all features are at their mean values, the probability of the positive class is about 9%. The intercept shifts the entire sigmoid curve left or right along the z-axis.

Hands-On Practice

While theoretical understanding of the sigmoid function and probability thresholds is crucial, the true power of Logistic Regression is best revealed through hands-on implementation. You'll build a solid multi-class classification model to predict biological species based on physical measurements, effectively moving beyond binary 'yes/no' predictions to handle complex, real-world categorization. We will use the Species Classification dataset, which provides clear separation between classes, making it an ideal sandbox for visualizing decision boundaries and understanding how logistic regression calculates probabilities for multiple categories simultaneously.

Dataset: Species Classification (Multi-class) Iris-style species classification with 3 well-separated classes. Perfect for multi-class algorithms. Expected accuracy ≈ 95%+.

To deepen your understanding, try adjusting the regularization parameter C (e.g., compare C=0.01 vs C=100) to see how it affects the model's bias-variance tradeoff and coefficients. You might also experiment with removing one feature (like sepal_width) to observe if the model maintains accuracy with less information, which simulates real-world feature selection. Finally, investigate the predict_proba outputs on misclassified points to see if the model was 'confidently wrong' or just uncertain.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths