# Logistic Regression: The Definitive Guide (2021)¶

This guide will cover

What is logistic regression

How to fit logistic regression models

How to apply the latest research in hyperparameter optimization to prevent overfitting

How to use logistic regression with more than two classes

**Contents**

## Introduction¶

*Logistic regression* is a supervised learning algorithm for classification tasks.

In classification tasks, we’re given a sample training data set consisting of feature vectors

and categorical target values

We seek to fit a model so that given new out-of-sample feature vectors $\boldsymbol{x}_1', \dots, \boldsymbol{x}_m'$ we can predict their associated target values.

Real-world examples of classification tasks include

The Wisconsin Breast Cancer Data Set where the task is to predict whether a mass of skin is malignant or not given physical properties of the mass such as thickness and shape.

The Iris Data Set where the task is to predict what species of iris a flower is given its septal and petal length and width.

What form could a categorical prediction from a model take? The simplest form is for a model to predict which categorical value is most likely; but more useful, is for a model to predict a probability distribution, where for each category $C_k$ the model assigns a probability $p_k$ such that

Let us consider how we might build a simple model that provides probabilistic predictions for the case of binary classification.

With binary classification, we need only predict the probability of one of the categories, $p_1$ , since $p_0 = 1 - p_1$ . For simplicity, we will use $-1$ and $+1$ to denote the two possible target values so that

### Why Not Linear Regression?¶

First, let’s address: Why can’t we use linear regression?

Given a feature vector $\boldsymbol{x}$ , a linear regression model with weights $\boldsymbol{w}$ predicts

But predictions from this linear regression model don’t meet the requirements of a probability distribution. If the feature vector $\boldsymbol{x}$ has large enough magnitude, then $p_1$ will be outside of the zero to one range required by a probability distribution.

How can we fix this?

### The Logistic Function¶

Suppose $g\colon \mathbb{R} \to \mathbb{R}$ is a function where

and

Then given $\boldsymbol{w}\in\mathbb{R}^p$ , a model that predicts

meets the requirements of a probability distribution.

Logistic regression uses the function

called the *logistic function*.

Note

There are other possible functions that are suitable for binary classification. For example,
the *probit function*

is also sometimes used.

## Fitting Logistic Regression¶

Now that we’ve described what logistic regression is, how do we go about fitting a model to the training data?

Given weights $\boldsymbol{w}$
, we can compute the negative log likelihood of the training data,
called a *cost function* (or sometimes *loss function*), as

where

Let’s see how we can find weights that minimize this cost function; or, equivalently, maximize the log likelihood of the training data.

Caution

We’ll see later why maximizing the log likehood of the training data is a bad idea.

We’re going to find optimal weights by using a *trust-region optimizer*.

### Trust-region Optimization¶

Trust-region optimizers are a powerful class of iterative, second-order optimization algorithms.

Given a twice-differentiable objective function, $f\colon \mathbb{R}^p \to \mathbb{R}$ , a trust-region optimizer produces a sequence $\{\boldsymbol{x}_k\}$ that converges to an optimum of the objective. The kth iteration of the sequence is generated by updating the previous iteration with a solution to the subproblem

The subproblem minimizes the second-order approximation of $f$ at $\boldsymbol{x}_{k-1}$ within the neighborhood $\| \boldsymbol{s} \| \le \delta_k$ , called the trust-region. Using the trust-region, the optimizer restricts the second-order approximation to areas where it models $f$ well, and it continually expands or shrinks the trust-region to adapt to the accuracy of the local quadratic approximation as the algorithm progresses.

**Trust-region optimizers are both efficient and robust**, being
able to solve convex and non-convex optimization problems.

See also

For a full description of trust-region optimization, see

Norcedal and S. J. Wright. Numerical Optimization. 1999.

To apply a trust-region optimizer, we need to compute the gradient and hessian of the cost function $J(\boldsymbol{w})$ .

### Cost Function Gradient and Hessian¶

Let’s start with the gradient. Differentiating the cost function with respect to one of the weights, we get

where

Note

We use the notation $\dot{f}(u)$ to denote the first derivative of a single variable function:

Similarly, $\ddot{f}(u)$ denotes the second derivative.

Now, for the hessian of $J$ , we derive

where

### Example: Breast Cancer Diagonostic (w/o Regularization)¶

Let’s see how this works on an example data set.

Note

We use the bbai package to fit logistic regression models. See Getting Started for instructions on how to install it.

We’ll look at the Wisconsin Breast Cancer Data Set where the task is to determine whether a mass of skin is malignant or not from physical properties of the mass.

The data set is bundled with sklearn and we start by loading and normalizing the data.

```
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
```

Next, we fit a logistic regression model to the data and print out the weights.

```
import numpy as np
import bbai.glm
model = bbai.glm.LogisticRegression(
C = np.inf, # This specifies to use no regularization.
# We'll explain later what this means.
)
model.fit(X, y)
print(model.intercept_[0], "intercept")
for wi, name in zip(model.coef_[0], data['feature_names']):
print(wi, name)
```

## Overfitting and Regularization¶

Question: What’s the probability of events A, B, and C if we know the probability of event B is zero?

It’s zero – and this leads us to the problem with fitting models to maximize the likelihood of the training data.

If the training data set lacks less likely feature-category combinations, our model may overconfidently predict a category probability to be close to zero when it isn’t. Then when the combination occurs in out-of-sample data, our model erroneously predicts the data to have near zero probability.

This problem is known as *overfitting*.

One way to control overfitting is to use a technique called *regularization*. With regularization, we
change the cost function so that instead of minimizing the negative log likelihood of the training data,
we minimize

$C$
is called a *hyperparameter*, and it acts as a knob that controls the complexity
of the model. When the knob is dialed up to full regularization strength
(i.e. $C$
is close to zero), it forces the model to take its simplest form,
$\boldsymbol{w} = 0.$
On the other hand, when $C$
is at its weakest (i.e. $C$
is large),
it’s as if regularization isn’t there.

By setting $C$ to a good value, we can prevent overfitting while still keeping the predictive strength of our model.

### Cross-validation¶

How do we find a good value of $C$ ?

This is the basic idea: We construct a function $f(C)$
that estimates the out-of-sample
prediction error for a given $C$
. Then we select $C$
so as to minimize $f$
. This
process is called *hyperparameter optimization*.

One way to estimate out-of-sample prediction error is with a technique called *cross-validation*.
Cross-validation partitions the training data set into folds. For each fold, we form two sets of
data: one consisting of the training data points outside of the fold and one consisting of the training
data points in the fold. We fit a model to the first set and measure the prediction error on the second
and then average all the fold prediction errors to form the out-of-sample prediction error estimate.

Cross-validation comes in different forms. *K-fold cross-validation* partitions the training data set
into $k$
different folds. The larger the value of $k$
, the more accurate
the estimate; and largest value of $k$
we can use is $k = n-1$
, called
*leave-one-out cross-validation*.

While leave-one-out cross-validation provides the best estimate for out-of-sample prediction error,
it is *expensive* to compute. It requires that we train a different model for each data point in the
training data set. Fortunately, for logistic regression, there exists a different form of
cross-validation that is nearly as accurate as leave-one-out, but much more efficient to compute.

See also

For a comparison of leave-one-out cross-validation with K-fold, see the paper

### Approximate Leave-one-out Cross-validation¶

Enter *Approximate Leave-one-out Cross-validation* (ALO). It has much of the benefits of
leave-one-out cross-validation, while being efficient to compute and optimize.
Here’s how it works: Given $C$
we
first fit logistic regression by minimizing the regularized cost function of the whole training data
set. Let $\hat{\boldsymbol{w}}$
denote the optimal weights

and $\boldsymbol{H}$ denote the hessian of the cost function at the optimum

Note: The gradient will be zero.

Now, suppose we modify the cost function by removing the ith training data point. We call this new cost function $J_{/i}$

and we can compute its gradient and hessian at $\hat{\boldsymbol{w}}$ :

Using the gradient and hessian of $J_{/i},$ we can take a single step of Newton’s optimization algorithm and get a close approximation for what the weights would be without the ith data point

and then estimate the leave-one-out cross-validation with

Note

Here’s a quick refresher on Newton’s method.

Let $f(\boldsymbol{x})$ denote the function we’re minimizing. The second-order Taylor series approximation cenerted at point $\boldsymbol{x}_0$ is

If $\nabla^2 f(\boldsymbol{x}_0)$ is positive-definite, then the approximation is minimized when the gradient with respect to $\boldsymbol{h}$ is zero, or equivalently

Using this formula we avoid having to fit a model for each data point, but we still need to invert a $p$ -by-$p$ matrix $n$ times. We can achieve an even more efficient formula that only requires a single matrix inversion by applying the matrix inversion lemma to get

where

Note

If $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ is an invertible square matrix, $\boldsymbol{u}, \boldsymbol{v} \in \mathbb{R}^n$ are vectors, and $\boldsymbol{A} + \boldsymbol{u} \boldsymbol{v}^\top$ is invertible, then the matrix inversion lemma says

$\left(\boldsymbol{A} + \boldsymbol{u} \boldsymbol{v}^\top\right)^{-1} = \boldsymbol{A}^{-1} - \frac{\boldsymbol{A}^{-1} \boldsymbol{u} \boldsymbol{v}^\top \boldsymbol{A}^{-1}} {1 + \boldsymbol{v}^\top \boldsymbol{A}^{-1} \boldsymbol{u}}.$Thus,

$\begin{align*} \boldsymbol{x}_i^\top \tilde{\boldsymbol{w}}_{/i} &= \boldsymbol{x}_i^\top\left( \hat{\boldsymbol{w}} - \left[\boldsymbol{H} - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{x}_i^\top \right]^{-1} \left(-\dot{\ell}_i(u_i) \boldsymbol{x}_i\right) \right) \\ &= u_i + \dot{\ell}_i(u_i) \boldsymbol{x}_i^\top \left[\boldsymbol{H} - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{x}_i^\top\right]^{-1} \boldsymbol{x}_i \\ &= u_i + \dot{\ell}_i(u_i) \boldsymbol{x}_i^\top \left[\boldsymbol{H}^{-1} + \frac{\ddot{\ell}_i(u_i) \boldsymbol{H}^{-1} \boldsymbol{x}_i \boldsymbol{x}_i^\top \boldsymbol{H}^{-1}} {1 - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{H}^{-1} \boldsymbol{x}_i} \right] \boldsymbol{x}_i \\ &= u_i + \dot{\ell}_i(u_i) \left[ h_i + \frac{\ddot{\ell}_i(u_i) h_i^2}{1 - \ddot{\ell}_i(u_i) h_i}\right] \\ &= u_i + \frac{\dot{\ell}_i(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i}. \end{align*}$

See also

For more background on approximate leave-one-out cross-validation, see the paper

A scalable estimate of the extra-sample prediction error via approximate leave-one-out.

### Optimizing Approximate Leave-one-out Cross-validation¶

But accurately estimating out-of-sample prediction error is just one part of hyperparameter optimization. Now that we have our efficient ALO estimator, how do we find the hyperparameters that minimize the estimated prediction error?

Fortunately, ALO isn’t just efficient to compute. We can also compute exact derivatives of ALO with respect to the hyperparameter, making it extremely efficient to dial in to the parameters that minimizes estimated prediction error.

First, we make a minor change to the way the cost function is parameterized. Define

let $\hat{\boldsymbol{w}}_\alpha$ denote the weights that minimize $J_\alpha$

and define

$f$ is our ALO measurement and what we need to minimize.

To optimize $f,$ we can again apply trust-region optimization using the first and second derivatives

Note

While efficient to compute, the derivative formulas for $f$ are quite involved, so we won’t derive them in this guide. But, if you want all the details, check out the paper

Optimizing Approximate Leave-one-out Cross-validation to Tune Hyperparameters.

### Example: Breast Cancer Diagonostic (with Regularization)¶

Let’s now revisit the Breast Cancer Data Set and use regularization.

We again start by loading and normalizing the data set.

```
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
```

While a lot happens behind the scenes, from the user point of view fitting a model to minimize ALO is easy. We’ll fit a model and print out the value of $C$ found.

```
import bbai.glm
model = bbai.glm.LogisticRegression()
# Note: bbai.glm.LogisticRegression defaults to find the C that
# minimizes ALO if none is specified.
model.fit(X, y)
print("C =", model.C_)
```

This prints out

The Breast Cancer Data Set is small enough that we can compute the leave-one-out cross-validation across a range of values of $C$ by brute force. Let’s plot it out, compare it to ALO, and see where the value we found for $C$ lies along the curve.

First, we write the function to compute leave-one-out cross-validation by brute force.

```
import numpy as np
def compute_loocv(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
n = len(y)
loo_likelihoods = []
for i in range(n):
train_indexes = [i_p for i_p in range(n) if i_p != i]
test_indexes = [i]
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[0]
loo_likelihoods.append(pred[y_test[0]])
return sum(np.log(loo_likelihoods))
```

Next, we write the function to compute ALO using the efficient formula.

```
import scipy
def fit_logistic_regression(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
model.fit(X, y)
return np.array(list(model.coef_[0]) + list(model.intercept_))
def compute_hessian(p_vector, X, alpha):
n, k = X.shape
a_vector = np.sqrt((1 - p_vector)*p_vector)
R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
H = np.dot(R.T, R)
for i in range(k-1):
H[i, i] += alpha
return H
def compute_alo(X, y, C):
alpha = 1.0 / C
w = fit_logistic_regression(X, y, C)
X = np.hstack((X, np.ones((X.shape[0], 1))))
n = X.shape[0]
y = 2*y - 1
u_vector = np.dot(X, w)
p_vector = scipy.special.expit(u_vector*y)
H = compute_hessian(p_vector, X, alpha)
L = np.linalg.cholesky(H)
T = scipy.linalg.solve_triangular(L, X.T, lower=True)
h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
loo_u_vector = u_vector - \
y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
loo_likelihoods = scipy.special.expit(y*loo_u_vector)
return sum(np.log(loo_likelihoods))
```

Now, let’s plot out leave-one-out cross-validation and ALO. We’ll draw a vertical line to indicate the optimal value of $C$ we found.

```
import matplotlib.pyplot as plt
Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]
fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()
```

This displays

## Handling More Than Two Classes¶

The logistic function we presented only works with two classes. So what do we do when we have more than two classes?

Let $K$ denote the number of classes and let $\boldsymbol{W}$ represent a $K\times p$ matrix of weights. We compute the probability of the kth class for the ith data entry as

This is called the *softmax function*, and we call the version of logistic regression that
uses it *multinomial logistic regression*. Put

The cost function for multinomial logistic regression is

We compute its gradient as

and we compute the hessian as

We can similarly adopt ALO for multinomial logistic regression.

### Example: Iris Species¶

We’ll apply multinomial logistic regression to fit a model to the Iris Data Set.

Load and normalize the data set.

```
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
data = load_iris()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
```

Fit multinomial logistic regression with $C$ set so as to minimize the error on the approximate leave-one-out cross-validation.

```
import bbai.glm
model = bbai.glm.LogisticRegression()
model.fit(X, y)
print(model.C_)
```

Prints