Logistic Regression: The Definitive Guide (2021)¶
This guide will cover
What is logistic regression
How to fit logistic regression models
How to apply the latest research in hyperparameter optimization to prevent overfitting
How to use logistic regression with more than two classes
Contents
Introduction¶
Logistic regression is a supervised learning algorithm for classification tasks.
In classification tasks, we’re given a sample training data set consisting of feature vectors
and categorical target values
We seek to fit a model so that given new out-of-sample feature vectors we can predict their associated target values.
Real-world examples of classification tasks include
The Wisconsin Breast Cancer Data Set where the task is to predict whether a mass of skin is malignant or not given physical properties of the mass such as thickness and shape.
The Iris Data Set where the task is to predict what species of iris a flower is given its septal and petal length and width.
What form could a categorical prediction from a model take? The simplest form is for a model to predict which categorical value is most likely; but more useful, is for a model to predict a probability distribution, where for each category the model assigns a probability such that
Let us consider how we might build a simple model that provides probabilistic predictions for the case of binary classification.
With binary classification, we need only predict the probability of one of the categories, , since . For simplicity, we will use and to denote the two possible target values so that
Why Not Linear Regression?¶
First, let’s address: Why can’t we use linear regression?
Given a feature vector , a linear regression model with weights predicts
But predictions from this linear regression model don’t meet the requirements of a probability distribution. If the feature vector has large enough magnitude, then will be outside of the zero to one range required by a probability distribution.
How can we fix this?
The Logistic Function¶
Suppose is a function where
and
Then given , a model that predicts
meets the requirements of a probability distribution.
Logistic regression uses the function
called the logistic function.
Note
There are other possible functions that are suitable for binary classification. For example, the probit function
is also sometimes used.
Fitting Logistic Regression¶
Now that we’ve described what logistic regression is, how do we go about fitting a model to the training data?
Given weights , we can compute the negative log likelihood of the training data, called a cost function (or sometimes loss function), as
where
Let’s see how we can find weights that minimize this cost function; or, equivalently, maximize the log likelihood of the training data.
Caution
We’ll see later why maximizing the log likehood of the training data is a bad idea.
We’re going to find optimal weights by using a trust-region optimizer.
Trust-region Optimization¶
Trust-region optimizers are a powerful class of iterative, second-order optimization algorithms.
Given a twice-differentiable objective function, , a trust-region optimizer produces a sequence that converges to an optimum of the objective. The kth iteration of the sequence is generated by updating the previous iteration with a solution to the subproblem
The subproblem minimizes the second-order approximation of at within the neighborhood , called the trust-region. Using the trust-region, the optimizer restricts the second-order approximation to areas where it models well, and it continually expands or shrinks the trust-region to adapt to the accuracy of the local quadratic approximation as the algorithm progresses.
Trust-region optimizers are both efficient and robust, being able to solve convex and non-convex optimization problems.
See also
For a full description of trust-region optimization, see
Norcedal and S. J. Wright. Numerical Optimization. 1999.
To apply a trust-region optimizer, we need to compute the gradient and hessian of the cost function .
Cost Function Gradient and Hessian¶
Let’s start with the gradient. Differentiating the cost function with respect to one of the weights, we get
where
Note
We use the notation to denote the first derivative of a single variable function:
Similarly, denotes the second derivative.
Now, for the hessian of , we derive
where
Example: Breast Cancer Diagonostic (w/o Regularization)¶
Let’s see how this works on an example data set.
Note
We use the bbai package to fit logistic regression models. See Getting Started for instructions on how to install it.
We’ll look at the Wisconsin Breast Cancer Data Set where the task is to determine whether a mass of skin is malignant or not from physical properties of the mass.
The data set is bundled with sklearn and we start by loading and normalizing the data.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
Next, we fit a logistic regression model to the data and print out the weights.
import numpy as np
import bbai.glm
model = bbai.glm.LogisticRegression(
C = np.inf, # This specifies to use no regularization.
# We'll explain later what this means.
)
model.fit(X, y)
print(model.intercept_[0], "intercept")
for wi, name in zip(model.coef_[0], data['feature_names']):
print(wi, name)
Overfitting and Regularization¶
Question: What’s the probability of events A, B, and C if we know the probability of event B is zero?
It’s zero – and this leads us to the problem with fitting models to maximize the likelihood of the training data.
If the training data set lacks less likely feature-category combinations, our model may overconfidently predict a category probability to be close to zero when it isn’t. Then when the combination occurs in out-of-sample data, our model erroneously predicts the data to have near zero probability.
This problem is known as overfitting.
One way to control overfitting is to use a technique called regularization. With regularization, we change the cost function so that instead of minimizing the negative log likelihood of the training data, we minimize
is called a hyperparameter, and it acts as a knob that controls the complexity of the model. When the knob is dialed up to full regularization strength (i.e. is close to zero), it forces the model to take its simplest form, On the other hand, when is at its weakest (i.e. is large), it’s as if regularization isn’t there.
By setting to a good value, we can prevent overfitting while still keeping the predictive strength of our model.
Cross-validation¶
How do we find a good value of ?
This is the basic idea: We construct a function that estimates the out-of-sample prediction error for a given . Then we select so as to minimize . This process is called hyperparameter optimization.
One way to estimate out-of-sample prediction error is with a technique called cross-validation. Cross-validation partitions the training data set into folds. For each fold, we form two sets of data: one consisting of the training data points outside of the fold and one consisting of the training data points in the fold. We fit a model to the first set and measure the prediction error on the second and then average all the fold prediction errors to form the out-of-sample prediction error estimate.
Cross-validation comes in different forms. K-fold cross-validation partitions the training data set into different folds. The larger the value of , the more accurate the estimate; and largest value of we can use is , called leave-one-out cross-validation.
While leave-one-out cross-validation provides the best estimate for out-of-sample prediction error, it is expensive to compute. It requires that we train a different model for each data point in the training data set. Fortunately, for logistic regression, there exists a different form of cross-validation that is nearly as accurate as leave-one-out, but much more efficient to compute.
See also
For a comparison of leave-one-out cross-validation with K-fold, see the paper
Approximate Leave-one-out Cross-validation¶
Enter Approximate Leave-one-out Cross-validation (ALO). It has much of the benefits of leave-one-out cross-validation, while being efficient to compute and optimize. Here’s how it works: Given we first fit logistic regression by minimizing the regularized cost function of the whole training data set. Let denote the optimal weights
and denote the hessian of the cost function at the optimum
Note: The gradient will be zero.
Now, suppose we modify the cost function by removing the ith training data point. We call this new cost function
and we can compute its gradient and hessian at :
Using the gradient and hessian of we can take a single step of Newton’s optimization algorithm and get a close approximation for what the weights would be without the ith data point
and then estimate the leave-one-out cross-validation with
Note
Here’s a quick refresher on Newton’s method.
Let denote the function we’re minimizing. The second-order Taylor series approximation cenerted at point is
If is positive-definite, then the approximation is minimized when the gradient with respect to is zero, or equivalently
Using this formula we avoid having to fit a model for each data point, but we still need to invert a -by- matrix times. We can achieve an even more efficient formula that only requires a single matrix inversion by applying the matrix inversion lemma to get
where
Note
If is an invertible square matrix, are vectors, and is invertible, then the matrix inversion lemma says
Thus,
See also
For more background on approximate leave-one-out cross-validation, see the paper
A scalable estimate of the extra-sample prediction error via approximate leave-one-out.
Optimizing Approximate Leave-one-out Cross-validation¶
But accurately estimating out-of-sample prediction error is just one part of hyperparameter optimization. Now that we have our efficient ALO estimator, how do we find the hyperparameters that minimize the estimated prediction error?
Fortunately, ALO isn’t just efficient to compute. We can also compute exact derivatives of ALO with respect to the hyperparameter, making it extremely efficient to dial in to the parameters that minimizes estimated prediction error.
First, we make a minor change to the way the cost function is parameterized. Define
let denote the weights that minimize
and define
is our ALO measurement and what we need to minimize.
To optimize we can again apply trust-region optimization using the first and second derivatives
Note
While efficient to compute, the derivative formulas for are quite involved, so we won’t derive them in this guide. But, if you want all the details, check out the paper
Optimizing Approximate Leave-one-out Cross-validation to Tune Hyperparameters.
Example: Breast Cancer Diagonostic (with Regularization)¶
Let’s now revisit the Breast Cancer Data Set and use regularization.
We again start by loading and normalizing the data set.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
While a lot happens behind the scenes, from the user point of view fitting a model to minimize ALO is easy. We’ll fit a model and print out the value of found.
import bbai.glm
model = bbai.glm.LogisticRegression()
# Note: bbai.glm.LogisticRegression defaults to find the C that
# minimizes ALO if none is specified.
model.fit(X, y)
print("C =", model.C_)
This prints out
The Breast Cancer Data Set is small enough that we can compute the leave-one-out cross-validation across a range of values of by brute force. Let’s plot it out, compare it to ALO, and see where the value we found for lies along the curve.
First, we write the function to compute leave-one-out cross-validation by brute force.
import numpy as np
def compute_loocv(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
n = len(y)
loo_likelihoods = []
for i in range(n):
train_indexes = [i_p for i_p in range(n) if i_p != i]
test_indexes = [i]
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[0]
loo_likelihoods.append(pred[y_test[0]])
return sum(np.log(loo_likelihoods))
Next, we write the function to compute ALO using the efficient formula.
import scipy
def fit_logistic_regression(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
model.fit(X, y)
return np.array(list(model.coef_[0]) + list(model.intercept_))
def compute_hessian(p_vector, X, alpha):
n, k = X.shape
a_vector = np.sqrt((1 - p_vector)*p_vector)
R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
H = np.dot(R.T, R)
for i in range(k-1):
H[i, i] += alpha
return H
def compute_alo(X, y, C):
alpha = 1.0 / C
w = fit_logistic_regression(X, y, C)
X = np.hstack((X, np.ones((X.shape[0], 1))))
n = X.shape[0]
y = 2*y - 1
u_vector = np.dot(X, w)
p_vector = scipy.special.expit(u_vector*y)
H = compute_hessian(p_vector, X, alpha)
L = np.linalg.cholesky(H)
T = scipy.linalg.solve_triangular(L, X.T, lower=True)
h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
loo_u_vector = u_vector - \
y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
loo_likelihoods = scipy.special.expit(y*loo_u_vector)
return sum(np.log(loo_likelihoods))
Now, let’s plot out leave-one-out cross-validation and ALO. We’ll draw a vertical line to indicate the optimal value of we found.
import matplotlib.pyplot as plt
Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]
fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()
This displays
Handling More Than Two Classes¶
The logistic function we presented only works with two classes. So what do we do when we have more than two classes?
Let denote the number of classes and let represent a matrix of weights. We compute the probability of the kth class for the ith data entry as
This is called the softmax function, and we call the version of logistic regression that uses it multinomial logistic regression. Put
The cost function for multinomial logistic regression is
We compute its gradient as
and we compute the hessian as
We can similarly adopt ALO for multinomial logistic regression.
Example: Iris Species¶
We’ll apply multinomial logistic regression to fit a model to the Iris Data Set.
Load and normalize the data set.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
data = load_iris()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
Fit multinomial logistic regression with set so as to minimize the error on the approximate leave-one-out cross-validation.
import bbai.glm
model = bbai.glm.LogisticRegression()
model.fit(X, y)
print(model.C_)
Prints