Logistic Regression: The Definitive Guide (2021)

This guide will cover

  • What is logistic regression

  • How to fit logistic regression models

  • How to apply the latest research in hyperparameter optimization to prevent overfitting

  • How to use logistic regression with more than two classes

Contents

Introduction

Logistic regression is a supervised learning algorithm for classification tasks.

In classification tasks, we’re given a sample training data set consisting of feature vectors

x1,x2,,xn,xiRp,\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n, \quad \boldsymbol{x}_i \in \mathbb{R}^p,

and categorical target values

y1,y2,,yn,yi{0,1,,K1}.y_1, y_2, \dots, y_n, \quad y_i \in \left\{0, 1, \dots, K-1\right\}.

We seek to fit a model so that given new out-of-sample feature vectors x1,,xm\boldsymbol{x}_1', \dots, \boldsymbol{x}_m' we can predict their associated target values.

Real-world examples of classification tasks include

  • The Wisconsin Breast Cancer Data Set where the task is to predict whether a mass of skin is malignant or not given physical properties of the mass such as thickness and shape.

  • The Iris Data Set where the task is to predict what species of iris a flower is given its septal and petal length and width.

What form could a categorical prediction from a model take? The simplest form is for a model to predict which categorical value is most likely; but more useful, is for a model to predict a probability distribution, where for each category CkC_k the model assigns a probability pkp_k such that

0pk1andkpk=1.0 \leq p_k \leq 1\quad \textrm{and} \quad \sum_k p_k = 1.

Let us consider how we might build a simple model that provides probabilistic predictions for the case of binary classification.

With binary classification, we need only predict the probability of one of the categories, p1p_1 , since p0=1p1p_0 = 1 - p_1 . For simplicity, we will use 1-1 and +1+1 to denote the two possible target values so that

yi{1,+1}.y_i \in \left\{-1, +1\right\}.

Why Not Linear Regression?

First, let’s address: Why can’t we use linear regression?

Given a feature vector x\boldsymbol{x} , a linear regression model with weights w\boldsymbol{w} predicts

p1=xw.p_1 = \boldsymbol{x}^\top \boldsymbol{w}.

But predictions from this linear regression model don’t meet the requirements of a probability distribution. If the feature vector x\boldsymbol{x} has large enough magnitude, then p1p_1 will be outside of the zero to one range required by a probability distribution.

How can we fix this?

The Logistic Function

Suppose g ⁣:RRg\colon \mathbb{R} \to \mathbb{R} is a function where

g(u)g(u),if  uu,g(u) \leq g(u'), \quad \textrm{if}\; u \leq u',

and

0g(u)1,for all  uR.0 \leq g(u) \leq 1, \quad \textrm{for all}\; u\in\mathbb{R}.

Then given wRp\boldsymbol{w}\in\mathbb{R}^p , a model that predicts

p1=g(xw)p_1 = g(\boldsymbol{x}^\top \boldsymbol{w})

meets the requirements of a probability distribution.

Logistic regression uses the function

g(u)=11+exp(u)g(u) = \frac{1}{1 + \exp(-u)}

called the logistic function.

Note

There are other possible functions that are suitable for binary classification. For example, the probit function

g(u)=12πuexp(12Z2)dZg(u) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^u \exp\left(-\frac{1}{2} Z^2\right)dZ

is also sometimes used.

Fitting Logistic Regression

Now that we’ve described what logistic regression is, how do we go about fitting a model to the training data?

Given weights w\boldsymbol{w} , we can compute the negative log likelihood of the training data, called a cost function (or sometimes loss function), as

J(w)=i=1ni(xiw),J(\boldsymbol{w}) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right),

where

i(u)=log[1+exp(yiu)].\begin{align*} \ell_i(u) &= \log\left[1 + \exp(-y_i u)\right]. \end{align*}

Let’s see how we can find weights that minimize this cost function; or, equivalently, maximize the log likelihood of the training data.

Caution

We’ll see later why maximizing the log likehood of the training data is a bad idea.

We’re going to find optimal weights by using a trust-region optimizer.

Trust-region Optimization

Trust-region optimizers are a powerful class of iterative, second-order optimization algorithms.

Given a twice-differentiable objective function, f ⁣:RpRf\colon \mathbb{R}^p \to \mathbb{R} , a trust-region optimizer produces a sequence {xk}\{\boldsymbol{x}_k\} that converges to an optimum of the objective. The kth iteration of the sequence is generated by updating the previous iteration with a solution to the subproblem

xk=xk1+s^ands^=argmins{f(xk1)+f(xk1)s+12s2f(xk1)s}=s.t.sδk.\begin{align*} \boldsymbol{x}_k &= \boldsymbol{x}_{k-1} + \hat{\boldsymbol{s}} \quad\textrm{and} \\ \hat{\boldsymbol{s}} &= \operatorname*{argmin}_{\boldsymbol{s}} \left\{ f\left(\boldsymbol{x}_{k-1}\right) + \nabla f\left(\boldsymbol{x}_{k-1}\right)^\top \boldsymbol{s} + \frac{1}{2} \boldsymbol{s}^\top \nabla^2 f\left(\boldsymbol{x}_{k-1}\right) \boldsymbol{s} \right\} \\ &\phantom{=}\quad\mathrm{s.t.} \quad \| \boldsymbol{s} \| \le \delta_k. \end{align*}

The subproblem minimizes the second-order approximation of ff at xk1\boldsymbol{x}_{k-1} within the neighborhood sδk\| \boldsymbol{s} \| \le \delta_k , called the trust-region. Using the trust-region, the optimizer restricts the second-order approximation to areas where it models ff well, and it continually expands or shrinks the trust-region to adapt to the accuracy of the local quadratic approximation as the algorithm progresses.

Trust-region optimizers are both efficient and robust, being able to solve convex and non-convex optimization problems.

See also

For a full description of trust-region optimization, see

  1. Norcedal and S. J. Wright. Numerical Optimization. 1999.

To apply a trust-region optimizer, we need to compute the gradient and hessian of the cost function J(w)J(\boldsymbol{w}) .

Cost Function Gradient and Hessian

Let’s start with the gradient. Differentiating the cost function with respect to one of the weights, we get

wsJ(w)=i=1nwsi(xiw)=i=1nxis˙i(xiw),\begin{align*} \frac{\partial}{\partial w_s} J(\boldsymbol{w}) &= \sum_{i=1}^n \frac{\partial}{\partial w_s} \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \\ &= \sum_{i=1}^n x_{is} \dot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right), \end{align*}

where

˙i(u)=yi1+exp(yiu).\dot{\ell}_i(u) = \frac{-y_i}{1 + \exp(y_i u)}.

Note

We use the notation f˙(u)\dot{f}(u) to denote the first derivative of a single variable function:

f˙(u)=dduf(u).\dot{f}(u) = \frac{d}{du} f(u).

Similarly, f¨(u)\ddot{f}(u) denotes the second derivative.

Now, for the hessian of JJ , we derive

2wswtJ(w)=ws(i=1nxit˙i(xiw))=i=1nxisxit¨i(xiw),\begin{align*} \frac{\partial^2}{\partial w_s \partial w_t} J(\boldsymbol{w}) &= \frac{\partial}{\partial w_s}\left( \sum_{i=1}^n x_{it} \dot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \right) \\ &= \sum_{i=1}^n x_{is} x_{it} \ddot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right), \end{align*}

where

¨i(u)=qp,p=11+exp(u),andq=1p.\begin{align*} \ddot{\ell}_i(u) &= q p, \\ p &= \frac{1}{1 + \exp(-u)}, \quad\textrm{and} \\ q &= 1 - p. \end{align*}

Example: Breast Cancer Diagonostic (w/o Regularization)

Let’s see how this works on an example data set.

Note

We use the bbai package to fit logistic regression models. See Getting Started for instructions on how to install it.

We’ll look at the Wisconsin Breast Cancer Data Set where the task is to determine whether a mass of skin is malignant or not from physical properties of the mass.

The data set is bundled with sklearn and we start by loading and normalizing the data.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

Next, we fit a logistic regression model to the data and print out the weights.

import numpy as np
import bbai.glm

model = bbai.glm.LogisticRegression(
    C = np.inf, # This specifies to use no regularization.
                # We'll explain later what this means.
)
model.fit(X, y)

print(model.intercept_[0], "intercept")
for wi, name in zip(model.coef_[0], data['feature_names']):
  print(wi, name)

Overfitting and Regularization

Question: What’s the probability of events A, B, and C if we know the probability of event B is zero?

It’s zero – and this leads us to the problem with fitting models to maximize the likelihood of the training data.

If the training data set lacks less likely feature-category combinations, our model may overconfidently predict a category probability to be close to zero when it isn’t. Then when the combination occurs in out-of-sample data, our model erroneously predicts the data to have near zero probability.

This problem is known as overfitting.

One way to control overfitting is to use a technique called regularization. With regularization, we change the cost function so that instead of minimizing the negative log likelihood of the training data, we minimize

J(w)=i=1ni(xiw)+12Cw2.J(\boldsymbol{w}) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) + \frac{1}{2C} \| \boldsymbol{w} \|^2.

CC is called a hyperparameter, and it acts as a knob that controls the complexity of the model. When the knob is dialed up to full regularization strength (i.e. CC is close to zero), it forces the model to take its simplest form, w=0.\boldsymbol{w} = 0. On the other hand, when CC is at its weakest (i.e. CC is large), it’s as if regularization isn’t there.

By setting CC to a good value, we can prevent overfitting while still keeping the predictive strength of our model.

Cross-validation

How do we find a good value of CC ?

This is the basic idea: We construct a function f(C)f(C) that estimates the out-of-sample prediction error for a given CC . Then we select CC so as to minimize ff . This process is called hyperparameter optimization.

One way to estimate out-of-sample prediction error is with a technique called cross-validation. Cross-validation partitions the training data set into folds. For each fold, we form two sets of data: one consisting of the training data points outside of the fold and one consisting of the training data points in the fold. We fit a model to the first set and measure the prediction error on the second and then average all the fold prediction errors to form the out-of-sample prediction error estimate.

Cross-validation comes in different forms. K-fold cross-validation partitions the training data set into kk different folds. The larger the value of kk , the more accurate the estimate; and largest value of kk we can use is k=n1k = n-1 , called leave-one-out cross-validation.

While leave-one-out cross-validation provides the best estimate for out-of-sample prediction error, it is expensive to compute. It requires that we train a different model for each data point in the training data set. Fortunately, for logistic regression, there exists a different form of cross-validation that is nearly as accurate as leave-one-out, but much more efficient to compute.

See also

For a comparison of leave-one-out cross-validation with K-fold, see the paper

Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions.

Approximate Leave-one-out Cross-validation

Enter Approximate Leave-one-out Cross-validation (ALO). It has much of the benefits of leave-one-out cross-validation, while being efficient to compute and optimize. Here’s how it works: Given CC we first fit logistic regression by minimizing the regularized cost function of the whole training data set. Let w^\hat{\boldsymbol{w}} denote the optimal weights

w^=argminwJ(w),\hat{\boldsymbol{w}} = \operatorname*{argmin}_{\boldsymbol{w}} J\left(\boldsymbol{w}\right),

and H\boldsymbol{H} denote the hessian of the cost function at the optimum

H=2J(w^).\boldsymbol{H} = \nabla^2 J(\hat{\boldsymbol{w}}).

Note: The gradient will be zero.

Now, suppose we modify the cost function by removing the ith training data point. We call this new cost function J/iJ_{/i}

J/i(w)=J(w)i(xiw),J_{/i}\left(\boldsymbol{w}\right) = J\left(\boldsymbol{w}\right) - \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right),

and we can compute its gradient and hessian at w^\hat{\boldsymbol{w}} :

J/i(w^)=˙i(xiw^)xi,2J/i(w^)=H¨i(xiw^)xixi.\begin{align*} \nabla J_{/i}\left(\hat{\boldsymbol{w}}\right) &= -\dot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i, \\ \nabla^2 J_{/i}\left(\hat{\boldsymbol{w}}\right) &= \boldsymbol{H} - \ddot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i \boldsymbol{x}_i^\top. \end{align*}

Using the gradient and hessian of J/i,J_{/i}, we can take a single step of Newton’s optimization algorithm and get a close approximation for what the weights would be without the ith data point

w~/i=w^[H¨i(xiw^)xixi]1(˙i(xiw^)xi),\tilde{\boldsymbol{w}}_{/i} = \hat{\boldsymbol{w}} - \left[ \boldsymbol{H} - \ddot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i \boldsymbol{x}_i^\top \right]^{-1} \left( -\dot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i \right),

and then estimate the leave-one-out cross-validation with

1ni=1ni(xiw~/i).\frac{1}{n}\sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \tilde{\boldsymbol{w}}_{/i}\right).

Note

Here’s a quick refresher on Newton’s method.

Let f(x)f(\boldsymbol{x}) denote the function we’re minimizing. The second-order Taylor series approximation cenerted at point x0\boldsymbol{x}_0 is

f(x0+h)f(x0)+f(x0)h+12h2f(x0)h.f(\boldsymbol{x}_0 + \boldsymbol{h}) \approx f(\boldsymbol{x}_0) + \nabla f(\boldsymbol{x}_0)^\top \boldsymbol{h} + \frac{1}{2} \boldsymbol{h}^\top \nabla^2 f(\boldsymbol{x}_0) \boldsymbol{h}.

If 2f(x0)\nabla^2 f(\boldsymbol{x}_0) is positive-definite, then the approximation is minimized when the gradient with respect to h\boldsymbol{h} is zero, or equivalently

h=(2f(x0))1f(xo).\boldsymbol{h} = -\left(\nabla^2 f(\boldsymbol{x}_0)\right)^{-1} \nabla f(\boldsymbol{x}_o).

Using this formula we avoid having to fit a model for each data point, but we still need to invert a pp -by-pp matrix nn times. We can achieve an even more efficient formula that only requires a single matrix inversion by applying the matrix inversion lemma to get

1ni=1ni(ui+i˙(ui)hi1¨i(ui)hi),\frac{1}{n}\sum_{i=1}^n \ell_i\left( u_i + \frac{\dot{\ell_i}(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i} \right),

where

ui=xiw^andhi=xiH1xi.u_i = \boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\quad\textrm{and}\quad h_i = \boldsymbol{x}_i^\top \boldsymbol{H}^{-1} \boldsymbol{x}_i.

Note

If ARn×n\boldsymbol{A} \in \mathbb{R}^{n \times n} is an invertible square matrix, u,vRn\boldsymbol{u}, \boldsymbol{v} \in \mathbb{R}^n are vectors, and A+uv\boldsymbol{A} + \boldsymbol{u} \boldsymbol{v}^\top is invertible, then the matrix inversion lemma says

(A+uv)1=A1A1uvA11+vA1u.\left(\boldsymbol{A} + \boldsymbol{u} \boldsymbol{v}^\top\right)^{-1} = \boldsymbol{A}^{-1} - \frac{\boldsymbol{A}^{-1} \boldsymbol{u} \boldsymbol{v}^\top \boldsymbol{A}^{-1}} {1 + \boldsymbol{v}^\top \boldsymbol{A}^{-1} \boldsymbol{u}}.

Thus,

xiw~/i=xi(w^[H¨i(ui)xixi]1(˙i(ui)xi))=ui+˙i(ui)xi[H¨i(ui)xixi]1xi=ui+˙i(ui)xi[H1+¨i(ui)H1xixiH11¨i(ui)xiH1xi]xi=ui+˙i(ui)[hi+¨i(ui)hi21¨i(ui)hi]=ui+˙i(ui)hi1¨i(ui)hi.\begin{align*} \boldsymbol{x}_i^\top \tilde{\boldsymbol{w}}_{/i} &= \boldsymbol{x}_i^\top\left( \hat{\boldsymbol{w}} - \left[\boldsymbol{H} - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{x}_i^\top \right]^{-1} \left(-\dot{\ell}_i(u_i) \boldsymbol{x}_i\right) \right) \\ &= u_i + \dot{\ell}_i(u_i) \boldsymbol{x}_i^\top \left[\boldsymbol{H} - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{x}_i^\top\right]^{-1} \boldsymbol{x}_i \\ &= u_i + \dot{\ell}_i(u_i) \boldsymbol{x}_i^\top \left[\boldsymbol{H}^{-1} + \frac{\ddot{\ell}_i(u_i) \boldsymbol{H}^{-1} \boldsymbol{x}_i \boldsymbol{x}_i^\top \boldsymbol{H}^{-1}} {1 - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{H}^{-1} \boldsymbol{x}_i} \right] \boldsymbol{x}_i \\ &= u_i + \dot{\ell}_i(u_i) \left[ h_i + \frac{\ddot{\ell}_i(u_i) h_i^2}{1 - \ddot{\ell}_i(u_i) h_i}\right] \\ &= u_i + \frac{\dot{\ell}_i(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i}. \end{align*}

See also

For more background on approximate leave-one-out cross-validation, see the paper

A scalable estimate of the extra-sample prediction error via approximate leave-one-out.

Optimizing Approximate Leave-one-out Cross-validation

But accurately estimating out-of-sample prediction error is just one part of hyperparameter optimization. Now that we have our efficient ALO estimator, how do we find the hyperparameters that minimize the estimated prediction error?

Fortunately, ALO isn’t just efficient to compute. We can also compute exact derivatives of ALO with respect to the hyperparameter, making it extremely efficient to dial in to the parameters that minimizes estimated prediction error.

First, we make a minor change to the way the cost function is parameterized. Define

Jα(w)=i=1ni(xiw)+α2w2,J_\alpha\left(\boldsymbol{w}\right) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) + \alpha^2 \| \boldsymbol{w} \|^2,

let w^α\hat{\boldsymbol{w}}_\alpha denote the weights that minimize JαJ_\alpha

w^α=argminwJα(w),\hat{\boldsymbol{w}}_\alpha = \operatorname*{argmin}_{\boldsymbol{w}} J_\alpha\left(\boldsymbol{w}\right),

and define

f(α)=1ni=1ni(ui+˙i(ui)hi1¨i(ui)hi).f(\alpha) = \frac{1}{n} \sum_{i=1}^n \ell_i\left( u_i + \frac{\dot{\ell}_i(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i} \right).

ff is our ALO measurement and what we need to minimize.

To optimize f,f, we can again apply trust-region optimization using the first and second derivatives

df(α)dαandd2f(α)dα2.\frac{d f(\alpha)}{d \alpha} \quad\textrm{and}\quad\frac{d^2 f(\alpha)}{d \alpha^2}.

Note

While efficient to compute, the derivative formulas for ff are quite involved, so we won’t derive them in this guide. But, if you want all the details, check out the paper

Optimizing Approximate Leave-one-out Cross-validation to Tune Hyperparameters.

Example: Breast Cancer Diagonostic (with Regularization)

Let’s now revisit the Breast Cancer Data Set and use regularization.

We again start by loading and normalizing the data set.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

While a lot happens behind the scenes, from the user point of view fitting a model to minimize ALO is easy. We’ll fit a model and print out the value of CC found.

import bbai.glm

model = bbai.glm.LogisticRegression()
# Note: bbai.glm.LogisticRegression defaults to find the C that
# minimizes ALO if none is specified.

model.fit(X, y)

print("C =", model.C_)

This prints out

C=0.6655139682151275.C = 0.6655139682151275.

The Breast Cancer Data Set is small enough that we can compute the leave-one-out cross-validation across a range of values of CC by brute force. Let’s plot it out, compare it to ALO, and see where the value we found for CC lies along the curve.

First, we write the function to compute leave-one-out cross-validation by brute force.

import numpy as np

def compute_loocv(X, y, C):
    model = bbai.glm.LogisticRegression(C=C)
    n = len(y)
    loo_likelihoods = []
    for i in range(n):
        train_indexes = [i_p for i_p in range(n) if i_p != i]
        test_indexes = [i]
        X_train, X_test = X[train_indexes], X[test_indexes]
        y_train, y_test = y[train_indexes], y[test_indexes]
        model.fit(X_train, y_train)
        pred = model.predict_proba(X_test)[0]
        loo_likelihoods.append(pred[y_test[0]])
    return sum(np.log(loo_likelihoods))

Next, we write the function to compute ALO using the efficient formula.

import scipy

def fit_logistic_regression(X, y, C):
    model = bbai.glm.LogisticRegression(C=C)
    model.fit(X, y)
    return np.array(list(model.coef_[0]) + list(model.intercept_))

def compute_hessian(p_vector, X, alpha):
    n, k = X.shape
    a_vector = np.sqrt((1 - p_vector)*p_vector)
    R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
    H = np.dot(R.T, R)
    for i in range(k-1):
        H[i, i] += alpha
    return H

def compute_alo(X, y, C):
    alpha = 1.0 / C
    w = fit_logistic_regression(X, y, C)
    X = np.hstack((X, np.ones((X.shape[0], 1))))
    n = X.shape[0]
    y = 2*y - 1
    u_vector = np.dot(X, w)
    p_vector = scipy.special.expit(u_vector*y)
    H = compute_hessian(p_vector, X, alpha)
    L = np.linalg.cholesky(H)
    T = scipy.linalg.solve_triangular(L, X.T, lower=True)
    h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
    loo_u_vector = u_vector - \
        y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
    loo_likelihoods = scipy.special.expit(y*loo_u_vector)
    return sum(np.log(loo_likelihoods))

Now, let’s plot out leave-one-out cross-validation and ALO. We’ll draw a vertical line to indicate the optimal value of CC we found.

import matplotlib.pyplot as plt

Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]

fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()

This displays

../_images/breast-cancer-cv.svg

Handling More Than Two Classes

The logistic function we presented only works with two classes. So what do we do when we have more than two classes?

Let KK denote the number of classes and let W\boldsymbol{W} represent a K×pK\times p matrix of weights. We compute the probability of the kth class for the ith data entry as

exp(ekWxi)k=1Kexp(ekWxi).\frac{\exp\left(\boldsymbol{e}_k^\top \boldsymbol{W} \boldsymbol{x}_i\right)} {\sum_{k=1}^K \exp\left(\boldsymbol{e}_k^\top \boldsymbol{W}\boldsymbol{x}_i\right)}.

This is called the softmax function, and we call the version of logistic regression that uses it multinomial logistic regression. Put

i(u)=log[eyiu]+log[k=1Kexp(eku)].\ell_i(\boldsymbol{u}) = -\log \left[ \boldsymbol{e}_{y_i}^\top \boldsymbol{u}\right] + \log\left[\sum_{k=1}^K \exp\left(\boldsymbol{e}_k^\top \boldsymbol{u}\right)\right].

The cost function for multinomial logistic regression is

J(W)=i=1ni(Wxi)+12Ck=1Kj=1pWkj2.J\left(\boldsymbol{W}\right) = \sum_{i=1}^n \ell_i\left(\boldsymbol{W} \boldsymbol{x}_i\right) + \frac{1}{2 C} \sum_{k=1}^K \sum_{j=1}^p W_{kj}^2.

We compute its gradient as

WkjJ(W)=i=1nxij(iuk)(Wxi)+1CWkj,\frac{\partial}{\partial W_{kj}} J\left(\boldsymbol{W}\right) = \sum_{i=1}^n x_{ij} \left(\frac{\partial \ell_i }{\partial u_k}\right) \left(\boldsymbol{W} \boldsymbol{x}_i\right) + \frac{1}{C} W_{kj},

and we compute the hessian as

2WkjWkjJ(W)=i=1nxijxij(2iukuk)(Wxi)=+1Cδkkδjj.\begin{align*} \frac{\partial^2}{\partial W_{kj} \partial W_{k'j'}} J\left(\boldsymbol{W}\right) &= \sum_{i=1}^n x_{ij} x_{ij'} \left(\frac{\partial^2 \ell_i }{\partial u_k \partial u_{k'}}\right) \left(\boldsymbol{W} \boldsymbol{x}_i\right) \\ &\phantom{=}\quad + \frac{1}{C} \delta_{kk'} \delta_{jj'}. \end{align*}

We can similarly adopt ALO for multinomial logistic regression.

Example: Iris Species

We’ll apply multinomial logistic regression to fit a model to the Iris Data Set.

Load and normalize the data set.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

data = load_iris()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

Fit multinomial logistic regression with CC set so as to minimize the error on the approximate leave-one-out cross-validation.

import bbai.glm
model = bbai.glm.LogisticRegression()
model.fit(X, y)
print(model.C_)

Prints

C=43.70957582240895.C = 43.70957582240895.

Stay up to date