How to Fit Logistic Regression with a Noninformative Prior

This guide will cover

  • What is the difference between MAP and ML models

  • What is a non-informative prior

  • What is Jeffreys prior

  • How to fit logistic regression models with Jeffreys prior

Contents

Introduction

Suppose X\X denotes a matrix of regressors and y\y,

yi{1,1},y_i \in \{-1, 1\},

denotes a vector of target values.

If we hypothesize that the data is generated from a logistic regression model, then our belief in weights w\w after seeing the data is given by

P(wX,y)=P(yX,w)P(w)P(y).\Prb(\w|\X, \y) = \frac{\Prb(\y|\X, \w) \cdot \Prb(\w)}{\Prb(\y)}.

Here, P(yX,w)\Prb(\y|\X,\w) is called the likelihood and is given by

P(yX,w)=i11+exp(yixiw).\Prb(\y|\X, \w) = \prod_i \frac{1}{1 + \exp(y_i \cdot \x_i^\top \w)}.

And P(w)\Prb(\w) is called the prior and quantifies our belief in w\w before seeing data.

Frequently, logistic regression is fit to maximize P(yX,w)\Prb(\y|\X,\w) without regard to the prior

wML=arg minwP(yX,w),\wml = \argmin_{\w} \Prb(\y|\X, \w),

and wML\wml is what we call a maximum likelihood estimate for the weights.

But disregarding the prior is generally a bad idea. Let’s see why.

What’s Wrong with Maximum Likelihood Models

Consider a concrete example.

Suppose you’re a biologist researching wolves. As part of your work, you’ve been tracking wolf packs and observing their hunting behavior.

You’d like to understand what makes a hunt successful. To that end, you fit a logistic regression model to predict whether a hunt will succeed or fail. It would be interesting to consider factors such as pack size and prey type, but to keep it simple you start with only a bias term.

Let nn denote the total number of hunts. If bb denotes the value of our bias term, then the likelihood of observing kk successful hunts is given by

P(kb)(11+exp(b))k(11+exp(b))nk.\Prb(k|b) \propto \left(\frac{1}{1 + \exp(-b)}\right)^k \left(\frac{1}{1 + \exp(b)}\right)^{n-k}.

Now, let’s suppose that we observe 88 hunts and record no successful hunts. (The hunter success rate for wolves is actually quite low, so such a result would not be surprising). What would be a good working estimate for bb? The maximum likelihood estimate would be -\infty, yet we know that wouldn’t make sense… why bother hunting at all if there’s no chance of success.

We can obtain a better working estimate if instead of selecting bb to maximize likelihood, we select bb to maximize the probability of the posterior distribution

bMAP=arg minbP(kb)P(b)b_{\maxpost} = \argmin_{b} \Prb(k|b) \Prb(b)

with a suitably chosen prior P(b)\Prb(b). We’ll discuss later the process for selecting priors but for now consider the prior

P(b)(11+exp(b))1/2(11+exp(b))1/2.\Prb(b) \propto \left(\frac{1}{1 + \exp(-b)}\right)^{1/2} \left(\frac{1}{1 + \exp(b)}\right)^{1/2}.

The posterior for this prior is

P(kb)(11+exp(b))k+1/2(11+exp(b))nk+1/2.\Prb(k|b) \propto \left(\frac{1}{1 + \exp(-b)}\right)^{k +1/2} \left(\frac{1}{1 + \exp(b)}\right)^{n-k + 1/2}.

Put p=(1+exp(b))1p = (1 + \exp(-b))^{-1}. We can find the MAP estimate by taking the logarithm, differentiating, and solving for 0

ddp[(k+1/2)logp+(nk+1/2)log(1p)]=(k+1/2)1p(nk+1/2)11p.\begin{align*} \frac{d}{dp} &\left[ (k + 1/2) \log p + (n - k + 1/2) \log (1 - p) \right] \\ & \quad = (k + 1/2) \frac{1}{p} - (n - k + 1/2) \frac{1}{1-p}. \end{align*}

Solving for 0, we have

(k+1/2)1p(nk+1/2)11p=0(k+1/2)1p=(nk+1/2)11p(k+1/2)(1p)=(nk+1/2)p(k+1/2)=(n+1)pp=k+1/2n+1.\begin{align*} &(k + 1/2) \frac{1}{p} - (n - k + 1/2) \frac{1}{1-p} = 0 \Leftrightarrow \\ &(k + 1/2) \frac{1}{p} = (n - k + 1/2) \frac{1}{1-p} \Leftrightarrow \\ &(k + 1/2) (1 - p) = (n - k + 1/2) p \Leftrightarrow \\ &(k + 1/2) = (n + 1) p \Leftrightarrow \\ &p = \frac{k + 1/2}{n + 1}. \end{align*}

In our example, this then gives us the much more reasonable estimate

p=k+1/2n+1=0+1/28+1=118.p = \frac{k + 1/2}{n + 1} = \frac{0 + 1/2}{8 + 1} = \frac{1}{18}.

The prior we used is an example of a noninformative prior.

What Are Noninformative Priors and How Do We Find Them

Suppose we’re given a likelihood function L(θy)L(\theta|\y) for a model with parameters θ\theta and data y\y. What’s a good prior to use when we have no particular knowledge about θ\theta?

Let’s look at a concrete example. Suppose data is generated from a normal distribution with known variance σ2\sigma^2 but unknown mean θ\theta. The likelihood function for this model is

L(θy)iexp[12σ2(yiθ)2]=exp[12σ2i(yi22yiθ+θ2)]=exp[n2σ2i(yˉθ)2+12σ2(iyi2nyˉ2)]exp[n2σ2(yˉθ)2].\begin{align*} L(\theta|\y) &\propto \prod_i \exp\left[\frac{1}{2\sigma^2} \left(y_i - \theta\right)^2\right] \\ &= \exp\left[\frac{1}{2\sigma^2} \sum_i\left(y_i^2 - 2 y_i \theta + \theta^2\right)\right] \\ &= \exp\left[\frac{n}{2\sigma^2} \sum_i\left(\bar{y} - \theta \right)^2 + \frac{1}{2\sigma^2}\left(\sum_i y_i^2 - n \bar{y}^2\right) \right] \\ &\propto \exp\left[\frac{n}{2\sigma^2} \left(\bar{y} - \theta\right)^2\right]. \end{align*}

And here’s a plot of the likelihood for n=5n=5 and different values of y^\hat{y}:

../_images/mean-likeplot.svg

Likelihood function for the mean of normally distributed data with variance one, n=5n=5, and different values of y^\hat{y}.

Note how the likelihood curves have the same shape for all values of y^\hat{y} and vary only by translation. When the likelihood curve has this property we say that it is data translated.

Now consider two ranges of the same size [a,b][a, b] and [a+t,b+t][a + t, b + t]. Both would be equivalently placed relative to the likelihood curve if the observed data translates by y^+t\hat{y}+t instead of y^\hat{y}. Thus, we should have

P(θ[a,b])=P(θ[a+t,b+t]),\Prb(\theta \in [a, b]) = \Prb(\theta \in [a + t, b +t ]),

or the uniform prior

P(θ)1.\Prb(\theta) \propto 1.

Suppose we use an alternative parameterization for a likelihood function L(θy)L(\theta|\y): We parameterize by u=ϕ(θ)u = \phi(\theta) and apply the uniform prior to uu where ϕ\phi is some monotonic injective transformation. The cumulative distribution function is then given by

P(u<ty)=1Z0tL(ϕ1(u)y)du,\Prb(u < t|\y) = \frac{1}{Z}\int_0^t L(\phi^{-1}(u)|\y) du,

where ZZ is some normalizing constant. Making the substitution θ=ϕ1(u)\theta = \phi^{-1}(u) to go back to the original parameterization then gives us

P(θ<ty)=1Z0tL(θy)ϕ(θ)dθ.\Prb(\theta < t|\y) = \frac{1}{Z}\int_0^t L(\theta|\y) \phi'(\theta) d\theta.

So we see that a uniform prior over uu is equivalent to a prior P(θ)ϕ(θ)\Prb(\theta) \propto \phi'(\theta) over θ\theta. And finding a noninformative prior for θ\theta is equivalent to finding a transformation ϕ\phi that makes the likelihood function data translated.

Let’s now return to the likelihood function for a logistic regression model with only a bias term

L(bk)(11+exp(b))k(11+exp(b))nk.L(b|k) \propto \left(\frac{1}{1 + \exp(-b)}\right)^k \left(\frac{1}{1 + \exp(b)}\right)^{n-k}.

Plotting the likelihood for n=8n=8 and different values of kk gives us

../_images/logit-likeplot.svg

Likelihood function for a logistic regression model with only a bias term, n=8n=8, and different values of kk.

It’s pretty clear from looking at the graph that the likelihood function is not data translated. The curve for k=4k=4 is more peaked than the others, and the curves for k=1k=1 and k=6k=6 are skewed.

Now, for suitably behaved likelihood functions, we can approximate L(θy)L(\theta|y) by a Gaussian using a second-order Taylor series expansion of logL(θy)\log L(\theta|y) about θML\theta_{\maxlike}

L(θ,y)L(θMLy)exp[n2σML2(θθML)2],L(\theta, \y) \approx L(\theta_{\maxlike}|\y) \exp\left[ -\frac{n}{2 \sigma_{\maxlike}^2}\left(\theta - \theta_{\maxlike}\right)^2 \right],

where

σML2=(1n2logL(θy)θ2)θML1.\sigma_{\maxlike}^2 = -\left(\frac{1}{n}\frac{\partial^2 \log L(\theta|\y)}{\partial \theta^2}\right)_{\theta_{\maxlike}}^{-1}.

As nn \to\infty, the approximation becomes more and more accurate.

Let’s see how this works for our simplified logistic regression model.

Differentiating logL(bk)\log L(b|k) gives us

b(1nlogL(bk))=1nb(klog[1+exp(b)]+(nk)log[1+exp(b)])=1n(k1+exp(b)(exp(b))+nk1+exp(b)(exp(b)))=1n(nk1+exp(b)k1+exp(b)).\begin{align*} \frac{\partial}{\partial b} \left( \frac{1}{n}\log L(b|k)\right) &= \frac{1}{n}\frac{\partial}{\partial b} \left(k \log[1 + \exp(-b)] + (n - k) \log[1 + \exp(b)]\right) \\ &= \frac{1}{n}\left(\frac{k}{1 + \exp(-b)} (-\exp(-b)) + \frac{n - k}{1 + \exp(b)} (\exp(b))\right) \\ &= \frac{1}{n}\left(\frac{n - k}{1 + \exp(-b)} - \frac{k}{1 + \exp(b)}\right). \end{align*}

Differentiating again gives us

2b2(1nlogL(bk))=1nb(nk1+exp(b)k1+exp(b))=1n((nk)(exp(b))(1+exp(b))2k(exp(b)(1+exp(b))2)=(11+exp(b))(11+exp(b)).\begin{align*} \frac{\partial^2}{\partial b^2} \left( \frac{1}{n}\log L(b|k)\right) &= \frac{1}{n}\frac{\partial}{\partial b} \left(\frac{n - k}{1 + \exp(-b)} - \frac{k}{1 + \exp(b)}\right) \\ &= \frac{1}{n}\left(\frac{(n - k)(-\exp(-b))}{(1 + \exp(-b))^2} - \frac{k(\exp(b)}{(1 + \exp(b))^2}\right) \\ &= -\left(\frac{1}{1 + \exp(-b)}\right)\left(\frac{1}{1 + \exp(b)}\right). \end{align*}

And here’s what the Gaussian approximation looks like

../_images/gauss-approx.svg

Likelihood function for a logistic regression model with only a bias term together with its Gaussian approximation for n=8n=8 and k=1k=1

Now, consider what happens to our Gaussian approximation of L(θy)L(\theta|y) when we reparameterize with θ=ϕ1(u)\theta = \phi^{-1}(u).

Differentiating gives us

ulogL(ϕ1(u)y)=(logLθ)1ϕ(u).\frac{\partial}{\partial u} \log L(\phi^{-1}(u)|y) = \left(\frac{\partial \log L}{\partial \theta}\right) \frac{1}{\phi'(u)}.

Differentiating again gives us

2u2logL(ϕ1(u)y)=u((logLθ)1ϕ(u))=(2logLθ2)1ϕ(u)2(logLθ)1ϕ(u)2ϕ(u).\begin{align*} \frac{\partial^2}{\partial u^2} \log L(\phi^{-1}(u)|y) &= \frac{\partial}{\partial u} \left( \left(\frac{\partial \log L}{\partial \theta}\right) \frac{1}{\phi'(u)} \right) \\ &= \left(\frac{\partial^2 \log L}{\partial \theta^2}\right) \frac{1}{\phi'(u)^2} - \left(\frac{\partial \log L}{\partial \theta}\right) \frac{1}{\phi'(u)^2} \phi''(u). \end{align*}

Evaluating at a maximum uML=ϕ(θML)u_{\maxlike} = \phi(\theta_{\maxlike}), we have

(2u2logL(ϕ1(u)y))uML=(2logLθ2)θML1ϕ(uML)2\left(\frac{\partial^2}{\partial u^2} \log L(\phi^{-1}(u)|y)\right)_{u_{\maxlike}} = \left(\frac{\partial^2 \log L}{\partial \theta^2}\right)_{\theta_{\maxlike}} \frac{1}{\phi'(u_{\maxlike})^2}

where we’ve applied

(logLθ)θML=0.\left(\frac{\partial \log L}{\partial \theta}\right)_{\theta_{\maxlike}} = 0.

Returning to logistic regression, put

u=ϕ(b)=b(11+exp(b))1/2(11+exp(b))1/2db.u = \phi(b) = \int_{-\infty}^b \left(\frac{1}{1 + \exp(-b)}\right)^{1/2} \left(\frac{1}{1 + \exp(b)}\right)^{1/2} db.

Then we have

σML2=(1n2logL(ϕ1(u)y)u2)uML1=[1n(11+exp(bML))(11+exp(bML))1ϕ(bML)2]1=n.\begin{align*} \sigma_{\maxlike}^2 &= -\left(\frac{1}{n}\frac{\partial^2 \log L(\phi^{-1}(u)|\y)}{\partial u^2}\right)_{u_{\maxlike}}^{-1} \\ &= \left[\frac{1}{n} \left(\frac{1}{1 + \exp(-b_{\maxlike})}\right) \left(\frac{1}{1 + \exp(b_{\maxlike})}\right) \frac{1}{\phi'(b_{\maxlike})^2}\right]^{-1} \\ &= n. \end{align*}

Thus, the Gaussian approximation about uMLu_{\maxlike} becomes

L(ϕ1(u),k)L(bMLk)exp[12(uuML)2]exp[12(uuML)2].\begin{align*} L(\phi^{-1}(u), k) &\approx L(b_{\maxlike}|k) \exp\left[ -\frac{1}{2 }\left(u - u_{\maxlike}\right)^2 \right] \\ &\propto \exp\left[ -\frac{1}{2 }\left(u - u_{\maxlike}\right)^2 \right]. \end{align*}

so that the likelihood function is now approximately data translated.

../_images/transformed-logit-likeplot.svg

Reparameterized likelihood function for a logistic regression model with only a bias term, n=8n=8, and different values of kk.

Jeffreys Prior

Let’s now give a brief sketch of Jeffreys prior for the multiparameter case. Let θ=(θ1,,θk)\thetav^\top = (\theta_1, \ldots, \theta_k) denote the vector of parameters.

Similar to the single parameter case, the likelihood function function can be approximated by a Gaussian about the optimum

L(θ,y)L(θMLy)exp[12(θθML)ΣML1(θθML)],L(\thetav, \y) \approx L(\thetav_{\maxlike}|\y) \exp\left[ -\frac{1}{2}\multtx{\left(\thetav - \thetav_{\maxlike}\right)}{\Sigma^{-1}_{\maxlike}} \right],

where

ΣML1=E[θ2logL(θy)].\Sigma^{-1}_{\maxlike} = -\Ex{\nabla^2_{\thetav} \log L(\thetav|\y)}.

Now, suppose we reparameterize with u=ϕ(θ)\uv = \phi(\thetav). Then the updated hessian for the Gaussian approximation becomes

E[u2logL(ϕ1(u)y)]=AE[θ2logL(θy)]A,\Ex{\nabla^2_{\uv} \log L(\phi^{-1}(\uv)|\y)} = \A \Ex{\nabla^2_{\thetav} \log L(\thetav|\y)} \A^\top,

where

Ast=ϕs1(u)ut.A_{st} = \frac{\partial \phi^{-1}_s(\uv)}{\partial u_t}.

In the general multivariable case, we may not be able to reparameterize in such a way the Gaussian approximation becomes data translated, but suppose we select ϕ\phi so that

detA=(detE[θ2logL(θy)])1/2.\det \A = \left(\det \Ex{\nabla^2_{\thetav} \log L(\thetav|\y)}\right)^{-1/2}.

Then

detE[u2logL(ϕ1(u)y)]=detA(detE[θ2logL(θy)])detA=1.\begin{align*} \det \Ex{\nabla^2_{\uv} \log L(\phi^{-1}(\uv)|\y)} &= \det \A \left(\det \Ex{\nabla^2_{\thetav} \log L(\thetav|\y)}\right) \det \A^\top \\ &= 1. \end{align*}

Having a constant determinant, means that the volume of regions will be preserved.

Let SRkS \in \mathbb{R}^k be a region about the origin and K\K be the covariance matrix of the reparameterized Gaussian approximation about θML\thetav_{\maxlike}. If L\L is the Cholesky factorization of K\K

K=LL,\K = \L \L^\top,

then SS corresponds to the region L1S+θML\L^{-1}S + \theta_{\maxlike} about θML\thetav_{\maxlike} that has fixed probability mass and volume

Vol(L1S)=1detLVol(S)=Vol(S).Vol(\L^{-1}S) = \frac{1}{\det \L} Vol(S) = Vol(S).

The corresponding prior for θ\thetav of a uniform prior on u\uv is

Pr(θ)(detE[θ2logL(θy)])1/2.\Pr(\thetav) \propto \left(\det \Ex{- \nabla^2_{\theta} \log L(\thetav|\y)}\right)^{1/2}.

And this is called the Jeffreys prior.

Let’s apply this to the full binomial logistic regression model with weights. The negative log likelihood function is given by

J(w)=i=1ni(xiw),J(\boldsymbol{w}) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right),

where

i(u)=log[1+exp(yiu)].\ell_i(u) = \log[1 + \exp(-y_i u)].

Differentiating gives us

wsJ(w)=i=1nwsi(xiw)=i=1nxisi(xiw),\begin{align*} \frac{\partial}{\partial w_s} J(\boldsymbol{w}) &= \sum_{i=1}^n \frac{\partial}{\partial w_s} \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \\ &= \sum_{i=1}^n x_{is } \ell_i'\left(\boldsymbol{x} _i ^\top \boldsymbol { w }\right), \end{align*}

where

i(u)=yi1+exp(yiu).\ell_i'(u) = \frac{-y_i} {1 + \exp(y_i u)} .

Differentiating again gives us

2wswtJ(w)=ws(i=1nxit˙i(xiw))=i=1nxisxiti(xiw),\begin {align*} \frac{\partial^2}{\partial w_s \partial w_t} J(\boldsymbol{w}) &= \frac{\partial}{\partial w_s}\left( \sum_{i=1}^n x_{it} \dot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \right) \\ &= \sum_{i=1}^n x_{is} x_{it } \ell_i''\left(\boldsymbol{x} _i ^\top \boldsymbol { w }\right), \end{align*}

where

i(u)=qp,p=11+exp(u),andq=1p.\begin {align*} \ell_i''(u) &= q p, \\ p &= \frac{1}{1 + \exp(-u)}, \quad\textrm{and} \\ q &= 1 - p. \end{align*}

We can write the hessian as

H=XAX,\H = \X^\top \A \X,

where A\A is the diagonal matrix with Aii=i(xiw)A_{ii} = \ell_i''(\x_i^\top \w).

The Jeffreys prior is then

Pr(w)(detXAX)1/2.\Pr(\w) \propto \left(\det \X^\top \A \X\right)^{1/2}.

Fitting Logistic Regression with Jeffreys Prior

Finding the MAP model for logistic regression with the Jeffreys prior amounts to solving the optimization problem

wMAP=arg minw{i=1ni(xiw)12logdet(XAwX)}.\w_{\maxpost} = \argmin_{\w} \left\{\sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) - \frac{1}{2}\log \det \left(\X^\top \A_{\w} \X\right)\right\}.

Let π\pi denote the log of Jeffreys prior

π(w)=logdet(XAwX).\pi(\w) = \log \det \left(\X^\top \A_{\w} \X\right).

If we can find equations for the gradient and hessian of π\pi, then we can apply standard second-order optimization algorithms to find wMAP\w_{\maxpost}.

Starting with the gradient, we can apply Jacobi’s Formula for differentiating a determinant

ddtdetA=(detA)tr{A1dAdt}\frac{d}{dt} \det \A = \left(\det \A\right) \tr\left\{\A^{-1} \frac{d \A}{dt}\right\}

to get

πws(w)=tr{(XAwX)1XAwwsX},\begin{align*} \frac{\partial\pi}{\partial w_s} (\w) &= \tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_s} \X \right\}, \end{align*}

where the derivative of Aw\A_{\w} is the diagonal matrix

(Awws)ii=i(xiw)xisi(u)=qp(qp).\begin{align*} \left(\frac{\partial \A_{\w}}{\partial w_s}\right)_{ii} &= \ell_i'''(\x_i^\top \w) x_{is} \\ \ell_i'''(u) &= q p (q - p). \end{align*}

For the hessian, we apply the formula for differentiating an inverse matrix

ddtA1=A1dAdtA1\frac{d}{dt} \A^{-1} = -\A^{-1} \frac{d \A}{dt} \A^{-1}

to get

2πwswt(w)=ws(tr{(XAwX)1XAwwtX})=tr{(XAwX)1XAwwsX(XAwX)1XAwwtX}=1234+tr{(XAwX)1X2AwwswtX},\begin{align*} \frac{\partial^2 \pi}{\partial w_s\partial w_t}(\w) &= \frac{\partial}{\partial w_s} \left( \tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_t} \X \right\}\right) \\ &= -\tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_s} \X \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_t} \X \right\} \\ &\phantom{=1234} + \tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial^2 \A_{\w}}{\partial w_s \partial w_t} \X \right\}, \end{align*}

where

(2Awwswt)ii=i(xiw)xisxiti(u)=qp(q2+p2)4q2p2.\begin{align*} \left(\frac{\partial^2 \A_{\w}}{\partial w_s \partial w_t}\right)_{ii} &= \ell_i''''(\x_i^\top \w) x_{is} x_{it} \\ \ell_i''''(u) &= qp(q^2 + p^2) - 4q^2p^2. \end{align*}

By evaluation the gradient and hessian computations in an efficient manner and applying a second-order trust region optimizer, we can then quickly iterate to find wMAP\w_{\maxpost}.

Let’s look at how this works on some examples.

Example 1: Simulated Data with Single Variable

We’ll start by looking at the prior and posterior for a simulation dataset with a single regressor and no bias.

We begin by generating the dataset.

def generate_correlation_matrix(p, param):
    res = np.zeros(shape=(p, p))
    for s in range(p):
        for t in range(0, s+1):
            corr = param
            if s == t:
                corr = 1.0
            res[s, t] = corr
            res[t, s] = corr
    return res

def generate_design_matrix(n, K):
    mean = np.zeros(K.shape[0])
    return np.random.multivariate_normal(mean, K, size=n)

def generate_weights(p):
    return np.random.normal(size=p)

def generate_data_set(n, K):
    p = K.shape[0]
    X = generate_design_matrix(n, K)
    w = generate_weights(p)

    u = np.dot(X, w)

    p = 1 / (1 + np.exp(-u))

    y = []
    for i in range(n):
        y.append(np.random.binomial(1, p[i]))
    y = np.array(y)

    return X, y, w

np.random.seed(0)

n = 20
p = 1

K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K)

We next add a function to compute the prior

def compute_a_matrix(X, u):
    p_vector = 1 / (1 + np.exp(u))
    return np.diag(p_vector * (1 - p_vector))

def compute_fisher_information_matrix(X, u):
    A = compute_a_matrix(X, u)
    return np.dot(X.T, np.dot(A, X))

def compute_log_prior(X, u):
    FIM = compute_fisher_information_matrix(X, u)
    return 0.5 * np.linalg.slogdet(FIM)[1]

def f_prior(w):
    u = np.dot(X, np.array([w]))
    return np.exp(compute_log_prior(X, u))

And plot it out across a range of values

Z, _ = integrate.quad(f_prior, -100, 100)
wx = np.arange(-20, 20, 0.1)
yx = [f_prior(w)/Z for w in wx]
plt.plot(wx, yx)
plt.xlabel('w')
plt.ylabel('Probability')
plt.show()
../_images/jeffreys-prior-var1.svg

Jeffreys prior for a simulation dataset with a single regressor

Next, we fit a logistic regression MAP model and compare wMAP\w_{\maxpost} to wtrue\w_{\textrm{true}}.

model = LogisticRegressionMAP(fit_intercept=False)
model.fit(X, y)
w_map = model.coef_[0]
print('w_true =', w_true[0])
print('w_map = ', w_map[0])

Prints

wtrue=2.55wMAP=3.14\begin{align*} \w_{\textrm{true}} &= -2.55 \\ \w_{\maxpost} &= -3.14 \end{align*}

And finally, we plot out the posterior distribution.

def compute_log_posterior(X, y, w):
    u = np.dot(X, w)
    log_prior = compute_log_prior(X, u)
    y = 2 * y - 1
    likelihood = -np.sum(np.log(1 + np.exp(-y * u)))
    return likelihood + log_prior

def f_posterior(w):
    return np.exp(compute_log_posterior(X, y, np.array([w])))

Z,_ = integrate.quad(f_posterior, -100, 100)
yx = [f_posterior(w)/Z for w in wx]
plt.plot(wx, yx)
plt.xlabel('w')
plt.axvline(x=w_map[0], color='red', label='w_map')
plt.ylabel('Probability')
plt.legend()
plt.show()
../_images/jeffreys-posterior-var1.svg

Posterior for a simulation dataset with a single regressor

The full example is available at github.com/rnburn/bbai/example/05-jeffreys1.ipynb.

Example 2: Simulated Data with Two Variables

Let’s look next at an example with two variables.

We generate a data set with two variables

np.random.seed(0)

n = 20
p = 2

K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K)

We compute the prior and plot it across a range of values

def evaluate_mesh(f, W1, W2):
    n, m = W1.shape
    res = np.zeros(W1.shape)
    for i in range(n):
        for j in range(m):
            res[i, j] = f(W1[i, j], W2[i, j])
    return res

def f_prior(w1, w2):
    u = np.dot(X, [w1, w2])
    return np.exp(compute_log_prior(X, u))

Z,_ = integrate.nquad(f_prior, [[-50, 50], [-50, 50]])

wx = np.arange(-7, 7, 0.025)
W1, W2 = np.meshgrid(wx, wx)

P = evaluate_mesh(f_prior, W1, W2) / Z

plt.contour(W1, W2, P, 50, colors='k')
plt.xlabel('w1')
plt.ylabel('w2')
../_images/jeffreys-prior-var2.svg

Jeffreys prior for logistic regression on simulation dataset with two regressors

Now, we’ll fit a model to find wMAP\w_{\maxpost}.

model = LogisticRegressionMAP(fit_intercept=False)
model.fit(X, y)
w_map = model.coef_[0]
print('w_true =', w_true)
print('w_map = ', w_map)

Prints

wtrue=(1.05,1.42)wMAP=(1.68,0.14)\begin{align*} \w_{\textrm{true}} &= \left(-1.05, -1.42\right)^\top \\ \w_{\maxpost} &= \left(-1.68, 0.14\right)^\top \end{align*}

And finally, we plot out the posterior distribution centered at wMAP\w_{\maxpost}.

def f_posterior(w1, w2):
    return np.exp(compute_log_posterior(X, y, np.array([w1, w2])))

Z,_ = integrate.nquad(f_posterior, [[-50, 50], [-50, 50]])

s1 = model.laplace_covariance_matrix_[0, 0]**.5
s2 = model.laplace_covariance_matrix_[1, 1]**.5
w1x = np.arange(w_map[0] - 2.5*s1, w_map[0] + 2.5*s1, 0.025)
w2x = np.arange(w_map[1] - 2.5*s2, w_map[1] + 2.5*s2, 0.025)
W1, W2 = np.meshgrid(w1x, w2x)

P = evaluate_mesh(f_posterior, W1, W2) / Z

plt.contour(W1, W2, P, 50, colors='k')
plt.xlabel('w1')
plt.ylabel('w2')
../_images/jeffreys-post-var2.svg

Posterior for logistic regression with Jeffreys prior on a simulation dataset with two regressors

The full example is available at github.com/rnburn/bbai/example/06-jeffreys2.ipynb.

Example 3: Breast Cancer Data Set

Now we’ll fit a model to the real-world breast cancer data set.

We begin by loading and preprocessing the data set.

data = load_breast_cancer()

X = data['data']
y = data['target']
features = list(data['feature_names'])
features.append('intercept')
n, p = X.shape
X = StandardScaler().fit_transform(X)

We then fit a logistic regression MAP model and use the Fisher information matrix to estimate the standard error.

model = bbai.glm.LogisticRegressionMAP()
model.fit(X, y)
w_map = list(model.coef_[0])
w_map.append(model.intercept_[0])

s_map = 1.0 / np.sqrt(np.diag(model.hessian_))

And we print out the weights along with their standard errors.

max_len = max([len(s) for s in features])
fmt = '%-' + str(max_len) + 's'
for j in range(p+1):
    print(fmt % features[j], '%f (%f)' % (w_map[j], s_map[j]))

Feature

weight (standard error)

# mean radius

18.844369 (0.246536)

# mean texture

0.718237 (0.264921)

# mean perimeter

-19.185223 (0.247434)

# mean area

-0.157821 (0.206162)

# mean smoothness

0.367833 (0.319041)

# mean compactness

3.227443 (0.262862)

# mean concavity

-0.022651 (0.226283)

# mean concave points

-2.040344 (0.308589)

# mean symmetry

-0.325849 (0.245192)

# mean fractal dimension

-0.766704 (0.210683)

# radius error

-10.264099 (0.134592)

# texture error

0.806598 (0.217691)

# perimeter error

5.748345 (0.127539)

# area error

3.829019 (0.111033)

# smoothness error

0.011620 (0.178030)

# compactness error

-0.722398 (0.144933)

# concavity error

0.377161 (0.105596)

# concave points error

-1.083376 (0.158857)

# symmetry error

-0.103406 (0.191188)

# fractal dimension error

1.550249 (0.127858)

# worst radius

-4.162614 (0.292164)

# worst texture

-2.733197 (0.252259)

# worst perimeter

-12.439716 (0.290176)

# worst area

11.414854 (0.237330)

# worst smoothness

-1.011890 (0.274542)

# worst compactness

1.888830 (0.218036)

# worst concavity

-1.708811 (0.213529)

# worst concave points

0.375874 (0.370696)

# worst symmetry

-0.268151 (0.203808)

# worst fractal dimension

-1.351309 (0.216937)

# intercept

1.333285 (0.246327)

The full example is available at github.com/rnburn/bbai/example/07-jeffreys-breast-cancer.py.

Conclusion and Further Reading

Unlike Ordinary Least Squares, the likelihood function for logistic regression is not data translated. Searching for a data translated transformation leads us to Jeffreys prior, a natural shrinkage prior.

Firth 1993 and Kosmidis & Firth 2021 analyzed the statistical properties of the estimator that maximizes likelihood with the Jeffreys prior penalization and found it to have smaller asymptotic bias order than the standard maximum likelihood estimator. The reduced bias property of Jeffreys prior combined with its ability to handle the problem of separation (Heinze & Shemper, 2002) can make it an excellent drop in replacement to the standard approach for fitting logistic regression models.

Stay up to date