How to Fit Logistic Regression with a Noninformative Prior¶

This guide will cover

What is the difference between MAP and ML models
What is a non-informative prior
What is Jeffreys prior
How to fit logistic regression models with Jeffreys prior

Contents

Introduction
What’s Wrong with Maximum Likelihood Models
What Are Noninformative Priors and How Do We Find Them
Jeffreys Prior
Fitting Logistic Regression with Jeffreys Prior
Example 1: Simulated Data with Single Variable
Example 2: Simulated Data with Two Variables
Example 3: Breast Cancer Data Set
Conclusion and Further Reading

Introduction ¶

Suppose $\X$ denotes a matrix of regressors and $\y$ ,

y_i \in \{-1, 1\},

denotes a vector of target values.

If we hypothesize that the data is generated from a logistic regression model, then our belief in weights $\w$ after seeing the data is given by

\Prb(\w|\X, \y) = \frac{\Prb(\y|\X, \w) \cdot \Prb(\w)}{\Prb(\y)}.

Here, $\Prb(\y|\X,\w)$ is called the likelihood and is given by

\Prb(\y|\X, \w) = \prod_i \frac{1}{1 + \exp(y_i \cdot \x_i^\top \w)}.

And $\Prb(\w)$ is called the prior and quantifies our belief in $\w$ before seeing data.

Frequently, logistic regression is fit to maximize $\Prb(\y|\X,\w)$ without regard to the prior

\wml = \argmin_{\w} \Prb(\y|\X, \w),

and $\wml$ is what we call a maximum likelihood estimate for the weights.

But disregarding the prior is generally a bad idea. Let’s see why.

What’s Wrong with Maximum Likelihood Models ¶

Consider a concrete example.

Suppose you’re a biologist researching wolves. As part of your work, you’ve been tracking wolf packs and observing their hunting behavior.

Photo by Thomas Bonometti on Unsplash.¶

You’d like to understand what makes a hunt successful. To that end, you fit a logistic regression model to predict whether a hunt will succeed or fail. It would be interesting to consider factors such as pack size and prey type, but to keep it simple you start with only a bias term.

Let $n$ denote the total number of hunts. If $b$ denotes the value of our bias term, then the likelihood of observing $k$ successful hunts is given by

\Prb(k|b) \propto \left(\frac{1}{1 + \exp(-b)}\right)^k \left(\frac{1}{1 + \exp(b)}\right)^{n-k}.

Now, let’s suppose that we observe $8$ hunts and record no successful hunts. (The hunter success rate for wolves is actually quite low, so such a result would not be surprising). What would be a good working estimate for $b$ ? The maximum likelihood estimate would be $-\infty$ , yet we know that wouldn’t make sense… why bother hunting at all if there’s no chance of success.

We can obtain a better working estimate if instead of selecting $b$ to maximize likelihood, we select $b$ to maximize the probability of the posterior distribution

b_{\maxpost} = \argmin_{b} \Prb(k|b) \Prb(b)

with a suitably chosen prior $\Prb(b)$ . We’ll discuss later the process for selecting priors but for now consider the prior

\Prb(b) \propto \left(\frac{1}{1 + \exp(-b)}\right)^{1/2} \left(\frac{1}{1 + \exp(b)}\right)^{1/2}.

The posterior for this prior is

\Prb(k|b) \propto \left(\frac{1}{1 + \exp(-b)}\right)^{k +1/2} \left(\frac{1}{1 + \exp(b)}\right)^{n-k + 1/2}.

Put $p = (1 + \exp(-b))^{-1}$ . We can find the MAP estimate by taking the logarithm, differentiating, and solving for 0

\begin{align*} \frac{d}{dp} &\left[ (k + 1/2) \log p + (n - k + 1/2) \log (1 - p) \right] \\ & \quad = (k + 1/2) \frac{1}{p} - (n - k + 1/2) \frac{1}{1-p}. \end{align*}

Solving for 0, we have

\begin{align*} &(k + 1/2) \frac{1}{p} - (n - k + 1/2) \frac{1}{1-p} = 0 \Leftrightarrow \\ &(k + 1/2) \frac{1}{p} = (n - k + 1/2) \frac{1}{1-p} \Leftrightarrow \\ &(k + 1/2) (1 - p) = (n - k + 1/2) p \Leftrightarrow \\ &(k + 1/2) = (n + 1) p \Leftrightarrow \\ &p = \frac{k + 1/2}{n + 1}. \end{align*}

In our example, this then gives us the much more reasonable estimate

p = \frac{k + 1/2}{n + 1} = \frac{0 + 1/2}{8 + 1} = \frac{1}{18}.

The prior we used is an example of a noninformative prior.

What Are Noninformative Priors and How Do We Find Them ¶

Suppose we’re given a likelihood function $L(\theta|\y)$ for a model with parameters $\theta$ and data $\y$ . What’s a good prior to use when we have no particular knowledge about $\theta$ ?

Let’s look at a concrete example. Suppose data is generated from a normal distribution with known variance $\sigma^2$ but unknown mean $\theta$ . The likelihood function for this model is

\begin{align*} L(\theta|\y) &\propto \prod_i \exp\left[\frac{1}{2\sigma^2} \left(y_i - \theta\right)^2\right] \\ &= \exp\left[\frac{1}{2\sigma^2} \sum_i\left(y_i^2 - 2 y_i \theta + \theta^2\right)\right] \\ &= \exp\left[\frac{n}{2\sigma^2} \sum_i\left(\bar{y} - \theta \right)^2 + \frac{1}{2\sigma^2}\left(\sum_i y_i^2 - n \bar{y}^2\right) \right] \\ &\propto \exp\left[\frac{n}{2\sigma^2} \left(\bar{y} - \theta\right)^2\right]. \end{align*}

And here’s a plot of the likelihood for $n=5$ and different values of $\hat{y}$ :

Likelihood function for the mean of normally distributed data with variance one, $n=5$ , and different values of $\hat{y}$ .¶

Note how the likelihood curves have the same shape for all values of $\hat{y}$ and vary only by translation. When the likelihood curve has this property we say that it is data translated.

Now consider two ranges of the same size $[a, b]$ and $[a + t, b + t]$ . Both would be equivalently placed relative to the likelihood curve if the observed data translates by $\hat{y}+t$ instead of $\hat{y}$ . Thus, we should have

\Prb(\theta \in [a, b]) = \Prb(\theta \in [a + t, b +t ]),

or the uniform prior

\Prb(\theta) \propto 1.

Suppose we use an alternative parameterization for a likelihood function $L(\theta|\y)$ : We parameterize by $u = \phi(\theta)$ and apply the uniform prior to $u$ where $\phi$ is some monotonic injective transformation. The cumulative distribution function is then given by

\Prb(u < t|\y) = \frac{1}{Z}\int_0^t L(\phi^{-1}(u)|\y) du,

where $Z$ is some normalizing constant. Making the substitution $\theta = \phi^{-1}(u)$ to go back to the original parameterization then gives us

\Prb(\theta < t|\y) = \frac{1}{Z}\int_0^t L(\theta|\y) \phi'(\theta) d\theta.

So we see that a uniform prior over $u$ is equivalent to a prior $\Prb(\theta) \propto \phi'(\theta)$ over $\theta$ . And finding a noninformative prior for $\theta$ is equivalent to finding a transformation $\phi$ that makes the likelihood function data translated.

Let’s now return to the likelihood function for a logistic regression model with only a bias term

L(b|k) \propto \left(\frac{1}{1 + \exp(-b)}\right)^k \left(\frac{1}{1 + \exp(b)}\right)^{n-k}.

Plotting the likelihood for $n=8$ and different values of $k$ gives us

Likelihood function for a logistic regression model with only a bias term, $n=8$ , and different values of $k$ .¶

It’s pretty clear from looking at the graph that the likelihood function is not data translated. The curve for $k=4$ is more peaked than the others, and the curves for $k=1$ and $k=6$ are skewed.

Now, for suitably behaved likelihood functions, we can approximate $L(\theta|y)$ by a Gaussian using a second-order Taylor series expansion of $\log L(\theta|y)$ about $\theta_{\maxlike}$

L(\theta, \y) \approx L(\theta_{\maxlike}|\y) \exp\left[ -\frac{n}{2 \sigma_{\maxlike}^2}\left(\theta - \theta_{\maxlike}\right)^2 \right],

where

\sigma_{\maxlike}^2 = -\left(\frac{1}{n}\frac{\partial^2 \log L(\theta|\y)}{\partial \theta^2}\right)_{\theta_{\maxlike}}^{-1}.

As $n \to\infty$ , the approximation becomes more and more accurate.

Let’s see how this works for our simplified logistic regression model.

Differentiating $\log L(b|k)$ gives us

\begin{align*} \frac{\partial}{\partial b} \left( \frac{1}{n}\log L(b|k)\right) &= \frac{1}{n}\frac{\partial}{\partial b} \left(k \log[1 + \exp(-b)] + (n - k) \log[1 + \exp(b)]\right) \\ &= \frac{1}{n}\left(\frac{k}{1 + \exp(-b)} (-\exp(-b)) + \frac{n - k}{1 + \exp(b)} (\exp(b))\right) \\ &= \frac{1}{n}\left(\frac{n - k}{1 + \exp(-b)} - \frac{k}{1 + \exp(b)}\right). \end{align*}

Differentiating again gives us

\begin{align*} \frac{\partial^2}{\partial b^2} \left( \frac{1}{n}\log L(b|k)\right) &= \frac{1}{n}\frac{\partial}{\partial b} \left(\frac{n - k}{1 + \exp(-b)} - \frac{k}{1 + \exp(b)}\right) \\ &= \frac{1}{n}\left(\frac{(n - k)(-\exp(-b))}{(1 + \exp(-b))^2} - \frac{k(\exp(b)}{(1 + \exp(b))^2}\right) \\ &= -\left(\frac{1}{1 + \exp(-b)}\right)\left(\frac{1}{1 + \exp(b)}\right). \end{align*}

And here’s what the Gaussian approximation looks like

Likelihood function for a logistic regression model with only a bias term together with its Gaussian approximation for $n=8$ and $k=1$ ¶

Now, consider what happens to our Gaussian approximation of $L(\theta|y)$ when we reparameterize with $\theta = \phi^{-1}(u)$ .

Differentiating gives us

\frac{\partial}{\partial u} \log L(\phi^{-1}(u)|y) = \left(\frac{\partial \log L}{\partial \theta}\right) \frac{1}{\phi'(u)}.

Differentiating again gives us

\begin{align*} \frac{\partial^2}{\partial u^2} \log L(\phi^{-1}(u)|y) &= \frac{\partial}{\partial u} \left( \left(\frac{\partial \log L}{\partial \theta}\right) \frac{1}{\phi'(u)} \right) \\ &= \left(\frac{\partial^2 \log L}{\partial \theta^2}\right) \frac{1}{\phi'(u)^2} - \left(\frac{\partial \log L}{\partial \theta}\right) \frac{1}{\phi'(u)^2} \phi''(u). \end{align*}

Evaluating at a maximum $u_{\maxlike} = \phi(\theta_{\maxlike})$ , we have

\left(\frac{\partial^2}{\partial u^2} \log L(\phi^{-1}(u)|y)\right)_{u_{\maxlike}} = \left(\frac{\partial^2 \log L}{\partial \theta^2}\right)_{\theta_{\maxlike}} \frac{1}{\phi'(u_{\maxlike})^2}

where we’ve applied

\left(\frac{\partial \log L}{\partial \theta}\right)_{\theta_{\maxlike}} = 0.

Returning to logistic regression, put

u = \phi(b) = \int_{-\infty}^b \left(\frac{1}{1 + \exp(-b)}\right)^{1/2} \left(\frac{1}{1 + \exp(b)}\right)^{1/2} db.

Then we have

\begin{align*} \sigma_{\maxlike}^2 &= -\left(\frac{1}{n}\frac{\partial^2 \log L(\phi^{-1}(u)|\y)}{\partial u^2}\right)_{u_{\maxlike}}^{-1} \\ &= \left[\frac{1}{n} \left(\frac{1}{1 + \exp(-b_{\maxlike})}\right) \left(\frac{1}{1 + \exp(b_{\maxlike})}\right) \frac{1}{\phi'(b_{\maxlike})^2}\right]^{-1} \\ &= n. \end{align*}

Thus, the Gaussian approximation about $u_{\maxlike}$ becomes

\begin{align*} L(\phi^{-1}(u), k) &\approx L(b_{\maxlike}|k) \exp\left[ -\frac{1}{2 }\left(u - u_{\maxlike}\right)^2 \right] \\ &\propto \exp\left[ -\frac{1}{2 }\left(u - u_{\maxlike}\right)^2 \right]. \end{align*}

so that the likelihood function is now approximately data translated.

Reparameterized likelihood function for a logistic regression model with only a bias term, $n=8$ , and different values of $k$ .¶

Jeffreys Prior ¶

Let’s now give a brief sketch of Jeffreys prior for the multiparameter case. Let $\thetav^\top = (\theta_1, \ldots, \theta_k)$ denote the vector of parameters.

Similar to the single parameter case, the likelihood function function can be approximated by a Gaussian about the optimum

L(\thetav, \y) \approx L(\thetav_{\maxlike}|\y) \exp\left[ -\frac{1}{2}\multtx{\left(\thetav - \thetav_{\maxlike}\right)}{\Sigma^{-1}_{\maxlike}} \right],

where

\Sigma^{-1}_{\maxlike} = -\Ex{\nabla^2_{\thetav} \log L(\thetav|\y)}.

Now, suppose we reparameterize with $\uv = \phi(\thetav)$ . Then the updated hessian for the Gaussian approximation becomes

\Ex{\nabla^2_{\uv} \log L(\phi^{-1}(\uv)|\y)} = \A \Ex{\nabla^2_{\thetav} \log L(\thetav|\y)} \A^\top,

where

A_{st} = \frac{\partial \phi^{-1}_s(\uv)}{\partial u_t}.

In the general multivariable case, we may not be able to reparameterize in such a way the Gaussian approximation becomes data translated, but suppose we select $\phi$ so that

\det \A = \left(\det \Ex{\nabla^2_{\thetav} \log L(\thetav|\y)}\right)^{-1/2}.

Then

\begin{align*} \det \Ex{\nabla^2_{\uv} \log L(\phi^{-1}(\uv)|\y)} &= \det \A \left(\det \Ex{\nabla^2_{\thetav} \log L(\thetav|\y)}\right) \det \A^\top \\ &= 1. \end{align*}

Having a constant determinant, means that the volume of regions will be preserved.

Let $S \in \mathbb{R}^k$ be a region about the origin and $\K$ be the covariance matrix of the reparameterized Gaussian approximation about $\thetav_{\maxlike}$ . If $\L$ is the Cholesky factorization of $\K$

\K = \L \L^\top,

then $S$ corresponds to the region $\L^{-1}S + \theta_{\maxlike}$ about $\thetav_{\maxlike}$ that has fixed probability mass and volume

Vol(\L^{-1}S) = \frac{1}{\det \L} Vol(S) = Vol(S).

The corresponding prior for $\thetav$ of a uniform prior on $\uv$ is

\Pr(\thetav) \propto \left(\det \Ex{- \nabla^2_{\theta} \log L(\thetav|\y)}\right)^{1/2}.

And this is called the Jeffreys prior.

Let’s apply this to the full binomial logistic regression model with weights. The negative log likelihood function is given by

J(\boldsymbol{w}) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right),

where

\ell_i(u) = \log[1 + \exp(-y_i u)].

Differentiating gives us

\begin{align*} \frac{\partial}{\partial w_s} J(\boldsymbol{w}) &= \sum_{i=1}^n \frac{\partial}{\partial w_s} \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \\ &= \sum_{i=1}^n x_{is } \ell_i'\left(\boldsymbol{x} _i ^\top \boldsymbol { w }\right), \end{align*}

where

\ell_i'(u) = \frac{-y_i} {1 + \exp(y_i u)} .

Differentiating again gives us

\begin {align*} \frac{\partial^2}{\partial w_s \partial w_t} J(\boldsymbol{w}) &= \frac{\partial}{\partial w_s}\left( \sum_{i=1}^n x_{it} \dot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \right) \\ &= \sum_{i=1}^n x_{is} x_{it } \ell_i''\left(\boldsymbol{x} _i ^\top \boldsymbol { w }\right), \end{align*}

where

\begin {align*} \ell_i''(u) &= q p, \\ p &= \frac{1}{1 + \exp(-u)}, \quad\textrm{and} \\ q &= 1 - p. \end{align*}

We can write the hessian as

\H = \X^\top \A \X,

where $\A$ is the diagonal matrix with $A_{ii} = \ell_i''(\x_i^\top \w)$ .

The Jeffreys prior is then

\Pr(\w) \propto \left(\det \X^\top \A \X\right)^{1/2}.

Fitting Logistic Regression with Jeffreys Prior ¶

Finding the MAP model for logistic regression with the Jeffreys prior amounts to solving the optimization problem

\w_{\maxpost} = \argmin_{\w} \left\{\sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) - \frac{1}{2}\log \det \left(\X^\top \A_{\w} \X\right)\right\}.

Let $\pi$ denote the log of Jeffreys prior

\pi(\w) = \log \det \left(\X^\top \A_{\w} \X\right).

If we can find equations for the gradient and hessian of $\pi$ , then we can apply standard second-order optimization algorithms to find $\w_{\maxpost}$ .

Starting with the gradient, we can apply Jacobi’s Formula for differentiating a determinant

\frac{d}{dt} \det \A = \left(\det \A\right) \tr\left\{\A^{-1} \frac{d \A}{dt}\right\}

to get

\begin{align*} \frac{\partial\pi}{\partial w_s} (\w) &= \tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_s} \X \right\}, \end{align*}

where the derivative of $\A_{\w}$ is the diagonal matrix

\begin{align*} \left(\frac{\partial \A_{\w}}{\partial w_s}\right)_{ii} &= \ell_i'''(\x_i^\top \w) x_{is} \\ \ell_i'''(u) &= q p (q - p). \end{align*}

For the hessian, we apply the formula for differentiating an inverse matrix

\frac{d}{dt} \A^{-1} = -\A^{-1} \frac{d \A}{dt} \A^{-1}

to get

\begin{align*} \frac{\partial^2 \pi}{\partial w_s\partial w_t}(\w) &= \frac{\partial}{\partial w_s} \left( \tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_t} \X \right\}\right) \\ &= -\tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_s} \X \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial \A_{\w}}{\partial w_t} \X \right\} \\ &\phantom{=1234} + \tr\left\{ \left(\X^\top \A_{\w} \X\right)^{-1} \X^\top \frac{\partial^2 \A_{\w}}{\partial w_s \partial w_t} \X \right\}, \end{align*}

where

\begin{align*} \left(\frac{\partial^2 \A_{\w}}{\partial w_s \partial w_t}\right)_{ii} &= \ell_i''''(\x_i^\top \w) x_{is} x_{it} \\ \ell_i''''(u) &= qp(q^2 + p^2) - 4q^2p^2. \end{align*}

By evaluation the gradient and hessian computations in an efficient manner and applying a second-order trust region optimizer, we can then quickly iterate to find $\w_{\maxpost}$ .

Let’s look at how this works on some examples.

Example 1: Simulated Data with Single Variable ¶

We’ll start by looking at the prior and posterior for a simulation dataset with a single regressor and no bias.

We begin by generating the dataset.

def generate_correlation_matrix(p, param):
    res = np.zeros(shape=(p, p))
    for s in range(p):
        for t in range(0, s+1):
            corr = param
            if s == t:
                corr = 1.0
            res[s, t] = corr
            res[t, s] = corr
    return res

def generate_design_matrix(n, K):
    mean = np.zeros(K.shape[0])
    return np.random.multivariate_normal(mean, K, size=n)

def generate_weights(p):
    return np.random.normal(size=p)

def generate_data_set(n, K):
    p = K.shape[0]
    X = generate_design_matrix(n, K)
    w = generate_weights(p)

    u = np.dot(X, w)

    p = 1 / (1 + np.exp(-u))

    y = []
    for i in range(n):
        y.append(np.random.binomial(1, p[i]))
    y = np.array(y)

    return X, y, w

np.random.seed(0)

n = 20
p = 1

K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K)

We next add a function to compute the prior

def compute_a_matrix(X, u):
    p_vector = 1 / (1 + np.exp(u))
    return np.diag(p_vector * (1 - p_vector))

def compute_fisher_information_matrix(X, u):
    A = compute_a_matrix(X, u)
    return np.dot(X.T, np.dot(A, X))

def compute_log_prior(X, u):
    FIM = compute_fisher_information_matrix(X, u)
    return 0.5 * np.linalg.slogdet(FIM)[1]

def f_prior(w):
    u = np.dot(X, np.array([w]))
    return np.exp(compute_log_prior(X, u))

And plot it out across a range of values

Z, _ = integrate.quad(f_prior, -100, 100)
wx = np.arange(-20, 20, 0.1)
yx = [f_prior(w)/Z for w in wx]
plt.plot(wx, yx)
plt.xlabel('w')
plt.ylabel('Probability')
plt.show()

Jeffreys prior for a simulation dataset with a single regressor¶

Next, we fit a logistic regression MAP model and compare $\w_{\maxpost}$ to $\w_{\textrm{true}}$ .

model = LogisticRegressionMAP(fit_intercept=False)
model.fit(X, y)
w_map = model.coef_[0]
print('w_true =', w_true[0])
print('w_map = ', w_map[0])

Prints

\begin{align*} \w_{\textrm{true}} &= -2.55 \\ \w_{\maxpost} &= -3.14 \end{align*}

And finally, we plot out the posterior distribution.

def compute_log_posterior(X, y, w):
    u = np.dot(X, w)
    log_prior = compute_log_prior(X, u)
    y = 2 * y - 1
    likelihood = -np.sum(np.log(1 + np.exp(-y * u)))
    return likelihood + log_prior

def f_posterior(w):
    return np.exp(compute_log_posterior(X, y, np.array([w])))

Z,_ = integrate.quad(f_posterior, -100, 100)
yx = [f_posterior(w)/Z for w in wx]
plt.plot(wx, yx)
plt.xlabel('w')
plt.axvline(x=w_map[0], color='red', label='w_map')
plt.ylabel('Probability')
plt.legend()
plt.show()

Posterior for a simulation dataset with a single regressor¶

The full example is available at github.com/rnburn/bbai/example/05-jeffreys1.ipynb.

Example 2: Simulated Data with Two Variables ¶

Let’s look next at an example with two variables.

We generate a data set with two variables

np.random.seed(0)

n = 20
p = 2

K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K)

We compute the prior and plot it across a range of values

def evaluate_mesh(f, W1, W2):
    n, m = W1.shape
    res = np.zeros(W1.shape)
    for i in range(n):
        for j in range(m):
            res[i, j] = f(W1[i, j], W2[i, j])
    return res

def f_prior(w1, w2):
    u = np.dot(X, [w1, w2])
    return np.exp(compute_log_prior(X, u))

Z,_ = integrate.nquad(f_prior, [[-50, 50], [-50, 50]])

wx = np.arange(-7, 7, 0.025)
W1, W2 = np.meshgrid(wx, wx)

P = evaluate_mesh(f_prior, W1, W2) / Z

plt.contour(W1, W2, P, 50, colors='k')
plt.xlabel('w1')
plt.ylabel('w2')

Jeffreys prior for logistic regression on simulation dataset with two regressors¶

Now, we’ll fit a model to find $\w_{\maxpost}$ .

model = LogisticRegressionMAP(fit_intercept=False)
model.fit(X, y)
w_map = model.coef_[0]
print('w_true =', w_true)
print('w_map = ', w_map)

Prints

\begin{align*} \w_{\textrm{true}} &= \left(-1.05, -1.42\right)^\top \\ \w_{\maxpost} &= \left(-1.68, 0.14\right)^\top \end{align*}

And finally, we plot out the posterior distribution centered at $\w_{\maxpost}$ .

def f_posterior(w1, w2):
    return np.exp(compute_log_posterior(X, y, np.array([w1, w2])))

Z,_ = integrate.nquad(f_posterior, [[-50, 50], [-50, 50]])

s1 = model.laplace_covariance_matrix_[0, 0]**.5
s2 = model.laplace_covariance_matrix_[1, 1]**.5
w1x = np.arange(w_map[0] - 2.5*s1, w_map[0] + 2.5*s1, 0.025)
w2x = np.arange(w_map[1] - 2.5*s2, w_map[1] + 2.5*s2, 0.025)
W1, W2 = np.meshgrid(w1x, w2x)

P = evaluate_mesh(f_posterior, W1, W2) / Z

plt.contour(W1, W2, P, 50, colors='k')
plt.xlabel('w1')
plt.ylabel('w2')

Posterior for logistic regression with Jeffreys prior on a simulation dataset with two regressors¶

The full example is available at github.com/rnburn/bbai/example/06-jeffreys2.ipynb.

Example 3: Breast Cancer Data Set ¶

Now we’ll fit a model to the real-world breast cancer data set.

We begin by loading and preprocessing the data set.

data = load_breast_cancer()

X = data['data']
y = data['target']
features = list(data['feature_names'])
features.append('intercept')
n, p = X.shape
X = StandardScaler().fit_transform(X)

We then fit a logistic regression MAP model and use the Fisher information matrix to estimate the standard error.

model = bbai.glm.LogisticRegressionMAP()
model.fit(X, y)
w_map = list(model.coef_[0])
w_map.append(model.intercept_[0])

s_map = 1.0 / np.sqrt(np.diag(model.hessian_))

And we print out the weights along with their standard errors.

max_len = max([len(s) for s in features])
fmt = '%-' + str(max_len) + 's'
for j in range(p+1):
    print(fmt % features[j], '%f (%f)' % (w_map[j], s_map[j]))

Feature	weight (standard error)
# mean radius	18.844369 (0.246536)
# mean texture	0.718237 (0.264921)
# mean perimeter	-19.185223 (0.247434)
# mean area	-0.157821 (0.206162)
# mean smoothness	0.367833 (0.319041)
# mean compactness	3.227443 (0.262862)
# mean concavity	-0.022651 (0.226283)
# mean concave points	-2.040344 (0.308589)
# mean symmetry	-0.325849 (0.245192)
# mean fractal dimension	-0.766704 (0.210683)
# radius error	-10.264099 (0.134592)
# texture error	0.806598 (0.217691)
# perimeter error	5.748345 (0.127539)
# area error	3.829019 (0.111033)
# smoothness error	0.011620 (0.178030)
# compactness error	-0.722398 (0.144933)
# concavity error	0.377161 (0.105596)
# concave points error	-1.083376 (0.158857)
# symmetry error	-0.103406 (0.191188)
# fractal dimension error	1.550249 (0.127858)
# worst radius	-4.162614 (0.292164)
# worst texture	-2.733197 (0.252259)
# worst perimeter	-12.439716 (0.290176)
# worst area	11.414854 (0.237330)
# worst smoothness	-1.011890 (0.274542)
# worst compactness	1.888830 (0.218036)
# worst concavity	-1.708811 (0.213529)
# worst concave points	0.375874 (0.370696)
# worst symmetry	-0.268151 (0.203808)
# worst fractal dimension	-1.351309 (0.216937)
# intercept	1.333285 (0.246327)

The full example is available at github.com/rnburn/bbai/example/07-jeffreys-breast-cancer.py.

Conclusion and Further Reading ¶

Unlike Ordinary Least Squares, the likelihood function for logistic regression is not data translated. Searching for a data translated transformation leads us to Jeffreys prior, a natural shrinkage prior.

Firth 1993 and Kosmidis & Firth 2021 analyzed the statistical properties of the estimator that maximizes likelihood with the Jeffreys prior penalization and found it to have smaller asymptotic bias order than the standard maximum likelihood estimator. The reduced bias property of Jeffreys prior combined with its ability to handle the problem of separation (Heinze & Shemper, 2002) can make it an excellent drop in replacement to the standard approach for fitting logistic regression models.