How to Fit Logistic Regression with a Noninformative Prior¶
This guide will cover
What is the difference between MAP and ML models
What is a non-informative prior
What is Jeffreys prior
How to fit logistic regression models with Jeffreys prior
Contents
Introduction¶
Suppose denotes a matrix of regressors and ,
denotes a vector of target values.
If we hypothesize that the data is generated from a logistic regression model, then our belief in weights after seeing the data is given by
Here, is called the likelihood and is given by
And is called the prior and quantifies our belief in before seeing data.
Frequently, logistic regression is fit to maximize without regard to the prior
and is what we call a maximum likelihood estimate for the weights.
But disregarding the prior is generally a bad idea. Let’s see why.
What’s Wrong with Maximum Likelihood Models¶
Consider a concrete example.
Suppose you’re a biologist researching wolves. As part of your work, you’ve been tracking wolf packs and observing their hunting behavior.
You’d like to understand what makes a hunt successful. To that end, you fit a logistic regression model to predict whether a hunt will succeed or fail. It would be interesting to consider factors such as pack size and prey type, but to keep it simple you start with only a bias term.
Let denote the total number of hunts. If denotes the value of our bias term, then the likelihood of observing successful hunts is given by
Now, let’s suppose that we observe hunts and record no successful hunts. (The hunter success rate for wolves is actually quite low, so such a result would not be surprising). What would be a good working estimate for ? The maximum likelihood estimate would be , yet we know that wouldn’t make sense… why bother hunting at all if there’s no chance of success.
We can obtain a better working estimate if instead of selecting to maximize likelihood, we select to maximize the probability of the posterior distribution
with a suitably chosen prior . We’ll discuss later the process for selecting priors but for now consider the prior
The posterior for this prior is
Put . We can find the MAP estimate by taking the logarithm, differentiating, and solving for 0
Solving for 0, we have
In our example, this then gives us the much more reasonable estimate
The prior we used is an example of a noninformative prior.
What Are Noninformative Priors and How Do We Find Them¶
Suppose we’re given a likelihood function for a model with parameters and data . What’s a good prior to use when we have no particular knowledge about ?
Let’s look at a concrete example. Suppose data is generated from a normal distribution with known variance but unknown mean . The likelihood function for this model is
And here’s a plot of the likelihood for and different values of :
Note how the likelihood curves have the same shape for all values of and vary only by translation. When the likelihood curve has this property we say that it is data translated.
Now consider two ranges of the same size and . Both would be equivalently placed relative to the likelihood curve if the observed data translates by instead of . Thus, we should have
or the uniform prior
Suppose we use an alternative parameterization for a likelihood function : We parameterize by and apply the uniform prior to where is some monotonic injective transformation. The cumulative distribution function is then given by
where is some normalizing constant. Making the substitution to go back to the original parameterization then gives us
So we see that a uniform prior over is equivalent to a prior over . And finding a noninformative prior for is equivalent to finding a transformation that makes the likelihood function data translated.
Let’s now return to the likelihood function for a logistic regression model with only a bias term
Plotting the likelihood for and different values of gives us
It’s pretty clear from looking at the graph that the likelihood function is not data translated. The curve for is more peaked than the others, and the curves for and are skewed.
Now, for suitably behaved likelihood functions, we can approximate by a Gaussian using a second-order Taylor series expansion of about
where
As , the approximation becomes more and more accurate.
Let’s see how this works for our simplified logistic regression model.
Differentiating gives us
Differentiating again gives us
And here’s what the Gaussian approximation looks like
Now, consider what happens to our Gaussian approximation of when we reparameterize with .
Differentiating gives us
Differentiating again gives us
Evaluating at a maximum , we have
where we’ve applied
Returning to logistic regression, put
Then we have
Thus, the Gaussian approximation about becomes
so that the likelihood function is now approximately data translated.
Jeffreys Prior¶
Let’s now give a brief sketch of Jeffreys prior for the multiparameter case. Let denote the vector of parameters.
Similar to the single parameter case, the likelihood function function can be approximated by a Gaussian about the optimum
where
Now, suppose we reparameterize with . Then the updated hessian for the Gaussian approximation becomes
where
In the general multivariable case, we may not be able to reparameterize in such a way the Gaussian approximation becomes data translated, but suppose we select so that
Then
Having a constant determinant, means that the volume of regions will be preserved.
Let be a region about the origin and be the covariance matrix of the reparameterized Gaussian approximation about . If is the Cholesky factorization of
then corresponds to the region about that has fixed probability mass and volume
The corresponding prior for of a uniform prior on is
And this is called the Jeffreys prior.
Let’s apply this to the full binomial logistic regression model with weights. The negative log likelihood function is given by
where
Differentiating gives us
where
Differentiating again gives us
where
We can write the hessian as
where is the diagonal matrix with .
The Jeffreys prior is then
Fitting Logistic Regression with Jeffreys Prior¶
Finding the MAP model for logistic regression with the Jeffreys prior amounts to solving the optimization problem
Let denote the log of Jeffreys prior
If we can find equations for the gradient and hessian of , then we can apply standard second-order optimization algorithms to find .
Starting with the gradient, we can apply Jacobi’s Formula for differentiating a determinant
to get
where the derivative of is the diagonal matrix
For the hessian, we apply the formula for differentiating an inverse matrix
to get
where
By evaluation the gradient and hessian computations in an efficient manner and applying a second-order trust region optimizer, we can then quickly iterate to find .
Let’s look at how this works on some examples.
Example 1: Simulated Data with Single Variable¶
We’ll start by looking at the prior and posterior for a simulation dataset with a single regressor and no bias.
We begin by generating the dataset.
def generate_correlation_matrix(p, param):
res = np.zeros(shape=(p, p))
for s in range(p):
for t in range(0, s+1):
corr = param
if s == t:
corr = 1.0
res[s, t] = corr
res[t, s] = corr
return res
def generate_design_matrix(n, K):
mean = np.zeros(K.shape[0])
return np.random.multivariate_normal(mean, K, size=n)
def generate_weights(p):
return np.random.normal(size=p)
def generate_data_set(n, K):
p = K.shape[0]
X = generate_design_matrix(n, K)
w = generate_weights(p)
u = np.dot(X, w)
p = 1 / (1 + np.exp(-u))
y = []
for i in range(n):
y.append(np.random.binomial(1, p[i]))
y = np.array(y)
return X, y, w
np.random.seed(0)
n = 20
p = 1
K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K)
We next add a function to compute the prior
def compute_a_matrix(X, u):
p_vector = 1 / (1 + np.exp(u))
return np.diag(p_vector * (1 - p_vector))
def compute_fisher_information_matrix(X, u):
A = compute_a_matrix(X, u)
return np.dot(X.T, np.dot(A, X))
def compute_log_prior(X, u):
FIM = compute_fisher_information_matrix(X, u)
return 0.5 * np.linalg.slogdet(FIM)[1]
def f_prior(w):
u = np.dot(X, np.array([w]))
return np.exp(compute_log_prior(X, u))
And plot it out across a range of values
Z, _ = integrate.quad(f_prior, -100, 100)
wx = np.arange(-20, 20, 0.1)
yx = [f_prior(w)/Z for w in wx]
plt.plot(wx, yx)
plt.xlabel('w')
plt.ylabel('Probability')
plt.show()
Next, we fit a logistic regression MAP model and compare to .
model = LogisticRegressionMAP(fit_intercept=False)
model.fit(X, y)
w_map = model.coef_[0]
print('w_true =', w_true[0])
print('w_map = ', w_map[0])
Prints
And finally, we plot out the posterior distribution.
def compute_log_posterior(X, y, w):
u = np.dot(X, w)
log_prior = compute_log_prior(X, u)
y = 2 * y - 1
likelihood = -np.sum(np.log(1 + np.exp(-y * u)))
return likelihood + log_prior
def f_posterior(w):
return np.exp(compute_log_posterior(X, y, np.array([w])))
Z,_ = integrate.quad(f_posterior, -100, 100)
yx = [f_posterior(w)/Z for w in wx]
plt.plot(wx, yx)
plt.xlabel('w')
plt.axvline(x=w_map[0], color='red', label='w_map')
plt.ylabel('Probability')
plt.legend()
plt.show()
The full example is available at github.com/rnburn/bbai/example/05-jeffreys1.ipynb.
Example 2: Simulated Data with Two Variables¶
Let’s look next at an example with two variables.
We generate a data set with two variables
np.random.seed(0)
n = 20
p = 2
K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K)
We compute the prior and plot it across a range of values
def evaluate_mesh(f, W1, W2):
n, m = W1.shape
res = np.zeros(W1.shape)
for i in range(n):
for j in range(m):
res[i, j] = f(W1[i, j], W2[i, j])
return res
def f_prior(w1, w2):
u = np.dot(X, [w1, w2])
return np.exp(compute_log_prior(X, u))
Z,_ = integrate.nquad(f_prior, [[-50, 50], [-50, 50]])
wx = np.arange(-7, 7, 0.025)
W1, W2 = np.meshgrid(wx, wx)
P = evaluate_mesh(f_prior, W1, W2) / Z
plt.contour(W1, W2, P, 50, colors='k')
plt.xlabel('w1')
plt.ylabel('w2')
Now, we’ll fit a model to find .
model = LogisticRegressionMAP(fit_intercept=False)
model.fit(X, y)
w_map = model.coef_[0]
print('w_true =', w_true)
print('w_map = ', w_map)
Prints
And finally, we plot out the posterior distribution centered at .
def f_posterior(w1, w2):
return np.exp(compute_log_posterior(X, y, np.array([w1, w2])))
Z,_ = integrate.nquad(f_posterior, [[-50, 50], [-50, 50]])
s1 = model.laplace_covariance_matrix_[0, 0]**.5
s2 = model.laplace_covariance_matrix_[1, 1]**.5
w1x = np.arange(w_map[0] - 2.5*s1, w_map[0] + 2.5*s1, 0.025)
w2x = np.arange(w_map[1] - 2.5*s2, w_map[1] + 2.5*s2, 0.025)
W1, W2 = np.meshgrid(w1x, w2x)
P = evaluate_mesh(f_posterior, W1, W2) / Z
plt.contour(W1, W2, P, 50, colors='k')
plt.xlabel('w1')
plt.ylabel('w2')
The full example is available at github.com/rnburn/bbai/example/06-jeffreys2.ipynb.
Example 3: Breast Cancer Data Set¶
Now we’ll fit a model to the real-world breast cancer data set.
We begin by loading and preprocessing the data set.
data = load_breast_cancer()
X = data['data']
y = data['target']
features = list(data['feature_names'])
features.append('intercept')
n, p = X.shape
X = StandardScaler().fit_transform(X)
We then fit a logistic regression MAP model and use the Fisher information matrix to estimate the standard error.
model = bbai.glm.LogisticRegressionMAP()
model.fit(X, y)
w_map = list(model.coef_[0])
w_map.append(model.intercept_[0])
s_map = 1.0 / np.sqrt(np.diag(model.hessian_))
And we print out the weights along with their standard errors.
max_len = max([len(s) for s in features])
fmt = '%-' + str(max_len) + 's'
for j in range(p+1):
print(fmt % features[j], '%f (%f)' % (w_map[j], s_map[j]))
Feature |
weight (standard error) |
---|---|
# mean radius |
18.844369 (0.246536) |
# mean texture |
0.718237 (0.264921) |
# mean perimeter |
-19.185223 (0.247434) |
# mean area |
-0.157821 (0.206162) |
# mean smoothness |
0.367833 (0.319041) |
# mean compactness |
3.227443 (0.262862) |
# mean concavity |
-0.022651 (0.226283) |
# mean concave points |
-2.040344 (0.308589) |
# mean symmetry |
-0.325849 (0.245192) |
# mean fractal dimension |
-0.766704 (0.210683) |
# radius error |
-10.264099 (0.134592) |
# texture error |
0.806598 (0.217691) |
# perimeter error |
5.748345 (0.127539) |
# area error |
3.829019 (0.111033) |
# smoothness error |
0.011620 (0.178030) |
# compactness error |
-0.722398 (0.144933) |
# concavity error |
0.377161 (0.105596) |
# concave points error |
-1.083376 (0.158857) |
# symmetry error |
-0.103406 (0.191188) |
# fractal dimension error |
1.550249 (0.127858) |
# worst radius |
-4.162614 (0.292164) |
# worst texture |
-2.733197 (0.252259) |
# worst perimeter |
-12.439716 (0.290176) |
# worst area |
11.414854 (0.237330) |
# worst smoothness |
-1.011890 (0.274542) |
# worst compactness |
1.888830 (0.218036) |
# worst concavity |
-1.708811 (0.213529) |
# worst concave points |
0.375874 (0.370696) |
# worst symmetry |
-0.268151 (0.203808) |
# worst fractal dimension |
-1.351309 (0.216937) |
# intercept |
1.333285 (0.246327) |
The full example is available at github.com/rnburn/bbai/example/07-jeffreys-breast-cancer.py.
Conclusion and Further Reading¶
Unlike Ordinary Least Squares, the likelihood function for logistic regression is not data translated. Searching for a data translated transformation leads us to Jeffreys prior, a natural shrinkage prior.
Firth 1993 and Kosmidis & Firth 2021 analyzed the statistical properties of the estimator that maximizes likelihood with the Jeffreys prior penalization and found it to have smaller asymptotic bias order than the standard maximum likelihood estimator. The reduced bias property of Jeffreys prior combined with its ability to handle the problem of separation (Heinze & Shemper, 2002) can make it an excellent drop in replacement to the standard approach for fitting logistic regression models.