This guide describes a Bayesian algorithm for regularized linear regression.
The algorithm uses a hyperparameter to control
regularization strength and fully integrates over the hyperparameter in the posterior
distribution, applying a hyperprior selected so as to be approximately noninformative.
Let θ=(σ2,w) denote parameters for a
linear regression model with weights w and normally distributed errors of
variance σ2.
If X represents an n×p matrix of full rank of p regressors
and n rows, then θ specifies a probability distribution over possible
target values y:
P(y∣θ)=σ2π1exp[−2σ21(y−Xw)⊤(y−Xw)]
Suppose we observe y and assume y is generated from a linear model of unknown
parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown
parameters θ given the observation.
P(θ∣y)
Applying Bayes’ theorem, we can rewrite the probability as
P(θ∣y)=P(y)P(y∣θ)⋅P(θ)
where we refer to
P(θ∣y) as the posterior distribution
P(y∣θ) as the likelihood function
P(θ) as the prior distribution
The prior distribution describes our belief of θ before observing y and
the posterior distribution describes our updated belief after observing y.
Suppose the prior distribution can be expressed as
P(θ)=h(θ,η)
where h(⋅,η) denotes a family of probability distributions parameterized
by what we call a hyperparameter η.
Traditional approaches to Bayesian linear regression have used what are called conjugate priors.
A family of priors h(⋅,η) is conjugate if the posterior also belongs
to the family
P(θ∣y)=P(y)P(y∣θ)⋅h(θ,η)=h(θ,η′)
Conjugate priors are mathematically convenient as successive observations can be viewed as making
successive updates to the parameters of a family of distributions, but requiring h to
be conjugate is a strong assumption.
Note
See Appendix A for a more detail comparison to other Bayesian algorithms.
We’ll instead describe an algorithm where
Priors are selected to shrink w, reflecting the prior hypothesis that w is not
predictive, and be approximately noninformative for other parameters.
We fully integrate over hyperparameters so that no arbitrary choice of η is required.
Let’s first consider what it means for a prior to be noninformative.
See section 1.3 in “Bayesian Inference in Statistical Analysis”, Box and Tiao, 1973,
for greater detail on how to select noninformative priors.
Suppose y is data generated from a normal distribution of mean 0 but unknown
variance. Let σ denote standard deviation
and let ℓ denote the likelihood function
P(y∣σ)∝ℓ(σ)∝(σ1)nexp[−2σ21y⊤y]
Suppose we impose a uniform prior over σ so that the posterior is
P(σ∣y)∝ℓ(σ)
and the cumulative distribution function is
P(σ<t∣y)=N1∫0tℓ(σ)dσ
where N is some normalizing constant.
But now let’s suppose that instead of standard deviation we parameterize over variance. Making
the substitution u=σ2 into the cumulative distribution function gives us
dσP(u<t∣y)=2u1du=N1∫0tℓ(u)(2u1)du
Thus, we see that choosing a uniform prior over σ is equivalent to choosing the
improper prior
P(u)∝u1
over variance. In general, if u=φ(θ) is an
alternative way of parameterizing the likelihood function where φ is some monotonic,
one-to-one, onto function. Then a uniform prior over u is equivalent to choosing a prior
P(θ)∝φ′(θ)
over θ.
So, when does it make sense to use a uniform prior if the choice is sensitive to parameterization?
Let’s consider how the likelihood is affected by changes in the observed value
y⊤y.
Likelihood function for the standard deviation of normally distributed data with zero mean,
n=10, and different values of y⊤y.¶
As we can see from the figure, as y⊤y is increased the shape of the likelihood
function changes: it becomes less peaked and more spread out.
Observe that we can rewrite the likelihood function as
And we say that the likelihood function is data translated in logσ
because everything is known about the likelihood curve except the location and the value of the
observation serves only to shift the curve.
Likelihood function for the log standard deviation of normally distributed data with zero mean,
n=10, and different values of y⊤y.¶
When the likelihood function is data translated in a parameter, then it makes sense to use a
uniform function for a noninformative prior. Before observing data, we have no reason to prefer one
range of parameters [a,b] over another range of the same width
[t+a,t+b] because they will be equivalently placed relative to the
likelihood curve if the observed data translates by t.
For the parameter σ, we use the noninformative prior
Pr(σ)∝σ1
which is equivalent to using a uniform prior over the parameter logσ.
For w, we want an informative prior that shrinks the weights, reflecting a prior
belief that weights are non-predictive. Let η denote a hyperparameter that controls
the degree of shrinkage. Then we use the spherical normal distribution with covariance
matrix (λησ)2I
Pr(w∣σ,η)∝(σ/λη1)pexp[−2(σ/λη)21w⊤w]
Note that we haven’t described yet how η parameterizes λη and we’ll also
be integrating over η so we additionally have a prior for η
(called a hyperprior) so that
Pr(w,σ,η)=Pr(σ)⋅Pr(w∣σ,η)⋅Pr(η)
Our goal is for the prior Pr(η) to be noninformative. So we want to know: In what
parameterization, would Pr(y∣η) be data translated?
Let U, Σ, and V denote the singular value decomposition of X
so that
X=UΣV⊤
Let ξ1,…ξp denote the non-zero diagonal entries of Σ.
Put Λ=Σ⊤Σ. Then Λ is a diagonal matrix with entries
ξ12,…,ξp2 and
X⊤X=VΣ⊤U⊤UΣV⊤=VΛV⊤
Note
To implement our algorithm, we will only need the matrix V and the nonzero diagonal
entries of Σ, which can be efficiently computed with a partial singular value
decomposition.
The result of fitting a Bayesian model is the posterior distribution Pr(θ∣y).
Let’s consider how we can use the distribution to make predictions given a new data point
x′.
We’ll start by computing the expected target value
To compute expected values of expressions of η, we need to integrate over the
posterior distribution Pr(η∣y). We won’t have an analytical form for the integrals, but we can
efficiently and accurately integrate with an
adaptive Quadrature.
We’ve seen that computing statistics about predictions involve integrating over the
posterior distribution P(η∣y). We’ll briefly sketch out an algorithm for
computing such integrals. We describe it only for computing the expected value of
w. Other expected values can be computed with straightforward modifications.
The proceedure SHRINK-INVERSE applies Newton’s algorithm for
root-finding with r and r′ to compute r−1.
To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm.
Quadratures approximate an integral by a weighted sum at different points of evaluation.
∫abf(x)dx≈i∑wif(xi)
In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive
quadrature algorithms compare the integral approximation at different levels of refinement to
approximate the error and increase the number of points of evaluation until a desired tolerance is reached.
Note
We omit the details of the quadrature algorithm used and describe only at a high level. For more
information on specific quadrature algorithms refer to Gauss-Hermite Quadratures and Tanh-sinh Quadratures.
To better understand how the algorithm works in practice, we’ll set up a small
experimental data set, and we’ll compare a model fit with the Bayesian algorithm to
Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on
a Leave-one-out Cross-validation (LOOCV) of the data set.
We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix
X using a Gaussian with zero mean and covariance matrix K where
Kij={1,0.5,ifi=jotherwise
We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and
we’ll rescale the weights so that the signal variance is equal to 1.
w⊤Kw=1
Then we’ll set
y=Xw+ε
where ε is a vector of 20 elements taken from a Gaussian with
unit variance.
Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model
with the regularization strength set so that mean squared error is minimized on a LOOCV.
# OLSfromsklearn.linear_modelimportLinearRegressionmodel_ols=LinearRegression(fit_intercept=False)model_ols.fit(X,y)# Bayesian Linear Ridge Regressionfrombbai.glmimportBayesianRidgeRegressionmodel_bay=BayesianRidgeRegression(fit_intercept=False)model_bay.fit(X,y)# Ridge Regression fit to optimize LOOCV error## For details refer to# https://arxiv.org/abs/2011.10218frombbai.glmimportRidgeRegressionmodel_rr=RidgeRegression(fit_intercept=False)model_rr.fit(X,y)
We can measure the out-of-sample error variance for each model using the equation
Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for
Bayesian ridge regression and describe the advantages.
Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma
prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally,
it requires initial values for parameters α and λ that are then updated
from the data.
The algorithm’s performance can be sensitive to the choice of values for these parameters, and
scikit-learn’s documentation provides a curve fitting example where the defaults perform poorly.
# Author: Yoshihiro Uchida <nimbus1after2a1sun7shower@gmail.com>importnumpyasnpimportmatplotlib.pyplotaspltfromsklearn.linear_modelimportBayesianRidgedeffunc(x):returnnp.sin(2*np.pi*x)# ############################################################################## Generate sinusoidal data with noisesize=25rng=np.random.RandomState(1234)x_train=rng.uniform(0.0,1.0,size)y_train=func(x_train)+rng.normal(scale=0.1,size=size)x_test=np.linspace(0.0,1.0,100)# ############################################################################## Fit by cubic polynomialn_order=3X_train=np.vander(x_train,n_order+1,increasing=True)X_test=np.vander(x_test,n_order+1,increasing=True)# ############################################################################## Plot the true and predicted curves with log marginal likelihood (L)reg=BayesianRidge(tol=1e-6,fit_intercept=False,compute_score=True)fig,axes=plt.subplots(1,2,figsize=(8,4))fori,axinenumerate(axes):# Bayesian ridge regression with different initial value pairsifi==0:init=[1/np.var(y_train),1.0]# Default valueselifi==1:init=[1.0,1e-3]reg.set_params(alpha_init=init[0],lambda_init=init[1])reg.fit(X_train,y_train)ymean,ystd=reg.predict(X_test,return_std=True)ax.plot(x_test,func(x_test),color="blue",label="sin($2\\pi x$)")ax.scatter(x_train,y_train,s=50,alpha=0.5,label="observation")ax.plot(x_test,ymean,color="red",label="predict mean")ax.fill_between(x_test,ymean-ystd,ymean+ystd,color="pink",alpha=0.5,label="predict std")ax.set_ylim(-1.3,1.3)ax.legend()title="$\\alpha$_init$={:.2f},\\\\lambda$_init$={}$".format(init[0],init[1])ifi==0:title+=" (Default)"ax.set_title(title,fontsize=12)text="$\\alpha={:.1f}$\n$\\lambda={:.3f}$\n$L={:.1f}$".format(reg.alpha_,reg.lambda_,reg.scores_[-1])ax.text(0.05,-1.0,text,fontsize=12)plt.tight_layout()plt.show()
Shows predictions for sklearn’s BayesianRidge model on a curve fitting example. The left
shows the predictions with default parameters (performs poorly), the right shows the predictions
after initial parameters have been tweaked (performs better).¶
In comparison, the algorithm we presented requires no initial parameters; and because the
hyperparameter is integrated over, poor performing values contribute little to the posterior
probability mass.
We can see that our approach handles the curve-fitting example without requiring any tweaking.
# ############################################################################## Plot the true and predicted curves for bbai's BayesianRidgeRegression modelfrombbai.glmimportBayesianRidgeRegressionreg=BayesianRidgeRegression(fit_intercept=False)fig,ax=plt.subplots(1,1,figsize=(4,4))# Note: there are no parameters to tweakreg.fit(X_train,y_train)ymean,ystd=reg.predict(X_test,return_std=True)ax.plot(x_test,func(x_test),color="blue",label="sin($2\\pi x$)")ax.scatter(x_train,y_train,s=50,alpha=0.5,label="observation")ax.plot(x_test,ymean,color="red",label="predict mean")ax.fill_between(x_test,ymean-ystd,ymean+ystd,color="pink",alpha=0.5,label="predict std")ax.set_ylim(-1.3,1.3)ax.legend()plt.tight_layout()plt.show()
Shows predictions from our Bayesian ridge regression algorithm on the curve fitting example. In
comparison to sklearn, our algorithm perform well without requiring any tweaking.¶
The full curve-fitting comparison example is available here.