An Algorithm for Bayesian Ridge Regression

This guide describes a Bayesian algorithm for regularized linear regression. The algorithm uses a hyperparameter to control regularization strength and fully integrates over the hyperparameter in the posterior distribution, applying a hyperprior selected so as to be approximately noninformative.

Contents

Introduction

Let θ=(σ2,w)\params = \left(\noise^2, \w\right) denote parameters for a linear regression model with weights w\w and normally distributed errors of variance σ2\noise^2.

If X\X represents an n×pn\times p matrix of full rank of pp regressors and nn rows, then θ\params specifies a probability distribution over possible target values y\y:

P(yθ)=1σ2πexp[12σ2(yXw)(yXw)]\Prb\left(\y|\params\right) = \frac{1}{\noise\sqrt{2\pi}} \exp\left[-\frac{1}{2 \noise^2} \multt{\left(\y - \X\w\right)}\right]

Suppose we observe y\y and assume y\y is generated from a linear model of unknown parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown parameters θ\params given the observation.

P(θy)\Prb\left(\params|\y\right)

Applying Bayes’ theorem, we can rewrite the probability as

P(θy)=P(yθ)P(θ)P(y)\Prb\left(\params|\y\right) = \frac{\Prb\left(\y|\params\right) \cdot \Prb\left(\params\right)} {\Prb\left(\y\right)}

where we refer to

  • P(θy)\Prb\left(\params|\y\right) as the posterior distribution

  • P(yθ)\Prb\left(\y|\params\right) as the likelihood function

  • P(θ)\Prb\left(\params\right) as the prior distribution

The prior distribution describes our belief of θ\params before observing y\y and the posterior distribution describes our updated belief after observing y\y.

Suppose the prior distribution can be expressed as

P(θ)=h(θ,η)\Prb\left(\params\right) = \priorf\left(\params, \hp\right)

where h(,η)\priorf(\cdot,\hp) denotes a family of probability distributions parameterized by what we call a hyperparameter η\hp.

Traditional approaches to Bayesian linear regression have used what are called conjugate priors. A family of priors h(,η)\priorf(\cdot,\hp) is conjugate if the posterior also belongs to the family

P(θy)=P(yθ)h(θ,η)P(y)=h(θ,η)\begin{align*} \Prb\left(\params|\y\right) &= \frac{\Prb\left(\y|\params\right) \cdot \priorf\left(\params, \hp\right)} {\Prb\left(\y\right)} \\ &= \priorf\left(\params, \hp'\right) \end{align*}

Conjugate priors are mathematically convenient as successive observations can be viewed as making successive updates to the parameters of a family of distributions, but requiring h\priorf to be conjugate is a strong assumption.

Note

See Appendix A for a more detail comparison to other Bayesian algorithms.

We’ll instead describe an algorithm where

  1. Priors are selected to shrink w\w, reflecting the prior hypothesis that w\w is not predictive, and be approximately noninformative for other parameters.

  2. We fully integrate over hyperparameters so that no arbitrary choice of η\hp is required.

Let’s first consider what it means for a prior to be noninformative.

How to Pick Non-informative Priors?

See also

See section 1.3 in “Bayesian Inference in Statistical Analysis”, Box and Tiao, 1973, for greater detail on how to select noninformative priors.

Suppose y\y is data generated from a normal distribution of mean 00 but unknown variance. Let σ\sigma denote standard deviation and let \like denote the likelihood function

P(yσ)(σ)(1σ)nexp[12σ2yy]\Prb\left(\y\right|\sigma) \propto \like(\sigma) \propto \left(\frac{1}{\sigma}\right)^n \exp\left[-\frac{1}{2\sigma^2} \multt{\y}\right]

Suppose we impose a uniform prior over σ\sigma so that the posterior is

P(σy)(σ)\Prb\left(\sigma|\y\right) \propto \ell(\sigma)

and the cumulative distribution function is

P(σ<ty)=1N0t(σ)dσ\Prb\left(\sigma \lt t|\y\right) = \frac{1}{\N} \int_0^t \like(\sigma) d\sigma

where N\N is some normalizing constant.

But now let’s suppose that instead of standard deviation we parameterize over variance. Making the substitution u=σ2u=\sigma^2 into the cumulative distribution function gives us

dσ=12uduP(u<ty)=1N0t(u)(12u)du\begin{align*} d\sigma &= \frac{1}{2\sqrt{u}}du \\ \Prb\left(u \lt t|\y\right) &= \frac{1}{\N} \int_0^t \like(\sqrt{u}) \left(\frac{1}{2\sqrt{u}}\right) du \end{align*}

Thus, we see that choosing a uniform prior over σ\sigma is equivalent to choosing the improper prior

P(u)1u\Prb\left(u\right) \propto \frac{1}{\sqrt{u}}

over variance. In general, if u=φ(θ)u=\varphi\left(\params\right) is an alternative way of parameterizing the likelihood function where φ\varphi is some monotonic, one-to-one, onto function. Then a uniform prior over uu is equivalent to choosing a prior

P(θ)φ(θ)\Prb\left(\params\right) \propto \varphi'(\params)

over θ\params.

So, when does it make sense to use a uniform prior if the choice is sensitive to parameterization?

Let’s consider how the likelihood is affected by changes in the observed value yy\multt{\y}.

../_images/stddev-like.svg

Likelihood function for the standard deviation of normally distributed data with zero mean, n=10n=10, and different values of yy\multt{\y}.

As we can see from the figure, as yy\multt{\y} is increased the shape of the likelihood function changes: it becomes less peaked and more spread out.

Observe that we can rewrite the likelihood function as

P(yσ)(1σ)nexp[12σ2yy](1σ/yy)nexp[12(σ/yy)2]g(σyy)\begin{align*} \Prb\left(\y\right|\sigma) &\propto \left(\frac{1}{\sigma}\right)^n \exp\left[-\frac{1}{2\sigma^2} \multt{\y}\right] \\ &\propto \left(\frac{1}{\sigma / \sqrt{\multt{\y}}}\right)^n \exp\left[-\frac{1}{2(\sigma/\sqrt{\multt{\y}})^2}\right] \\ &\propto g\left(\frac{\sigma}{\sqrt{\multt{\y}}}\right) \end{align*}

where

g(t)=(1t)nexp[12t2]g(t) = \left(\frac{1}{t}\right)^n \exp\left[-\frac{1}{2 t^2}\right]

Thus, in the parameter logσ\log \sigma, likelihood has the form

P(yσ)g(exp[logσ12logyy])g~(logσ12logyy)\begin{align*} \Prb\left(\y\right|\sigma) &\propto g\left(\exp\left[\log \sigma - \frac{1}{2} \log \multt{\y}\right]\right) \\ &\propto \tilde{g}\left(\log \sigma - \frac{1}{2} \log \multt{\y}\right) \end{align*}

where

g~(t)=g(exp(t))\tilde{g}(t) = g\left(\exp(t)\right)

And we say that the likelihood function is data translated in logσ\log \sigma because everything is known about the likelihood curve except the location and the value of the observation serves only to shift the curve.

../_images/log-stddev-like.svg

Likelihood function for the log standard deviation of normally distributed data with zero mean, n=10n=10, and different values of yy\multt{\y}.

When the likelihood function is data translated in a parameter, then it makes sense to use a uniform function for a noninformative prior. Before observing data, we have no reason to prefer one range of parameters [a,b]\left[a, b\right] over another range of the same width [t+a,t+b]\left[t + a, t + b\right] because they will be equivalently placed relative to the likelihood curve if the observed data translates by tt.

Now, let’s return to picking our priors.

Picking Regularized Bayesian Linear Regression Priors

For the parameter σ\noise, we use the noninformative prior

Pr(σ)1σ\Pr(\noise) \propto \frac{1}{\noise}

which is equivalent to using a uniform prior over the parameter logσ\log \noise. For w\w, we want an informative prior that shrinks the weights, reflecting a prior belief that weights are non-predictive. Let η\hp denote a hyperparameter that controls the degree of shrinkage. Then we use the spherical normal distribution with covariance matrix (σλη)2I\left(\frac{\noise}{\lda}\right)^2\I

Pr(wσ,η)(1σ/λη)pexp[12(σ/λη)2ww]\Pr(\w|\noise, \hp) \propto \left(\frac{1}{\noise/\lda}\right)^p \exp\left[-\frac{1}{2 (\noise/\lda)^2} \multt{\w}\right]

Note that we haven’t described yet how η\hp parameterizes λη\lda and we’ll also be integrating over η\hp so we additionally have a prior for η\hp (called a hyperprior) so that

Pr(w,σ,η)=Pr(σ)Pr(wσ,η)Pr(η)\Pr(\w, \noise, \hp) = \Pr(\noise) \cdot \Pr(\w|\noise, \hp) \cdot \Pr(\hp)

Our goal is for the prior Pr(η)\Pr(\hp) to be noninformative. So we want to know: In what parameterization, would Pr(yη)\Pr(\y|\hp) be data translated?

How to Parameterize the shrinkage prior?

Before thinking about how to parameterize λη\lda, let’s characterize the likelihood function for λη\lda.

Pr(yη)=σwPr(y,σ,wη)dwdσ\Pr(\y|\hp) = \int_\noise \int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w d\noise

Expanding the integrand, we have

Pr(y,σ,wη)=Pr(yσ,w)Pr(σ,wη)=Pr(yσ,w)Pr(σ)Pr(wσ,η)(1σ)n+1exp[12σ2(yXw)(yXw)]×(λησ)pexp[λη22σ2ww]=(#1)(1σ)pexp[12σ2(ww^)(XX+λη2I)(ww^)]\begin{align*} \Pr\left(\y, \noise, \w|\hp\right) &= \Pr\left(\y| \noise, \w\right) \cdot \Pr\left(\noise, \w | \hp\right) \\ &= \Pr\left(\y| \noise, \w\right) \cdot \Pr(\noise) \cdot \Pr\left(\w | \noise, \hp\right) \\ &\propto \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{1}{2\noise^2} \multt{\left(\y - \X \w\right)}\right] \times \\ &\phantom{\propto} \left(\frac{\lda}{\noise}\right)^p \exp\left[-\frac{\lda^2}{2 \noise^2} \multt{\w}\right] \\ &= \left(\#1\right) \left(\frac{1}{\noise}\right)^p \exp\left[-\frac{1}{2 \noise^2} \multtx{\left(\w - \what\right)} {\left(\multt{\X} + \lda^2 \I\right)}\right] \end{align*}

where

#1=(1σ)n+1exp[12σ2(yyw^(XX+λη2I)w^)](λη)p\begin{align*} \#1 &= \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{1}{2\noise^2} \left(\multt{\y} - \multtx{\what}{\left(\multt{\X} + \lda^2\I\right)}\right)\right] \left(\lda\right)^p \end{align*}

and

w^=(XX+λη2I)1Xy\what = \left(\multt{\X} + \lda^2\I\right)^{-1}\X^\top\y

Observe that #1\#1 is independent of w\w, so the integral with respect to w\w is equivalent to integrating over an unnormalized Gaussian.

Note

Recall the formula for a normalized Gaussian integral

x1(2π)kdetAexp[12(xxˉ)A1(xxˉ)]dx=1\int_\x \frac{1}{\sqrt{\left(2 \pi\right)^k \det \A}} \exp\left[-\frac{1}{2} \multtx{\left(\x - \bar{\x}\right)}{\A^{-1}}\right] d\x = 1

Thus,

wPr(y,σ,wη)dw(#1)(det(XX+λη2I))1/2\int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w \propto \left(\#1\right) \left(\det\left(\multt{\X} + \lda^2\I\right)\right)^{-1/2}

Next let’s consider the integral over σ\noise.

Put

g(λ)=yyw^(XX+λ2I)w^=yyyX(XX+λ2I)1Xyh(λ)=λp(det(XX+λ2I))1/2\begin{align*} g(\lambda) &= \multt{\y} - \multtx{\what}{\left(\multt{\X} + \lambda^2 \I\right)} \\ &= \multt{\y} - \y^\top \X \left(\multt{\X} + \lambda^2 \I\right)^{-1} \X^\top\y \\ h(\lambda) &= \lambda^p \left(\det\left(\multt{\X} + \lambda^2\I\right)\right)^{-1/2} \end{align*}

Then we can rewrite

wPr(y,σ,wη)dw(1σ)n+1exp[g(λη)2σ2]h(λη)\int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w \propto \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{g(\lda)}{2\noise^2}\right] h(\lda)

After making a change of variables, we can express the integral with respect to σ\noise as a form of the Gamma function.

Note

Consider an integral

0(1σ)kexp[A2σ2]dσ\int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma

Put

u=A2σ2u = \frac{A}{2\sigma^2}

Then

σ=(A2u)1/2dσ=12(A2)1/2u3/2du\begin{align*} \sigma &= \left(\frac{A}{2 u}\right)^{1/2} \\ d\sigma &= -\frac{1}{2}\left(\frac{A}{2}\right)^{1/2} u^{-3/2} du \end{align*}

And

0(1σ)kexp[A2σ2]dσ=0(2uA)k/2exp(u)(12(A2)1/2u3/2)du=12(2A)(k1)/20u(k1)/21exp(u)du=2(k3)/2A(k1)/2Γ(k12)\begin{align*} \int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma &= \int_0^\infty \left(\frac{2u}{A}\right)^{k/2} \exp(-u) \left(\frac{1}{2}\left(\frac{A}{2}\right)^{1/2} u^{-3/2}\right) du \\ &= \frac{1}{2} \left(\frac{2}{A}\right)^{(k-1)/2} \int_0^\infty u^{(k-1)/2-1} \exp(-u) du \\ &= 2^{(k-3)/2} A^{-(k-1)/2} \Gamma\left(\frac{k - 1}{2}\right) \end{align*}

where Γ\Gamma denotes the Gamma function

Γ(t)=0xt1exdx\Gamma(t) = \int_0^\infty x^{t-1} e^{-x} dx

Thus, we can write

σwPr(y,σ,wη)dwdσg(λη)n/2h(λη)\int_\noise \int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w d\noise \propto g(\lda)^{-n/2} h(\lda)

Let U\U, Σ\bSigma, and V\V denote the singular value decomposition of X\X so that

X=UΣV\X = \U \bSigma \V^\top

Let ξ1,ξp\eig_1, \ldots \eig_p denote the non-zero diagonal entries of Σ\bSigma. Put Λ=ΣΣ\eigd = \multt{\bSigma}. Then Λ\eigd is a diagonal matrix with entries ξ12,,ξp2\eig_1^2, \ldots, \eig_p^2 and

XX=VΣUUΣV=VΛV\begin{align*} \multt{\X} &= \V \mathbf{\Sigma}^\top \multt{\U} \bSigma \V^\top \\ &= \V \eigd \V^\top \\ \end{align*}

Note

To implement our algorithm, we will only need the matrix V\V and the nonzero diagonal entries of Σ\bSigma, which can be efficiently computed with a partial singular value decomposition.

See the LAPACK function gesdd.

Put

y~=Uy.\ytilde = \U^\top \y.

Then we can rewrite hh and gg as

h(λ)=λp(det(XX+λ2I))1/2=j(λ2ξj2+λ2)1/2\begin{align*} h(\lambda) &= \lambda^p \left(\det\left(\multt{\X} + \lambda^2\I\right)\right)^{-1/2} \\ &= \prod_j \left(\frac{\lambda^2}{\eig_j^2 + \lambda^2}\right)^{1/2} \end{align*}

And

g(λ)=yyyX(XX+λ2I)1Xy=yyyXV(Λ+λ2I)1VXy=yyyUΣ(Λ+λ2I)1ΣUy=yyjξj2ξj2+λ2y~j2=yyj(1λ2ξj2+λ2)y~j2=RSS+jλ2ξj2+λ2y~j2\begin{align*} g(\lambda) &= \multt{\y} - \y^\top \X \left(\multt{\X} + \lambda^2 \I\right)^{-1} \X^\top\y \\ &= \multt{\y} - \y^\top \X \V \left(\eigd + \lambda^2 \I\right)^{-1} \V^\top \X^\top\y \\ &= \multt{\y} - \y^\top \U \bSigma \left(\eigd + \lambda^2 \I\right)^{-1} \bSigma^\top \U^\top\y \\ &= \multt{\y} - \sum_j \frac{\eig_j^2}{\eig_j^2 + \lambda^2} \ytilde_j^2 \\ &= \multt{\y} - \sum_j \left(1 - \frac{\lambda^2}{\eig_j^2 + \lambda^2}\right) \ytilde_j^2 \\ &= \RSS + \sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2} \ytilde_j^2 \\ \end{align*}

Here we adopt the terminology

TSS=yyrefers to Total Sum of SquaresESS=j=1py~j2refers to Explaned Sum of SquaresRSS=TSSESSrefers to Residual Sum of Squares\begin{align*} \TSS &= \multt{\y} &&\quad \textrm{refers to Total Sum of Squares} \\ \ESS &= \sum_{j=1}^p \ytilde_j^2 &&\quad \textrm{refers to Explaned Sum of Squares} \\ \RSS &= \TSS - \ESS &&\quad \textrm{refers to Residual Sum of Squares} \end{align*}

Put

r(λ)=1pjλ2ξj2+λ2\shrink(\lambda) = \frac{1}{p} \sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2}

Then r\shrink is a monotonically increasing function that ranges from 00 to 11, and we can think of r\shrink as the average shrinkage factor across the eigenvectors.

Now, let’s make an approximation by replacing individual eigenvector shrinkages with the average:

λ2ξj2+λ2r(λ)for  j=1,,p\frac{\lambda^2}{\eig_j^2 + \lambda^2} \approx \shrink(\lambda) \quad \textrm{for}\;j=1, \ldots, p

Substituting the approximation into the likelihood then gives us

Pr(yη)g(λη)n/2h(λη)=(RSS+jλ2ξj2+λ2y~j2)n/2j(λ2ξj2+λ2)1/2(RSS+ESS×r(λη))n/2r(λη)p/2(1+r(λη)(RSS/ESS))n/2(r(λη)(RSS/ESS))p/2\begin{align*} \Pr(\y|\hp) &\propto g(\lda)^{-n/2} h(\lda) \\ &= \left(\RSS + \sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2} \ytilde_j^2\right)^{-n/2} \prod_j \left(\frac{\lambda^2}{\eig_j^2 + \lambda^2}\right)^{1/2} \\ &\approx \left(\RSS + \ESS\times\shrink(\lda)\right)^{-n/2} \shrink(\lda)^{p/2} \\ &\propto \left(1 + \frac{\shrink(\lda)}{(\RSS/\ESS)}\right)^{-n/2} \left(\frac{\shrink(\lda)}{(\RSS/\ESS)}\right)^{p/2} \end{align*}

We see that the approximated likelihood can be expressed as a function of

r(λη)(RSS/ESS)\frac{\shrink(\lda)}{(\RSS/\ESS)}

and it follows that the likelihood is approximately data translated in logr(λη)\log r(\lda).

Thus, we can achieve an approximately noninformative prior if we let η\hp represent the average shrinkage, put

λη=r1(η)\lda = \shrinki(\hp)

and use the hyperprior

Pr(η)1η\Pr(\hp) \propto \frac{1}{\hp}

Note

To invert r\shrink, we can use a standard root-finding algorithm.

Differentiating r(λ)\shrink(\lambda) gives us

ddλr(λ)=ddλ(jλ2ξj2+λ2)=ddλ(j11+ξj2λ2)=j(1+ξj2λ2)2(2ξj2λ3))=j2ξj2λ3(1+ξj2λ2)2=j2ξj2λ(λ2+ξj2)2\begin{align*} \diff{\lambda} \shrink(\lambda) &= \diff{\lambda} \left(\sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2}\right) \\ &= \diff{\lambda} \left(\sum_j \frac{1}{1 + \eig_j^2 \lambda^{-2}}\right) \\ &= -\sum_j \left(1 + \eig_j^2 \lambda^{-2}\right)^{-2} \left(-2 \eig_j^2 \lambda^{-3}) \right) \\ &= \sum_j \frac{2 \eig_j^2}{\lambda^3\left(1 + \eig_j^2\lambda^{-2}\right)^2} \\ &= \sum_j \frac{2 \eig_j^2\lambda}{\left(\lambda^2 + \eig_j^2\right)^2} \end{align*}

Using the derivative and a variant of Newton’s algorithm, we can then quickly iterate to a solution of r1(η)\shrinki(\hp).

Making Predictions

The result of fitting a Bayesian model is the posterior distribution Pr(θy)\Pr(\params|\y). Let’s consider how we can use the distribution to make predictions given a new data point x\x'.

We’ll start by computing the expected target value

E[yx,y]=yθyPr(yθ)Pr(θy)dθdy=θyyPr(yθ)Pr(θy)dydθ=θxwPr(yθ)Pr(θy)dθ=xE[wy]\begin{align*} \Ex{y'|\x', \y} &= \int_{y'} \int_\params y' \Pr(y'|\params) \Pr(\params|\y) d\params dy' \\ &= \int_\params \int_{y'} y' \Pr(y'|\params) \Pr(\params|\y) dy' d\params \\ &= \int_\params \x'^\top \w \Pr(y'|\params) \Pr(\params|\y) d\params \\ &= \x'^\top \Ex{\w|\y} \end{align*}

And

E[wy]=ησwwPr(w,σ,ηy)dwdσdη=η(XX+λη2I)1XyPr(ηy)dη=E[(XX+λη2I)1Xyy]=V(E[(Λ+λη2I)1y])VXy\begin{align*} \Ex{\w|\y} &= \int_\hp \int_\noise \int_\w \w \Pr(\w, \noise, \hp|\y) d\w d\noise d\hp \\ &= \int_\hp \left(\multt{\X} + \lda^2\I\right)^{-1} \X^\top \y \Pr(\hp|\y) d\hp \\ &= \Ex{\left(\multt{\X} + \lda^2\I\right)^{-1}\X^\top \y \big| \y} \\ &= \eigm \left(\Ex{\left(\eigd + \lda^2\I\right)^{-1}\big|\y}\right) \eigm^\top \X^\top \y \end{align*}

Note

To compute expected values of expressions of η\hp, we need to integrate over the posterior distribution Pr(ηy)\Pr(\hp|\y). We won’t have an analytical form for the integrals, but we can efficiently and accurately integrate with an adaptive Quadrature.

To compute variance, we have

E[(yyˉ)2x,y]=E[y2x,y]E[yx,y]2\Ex{\left(y' - \bar{y}'\right)^2|\x',\y} = \Ex{y'^2|\x',\y} - \Ex{y'|\x',\y}^2

and

E[y2x,y]=θyy2Pr(yθ)Pr(θy)dydθ=θ(σ2+xwwx)Pr(θy)dθ=E[σ2y]+xE[wwy]x\begin{align*} \Ex{y'^2|\x',\y} &= \int_\params\int_{y'}y'^2\Pr(y'|\params) \Pr(\params|\y) dy' d\params \\ &= \int_\params\left(\noise^2 + \x'^\top \w \w^\top \x'\right) \Pr(\params|\y) d\params \\ &= \Ex{\noise^2|\y} + \x'^\top \Ex{\w \w^\top|\y} \x' \end{align*}

Starting with E[σ2y]\Ex{\noise^2|\y}, recall that

Pr(ηy)0(1σ)n+1exp[g(λη)2σ2]h(λη)dσPr(\hp|\y) \propto \int_0^\infty \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{g(\lda)}{2\noise^2}\right] h(\lda) d\noise

And

0(1σ)kexp[A2σ2]dσ=2(k3)/2A(k1)/2Γ(k12)\int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma = 2^{(k-3)/2} A^{-(k-1)/2} \Gamma\left(\frac{k - 1}{2}\right)

Thus,

0(1σ)k2exp[A2σ2]dσ=2(k3)/21A(k1)/2+1Γ(k121)=A2Γ((k1)/21)Γ((k1)/2)0(1σ)kexp[A2σ2]dσ\begin{align*} \int_0^\infty \left(\frac{1}{\sigma}\right)^{k-2} \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma &= 2^{(k-3)/2 - 1} A^{-(k-1)/2 + 1} \Gamma\left(\frac{k - 1}{2} - 1\right) \\ &= \frac{A}{2} \frac{\Gamma((k-1)/2-1)}{\Gamma((k - 1)/2)} \int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma \end{align*}

Note

The Gamma function has the property

Γ(x+1)=xΓ(x)\Gamma(x + 1) = x \Gamma(x)

So,

Γ(k12)=(k121)Γ(k121)\Gamma\left(\frac{k-1}{2}\right) = \left(\frac{k - 1}{2}-1\right)\Gamma\left(\frac{k-1}{2}-1\right)

And

0(1σ)k2exp[A2σ2]dσ=Ak30(1σ)kexp[A2σ2]dσ\int_0^\infty \left(\frac{1}{\sigma}\right)^{k-2} \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma = \frac{A}{k - 3} \int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma

It follows that

ηθσ2Pr(θ,ηy)dθdη=ηg(λη)n2Pr(ηy)dη=E[g(λη)n2y]\begin{align*} \int_\hp \int_\params \noise^2 \Pr(\params, \hp|\y) d\params d\hp &= \int_\hp \frac{g(\lda)}{n - 2} \Pr(\hp|\y) d\hp \\ &= \Ex{\frac{g(\lda)}{n - 2}\big|\y} \end{align*}

For E[wwy]\Ex{\w\w^\top|\y}, we have

E[wwy]=ησwwwP(w,σ,ηy)dwdσdη=ησσ2(XX+λη2I)1+w^ηw^ηdσdη=E[g(λη)n2(XX+λη2I)1y]+=E[(XX+λη2I)1XyyX(XX+λη2I)1y]=V(E[g(λη)n2(Λ+λη2I)1y])V+=V(E[(Λ+λη2I)1VXyyXV(Λ+λη2I)1y])V\begin{align*} \Ex{\w\w^\top|\y} &= \int_\hp \int_\noise \int_\w \w\w^\top \Prb(\w, \sigma, \hp|\y) d\w d\noise d\hp \\ &= \int_\hp \int_\noise \noise^2 \left(\multt{\X} + \lda^2\I\right)^{-1} + \what_\hp \what_\hp^\top d\noise d\hp \\ &= \Ex{\frac{g(\lda)}{n - 2}\left(\multt{\X} + \lda^2\I\right)^{-1}\big|\y} + \\ &\phantom{=}\quad \Ex{ \left(\multt{\X} + \lda^2\I \right)^{-1} \X \y \y^\top \X^\top \left(\multt{\X} + \lda^2\I \right)^{-1} \big|\y} \\ &= \eigm\left(\Ex{\frac{g(\lda)}{n - 2}\left(\eigd + \lda^2\I\right)^{-1}\big|\y}\right)\eigm^\top + \\ &\phantom{=}\quad \eigm \left(\Ex{ \left(\eigd + \lda^2\I \right)^{-1} \eigm^\top \X^\top \y \y^\top \X \eigm \left(\eigd + \lda^2\I \right)^{-1} \big|\y} \right) \eigm^\top \end{align*}

Outline of Algorithm

We’ve seen that computing statistics about predictions involve integrating over the posterior distribution P(ηy)\Prb(\hp|\y). We’ll briefly sketch out an algorithm for computing such integrals. We describe it only for computing the expected value of w\w. Other expected values can be computed with straightforward modifications.

procedure compute-expected-w(X,y\mathbf{X}, \mathbf{y})

Σ,V=\mathbf{\Sigma}, \mathbf{V} = partial-svd(X\mathbf{X})

Λ=ΣΣ\mathbf{\Lambda} = \mathbf{\Sigma}^\top \mathbf{\Sigma}

procedure f(η\eta)

λ=\lambda = shrink-inverse(Λ,η\mathbf{\Lambda}, \eta)

return (Λ+λ2I)1\left(\mathbf{\Lambda} + \lambda^2 \mathbf{I}\right)^{-1}

end procedure

n=n = length(y\mathbf{y})

z=VXy\mathbf{z} = \mathbf{V}^\top \mathbf{X}^\top \mathbf{y}

TSS=yy\textrm{TSS} = \mathbf{y}^\top \mathbf{y}

D=\mathbf{D} = integrate-over-hyperparameter-posterior(F,TSS,n,Λ,z\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z})

procedure norm-f(η\eta)

return 1

end procedure

N=\textrm{N} = integrate-over-hyperparameter-posterior(NORM-F,TSS,n,Λ,z\textrm{NORM-F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z})

return 1NVDz\frac{1}{\textrm{N}} \mathbf{V} \mathbf{D} \mathbf{z}

end procedure

The proceedure SHRINK-INVERSE applies Newton’s algorithm for root-finding with r\shrink and r\shrink' to compute r1\shrinki.

To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm. Quadratures approximate an integral by a weighted sum at different points of evaluation.

abf(x)dxiwif(xi)\int_a^b f(x) dx \approx \sum_i w_i f(x_i)

In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive quadrature algorithms compare the integral approximation at different levels of refinement to approximate the error and increase the number of points of evaluation until a desired tolerance is reached.

Note

We omit the details of the quadrature algorithm used and describe only at a high level. For more information on specific quadrature algorithms refer to Gauss-Hermite Quadratures and Tanh-sinh Quadratures.

procedure integrate-over-hyperparameter-posterior(F,TSS,n,Λ,z\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z})

level=0level = 0

x,w=\mathbf{x}, \mathbf{w} = quadrature-points-weights(0,1,level0, 1, level)

Q1=0\mathbf{Q}_{-1} = \mathbf{0}

Q=\mathbf{Q} = approximate-integral(F,TSS,n,Λ,z,x,w\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \mathbf{x}, \mathbf{w})

err=err = \infty

while err>εerr > \varepsilon do

Q1=Q\mathbf{Q}_{-1} = \mathbf{Q}

level=level+1level = level + 1

x,w=\mathbf{x}, \mathbf{w} = quadrature-points-weights(0,1,level0, 1, level)

Q=\mathbf{Q} = approximate-integral(F,TSS,n,Λ,z,x,w\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \mathbf{x}, \mathbf{w})

err=QQ1err = \|\mathbf{Q} - \mathbf{Q}_{-1}\|_\infty

end while

return Q\mathbf{Q}

end procedure

procedure approximate-integral(F,TSS,n,Λ,z,x,w\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \mathbf{x}, \mathbf{w})

Q=0\mathbf{Q} = \mathbf{0}

for ii to LENGTH(x)\textrm{LENGTH}(\mathbf{x}) do

η=xi\eta = \mathbf{x}_i

density=density = unnormalized-pdf(TSS,n,Λ,z,η\textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \eta)

Q=Q+F(η)densitywi\mathbf{Q} = \mathbf{Q} + \textrm{F}(\eta) \cdot density \cdot \mathbf{w}_i

end for

return Q\mathbf{Q}

end procedure

procedure unnormalized-pdf(TSS,n,Λ,z,η\textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \eta)

λ=\lambda = shrink-inverse(Λ,η\mathbf{\Lambda}, \eta)

t1=(TSSz(Λ+λ2I)1z)n/2t_1 = \left(\textrm{TSS} - \mathbf{z}^\top\left(\mathbf{\Lambda} + \lambda^2\mathbf{I}\right)^{-1}\mathbf{z}\right)^{-n/2}

p=p = length(z\mathbf{z})

t2=λp(det(Λ+λ2I))1/2t_2 = \lambda^p \left(\det(\mathbf{\Lambda} + \lambda^2\mathbf{I})\right)^{-1/2}

return t1t21ηt_1 \cdot t_2 \cdot \frac{1}{\eta}

end procedure

Experimental Results

To better understand how the algorithm works in practice, we’ll set up a small experimental data set, and we’ll compare a model fit with the Bayesian algorithm to Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on a Leave-one-out Cross-validation (LOOCV) of the data set.

Full source code for the experiment is available at github.com/rnburn/bbai/example/03-bayesian.py.

Generating the Data Set

We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix X\X using a Gaussian with zero mean and covariance matrix K\K where

Kij={1,if i=j0.5,otherwise\K_{ij} = \begin{cases} 1, & \text{if}\ i=j \\ 0.5, & \text{otherwise} \end{cases}

We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and we’ll rescale the weights so that the signal variance is equal to 1.

wKw=1\multtx{\w}{\K} = 1

Then we’ll set

y=Xw+ε\y = \X \w + \mathbf{\varepsilon}

where ε\mathbf{\varepsilon} is a vector of 20 elements taken from a Gaussian with unit variance.

Here’s the Python code that sets up the data set.

np.random.seed(0)

def generate_correlation_matrix(p, param):
    res = np.zeros(shape=(p, p))
    for s in range(p):
        for t in range(0, s+1):
            corr = param
            if s == t:
                corr = 1.0
            res[s, t] = corr
            res[t, s] = corr
    return res

def generate_design_matrix(n, K):
    mean = np.zeros(K.shape[0])
    return np.random.multivariate_normal(mean, K, size=n)

def generate_weights(p):
    return np.random.normal(size=p)

def generate_data_set(n, K, signal_noise_ratio):
    p = K.shape[0]
    X = generate_design_matrix(n, K)
    w = generate_weights(p)
    signal_var = np.dot(w, np.dot(K, w))
    w *= np.sqrt(signal_noise_ratio / signal_var)
    y = np.dot(X, w) + np.random.normal(size=n)
    return X, y, w

p = 10
n = 20
signal_noise_ratio = 1.0
K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K, signal_noise_ratio)

Fitting Models

Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model with the regularization strength set so that mean squared error is minimized on a LOOCV.

# OLS
from sklearn.linear_model import LinearRegression
model_ols = LinearRegression(fit_intercept=False)
model_ols.fit(X, y)

# Bayesian Linear Ridge Regression
from bbai.glm import BayesianRidgeRegression
model_bay = BayesianRidgeRegression(fit_intercept=False)
model_bay.fit(X, y)

# Ridge Regression fit to optimize LOOCV error
#
# For details refer to
#     https://arxiv.org/abs/2011.10218
from bbai.glm import RidgeRegression
model_rr = RidgeRegression(fit_intercept=False)
model_rr.fit(X, y)

We can measure the out-of-sample error variance for each model using the equation

E[yxw^2]=σ2+(ww^)K(ww^)\Ex{\left\|y' - \x'^\top \what\right\|^2} = \noise^2 + \multtx{\left(\w - \what\right)}{\K}
def compute_prediction_error_variance(K, w_true, w):
    delta = w - w_true
    noise_variance = 1.0
    return noise_variance + np.dot(delta, np.dot(K, delta))

err_variance_ols = compute_prediction_error_variance(K, w_true, model_ols.coef_)
err_variance_bay = compute_prediction_error_variance(K, w_true, model_bay.weight_mean_vector_)
err_variance_rr = compute_prediction_error_variance(K, w_true, model_rr.coef_)

Doing this gives us

Out-of-sample Performance of Models

Model

Out-of-sample Error Variance

OLS

2.27

Bayesian

1.23

Ridge Regression

1.33

We see that both the Bayesian and ridge regression models are able to prevent overfitting and achieve better out-of-sample results.

Finally, we’ll compare the estimated noise variance from the Bayesian model to that from the OLS model.

err_ols = y - np.dot(X, model_ols.coef_)
s_ols = np.dot(err_ols, err_ols) / (n - p)
print("noise_variance_ols =", s_ols)
print("noise_variance_bay =", model_bay.noise_variance_mean_)

This gives us

Model estimates for σ2\noise^2

Model

noise variance estimate

OLS

0.41

Bayesian

0.65

Appendix A: Comparison with Other Bayesian Algorithms

Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for Bayesian ridge regression and describe the advantages.

Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally, it requires initial values for parameters α\alpha and λ\lambda that are then updated from the data.

The algorithm’s performance can be sensitive to the choice of values for these parameters, and scikit-learn’s documentation provides a curve fitting example where the defaults perform poorly.

# Author: Yoshihiro Uchida <nimbus1after2a1sun7shower@gmail.com>

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import BayesianRidge


def func(x):
    return np.sin(2 * np.pi * x)


# #############################################################################
# Generate sinusoidal data with noise
size = 25
rng = np.random.RandomState(1234)
x_train = rng.uniform(0.0, 1.0, size)
y_train = func(x_train) + rng.normal(scale=0.1, size=size)
x_test = np.linspace(0.0, 1.0, 100)


# #############################################################################
# Fit by cubic polynomial
n_order = 3
X_train = np.vander(x_train, n_order + 1, increasing=True)
X_test = np.vander(x_test, n_order + 1, increasing=True)

# #############################################################################
# Plot the true and predicted curves with log marginal likelihood (L)
reg = BayesianRidge(tol=1e-6, fit_intercept=False, compute_score=True)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
    # Bayesian ridge regression with different initial value pairs
    if i == 0:
        init = [1 / np.var(y_train), 1.0]  # Default values
    elif i == 1:
        init = [1.0, 1e-3]
        reg.set_params(alpha_init=init[0], lambda_init=init[1])
    reg.fit(X_train, y_train)
    ymean, ystd = reg.predict(X_test, return_std=True)

    ax.plot(x_test, func(x_test), color="blue", label="sin($2\\pi x$)")
    ax.scatter(x_train, y_train, s=50, alpha=0.5, label="observation")
    ax.plot(x_test, ymean, color="red", label="predict mean")
    ax.fill_between(
        x_test, ymean - ystd, ymean + ystd, color="pink", alpha=0.5, label="predict std"
    )
    ax.set_ylim(-1.3, 1.3)
    ax.legend()
    title = "$\\alpha$_init$={:.2f},\\ \\lambda$_init$={}$".format(init[0], init[1])
    if i == 0:
        title += " (Default)"
    ax.set_title(title, fontsize=12)
    text = "$\\alpha={:.1f}$\n$\\lambda={:.3f}$\n$L={:.1f}$".format(
        reg.alpha_, reg.lambda_, reg.scores_[-1]
    )
    ax.text(0.05, -1.0, text, fontsize=12)

plt.tight_layout()
plt.show()
../_images/curve-fit-sklearn.svg

Shows predictions for sklearn’s BayesianRidge model on a curve fitting example. The left shows the predictions with default parameters (performs poorly), the right shows the predictions after initial parameters have been tweaked (performs better).

In comparison, the algorithm we presented requires no initial parameters; and because the hyperparameter is integrated over, poor performing values contribute little to the posterior probability mass.

We can see that our approach handles the curve-fitting example without requiring any tweaking.

# #############################################################################
# Plot the true and predicted curves for bbai's BayesianRidgeRegression model
from bbai.glm import BayesianRidgeRegression
reg = BayesianRidgeRegression(fit_intercept=False)
fig, ax = plt.subplots(1, 1, figsize=(4, 4))

# Note: there are no parameters to tweak
reg.fit(X_train, y_train)
ymean, ystd = reg.predict(X_test, return_std=True)

ax.plot(x_test, func(x_test), color="blue", label="sin($2\\pi x$)")
ax.scatter(x_train, y_train, s=50, alpha=0.5, label="observation")
ax.plot(x_test, ymean, color="red", label="predict mean")
ax.fill_between(
    x_test, ymean - ystd, ymean + ystd, color="pink", alpha=0.5, label="predict std"
)
ax.set_ylim(-1.3, 1.3)
ax.legend()

plt.tight_layout()
plt.show()
../_images/curve-fit-bbai.svg

Shows predictions from our Bayesian ridge regression algorithm on the curve fitting example. In comparison to sklearn, our algorithm perform well without requiring any tweaking.

The full curve-fitting comparison example is available here.

Stay up to date