An Algorithm for Bayesian Ridge Regression¶

This guide describes a Bayesian algorithm for regularized linear regression. The algorithm uses a hyperparameter to control regularization strength and fully integrates over the hyperparameter in the posterior distribution, applying a hyperprior selected so as to be approximately noninformative.

Contents

Introduction
How to Pick Non-informative Priors?
Picking Regularized Bayesian Linear Regression Priors
How to Parameterize the shrinkage prior?
Making Predictions
Outline of Algorithm
Experimental Results
- Generating the Data Set
- Fitting Models
Appendix A: Comparison with Other Bayesian Algorithms

Introduction ¶

Let $\params = \left(\noise^2, \w\right)$ denote parameters for a linear regression model with weights $\w$ and normally distributed errors of variance $\noise^2$ .

If $\X$ represents an $n\times p$ matrix of full rank of $p$ regressors and $n$ rows, then $\params$ specifies a probability distribution over possible target values $\y$ :

\Prb\left(\y|\params\right) = \frac{1}{\noise\sqrt{2\pi}} \exp\left[-\frac{1}{2 \noise^2} \multt{\left(\y - \X\w\right)}\right]

Suppose we observe $\y$ and assume $\y$ is generated from a linear model of unknown parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown parameters $\params$ given the observation.

\Prb\left(\params|\y\right)

Applying Bayes’ theorem, we can rewrite the probability as

\Prb\left(\params|\y\right) = \frac{\Prb\left(\y|\params\right) \cdot \Prb\left(\params\right)} {\Prb\left(\y\right)}

where we refer to

$\Prb\left(\params|\y\right)$ as the posterior distribution
$\Prb\left(\y|\params\right)$ as the likelihood function
$\Prb\left(\params\right)$ as the prior distribution

The prior distribution describes our belief of $\params$ before observing $\y$ and the posterior distribution describes our updated belief after observing $\y$ .

Suppose the prior distribution can be expressed as

\Prb\left(\params\right) = \priorf\left(\params, \hp\right)

where $\priorf(\cdot,\hp)$ denotes a family of probability distributions parameterized by what we call a hyperparameter $\hp$ .

Traditional approaches to Bayesian linear regression have used what are called conjugate priors. A family of priors $\priorf(\cdot,\hp)$ is conjugate if the posterior also belongs to the family

\begin{align*} \Prb\left(\params|\y\right) &= \frac{\Prb\left(\y|\params\right) \cdot \priorf\left(\params, \hp\right)} {\Prb\left(\y\right)} \\ &= \priorf\left(\params, \hp'\right) \end{align*}

Conjugate priors are mathematically convenient as successive observations can be viewed as making successive updates to the parameters of a family of distributions, but requiring $\priorf$ to be conjugate is a strong assumption.

Note

See Appendix A for a more detail comparison to other Bayesian algorithms.

We’ll instead describe an algorithm where

Priors are selected to shrink $\w$ , reflecting the prior hypothesis that $\w$ is not predictive, and be approximately noninformative for other parameters.
We fully integrate over hyperparameters so that no arbitrary choice of $\hp$ is required.

Let’s first consider what it means for a prior to be noninformative.

How to Pick Non-informative Priors?¶

Picking Regularized Bayesian Linear Regression Priors ¶

For the parameter $\noise$ , we use the noninformative prior

\Pr(\noise) \propto \frac{1}{\noise}

which is equivalent to using a uniform prior over the parameter $\log \noise$ . For $\w$ , we want an informative prior that shrinks the weights, reflecting a prior belief that weights are non-predictive. Let $\hp$ denote a hyperparameter that controls the degree of shrinkage. Then we use the spherical normal distribution with covariance matrix $\left(\frac{\noise}{\lda}\right)^2\I$

\Pr(\w|\noise, \hp) \propto \left(\frac{1}{\noise/\lda}\right)^p \exp\left[-\frac{1}{2 (\noise/\lda)^2} \multt{\w}\right]

Note that we haven’t described yet how $\hp$ parameterizes $\lda$ and we’ll also be integrating over $\hp$ so we additionally have a prior for $\hp$ (called a hyperprior) so that

\Pr(\w, \noise, \hp) = \Pr(\noise) \cdot \Pr(\w|\noise, \hp) \cdot \Pr(\hp)

Our goal is for the prior $\Pr(\hp)$ to be noninformative. So we want to know: In what parameterization, would $\Pr(\y|\hp)$ be data translated?

How to Parameterize the shrinkage prior?¶

Before thinking about how to parameterize $\lda$ , let’s characterize the likelihood function for $\lda$ .

\Pr(\y|\hp) = \int_\noise \int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w d\noise

Expanding the integrand, we have

\begin{align*} \Pr\left(\y, \noise, \w|\hp\right) &= \Pr\left(\y| \noise, \w\right) \cdot \Pr\left(\noise, \w | \hp\right) \\ &= \Pr\left(\y| \noise, \w\right) \cdot \Pr(\noise) \cdot \Pr\left(\w | \noise, \hp\right) \\ &\propto \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{1}{2\noise^2} \multt{\left(\y - \X \w\right)}\right] \times \\ &\phantom{\propto} \left(\frac{\lda}{\noise}\right)^p \exp\left[-\frac{\lda^2}{2 \noise^2} \multt{\w}\right] \\ &= \left(\#1\right) \left(\frac{1}{\noise}\right)^p \exp\left[-\frac{1}{2 \noise^2} \multtx{\left(\w - \what\right)} {\left(\multt{\X} + \lda^2 \I\right)}\right] \end{align*}

where

\begin{align*} \#1 &= \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{1}{2\noise^2} \left(\multt{\y} - \multtx{\what}{\left(\multt{\X} + \lda^2\I\right)}\right)\right] \left(\lda\right)^p \end{align*}

and

\what = \left(\multt{\X} + \lda^2\I\right)^{-1}\X^\top\y

Observe that $\#1$ is independent of $\w$ , so the integral with respect to $\w$ is equivalent to integrating over an unnormalized Gaussian.

Note

Recall the formula for a normalized Gaussian integral

\int_\x \frac{1}{\sqrt{\left(2 \pi\right)^k \det \A}} \exp\left[-\frac{1}{2} \multtx{\left(\x - \bar{\x}\right)}{\A^{-1}}\right] d\x = 1

Thus,

\int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w \propto \left(\#1\right) \left(\det\left(\multt{\X} + \lda^2\I\right)\right)^{-1/2}

Next let’s consider the integral over $\noise$ .

Put

\begin{align*} g(\lambda) &= \multt{\y} - \multtx{\what}{\left(\multt{\X} + \lambda^2 \I\right)} \\ &= \multt{\y} - \y^\top \X \left(\multt{\X} + \lambda^2 \I\right)^{-1} \X^\top\y \\ h(\lambda) &= \lambda^p \left(\det\left(\multt{\X} + \lambda^2\I\right)\right)^{-1/2} \end{align*}

Then we can rewrite

\int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w \propto \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{g(\lda)}{2\noise^2}\right] h(\lda)

After making a change of variables, we can express the integral with respect to $\noise$ as a form of the Gamma function.

Note

Consider an integral

\int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma

Put

u = \frac{A}{2\sigma^2}

Then

\begin{align*} \sigma &= \left(\frac{A}{2 u}\right)^{1/2} \\ d\sigma &= -\frac{1}{2}\left(\frac{A}{2}\right)^{1/2} u^{-3/2} du \end{align*}

And

\begin{align*} \int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma &= \int_0^\infty \left(\frac{2u}{A}\right)^{k/2} \exp(-u) \left(\frac{1}{2}\left(\frac{A}{2}\right)^{1/2} u^{-3/2}\right) du \\ &= \frac{1}{2} \left(\frac{2}{A}\right)^{(k-1)/2} \int_0^\infty u^{(k-1)/2-1} \exp(-u) du \\ &= 2^{(k-3)/2} A^{-(k-1)/2} \Gamma\left(\frac{k - 1}{2}\right) \end{align*}

where $\Gamma$ denotes the Gamma function

\Gamma(t) = \int_0^\infty x^{t-1} e^{-x} dx

Thus, we can write

\int_\noise \int_\w \Pr\left(\y, \noise, \w|\hp\right) d\w d\noise \propto g(\lda)^{-n/2} h(\lda)

Let $\U$ , $\bSigma$ , and $\V$ denote the singular value decomposition of $\X$ so that

\X = \U \bSigma \V^\top

Let $\eig_1, \ldots \eig_p$ denote the non-zero diagonal entries of $\bSigma$ . Put $\eigd = \multt{\bSigma}$ . Then $\eigd$ is a diagonal matrix with entries $\eig_1^2, \ldots, \eig_p^2$ and

\begin{align*} \multt{\X} &= \V \mathbf{\Sigma}^\top \multt{\U} \bSigma \V^\top \\ &= \V \eigd \V^\top \\ \end{align*}

Note

To implement our algorithm, we will only need the matrix $\V$ and the nonzero diagonal entries of $\bSigma$ , which can be efficiently computed with a partial singular value decomposition.

See the LAPACK function gesdd.

Put

\ytilde = \U^\top \y.

Then we can rewrite $h$ and $g$ as

\begin{align*} h(\lambda) &= \lambda^p \left(\det\left(\multt{\X} + \lambda^2\I\right)\right)^{-1/2} \\ &= \prod_j \left(\frac{\lambda^2}{\eig_j^2 + \lambda^2}\right)^{1/2} \end{align*}

And

\begin{align*} g(\lambda) &= \multt{\y} - \y^\top \X \left(\multt{\X} + \lambda^2 \I\right)^{-1} \X^\top\y \\ &= \multt{\y} - \y^\top \X \V \left(\eigd + \lambda^2 \I\right)^{-1} \V^\top \X^\top\y \\ &= \multt{\y} - \y^\top \U \bSigma \left(\eigd + \lambda^2 \I\right)^{-1} \bSigma^\top \U^\top\y \\ &= \multt{\y} - \sum_j \frac{\eig_j^2}{\eig_j^2 + \lambda^2} \ytilde_j^2 \\ &= \multt{\y} - \sum_j \left(1 - \frac{\lambda^2}{\eig_j^2 + \lambda^2}\right) \ytilde_j^2 \\ &= \RSS + \sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2} \ytilde_j^2 \\ \end{align*}

Here we adopt the terminology

\begin{align*} \TSS &= \multt{\y} &&\quad \textrm{refers to Total Sum of Squares} \\ \ESS &= \sum_{j=1}^p \ytilde_j^2 &&\quad \textrm{refers to Explaned Sum of Squares} \\ \RSS &= \TSS - \ESS &&\quad \textrm{refers to Residual Sum of Squares} \end{align*}

Put

\shrink(\lambda) = \frac{1}{p} \sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2}

Then $\shrink$ is a monotonically increasing function that ranges from $0$ to $1$ , and we can think of $\shrink$ as the average shrinkage factor across the eigenvectors.

Now, let’s make an approximation by replacing individual eigenvector shrinkages with the average:

\frac{\lambda^2}{\eig_j^2 + \lambda^2} \approx \shrink(\lambda) \quad \textrm{for}\;j=1, \ldots, p

Substituting the approximation into the likelihood then gives us

\begin{align*} \Pr(\y|\hp) &\propto g(\lda)^{-n/2} h(\lda) \\ &= \left(\RSS + \sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2} \ytilde_j^2\right)^{-n/2} \prod_j \left(\frac{\lambda^2}{\eig_j^2 + \lambda^2}\right)^{1/2} \\ &\approx \left(\RSS + \ESS\times\shrink(\lda)\right)^{-n/2} \shrink(\lda)^{p/2} \\ &\propto \left(1 + \frac{\shrink(\lda)}{(\RSS/\ESS)}\right)^{-n/2} \left(\frac{\shrink(\lda)}{(\RSS/\ESS)}\right)^{p/2} \end{align*}

We see that the approximated likelihood can be expressed as a function of

\frac{\shrink(\lda)}{(\RSS/\ESS)}

and it follows that the likelihood is approximately data translated in $\log r(\lda)$ .

Thus, we can achieve an approximately noninformative prior if we let $\hp$ represent the average shrinkage, put

\lda = \shrinki(\hp)

and use the hyperprior

\Pr(\hp) \propto \frac{1}{\hp}

Note

To invert $\shrink$ , we can use a standard root-finding algorithm.

Differentiating $\shrink(\lambda)$ gives us

\begin{align*} \diff{\lambda} \shrink(\lambda) &= \diff{\lambda} \left(\sum_j \frac{\lambda^2}{\eig_j^2 + \lambda^2}\right) \\ &= \diff{\lambda} \left(\sum_j \frac{1}{1 + \eig_j^2 \lambda^{-2}}\right) \\ &= -\sum_j \left(1 + \eig_j^2 \lambda^{-2}\right)^{-2} \left(-2 \eig_j^2 \lambda^{-3}) \right) \\ &= \sum_j \frac{2 \eig_j^2}{\lambda^3\left(1 + \eig_j^2\lambda^{-2}\right)^2} \\ &= \sum_j \frac{2 \eig_j^2\lambda}{\left(\lambda^2 + \eig_j^2\right)^2} \end{align*}

Using the derivative and a variant of Newton’s algorithm, we can then quickly iterate to a solution of $\shrinki(\hp)$ .

Making Predictions ¶

The result of fitting a Bayesian model is the posterior distribution $\Pr(\params|\y)$ . Let’s consider how we can use the distribution to make predictions given a new data point $\x'$ .

We’ll start by computing the expected target value

\begin{align*} \Ex{y'|\x', \y} &= \int_{y'} \int_\params y' \Pr(y'|\params) \Pr(\params|\y) d\params dy' \\ &= \int_\params \int_{y'} y' \Pr(y'|\params) \Pr(\params|\y) dy' d\params \\ &= \int_\params \x'^\top \w \Pr(y'|\params) \Pr(\params|\y) d\params \\ &= \x'^\top \Ex{\w|\y} \end{align*}

And

\begin{align*} \Ex{\w|\y} &= \int_\hp \int_\noise \int_\w \w \Pr(\w, \noise, \hp|\y) d\w d\noise d\hp \\ &= \int_\hp \left(\multt{\X} + \lda^2\I\right)^{-1} \X^\top \y \Pr(\hp|\y) d\hp \\ &= \Ex{\left(\multt{\X} + \lda^2\I\right)^{-1}\X^\top \y \big| \y} \\ &= \eigm \left(\Ex{\left(\eigd + \lda^2\I\right)^{-1}\big|\y}\right) \eigm^\top \X^\top \y \end{align*}

Note

To compute expected values of expressions of $\hp$ , we need to integrate over the posterior distribution $\Pr(\hp|\y)$ . We won’t have an analytical form for the integrals, but we can efficiently and accurately integrate with an adaptive Quadrature.

To compute variance, we have

\Ex{\left(y' - \bar{y}'\right)^2|\x',\y} = \Ex{y'^2|\x',\y} - \Ex{y'|\x',\y}^2

and

\begin{align*} \Ex{y'^2|\x',\y} &= \int_\params\int_{y'}y'^2\Pr(y'|\params) \Pr(\params|\y) dy' d\params \\ &= \int_\params\left(\noise^2 + \x'^\top \w \w^\top \x'\right) \Pr(\params|\y) d\params \\ &= \Ex{\noise^2|\y} + \x'^\top \Ex{\w \w^\top|\y} \x' \end{align*}

Starting with $\Ex{\noise^2|\y}$ , recall that

Pr(\hp|\y) \propto \int_0^\infty \left(\frac{1}{\noise}\right)^{n+1} \exp\left[-\frac{g(\lda)}{2\noise^2}\right] h(\lda) d\noise

And

\int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma = 2^{(k-3)/2} A^{-(k-1)/2} \Gamma\left(\frac{k - 1}{2}\right)

Thus,

\begin{align*} \int_0^\infty \left(\frac{1}{\sigma}\right)^{k-2} \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma &= 2^{(k-3)/2 - 1} A^{-(k-1)/2 + 1} \Gamma\left(\frac{k - 1}{2} - 1\right) \\ &= \frac{A}{2} \frac{\Gamma((k-1)/2-1)}{\Gamma((k - 1)/2)} \int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma \end{align*}

Note

The Gamma function has the property

\Gamma(x + 1) = x \Gamma(x)

So,

\Gamma\left(\frac{k-1}{2}\right) = \left(\frac{k - 1}{2}-1\right)\Gamma\left(\frac{k-1}{2}-1\right)

And

\int_0^\infty \left(\frac{1}{\sigma}\right)^{k-2} \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma = \frac{A}{k - 3} \int_0^\infty \left(\frac{1}{\sigma}\right)^k \exp\left[-\frac{A}{2 \sigma^2}\right] d\sigma

It follows that

\begin{align*} \int_\hp \int_\params \noise^2 \Pr(\params, \hp|\y) d\params d\hp &= \int_\hp \frac{g(\lda)}{n - 2} \Pr(\hp|\y) d\hp \\ &= \Ex{\frac{g(\lda)}{n - 2}\big|\y} \end{align*}

For $\Ex{\w\w^\top|\y}$ , we have

\begin{align*} \Ex{\w\w^\top|\y} &= \int_\hp \int_\noise \int_\w \w\w^\top \Prb(\w, \sigma, \hp|\y) d\w d\noise d\hp \\ &= \int_\hp \int_\noise \noise^2 \left(\multt{\X} + \lda^2\I\right)^{-1} + \what_\hp \what_\hp^\top d\noise d\hp \\ &= \Ex{\frac{g(\lda)}{n - 2}\left(\multt{\X} + \lda^2\I\right)^{-1}\big|\y} + \\ &\phantom{=}\quad \Ex{ \left(\multt{\X} + \lda^2\I \right)^{-1} \X \y \y^\top \X^\top \left(\multt{\X} + \lda^2\I \right)^{-1} \big|\y} \\ &= \eigm\left(\Ex{\frac{g(\lda)}{n - 2}\left(\eigd + \lda^2\I\right)^{-1}\big|\y}\right)\eigm^\top + \\ &\phantom{=}\quad \eigm \left(\Ex{ \left(\eigd + \lda^2\I \right)^{-1} \eigm^\top \X^\top \y \y^\top \X \eigm \left(\eigd + \lda^2\I \right)^{-1} \big|\y} \right) \eigm^\top \end{align*}

Outline of Algorithm ¶

We’ve seen that computing statistics about predictions involve integrating over the posterior distribution $\Prb(\hp|\y)$ . We’ll briefly sketch out an algorithm for computing such integrals. We describe it only for computing the expected value of $\w$ . Other expected values can be computed with straightforward modifications.

procedure compute-expected-w( $\mathbf{X}, \mathbf{y}$ )

$\mathbf{\Sigma}, \mathbf{V} =$ partial-svd( $\mathbf{X}$ )

$\mathbf{\Lambda} = \mathbf{\Sigma}^\top \mathbf{\Sigma}$

procedure f( $\eta$ )

$\lambda =$ shrink-inverse( $\mathbf{\Lambda}, \eta$ )

return $\left(\mathbf{\Lambda} + \lambda^2 \mathbf{I}\right)^{-1}$

end procedure

$n =$ length( $\mathbf{y}$ )

$\mathbf{z} = \mathbf{V}^\top \mathbf{X}^\top \mathbf{y}$

$\textrm{TSS} = \mathbf{y}^\top \mathbf{y}$

$\mathbf{D} =$ integrate-over-hyperparameter-posterior( $\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}$ )

procedure norm-f( $\eta$ )

return 1

end procedure

$\textrm{N} =$ integrate-over-hyperparameter-posterior( $\textrm{NORM-F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}$ )

return $\frac{1}{\textrm{N}} \mathbf{V} \mathbf{D} \mathbf{z}$

end procedure

The proceedure SHRINK-INVERSE applies Newton’s algorithm for root-finding with $\shrink$ and $\shrink'$ to compute $\shrinki$ .

To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm. Quadratures approximate an integral by a weighted sum at different points of evaluation.

\int_a^b f(x) dx \approx \sum_i w_i f(x_i)

In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive quadrature algorithms compare the integral approximation at different levels of refinement to approximate the error and increase the number of points of evaluation until a desired tolerance is reached.

Note

We omit the details of the quadrature algorithm used and describe only at a high level. For more information on specific quadrature algorithms refer to Gauss-Hermite Quadratures and Tanh-sinh Quadratures.

procedure integrate-over-hyperparameter-posterior( $\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}$ )

$level = 0$

$\mathbf{x}, \mathbf{w} =$ quadrature-points-weights( $0, 1, level$ )

$\mathbf{Q}_{-1} = \mathbf{0}$

$\mathbf{Q} =$ approximate-integral( $\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \mathbf{x}, \mathbf{w}$ )

$err = \infty$

while $err > \varepsilon$ do

$\mathbf{Q}_{-1} = \mathbf{Q}$

$level = level + 1$

$\mathbf{x}, \mathbf{w} =$ quadrature-points-weights( $0, 1, level$ )

$\mathbf{Q} =$ approximate-integral( $\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \mathbf{x}, \mathbf{w}$ )

$err = \|\mathbf{Q} - \mathbf{Q}_{-1}\|_\infty$

end while

return $\mathbf{Q}$

end procedure

procedure approximate-integral( $\textrm{F}, \textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \mathbf{x}, \mathbf{w}$ )

$\mathbf{Q} = \mathbf{0}$

for $i$ to $\textrm{LENGTH}(\mathbf{x})$ do

$\eta = \mathbf{x}_i$

$density =$ unnormalized-pdf( $\textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \eta$ )

$\mathbf{Q} = \mathbf{Q} + \textrm{F}(\eta) \cdot density \cdot \mathbf{w}_i$

end for

return $\mathbf{Q}$

end procedure

procedure unnormalized-pdf( $\textrm{TSS}, n, \mathbf{\Lambda}, \mathbf{z}, \eta$ )

$\lambda =$ shrink-inverse( $\mathbf{\Lambda}, \eta$ )

$t_1 = \left(\textrm{TSS} - \mathbf{z}^\top\left(\mathbf{\Lambda} + \lambda^2\mathbf{I}\right)^{-1}\mathbf{z}\right)^{-n/2}$

$p =$ length( $\mathbf{z}$ )

$t_2 = \lambda^p \left(\det(\mathbf{\Lambda} + \lambda^2\mathbf{I})\right)^{-1/2}$

return $t_1 \cdot t_2 \cdot \frac{1}{\eta}$

end procedure

Experimental Results ¶

To better understand how the algorithm works in practice, we’ll set up a small experimental data set, and we’ll compare a model fit with the Bayesian algorithm to Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on a Leave-one-out Cross-validation (LOOCV) of the data set.

Full source code for the experiment is available at github.com/rnburn/bbai/example/03-bayesian.py.

Generating the Data Set ¶

We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix $\X$ using a Gaussian with zero mean and covariance matrix $\K$ where

\K_{ij} = \begin{cases} 1, & \text{if}\ i=j \\ 0.5, & \text{otherwise} \end{cases}

We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and we’ll rescale the weights so that the signal variance is equal to 1.

\multtx{\w}{\K} = 1

Then we’ll set

\y = \X \w + \mathbf{\varepsilon}

where $\mathbf{\varepsilon}$ is a vector of 20 elements taken from a Gaussian with unit variance.

Here’s the Python code that sets up the data set.

np.random.seed(0)

def generate_correlation_matrix(p, param):
    res = np.zeros(shape=(p, p))
    for s in range(p):
        for t in range(0, s+1):
            corr = param
            if s == t:
                corr = 1.0
            res[s, t] = corr
            res[t, s] = corr
    return res

def generate_design_matrix(n, K):
    mean = np.zeros(K.shape[0])
    return np.random.multivariate_normal(mean, K, size=n)

def generate_weights(p):
    return np.random.normal(size=p)

def generate_data_set(n, K, signal_noise_ratio):
    p = K.shape[0]
    X = generate_design_matrix(n, K)
    w = generate_weights(p)
    signal_var = np.dot(w, np.dot(K, w))
    w *= np.sqrt(signal_noise_ratio / signal_var)
    y = np.dot(X, w) + np.random.normal(size=n)
    return X, y, w

p = 10
n = 20
signal_noise_ratio = 1.0
K = generate_correlation_matrix(p, 0.5)
X, y, w_true = generate_data_set(n, K, signal_noise_ratio)

Fitting Models ¶

Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model with the regularization strength set so that mean squared error is minimized on a LOOCV.

# OLS
from sklearn.linear_model import LinearRegression
model_ols = LinearRegression(fit_intercept=False)
model_ols.fit(X, y)

# Bayesian Linear Ridge Regression
from bbai.glm import BayesianRidgeRegression
model_bay = BayesianRidgeRegression(fit_intercept=False)
model_bay.fit(X, y)

# Ridge Regression fit to optimize LOOCV error
#
# For details refer to
#     https://arxiv.org/abs/2011.10218
from bbai.glm import RidgeRegression
model_rr = RidgeRegression(fit_intercept=False)
model_rr.fit(X, y)

We can measure the out-of-sample error variance for each model using the equation

\Ex{\left\|y' - \x'^\top \what\right\|^2} = \noise^2 + \multtx{\left(\w - \what\right)}{\K}

def compute_prediction_error_variance(K, w_true, w):
    delta = w - w_true
    noise_variance = 1.0
    return noise_variance + np.dot(delta, np.dot(K, delta))

err_variance_ols = compute_prediction_error_variance(K, w_true, model_ols.coef_)
err_variance_bay = compute_prediction_error_variance(K, w_true, model_bay.weight_mean_vector_)
err_variance_rr = compute_prediction_error_variance(K, w_true, model_rr.coef_)

Doing this gives us

Out-of-sample Performance of Models¶
Model	Out-of-sample Error Variance
OLS	2.27
Bayesian	1.23
Ridge Regression	1.33

We see that both the Bayesian and ridge regression models are able to prevent overfitting and achieve better out-of-sample results.

Finally, we’ll compare the estimated noise variance from the Bayesian model to that from the OLS model.

err_ols = y - np.dot(X, model_ols.coef_)
s_ols = np.dot(err_ols, err_ols) / (n - p)
print("noise_variance_ols =", s_ols)
print("noise_variance_bay =", model_bay.noise_variance_mean_)

This gives us

Model estimates for $\noise^2$ ¶
Model	noise variance estimate
OLS	0.41
Bayesian	0.65

Appendix A: Comparison with Other Bayesian Algorithms ¶

Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for Bayesian ridge regression and describe the advantages.

Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally, it requires initial values for parameters $\alpha$ and $\lambda$ that are then updated from the data.

The algorithm’s performance can be sensitive to the choice of values for these parameters, and scikit-learn’s documentation provides a curve fitting example where the defaults perform poorly.

# Author: Yoshihiro Uchida <nimbus1after2a1sun7shower@gmail.com>

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import BayesianRidge


def func(x):
    return np.sin(2 * np.pi * x)


# #############################################################################
# Generate sinusoidal data with noise
size = 25
rng = np.random.RandomState(1234)
x_train = rng.uniform(0.0, 1.0, size)
y_train = func(x_train) + rng.normal(scale=0.1, size=size)
x_test = np.linspace(0.0, 1.0, 100)


# #############################################################################
# Fit by cubic polynomial
n_order = 3
X_train = np.vander(x_train, n_order + 1, increasing=True)
X_test = np.vander(x_test, n_order + 1, increasing=True)

# #############################################################################
# Plot the true and predicted curves with log marginal likelihood (L)
reg = BayesianRidge(tol=1e-6, fit_intercept=False, compute_score=True)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
for i, ax in enumerate(axes):
    # Bayesian ridge regression with different initial value pairs
    if i == 0:
        init = [1 / np.var(y_train), 1.0]  # Default values
    elif i == 1:
        init = [1.0, 1e-3]
        reg.set_params(alpha_init=init[0], lambda_init=init[1])
    reg.fit(X_train, y_train)
    ymean, ystd = reg.predict(X_test, return_std=True)

    ax.plot(x_test, func(x_test), color="blue", label="sin($2\\pi x$)")
    ax.scatter(x_train, y_train, s=50, alpha=0.5, label="observation")
    ax.plot(x_test, ymean, color="red", label="predict mean")
    ax.fill_between(
        x_test, ymean - ystd, ymean + ystd, color="pink", alpha=0.5, label="predict std"
    )
    ax.set_ylim(-1.3, 1.3)
    ax.legend()
    title = "$\\alpha$_init$={:.2f},\\ \\lambda$_init$={}$".format(init[0], init[1])
    if i == 0:
        title += " (Default)"
    ax.set_title(title, fontsize=12)
    text = "$\\alpha={:.1f}$\n$\\lambda={:.3f}$\n$L={:.1f}$".format(
        reg.alpha_, reg.lambda_, reg.scores_[-1]
    )
    ax.text(0.05, -1.0, text, fontsize=12)

plt.tight_layout()
plt.show()

Shows predictions for sklearn’s BayesianRidge model on a curve fitting example. The left shows the predictions with default parameters (performs poorly), the right shows the predictions after initial parameters have been tweaked (performs better).¶

In comparison, the algorithm we presented requires no initial parameters; and because the hyperparameter is integrated over, poor performing values contribute little to the posterior probability mass.

We can see that our approach handles the curve-fitting example without requiring any tweaking.

# #############################################################################
# Plot the true and predicted curves for bbai's BayesianRidgeRegression model
from bbai.glm import BayesianRidgeRegression
reg = BayesianRidgeRegression(fit_intercept=False)
fig, ax = plt.subplots(1, 1, figsize=(4, 4))

# Note: there are no parameters to tweak
reg.fit(X_train, y_train)
ymean, ystd = reg.predict(X_test, return_std=True)

ax.plot(x_test, func(x_test), color="blue", label="sin($2\\pi x$)")
ax.scatter(x_train, y_train, s=50, alpha=0.5, label="observation")
ax.plot(x_test, ymean, color="red", label="predict mean")
ax.fill_between(
    x_test, ymean - ystd, ymean + ystd, color="pink", alpha=0.5, label="predict std"
)
ax.set_ylim(-1.3, 1.3)
ax.legend()

plt.tight_layout()
plt.show()

Shows predictions from our Bayesian ridge regression algorithm on the curve fitting example. In comparison to sklearn, our algorithm perform well without requiring any tweaking.¶

The full curve-fitting comparison example is available here.