A glimpse on Gaussian process regression

The initial motivation for me to begin reading about Gaussian process (GP) regression came from Markus Gesmann’s blog entry about generalized linear models in R. The class of models implemented or available with the glm function in R comprises several interesting members that are standard tools in machine learning and data science, e.g. the logistic regression model.

Looking at the scatter plots shown in Markus’ post reminded me of the amazing talk by Micheal Betancourt (there are actually two videos, but GPs only appear in the second – make sure you watch them both!). In one of the examples, he uses a Gaussian process with logistic link function to model data on the acceptance ratio of gay marriage as a function of age. The results he presented were quite remarkable and I thought that applying the methodology to Markus’ ice cream data set, was a great opportunity to learn what a Gaussian process regression is and how to implement it in Stan. I initially planned not to spend too much time with the theoretical background, but get to meat and potatoes quickly, i.e. try them in practice on a data set, see how they work, make some plots etc. But all introductory texts that I found were either (a) very mathy, or (b) superficial and ad hoc in their motivation. I wasn’t satisfied and had the feeling that GP remained a black box to me. Maybe you had the same impression and now landed on this site?

I was therefore very happy to find this outstanding introduction by David MacKay (DM). I think it is just perfect – a meticulously prepared lecture by someone who is passionate about teaching.

This provided me with just the right amount of intuition and theoretical backdrop to get to grip with GPs and explore their properties in R and Stan.

In this post I will follow DM’s game plan and reproduce some of his examples which provided me with a good intuition what is a Gaussian process regression and using the words of Davic MacKay “Throwing mathematical precision to the winds, a Gaussian process can be defined as a probability distribution on a space of unctions (…)”. In a future post, I will walk through an implementation in Stan, i.e. show how GP regression can be fitted to data and be used for prediction.

Gaussian processes

Let’ start with a standard definition of a Gaussian process

Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

That’s a fairly general definition, and moreover it’s not all too clear what such a collection of rv’s has to do with regressions. To draw the connection, let me plot a bivariate Gaussian

library(mixtools)
library(mvtnorm)
K <- matrix(c(1, 0.9093729, 0.9093729, 1), ncol = 2)
set.seed(1906)
p <- rmvnorm(1, mean = rep(0, 2), sigma = K)
plot(p, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), 
    xlab = expression(y[1]), main = "Fig. 1: Contours of bivariate Gaussian distribution")
abline(h = 0, v = 0)
points(p, cex = 2, pch = 20)
ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred")
ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red")

plot of chunk unnamed-chunk-2

p
##           [,1]      [,2]
## [1,] -1.116858 -1.196803

Where mean and covariance are given in the R code. The point p has coordinates y_1 = -1.12 and y_2=-1.19. One thing we can glean from the shape of the ellipse is that if y_1 is negative, y_2 is likely to be negative as well and vice versa. There is positive correlation between the two.

To draw the connection to regression, I plot the point p in a different coordinate system

plot(c(1, 2), p, xlim = c(0, 3), ylim = c(-2, 0), type = "p", pch = 20, cex = 2, 
    xlab = "x", ylab = "y", main = "Fig. 2: Points in x-y-plane")

plot of chunk unnamed-chunk-3

where y = [y_1=-1.12, y_2=-1.19] as before, but now the indexes 1 and 2 act as the explanatory/feature variable x=[1,2]. The y coordinates give us the height of the points for each x. And there is really nothing sacred about the numbers 1 and 2. I could equally well call the coordinates in the first plot [y_{1.763}, y_{\pi}] and virtually pick any number to index them. This illustrates, that the Gaussian process can be used to define a distribution over a function over the real numbers.

In our simple starting example, I can draw a line to connect the two dots, much as a regression line would do to illustrate this for two dimensions. But you maybe can imagine how I can go to higher dimensional distributions and fill up any of the gaps before, after or between the two points.

The upshot of this is: every point y=(y_i, y_j) from a set Y with indexes i and j from an index set X, can be taken to define two points in the Y \times X plane. And I deliberately wrote i and j instead of 1 and 2, because the indexes can be arbitrary real numbers. I can continue this simple example and sample more points (let me combine the graphs to save some space here)

par(mfcol = c(1, 2))
N <- 15
set.seed(1906)
p <- rmvnorm(N, mean = rep(0, 2), sigma = K)
plot(p, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), 
    xlab = expression(y[1]), main = "Fig. 3a: Multiple draws form the Gaussian")
abline(h = 0, v = 0)
points(p, cex = 2, pch = 20)
ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred")
ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red")
plot(rep(1, N), p[, 1], xlim = c(0, 3), ylim = c(-3, 3), type = "p", pch = 20, 
    cex = 2, xlab = "x", ylab = "y", main = "Fig. 3b: Multiple draws in x-y-plane")
points(rep(2, N), p[, 2], pch = 20, cex = 2)
for (i in 1:N) {
    segments(x0 = 1, p[i, 1], x1 = 2, p[i, 2], col = i, lwd = 1.5)
}

plot of chunk unnamed-chunk-4

I have also drawn the line segments connecting the samples values from the bivariate Gaussian. This illustrates nicely how a zero-mean Gaussian distribution with a simple covariance matrix can define random linear lines in the right-hand side plot. A multivariate Gaussian is like a probability distribution over (finitely many) values of a function.

For now we only have two points on the right, but by going from the bivariate to the n-dimensional normal distribution I can get more points in. And keep in mind, I can also insert points in between – the x domain is really dense now, I need not take just some integer values. What I do have to do in order to add more points, is to specify the mean the covariance.

Another instructive view on this is when I introduce measurement errors or noise into the equation

y_i = f(x_i) + \epsilon_i where \epsilon_i \sim \text{Normal}(0, \sigma^2).

With this my model very much looks like a non-parametric or non-linear regression model with some function f.

Updating the prior

If the Gaussian distribution that we started with is nothing, but a prior belief about the shape of a function, then we can update this belief in light of the data.

There is a nice way to illustrate how learning from data actually works in this setting. Say, we get to learn the value of y_1=1.1. In terms of fig. 3b this means we have to fix the left-hand point at 1.1 and that any line segment connecting y_1 and y_2 has to originate from there. And maybe this gets the intuition across that this narrows down the range of values that y_2 is likely to take. It seems even more unlikely than before that, e.g., y_2

We can try to confirm this intuition using the fact that if

K = \begin{bmatrix} k_{11} & k_{12} \\ k_{21} & k_{22} \end{bmatrix}

is the covariance matrix of the Gaussian, we can deduce (see here)

p(y_2 \mid y_1, K) \propto \text{exp} \left( -\frac{1}{2} \begin{bmatrix} y_1 & y_2 \end{bmatrix} K^{-1} \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \right).

This can be simplified to a normal over y_2

p(y_2 \mid y_1, K) \propto \text{exp} (-\frac{1}{2} (y_2 - \hat{y}_2) k_{22} (y_2 - \hat{y}_2))

with mean \hat{y}_2 = 0 + (k_{11} k_{22}^{-1}) k_{12} y_1 and variance (1-k_{12}^2). Inserting the given numbers, you see that \hat{y}_2 = 1.0031 and that the conditional variance is around 0.17.

One notheworthy feature of the conditional distribution of y_{2} given y_{1} and x_1 is that it does not make any reference to the functional from of f(x). In that sense it is a non-parametric prediction method, because it does not depend on specifying the function f linking y to x.

Let’s update Figure 3:

par(mfcol = c(1, 2))
set.seed(1906)
y1.actual <- 1.1
y2.hat <- 0 + y1.actual * K[1, 2]
y2.var <- (1 - K[1, 2]^2)
p2 <- data.frame(y1 = rep(y1.actual, N), y2 = rnorm(N, mean = y2.hat, sd = sqrt(y2.var)))
plot(p2, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), 
    xlab = expression(y[1]), main = "Fig. 3a: Multiple draws from the two-dimensional Gaussian")
abline(h = 0, v = 0)
abline(v = y1.actual, lty = 2, col = "grey", lwd = 1.5)
points(p, cex = 2, pch = 20, col = "lightgrey")
points(p2, cex = 2, pch = 20)
ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred")
ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red")
plot(rep(1, N), p2[, 1], xlim = c(0, 3), ylim = c(-3, 3), type = "n", pch = 20, 
    cex = 2, xlab = "x", ylab = "y", main = "Fig. 3b: Multiple draws in x-y-plane")
points(rep(1, N), p[, 1], pch = 20, cex = 2, col = "lightgrey")
points(rep(2, N), p[, 2], pch = 20, cex = 2, col = "lightgrey")
for (i in 1:N) {
    segments(x0 = 1, p[i, 1], x1 = 2, p[i, 2], col = "lightgrey", lwd = 1.5, 
        lty = 2)
}
for (i in 1:N) {
    segments(x0 = 1, p2[i, 1], x1 = 2, p2[i, 2], col = i, lwd = 1.5)
}
points(1, y1.actual, pch = 20, cex = 2)
points(rep(2, N), p2[, 2], pch = 20, cex = 2)

plot of chunk unnamed-chunk-5

The conditional distribution is considerably more pointed and the right-hand side plot shows how this helps to narrow down the likely values of y_2.

The upshot here is: there is a straightforward way to update the a priori GP to obtain simple expressions for the predictive distribution of points not in our training sample.

Higher dimensions

The connection to non-linear regression becomes more apparent, if we move from a bivariate Gaussian to a higher dimensional distrbution. With more than two dimensions, I cannot draw the underlying contours of the Gaussian anymore, but I can continue to plot the result in the Y \times X plane.

p <- rmvnorm(N, mean = rep(0, n), sigma = K)
plot(1, type = "n", xlim = c(0, 7), ylim = c(-3, 5), xlab = "x", ylab = "y")
abline(h = 0, v = 0)
for (i in 1:n) {
    points(rep(i, N), p[, i], pch = 20, cex = 2)
}
for (i in 1:N) {
    for (j in 1:(n - 1)) {
        segments(x0 = j, p[i, j], x1 = j + 1, p[i, j + 1], col = i, lwd = 1.5)
    }
}

plot of chunk unnamed-chunk-7

where again the mean of the Gaussian is zero and now the covariance matrix is

K
##            [,1]      [,2]      [,3]      [,4]      [,5]       [,6]
## [1,] 1.00000000 0.9093729 0.6838614 0.4252832 0.2187119 0.09301449
## [2,] 0.90937293 1.0000000 0.9093729 0.6838614 0.4252832 0.21871189
## [3,] 0.68386141 0.9093729 1.0000000 0.9093729 0.6838614 0.42528319
## [4,] 0.42528319 0.6838614 0.9093729 1.0000000 0.9093729 0.68386141
## [5,] 0.21871189 0.4252832 0.6838614 0.9093729 1.0000000 0.90937293
## [6,] 0.09301449 0.2187119 0.4252832 0.6838614 0.9093729 1.00000000

Having added more points confirms our intuition that a Gaussian process is like a probability distribution over functions. It also seems that if we would add more and more points, the lines would become smoother and smoother.
And in fact for the most common specification of Gaussian processes this will be the case, i.e. the GP prior will imply a smooth function.

Covariance function

Drawing more points into the plots was easy for me, because I had the mean and the covariance matrix given, but how exactly did I choose them? Randomly? I will give you the details below, but it should be clear that when we want to define a Gaussian process over an arbitrary (but finite) number of points, we need to provide some systematic way that gives us a covariance matrix and the vector of means. The former is usually denoted as k(x, x') for any two (feature) vectors x and x' in the domain of the function. The latter is usually denoted as m(x) and set to zero. With this one usually writes

y \sim GP ( m(x), k(x, x')).

With m set to zero, the entire shape or dynamics of the process are governed by the covariance matrix. If you look back at the last plot, you might notice that the covariance matrix I set to generate points from the six-dimensional Gaussian seems to imply a particular pattern. For paths of the process that start above the horizontal line (with a positive value), the subsequent values are lower. The other way around for paths that start below the horizontal line. Like in the two-dimensional example that we started with, the larger covariance matrix seems to imply negative autocorrelation.

Hence, we see one way we can model our prior belief. If we had a formula that returns covariance matrices that generate this pattern, we were able postulate a prior belief for an arbitrary (finite) dimension.

Squared exponential kernel

The formula I used to generate the $ij$th element of the covariance matrix of the process was

k_{i,j}(x, x') = \text{exp}(-0.095 (x_i - x_j)^2)

More generally, one may write this covariance function in terms of hyperparameters

k_{i,j}(x, x') = \theta_1 + \theta_2 \, \text{exp}\left(-\frac{1}{2 \ell^2} (x_i - x_j)^2\right)

Because k is a function of the squared Euclidean distance between x_i and x_j, it captures the idea of diminishing correlation between distant points. How fast the exponential term tends towards unity is goverened by the hyperparameter \ell which is called lenght scale. For large \ell the term inside the exponential will be very close to zero and thus the kernel will be constant over large parts of the domain. The hyperparameter \theta_2 scales the overall variances and covariances and \theta_1 allows for an offset.

The squared exponential kernel is apparently the most common function form for the covariance function in applied work, but it may still seem like a very ad hoc assumption about the covariance structure. It turns out, however, that the squared exponential kernel can be derived from a linear model of basis functions of x (see section 3.1 here). In general, one is free to specify any function that returns a positive definite matrix k(x, x') for all possible x and x'. Discussing the wide array of possible kernels is certainly beyond the scope of this post and I therefore happily refer any reader to the introductory text by David MacKay (see previous link) and the textbook by Rasmussen and Williams who have an entire chapter on covariance functions and their properties.

Up next

Now that I have a rough idea of what is a Gaussian process regression and how it can be used to do nonlinear regression, the question is how to make them operational. In terms of the Bayesian paradigm, we would like to learn what are likely values for \theta_1, \theta_2 and \ell in light of data. Then we can determine the mode of this posterior (MAP). Likewise, one may specify a likelhood function and use hill-climbing algorithms to find the ML estimates. It is not too hard to imagine that for real-world problems this can be delicate. Especially if we include more than only one feature vector, the likelihood is often not unimodal and all sort of restrictions on the parameters need to be imposed to guarantee the result is a covariance function that always returns positive definite matrices.

Fitting a GP to data will be the topic of the next post on Gaussian processes.

Advertisements