# Programming simple economic experiments in shiny

Note, September 16, 2015: I have corrected some errors in the R code chunks that sneaked in when I transformed the post to WordPress format. Sorry for any inconvenience.

In this post, I want to present a flexible way to implement small surveys or economic experiments in shiny. If you have some background in experimental economics, you may have noticed that the most widely used software to implement economic experiments is zTree. To be honest I never touched zTree so I can say little about it, but my impression was that capabilities are actually fairly limited. On the other hand, what are the alternatives? Well, let us assume pen&paper is not a viable alternative. A more modern approach is to code the experiment as a small webpage, i.e. use languages such as HTML, JavaScript, Ruby or Python (or a mix of them). The resulting app is then deployed to a server and can be called in a standard webbrowser. Unfortunately that requires that you are proficient in at least one of these languages and that you have an idea about how to set up a server to get this running, including how to handle users, sessions, create a database on the server to save the responses etc. etc. In case you are familiar with these things, I guess that’s the way to go, because it’s tremendously more flexible. Moreover, once you have an experiment running online, it’s available everywhere. No messy zTree installation on PCs in the lab. All you need is that computers in the lab are connected to the internet (a very mild requirement I’d say).

When I wanted to run my first experiments, I was not familiar with any of these things and learning them was a bit steep for me (both in terms of time and effort). Fortunately, I had a good friend to help me out. So without being a web programmer, or having friends willing to sink their precious free time, what else can you do? Enter: Dean Attali, who blogged about how to mimick a Google Doc in shiny.

I think Dean wrote a great post (again someone who seems to be a great teacher) and made a tremendous effort to describe in detail how to build this little app from scratch. His solution also shows his shinyjs package in action, a package that I will make extensive use of here, too.

Taking Dean’s code from his GitHub repo makes it straightforward to customize and expand his doc to a small multi-page survey, where users are prompted to provide answers to several questions and their responses are saved in a ‘database’. He has a follow-up post, where he discusses different possibilities to store these responses not only locally, but also how to push them to a remote server, e.g. Amazon’s S3 or your Dropbox. Using remote storage, e.g. on S3, you can also grant your co-authors immediate access to the data (see here).

## Economic Experiments in shiny?

On a second thought, many, if not most, experiments that economist run are just like these little surveys. Consider the following structure:

1. Login page, where participants that were recruited to the lab type in their credentials.
2. A single or a set of pages with instructions about the experiment that is to follow.
3. Pages with the main experiment, e.g. a series of pages where the participants have to provide answers to incentivized questions.
4. Logout page, where participants often receive feedback on their responses and their final payoff.

This structure suffices for many of the experiments that I am aware of. In what follows, I will demonstrate how with the help of Dean’s code one can implement this basic structure in shiny. That being said, you may well ask for more – more interaction, fancy graphs – and find that the above structure is too rudimentary for what you have in mind. Think of what comes below as a proof of concept, but not an exhausive example of what is possible. In fact, I believe much more fancy things are possible than what I will show below, e.g. to have users interact etc.

### Prerequisites

I assume you work from RStudio. We will need the same prerequisites as listed by Dean:

• Packages: shiny, shinyjs,digest and dplyr installed from CRAN.
• In case you want to replicate Dean’s or my example you will need install package DT to render the table using the jQuery plug-in DataTables. Apparently, you won’t have to install DT from GitHub anymore, since it was moved to CRAN as well. If you do not plan to include a data table like here, you will not need DT.

## A simple experiment

I will split up the shiny app into three files: helpers.R, server.R and ui.R. In case you are not familiar with the latter two and/or the basic structure of shiny apps, I recommend you read through this introduction first. The file helpers.R will be sourced in one of the two other files and load some workhorse functions and parameters – it makes things more tidy.

### Handling login, instructions & final page

Without any reference to the actual experiment that we will implement, we can begin to code the login page, which may serve as a blueprint independent of what comes thereafter. Let’s begin with the ui.R, i.e. the user interface that participants in the lab will see. Following the basic structure, this will be a login page. Start with a skeletal UI and consider the following basic page:

shinyUI(fluidPage(
useShinyjs(),
div(
titlePanel("Welcome to the experiment!"),
br(),
sidebarLayout(

sidebarPanel(
p("Welcome to today's experiment. Please use the user name provided on the instructions to login into the experiment."),
hidden(
div(
span("Your user name is invalid. Please check for typos and try again.", style = "color:red")
)
)
),

mainPanel(
textInput("user", "User", "123"),
)
)
)
)
)


Let’s walk through the code together. First, I load shinyjs by writing useShinyjs() to be able to call functions from that package. Next I use the HTML tag div() to create a division, just like in plain HTML. By giving the division an id (here: id=login_page), I can later reference it and thereby change its properties using functions from shinyjs. Shiny will keep a list of all named items (divs, buttons, widgets and so on) in the background and allow you to call them by their name to affect them. Inside the div(...), I have put a plain shiny page that consists of a titlePanel and a sidebarLayout that you should be familiar with. I have put some basic login instructions in the sidebarPanel part of the page and a little further down you see that in the mainPanel there are three widgets: a textInput for username and for a password and an actionButton which is there for the user to confirm her credentials.

#### Enter: shinyjs

I haven’t talked about the part of the sidebarPanel that is wrapped in the hidden() command. This is our first encounter of a basic JS function provided by shinyjs. If you consult the help file ?hidden, it says:

Create a Shiny tag that is invisible when the Shiny app starts. The tag can be made visible later with shinyjs::toggle or shinyjs::show.

Thus, everything inside hidden will be invisible to the participants/users when the screen/app is loaded. In the above example, I have put an error message inside hidden, which will be uncovered in case the login credentials are invalid. The help file for hidden already mentions that anything hidden can be uncovered using the command show. Likewise, once shown, items can (again) be hidden using hide.

How and under which conditions login_error is uncovered is handled in the server.R script below. Details aside for now, the principle is as follows: the app waits for the user to press the button login. Once pressed the server checks the credentials. This check returns a boolean (true or false) and in case it is FALSE the div with id login_error will be uncovered.

This is also the general principle how I will build a multi-page app. For every “page” in your app, include a div in the ui.R and hide all but the initial screen using hidden. Then define certain “events” in the server.R script, e.g. the user clicking on a button that says “Continue to next page”. Let the server observe whether this event took place and if the answer is yes, hide the current page, and uncover the next. For example, if login credentials are correct what happens? The app will hide the div login_page and uncover another div, called instructions say, that was not yet included above, but which prints the instructions to the screen. E.g

shinyUI(fluidPage(
useShinyjs(),
div(
titlePanel("Welcome to the experiment!"),
br(),
sidebarLayout(

sidebarPanel(
p("Welcome to today's experiment. Please use the user name provided on the instructions to login into the experiment."),
hidden(
div(
span("Your user name is invalid. Please check for typos and try again.", style = "color:red")
)
)
),

mainPanel(
textInput("user", "User", ""),
)
)
),

hidden(
div( id = "instructions",
h3("Here we post instructions for subjects..."),
p("In this experiment you will have to guess the in wich direction
a coin that is tossed repeatedly is biased. You will observe whether
the coin landed heads or tails over several tosses.... Bla bla"),
actionButton("confirm", label = "Ok, I got it...")
)
),

hidden(
div(
id = "end",
titlePanel("Thank you!"),
br(),
p("End of experiment.")
)
)
)
)


These are not a very carefully formatted divs that I included, but they get the idea across. The instructions will be hidden at the outset and only uncovered once the user submitted valid username and password. At the same time, the login_page will be hidden. Otherwise the app will throw the error message. Same for the instructions. Once the user presses the button labelled confirm, the app hides instructions and uncovers the end div. This gives the user the impression to walk through multiple pages, where in fact it is just a single page with multiple overlays that are covered and uncovered. Bingo!

Once we understood the general “hidden-hide-show” principle, we can now go on and include further pages by adding further hidden divs in the ui.R. Before we do that, let us peek into the server.R to see how the basic three-page structure we have until now is implemented server-wise.

shinyServer(
function(input, output, session) {

# When the Login button is clicked, check whether user name is in list
observeEvent(input$login, { # Disable the login, user cannot click it until it is enabled again disable("login") # Check whether user name is correct # Very simple here: a hard-coded password user_ok <- input$password==session_password

# If credentials are valid push user into experiment
if(user_ok){
show("instructions")

} else {
# If credentials are invalid throw error and prompt user to try again
}

})

observeEvent(input$confirm, { hide("instructions") show("end") }) } )  The first handler that appears in the server script is observeEvent. Again, consulting the respective help file ?observeEvent we get to know what this is for Respond to “event-like” reactive inputs, values, and expressions. This function is used to constantly observe the value of a certain variable. For example, let us consider an actionButton, e.g. the login button: Creates an action button or link whose value is initially zero, and increments by one each time it is pressed. When the app is started, it initializes the action button and adds an item to the input list, i.e. for the login button it creates something like input$login <- 0


This value is constantly monitored by the server-side function observeEvent. Once its value changes, e.g. from zero to one, the code that appears inside the curly brackets will be executed.

In the given case this is a series of shinyjs commands together with an if-else check for whether the credentials are valid. You might wonder where is the variable session_password that I compare the entered password to. I put that into the helpers.R file where I park all utilities, such as specific functions that I will have to call or other hard-coded variables.

For now helpers.R, which resides in the same directory as the other two files, is simply

# Password to login for this session


Of course, I have to source this file at the beginning of the server.R.

If the password is correct, user_ok evaluates to TRUE and the login page will be hidden, whereas the instructions will be shown. If user_ok evaluates to FALSE, the instructions remain hidden and the login page remains visible. However, all text inputs will be reset (they will be empty again), the error message will be shown and the login button enabled again.

This is how a basic login page can be handled. And the “hidden-hide-show” principle is one of the key learning outcome of this post.

### The main part

For the main part we have to add quite a bit of functionality usually, because this is were the actual experiment starts and the app has to react in maybe multiple ways to the responses by the user. It would also be common to have to store responses over multiple rounds of the same experiment.

I will show some basic functionality here by implementing a very simple (not to say boring) little experiment. The user has to guess in which direction a coin that is flipped is biased. The user observes the tosses in chunks, first one then three, then five and so on. After each new chunk, the user has to indicate which side she thinks is the one being more likely. And this will be done over several rounds. Sounds very boring – and it is, but it’s a simple example that already features a lot of different things to do.

I will also add some functionality to the final page, by showing a table of the decisions submitted by the user together with a short information text about what is the payoff from the experiment.

Let us begin with the complete ui.R

library(shinyjs)

source('helpers.R')
shinyUI(fluidPage(
useShinyjs(),
div(
titlePanel("Welcome to the experiment!"),
br(),
sidebarLayout(

sidebarPanel(
p("Welcome to today's experiment. Please use the user name provided on the instructions to login into the experiment."),
hidden(
div(
span("Your user name is invalid. Please check for typos and try again.", style = "color:red")
)
)
),

mainPanel(
textInput("user", "User", ""),
)
)
),

hidden(
div( id = "instructions",
h3("Here we post instructions for subjects..."),
p("In this experiment you will have to guess the in wich direction
a coin that is tossed repeatedly is biased. You will observe whether
the coin landed heads or tails over several tosses.... Bla bla"),
actionButton("confirm", label = "Ok, I got it... let's start")
)
),

hidden(
div(
id = "form",
titlePanel("Main experimental screen"),

sidebarLayout(

sidebarPanel(
p("Indicate whether you think the coin that was tossed is more likely to land heads or tails based on the throws shown to you on the right."),
label = h3("Your based on the tosses so far"),
selected = NULL),
actionButton("submit", "Submit", class = "btn-primary")
),

mainPanel(
h4(textOutput("round_info")),
dataTableOutput(outputId="table")
)
)
)
),

hidden(
div(
id = "end",
titlePanel("Thank you!"),

sidebarLayout(

sidebarPanel(
p("You have reached the end of the experiment. Thank you for your participation."),
textOutput("round")
),

mainPanel(
dataTableOutput(outputId="results")
)
)
)
)
)
)


This is roughly the same as before, but with a new div for the main experimental screen. I again opted for the simple sidebarLayout with the user inputs being a simple radio button (I’d love to initialize it with state NULL so that nothing is selected, but this is not possible – setting it to NULL is like setting it to be first option) and a submit button. The mainPanel consists of a table that is the list of of Heads-Tails for that round. It comes in form of a DataTable, hence the DT package.

Note: You are free to do much more fancy things than a table. R is great for plotting and you can leverage its capabilities here by providing graphs that react to user input and much more.

The final page has pretty much the same structure now.

#### Handling data storage and multiple rounds

Let’s think about how we want the user to experience this ui.R. First the login and instructions as before. Then she is pushed into the first round of the main part. There she observes the split screen with radio button on the left, table on the right. We now wait for her to guess the bias, i.e. select and press submit. Now we want to show further tosses of the coin for the first round, i.e. update the table, but also not forget to save the previous response. To save user responses I bite the code by Dean Attali, so you might want to look at his article for further explanation. All I should mention here is that you will have to create a folder inside the the folder you saved helpers.R, ui.R and server.R that is called responses. This will be directory where user responses will be saved. In case you want to store data not locally, but on a remote server you will have to change that – but Dean has a separate article on this. After the user has played through all rounds, we want to push her out of the main experiment part and route her to the final page that says good-bye and gives users the possibility to review their choices.

That’s quite a bit to handle by server.R – let’s jump right into my final solution:

library(shiny)
require(digest)
require(dplyr)

source('helpers.R')

shinyServer(
function(input, output, session) {

##########################################################
##########################################################

# When the Login button is clicked, check whether user name is in list
observeEvent(input$login, { # User-experience stuff shinyjs::disable("login") # Check whether user name is correct # Fix me: test against a session-specific password here, not username user_ok <- input$password==session_password

# If credentials are valid push user into experiment
if(user_ok){
shinyjs::show("instructions")

# Save username to write into data file
output$username <- renderText({input$user})
} else {
# If credentials are invalid throw error and prompt user to try again
}

})

##########################################################
########### PART II: INSTRUCTIONS ########################
##########################################################

observeEvent(input$confirm, { hide("instructions") show("form") }) ########################################################## ########### PART III: MAIN EXPERIMENT #################### ########################################################## ## Initialize reactive values # round is an iterator that counts how often 'submit' as been clicked. values <- reactiveValues(round = 1) # df will carry the responses submitted by the user values$df <- NULL

##########################################################
########## PART IIIa: MAIN HANDLER #######################
##########################################################

## This is the main experiment handler
# Observe the submit button, if clicked... FIRE
observeEvent(input$submit, { # Increment the round by one isolate({ values$round <- values$round +1 }) # Call function formData() (see below) to record submitted response newLine <- isolate(formData()) # Write newLine into data frame df isolate({ values$df <- rbind(values$df, newLine) }) # Has the user reached the end of the experiment? # If so then... if(values$round > n_guesses){

# Draw a round from all rounds with equal probability
# Note: the username must be numeric here, because it serves
# as a seed for the RNG. See comment below.
isolate(values$payroll <- payoffRound(as.numeric(input$user)))

# Based on the drawn round determine the payoff. People get a Euro for having guessed correctly.
output$round <- renderText({ paste0("The computer selected round ", values$payroll,
". Because you guessed ",ifelse(values$df[values$payroll, 3]==true_state[values$payroll], "correctly ", "incorrectly "), "we will add ", ifelse(values$df[values$payroll, 3]==true_state[values$payroll], prize, 0),
" Euro to your show-up fee. Your total payoff will therefore equals ",
ifelse(values$df[values$payroll, 3]==true_state[values$payroll], prize, 0) + show_up, " Euro.") }) isolate(values$df[, 5] <- ifelse(values$df[values$payroll, 3]==true_state[values$payroll], prize, 0) + show_up) # The function saveData() writes the df to disk. # This can be a remote server! saveData(values$df)

# Say good-bye
hide(id = "form")
show(id = "end")
}
})

## Utilities & functions

# I take formData from Dean with minor changes.
# When it is called, it creates a vector of data.
# This will be a row added to values$df - one for each round. # # Gather all the form inputs (and add timestamp) formData <- reactive({ data <- sapply(fieldsAll, function(x) input[[x]]) data <- c(round = values$round-1, data, timestamp = humanTime(), payoff = NA)
data <- t(data)
data
})

# The coin flips shown on the right.
# Note I have added a small delay with progress bar, just to give
# users a more natural look-and-feel, since throwing coins usually takes time.
# I have disabled all of the features of DT below, because they distract users
output$table <- renderDataTable({ if(values$round > 1 && values$round <= n_guesses){ withProgress(message = 'Flipping the coin.', detail = 'Please wait...', value = 0, { for (i in 1:15) { incProgress(1/15) Sys.sleep(0.02) } }) } idx.row <- sum(!is.na(flips[, min(values$round, n_guesses)]))
idx.col <- min(values$round, n_guesses) data.frame(Wurf = seq(1, idx.row), Seite= flips[1:idx.row, idx.col]) }, options = list(paging = FALSE, searching = FALSE, ordering = FALSE ) ) # This renders the table of choices made by a participant that is shown # to them on the final screen output$results <- renderDataTable({
out <- data.frame(Round = rep(seq(1,n_rounds), each = guesses_per_round),
Guess = rep(seq(1, guesses_per_round), times = n_rounds),
choice = values$df[,3], actual = rep(true_state, each = guesses_per_round) ) colnames(out) <- c("Round", "Guess no.", "Your choice", "Correct/True value") out }, options = list(paging = FALSE, searching = FALSE, ordering = FALSE ) ) # Tell user where she is output$round_info <- renderText({
paste0("Round ", ceiling(values$round/guesses_per_round), " of ", n_rounds) }) } )  Phew, there are lots of new things. Let me go from top to bottom. First, I initialize an object called values and this is a reactive object. Again the help file has a concise description of what are its capabilities: This function returns an object for storing reactive values. It is similar to a list, but with special capabilities for reactive programming. When you read a value from it, the calling reactive expression takes a reactive dependency on that value, and when you write to it, it notifies any reactive functions that depend on that value. (…) The reactive variable values$round serves as a counter or iterator that I use to keep track of the round the user is currently in. Most experiments will have participants play multiple rounds of the same task and this variable is there to record in which round users currently are.

Next comes the main handler. The main handler observed the value of input$submit. If the user clicks it, the stuff in curly brackets is executed. Inside the curly brackets, we first increase values$round by one. Because this is a reactive value, it will broadcast its new value to all functions that depend on it. If you scroll down to the output$table you will note than e.g. the data table on the right depends on values$round and hence will get updated, once values$round changes. Back to the main handler. Next the code creates an object called newLine which is row in our later data matrix. We write it to values$df. Next we check whether the previous round was the final round. This is the case if the iterator values$round exceeds the hard-coded number of rounds given by n_guesses. If the final round was reached, we do four things: 1. Draw one round randomly and compare the actual bias of the coin to the guess by the user. 2. Prepare a text message for the final screen which takes into account whether the user was lucky and will receive a bonus for having guessed correctly. 3. Save all received responses to disk. 4. Say good-bye by loading the final screen. The astute reader will miss numerous functions that I call inside server.R, but that are nowhere defined. I follow Dean Attali’s style here and defer them to the helpers file: # which fields get saved fieldsAll <- c("user", "guess") # self-explanatory responsesDir <- file.path("responses") # Password to login for this session session_password <- "foo" ### Generate data here ### ### ### set.seed(1906) n_rounds <- 2 n_flips <- 3 probs <- c(0.6,0.4) prize <- 1 show_up <- 10 probas <- array(, c(n_rounds, 2)) true_state <- sample(c("Heads", "Tails"), n_rounds, replace = TRUE) for(i in 1:n_rounds){ if(true_state[i]=="Heads"){ probas[i,] <- probs } else { probas[i,] <- probs[2:1] } } flips <- sapply(1:n_rounds, function(x) sample(c(1, -1), n_flips, replace = TRUE, prob = probas[x, ]) ) flips <- data.frame(flips) cascade <- function(x, thin){ tmp <- rep(1, n_flips) %x% x dim(tmp) <- c(n_flips, n_flips) tmp[lower.tri(tmp)] <- NA tmp[tmp==1] <- "Heads" tmp[tmp!="Heads"] <- "Tails" if(thin > 1){ tmp <- tmp[, seq(1, ncol(tmp), thin)] } return(tmp) } tmp <- lapply(flips, cascade, thin = 2) flips <- do.call(cbind, tmp) n_guesses <- ncol(flips) guesses_per_round <- n_guesses/n_rounds # add an asterisk to an input label labelMandatory <- function(label) { tagList( label, span("*", class = "mandatory_star") ) } # CSS to use in the app appCSS <- ".mandatory_star { color: red; } .shiny-input-container { margin-top: 25px; } .shiny-progress .progress-text { font-size: 18px; top: 50% !important; left: 50% !important; margin-top: -100px !important; margin-left: -250px !important; }" # Helper functions humanTime <- function() format(Sys.time(), "%d-%m-%Y-%H-%M-%S") saveData <- function(data) { fileName <- sprintf("%s_%s.csv", humanTime(), digest::digest(data)) write.csv(x = data, file = file.path(responsesDir, fileName), row.names = FALSE, quote = TRUE) } payoffRound <- function(user){ set.seed(user) out <- sample(seq(1, n_guesses), 1) return(out) } epochTime <- function() { as.integer(Sys.time()) }  The helper file begins with defining the user inputs that will be saved along with a timestamp etc. Another thing that I handle is generating the data. All coin flips are not drawn live during the experiment, but before the first screen is loaded. Because I set a static seed, all users will see the same sequence of flips. Finally, several workhorse functions are defined. You may find a less thoroughly commented version of the code on GitHub. #### Further notes & extensions An idea that I like to borrow from Joern Hees is the possibility to use the username (or part of it) as a seed for the random numbers that will be drawn during the experiment. Depending on your analysis and the design of the experiment this can be very useful. For example, if you have two groups, one treatment and one control group, and you want to show exactly one subjects in the control and one in the treatment exactly the same stimuli, you can do this by conditioning all random numbers on their username. For example, take users A-01 and B-01. If we denote groups by A and B, then it’s easy to extract the two-digit number using a regular expression and take it as a seed for a random number generator. The type of experiments that can be implemented in this way is still fairly limited. For example, I have not tried to program an experiment where different users interact, e.g. to run experiments that inspect how behavior changes within a group, or how certain types of communication affect outcomes. But given there are examples of chat rooms being implemented using shiny, I see a chance this can be done. ## Conclusion Even though the length of the post and my messy code can look daunting and give the impression that this is not much easier than becoming a web programmer, it certainly is. Shiny handles a lot of server structure and initialization of databases, sessions and much more beneath the hood. The simple experiment I implemented above is more of a proof of concept and you may wish to expand it in many ways. I certainly have shown only a very narrow set of things that are possible within shiny. Thus, in case you just got started running experiments and want to have more flexibility than zTree and not to hassle about the gory details, shiny might be an alternative for you. Advertisements # A glimpse on Gaussian process regression The initial motivation for me to begin reading about Gaussian process (GP) regression came from Markus Gesmann’s blog entry about generalized linear models in R. The class of models implemented or available with the glm function in R comprises several interesting members that are standard tools in machine learning and data science, e.g. the logistic regression model. Looking at the scatter plots shown in Markus’ post reminded me of the amazing talk by Micheal Betancourt (there are actually two videos, but GPs only appear in the second – make sure you watch them both!). In one of the examples, he uses a Gaussian process with logistic link function to model data on the acceptance ratio of gay marriage as a function of age. The results he presented were quite remarkable and I thought that applying the methodology to Markus’ ice cream data set, was a great opportunity to learn what a Gaussian process regression is and how to implement it in Stan. I initially planned not to spend too much time with the theoretical background, but get to meat and potatoes quickly, i.e. try them in practice on a data set, see how they work, make some plots etc. But all introductory texts that I found were either (a) very mathy, or (b) superficial and ad hoc in their motivation. I wasn’t satisfied and had the feeling that GP remained a black box to me. Maybe you had the same impression and now landed on this site? I was therefore very happy to find this outstanding introduction by David MacKay (DM). I think it is just perfect – a meticulously prepared lecture by someone who is passionate about teaching. This provided me with just the right amount of intuition and theoretical backdrop to get to grip with GPs and explore their properties in R and Stan. In this post I will follow DM’s game plan and reproduce some of his examples which provided me with a good intuition what is a Gaussian process regression and using the words of Davic MacKay “Throwing mathematical precision to the winds, a Gaussian process can be defined as a probability distribution on a space of unctions (…)”. In a future post, I will walk through an implementation in Stan, i.e. show how GP regression can be fitted to data and be used for prediction. ### Gaussian processes Let’ start with a standard definition of a Gaussian process Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. That’s a fairly general definition, and moreover it’s not all too clear what such a collection of rv’s has to do with regressions. To draw the connection, let me plot a bivariate Gaussian library(mixtools) library(mvtnorm) K &amp;lt;- matrix(c(1, 0.9093729, 0.9093729, 1), ncol = 2) set.seed(1906) p &amp;lt;- rmvnorm(1, mean = rep(0, 2), sigma = K) plot(p, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), xlab = expression(y[1]), main = "Fig. 1: Contours of bivariate Gaussian distribution") abline(h = 0, v = 0) points(p, cex = 2, pch = 20) ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred") ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red")  p  ## [,1] [,2] ## [1,] -1.116858 -1.196803  Where mean and covariance are given in the R code. The point p has coordinates $y_1 = -1.12$ and $y_2=-1.19$. One thing we can glean from the shape of the ellipse is that if $y_1$ is negative, $y_2$ is likely to be negative as well and vice versa. There is positive correlation between the two. To draw the connection to regression, I plot the point p in a different coordinate system plot(c(1, 2), p, xlim = c(0, 3), ylim = c(-2, 0), type = "p", pch = 20, cex = 2, xlab = "x", ylab = "y", main = "Fig. 2: Points in x-y-plane")  where $y = [y_1=-1.12, y_2=-1.19]$ as before, but now the indexes $1$ and $2$ act as the explanatory/feature variable $x=[1,2]$. The $y$ coordinates give us the height of the points for each $x$. And there is really nothing sacred about the numbers $1$ and $2$. I could equally well call the coordinates in the first plot $[y_{1.763}, y_{\pi}]$ and virtually pick any number to index them. This illustrates, that the Gaussian process can be used to define a distribution over a function over the real numbers. In our simple starting example, I can draw a line to connect the two dots, much as a regression line would do to illustrate this for two dimensions. But you maybe can imagine how I can go to higher dimensional distributions and fill up any of the gaps before, after or between the two points. The upshot of this is: every point $y=(y_i, y_j)$ from a set $Y$ with indexes $i$ and $j$ from an index set $X$, can be taken to define two points in the $Y \times X$ plane. And I deliberately wrote $i$ and $j$ instead of 1 and 2, because the indexes can be arbitrary real numbers. I can continue this simple example and sample more points (let me combine the graphs to save some space here) par(mfcol = c(1, 2)) N &amp;lt;- 15 set.seed(1906) p &amp;lt;- rmvnorm(N, mean = rep(0, 2), sigma = K) plot(p, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), xlab = expression(y[1]), main = "Fig. 3a: Multiple draws form the Gaussian") abline(h = 0, v = 0) points(p, cex = 2, pch = 20) ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred") ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red") plot(rep(1, N), p[, 1], xlim = c(0, 3), ylim = c(-3, 3), type = "p", pch = 20, cex = 2, xlab = "x", ylab = "y", main = "Fig. 3b: Multiple draws in x-y-plane") points(rep(2, N), p[, 2], pch = 20, cex = 2) for (i in 1:N) { segments(x0 = 1, p[i, 1], x1 = 2, p[i, 2], col = i, lwd = 1.5) }  I have also drawn the line segments connecting the samples values from the bivariate Gaussian. This illustrates nicely how a zero-mean Gaussian distribution with a simple covariance matrix can define random linear lines in the right-hand side plot. A multivariate Gaussian is like a probability distribution over (finitely many) values of a function. For now we only have two points on the right, but by going from the bivariate to the $n$-dimensional normal distribution I can get more points in. And keep in mind, I can also insert points in between – the $x$ domain is really dense now, I need not take just some integer values. What I do have to do in order to add more points, is to specify the mean the covariance. Another instructive view on this is when I introduce measurement errors or noise into the equation $y_i = f(x_i) + \epsilon_i$ where $\epsilon_i \sim \text{Normal}(0, \sigma^2)$. With this my model very much looks like a non-parametric or non-linear regression model with some function $f$. #### Updating the prior If the Gaussian distribution that we started with is nothing, but a prior belief about the shape of a function, then we can update this belief in light of the data. There is a nice way to illustrate how learning from data actually works in this setting. Say, we get to learn the value of $y_1=1.1$. In terms of fig. 3b this means we have to fix the left-hand point at $1.1$ and that any line segment connecting $y_1$ and $y_2$ has to originate from there. And maybe this gets the intuition across that this narrows down the range of values that $y_2$ is likely to take. It seems even more unlikely than before that, e.g., $y_2$ We can try to confirm this intuition using the fact that if $K = \begin{bmatrix} k_{11} & k_{12} \\ k_{21} & k_{22} \end{bmatrix}$ is the covariance matrix of the Gaussian, we can deduce (see here) $p(y_2 \mid y_1, K) \propto \text{exp} \left( -\frac{1}{2} \begin{bmatrix} y_1 & y_2 \end{bmatrix} K^{-1} \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \right)$. This can be simplified to a normal over $y_2$ $p(y_2 \mid y_1, K) \propto \text{exp} (-\frac{1}{2} (y_2 - \hat{y}_2) k_{22} (y_2 - \hat{y}_2))$ with mean $\hat{y}_2 = 0 + (k_{11} k_{22}^{-1}) k_{12} y_1$ and variance $(1-k_{12}^2)$. Inserting the given numbers, you see that $\hat{y}_2 = 1.0031$ and that the conditional variance is around $0.17$. One notheworthy feature of the conditional distribution of $y_{2}$ given $y_{1}$ and $x_1$ is that it does not make any reference to the functional from of $f(x)$. In that sense it is a non-parametric prediction method, because it does not depend on specifying the function $f$ linking $y$ to $x$. Let’s update Figure 3: par(mfcol = c(1, 2)) set.seed(1906) y1.actual &amp;lt;- 1.1 y2.hat &amp;lt;- 0 + y1.actual * K[1, 2] y2.var &amp;lt;- (1 - K[1, 2]^2) p2 &amp;lt;- data.frame(y1 = rep(y1.actual, N), y2 = rnorm(N, mean = y2.hat, sd = sqrt(y2.var))) plot(p2, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), xlab = expression(y[1]), main = "Fig. 3a: Multiple draws from the two-dimensional Gaussian") abline(h = 0, v = 0) abline(v = y1.actual, lty = 2, col = "grey", lwd = 1.5) points(p, cex = 2, pch = 20, col = "lightgrey") points(p2, cex = 2, pch = 20) ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred") ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red") plot(rep(1, N), p2[, 1], xlim = c(0, 3), ylim = c(-3, 3), type = "n", pch = 20, cex = 2, xlab = "x", ylab = "y", main = "Fig. 3b: Multiple draws in x-y-plane") points(rep(1, N), p[, 1], pch = 20, cex = 2, col = "lightgrey") points(rep(2, N), p[, 2], pch = 20, cex = 2, col = "lightgrey") for (i in 1:N) { segments(x0 = 1, p[i, 1], x1 = 2, p[i, 2], col = "lightgrey", lwd = 1.5, lty = 2) } for (i in 1:N) { segments(x0 = 1, p2[i, 1], x1 = 2, p2[i, 2], col = i, lwd = 1.5) } points(1, y1.actual, pch = 20, cex = 2) points(rep(2, N), p2[, 2], pch = 20, cex = 2)  The conditional distribution is considerably more pointed and the right-hand side plot shows how this helps to narrow down the likely values of $y_2$. The upshot here is: there is a straightforward way to update the a priori GP to obtain simple expressions for the predictive distribution of points not in our training sample. #### Higher dimensions The connection to non-linear regression becomes more apparent, if we move from a bivariate Gaussian to a higher dimensional distrbution. With more than two dimensions, I cannot draw the underlying contours of the Gaussian anymore, but I can continue to plot the result in the $Y \times X$ plane. p &amp;lt;- rmvnorm(N, mean = rep(0, n), sigma = K) plot(1, type = "n", xlim = c(0, 7), ylim = c(-3, 5), xlab = "x", ylab = "y") abline(h = 0, v = 0) for (i in 1:n) { points(rep(i, N), p[, i], pch = 20, cex = 2) } for (i in 1:N) { for (j in 1:(n - 1)) { segments(x0 = j, p[i, j], x1 = j + 1, p[i, j + 1], col = i, lwd = 1.5) } }  where again the mean of the Gaussian is zero and now the covariance matrix is K  ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1.00000000 0.9093729 0.6838614 0.4252832 0.2187119 0.09301449 ## [2,] 0.90937293 1.0000000 0.9093729 0.6838614 0.4252832 0.21871189 ## [3,] 0.68386141 0.9093729 1.0000000 0.9093729 0.6838614 0.42528319 ## [4,] 0.42528319 0.6838614 0.9093729 1.0000000 0.9093729 0.68386141 ## [5,] 0.21871189 0.4252832 0.6838614 0.9093729 1.0000000 0.90937293 ## [6,] 0.09301449 0.2187119 0.4252832 0.6838614 0.9093729 1.00000000  Having added more points confirms our intuition that a Gaussian process is like a probability distribution over functions. It also seems that if we would add more and more points, the lines would become smoother and smoother. And in fact for the most common specification of Gaussian processes this will be the case, i.e. the GP prior will imply a smooth function. #### Covariance function Drawing more points into the plots was easy for me, because I had the mean and the covariance matrix given, but how exactly did I choose them? Randomly? I will give you the details below, but it should be clear that when we want to define a Gaussian process over an arbitrary (but finite) number of points, we need to provide some systematic way that gives us a covariance matrix and the vector of means. The former is usually denoted as $k(x, x')$ for any two (feature) vectors $x$ and $x'$ in the domain of the function. The latter is usually denoted as $m(x)$ and set to zero. With this one usually writes $y \sim GP ( m(x), k(x, x'))$. With $m$ set to zero, the entire shape or dynamics of the process are governed by the covariance matrix. If you look back at the last plot, you might notice that the covariance matrix I set to generate points from the six-dimensional Gaussian seems to imply a particular pattern. For paths of the process that start above the horizontal line (with a positive value), the subsequent values are lower. The other way around for paths that start below the horizontal line. Like in the two-dimensional example that we started with, the larger covariance matrix seems to imply negative autocorrelation. Hence, we see one way we can model our prior belief. If we had a formula that returns covariance matrices that generate this pattern, we were able postulate a prior belief for an arbitrary (finite) dimension. #### Squared exponential kernel The formula I used to generate the$ij\$th element of the covariance matrix of the process was

$k_{i,j}(x, x') = \text{exp}(-0.095 (x_i - x_j)^2)$

More generally, one may write this covariance function in terms of hyperparameters

$k_{i,j}(x, x') = \theta_1 + \theta_2 \, \text{exp}\left(-\frac{1}{2 \ell^2} (x_i - x_j)^2\right)$

Because $k$ is a function of the squared Euclidean distance between $x_i$ and $x_j$, it captures the idea of diminishing correlation between distant points. How fast the exponential term tends towards unity is goverened by the hyperparameter $\ell$ which is called lenght scale. For large $\ell$ the term inside the exponential will be very close to zero and thus the kernel will be constant over large parts of the domain. The hyperparameter $\theta_2$ scales the overall variances and covariances and $\theta_1$ allows for an offset.

The squared exponential kernel is apparently the most common function form for the covariance function in applied work, but it may still seem like a very ad hoc assumption about the covariance structure. It turns out, however, that the squared exponential kernel can be derived from a linear model of basis functions of $x$ (see section 3.1 here). In general, one is free to specify any function that returns a positive definite matrix $k(x, x')$ for all possible $x$ and $x'$. Discussing the wide array of possible kernels is certainly beyond the scope of this post and I therefore happily refer any reader to the introductory text by David MacKay (see previous link) and the textbook by Rasmussen and Williams who have an entire chapter on covariance functions and their properties.

#### Up next

Now that I have a rough idea of what is a Gaussian process regression and how it can be used to do nonlinear regression, the question is how to make them operational. In terms of the Bayesian paradigm, we would like to learn what are likely values for $\theta_1$, $\theta_2$ and $\ell$ in light of data. Then we can determine the mode of this posterior (MAP). Likewise, one may specify a likelhood function and use hill-climbing algorithms to find the ML estimates. It is not too hard to imagine that for real-world problems this can be delicate. Especially if we include more than only one feature vector, the likelihood is often not unimodal and all sort of restrictions on the parameters need to be imposed to guarantee the result is a covariance function that always returns positive definite matrices.

Fitting a GP to data will be the topic of the next post on Gaussian processes.