Gaussian processes for regression and classification

In my previous post on Gaussian process regressions, I described the intuition behind the function space view on GPs. In this post I want to continue illustrating how to use Gaussian processes to do regression and classification on a small example dataset. To train the model to the data I will use Stan. In case you haven’t heard about Stan or its R interface, it is definitely worth to have a look. I use Stan here instead of native R, because Stan makes it really, really simple to operationalize GP regressions and requires – quite frankly – minimal programming effort. First, this is because there is some example code in the manual that shows how to implement some basic models and the GitHub repo features more example models. Second, in Stan I only have to specify the probability model, but no further derivations of properties of the posterior are required. Even though there is a lot of tuning and differentiation going on, it’s all under the hood. Third, the standard posterior simulation algorithm of Stan (HMC or NUTS) is good at handling complex geometries of the posterior, i.e. non-convexities or multiple modes. This becomes particularly relevant when you want to go beyond the bivariate examples here and consider large feature spaces or you want to consider more complex hierarchical structures. Fourth, we can do full Bayesian inference, e.g. do posterior predictive checking.

Kernels revisited

In the previous post about Gaussian processes it already became clear that the mean and covariance function m(x) and k(x, x') of the Gaussian process are at the heart of building the model

y \sim GP(m(x), k(x, x')).

Nearly all texts that I found about GPs do not explicitly model the mean function m(x), but set it to zero. But I think this need not necessarily be the best choice for every application at hand. If you have substantive prior knowledge that makes you confident to postulate some non-zero prior relationship between mean of y and x, one should arguably include it in the prior. However, the subsection on linear kernels below will illustrate how a GP prior with m(x)=0 and linear kernel can be shown to represent a prior belief that the data comes from a linear model with a normal prior with mean zero for the parameters. Hence, the zero mean function induces shrinkage towards zero (to mitigate overfitting) like in a ridge regression.

Squared exponential kernel

I mentioned the squared exponential (SE) kernel already in the last post, which in case of a univariate feature vector is equivalent to set the ij-th element of the covariance matrix

k_{ij}(x, x') = \theta_1 + \theta_2 * \text{exp} \left(- \frac{1}{2 \ell^2} (x_i - x_j)^2 \right)

with (hyper-)parameters such that the resulting covariance matrix is always positive-definite. In the last post on GPs, I mentioned that one implication of such a prior is that (for given length scale \ell) values for y_i and y_j are less correlated the further their apart in terms of the distance between x_i and x_j. How do functions fron this prior distribution look like? We can use Stan to simulate processes from the GP prior. The Stan manual provides examples how to implement this, and I largely copy their code here:

stan_code <- '
// Sample from Gaussian Process
// Hyperparameters of the covariance functions are part of the data

data {
int<lower=1> N;
real x[N];
real<lower=0> ell;
real<lower=0> theta1;
real<lower=0> theta2;
}
transformed data {
matrix[N,N] L;
vector[N] mu;
cov_matrix[N] Sigma;
mu <- rep_vector(0, N);

for(i in 1:N)
for(j in i:N){
Sigma[i, j] <- theta1 + theta2 * exp(-(1/(2 * pow(ell,2))) * pow(x[i] - x[j], 2));
}

for(i in 1:(N-1))
for(j in (i+1):N){
Sigma[j, i] <- Sigma[i, j];
}

L <- cholesky_decompose(Sigma);
}
parameters {
vector[N] z;
}
model {
z ~ normal(0,1);
}
generated quantities {
vector[N] y;
y <- mu + L * z;
}'
library(rstan)

stan_data <- list(N=100,
x=1:100,
ell=1,
theta1=1,
theta2=1)

model <- stan(model_code = stan_code,
data=stan_data,
iter=0, chains=0, verbose =FALSE)

which we can run from R using rstan

ells <- c(0.000095, 1.1, 3.1)
offset <- c(-10, 0, 10)

for (i in 1:length(ells)) {
sims theta1 = 1, theta2 = 5), iter = 50, chains = 1, seed = c(123), verbose = FALSE)
tmp if (i == 1) {
plot(offset[i] + tmp$y[15, ], type = "l", col = i, lwd = 2, ylim = c(-15,
15), ylab = "y", xlab = "x", sub = "Lines offset by -10, 0 and 10 for illustration")
legend("topleft", legend = c(paste("l = ", ells)), lwd = rep(2, 3),
col = 1:3)
} else {
lines(offset[i] + tmp$y[15, ], type = "l", col = i, lwd = 2)
}
}

plot of chunk simulate1

The plots confirm a simple intuition: the values of y become more and more correlated across different values of x, the higher the length scale \ell, because the term inside the exponential is pulled towards zero. Another view on \ell that explains its name is that it scales the x-axis. Imagine two points x_i and x_j with Euclidean distance d. The term inside the exponential is like - \frac{1}{2\ell^2} d^2. Hence, increasing \ell is just like decreasing the distance d, i.e. pulling points closer together, which we can be understood as a compression of the x-axis. Another view on \ell is through the lens of variable or feature selection. In a model with more than a single explanatory x_i each of which has its own individual length scale \ell_i, we can do feature selection through the individual length scales. To see this, note that for \ell_i \rightarrow \infty the influence of variable i in the kernel vanishes and y is essentially constant in that variable.

Linear kernel

The linear kernel has the nice property that it can be derived from a linear model. Doing so is instructive, because it establishes an example of how kernels can be elicited based on features of the data. For this suppose we model the relationship between x and y by a set of basis functions \phi(x), e.g. \phi(x) = [1, x, x^2, x^3, \ldots]' would be such a set of basis functions of x that describe a polyniomial regression. Hence,

y_i = \phi(x_i) ' \theta + \epsilon_i

is still a linear model, in the sense that it is linear in the parameters \theta. We may write this more compactly

y = Z \theta + \epsilon

where Z = [\phi(x_1)', \ldots, \phi(x_N)']' is the matrix of regressors and \epsilon is a conformable vector if iid noise with mean zero and variance \sigma^2_e. If we impose a normal prior \theta \sim \text{Normal}(0, \sigma_b^2 I_n) where the mean is a vector of zeros and I_n denotes the n-dimensional identity matrix (n the number of parameters), it turns our that y is also normal with zero mean

\mathbb{E}[y] = \mathbb{E}[Z \theta] = 0

and covariance

\mathbb{E}[y y'] = \mathbb{E}[(Z \theta + \epsilon) (Z \theta + \epsilon)'] = \sigma_b^2 Z Z' + \sigma^2_e I_N, i.e. y \sim \text{Normal}(0, \sigma_b^2 Z Z' + \sigma^2_e I_N).

This means that for any finite number of points [x_1, \ldots, X_N] the joint distribution of the elements in y is Gaussian: this was the defining property of a Gaussian process stated in the previous post. We conclude that

m(x_i) = 0

and

k(x, x') = \sigma^2_b Z Z' + \sigma_e^2 I_N.

The simplest special case is given by a single feature vector x_i and \phi(x_i) = x_i. In that case

K_{i,j} = \sigma^2_b x_i x_j + \sigma_e^2 \mathbf{1}_{\{i=j\}}.

Note that if we want to include an intercept in the linear regression, we can do this either by inlcuding a columns of ones in Z or by setting

K_{i,j} = \sigma^2_b (x_i - c) (x_j - c) + \sigma_e^2 \mathbf{1}_{\{i=j\}}.

In the latter case, the parameter c gives the x-coordinate at which the prior function will be zero (plus noise if we include it – as we do here).

One can easily simulate such a model to get an idea how our prior looks

stan_code <- '
// Sample from Gaussian Process
// Hyperparameters of the covariance functions are part of the data

data {
int<lower=1> N;
real x[N];
real<lower=0> sigma_b;
real<lower=0> sigma_e;
real c;
}
transformed data {
matrix[N,N] L;
vector[N] mu;
cov_matrix[N] Sigma;
mu <- rep_vector(0, N);

for(i in 1:N)
for(j in i:N){
Sigma[i, j] <- sigma_b * (x[i] - c) * (x[j] - c) + if_else(i==j, sigma_e, 0.0);
}

for(i in 1:(N-1))
for(j in (i+1):N){
Sigma[j, i] <- Sigma[i, j];
}

L <- cholesky_decompose(Sigma);
}
parameters {
vector[N] z;
}
model {
z ~ normal(0,1);
}
generated quantities {
vector[N] y;
y <- mu + L * z;
}'
library(rstan)

stan_data <- list(N=100,
x=1:100,
sigma_e=1,
sigma_b=1,
c=16)

model <- stan(model_code = stan_code,
data=stan_data,
iter=0, chains=0, verbose =FALSE)
sigmas <- c(0.02, 0.15, 0.5)

for (i in 1:length(sigmas)) {
sims <- stan(fit = model, data = list(N = 100, x = 1:100, sigma_b = sigmas[i],
sigma_e = 1, c = 16), iter = 50, chains = 1, seed = c(123), verbose = FALSE)
tmp <- extract(sims, pars = "y")
if (i == 1) {
plot(tmp$y[15, ], type = "l", col = i, lwd = 2, ylim = c(-50, 50), ylab = "y",
xlab = "x")
legend("topleft", legend = c(paste("sigma_b = ", sigmas)), lwd = rep(2,
3), col = 1:3)
} else {
lines(tmp$y[15, ], type = "l", col = i, lwd = 2)
}
}

plot of chunk gp-in-action-simulate2

The plots show that our prior functional form is indeed linear with gradual fluctuations around this linear trend. The amplitude of these fluctuations is governed by the error variance, the higher the more erratic is the prior. Likewise, the smaller the prior variance of the coefficients, the more likely we think the linear trend is actually just flat. This shows that even with mean function set to zero, we may still model trending data solely in terms of the covariance function. The sampled lines cross the x-axis around the value c=16, as described above. Since we add noise to the observations, the sampled paths are not exactly zero at this point. However, apart from the noise there is no uncertainty at this points, i.e. the prior variance is minimal at this point and gradually increases as we move away from it.

Kernels, kernels, kernels…

There are many, many more kernels that have been proposed in different places. I found this little website by David Duvenaud quite helpful (make sure you also check out the chapter of his thesis that is linked on his page) to get (a) an overview over several prominent kernels and (b) an idea how to combine them to yields ever more flexible structures.

Univariate Gaussian process regression: icecream

In order to illustrate how to operationalize GP regressions, I decided to pick up on a little demonstration on Markus Gesmann's blog again. Hence, I wrap my discussion around the icecream dataset that Markus used to illustrate some general features of generalized linear models.

icecream <- data.frame(temp = c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1, 18.5, 19.4,
22.1, 22.6, 23.4, 25.1), units = c(185L, 215L, 332L, 325L, 408L, 421L, 406L,
412L, 522L, 445L, 544L, 614L))
basicPlot <- function(...) {
plot(units ~ temp, data = icecream, bty = "n", lwd = 2, main = "Number of ice creams sold",
col = "#00526D", xlab = "Temperature (Celsius)", ylab = "Units sold",
...)
axis(side = 1, col = "grey")
axis(side = 2, col = "grey")
}
basicPlot()

plot of chunk gp-in-action-dataPlot

Markus has a very nice demonstration of how to gradually build a model that fits the data and its properties. From the simple scatter plot we can see already that when we wish to start modelling the data by means of a GP regression, there is little hope we can employ a simple SE kernel. Instead of being characterized by gradual up- and down swings, there is a pretty obvious trend in the data and I thus begin with a linear kernel.

LIN_kernel <- '
data {
int<lower=1> N1;
int<lower=1> N2;
vector[N1 + N2] x;
int<lower=0> y1[N1];
}
transformed data {
int<lower=1> N; // Total length of the feature vector
vector[N1 + N2] mu; // Prior mean of the process

N <- N1 + N2;
mu <- rep_vector(0, N);
}
parameters {
vector[N2] y2;
real<lower=0> beta1;
real<lower=0> beta2;
real<lower=0> beta3;
}
model {
matrix[N, N] Sigma;
vector[N] y;

for (n in 1:N1) y[n] <- y1[n];
for (n in 1:N2) y[N1 + n] <- y2[n];

for(i in 1:N)
for(j in i:N){
Sigma[i, j] <- beta1 * (x[i] - beta2) * (x[j] - beta2) + if_else(i==j, beta3, 0.0);
}

for(i in 1:(N-1))
for(j in (i+1):N){
Sigma[j, i] <- Sigma[i, j];
}

beta1 ~ cauchy(0,5);
beta2 ~ cauchy(0,5);
beta3 ~ cauchy(0,5);

y ~ multi_normal(mu, Sigma);
}'

x1 <- icecream$temp
N1 <- nrow(icecream)
idx1 <- (1):(N1)
x2 <- seq(0, 40, length.out = 40)
N2 <- length(x2)
idx2 <- (N1+1):(N1+N2)
x <- union(x1, x2)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores()-1)

stan_data <- list(N1 = N1,
N2 = N2,
x = x,
y1 = icecream$unit
)

ice.fit <- stan(model_code = LIN_kernel,
data=stan_data,
iter=0, chains=0)

I should mention that I went with the default priors set in the some of the example models from the manual. However, the motivation for using the Cauchy prior is that in many settings the Cauchy with location zero and scale five is weakly informative, i.e. implied prior effect sizes broadly encompass any reasonable value with large probability – and some we consider unreasonable with still sufficient mass not to entirely preclude them. But I didn't really do any sanity check here, so to build a serious model we would of course have to go through some elaborate prior elicitation.

ice.res <- stan(fit = ice.fit, data = stan_data, iter = 800, chains = 2, seed = 123)

Note that I have only run 800 iterations on two chains. If you are familiar with other posterior simulation algorithms, e.g. Metropolis-Hastings, you might chuckle a bit, because these algorithms often require north of several thousand iterations to achieve convergence even on simple models. Plus, they are marred by large amounts of autocorrelation between samples from the posterior, which in turn makes a large number of iterations necessary. Using Stan 800 iterations is really enough for this simple model.

We can confirm this by looking at the summary statistics for the hyperparameters:

paras = c('beta1', 'beta2', 'beta3')
tmp.sum <- summary(ice.res, pars = paras)
tmp <- tmp.sum$summary[, "Rhat"]
tmp

## beta1 beta2 beta3
## 1.004882 1.042516 1.035404

The potential scale reduction factor Rhat indicates convergence for all parameters. Let us look at the autocorrelation:

test colnames(test) par(mfcol = c(1, 3))
sapply(1:3, function(i) acf(test[, i], main = colnames(test)[i]))

plot of chunk gp-in-action-acf

Pretty good and for just a few hundred iterations!

Hence, we can go on and plot the fit of the model together with out-of-sample predictions and some credibility region (in case you are re-running this on your computer, you certainly do not want to follow me here and explore the models using static graphs, but use ShinyStan instead). One nice thing about doing Bayesian inference here is that uncertainty surrounding our estimates (i) are actual posterior intervals (which many people wrongly equate with a confidence interval), (ii) do not have a rigid theoretical form that hinges on asymptotic approximations and (iii) price in the estimation uncertainty surrounding the unknowns in our model, e.g. the hyperparameters in the kernel.

library(dplyr)
ys ym ysd

post.var all.preds sd.data for (i in 1:nrow(icecream)) {
tmp sd.data[, i] }
all.bands 0.84, 0.975)))

plot.data lower = all.bands[2, ], higher = all.bands[3, ], highest = all.bands[4,
])
daten % tbl_df %>% arrange(temp)

plot(daten$temp, daten$sold, type = "l", main = "Number of ice creams sold",
col = "#00526D", xlab = "Temperature (Celsius)", ylab = "Units sold", sub = "Dashed lines show 68% and 95% posterior interval, solid line depicts posterior median",
ylim = c(0, 900))
abline(h = 0, v = 0)
points(icecream$temp, icecream$units)
lines(daten$temp, daten$lowest, lty = 2, col = "grey")
lines(daten$temp, daten$highest, lty = 2, col = "grey")
lines(daten$temp, daten$lower, lty = 2)
lines(daten$temp, daten$higher, lty = 2)

plot of chunk gp-in-action-linear-model

Note that the plot not only shows predictions to the left and the right of our data points, i.e. for the region below 12 and above 26 degrees, but also between points. If you look at the code above, you’ll see that I have squeezed some x-values between some of the observations for the algorithm to generate. Any point that is not directly observed can (or in fact is) treated like a parameter we can estimate. So you see that estimation uncertainty remains low between two points and that this is not just because we have drawn a straight line between the ends of the pointwise error bands – it really is fairly low. You can also notice that the uncertainty increases slightly to the left and the right of our observations, which seems reasonable.

Just to illustrate how it affects our inference, let me add a SE kernel to the linear kernel:

LIN_SE_kernel <- '
data {
int<lower=1> N1;
int<lower=1> N2;
vector[N1 + N2] x;
int<lower=0> y1[N1];
}
transformed data {
int<lower=1> N; // Total length of the feature vector
vector[N1 + N2] mu; // Prior mean of the process

N <- N1 + N2;
mu <- rep_vector(0, N);
}
parameters {
vector[N2] y2;
real<lower=0> beta1;
real<lower=0> beta2;
real<lower=0> beta3;
real<lower=0> beta4;
real<lower=0> beta5;
}
model {
matrix[N, N] Sigma;
vector[N] y;

for (n in 1:N1) y[n] <- y1[n];
for (n in 1:N2) y[N1 + n] <- y2[n];

for(i in 1:N)
for(j in i:N){
Sigma[i, j] <- beta1 * exp(-beta2 * pow(x[i] - x[j], 2)) + beta3 * (x[i] - beta4) * (x[j] - beta4) + if_else(i==j, beta5, 0.0);
}

for(i in 1:(N-1))
for(j in (i+1):N){
Sigma[j, i] <- Sigma[i, j];
}

beta1 ~ cauchy(0,5);
beta2 ~ cauchy(0,5);
beta3 ~ cauchy(0,5);
beta4 ~ cauchy(0,5);
beta5 ~ cauchy(0,5);

y ~ multi_normal(mu, Sigma);
}'

x1 <- icecream$temp
N1 <- nrow(icecream)
idx1 <- (1):(N1)
x2 <- seq(0, 40, length.out = 40)
N2 <- length(x2)
idx2 <- (N1+1):(N1+N2)
x <- union(x1, x2)

stan_data <- list(N1 = N1,
N2 = N2,
x = x,
y1 = icecream$unit
)

ice.fit <- stan(model_code = LIN_SE_kernel,
data=stan_data,
iter=0, chains=0)
ice.res <- stan(fit = ice.fit, data = stan_data, iter = 800, chains = 2, seed = 123)
paras = c("beta1", "beta2", "beta3", "beta4", "beta5")
tmp.sum tmp % tbl_df %>% arrange(temp)

plot(daten$temp, daten$sold, type = "l", main = "Number of ice creams sold",
col = "#00526D", xlab = "Temperature (Celsius)", ylab = "Units sold", sub = "Dashed lines show 68% and 95% posterior interval, solid line depicts posterior median",
ylim = c(0, 800))
points(icecream$temp, icecream$units)
lines(daten$temp, daten$lowest, lty = 2, col = "grey")
lines(daten$temp, daten$highest, lty = 2, col = "grey")
lines(daten$temp, daten$lower, lty = 2)
lines(daten$temp, daten$higher, lty = 2)

plot of chunk gp-in-action-BIN-model

I must say I am quite amazed by the plot. The extent to which the replicated data matches the actual observations is striking. Additionally, the way we modelled the data based on Markus’ initial suggestions delivers very plausible predictions overall. The model predicts for example that 95% of the sales we can make at zero degrees are between 21 and 87 units, with median being 42 units.

The overall shape of our posterior predictions close matches those that also come out of Markus’ binomial model, but you see that the model is a bit more flexible in capturing the wiggly structure of the data in the middle.

Conclusion

I am happy to have taken the the chance to learn about Gaussian processes. They are an interesting tool that I have added to my bag of tricks. Together with Stan these models are very flexible and fairly easy to implement and test on a data set. So you may expect more examples in the future.

Appendix

prepare.plot require(dplyr)
ys ym ysd

post.var

all.preds sd.data for (i in 1:nrow(icecream)) {
tmp sd.data[, i] }
all.bands

plot.data ], lower = all.bands[2, ], higher = all.bands[3, ], highest = all.bands[4,
])
daten % tbl_df %>% arrange(temp)
return(daten)
}

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.4.2 rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
## [5] knitr_1.10.5 RWordPress_0.2-3
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-11 XML_3.98-1.3 assertthat_0.1 bitops_1.0-6
## [5] R6_2.1.0 DBI_0.3.1 stats4_3.2.1 formatR_1.2
## [9] magrittr_1.5 evaluate_0.7 stringi_0.5-5 lazyeval_0.1.10
## [13] tools_3.2.1 stringr_1.0.0 RCurl_1.95-4.7 parallel_3.2.1
## [17] XMLRPC_0.3-0
Advertisements

Inauguration post

I will start this blog by writing about how I started this blog. One of the first things I wanted to make sure was to be able to blog from R, or RStudio and not have to use the WordPress.com editor. So using RStudio this is actually pretty simple and I can take it easy with my first post and reblog what others have invented to make it happen.

I write my blog entries in RStudio using Rmarkdown and then push them to the WordPress servers. And the way I implemented this is much along the lines of what is described here.

Install R package and link to WordPress

Hence, first install the RWordPress package

if (!require(RWordPress)) {
    install.packages("RWordPress", repos = "http://www.omegahat.org/R")
}

and link it to your account (insert your username for ‘yourWPusername’ and your password for ‘yourWPpassword’)

options(WordPressLogin = c(yourWPusername = "yourWPpassword"), WordPressURL = "https://yourWPusername.wordpress.com/xmlrpc.php")

You can do a quick whether your credentials are working by calling

getUsersBlogs()

So far so good.

Handling graphs and figures

When you blog about R, I guess its reasonable to expect to produce many graphs to show on your blog. I handle graphs following Yihui Xie’s suggestion and save them in my dropbox. I created a separate folder in my Public folder called wp that is a bucket for all the figures that I create. Then I tell knitr that I want to save any figure created in this folder.

opts_knit$set(base.url = &apos;https://dl.dropboxusercontent.com/u/some_number/wp/&apos;,
              base.dir = &apos;C:/&lt;path to dropbox&gt;/Dropbox/Public/wp/&apos;)

The first line includes the URL of the wp folder which you can find by putting a little file in it and right-clicking on it to copy the link. Then insert the copied link and the path to your newly created folder into the snippet above.

knit2html -> WordPress.com format

Suppose your have your blog post saved in a file post.Rmd in the Rmarkdown .Rmd format. Using the function knit2htmlyou can turn it into an HTML file (in RStudio simply press the the ‘Knit HTML’ button at the top). To push the content of the HTML file to WordPress one has to extract the body of the HTML file, otherwise WordPress.com won’t accept it (at least when you are hosting your blog with WordPress.com – which I do).

On the blog I linked above, the author provides a function to sanitize the HTML file and return a string that you can pass on to WordPress using the RWordPress function newPost. But for me it didn’t work, because it inserted crazy special characters here and there. After googling a bit, I found an alternative solution (I forgot where, feel free to point me to the original and I will happily link to it):

knit2wp &lt;- function(file) {
    require(XML)
    content &lt;- readLines(file)
    content &lt;- htmlTreeParse(content, trim=FALSE)

    ## WP will add the h1 header later based on the title, so delete here
    content$children$html$children$body$children$h1 &lt;- NULL
    content &lt;- paste(capture.output(print(content$children$html$children$body,
                                          indent=FALSE, tagSeparator="")),
                     collapse="\n")
    content &lt;- gsub("&lt;?.body&gt;", "", content)         # remove body tag

    ## enclose code snippets in SyntaxHighlighter format
    content &lt;- gsub("&lt;?pre&gt;&lt;code class=\"r\"&gt;", "\\\\\n",
                    content)
    content &lt;- gsub("&lt;?pre&gt;&lt;code class=\"no-highlight\"&gt;", "\\\\\n",
                    content)
    content &lt;- gsub("&lt;?pre&gt;&lt;code&gt;", "\\\\\n", content)
    content &lt;- gsub("&lt;?/code&gt;&lt;/pre&gt;", "\\[/code\\]\\\n", content)
    return(content)
}

Workflow

The way it now works for me is as follows. Suppose I have my post written in Rmarkdown and saved in the file test.Rmd. Set the current working directory to the location of this file and then run the following script:

library(RWordPress)
options(WordPressLogin = c(yourWPusername = "yourWPpassword"), WordPressURL = "https://yourWPusername.wordpress.com/xmlrpc.php")
library(knitr)
opts_knit$set(base.url = &apos;https://dl.dropboxusercontent.com/u/some_number/wp/&apos;,
              base.dir = &apos;C:/&lt;path to dropbox&gt;/Dropbox/Public/wp/&apos;)

knit2wp &lt;- function(file) {
    require(XML)
    content &lt;- readLines(file)
    content &lt;- htmlTreeParse(content, trim=FALSE)

    ## WP will add the h1 header later based on the title, so delete here
    content$children$html$children$body$children$h1 &lt;- NULL
    content &lt;- paste(capture.output(print(content$children$html$children$body,
                                          indent=FALSE, tagSeparator="")),
                     collapse="\n")
    content &lt;- gsub("&lt;?.body&gt;", "", content)         # remove body tag

    ## enclose code snippets in SyntaxHighlighter format
    content &lt;- gsub("&lt;?pre&gt;&lt;code class=\"r\"&gt;", "\\\\\n",
                    content)
    content &lt;- gsub("&lt;?pre&gt;&lt;code class=\"no-highlight\"&gt;", "\\\\\n",
                    content)
    content &lt;- gsub("&lt;?pre&gt;&lt;code&gt;", "\\\\\n", content)
    content &lt;- gsub("&lt;?/code&gt;&lt;/pre&gt;", "\\[/code\\]\\\n", content)
    return(content)
}

knit2html("test.Rmd")

newPost(
    list(
        description=knit2wp(&apos;test.html&apos;),
        title=&apos;Test&apos;,
        categories=c(&apos;Programming&apos;, &apos;R&apos;),
        mt_keywords=c(&apos;rstats&apos;, &apos;blogging&apos;, &apos;XML-RPC&apos;, &apos;R&apos;, &apos;knitr&apos;, &apos;markdown&apos;)
    ),
    publish=FALSE)

Et voilá!