Epistemology of the corral: regression and variable selection with Stan and the Horseshoe prior

Last summer I wrote PyStan, a Python interface for Stan. For those familiar with the world of Bayesian statistics, Stan improves on JAGS and WinBUGS. This post demonstrates how to use Stan to fit a linear regression model which includes the Horseshoe prior.

Update (2015-2-18): Corrected Stan model code per discussion on Andrew Gelman's blog. The blog post itself is worth reading as it contains a link to a paper describing a "Horseshoe+ prior", which bets even more emphatically on sparsity.

When building a predictive model it is not unusual to have an abundance of explanatory variables. For instance, we might be trying to predict how people will vote in an upcoming election using hundreds of variables describing each voting district. These variables might include median income, probable temperature on voting day, incumbency advantage, candidate height, number of certified organic groceries in the district, and so on. Or we might be trying to predict whether or not an e-mail is spam using the presence or absence of the 10,000 most frequent words in the English language. In all likelihood some of these variables matter quite a bit, others seem unlikely to be important.

Betting that only a subset of the explanatory variables are useful for prediction is a bet on sparsity. A popular model making this bet is the Lasso or, less handily, L1-regularized regression. A Bayesian competitor to the Lasso makes use of the "Horseshoe prior" (which I'll call "the Horseshoe" for symmetry). This prior captures the belief that regression coefficients are rather likely to be zero (the bet on sparsity). The following shows how to use the Horseshoe in Stan.

Data

We'll use the diabetes dataset (found in the LARS package on CRAN) and split it into training and test sets so we can evaluate how well each model performs. For those unfamiliar with the dataset, here are the details:

442 diabetes patients
10 main variables, $x_1, ..., x_{10}$: age, gender, body mass index, average blood pressure (map), and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu)
45 interactions of the form $x_j x_k$
9 quadratic effects of the form $x_j^2$ (gender is binary, so $x_2 = x_2^2$)
measure of disease progression taken one year later, $y$

We have a total of $p = 64$ variables that we might use to predict $y$. It seems plausible that a subset of the variables will reliably predict $y$.

We'll evaluate each model by considering how well it generalizes to unseen ("out-of-sample") data. In order to do this, we need to hold out a portion of the data, so we randomly split the data associated with the 442 diabetes patients into 342 training samples and 100 test samples.¹

Before splitting the diabetes dataset into training and test sets, all variables were standardized—i.e., I subtracted the mean and divided by one standard deviation. The data can be made available in Python with the following lines of code:

import numpy as np

y = np.loadtxt("y-train.dat")
X = np.loadtxt("X-train.dat")

y_test = np.loadtxt("y-test.dat")
X_test = np.loadtxt("X-test.dat")

Models

We will compare the performance of three models: ordinary least squares regression (OLS), the Lasso, and the Horseshoe.

Ordinary least squares

As a baseline, we'll calculate the ordinary least squares estimate of the 64 coefficients $\beta$. Using the OLS estimate, our out-of-sample predictions are better than simply guessing the mean of $y$ in the training set.

# OLS prediction; mean squared error: 0.67
ypred = np.dot(X_test, np.linalg.lstsq(X, y)[0])
print(np.sum((y_test - ypred)**2) / len(y_test))

# Guess the mean of y; mean squared error: 0.97
print(np.sum((y_test - np.mean(y))**2) / len(y_test))

Lasso

Fitting the Lasso is easy. scikit-learn provides an implementation that is fast and well-tested. We see that using the Lasso yields a better (lower) out-of-sample error than OLS.

from sklearn import linear_model
clf = linear_model.Lasso(alpha = 0.1)
clf.fit(X, y)
ypred = clf.predict(X_test)

# Lasso, mean squared error: 0.49
print(np.sum((y_test - ypred)**2) / len(y_test))

Horseshoe

I have followed Carvalho et al. (2009) and used the latent variable formulation of the horseshoe prior (p. 74). It has a direct translation into Stan model code.²

Since we are working in Python, we will use PyStan.

import pystan

model_code = """
data {
  int<lower=0> n;
  int<lower=0> p;
  matrix[n,p] X;
  vector[n] y;
}
parameters {
  vector[p] beta;
  vector<lower=0>[p] lambda;
  real<lower=0> tau;
  real<lower=0> sigma;
}
model {
  lambda ~ cauchy(0, 1);
  tau ~ cauchy(0, 1);
  for (i in 1:p)
    beta[i] ~ normal(0, lambda[i] * tau);
  y ~ normal(X * beta, sigma);
}
"""

n, p = X.shape
data = dict(n=n, p=p, X=X, y=y)
fit = pystan.stan(model_code=model_code, data=data, seed=5)
# print(fit) for all the details
beta = np.mean(fit.extract()['beta'], axis=0)
ypred = np.dot(X_test, beta)

With an estimate of the regression coefficients $\beta$ in hand, we can calculate the mean squared error:

# Horseshoe, mean squared error: 0.46
print(np.sum((y_test - ypred)**2) / len(y_test))

Comparing Lasso and the Horseshoe

Let's consider this bet on sparsity in more detail. Both the Lasso and the Horseshoe yielded an estimate of coefficients where most coefficients were zero or nearly zero. Were the non-zero coefficients the same?

To answer that question, we will consider just those coefficient values that are greater than 0.01 in absolute value.³ Let's collect these variables for the Horseshoe and the Lasso respectively:

# Variable names
varnames = np.array("age sex bmi map tc ldl hdl tch ltg glu age^2 bmi^2 map^2 tc^2 ldl^2 hdl^2 tch^2 ltg^2 glu^2 age:sex age:bmi age:map age:tc age:ldl age:hdl age:tch age:ltg age:glu sex:bmi sex:map sex:tc sex:ldl sex:hdl sex:tch sex:ltg sex:glu bmi:map bmi:tc bmi:ldl bmi:hdl bmi:tch bmi:ltg bmi:glu map:tc map:ldl map:hdl map:tch map:ltg map:glu tc:ldl tc:hdl tc:tch tc:ltg tc:glu ldl:hdl ldl:tch ldl:ltg ldl:glu hdl:tch hdl:ltg hdl:glu tch:ltg tch:glu ltg:glu".split())

# Horseshoe
hs_vars = varnames[np.flatnonzero(np.abs(beta) > 0.01)]
# ['sex', 'bmi', 'map', 'tc', 'hdl', 'tch', 'ltg', 'glu', 'bmi^2', 'tch^2', 'glu^2',
#  'age:sex', 'age:map', 'age:ldl', 'age:tch', 'age:ltg', 'sex:bmi', 'sex:map',
#  'bmi:glu', 'map:ldl', 'tc:ltg', 'hdl:ltg', 'tch:ltg']

# Lasso
lasso_vars = varnames[np.flatnonzero(np.abs(clf.coef_) > 0.01)]
# ['bmi', 'map', 'hdl', 'ltg', 'bmi^2', 'age:sex', 'bmi:map']

# Coefficients in one but not the other
set(hs_vars).symmetric_difference(set(lasso_vars))
# {'age:ldl', 'age:ltg', 'age:map', 'age:tch', 'bmi:glu', 'bmi:map', 'glu',
#  'glu^2', 'hdl:ltg', 'map:ldl', 'sex', 'sex:bmi', 'sex:map', 'tc', 'tc:ltg',
#  'tch', 'tch:ltg', 'tch^2'}

While similar, these are distinct sets of variables. One obvious difference is that the Horseshoe gives weight to tc (serum total cholesterol) and the Lasso estimate does not.

Figure 1 compares the Lasso and the Horseshoe parameter estimates for the variables above.

Figure 1: Lasso and Horseshoe parameter estimates

Epistemology of the corral

It is worth highlighting that while these two sparsity-inducing approaches to linear regression both beat OLS, they lead us to believe there are different sets of "relevant" variables (i.e., variables whose coefficients are non-zero). It would seem that there are different ways to bet on sparsity.

What interpretation should we give each variable selection procedure and the estimated coefficient values for these two different models? One might argue that in practice this isn't a pressing concern; those using the Lasso or the Horseshoe will be happy to have a reliable alternative to the OLS estimates. On the other hand, it is easy to imagine contexts where we want to understand the relationship among variables.⁴ And in the back of our minds ought to be the concern that all these variables are related in some way to the outcome—that is, there is no such thing as a coefficient of zero.

References

Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. “Handling Sparsity via the Horseshoe.” Journal of Machine Learning Research W&CP 5, no. 73–80 (2009): 111.

Gelman et al. Bayesian Data Analysis. 3rd ed. Chapman and Hall/CRC, 2013.

Hoff, Peter D. A First Course in Bayesian Statistical Methods. New York: Springer, 2009.

Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso". Journal of the Royal Statistical Society, Series B 58 (1): 267–288.

Footnotes

I am following the setup described in chapter 9 of Hoff (2009). In that chapter Hoff considers a (Bayesian) model averaged estimate of the regression coefficients. The approach he describes in the chapter performs better than the Horseshoe. Unfortunately the Bayesian model averaging approach cannot currently be implemented in Stan. If you're interested in this approach, check out the R package BMA. ↩
Those new to Stan may find the Stan manual helpful. The latent variable formulation of the Horsehoe is
$$\begin{aligned} (\beta_i|\lambda_i, \tau) \sim N(0, \lambda_i^2\tau^2)\\\\ \lambda_i \sim C^+(0, 1)\\\\ \end{aligned}$$
where $C^+(0, 1)$ is the half-Cauchy distribution. ↩
The data has been standardized, so a change in one standard deviation in the relevant variable corresponds to a 0.01 standard deviation shift in the target variable. ↩
For more discussion see Gelman et al. (2013), p. 368-369. ↩

Allen B. Riddell