This tutorial is supposed to give a starting point for Bayesian analysis with rstanarm, assuming that you have some background with linear models. It is roughly following this official tutorial.

Required libraries:

library(rstanarm)
library(tidyverse)
library(parameters)
library(HSAUR3)

The data

We are using data from a survey from 1974 / 1975 asking both female and male responders about their opinion on the statement: “Women should take care of running their homes and leave running the country up to men.” The data description is found here.

data("womensrole", package = "HSAUR3")
womensrole$total <- womensrole$agree + womensrole$disagree

womensrole %>% head
##   education gender agree disagree total
## 1         0   Male     4        2     6
## 2         1   Male     2        0     2
## 3         2   Male     4        0     4
## 4         3   Male     6        3     9
## 5         4   Male     5        5    10
## 6         5   Male    13        7    20

The survey aims at answering the question whether the responses of men and women differ.

Fit a regular GLM

This is how we would fit a non-Bayesian generalized linear model (GLM):

my_glm <- glm(cbind(agree, disagree) ~ education + gender,
                        data = womensrole, 
                        family = binomial(link = "logit"))

We can look at the coefficients that the model fitted:

round(coef(summary(my_glm)), 3)
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)     2.509      0.184  13.646    0.000
## education      -0.271      0.015 -17.560    0.000
## genderFemale   -0.011      0.084  -0.136    0.892

Because of the logit-link, the coefficients are given in log-odds: \(logit = \log(\frac{p}{1-p})\)

We can convert logit to probability as follows:

\(p = \frac{\exp(logit)}{1+\exp(logit)}\)

The plogis function does this calculation for us. For example, the intercept can be interpreted as the probability of agreeing for education = 0 and gender=Male:

coefs <- my_glm$coefficients
plogis(coefs[1])
## (Intercept) 
##   0.9247962

We can check how well the model fits the data:

womensrole %>% 
  mutate(percent_agree = agree/total) %>% 
  mutate(pred=predict(my_glm, type="response")) %>% 
  ggplot(aes(x=education, y=percent_agree, color=gender)) +
  geom_point()+
  geom_line(aes(y=pred))

Bayesian model

Set up the model

The extension to a Bayesian model is very easy to implement:

  • Instead of glm, we call stan_glm
  • specify the same linear predictor, data, and family
  • add priors on the variables and the intercept
my_bglm <- stan_glm(cbind(agree, disagree) ~ education + gender,
                              data = womensrole,
                              family = binomial(link = "logit"),
                              prior = student_t(df = 7,location= 0, scale= 5),
                              prior_intercept = student_t(df = 7, location=0, scale=5),
                              cores = 2, seed = 12345)

Information on the available priors in rstanarm is given here. The authors recommend weakly informative priors, which imply little prior knowledge, but, as opposed to flat priors, still impose some constraints.

Fitted coefficients

First of all, we can simply print out the model:

my_bglm
## stan_glm
##  family:       binomial [logit]
##  formula:      cbind(agree, disagree) ~ education + gender
##  observations: 42
##  predictors:   3
## ------
##              Median MAD_SD
## (Intercept)   2.5    0.2  
## education    -0.3    0.0  
## genderFemale  0.0    0.1  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

The median for the coefficients is very similar to the MLEs. We get a median absolute deviation that should approximate the parameters’ standard deviation.

The posterior

One of the most important contents of the fit object are the posterior draws from the MCMC, which we can access directly:

draws <- as.matrix(my_bglm)
dim(draws)
## [1] 4000    3
colnames(draws)
## [1] "(Intercept)"  "education"    "genderFemale"

For each of the three parameters, the fit gives us 4000 draws from the posterior. From these draws, we can compute whatever summary statistic we like, for example:

The posterior mean for education:

mean(draws[,2])
## [1] -0.2717251

The credible interval:

quantile(draws[,2],probs = c(.025,.9725))
##       2.5%     97.25% 
## -0.3018313 -0.2425906

Or visualize the distribution:

data.frame(education_posterior = draws[,2]) %>% 
  ggplot(aes(x= education_posterior)) +
  geom_histogram(bins=50)

For many of these computations, there are also a built-in functions. For example:

Visually comparing priors and posteriors:

posterior_vs_prior(my_bglm)

Parameter overview:

library(parameters)
parameters(my_bglm)
## Parameter    | Median |         95% CI |     pd |  Rhat |     ESS |              Prior
## --------------------------------------------------------------------------------------
## (Intercept)  |   2.52 | [ 2.17,  2.89] |   100% | 1.000 | 2759.00 | Student_t (0 +- 5)
## education    |  -0.27 | [-0.30, -0.24] |   100% | 1.000 | 3103.00 | Student_t (0 +- 5)
## genderFemale |  -0.01 | [-0.17,  0.15] | 55.97% | 1.000 | 2708.00 | Student_t (0 +- 5)

Credible interval:

posterior_interval(my_bglm, prob = 0.95, pars = "education")
##                 2.5%      97.5%
## education -0.3018313 -0.2420623

Interpretation: There is a 95% probability that the coefficient is between -0.30 and -0.24.

Model checking

rstanarm comes with a shiny app that creates diagnostic plots for the fit.

launch_shinystan(my_bglm, ppd = FALSE)

Here, the “Explore” section is helpful: It gives the Kernel density estimate of the posterior distribution, and the Markov Chain. The Markov chain should fluctuate around the estimate.

Model predictions

For the model predictions, we obtain draws from the posterior again:

prediction <- posterior_predict(my_bglm, newdata=womensrole)
dim(prediction)
## [1] 4000   42
dim(womensrole)
## [1] 42  5

There are 4000 posterior draws for each of the 42 combinations of predictor values that are present in the original data.

One way to visualize the prediction is take the mean for each combination and plot it in comparison to the data:

prediction_means <- rowMeans(t(prediction))

womensrole %>% 
  mutate(percent_agree = agree/total) %>% 
  mutate(pred = prediction_means/total) %>% 
  ggplot(aes(x=education, y=percent_agree, color=gender)) +
  geom_point()+
  geom_line(aes(y=pred))

Exercises

  1. The mean of the posterior prediction is using a summary statistic, thus loosing information. Can you think of a way to visualize the posterior prediction in more detail?

  2. Remember the aim of the survey was to find whether men and women agree to different extents. Let’s put some numbers on this. Can you use posterior_predict to answer the question: Out of 100 men with average education (education=10), how many do we predict to agree to the statement? Is this prediction different for women?

Further reading