This tutorial is supposed to give a starting point for Bayesian
analysis with rstanarm
, assuming that you have some
background with linear models. It is roughly following this
official tutorial.
Required libraries:
library(rstanarm)
library(tidyverse)
library(parameters)
library(HSAUR3)
We are using data from a survey from 1974 / 1975 asking both female and male responders about their opinion on the statement: “Women should take care of running their homes and leave running the country up to men.” The data description is found here.
data("womensrole", package = "HSAUR3")
womensrole$total <- womensrole$agree + womensrole$disagree
womensrole %>% head
## education gender agree disagree total
## 1 0 Male 4 2 6
## 2 1 Male 2 0 2
## 3 2 Male 4 0 4
## 4 3 Male 6 3 9
## 5 4 Male 5 5 10
## 6 5 Male 13 7 20
The survey aims at answering the question whether the responses of men and women differ.
This is how we would fit a non-Bayesian generalized linear model (GLM):
my_glm <- glm(cbind(agree, disagree) ~ education + gender,
data = womensrole,
family = binomial(link = "logit"))
We can look at the coefficients that the model fitted:
round(coef(summary(my_glm)), 3)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.509 0.184 13.646 0.000
## education -0.271 0.015 -17.560 0.000
## genderFemale -0.011 0.084 -0.136 0.892
Because of the logit-link, the coefficients are given in log-odds: \(logit = \log(\frac{p}{1-p})\)
We can convert logit to probability as follows:
\(p = \frac{\exp(logit)}{1+\exp(logit)}\)
The plogis
function does this calculation for us. For
example, the intercept can be interpreted as the probability of agreeing
for education = 0
and gender=Male
:
coefs <- my_glm$coefficients
plogis(coefs[1])
## (Intercept)
## 0.9247962
We can check how well the model fits the data:
womensrole %>%
mutate(percent_agree = agree/total) %>%
mutate(pred=predict(my_glm, type="response")) %>%
ggplot(aes(x=education, y=percent_agree, color=gender)) +
geom_point()+
geom_line(aes(y=pred))
The extension to a Bayesian model is very easy to implement:
glm
, we call stan_glm
my_bglm <- stan_glm(cbind(agree, disagree) ~ education + gender,
data = womensrole,
family = binomial(link = "logit"),
prior = student_t(df = 7,location= 0, scale= 5),
prior_intercept = student_t(df = 7, location=0, scale=5),
cores = 2, seed = 12345)
Information on the available priors in rstanarm
is given
here.
The authors recommend weakly informative priors, which imply
little prior knowledge, but, as opposed to flat priors, still impose
some constraints.
First of all, we can simply print out the model:
my_bglm
## stan_glm
## family: binomial [logit]
## formula: cbind(agree, disagree) ~ education + gender
## observations: 42
## predictors: 3
## ------
## Median MAD_SD
## (Intercept) 2.5 0.2
## education -0.3 0.0
## genderFemale 0.0 0.1
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
The median for the coefficients is very similar to the MLEs. We get a median absolute deviation that should approximate the parameters’ standard deviation.
One of the most important contents of the fit object are the posterior draws from the MCMC, which we can access directly:
draws <- as.matrix(my_bglm)
dim(draws)
## [1] 4000 3
colnames(draws)
## [1] "(Intercept)" "education" "genderFemale"
For each of the three parameters, the fit gives us 4000 draws from the posterior. From these draws, we can compute whatever summary statistic we like, for example:
The posterior mean for education
:
mean(draws[,2])
## [1] -0.2717251
The credible interval:
quantile(draws[,2],probs = c(.025,.9725))
## 2.5% 97.25%
## -0.3018313 -0.2425906
Or visualize the distribution:
data.frame(education_posterior = draws[,2]) %>%
ggplot(aes(x= education_posterior)) +
geom_histogram(bins=50)
For many of these computations, there are also a built-in functions. For example:
Visually comparing priors and posteriors:
posterior_vs_prior(my_bglm)
Parameter overview:
library(parameters)
parameters(my_bglm)
## Parameter | Median | 95% CI | pd | Rhat | ESS | Prior
## --------------------------------------------------------------------------------------
## (Intercept) | 2.52 | [ 2.17, 2.89] | 100% | 1.000 | 2759.00 | Student_t (0 +- 5)
## education | -0.27 | [-0.30, -0.24] | 100% | 1.000 | 3103.00 | Student_t (0 +- 5)
## genderFemale | -0.01 | [-0.17, 0.15] | 55.97% | 1.000 | 2708.00 | Student_t (0 +- 5)
Credible interval:
posterior_interval(my_bglm, prob = 0.95, pars = "education")
## 2.5% 97.5%
## education -0.3018313 -0.2420623
Interpretation: There is a 95% probability that the coefficient is between -0.30 and -0.24.
rstanarm
comes with a shiny app that creates diagnostic
plots for the fit.
launch_shinystan(my_bglm, ppd = FALSE)
Here, the “Explore” section is helpful: It gives the Kernel density estimate of the posterior distribution, and the Markov Chain. The Markov chain should fluctuate around the estimate.
For the model predictions, we obtain draws from the posterior again:
prediction <- posterior_predict(my_bglm, newdata=womensrole)
dim(prediction)
## [1] 4000 42
dim(womensrole)
## [1] 42 5
There are 4000 posterior draws for each of the 42 combinations of predictor values that are present in the original data.
One way to visualize the prediction is take the mean for each combination and plot it in comparison to the data:
prediction_means <- rowMeans(t(prediction))
womensrole %>%
mutate(percent_agree = agree/total) %>%
mutate(pred = prediction_means/total) %>%
ggplot(aes(x=education, y=percent_agree, color=gender)) +
geom_point()+
geom_line(aes(y=pred))
The mean of the posterior prediction is using a summary statistic, thus loosing information. Can you think of a way to visualize the posterior prediction in more detail?
Remember the aim of the survey was to find whether men and women
agree to different extents. Let’s put some numbers on this. Can you use
posterior_predict
to answer the question: Out of 100 men
with average education (education=10
), how many do we
predict to agree to the statement? Is this prediction different for
women?