This tutorial is supposed to give a starting point for Bayesian
analysis with `rstanarm`

, assuming that you have some
background with linear models. It is roughly following this
official tutorial.

Required libraries:

```
library(rstanarm)
library(tidyverse)
library(parameters)
library(HSAUR3)
```

We are using data from a survey from 1974 / 1975 asking both female and male responders about their opinion on the statement: “Women should take care of running their homes and leave running the country up to men.” The data description is found here.

```
data("womensrole", package = "HSAUR3")
womensrole$total <- womensrole$agree + womensrole$disagree
womensrole %>% head
```

```
## education gender agree disagree total
## 1 0 Male 4 2 6
## 2 1 Male 2 0 2
## 3 2 Male 4 0 4
## 4 3 Male 6 3 9
## 5 4 Male 5 5 10
## 6 5 Male 13 7 20
```

The survey aims at answering the question whether the responses of men and women differ.

This is how we would fit a non-Bayesian generalized linear model (GLM):

```
my_glm <- glm(cbind(agree, disagree) ~ education + gender,
data = womensrole,
family = binomial(link = "logit"))
```

We can look at the coefficients that the model fitted:

`round(coef(summary(my_glm)), 3)`

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.509 0.184 13.646 0.000
## education -0.271 0.015 -17.560 0.000
## genderFemale -0.011 0.084 -0.136 0.892
```

Because of the logit-link, the coefficients are given in log-odds: \(logit = \log(\frac{p}{1-p})\)

We can convert logit to probability as follows:

\(p = \frac{\exp(logit)}{1+\exp(logit)}\)

The `plogis`

function does this calculation for us. For
example, the intercept can be interpreted as the probability of agreeing
for `education = 0`

and `gender=Male`

:

```
coefs <- my_glm$coefficients
plogis(coefs[1])
```

```
## (Intercept)
## 0.9247962
```

We can check how well the model fits the data:

```
womensrole %>%
mutate(percent_agree = agree/total) %>%
mutate(pred=predict(my_glm, type="response")) %>%
ggplot(aes(x=education, y=percent_agree, color=gender)) +
geom_point()+
geom_line(aes(y=pred))
```

The extension to a Bayesian model is very easy to implement:

- Instead of
`glm`

, we call`stan_glm`

- specify the same linear predictor, data, and family

- add priors on the variables and the intercept

```
my_bglm <- stan_glm(cbind(agree, disagree) ~ education + gender,
data = womensrole,
family = binomial(link = "logit"),
prior = student_t(df = 7,location= 0, scale= 5),
prior_intercept = student_t(df = 7, location=0, scale=5),
cores = 2, seed = 12345)
```

Information on the available priors in `rstanarm`

is given
here.
The authors recommend *weakly informative priors*, which imply
little prior knowledge, but, as opposed to flat priors, still impose
some constraints.

First of all, we can simply print out the model:

`my_bglm`

```
## stan_glm
## family: binomial [logit]
## formula: cbind(agree, disagree) ~ education + gender
## observations: 42
## predictors: 3
## ------
## Median MAD_SD
## (Intercept) 2.5 0.2
## education -0.3 0.0
## genderFemale 0.0 0.1
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
```

The median for the coefficients is very similar to the MLEs. We get a median absolute deviation that should approximate the parameters’ standard deviation.

One of the most important contents of the fit object are the posterior draws from the MCMC, which we can access directly:

```
draws <- as.matrix(my_bglm)
dim(draws)
```

`## [1] 4000 3`

`colnames(draws)`

`## [1] "(Intercept)" "education" "genderFemale"`

For each of the three parameters, the fit gives us 4000 draws from the posterior. From these draws, we can compute whatever summary statistic we like, for example:

The posterior mean for `education`

:

`mean(draws[,2])`

`## [1] -0.2717251`

The credible interval:

`quantile(draws[,2],probs = c(.025,.9725))`

```
## 2.5% 97.25%
## -0.3018313 -0.2425906
```

Or visualize the distribution:

```
data.frame(education_posterior = draws[,2]) %>%
ggplot(aes(x= education_posterior)) +
geom_histogram(bins=50)
```

For many of these computations, there are also a built-in functions. For example:

Visually comparing priors and posteriors:

`posterior_vs_prior(my_bglm)`

Parameter overview:

```
library(parameters)
parameters(my_bglm)
```

```
## Parameter | Median | 95% CI | pd | Rhat | ESS | Prior
## --------------------------------------------------------------------------------------
## (Intercept) | 2.52 | [ 2.17, 2.89] | 100% | 1.000 | 2759.00 | Student_t (0 +- 5)
## education | -0.27 | [-0.30, -0.24] | 100% | 1.000 | 3103.00 | Student_t (0 +- 5)
## genderFemale | -0.01 | [-0.17, 0.15] | 55.97% | 1.000 | 2708.00 | Student_t (0 +- 5)
```

Credible interval:

`posterior_interval(my_bglm, prob = 0.95, pars = "education")`

```
## 2.5% 97.5%
## education -0.3018313 -0.2420623
```

Interpretation: There is a 95% probability that the coefficient is between -0.30 and -0.24.

`rstanarm`

comes with a shiny app that creates diagnostic
plots for the fit.

`launch_shinystan(my_bglm, ppd = FALSE)`

Here, the “Explore” section is helpful: It gives the Kernel density estimate of the posterior distribution, and the Markov Chain. The Markov chain should fluctuate around the estimate.

For the model predictions, we obtain draws from the posterior again:

```
prediction <- posterior_predict(my_bglm, newdata=womensrole)
dim(prediction)
```

`## [1] 4000 42`

`dim(womensrole)`

`## [1] 42 5`

There are 4000 posterior draws for each of the 42 combinations of predictor values that are present in the original data.

One way to visualize the prediction is take the mean for each combination and plot it in comparison to the data:

```
prediction_means <- rowMeans(t(prediction))
womensrole %>%
mutate(percent_agree = agree/total) %>%
mutate(pred = prediction_means/total) %>%
ggplot(aes(x=education, y=percent_agree, color=gender)) +
geom_point()+
geom_line(aes(y=pred))
```