The aim is to demonstrate that shrinkage, as we covered in the Empirical Bayes book seminar, is also possible with a method called partial pooling. In addition, to shrinking to the data average, we can set a prior on this average to express our uncertainty about it.

This tutorial roughly follows this workflow which I can also recommend for getting more information on the theory behind partial pooling. We use the same Baseball data set as in the book.

Required libraries:

```
library(tidyverse)
library(Lahman)
library(rstanarm)
```

We prepare the data as shown in the book:

```
# Filter out pitchers
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>% group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>% mutate(average = H / AB)
# Include names along with the player IDs
career <- People %>%
tbl_df() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID") %>%
select(-playerID)
```

This gives the `career`

data frame:

`career %>% head`

```
## # A tibble: 6 Ã— 4
## name H AB average
## <chr> <int> <int> <dbl>
## 1 Hank Aaron 3771 12364 0.305
## 2 Tommie Aaron 216 944 0.229
## 3 Andy Abad 2 21 0.0952
## 4 John Abadie 11 49 0.224
## 5 Ed Abbaticchio 772 3044 0.254
## 6 Fred Abbott 107 513 0.209
```

`AB`

are the total at bats, and `H`

is the
number of hits out of those for each player (`name`

).

Now we downsample this data frame, to reduce the time needed for Bayesian simulation:

```
set.seed(93)
career_sample <- career %>%
sample_n(300)
```

This is a histogram of batting averages:

```
career_sample %>%
mutate("evidence" = ifelse(AB>20,"high","low")) %>%
ggplot(aes(x=average, fill=evidence))+
geom_histogram(bins=50)
```

The average is around 0.25, but we see quite some batters with extreme values, but low evidence.

We have seen how to use Empirical Bayes to shrink the extreme values
towards the overall mean. There is another way to achieve shrinkage,
using *partial pooling*, implemented in a *generalized linear
mixed model*. This model can be implemented in a frequentist
framework, for which we would use the function `glmer`

from
the `lme4`

package. But we can also decide to express our
beliefs and uncertainty about the overall batting average using a prior
in the Bayesian framework.

The model behind partial pooling is the following:

Each player (individual \(i\)) has an individual true batting average \(p_i\) from which hits and misses are drawn using a binomial distribution:

\(H_i \sim Bin(AB,p_i)\)

The batting averages are, in turn, drawn from a normal distribution, as the players form a group:

\(p_i \sim N(\mu_{players},\sigma_{players})\)

The parameters for these two distributions, \(p_i\), \(\mu_{players}\) and \(\sigma_{players}\), are estimated together from the data. In this process, the individual \(p_i\) are not modeled as the raw batting averages â€“ instead the raw batting averages get shrunken towards the estimated mean. Shrinkage is higher when the estimated \(\sigma_{players}\) is small, when there are little data points available for a specific individual, and when the raw average is extreme.

In addition to that, we can specify a prior on the location of the normal distribution that gets estimated from the data. This means:

\(\mu \sim N(\mu_{prior},\sigma_{prior})\).

The \(\sigma_{prior}\) expresses our
uncertainty about the true mean batting average. It is *not* the
spread of batting averages observed in our sample.

The prior has to be given in terms of log odds (because of the logit link). Therefore, we can calculate the batting average and transform it into log odds.

```
BA <- career_sample %>%
mutate(average=H/AB) %>%
pull(average) %>%
mean()
BA
```

`## [1] 0.227673`

`log(BA/(1-BA)) # translate to log odds`

`## [1] -1.221498`

This can be the location of our prior.

`rstanarm`

```
pp_model <- stan_glmer(cbind(H,AB-H) ~ (1|name),
data = career_sample,
family = binomial("logit"),
prior_intercept =normal(-1.2, 1),
cores = 4, seed = 12345)
```

**What are the ingredients of this model?**

- As we have binomial counts, we need a binomial family. We take set
`link="logit"`

, to transform into log odds. A linear model can predict log odds, but should not predict fractions, as those can only take values between 0 and 1.

- The linear predictor
`~(1|name)`

: In this model, we donâ€™t predict the batting average from any variable, we just fit a global intercept. In addition to that, we say that there is a grouping variable`name`

, and each individual of this group may show some variation in batting average around the intercept. This is called*random effect*and is written in R as`(the term on which the variation applies | the group that shows variation)`

.`1`

is the terminology for intercept.

`prior_intercept`

: We use a normal prior with`location=-1.2`

and`scale=1`

. This means we belief that the intercept corresponds roughly to the overall batting average, but we are not very certain about that.

We can look at the coefficients:

`pp_model$coefficients %>% head`

```
## (Intercept) b[(Intercept) name:Al_Hermann]
## -1.089642888 -0.017206863
## b[(Intercept) name:Al_Hubbard] b[(Intercept) name:Al_Lopez]
## 0.005630797 0.048006664
## b[(Intercept) name:Al_Montreuil] b[(Intercept) name:Al_Smith]
## -0.022416848 0.099153690
```

The fitted values give the individual playersâ€™ fitted true batting averages:

`pp_model$fitted.values %>% head`

`## [1] 0.2400022 0.2761236 0.2796841 0.2555648 0.2571514 0.2666649`

Add the Bayesian estimates to the players:

```
career_sample <- career_sample %>%
mutate(BE = pp_model$fitted.values)
```

```
career_sample %>%
mutate(average = H/AB) %>%
ggplot(aes(x=average, y = BE, colour=log(AB))) +
geom_point()+
geom_abline()+
geom_abline(slope=0, intercept=mean(career_sample$BE))+
ylab("Bayesian estimate")
```

- You can access the posterior draws with:

`draws <- as.matrix(pp_model)`

Can you use them to create confidence intervals for the true batting average for some of the players?

What happens if you put an informed prior? How that change the estimates and the confidence intervals?

Can you use the posterior draws to compare two players of interest? You may refer to chapter 6 on Bayesian A/B testing.

The function

`stan_glmer`

has an equivalent in the frequentist world, which is`lme4:glmer`

. How do the results compare?