Overview

The aim is to demonstrate that shrinkage, as we covered in the Empirical Bayes book seminar, is also possible with a method called partial pooling. In addition, to shrinking to the data average, we can set a prior on this average to express our uncertainty about it.

This tutorial roughly follows this workflow which I can also recommend for getting more information on the theory behind partial pooling. We use the same Baseball data set as in the book.

Required libraries:

library(tidyverse)
library(Lahman)
library(rstanarm)

Data

We prepare the data as shown in the book:

# Filter out pitchers
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>% group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>% mutate(average = H / AB)

# Include names along with the player IDs
career <- People %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>% 
  unite(name, nameFirst, nameLast, sep = " ") %>% 
  inner_join(career, by = "playerID") %>%
  select(-playerID)

This gives the career data frame:

career %>% head
## # A tibble: 6 × 4
##   name               H    AB average
##   <chr>          <int> <int>   <dbl>
## 1 Hank Aaron      3771 12364  0.305 
## 2 Tommie Aaron     216   944  0.229 
## 3 Andy Abad          2    21  0.0952
## 4 John Abadie       11    49  0.224 
## 5 Ed Abbaticchio   772  3044  0.254 
## 6 Fred Abbott      107   513  0.209

AB are the total at bats, and H is the number of hits out of those for each player (name).

Now we downsample this data frame, to reduce the time needed for Bayesian simulation:

set.seed(93)
career_sample <- career %>% 
  sample_n(300)

This is a histogram of batting averages:

career_sample %>% 
  mutate("evidence" = ifelse(AB>20,"high","low")) %>% 
  ggplot(aes(x=average, fill=evidence))+
  geom_histogram(bins=50)

The average is around 0.25, but we see quite some batters with extreme values, but low evidence.

We have seen how to use Empirical Bayes to shrink the extreme values towards the overall mean. There is another way to achieve shrinkage, using partial pooling, implemented in a generalized linear mixed model. This model can be implemented in a frequentist framework, for which we would use the function glmer from the lme4 package. But we can also decide to express our beliefs and uncertainty about the overall batting average using a prior in the Bayesian framework.

Partial pooling

Generative model

The model behind partial pooling is the following:

Each player (individual \(i\)) has an individual true batting average \(p_i\) from which hits and misses are drawn using a binomial distribution:

\(H_i \sim Bin(AB,p_i)\)

The batting averages are, in turn, drawn from a normal distribution, as the players form a group:

\(p_i \sim N(\mu_{players},\sigma_{players})\)

The parameters for these two distributions, \(p_i\), \(\mu_{players}\) and \(\sigma_{players}\), are estimated together from the data. In this process, the individual \(p_i\) are not modeled as the raw batting averages – instead the raw batting averages get shrunken towards the estimated mean. Shrinkage is higher when the estimated \(\sigma_{players}\) is small, when there are little data points available for a specific individual, and when the raw average is extreme.

In addition to that, we can specify a prior on the location of the normal distribution that gets estimated from the data. This means:

\(\mu \sim N(\mu_{prior},\sigma_{prior})\).

The \(\sigma_{prior}\) expresses our uncertainty about the true mean batting average. It is not the spread of batting averages observed in our sample.

The prior has to be given in terms of log odds (because of the logit link). Therefore, we can calculate the batting average and transform it into log odds.

BA <- career_sample %>% 
  mutate(average=H/AB) %>% 
  pull(average) %>% 
  mean()
BA
## [1] 0.227673
log(BA/(1-BA)) # translate to log odds
## [1] -1.221498

This can be the location of our prior.

Run the model in rstanarm

pp_model <- stan_glmer(cbind(H,AB-H) ~ (1|name),
                       data = career_sample, 
                       family = binomial("logit"),
                       prior_intercept =normal(-1.2, 1),
                       cores = 4, seed = 12345)

What are the ingredients of this model?

  • As we have binomial counts, we need a binomial family. We take set link="logit", to transform into log odds. A linear model can predict log odds, but should not predict fractions, as those can only take values between 0 and 1.
  • The linear predictor ~(1|name): In this model, we don’t predict the batting average from any variable, we just fit a global intercept. In addition to that, we say that there is a grouping variable name, and each individual of this group may show some variation in batting average around the intercept. This is called random effect and is written in R as (the term on which the variation applies | the group that shows variation). 1 is the terminology for intercept.
  • prior_intercept: We use a normal prior with location=-1.2 and scale=1. This means we belief that the intercept corresponds roughly to the overall batting average, but we are not very certain about that.

Extract the shrunken averages

We can look at the coefficients:

pp_model$coefficients %>% head
##                      (Intercept)   b[(Intercept) name:Al_Hermann] 
##                     -1.089642888                     -0.017206863 
##   b[(Intercept) name:Al_Hubbard]     b[(Intercept) name:Al_Lopez] 
##                      0.005630797                      0.048006664 
## b[(Intercept) name:Al_Montreuil]     b[(Intercept) name:Al_Smith] 
##                     -0.022416848                      0.099153690

The fitted values give the individual players’ fitted true batting averages:

pp_model$fitted.values %>% head
## [1] 0.2400022 0.2761236 0.2796841 0.2555648 0.2571514 0.2666649

Add the Bayesian estimates to the players:

career_sample <- career_sample %>% 
  mutate(BE = pp_model$fitted.values)

Plot the result

career_sample %>% 
  mutate(average = H/AB) %>% 
  ggplot(aes(x=average, y = BE, colour=log(AB))) +
  geom_point()+
  geom_abline()+
  geom_abline(slope=0, intercept=mean(career_sample$BE))+
  ylab("Bayesian estimate")

Exercises

  1. You can access the posterior draws with:
draws <- as.matrix(pp_model)

Can you use them to create confidence intervals for the true batting average for some of the players?

  1. What happens if you put an informed prior? How that change the estimates and the confidence intervals?

  2. Can you use the posterior draws to compare two players of interest? You may refer to chapter 6 on Bayesian A/B testing.

  3. The function stan_glmer has an equivalent in the frequentist world, which is lme4:glmer. How do the results compare?