The aim is to demonstrate that shrinkage, as we covered in the Empirical Bayes book seminar, is also possible with a method called partial pooling. In addition, to shrinking to the data average, we can set a prior on this average to express our uncertainty about it.
This tutorial roughly follows this workflow which I can also recommend for getting more information on the theory behind partial pooling. We use the same Baseball data set as in the book.
Required libraries:
library(tidyverse)
library(Lahman)
library(rstanarm)
We prepare the data as shown in the book:
# Filter out pitchers
career <- Batting %>%
filter(AB > 0) %>%
anti_join(Pitching, by = "playerID") %>% group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>% mutate(average = H / AB)
# Include names along with the player IDs
career <- People %>%
tbl_df() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID") %>%
select(-playerID)
This gives the career
data frame:
career %>% head
## # A tibble: 6 × 4
## name H AB average
## <chr> <int> <int> <dbl>
## 1 Hank Aaron 3771 12364 0.305
## 2 Tommie Aaron 216 944 0.229
## 3 Andy Abad 2 21 0.0952
## 4 John Abadie 11 49 0.224
## 5 Ed Abbaticchio 772 3044 0.254
## 6 Fred Abbott 107 513 0.209
AB
are the total at bats, and H
is the
number of hits out of those for each player (name
).
Now we downsample this data frame, to reduce the time needed for Bayesian simulation:
set.seed(93)
career_sample <- career %>%
sample_n(300)
This is a histogram of batting averages:
career_sample %>%
mutate("evidence" = ifelse(AB>20,"high","low")) %>%
ggplot(aes(x=average, fill=evidence))+
geom_histogram(bins=50)
The average is around 0.25, but we see quite some batters with extreme values, but low evidence.
We have seen how to use Empirical Bayes to shrink the extreme values
towards the overall mean. There is another way to achieve shrinkage,
using partial pooling, implemented in a generalized linear
mixed model. This model can be implemented in a frequentist
framework, for which we would use the function glmer
from
the lme4
package. But we can also decide to express our
beliefs and uncertainty about the overall batting average using a prior
in the Bayesian framework.
The model behind partial pooling is the following:
Each player (individual \(i\)) has an individual true batting average \(p_i\) from which hits and misses are drawn using a binomial distribution:
\(H_i \sim Bin(AB,p_i)\)
The batting averages are, in turn, drawn from a normal distribution, as the players form a group:
\(p_i \sim N(\mu_{players},\sigma_{players})\)
The parameters for these two distributions, \(p_i\), \(\mu_{players}\) and \(\sigma_{players}\), are estimated together from the data. In this process, the individual \(p_i\) are not modeled as the raw batting averages – instead the raw batting averages get shrunken towards the estimated mean. Shrinkage is higher when the estimated \(\sigma_{players}\) is small, when there are little data points available for a specific individual, and when the raw average is extreme.
In addition to that, we can specify a prior on the location of the normal distribution that gets estimated from the data. This means:
\(\mu \sim N(\mu_{prior},\sigma_{prior})\).
The \(\sigma_{prior}\) expresses our uncertainty about the true mean batting average. It is not the spread of batting averages observed in our sample.
The prior has to be given in terms of log odds (because of the logit link). Therefore, we can calculate the batting average and transform it into log odds.
BA <- career_sample %>%
mutate(average=H/AB) %>%
pull(average) %>%
mean()
BA
## [1] 0.227673
log(BA/(1-BA)) # translate to log odds
## [1] -1.221498
This can be the location of our prior.
rstanarm
pp_model <- stan_glmer(cbind(H,AB-H) ~ (1|name),
data = career_sample,
family = binomial("logit"),
prior_intercept =normal(-1.2, 1),
cores = 4, seed = 12345)
What are the ingredients of this model?
link="logit"
, to transform into log odds. A linear model
can predict log odds, but should not predict fractions, as those can
only take values between 0 and 1.~(1|name)
: In this model, we don’t
predict the batting average from any variable, we just fit a global
intercept. In addition to that, we say that there is a grouping variable
name
, and each individual of this group may show some
variation in batting average around the intercept. This is called
random effect and is written in R as
(the term on which the variation applies | the group that shows variation)
.
1
is the terminology for intercept.prior_intercept
: We use a normal prior with
location=-1.2
and scale=1
. This means we
belief that the intercept corresponds roughly to the overall batting
average, but we are not very certain about that.We can look at the coefficients:
pp_model$coefficients %>% head
## (Intercept) b[(Intercept) name:Al_Hermann]
## -1.089642888 -0.017206863
## b[(Intercept) name:Al_Hubbard] b[(Intercept) name:Al_Lopez]
## 0.005630797 0.048006664
## b[(Intercept) name:Al_Montreuil] b[(Intercept) name:Al_Smith]
## -0.022416848 0.099153690
The fitted values give the individual players’ fitted true batting averages:
pp_model$fitted.values %>% head
## [1] 0.2400022 0.2761236 0.2796841 0.2555648 0.2571514 0.2666649
Add the Bayesian estimates to the players:
career_sample <- career_sample %>%
mutate(BE = pp_model$fitted.values)
career_sample %>%
mutate(average = H/AB) %>%
ggplot(aes(x=average, y = BE, colour=log(AB))) +
geom_point()+
geom_abline()+
geom_abline(slope=0, intercept=mean(career_sample$BE))+
ylab("Bayesian estimate")
draws <- as.matrix(pp_model)
Can you use them to create confidence intervals for the true batting average for some of the players?
What happens if you put an informed prior? How that change the estimates and the confidence intervals?
Can you use the posterior draws to compare two players of interest? You may refer to chapter 6 on Bayesian A/B testing.
The function stan_glmer
has an equivalent in the
frequentist world, which is lme4:glmer
. How do the results
compare?