Load libraries:
library(fitdistrplus)
library(vcd)
library(tidyverse)
R includes a whole range of distributions: Here is a list of them.
Let’s, for example, look at the help for the Gaussian functions:
?dnorm
We see that there are four different calls:
- dnorm
: density
- pnorm
: cumulative distribution function (percentage of
values smaller than)
- qnorm
: quantile function (inverse of cumulative
distribution)
- rnorm
: generates random numbers
These four functions are available for most of the distributions. The first letter specifies if we want to look at the density, probability distribution/mass function, quantile or random numbers. The suffix specifies the distribution.
The arguments depend on the distribution we are looking at, but always include the parameters of that function.
The functions starting with d
give us densities /
probabilities for certain values to occur in a distribution that we
specify ourselves.
Let’s use the example where we caught 10 frogs and count how many of them are light-colored. For known parameters, we can calculate the the chances of counting exactly 5 light-colored frogs:
n = 10 # number of frogs we catch
p = 0.3 # true fraction of light frogs
dbinom(x=5, size=n, prob=p)
## [1] 0.1029193
We can ask for the probability of catching at most (or at least) 5
light frogs. In this case, we need the cumulative probability
distribution starting with p
:
pbinom(q=5, size=n, prob=p) # at most
## [1] 0.952651
pbinom(q=5, size=n,prob=p, lower.tail=FALSE) # larger than
## [1] 0.04734899
Catching at least 5 light frogs is a rare event.
Discrete random variables have probability mass functions, which are only defined for integer values:
dpois(1.5, lambda=4)
## Warning in dpois(1.5, lambda = 4): non-integer x = 1.500000
## [1] 0
The probability here is \(0\), and comes with a warning: Your calculations potentially make no sense.
You can draw random numbers from a certain distribution. We use this a lot for simulating random processes. If we do so, it’s advised to set a seed for reproducibility:
set.seed(59)
Now we can simulate frog counts:
frog_counts <-rpois(n = 200, lambda = 4)
head(frog_counts)
## [1] 1 4 6 5 4 6
Let’s also simulate an experiment where each frog count was done in a different lake.
For this we use the gamma-poisson distribution:
frog_counts_different_lakes <- rnbinom(n=200, size=2, mu=4) # the smaller size, the more spread out
Exercise: Simulate 200 frog sizes that follow a Gaussian distribution with an average of 7 cm, and a standard deviation of 2 cm.
Solution:
frog_sizes <- rnorm(n = 200, mean = 7, sd = 2)
head(frog_sizes)
## [1] 8.281580 8.384858 8.695852 7.667760 9.216826 7.191485
If you draw random numbers from a certain distribution, the histogram will have a shape that is specific for the distribution:
data.frame(frog_counts) %>%
ggplot(aes(x=frog_counts))+
geom_histogram(binwidth=1)
Exercise: Draw a histogram of the frog sizes:
data.frame(frog_sizes) %>%
ggplot(aes(x=frog_sizes))+
geom_histogram()
The cumulative distribution is the integral of the distribution/the
histogram: It gives you the percentage of values that are smaller than a
certain value.
We can visualize it with stat_ecdf
in
ggplot2
.
data.frame(frog_sizes) %>%
ggplot(aes(x=frog_sizes))+
stat_ecdf()
The qq-plot compares the quantiles of two distributions.
Quantiles are the inverse of the cumulative distribution, i.e., the
qnorm
function is the inverse of pnorm
:
pnorm(q = -2,mean=0, sd=2) # What is the probability of seeing a value of -2 or smaller?
## [1] 0.1586553
qnorm(p=0.158, mean=0, sd=2) # What is the value for which 15% of the other values are smaller?
## [1] -2.005423
We can compare the quantiles of the frog sizes to the theoretical quantiles of a normal distribution. There are specialized functions for qq-plots, so we don’t have to calculate the theoretical values by hand:
data.frame(frog_sizes) %>%
ggplot(aes(sample=frog_sizes))+
geom_qq()+
geom_qq_line()
The plotting function from the fitdistrplus
package lets
you compare your data to a theoretical distribution.
Let’s create an example where this plot is very useful: We take the frog counts from different lakes and try to fit it with Poisson:
my_fit = fitdist(frog_counts_different_lakes, dist = "pois")
my_fit # gives you the parameter estimates
## Fitting of the distribution ' pois ' by maximum likelihood
## Parameters:
## estimate Std. Error
## lambda 3.88 0.1392839
plot(my_fit)
This histogram shows us that there is a systematic problem: The empirical distribution exceeds the theory in the peripheries, i.e. for very low and very high values, indicating that the empirical distribution is more spread out than expected.
We could try again with fitting a gamma-poisson (negative binomial) distribution:
my_fit = fitdist(frog_counts_different_lakes, dist = "nbinom")
my_fit # gives you the parameter estimates
## Fitting of the distribution ' nbinom ' by maximum likelihood
## Parameters:
## estimate Std. Error
## size 1.690771 0.2616811
## mu 3.880206 0.2528455
plot(my_fit)
Exercise: Fit a normal distribution to the frog
sizes using fitdist
Solution:
fitdist(frog_sizes, "norm") %>% plot()