Load libraries:

library(fitdistrplus)
library(vcd)
library(tidyverse)

Distribution functions and random numbers in R

R includes a whole range of distributions: Here is a list of them.

Let’s, for example, look at the help for the Gaussian functions:

?dnorm

We see that there are four different calls:
- dnorm: density
- pnorm: cumulative distribution function (percentage of values smaller than)
- qnorm: quantile function (inverse of cumulative distribution)
- rnorm: generates random numbers

These four functions are available for most of the distributions. The first letter specifies if we want to look at the density, probability distribution/mass function, quantile or random numbers. The suffix specifies the distribution.

The arguments depend on the distribution we are looking at, but always include the parameters of that function.

Probability

The functions starting with d give us densities / probabilities for certain values to occur in a distribution that we specify ourselves.

Binomial distribution

Let’s use the example where we caught 10 frogs and count how many of them are light-colored. For known parameters, we can calculate the the chances of counting exactly 5 light-colored frogs:

n = 10 # number of frogs we catch
p = 0.3 # true fraction of light frogs
dbinom(x=5, size=n, prob=p)
## [1] 0.1029193

We can ask for the probability of catching at most (or at least) 5 light frogs. In this case, we need the cumulative probability distribution starting with p:

pbinom(q=5, size=n, prob=p) # at most
## [1] 0.952651
pbinom(q=5, size=n,prob=p, lower.tail=FALSE) # larger than
## [1] 0.04734899

Catching at least 5 light frogs is a rare event.

Discrete random variables have probability mass functions, which are only defined for integer values:

dpois(1.5, lambda=4)
## Warning in dpois(1.5, lambda = 4): non-integer x = 1.500000
## [1] 0

The probability here is \(0\), and comes with a warning: Your calculations potentially make no sense.

Random numbers

You can draw random numbers from a certain distribution. We use this a lot for simulating random processes. If we do so, it’s advised to set a seed for reproducibility:

set.seed(59)

Now we can simulate frog counts:

frog_counts <-rpois(n = 200, lambda = 4)
head(frog_counts)
## [1] 1 4 6 5 4 6

Let’s also simulate an experiment where each frog count was done in a different lake.

For this we use the gamma-poisson distribution:

frog_counts_different_lakes <- rnbinom(n=200, size=2, mu=4)   # the smaller size, the more spread out

Exercise: Simulate 200 frog sizes that follow a Gaussian distribution with an average of 7 cm, and a standard deviation of 2 cm.

Solution:

frog_sizes <- rnorm(n = 200, mean = 7, sd = 2)
head(frog_sizes)
## [1] 8.281580 8.384858 8.695852 7.667760 9.216826 7.191485

Histogram

If you draw random numbers from a certain distribution, the histogram will have a shape that is specific for the distribution:

data.frame(frog_counts) %>% 
  ggplot(aes(x=frog_counts))+
  geom_histogram(binwidth=1)

Exercise: Draw a histogram of the frog sizes:

data.frame(frog_sizes) %>% 
  ggplot(aes(x=frog_sizes))+
  geom_histogram()

Cumulative distrubution function

The cumulative distribution is the integral of the distribution/the histogram: It gives you the percentage of values that are smaller than a certain value.
We can visualize it with stat_ecdf in ggplot2.

data.frame(frog_sizes) %>% 
  ggplot(aes(x=frog_sizes))+
  stat_ecdf()

The QQ-plot

The qq-plot compares the quantiles of two distributions.

Quantiles are the inverse of the cumulative distribution, i.e., the qnorm function is the inverse of pnorm:

pnorm(q = -2,mean=0, sd=2) # What is the probability of seeing a value of -2 or smaller?
## [1] 0.1586553
qnorm(p=0.158, mean=0, sd=2) # What is the value for which 15% of the other values are smaller?
## [1] -2.005423

We can compare the quantiles of the frog sizes to the theoretical quantiles of a normal distribution. There are specialized functions for qq-plots, so we don’t have to calculate the theoretical values by hand:

data.frame(frog_sizes) %>% 
  ggplot(aes(sample=frog_sizes))+
  geom_qq()+
  geom_qq_line()

Comparing data to theory

The plotting function from the fitdistrplus package lets you compare your data to a theoretical distribution.

Let’s create an example where this plot is very useful: We take the frog counts from different lakes and try to fit it with Poisson:

my_fit = fitdist(frog_counts_different_lakes, dist = "pois")
my_fit # gives you the parameter estimates
## Fitting of the distribution ' pois ' by maximum likelihood 
## Parameters:
##        estimate Std. Error
## lambda     3.88  0.1392839
plot(my_fit)

This histogram shows us that there is a systematic problem: The empirical distribution exceeds the theory in the peripheries, i.e. for very low and very high values, indicating that the empirical distribution is more spread out than expected.

We could try again with fitting a gamma-poisson (negative binomial) distribution:

my_fit = fitdist(frog_counts_different_lakes, dist = "nbinom")
my_fit # gives you the parameter estimates
## Fitting of the distribution ' nbinom ' by maximum likelihood 
## Parameters:
##      estimate Std. Error
## size 1.690771  0.2616811
## mu   3.880206  0.2528455
plot(my_fit)

Exercise: Fit a normal distribution to the frog sizes using fitdist

Solution:

fitdist(frog_sizes, "norm") %>% plot()