2 Statistical Modeling In the previous chapter, the knowledge of both the generative model and the values of the parameters provided us with probabilities we could use for decision making – for instance, whether we had really found an epitope. In many real situations, neither the generative model nor the parameters are known, and we will need to estimate them using the data we have collected. Statistical modeling works from the data upwards to a model that might plausibly explain the data21 Even if we have found a model that perfectly explains all our current data, it could always be that reality is more complex. A new set of data lets us conclude that another model is needed, and may include the current model as a special case or approximation.. This upward-reasoning step is called statistical inference. This chapter will show us some of the distributions and estimation mechanisms that serve as building blocks for inference. Although the examples in this chapter are all parametric (i.e., the statistical models only have a small number of unknown parameters), the principles we discuss will generalize. In a statistical setting, we start with the data $$X$$ and use them to estimate the parameters. These estimates are denoted by Greek letters with what we call hats on them, as in $$\widehat{\theta}$$.

2.1 Goals for this chapter

In this chapter we will:

• See that there is a difference between two subjects that are often confused: “Probability” and “Statistics”.

• Fit data to probability distributions using histograms and other visualization tricks.

• Have a first encounter with an estimating procedure known as maximum likelihood through a simulation experiement.

• Make inferences from data for which we have prior information. For this we will use the Bayesian paradigm which will involve new distributions with specially tailored properties. We will use simulations and see how Bayesian estimation differs from simple application of maximum likelihood.

• Use statistical models and estimation to evaluate dependencies in binomial and multinomial distributions.

• Analyse some historically interesting genomic data assembled into tables.

• Make Markov chain models for dependent data.

• Do a few concrete applications counting motifs in whole genomes and manipulate special Bioconductor classes dedicated to genomic data. textbfExamples of parameters: the single parameter $$\lambda$$ defines a Poisson distribution. The letter $$\mu$$ is often used for the mean of the normal. More generally, we use the Greek letter $$\theta$$ to designate a generic tuple of parameters necessary to specify a probability model. For instance, in the case of the binomial distribution, $$\theta=(n,p)$$ comprises two numbers, a positive integer and a real number between 0 and 1.

Parameters are the key.

We saw in Chapter 1 that the knowledge of all the parameter values in the epitope example enabled us to use our probability model and test a null hypothesis based on the data we had at hand. We will see different approaches to statistical modeling through some real examples and computer simulations, but let’s start by making a distinction between two situations depending on how much information is available.

2.2 The difference between statistical and probabilistic models

A probabilistic analysis is possible when we know a good generative model for the randomness in the data, and we are provided with the parameters’ actual values. Figure 2.1: The probabilistic model we obtained in Chapter 1. The data are represented as $$x$$ in green. We can use the observed data to compute the probability if observing $$x$$ when we know the true value of $$\theta$$.

In the epitope example, knowing that false positives occured as Bernoulli(0.01) per position, the number of patients assayed and the length of the protein ensured that there were no unknown parameters.

In such a case, we can use mathematical deduction to compute the probability of an event as schematized in Figure 2.1. In the epitope examples, we used the Poisson probability as our null model with the given parameter $$\lambda=0.5$$. We were able to conclude through mathematical deduction that the chances of seeing a maximum value of 7 or larger was around $$10^{-4}$$ and thus that in fact the observed data were highly unlikely under that model (or “null hypothesis”).

Now suppose that we know the number of patients and the length of the proteins (these are given by the experimental design) but not the distribution itself and the false positive rate. Once we observe data, we need to go up from the data to estimate both a probability model $$F$$ (Poisson, normal, binomial) and eventually the missing parameter(s) for that model. This is the type of statistical inference we will explain in this chapter.

2.3 A simple example of statistical modeling

There are two parts to the modeling procedure. First we need a reasonable probability distribution to model the data generation process. As we saw in Chapter 1, discrete count data may be modeled by simple probability distributions such as binomial, multinomial or Poisson distributions. The normal distribution, or bell shaped curve, is often a good model for continuous measurements. Distributions can also be more complicated mixtures of these elementary ones (more on this in Chapter 4).

Let’s revisit the epitope example from the previous chapter, starting without the tricky outlier.

Goodness-of-fit : visual evaluation

Our first step is to find a fit from candidate distributions; this requires consulting graphical and quantitative goodness-of-fit plots. For discrete data, we can plot a barplot of frequencies (for continuous data, we would look at the histogram) as in Figure 2.2. Figure 2.2: The observed distribution of the epitope data without the outlier.

However, it is hard to decide which theoretical distribution fits the data best without using a comparison. One visual goodness-of-fit diagram is known as the rootogram (William S Cleveland 1988); it hangs the bars with the observed counts from the theoretical red points. If the counts correspond exactly to their theoretical values, the bottom of the boxes will align exactly with the horizontal axis. Figure 2.3: Rootogram showing the square root of the theoretical values as red dots and the square root of the observed frequencies as drop down rectangles. (We’ll see a bit below how the goodfit function decided which $$\lambda$$ to use.)

► Question 2.1

To calibrate what such a plot looks like with a known Poisson variable, use rpois with $$\lambda$$ = 0.05 to generate 100 Poisson distributed numbers and draw their rootogram.

We see that the rootogram for e99 seems to fit the Poisson model reasonably well. But remember, to make this happen we removed the outlier. The Poisson is completely determined by one parameter, often called the Poisson mean $$\lambda$$. In most cases where we can guess the data follows a Poisson distribution, we will need to estimate the Poisson parameter from the data. The parameter is called the Poisson mean because it is the mean of the theoretical distribution and, as it turns out, is estimated by the sample mean. This overloading of the word is confusing to everyone.

The most common way of estimating $$\lambda$$ is to choose the value $$\hat{\lambda}$$ that makes the observed data the most likely. This is called the maximum likelihood estimator (Rice 2006 Chapter 8, Section 5), often abbreviated MLE. We will illustrate this rather paradoxical idea in the next section.

Although we above took out the extreme observation before taking a guess at the probability distribution, we are going to return to the data with it for the rest of our analysis. In practice we would not know whether there is an outlier, and which data point(s) it is / they are. The effect of leaving it in is to make our estimate of the mean higher. In turns this would make it more likely that we’d observe a value of 7 under the null model, resulting in a larger p-value. So, if the resulting p-value is small even with the outlier included, we are assured that our analysis is up to something real. We call such a tactic being conservative: we err on the side of caution, of not detecting something.

Estimating the parameter of the Poisson distribution

What value for the Poisson mean makes the data the most probable? In a first step, we tally the outcomes.

## e100
##  0  1  2  7
## 58 34  7  1

Then we are going to try out different values for the Poisson mean and see which one gives the best fit to our data. If the mean $$\lambda$$ of the Poisson distribution were 3, the counts would look something like this:

##
##  0  1  2  3  4  5  6  7  8
##  3 10 23 23 18 18  2  2  1

which has many more 2’s and 3’s than we see in our data. So we see that $$\lambda=3$$ is unlikely to have produced our data, as the counts do not match up so well.

► Question 2.2

Repeat this simulation with different values of $$\lambda$$. Can you find one that gives counts close to the oberved ones just by trial and error?

So we could try out many possible values and proceed by brute force. However, we’ll do something more elegant and use a little mathematics to see which value maximizes the probability of observing our data. Let’s calculate the probability of seeing the data if the value of the Poisson parameter is $$m$$. Since we suppose the data derive from independent draws, this probability is simply the product of individual probabilities:

\begin{equation*} \begin{aligned} P(58 \times 0, 34 \times 1, 7 \times 2, \text{one }7 \;|\; \text{data are Poisson}(m)) = P(0)^{58}\times P(1)^{34}\times P(2)^{7}\times P(7)^{1}.\end{aligned} \end{equation*}

For $$m=3$$ we can compute this22 Note how we here use R’s vectorization: the call to dpois returns four values, corresponding to the four different numbers. We then take these to the powers of 58, 34, 7 and 1, respectively, using the ^ operator, resulting again in four values. Finally, we collapse them into one number, the product, with the prod function..

##  1.392143e-110

► Question 2.3

Compute the probability as above for $$m=0,1,2$$. Does $$m$$ have to be integer? Try computing the probability for $$m=0.4$$ for example.

This probability is the likelihood function of $$\lambda$$, given the data, and we write it Here $$L$$ stands for likelihood and $$f(k)=e^{-\lambda} \,\lambda^k\,/\,k!$$, the Poisson probability we saw earlier.

$\begin{equation*} L\left(\lambda ,\,\ x=(k_1,k_2,k_3,...)\right) = \prod_{i=1}^{100} f(k_i) \end{equation*}$

Instead of working with multiplications of a hundred small numbers, it is convenient23 That’s usually true both for pencil and paper and for computer calculations. to take the logarithm. Since the logarithm is strictly increasing, if there is a point where the logarithm achieves its maximum within an interval it will also be the maximum for the probability.

Let’s start with a computational illustration. We compute the likelihood for many different values of the Poisson parameter. To do this we need to write a small function that computes the probability of the data for different values24 Here we again use R’s vector syntax that allows us to write the computation without an explicit loop over the data points. Compared to the code above, here we call dpois on each of the 100 data points, rather than tabulating data with the table function before calling dpois only on the distinct values. This is a simple example for alternative solutions whose results are equivalent, but may differ in how easy it is to read the code or how long it takes to execute..

Now we can compute the likelihood for a whole series of lambda values from 0.05 to 0.95 (Figure 2.4). Figure 2.4: The red curve is the log-likelihood function. The vertical line shows the value of m (the mean) and the horizontal line the log-likelihood of m. It looks like m maximizes the likelihood.

##  0.55

► Question 2.4

What does the vapply function do in the above code? Hint: check its manual page.

► Solution

In fact there is a shortcut: the function goodfit.

##  "observed" "count"    "fitted"   "type"     "method"
##  "df"       "par"
## lambda ##  0.55 The output of goodfit is a composite object called a list. One of its components is called par and contains the value(s) of the fitted parameter(s) for the distribution studied. In this case it’s only one number, the estimate of $$\lambda$$. ► Question 2.5 What are the other components of the output from the goodfit function? ► Task Compare the value of m to the value that we used previously for $$\lambda$$, 0.5. Redo the modeling that we did in Chapter 1 with m instead of 0.5. 2.3.1 Classical statistics for classical data Here is a formal proof of our computational finding that the mean maximizes the (log-)likelihood. \begin{align} \log L(\lambda, x)=&\sum_{i=1}^{100} -\lambda +k_i \log\lambda-\log (k_i!) \tag{2.1} \\ =&-100\lambda +\log\lambda\left(\sum_{i=1}^{100} k_i\right) + \text{const.} \tag{2.2} \end{align} We use the catch-all “const.” for terms that do not depend on $$\lambda$$ (although they do depend on $$x$$, i.e., on the $$k_i$$). To find the $$\lambda$$ that maximizes this, we compute the derivative in $$\lambda$$ and set it to zero. \begin{align} \frac{d}{d\lambda}\log L=&-100 +\frac{1}{\lambda}\sum_{i=1}^{100} k_i \stackrel{?}{=}0 \tag{2.3} \\ \lambda=&\frac{1}{100}\sum_{i=1}^{100} k_i=\bar{k} \tag{2.4} \end{align} You have just seen the first steps of a statistical approach, starting from the ground up’ (from the data) to infer the model parameter(s): this is statistical estimation a parameter from data. Another important component will be choosing which family of distributions our data come from, that part is done by evaluating the goodness of fit. We will encounter this later. In the classical statistical testing framework, we consider one single model, that we call the null model, for the data. The null model formulates an “uninteresting” baseline, such as that all observations come from the same random distribution regardless of which group or treatment they are from. We then test whether there is something more interesting going on by computing the probability that the data are compatible with that model. Often, this is the best we can do, since we do not know in sufficient detail what the “interesting”, non-null or alternative model should be. In other situations, we have two competing models that we can compare, as we will see later. ► Question 2.6 What is the value of modeling with a known distribution? For instance, why is it interesting to know a variable has a Poisson distribution ? ► Solution Another useful direction is regression. We may be interested in knowing how our count-based response variable (e.g., the result of counting sequencing reads) depends on a continuous covariate, say, temperature or nutrient concentration. You may already have encountered linear regression, where our model is that the response variable $$y$$ depends on the covariate $$x$$ via the equation $$y = ax+b + e$$, with parameters $$a$$ and $$b$$ (that we need to estimate), and with residuals $$e$$ whose probability model is a normal distribution (whose variance we usually also need to estimate). For count data the same type of regression model is possible, although the probability distribution for the residuals then needs to be non-normal. In that case we use the generalized linear models framework. We will see examples when studying RNA-Seq in Chapter 8 and another type of next generation sequencing data, 16S rRNA data, in Chapter 9. Knowing that our probability model involves a Poisson, binomial, multinomial distribution or another parametric family will enable us to have quick answers to questions about the parameters of the model and compute quantities such as p-values and confidence intervals. 2.4 Binomial distributions and maximum likelihood In a binomial distribution there are two parameters: the number of trials $$n$$, which is typically known, and the probability $$p$$ of seeing a 1 in a trial. This probability is often unknown. 2.4.1 An example Suppose we take a sample of $$n=120$$ males and test them for red-green colorblindness. We can code the data as 0 if the subject is not colorblind and 1 if he is. We summarize the data by the table: ## cb ## 0 1 ## 110 10 ► Question 2.7 Which value of $$p$$ is the most likely given these data? ► Solution However, be careful: sometimes ML estimates are harder to guess and to compute, as well as being much less intuitive (see Exercise 2.2). In this special case, your intuition may give you the estimate $$\hat{p}=\frac{1}{12}$$, which turns out to be the maximum likelihood estimate. We put a hat over the letter to remind us that this is not (necessarily) the underlying true value, but an estimate we make from the data. As before in the case of the Poisson, if we compute the likelihood for many possible $$p$$, we can plot it and see where its maximum falls. Figure 2.5: Plot of the likelihood as a function of the probabilities. The likelihood is a function on $$[0, 1]$$; here we have zoomed into the range of $$[0, 0.3]$$, as the likelihood is practically zero for larger values of $$p$$. ##  0.085 Note: 0.085 is not exactly the value we expected $$(\frac{1}{12})$$, and that is because the set of values that we tried (in probs) did not include the exact value of $$\frac{1}{12}\simeq 0.0833$$, so we obtained the next best one. We could use numeric optimisation methods to overcome that. Likelihood for the binomial distribution One can come up with different criteria than ML, which lead to other estimators: they all carry hats. We'll see other examples in Chapter 4. The likelihood and the probability are the same mathematical function, only interpreted in different ways – in one case, it tells us how probable it is to see a particular set of values of the data, given the parameters; in the other case, we consider the data as fixed, and ask for the particular parameter value that makes the data more likely. Suppose $$n=300$$, and we observe $$y=40$$ successes. Then, for the binomial distribution: $\begin{equation} f(\theta\,|\,n,y) = f(y\,|\,n,\theta)={n \choose y} \, \theta^y \, (1-\theta)^{(n-y)}. \tag{2.5} \end{equation}$ As $${n \choose y}$$ is very large25 It’s around $$e^{115}$$, and this can be seen from Stirling’s formula. We can also use R: choose(300, 40) = 9.8e+49., we use the logarithm of the likelihood to give: $\begin{equation*} \log f(\theta |y) = 115 + 40\log(\theta)+(300-40)\log(1-\theta). \end{equation*}$ Here’s a function we use to calculate it: which we plot for the range of $$\theta$$ from 0 to 1 (Figure 2.6). Figure 2.6: Plot of the log likelihood function for $$n=300$$ and $$y=40$$. The maximum lies at 40/300 = 0.1333… , consistent with intuition, but we see that other values of $$\theta$$ are almost equally likely, as the function is quite flat around the maximum. We will see in a later section how Bayesian methods allow us to use a range of values for $$\theta$$. 2.5 More boxes:multinomial data 2.5.1 DNA count modeling: base pairs There are four basic molecules of DNA: A - adenine, C - cytosine, G - guanine, T - thymine. The nucleotides are classified into 2 groups: purines (A and G) and pyrimidines (C and T). The binomial would work as a model for the purine/pyrimidine groupings but not if we want to use A, C, G, T; for that we need the multinomial model from 1.4. Let’s look at noticeable patterns that occur in these frequencies. 2.5.2 Nucleotide bias This section combines estimation and testing by simulation in a real example. Data from one strand of DNA for the genes of Staphylococcus aureus bacterium are available in a fasta file staphsequence.ffn.txt, which we can read with a function from the Bioconductor package Biostrings. Let’s look at the first gene: ## A DNAStringSet instance of length 1 ## width seq names ##  1362 ATGTCGGAAAAAGAA...ATAAGAAATGTATAA lcl|NC_002952.2_c... ## A C G T ## 522 219 229 392 ► Question 2.8 Why did we use double square brackets in the second line? ► Solution ► Question 2.9 Following a similar procedure as in Exercise 1.8, test whether the nucleotides are equally distributed across the four nucleotides for this first gene. Due to their different physcal properties, evolutionary selection can act on the nucleotide frequencies. So we can ask whether, say, the first ten genes from these data come from the same multinomial. We do not have a prior reference, we just want to decide whether the nucleotides occur in the same proportions in the first 10 genes. If not, this would provide us with evidence for varying selective pressure on these ten genes. ## gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10 ## A 0.38 0.36 0.35 0.37 0.35 0.33 0.33 0.34 0.38 0.27 ## C 0.16 0.16 0.13 0.15 0.15 0.15 0.16 0.16 0.14 0.16 ## G 0.17 0.17 0.23 0.19 0.22 0.22 0.20 0.21 0.20 0.20 ## T 0.29 0.31 0.30 0.29 0.27 0.30 0.30 0.29 0.28 0.36 ## A C G T ## 0.3470531 0.1518313 0.2011442 0.2999714 So let’s suppose p0 is the vector of multinomial probabilities for all the ten genes and use a Monte Carlo simulation to test whether the departures between the observed letter frequencies and expected values under this supposition are within a plausible range. We compute the expected counts by taking the outer product of the vector of probabilities p0 with the sums of nucleotide counts from each of the 10 columns, cs. ## gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 ## 1362 1134 246 1113 1932 2661 831 1515 1287 ## gene10 ## 696 ## gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10 ## A 473 394 85 386 671 924 288 526 447 242 ## C 207 172 37 169 293 404 126 230 195 106 ## G 274 228 49 224 389 535 167 305 259 140 ## T 409 340 74 334 580 798 249 454 386 209 We can now create a random table with the correct column sums using the rmultinom function. This table is generated according to the null hypothesis that the true proportions are given by p0. ##  TRUE Now we repeat this B = 1000 times. For each table we compute our test statistic from Section 1.4.1 in Chapter 1 (the function stat) and store the results in the vector simulstat. Together, these values constitute our null distribution, as they were generated under the null hypothesis that p0 is the vector of multinomial proportions for each of the 10 genes. ##  0 Figure 2.7: Histogram of simulstat. The value of S1 is marked by the vertical red line, those of the 0.95 and 0.99 quantiles (see next section) by the dotted lines. The histogram is shown in Figure 2.7. We see that the probability of seeing a value as large as S1=70.1 is very small under the null model. It happened 0 times in our 1000 simulations that a value as big as S1 occured. Thus the ten genes do not seem to come from the same multinomial model. 2.6 The $$\chi^2$$ distribution In fact, we could have used statistical theory to come to the same conclusion without running these simulations. The theoretical distribution of the simulstat statistic is called the $$\chi^2$$ (chi-squared) distribution26 Strictly speaking, the distribution of simulstat is approximately described by a $$\chi^2$$ distribution; the approximation is particularly good if the counts in the table are large. with parameter 30 ($$=10\times(4-1)$$). We can use this for computing the probability of having a value as large as S1 $$=$$ 70.1. As we just saw above, small probabilities are difficult to compute by Monte Carlo: the granularity of the computation is $$1/B$$, so we cannot estimate any probabilities smaller than that, and in fact the uncertainty of the estimate is larger. So if any theory is applicable, that tends to be useful. We can check how well theory and simulation match up in our case using another visual goodness-of-fit tool: the (QQ) plot. When comparing two distributions, whether from two different samples or from one sample versus a theoretical model, just looking at histograms is not informative enough. We use a method based on the quantiles of each of the distributions. 2.6.1 Intermezzo: quantiles and the quantile-quantile plot In the previous chapter, we ordered the 100 sample values $$x_{(1)},x_{(2)},...,x_{(100)}$$. Say we want the 22nd percentile. We can take any value between the 22nd and the 23rd value, i.e., any value that fulfills $$x_{(22)} \leq c_{0.22} < x_{(23)}$$ is acceptable as a 0.22 quantile ($$c_{0.22}$$). In other words, $$c_{0.22}$$ is defined by $\begin{equation*} \frac{\# x_i's \leq c_{0.22}}{n} = 0.22. \end{equation*}$ In Section 3.6.6, we’ll introduce the empirical cumulative distribution function (ECDF) $$\hat{F}$$, and we’ll see that our definition of $$c_{0.22}$$ can also be written as $$\hat{F}_n(c_{0.22}) = 0.22$$. In Figure 2.7, our histogram of the distribution of simulstat, the quantiles $$c_{0.95}$$ and $$c_{0.99}$$ are also shown. ► Question 2.10 1. Compare the simulstat values and 1000 randomly generated $$\chi^2_{30}$$ random numbers by displaying them in histograms with 50 bins each. 2. Compute the quantiles of the simulstat values and compare them to those of the $$\chi_{30}^2$$ distribution. Hint: A name collision occurs here. Statisticians call the summary statistic we just computed as simulstat (sum of squares of weighted differences), the chi-squared or $$\chi^2$$ statistic. The theoretical distribution $$\chi^2_{\nu}$$ is a distribution in its own right, with a parameter $$\nu$$ called the degrees of freedom. When reading about the chi-squared or $$\chi^2$$, you will need to pay attention to the context to see which meaning is appropriate. ► Question 2.11 Do you know another name for the 0.5 quantile? ► Solution ► Question 2.12 In the above definition, we were a little vague on how the quantile is defined in general, i.e., not just for 0.22. How is the quantile computed for any number between 0 and 1, including ones that are not multiples of $$1/n$$? ► Solution Now that we have an idea what quantiles are, we can do the quantile-quantile plot. We plot the quantiles of the simulstat values, which we simulated under the null hypothesis, against the theoretical null distribution $$\chi^2_{30}$$ (Figure 2.8): Figure 2.8: Our simulated statistic’s distribution compared to $$\chi_{30}^2$$ using a QQ-plot, which shows the theoretical quantiles for the $$\chi^2_{30}$$ distribution on the horizontal axis and the sampled ones on the vertical axis. Having convinced ourselves that simulstat is well described by a $$\chi^2_{30}$$ distribution, we can use that to compute our p-value, i.e., the probability that under the null hypothesis (counts are distributed as multinomial with probabilities $$p_{\text{A}} = 0.35$$, $$p_{\text{C}} = 0.15$$, $$p_{\text{G}} = 0.2$$, $$p_{\text{T}} = 0.3$$) we observe a value as high as S1=70.1: ##  4.74342e-05 With such a small p-value, the null hypothesis seems improbable. Note how this computation did not require the 1000 simulations and was faster. 2.7 Chargaff’s Rule The most important pattern in the nucleotide frequencies was discovered by Chargaff (Elson and Chargaff 1952). Long before DNA sequencing was available, using the weight of the molecules, he asked whether the nucleotides occurred at equal frequencies. He called this the tetranucleotide hypothesis. We would translate that into asking whether $$p_{\text{A}} = p_{\text{C}} = p_{\text{G}} = p_{\text{T}}$$. Unfortunately, Chargaff only published the percentages of the mass present in different organisms for each of the nucleotides, not the measurements themselves. ## A T C G ## Human-Thymus 30.9 29.4 19.9 19.8 ## Mycobac.Tuber 15.1 14.6 34.9 35.4 ## Chicken-Eryth. 28.8 29.2 20.5 21.5 ## Sheep-liver 29.3 29.3 20.5 20.7 ## Sea Urchin 32.8 32.1 17.7 17.3 ## Wheat 27.3 27.1 22.7 22.8 ## Yeast 31.3 32.9 18.7 17.1 ## E.coli 24.7 23.6 26.0 25.7 Figure 2.9: Barplots for the different rows in ChargaffTable. Can you spot the pattern? ► Question 2.13 • Do these data seem to come from equally likely multinomial categories? • Can you suggest an alternative pattern? • Can you do a quantitative analysis of the pattern, perhaps inspired by the simulations above? ► Solution ► Question 2.14 When computing pChf, we only looked at the values in the null distribution smaller than the observed value. Why did we do this in a one-sided way here? 2.7.1 Two categorical variables Up to now, we have visited cases where the data are taken from a sample that can be classified into different boxes: the binomial for Yes/No binary boxes and the multinomial distribution for categorical variables such as A, C, G, T or different genotypes such aa, aA, AA. However it might be that we measure two (or more) categorical variables on a set of subjects, for instance eye color and hair color. We can then cross-tabulate the counts for every combination of eye and hair color. We obtain a table of counts called a contingency table. This concept is very useful for many biological data types. ## Eye ## Hair Brown Blue Hazel Green ## Black 36 9 5 2 ## Brown 66 34 29 14 ## Red 16 7 7 7 ## Blond 4 64 5 8 ► Question 2.15 Explore the HairEyeColor object in R. What data type, shape and dimensions does it have? ► Solution Color blindness and sex Deuteranopia is a form of red-green color blindness due to the fact that medium wavelength sensitive cones (green) are missing. A deuteranope can only distinguish 2 to 3 different hues, whereas somebody with normal vision sees 7 different hues. A survey for this type of color blindness in human subjects produced a two-way table crossing color blindness and sex. ## Men Women ## Deute 19 2 ## NonDeute 1981 1998 How do we test whether there is a relationship between sex and the occurrence of color blindness? We postulate the null model with two independent binomials: one for sex and one for color blindness. Under this model we can estimate all the cells’ multinomial probabilities, and we can compare the observed counts to the expected ones. This is done through the chisq.test function in R. ## ## Pearson's Chi-squared test with Yates' continuity ## correction ## ## data: Deuteranopia ## X-squared = 12.255, df = 1, p-value = 0.0004641 The small p value tells us that we should expect to see such a table with only a very small probability under the null model – i.e., if the fractions of deuteranopic color blind among women and men were the same. We’ll see another test for this type of data called Fisher’s exact test (also known as the hypergeometric test) in section 10.3.2. This test is widely used for testing the over-representations of certain types of genes in a list of significantly expressed ones. 2.7.2 A special multinomial: Hardy-Weinberg equilibrium Here we highlight the use of a multinomial with three possible levels created by combining two alleles M and N. Suppose that the overall frequency of allele M in the population is $$p$$, so that of N is $$q = 1-p$$. The Hardy-Weinberg model looks at the relationship between $$p$$ and $$q$$ if there is independence of the frequency of both alleles in a genotype, the so-called Hardy-Weinberg equilibrium (HWE). This would be the case if there is random mating in a large population with equal distribution of the alleles among sexes. The probabilities of the three genotypes are then as follows: $\begin{equation} p_{\text{MM}}=p^2,\quad p_{\text{NN}}=q^2,\quad p_{\text{MN}}=2pq \tag{2.6} \end{equation}$ We only observe the frequencies $$(n_{\text{MM}},\,n_{\text{MN}},\,n_{\text{NN}})$$ for the genotypes MM, MN, NN and the total number $$S=n_{\text{MM}}+ n_{\text{MN}}+n_{\text{NN}}$$. We can write the likelihood, i.e., the probability of the observed data when the probabilities of the categories are given by (2.6), using the multinomial formula $\begin{equation*} P(n_{\text{MM}},\,n_{\text{MN}},\,n_{\text{NN}}\;|\;p) = {S \choose n_{\text{MM}},n_{\text{MN}},n_{\text{NN}}} (p^2)^{n_{\text{MM}}} \,\times\, (2pq)^{n_{\text{MN}}} \,\times\, (q^2)^{n_{\text{NN}}}, \label{eq:likHWE} \end{equation*}$ and the log-likelihood under HWE $\begin{equation*} L(p)=n_{\text{MM}}\log(p^2)+n_{\text{MN}} \log(2pq)+n_{\text{NN}}\log(q^2). \label{eq:loglikHWE} \end{equation*}$ The value of $$p$$ that maximizes the loglikelihood is $\begin{equation*} p = \frac{n_{\text{MM}} + n_{\text{MN}}/2}{S}. \end{equation*}$ See (Rice 2006 Chapter 8, Section 5) for the proof. Given the data $$(n_{\text{MM}},\,n_{\text{MN}},\,n_{\text{NN}})$$, the log-likelihood $$L$$ is a function of only one parameter, $$p$$. Figure 2.11 shows this log-likelihood function for different values of $$p$$ for the 216th row of the Mourant data27 This is genotype frequency data of blood group alleles from Mourant, Kopec, and Domaniewska-Sobczak (1976) available through the R package HardyWeinberg., computed in the following code. ## Population Country Total MM MN NN ## 214 Oceania Micronesia 962 228 436 298 ## 215 Oceania Micronesia 678 36 229 413 ## 216 Oceania Tahiti 580 188 296 96 Figure 2.11: Plot of the log-likelihood for the Tahiti data. The maximum likelihood estimate for the probabilities in the multinomial is also obtained by using the observed frequencies as in the binomial case, however the estimates have to account for the relationships between the three probabilities. We can compute $$\hat{p}_{\text{MM}}$$, $$\hat{p}_{\text{MN}}$$ and $$\hat{p}_{\text{NN}}$$ using the af function from the HardyWeinberg package. ##  0.5793103 The expected values under Hardy-Weinberg equilibrium are then ## MM MN NN ## 194.6483 282.7034 102.6483 which we can compare to the observed values above. We can see that they are quite close to the observed values. We could further test whether the observed values allow us to reject the Hardy-Weinberg model, either by doing a simulation or a $$\chi^2$$ test as above. A visual evaluation of the goodness-of-fit of Hardy-Weinberg was designed by de Finetti (Finetti 1926; Cannings and Edwards 1968). It places every sample at a point whose coordinates are given by the proportions of each of the different alleles. Visual comparison to the Hardy-Weinberg equilibrium We use the HWTernaryPlot functon to display the data and compare it to Hardy-Weinberg equilibrium graphically. Figure 2.12: This de Finetti plot shows the points as barycenters of the three genotypes using the frequencies as weights on each of the corners of the triangle. The Hardy-Weinberg model is the red curve, the acceptance region is between the two purple lines. We see that the US is the furthest from being in HW equilibrium. ► Question 2.16 Make the ternary plot as in the code above, then add the other data points to it, what do you notice? You could back up your discussion using the HWChisq function. ► Question 2.17 Divide all total frequencies by 50, keeping the same proportions for each of the genotypes, and recreate the ternary plot. • What happens to the points ? • What happens to the confidence regions and why? 2.7.3 Concatenating several multinomials: sequence motifs and logos The Kozak Motif is a sequence that occurs close to the start codon ATG of a coding region. The start codon itself always has a fixed spelling but in positions 5 to the left of it, there is a nucleotide pattern in which the letters are quite far from being equally likely. We summarize this by giving the position weight matrix (PWM) or position-specific scoring matrix (PSSM), which provides the multinomial probabilities at every position. This is encoded graphically by the sequence logo (Figure 2.13). ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## A 0.33 0.25 0.4 0.15 0.20 1 0 0 0.05 ## C 0.12 0.25 0.1 0.40 0.40 0 0 0 0.05 ## G 0.33 0.25 0.4 0.20 0.25 0 0 1 0.90 ## T 0.22 0.25 0.1 0.25 0.15 0 1 0 0.00 Figure 2.13: Here is a diagram called a sequence logo for the position dependent multinomial used to model the Kozak motif. It codifies the amount of variation in each of the positions on a log scale. The large letters represent positions where there is no uncertainty about which nucleotide occurs. Over the last sections, we’ve seen how the different “boxes” in the multinomial distributions we have encountered very rarely have equal probabilities. In other words, the parameters $$p_1, p_2, ...$$ are often different, depending on what is being modeled. Examples of multinomials with unequal frequencies include the twenty different amino acids, blood types and hair color. If we have multiple categorical variables, we have seen that they are rarely independent (sex and colorblindness, hair and eye color, …). We will see later in Chapter 9 that we can explore the patterns in these dependencies by using multivariate decompositions of the contingency tables. Here, we’ll look at an important special case of dependencies between categorical variables: those that occur along a sequence (or “chain”) of categorical variables, e.g., over time or along a biopolymer. 2.8 Modeling sequential dependencies: Markov chains If we want to predict tomorrow’s weather, a reasonably good guess is that it will most likely be the same as today’s weather, in addition we may state the probabilties for various kinds of possible changes28 The same reasoning can also be applied in reverse: we could “predict” yesterday’s weather from today’s.. This method for weather forecasting is an example for the Markov assumption: the prediction for tomorrow only depends on the state of things today, but not on yesterday or three weeks ago (all information we could potentially use is already contained in today’s weather). The weather example also highlights that such an assumption need not necessarily be exactly true, but it should be a good enough assumption. It is fairly straightforward to extend this assumption to dependencies on the previous $$k$$ days, where $$k$$ is a finite and hopefully not too large number. The essence of the Markov assumption is that the process has a finite “memory”, so that predictions only need to look back for a finite amount of time. Instead of temporal sequences, we can also apply this to biological sequences. In DNA, we may see specific succession of patterns so that pairs of nucleotides, called digrams, say, CG, CA, CC and CT are not equally frequent. For instance, in parts of the genome we see more frequent instances of CA than we would expect under independence: $\begin{equation*} P(\mathtt{CA}) \neq P(\mathtt{C}) \, P(\mathtt{A}). \end{equation*}$ We model this dependency in the sequence as a Markov chain: $\begin{equation*} P(\mathtt{CA}) = P(\mathtt{NCA}) = P(\mathtt{NNCA}) = P(... \mathtt{CA}) = P(\mathtt{C}) \, P(\mathtt{A|C}) \end{equation*}$ where N stands for any nucleotide, and $$P(\mathtt{A|C})$$ stands for “the probability of $$\mathtt{A}$$, given that the preceding base is a $$\mathtt{C}$$”. Figure 2.14 shows a schematic representation of such transitions on a graph. Figure 2.14: Visualisation of a 4-state Markov chain. The probability of each possible digram (e.,g., CA) is given by the weight of the edge between the corresponding nodes. So for instance, the probability of CA is given by the edge C$$\to$$ A. We’ll see in Chapter 11 how to use R packages to draw these type of network graphs. 2.9 Bayesian Thinking Figure 2.15: Turtles all the way down. Bayesian modeling of the uncertainty of the parameter of a distribution is done by using a random variable whose distribution may depend on parameters whose uncertainty can be modeled as a random variable; these are called hierarchical models. Up to now we have followed a classical approach where the parameters of our distributions, i.e., the probabilities of the possible different outcomes, represent long term frequencies. The parameters are –at least conceptually– definite, knowable, fixed numbers. We may not know them, so we estimate them from the data at hand. However, such an approach does not take into account any information that we might already know, and that might constrain our parameters or make certain parameters more likely than others even before we have seen any of the current set of data. For that we need a different approach, in which we use probability distributions to express our knowledge about the parameters, and use data to update this knowledge, for instance by shifting those distributions or making them more narrow; this is provided by the Bayesian paradigm (Figure 2.15). The Bayesian paradigm is a practical approach where prior and posterior distributions are used as models of our knowledge before and after collecting some data and making an observation. It is particularly useful for integrating or combining information from different sources. Suppose we have a certain hypothesis, call it $$H$$, and we want to use data to decide whether the hypothesis is true. We can formalize our prior knowledge about $$H$$ in the form of a prior probability, written $$P(H)$$29 For a so-called frequentist, such a probability does not exist. Their viewpoint is that, although the truth is unknown, in reality the hypothesis is either true or false; there is no meaning in calling it, say, “0.7-true”.. After we see the data, we have the posterior probability. We write it as $$P(H\,|\,D)$$, the probability of $$H$$ given that we saw $$D$$. This may be higher or lower than $$P(H)$$, depending on what the data $$D$$ were. Haplotypes To keep the mathematical formalism to a minimum, we will start with an example. We study a forensics example using combined signatures from the Y chromosome called haplotypes. A haplotype is a collection of alleles (DNA sequence variants) that are spatially adjacent on a chromosome, are usually inherited together (recombination tends not to disconnect them), and thus are genetically linked. In this case we are looking at linked variants on the Y chromosome. First we’ll look at the motivation behind haplotype frequency analyses, then we’ll revisit a little the idea of likelihood. After this, we’ll explain how we can think of unknown parameters as being random numbers themselves, modeling their uncertainty with a prior distribution. Then we will see how to incorporate new data observed into the probability distributions and compute posterior confidence statements about the parameters. Figure 2.16: A short tandem repeat (STR) in DNA occurs when a pattern of two or more nucleotides is repeated and the repeated sequences are directly adjacent to each other. An STR is also known as a microsatellite. The pattern can range in length from 2 to 13 nucleotides, and the number of repeats is highly variable across individuals. STR numbers can be used as genetic signatures, called haplotypes. 2.9.1 Example: haplotype frequencies We want to estimate the proportion of a particular Y-haplotype that consists of a set of different short tandem repeats (STR). The combination of STR numbers at the specific STR locations used for DNA forensics are labeled by the number of repeats at the specific positions. The US Y-chromosome STR database can be accessed at the URL urlhttp://www.usystrdatabase.org. Here is a short excerpt of an STR haplotype table: ## Individual DYS19 DXYS156Y DYS389m DYS389n DYS389p ## 1 H1 14 12 4 12 3 ## 2 H3 15 13 4 13 3 ## 3 H4 15 11 5 11 3 ## 4 H5 17 13 4 11 3 ## 5 H7 13 12 5 12 3 ## 6 H8 16 11 5 12 3 This says that the haplotype H1 has 14 repeats at position DYS19, 12 repeats at position DXYS156Y $$...$$. The haplotypes created through the use of these Y-STR profiles are shared between men in the same patriarchal lineages. For these reasons it is possible that two different men share the same profile. We need to find the underlying proportion $$\theta$$ of the haplotype of interest in the population of interest. We are going to consider the occurrence of a haplotype as a success’ in a binomial distribution using collected observations. 2.9.2 Simulation study of the Bayesian paradigm for the binomial Instead of assuming that our parameter $$\theta$$ has one single value, the Bayesian world view allows us to see it as a draw from a statistical distribution. The distribution expresses our belief about the possible values of the parameter $$\theta$$. In principle, we can use any distribution that we like whose possible values are permissible for $$\theta$$. When we are looking at a parameter that expresses a proportion or a probability, and which takes its values between 0 and 1, it is convenient to use the beta distribution. Its density formula is written $\begin{equation*} f_{\alpha,\beta}(x) = \frac{x^{\alpha-1} (1-x)^{\beta-1}}{\text{B}(\alpha, \beta)} \quad\text{where}\quad \text{B}(\alpha ,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \end{equation*}$ We can see in Figure 2.17 how this function depends on two parameters $$\alpha$$ and $$\beta$$, making it a very flexible family of distributions (so it can ‘fit’ a lot different situations). It has a nice mathematical property: if we start with a prior belief on $$\theta$$ that is beta-shaped, observe a dataset of $$n$$ binomial trials, then update our belief, the posterior distribution on $$\theta$$ will also have a beta distribution, albeit with updated parameters. This is a mathematical fact, we will not prove it, however we demonstrate it by simulation. Figure 2.17: Beta distributions with $$\alpha=10,20,50$$ and $$\beta=30,60,150$$ used as a {prior} for a probability of success. These three distributions have the same mean ($$\frac{\alpha}{\alpha +\beta}$$), but different concentrations around the mean. The distribution of $$Y$$ For a given choice of $$\theta$$, we know what the distribution of $$Y$$ is, by virtue of Equation (2.5). But what is the distribution of $$Y$$ if $$\theta$$ itself also varies according to some distribution? We call this the marginal distribution of $$Y$$. Let’s simulate that. First we generate a random sample of 10000 $$\theta$$s. In the code chunk, we again use vapply to apply a function, the unnamed (or ‘anonymous’) function of th, across all elements of rtheta to obtain as a result another vector y of the same length.. For each of these $$\theta$$s, we then generate a random sample of $$Y$$ (Figure 2.18). Figure 2.18: Marginal Distribution of $$Y$$. ► Question 2.18 Verify that we could have gotten the same result as in the above code chunk by using R’s vectorisation capabilities and writing rbinom(length(rtheta), rtheta, size = 300). Histogram of all the thetas such that $$Y=40$$: the posterior distribution So let’s now compute the posterior distribution of $$\theta$$ by conditioning on those outcomes where $$Y$$ was 40. We compare it to the theoretical posterior, densPostTheory30 We use thetas defined above in Section 2.4., of which more below. The results are shown in Figure 2.19. Figure 2.19: Only choosing the values of the distribution with $$Y=40$$ gives the posterior distribution of $$\theta$$. The histogram (green) shows the simulated values for the posteriror distribution, the line the theoretical density of a beta distribution with the theoretical posterior parameters. We can also check the means of both distributions computed above and see that they are close to 4 significant digits. ##  0.1286394 ##  0.1285714 To approximate the mean of the theoretical density densPostTheory, we have above literally computed the integral $$\int_0^1 \theta f(\theta) \, d\theta$$ using numerical integration, i.e., the sum over the integrant. This is not always convenient (or feasible), in particular if our parameters are high-dimensional, i.e., if our model involves not only a single, scalar $$\theta$$ parameter, but if $$\theta$$ is a high-dimensional object, as is for instance often the case in the case of image analysis, and if the integral cannot be computed analytically. Thus, let’s see how we could use Monte Carlo integration instead. This is similar to the code above, where we used numerical integration to compute the posterior mean from thetaPostEmp by calling R’s mean function. ##  0.1285445 We can check the concordance between our Monte Carlo sample thetaPostMC and our sample thetaPostEmp using a quantile-quantile plot (QQ-plot, Figure 2.20). Figure 2.20: QQ-plot of our Monte Carlo sample thetaPostMC from the theoretical distribution and our simulation sample thetaPostEmp. We could also similarly compare either of these two distributions to the theoretical distribution function pbeta(., 90, 610). If the curve lies on the line $$y=x$$ this indicates a good agreement. There are some random differences at the tails. ► Question 2.19 What is the difference between the simulation that results in thetaPostEmp and the Monte Carlo simulation that leads to thetaPostMC? Posterior distribution is also a beta Now we have seen that the posterior distribution is also a beta. In our case its parameters $$\alpha=90$$ and $$\beta=610$$ were obtained by summing the prior parameters $$\alpha=50$$, $$\beta=350$$ with the observed successes $$y=40$$ and the observed failures $$n-y=260$$, thus obtaining the posterior $\begin{equation*} \text{beta}(90,\, 610)=\text{beta}(\alpha+y,\beta+(n-y)). \end{equation*}$ We can use it to give the best31 We could take the value that maximizes the posterior distribution as our best estimate, this is called the MAP estimate and this case it would be $$\frac{\alpha-1}{\alpha+\beta-2}=\frac{89}{698}\doteq 0.1275$$. estimate we can for $$\theta$$ with its uncertainty given by the posterior distribution. Suppose we had a second series of data After seeing our previous data, we now have a new prior, $$\text{beta}(90, 610)$$. • Now we collect a new set of data with $$n=150$$ observations and $$y=25$$ successes, thus 125 failures. • Now what would we take to be our best guess at $$\theta$$? Using the same reasoning as before, the new posterior will be: $$\text{beta}(90+25=115,\, 610+125=735)$$. The mean of this distribution is $$\frac{115}{115+735}=\frac{115}{850}\simeq 0.135$$, thus one estimate of $$\theta$$ would be 0.135. The theoretical maximum a posteriori (MAP) estimate would be the mode of $$\text{beta}(115, 735)$$, ie $$\frac{114}{848}\simeq 0.134$$. Let’s check this numerically. ##  0.1352941 ##  0.1352963 ##  0.134 The last line of this code uses a Monte Carlo method for finding the MAP from a sample from rbeta(., 115, 735). ► Question 2.20 Redo all the computations replacing our original prior with a softer prior (less peaked), meaning that we use less prior information. How much does this change the final result? As a general rule, the prior rarely changes the posterior distribution substantially except if it is very peaked. This would be the case if, at the outset, we were already rather sure of what to expect. Another case when the prior has an influence is if there is very little data. The best situation to be in is to have enough data to swamp the prior so that its choice doesn’t have much impact on the final result. Confidence Statements for the proportion parameter Now it is time to conclude about where the proportion actually lies given the data. One summary is a posterior credibility interval, which is a Bayesian analog of the confidence interval. We can take the 2.5 and 97.5-th percentiles of the posterior distribution: $$P (L \leq \theta \leq U)=0.95$$ using R. ## 2.5% 97.5% ## 0.1131867 0.1590393 Figure 2.21: An example from Love, Huber, and Anders (2014) shows plots of the likelihoods (solid lines, scaled to integrate to 1) and the posteriors (dashed lines) for the green and purple genes and of the prior (solid black line): due to the higher dispersion of the purple gene, its likelihood is wider and less peaked (indicating less information), and the prior has more influence on its posterior than for the green gene. The stronger curvature of the green posterior at its maximum translates to a smaller reported standard error for the MAP logarithmic fold change (LFC) estimate (horizontal error bar). 2.10 Example: occurrence of a nucleotide pattern in a genome The examples we have seen up to now have concentrated on distributions of discrete counts and categorical data. Let’s look at an example of distributions of distances, which are quasi-continuous. This case study of the distributions of the distances between instances of a specific motif in genome sequences will also allow us to explore specific genomic sequence manipulations in Bioconductor. The Biostrings package provides tools for working with sequence data. The essential data structures, or classes as they known in R, are DNAString and DNAStringSet. These enable us to work with one or multiple DNA sequences efficiently The Biostrings package also contains additional classes for representing amino acid and general biological strings.. ► Question 2.21 Explore some of the useful data and functions provided in the Biostrings package by exploring the tutorial vignette. ► Solution This last command will open a list in your browser window from which you can access the documentation32 Vignettes are manuals for the packages complete with examples and case studies.. The BSgenome package provides access to many genomes, and you can access the names of the data packages that contain the whole genome sequences by typing ##  93 ##  "BSgenome.Alyrata.JGI.v1" ##  "BSgenome.Amellifera.BeeBase.assembly4" We are going to explore the occurrence of the AGGAGGT motif33 This is the Shine-Dalgarno motif which helps initiate protein synthesis in bacteria. in the geonome of E.coli. We use the genome sequence of one particular strain, Escherichia coli str. K12 substr.DH10B34 It is known as the laboratory workhorse, often used in experiments., whose NCBI accession number is NC_010473. We can count the pattern’s occurrence in windows of width 50000 using the countPattern function. ## numMatches ## 0 1 2 3 4 ## 48 32 8 3 2 ► Question 2.22 What distribution might this table fit ? ► Solution We can inspect the matches using the matchPattern function. You can type sdMatches in the R command line to obtain a summary of this object. It contains the locations of all 65 pattern matches, represented as a set of so-called views on the original sequence. Now what are the distances between them? So these are in fact the 66 complementary regions. Now let’s find a model for the distribution of the gap sizes between motifs. If the motifs occur at random locations, we expect the gap lengths to follow an exponential distribution35 How could we guess that the exponential is the right fit here? Whenever we have independent, random Bernoulli occurrences along a sequence, the gap lengths are exponential. You may be familiar with radioactive decay, where the waiting times between emissions are also exponentially distributed. It is a good idea if you are not familiar with this distribution to look up more details in the Wikipedia.. The code below (whose output is shown in Figure 2.23) assesses this assumption. If the exponential distribution is a good fit, the points should lie roughly on a straight line. The exponential distribution has one parameter, the rate, and the line with slope corresponding to an estimate from the data is also shown. Figure 2.23: Evaluation of fit to the exponential distribution (black line) of the gaps between the motifs. ► Question 2.23 There appears to be a slight deviation from the fitted line in Figure 2.23 at the right tail of the distibution, i.e., for the largest values. What could be the reason? 2.10.1 Modeling in the case of dependencies As we saw in Section 2.8, nucleotide sequences are often dependent: the probability of seing a certain nucleotide at a given position tends to depend on the surrounding sequence. Here we are going to put into practice dependency modeling using a Markov chain. We are going to look at regions of chromosome 8 of the human genome and try to discover differences between regions called CpG36 CpG stands for 5’-C-phosphate-G-3’; this means that a C is connected to a G through a phosphate along the strand (this is unrelated to C-G base-pairing of Section 2.7). The cytosines in the CpG dinucleotide can be methylated, changing the levels of gene expression. This type of gene regulation is part of epigenetics. Some more information is on Wikipedia: CpG site and epigenetics. islands and the rest. We use data (generated by Irizarry, Wu, and Feinberg (2009)) that tell us where the start and end points of the islands are in the genome and look at the frequencies of nucleotides and of the digrams ‘CG’, ‘CT’, ‘CA’, ‘CC’. So we can ask whether there are dependencies between the nucleotide occurrences and if so, how to model them. ##  65699 ## chr start end length CpGcount GCcontent pctGC obsExp ## 1 chr10 93098 93818 721 32 403 0.559 0.572 ## 2 chr10 94002 94165 164 12 97 0.591 0.841 ## 3 chr10 94527 95302 776 65 538 0.693 0.702 ## 4 chr10 119652 120193 542 53 369 0.681 0.866 ## 5 chr10 122133 122621 489 51 339 0.693 0.880 ## 6 chr10 180265 180720 456 32 256 0.561 0.893 We use the :: operator to call the filter function specifically from the dplyr package—and not from any other packages that may happen to be loaded and defining functions of the same name. This precaution is particularly advisable in the case of the filter function, since this name is used by quite a few other packages. You can think of the normal (without ::) way of calling R functions like calling people by their first (given) names; whereas the fully qualified version with :: corresponds to calling someone by their full name. At least within the reach of the CRAN and Bioconductor repositories, such fully qualified names are guaranteed to be unique. In the line above, we subset (filter) the data frame CpGtab to only chromosome 8, and then we create an IRanges object whose start and end positions are defined by the equally named columns of the data frame. In the IRanges function call (which constructs the object from its arguments), the first start is the argument name of the function, the second start refers to the column in the data frame obtained as an output from filter; and similarly for end. IRanges is a general container for mathematical intervals. We create the biological context37 The “I” in IRanges stands for “interval”; the “G” in GRanges for “genomic”. with the next line. Now let’s visualize; see the output in Figure 2.24. Figure 2.24: Gviz plot of CpG locations in a selected region of chromosome 8. We now define so-called views on the chromosome sequence that correspond to the CpG islands, irCpG, and to the regions in between (gaps(irCpG)). The resulting objects CGIview and NonCGIview only contain the coordinates, not the sequences themselves (these stay in the big object Hsapienschr8), so they are fairly lightweight in terms of storage.

We compute transition counts in CpG islands and non-islands using the data.

##  AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG  GT  TA  TC  TG  TT
## 389 351 400 436 498 560 112 603 359 336 403 336 330 527 519 485

For a four state Markov chain as we have, we define the transition matrix as a matrix where the rows are the from state and the columns are the to state.

We use the counts of numbers of transitions of each type to compute frequencies and put them into two matrices. The transition probabilities are probabilities so the rows need to sum to 1.

##            A         C         G         T
## A 0.20457773 0.2652333 0.3897678 0.1404212
## C 0.20128250 0.3442381 0.2371595 0.2173200
## G 0.18657245 0.3145299 0.3450223 0.1538754
## T 0.09802105 0.3352314 0.3598984 0.2068492
##           A         C          G         T
## A 0.3351380 0.1680007 0.23080886 0.2660524
## C 0.3641054 0.2464366 0.04177094 0.3476871
## G 0.2976696 0.2029017 0.24655406 0.2528746
## T 0.2265813 0.1972407 0.24117528 0.3350027

► Question 2.24

Are the transitions different in the different rows? This would mean that, for instance, $$P(\mathtt{A}\,|\,\mathtt{C}) \neq P(\mathtt{A}\,|\,\mathtt{T})$$.

► Solution

► Question 2.25

Are the relative frequencies of the different nucleotides different in CpG islands compared to elsewhere ?

► Solution

► Question 2.26

How can we use these differences to decide whether a given sequence comes from a CpG island?

► Solution

Given a sequence for which we do not know whether it is in a CpG island or not, we can ask what is the probability it belongs to a CpG island compared to somewhere else. We compute a score based on what is called the odds ratio. Let’s do an example: suppose our sequence $$x$$ is ACGTTATACTACG, and we want to decide whether it comes from a CpG island or not.

If we model the sequence as a first order Markov chain we can write, supposing that the sequence comes from a CpG island:

\begin{align} P_{\text{i}}(x = \mathtt{ACGTTATACTACG}) = \; &P_{\text{i}}(\mathtt{A})\, P_{\text{i}}(\mathtt{AC})\, P_{\text{i}}(\mathtt{CG})\, P_{\text{i}}(\mathtt{GT})\, P_{\text{i}}(\mathtt{TT}) \times \nonumber\\ &P_{\text{i}}(\mathtt{TA})\, P_{\text{i}}(\mathtt{AT})\, P_{\text{i}}(\mathtt{TA})\, P_{\text{i}}(\mathtt{AC})\, P_{\text{i}}(\mathtt{CG}). \tag{2.7} \end{align}

We are going to compare this probability to the probability for non-islands. As we saw above, these probabilities tend to be quite different. We will take their ratio and see if it is larger or smaller than 1. These probabilties are going to be products of many small terms and become very small. We can work around this by taking logarithms.

\begin{align} \log&\frac{P(x\,|\, \text{island})}{P(x\,|\,\text{non-island})}=\nonumber\\ \log&\left( \frac{P_{\text{i}}(\mathtt{A})\, P_{\text{i}}(\mathtt{A}\rightarrow \mathtt{C})\, P_{\text{i}}(\mathtt{C}\rightarrow \mathtt{G})\, P_{\text{i}}(\mathtt{G}\rightarrow \mathtt{T})\, P_{\text{i}}(\mathtt{T}\rightarrow \mathtt{T})\, P_{\text{i}}(\mathtt{T}\rightarrow \mathtt{A})} {P_{\text{n}}(\mathtt{A})\, P_{\text{n}}(\mathtt{A}\rightarrow \mathtt{C})\, P_{\text{n}}(\mathtt{C}\rightarrow \mathtt{G})\, P_{\text{n}}(\mathtt{G}\rightarrow \mathtt{T})\, P_{\text{n}}( \mathtt{T}\rightarrow \mathtt{T})\, P_{\text{n}}( \mathtt{T}\rightarrow \mathtt{A})} \right. \times\nonumber\\ &\left. \frac{P_{\text{i}}(\mathtt{A}\rightarrow \mathtt{T})\, P_{\text{i}}(\mathtt{T}\rightarrow \mathtt{A})\, P_{\text{i}}(\mathtt{A}\rightarrow \mathtt{C})\, P_{\text{i}}(\mathtt{C}\rightarrow \mathtt{G})} {P_{\text{n}}(\mathtt{A}\rightarrow \mathtt{T})\, P_{\text{n}}(\mathtt{T}\rightarrow \mathtt{A})\, P_{\text{n}}(\mathtt{A}\rightarrow \mathtt{C})\, P_{\text{n}}(\mathtt{C}\rightarrow \mathtt{G})} \right) \tag{2.8} \end{align}

This is the log-likelihood ratio score. To speed up the calculation, we compute the log-ratios $$\log(P_{\text{i}}(\mathtt{A})/P_{\text{n}}(\mathtt{A})),..., \log(P_{\text{i}}(\mathtt{T}\rightarrow \mathtt{A})/P_{\text{n}}(\mathtt{T}\rightarrow \mathtt{A}))$$ once and for all and then sum up the relevant ones to obtain our score. Worked out examples and many useful details can be found in Durbin et al. (1998).

##          A
## -0.2824623

In the code below, we pick sequences of length len = 100 out of the 2855 sequences in the seqCGI object, and then out of the 2854 sequences in the seqNonCGI object (each of them is a DNAStringSet). In the first three lines of the generateRandomScores function, we drop sequences that contain any letters other than A, C, T, G; such as “.” (a character used for undefined nucleotides). Among the remaining sequences, we sample with probabilities proportional to their length minus len and then pick subsequences of length len out of them. The start points of the subsequences are sampled uniformly, with the constraint that the subsequences have to fit in. Figure 2.25: Island and non-island scores as generated by the function generateRandomScores. This is the first instance of a mixture we encounter. We will revisit them in Chapter 4.

We can consider these our training data: from data for which we know the types, we can see whether our score is useful for discriminating – see Figure 2.25.

2.11 Summary of this chapter

In this chapter we experienced the basic yoga of statistics: how to go from the data back to the possible generating distributions and how to estimate the parameters that define these distributions. Statistical models We showed some specific statistical models for experiments with categorical outcomes (binomial and multinomial).

Goodness of fit We used different visualizations and showed how to run simulation experiments to test whether our data could be fit by a fair four box multinomial model. We encountered the chi-square statistic and saw how to compare simulation and theory using a qq-plot.

Estimation We explained maximum likelihood and Bayesian estimation procedures. These approaches were illustrated on examples involving nucleotide pattern discovery and haplotype estimations.

Prior and posterior distributions When assessing data of a type that has been been previously studied, such as haplotypes, it can be beneficial to compute the posterior distribution of the data. This enables us to incorporate uncertainty in the decision making, by way of a simple computation. The choice of the prior has little effect on the result as long as there is sufficient data.

CpG islands and Markov chains We saw how dependencies along DNA sequences can be modeled by Markov chain transitions. We used this to build scores based on likelihood ratios that enable us to see whether long DNA sequences come from CpG islands or not. When we made the histogram of scores, we saw in Figure 2.25 a noticeable feature: it seemed to be made of two pieces. This bimodality was our first encounter with mixtures, they are the subject of Chapter 4.

This is the first instance of building a model on some training data: sequences which we knew were in CpG islands, that we could use later to classify new data. We will develop a much more complete way of doing this in Chapter 12.

One of the best introductory statistics books available is Freedman, Pisani, and Purves (1997). It uses box models to explain the important concepts. If you have never taken a statistics class, or you feel you need a refresher, we highly recommend it. Many introductory statistics classes do not cover statistics for discrete data in any depth. The subject is an important part of what we need for biological applications. A book-long introduction to these types of analyses can be found in (Agresti 2007).

Here we gave examples of simple unstructured multinomials. However, sometimes the categories (or boxes) of a multinomial have specific structure. For instance, the 64 possible codons code for 20 amino acids and the stop codons (61+3). So we can see the amino acids themselves as a multinomial with 20 degrees of freedom. Within each amino acid there are multinomials with differing numbers of categories (Proline has four: CCA, CCG, CCC, CCT, see exercise2.3). Some multivariate methods have been specificly designed to decompose the variability between codon usage within the differently abundant amino-acids (Grantham et al. 1981; Perrière and Thioulouse 2002), and this enables discovery of latent gene transfer and translational selection. We will cover the specific methods used in those papers when we delve into the multivariate exploration of categorical data in Chapter 9.

There are many examples of successful uses of the Bayesian paradigm to quantify uncertainties. In recent years the computation of the posterior distribution has been revolutionized by special types of Monte Carlo that use either a Markov chain or random walk or Hamiltonian dynamics. These methods provide approximations that converge to the correct posterior distribution after quite a few iterations. For examples and much more see (Robert and Casella 2009; Marin and Robert 2007; McElreath 2015).

2.13 Exercises

► Exercise 2.1

Generate 1,000 random 0/1 variables that model mutations occurring along a 1,000 long gene sequence. These occur independently at a rate of $$10^{-4}$$ each. Then sum the 1,000 positions to count how many mutations in sequences of length 1,000.

Find the correct distribution for these mutation sums using a goodness of fit test and make a plot to visualize the quality of the fit.

► Exercise 2.2

Make a function that generates $$n$$ random uniform numbers between $$0$$ and $$7$$ and returns their maximum. Execute the function for $$n=25$$. Repeat this procedure $$B=100$$ times. Plot the distribution of these maxima.
What is the maximum likelihood estimate of the maximum of a sample of size 25 (call it $$\hat{\theta}$$)?
Can you find a theoretical justification and the true maximum $$\theta$$?

► Exercise 2.3

A sequence of three nucleotides (a codon) taken in a coding region of a gene can be transcribed into one of 20 possible amino acids. There are $$4^3=64$$ possible codon sequences, but only 20 amino acids. We say the genetic code is redundant: there are several ways to spell each amino acid.

The multiplicity (the number of codons that code for the same amino acid) varies from 2 to 6. The different codon-spellings of each amino acid do not occur with equal probabilities. Let’s look at the data for the standard laboratory strain of tuberculosis (H37Rv):

##   AmAcid Codon Number PerThous
## 1    Gly   GGG  25874    19.25
## 2    Gly   GGA  13306     9.90
## 3    Gly   GGT  25320    18.84
## 4    Gly   GGC  68310    50.82

The codons for the amino acid proline are of the form $$CC*$$ $$*$$ stands for any of the 4 letters, using the computer notation for a regular expression., and they occur with the following frequencies in Mycobacterium turberculosis:

##  0.54302025 0.10532985 0.05859765 0.29305225

a) Explore the data mtb using table to tabulate the AmAcid and Codon variables.
b) How was the PerThous variable created?
c) Write an R function that you can apply to the table to find which of the amino acids shows the strongest codon bias, i.e., the strongest departure from uniform distribution among its possible spellings.

► Exercise 2.4

Display GC content in a running window along the sequence of Staphylococcus Aureus. Read in a fasta file sequence from a file.

a) Look at the complete staph object and then display the first three sequences in the set.
b) Find the GC content in tsequence windows of width 100.
c) Display the GC content in a sliding window as a fraction.
d) How could we visualize the overall trends of these proportions along the sequence?

► Exercise 2.5

Redo a figure similar to Figure 2.17, but include two other distributions: the uniform (which is B(1,1)) and the B($$\frac{1}{2},\frac{1}{2}$$). What do you notice?

► Exercise 2.6

Choose your own prior for the parameters of the beta distribution. You can do this by sketching it here: https://jhubiostatistics.shinyapps.io/drawyourprior. Once you have set up a prior, re-analyse the data from Section 2.9.2, where we saw $$Y = 40$$ successes out of $$n=300$$ trials. Compare your posterior distribution to the one we obtained in that section using a QQ-plot.

References

Agresti, Alan. 2007. An Introduction to Categorical Data Analysis. John Wiley.

Cannings, Chris, and Anthony WF Edwards. 1968. “Natural Selection and the de Finetti Diagram.” Annals of Human Genetics 31 (4). Wiley Online Library: 421–28.

Cleveland, William S. 1988. The Collected Works of John W. Tukey: Graphics 1965-1985. Vol. 5. CRC Press.

Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis. Cambridge University Press.

Elson, D, and E Chargaff. 1952. “On the Desoxyribonucleic Acid Content of Sea Urchin Gametes.” Experientia 8 (4). Springer: 143–45.

Finetti, Bruno de. 1926. “Considerazioni Matematiche Sull’ereditarieta Mendeliana.” Metron 6: 3–41.

Freedman, David, Robert Pisani, and Roger Purves. 1997. Statistics. New York, NY: WW Norton.

Grantham, Richard, Christian Gautier, Manolo Gouy, M Jacobzone, and R Mercier. 1981. “Codon Catalog Usage Is a Genome Strategy Modulated for Gene Expressivity.” Nucleic Acids Research 9 (1). Oxford Univ Press: 213–13.

Irizarry, Rafael A, Hao Wu, and Andrew P Feinberg. 2009. “A Species-Generalized Probabilistic Model-Based Definition of Cpg Islands.” Mammalian Genome 20 (9-10). Springer: 674–80.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Gnome Biology 15 (12): 1–21.

Marin, Jean-Michel, and Christian Robert. 2007. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer Science & Business Media.

McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC.

Mourant, AE, Ada Kopec, and K Domaniewska-Sobczak. 1976. “The Distribution of the Human Blood Groups 2nd Edition.” Oxford University Press London.

Perrière, Guy, and Jean Thioulouse. 2002. “Use and Misuse of Correspondence Analysis in Codon Usage Studies.” Nucleic Acids Research 30 (20). Oxford Univ Press: 4548–55.

Rice, John. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.

Robert, Christian, and George Casella. 2009. Introducing Monte Carlo Methods with R. Springer Science & Business Media.

Page built: 2019-08-27

Support for maintaining the online version of this book is provided by de.NBI 