10 Networks and Trees

Networks and trees are often used to represent both biological data and knowledge about a system. Phylogenetic trees were used to represent family relationships between species even before Darwin’s famous notebook sketch. Networks are often used to schematize complex interactions between proteins involved in diseases as in Figure 10.1.

This graph represents genetic interactions that appeared to be modified in cancer patients where perturbations appeared in a chemokine signaling network. Figure 10.1: This graph represents genetic interactions that appeared to be modified in cancer patients where perturbations appeared in a chemokine signaling network.

We saw in Chapter 2 that we could model state transitions as a Markov chain and these can be usefully represented as directed graphs with weights on the edges. Metabolic pathways in which nodes are chemical metabolites and the edges represent chemical reactions. Mutation history trees are used in cancer genomics to infer lineages of mutations.

Transmission networks are important in studying the epidemiology of infectious diseases. As real networks can be very large, we will need special methods for representing and visualizing them. This chapter will be focused on ways of integrating graphs into a data analytic workflow.

10.1 Goals for this chapter

In this chapter we will

  • Use the formal definition of a graph’s components: edges, vertices, layout to see how we can manipulate them in R using both adjacency matrices and lists of edges.

  • We will transform a graph object from igraph into an object that can be visualized according to the layers approach in ggplot2 using ggnetwork. We will experiment with covariates that we attach to graph edges and nodes.

  • Graphs are useful ways of encoding prior knowledge about a system, and we will see how they enable us to go from simple gene set analyses to meaningful biological recommendations by mapping significance scores onto the network to detect perturbation hotspots.

  • We will build phylogenetic trees starting with from DNA sequences and then visualize these trees with the specifically designed R packages ape and ggtree.

  • We will combine a phylogenetic tree built from microbiome 16S rRNA data with covariates to show how the hierarchical relationship between taxa can increase the power in multiple hypothesis testing.

  • A special tree called a minimum spanning tree (MST) is very useful for testing the relations between a graph and other covariates. We’ll see how to implement different versions of what is known as the Friedman-Rafsky test. We’ll study both co-occurrence of bacteria in mice litters and strain similarities in HIV contagion networks.

10.2 Graphs

10.2.1 What is a graph and how can it be encoded?

A graph is formed by a set of nodes or vertices (often called \(V\)) and a set of edges between these vertices (\(E\)). Edges \(E\) are provided as unordered pairs of vertices in undirected graphs and ordered pairs for directed or oriented graphs.

An adjacency matrix \(\mathbf{A}\) is the matrix representation of \(E\). \(\mathbf{A}\) is a square matrix with as many rows as nodes in the graph. \(\mathbf{A}\) contains a non zero entry in the \(i\)th row and \(j\)th column if there is an edge between the \(i\)th and \(j\)th vertices.

► Question 10.1

For graphs with undirected edges what is special about the adjacency matrix \(A\)?

► Solution

A small undirected graph with numbered nodes. Figure 10.3: A small undirected graph with numbered nodes.

library("igraph")
edges1 = matrix(c(1,3,2,3,3,4,4,5,4,6),byrow = TRUE, ncol = 2)
g1 = graph_from_edgelist(edges1, directed = FALSE)
plot(g1, vertex.size = 25, edge.width = 5, vertex.color = "coral")

► Question 10.2

Can you give an alternative way of reading the graph from an edge set dataframe?

► Solution

Elements of a simple graph:

  • The nodes or vertices which are the colored circles with numbers in them in Figure 10.3.

  • Edges or connections, the segments that join the nodes and which can be directed or not.

  • Edge lengths, when not specified, we suppose they are all one and compute the distance between vertices on the graph as the number the edges traversed. On the other hand, in many situations we have meaningful edge lengths or strengths of connections between vertices that we can use both in the plots and analyses.

  • Edge and node attributes: optionally, each edge or each node can be mapped to further continuous or categorical variables.

We call a weighted, directed graph a network. Networks have adjacency matrices \({\bf A}\) which are \(n\) by \(n\) matrices of positive numbers corresponding to the edge lengths.

Graph statistics

For large graphs, some authors like to summarize overall graph structure such as vertex degrees (the number of edges connected to the vertex), centrality or betweenness. We focus on combining graphs with many other sources of data. So, we will not delve deeply into these statistics, though they are available in the various packages (network, igraph) we use.

If the number of edges is similar to the number of nodes (written \(\#E\sim O(\#V)\)), we say that the graph is sparse. Some graphs have many nodes, for instance the package ppiData contains a predicted protein interaction (ppipred) graph on about 2500 proteins with 20,000 edges. A complete adjacency matrix for this graph requires 6,250,000 elements. In fact a list of edges and `weights’ are sufficient: this only use 60,000 elements. So storing a large graph as an adjacency matrix can be wasteful. If a graph is sparse we can use sparse encodings similar to those implemented in the package Matrix.

On the other hand, if the number of edges is (approximately) a quadratic function of the nodes (written \(\#E\sim O(\#V^2)\)), our graph is dense. Memory space can be an issue for the storage of large graphs.

An enriched graph can contain: Arrows and directed edges In directed graphs we differentiate between in-degree and out-degree and can identify graphs that have cycles. Annotation variables on nodes and edges The strength of a link in a graph can be visualized by the width of the edge. Covariates can be added to nodes. A continuous covariate can be associated to the size of the node in the graphical representation. A categorical value associated to the color.

Graph layout

We will see several examples where the same graph is plotted in different ways, either for aesthetic or practical reasons. This is done through the choice of the graph layout.

When the edges have lengths representing distances the problem of a 2D representation of the graph is the same as the multidimensional scaling we saw in Chapter 9. It is often solved in a similar way by spreading out the vertex-points as much as possible. In the simple case of edges without lengths, the algorithms can choose different criteria, we’ll see that the method of Fruchterman and Reingold often appears as a default choice. It is based on a physical model where similar points attract and repel each other as if under the effect of physical forces.

► Task

Use the igraph package to do the following

  • Create a dense random graph with 12 nodes and more than 50 edges.

  • Experiment plotting the graph with different layouts: place the nodes on a circle, or represent the graph as symmetrically as possible, avoid any overlapping nodes or edges.

Graphs from data

Usually data do not arrive in the form of graphs. Graphical or network representations are often the result of transforming other data types: From distances: Graphs can simplify distance matrices by rounding the smaller distances down to zero. This requires defining a threshold \(t\) and then making edges between points whose distances are smaller than the threshold \(t\); these are often called geometric graphs.

Binary data Some data arrive naturally as 0/1 matrices, for instance presence or absence of the different finches in the Galapagos archipelago can be encoded with a 0/1 table. We can build a special type of graph between the finch-vertices and the island-vertices that is equivalent to such a table. That graph is called a bipartite graph; there are only edges between a taxon and an island, and none between taxa, nor between islands. An example is shown in Figure 10.4.

This **bipartite** graph connects each taxon to the sites where it was observed. Figure 10.4: This bipartite graph connects each taxon to the sites where it was observed.

► Question 10.3

Load the finch data from the networksis package and experiment plotting the data to highlight that this is a bipartite network by modifying the following lines of code (Figure 10.5):

► Solution

► Question 10.4

Make a dataframe from the graph g1 using the ggnetwork package, then plot it using ggplot and the provided geoms geom_edges, geom_nodes and geom_nodetext.

► Solution

An example: a four state Markov chain

In Chapter 2 we saw how a Markov chain can summarize transitions between nucleotides (considered the states of the system). This is often schematized by a graph, the igraph package enables many choices for graph `decoration’:

library("markovchain")
statesNames = c("A", "C", "G","T")
T1MC = new("markovchain", states = statesNames, transitionMatrix =
  matrix(c(0.2,0.1,0.4,0.3,0,1,0,0,0.1,0.2,0.2,0.5,0.1,0.1,0.8,0.0),
    nrow = 4,byrow = TRUE, dimnames=list(statesNames,statesNames)))
plot(T1MC, edge.arrow.size = 0.4, vertex.color="purple",
      edge.arrow.width = 2.2, edge.width = 5, edge.color = "blue",
      edge.curved = TRUE, edge.label.cex = 2.5, vertex.size= 32,
      vertex.label.cex = 3.5, edge.loop.angle = 3,
      vertex.label.family="sans", vertex.label.color = "white")

A four state Markov chain with arrows representing possible transitions between states. Figure 10.7: A four state Markov chain with arrows representing possible transitions between states.

Markov chains are idealized models of dynamical systems and the states are represented by the nodes in the graph. The transition matrix gives us the weights on the directed edges (arrows) between the states.

► Question 10.5

Which state do you think this Markov chain will end up in ?

► Task


a) Try changing your set.seed function input and see if it changes the plot.
b) Access the help for this particular plot function.
c) Redo the graph and label the edges with the transition probabilities in green and vertices in brown.

We will see how to build a complete example of annotated state space Markov chain graph in Exercise 10.3.

10.2.2 Graphs with many layers: labels on edges and nodes

Here is an example of plotting a graph downloaded from the STRING database with annotations at the vertices.

library("ggnetwork")
oldpar=par(mar=c(5.1,4.7,4.1,2.6))
datf = read.table("../data/string_graph.txt", header = TRUE)
grs = graph_from_data_frame(datf[, c("node1", "node2")], directed = FALSE)
E(grs)$weight = 1
V(grs)$size = centralization.degree(grs)$res
ggdf = ggnetwork(grs, layout = "fruchtermanreingold", cell.jitter = 0)
ggplot(ggdf, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(color = "black", curvature = 0.1, size = 0.95, alpha = 0.8)+
  geom_nodes(aes(x = x, y = y), size = 3, alpha = 0.5, color = "orange") +
geom_nodelabel_repel(aes(label = vertex.names), size=4, color="#8856a7") +
#  geom_nodetext(aes(label = vertex.names), size = 4, color = "#8856a7") +
  theme_blank() + theme(legend.position = "none")
par(oldpar)

Figure 10.8: Perturbed chemokine subnetwork uncovered in Yu et al. (2012) using differential gene expression patterns in sorted T-cells. Notice the clique-like structure of the genes CXCR3, CXCL13, CCL19, CSCR5 and CCR7 in the right hand corner.

Perturbed chemokine subnetwork uncovered in  @YuGXNA using differential gene expression patterns in sorted T-cells. Notice the clique-like structure of the genes CXCR3, CXCL13, CCL19, CSCR5 and CCR7 in the right hand corner.

Figure 10.8 shows the full perturbed chemokine subnetwork discovered in the study of breast cancer metastasis using GXNA (Nacu et al. 2007) and reported by Yu et al. (2012).

10.3 From gene set enrichment to networks

A long unstructured laundry list of possibly differentially expressed genes can be daunting. In Chapter 8, we studied methods for finding a list of differentially expressed genes. Small sample sizes, coupled with efforts to maintain low FDRs, often result in low power to detect differential expression. Therefore, obtaining a long list of genes that can be confidently declared as differentially expressed is, initially, a triumph. However, understanding the underlying biology requires more than just a laundry list of significant players in a biological system.

10.3.1 Methods using pre-defined gene sets (GSEA)

One of the earliest approaches was to look for gene attributes that are overrepresented or enriched in the laundry list of significant genes. These gene classes are often based on Gene Ontology (GO) categories (for example, genes that are involved in organ growth, or genes that are involved in feeding behavior). The Gene Ontology (GO) is a collection of three ontologies that describe genes and gene products. These ontologies are restricted vocabularies that have the structure of directed acyclic graphs (DAGS). The most specific terms are the leaves of the graph. The GO graph consists of nodes (here, Gene Ontology terms) and edges from more specific terms (children) to less specific (parents), often these edges are directed. Nodes and edges can have multiple attributes that can be visualized. The main purpose of using GO annotations for a particular set of Genes designated as significant in an experiment is to look for the enrichment of a GO term in this list, we will give this term a statistical meaning below. Many other useful lists of important gene sets exist.

► Task

Find a useful database of gene sets.

For instance the MsigDB Molecular Signature Database (Liberzon et al. 2011) contains many gene sets that can be accessed from within R using the Bioconductor package GSEABase command getBroadSets as follows:

library("GSEABase")
## Not run, this requires a login to the website.
fl  =  "/path/to/msigdb_v5.1.xml"
gss  =  getBroadSets(fl) # read entire msigdb
organism(gss[[1]])
table(sapply(gss, organism))

10.3.2 Gene set analysis with two-way table tests

Yello w Blue Red
Significant 25 25 25
Universe 500 100 400

Here, we start by explaining a basic approach often called or hypergeometric testing.

So-called 'exact' tests because they are nonparametric and based on exhaustive enumerations: not because we are sure of the answer – this is statistics after all.

Define a universe of candidate genes that may potentially be significant; say this universe is of size \(N\). We also have a record of the genes that actually did come out significant, of which we suppose there were \(m\).

We make a toy model involving balls in boxes, with a total of \(N\) balls corresponding to the genes identified in the gene universe. These genes are split into different functional categories, suppose there are \(N=1,000\) genes, of which 500 are yellow, 100 are blue and 400 are red. Then a subset of \(m=75\) genes are labeled as significant, suppose among these significantly interesting genes, there are 25 yellow, 25 red and 25 blue. Is the blue category enriched or overrepresented?

We use this hypergeometric two-way table testing to account for the fact that some categories are extremely numerous and others are rarer.

► Question 10.6

Run a Monte Carlo experiment with 20,000 simulations and compute the p-value of significance of having 25 blues under the null hypothesis that no category is over-represented in the significant set.

► Solution

In the general case, the gene universe is an urn with \(N\) balls, if we pick the \(m\) balls at random and there is a proportion of \(k/N\) blue balls, we expect to see \(km/N\) blue balls in a draw of size \(k\).

Plotting gene enrichment networks with GOplot

Here we show an attractive way of summarizing the connections between the gene functional categories and the significant gene set.

library("GOplot")
data("EC")
circ  =  circle_dat(EC$david, EC$genelist)
chord =  chord_dat(circ, EC$genes, EC$process)
GOChord(chord, limit = c(0, 5))

Figure 10.10: This graph shows the correspondence between GO terms and significantly changed genes in a study on differential expression in endothelial cells from two steady state tissues (brain and heart, see Nolan et al. (2013)). After normalization a differential expression analysis was performed giving a list of genes. A gene-annotation enrichment analysis of the set of differentially expressed genes (adjusted p-value < 0.05) was then performed with the GOplot package.

This graph shows the correspondence between GO terms and significantly changed genes in a study on differential expression in endothelial cells from two steady state tissues (brain and heart, see  @Nolan:2013). After normalization a differential expression analysis was performed giving a list of genes. A gene-annotation enrichment analysis of the set of differentially expressed genes (adjusted p-value < 0.05) was then performed with the **[GOplot](https://cran.r-project.org/web/packages/GOplot/)** package.

In fact, the Gene Ontology graph does not necessarily capture meaningful gene interactions as genes from different processes often interact productively. A large amount of information remains unused, for example, all significant genes are usually given equal weight, despite the potentially large variations in their p-values.

10.3.3 Significant subgraphs and high scoring modules

We have at our disposal more than just the Gene Ontology. There are many different databases of gene networks from which we can choose a known skeleton graph onto which we project significance scores such as p-values from our differential expression experiment. We will follow an idea first suggested by Ideker et al. (2002). This is further developed in Nacu et al. (2007). A careful implementation with many improvements is available as the Bioconductor package BioNet (Beisser et al. 2010). These method all search for the subgraphs or modules of a scored-skeleton network that seem to be particularly perturbed.

Each gene-node in the network is assigned a score that can either be calculated from a t-statistic or a p-value. Often pathways contain both upregulated and downregulated genes; as pointed out in Ideker et al. (2002), this can be captured by taking absolute values of the test statistic or just incorporating scores computed from the p-values129 We’ll want something like \(-\log p\), so that small p-values give large scores. Beisser et al. (2010) model the pvalues of the genes as we did in Chapter 6: mixture of non-perturbed genes whose pvalues will be uniformly distributed and non uniformly distributed pvalues from the perturbed genes. We model the signal in the data using a beta distribution for the p-values following Pounds and Morris (2003).

Given our node-scoring function; we search for connected hotspots in the graph, ie a subgraph of genes with high combined scores.

Using a subgraph search algorithm

Finding the maximal scoring subgraph of a generic graph is known to be intractable in general (we say it is an NP-hard problem), so various approximate algorithms have been proposed. Ideker et al. (2002) suggested using simulated annealing, however this is slow and tends to produce large subgraphs that are difficult to interpret. Nacu et al. (2007) started with a seed vertex and gradually expand around it. Beisser et al. (2010) started the search with a so-called minimal spanning tree (MST), a graph we we will study later in this chapter.

10.3.4 An example with the BioNet implementation

To illustrate the method, we show data from the BioNet package. The interactome data contains a connected component of the network comprising 2034 different gene products and 8399 interactions. This constitutes the skeleton graph with which we will work, see Beisser et al. (2010).

The dataLym contains the relevant pvalues and \(t\) statistics for 3,583 genes, you can access them and do the analysis as follows:

library("BioNet")
library("DLBCL")
data("dataLym")
data("interactome")
interactome
## A graphNEL graph with undirected edges
## Number of Nodes = 9386 
## Number of Edges = 36504
pval = dataLym$t.pval
names(pval)  =  dataLym$label
subnet = subNetwork(dataLym$label, interactome)
subnet = rmSelfLoops(subnet)
subnet
## A graphNEL graph with undirected edges
## Number of Nodes = 2559 
## Number of Edges = 7788

Fit a Beta-Uniform model

The p-values are fit to the type of mixture we studied in Chapter 4, with a uniform component from the null with probability \(\pi_0\) and a beta distribution (proportional to \(a x^{a - 1}\)) for the p-values corresponding to the alternatives (Pounds and Morris 2003). \[f(x|a,\pi_0)= \pi_0 + (1-\pi_0) a x^{a - 1}\qquad \mbox{ for } 0 <x \leq 1; \; 0<a<1\] The package actually gives a different name to \(\pi_0\): it uses \(\lambda\) and calls it the mixing parameter. Running the model with an fdr of 0.001:

fb = fitBumModel(pval, plot = FALSE)
fb
## Beta-Uniform-Mixture (BUM) model
## 
## 3583 pvalues fitted
## 
## Mixture parameter (lambda):  0.482
## shape parameter (a):         0.180
## log-likelihood:          4471.8
scores=scoreNodes(subnet, fb, fdr = 0.001)

The qqplot shows the quality of the fit of beta-uniform mixture model to the data. The red points have the theoretical quantiles from the beta distribution as the x coordinates the observed quantiles and the y coordinates. The blue line shows that this model fits nicely. Figure 10.11: The qqplot shows the quality of the fit of beta-uniform mixture model to the data. The red points have the theoretical quantiles from the beta distribution as the x coordinates the observed quantiles and the y coordinates. The blue line shows that this model fits nicely.

A histogram of the mixture components for the p-values, the beta in red and the uniform in blue, $\pi_0$ is the mixing proportion assigned to the null component whose distribution should be uniform. Figure 10.12: A histogram of the mixture components for the p-values, the beta in red and the uniform in blue, \(\pi_0\) is the mixing proportion assigned to the null component whose distribution should be uniform.

Then we run a heuristic search for a high scoring subgraph using:

hotSub  =  runFastHeinz(subnet, scores)
hotSub
## A graphNEL graph with undirected edges
## Number of Nodes = 153 
## Number of Edges = 243
logFC=dataLym$diff
names(logFC)=dataLym$label

► Question 10.7

We made Figure 10.13 using the following code:

plotModule(hotSub, layout = layout.davidson.harel, scores = scores,
                  diff.expr = logFC)

Figure 10.13: The subgraph found as maximally enriched for differential expression between ABC and GCB B-cell lymphoma. The nodes are colored in red and green: green shows an upregulation in ACB and red an upregulation in GBC. The shape of the nodes depicts the score: rectangles indicate a negative score, circles a positive score.

The subgraph found as maximally enriched for differential expression between ABC and GCB B-cell lymphoma. The nodes are colored in red and green: green shows an upregulation in ACB and red an upregulation in GBC. The shape of the nodes depicts the score: rectangles indicate a negative score, circles a positive score.

Using the function igraph.from.graphNEL, transform the module object and plot it using the ggnetwork method shown in Section 10.2.2.

10.4 Phylogenetic Trees

As mathematical objects, the hierarchical clustering trees (studied in Chapter \@ref(Chap:Clustering)) are the same as phylogenetic trees. They are **rooted binary** trees with labels at the tips. Figure 10.14: As mathematical objects, the hierarchical clustering trees (studied in Chapter 5) are the same as phylogenetic trees. They are rooted binary trees with labels at the tips.

One really important use of graphs in biology is the construction of phylogenetic trees. Trees are graphs with no cycles (the official word for loops, whether self loops, or ones that go through several vertices). Phylogenetic trees are usually rooted binary trees that only have labels on the leaves corresponding to contemporary130 Because they are contemporary, the trees are often represented so that the leaves are a ll the same distance from the root. taxa at the tips. The inner nodes correspond to ancestral sequences which have to be inferred from the contemporaneous data on the tips. Many methods use aligned DNA sequences from the different species or populations to infer or estimate the tree. The tips of the tree are usually called OTUs (Operational Taxonomic Units). The statistical parameter of interest in these analyses is the rooted binary tree with OTU labels on its leaves (see Holmes (1999, 2003b) for details).

The example of HIV

This phylogenetic tree describes the history of different HIV/SIV strains in Africa  [@Wertheim:2009], [Figure from]. Figure 10.15: This phylogenetic tree describes the history of different HIV/SIV strains in Africa (Wertheim and Worobey 2009), [Figure from].

HIV is a virus that protects itself by evolving very fast (within months, several mutations can appear). Its evolution can thus be followed in real time; whereas the evolution of large organisms which has happened over millions of years. HIV trees are built for medical purposes such as the detection and understanding of drug resistance. They are estimated for individual genes. Different genes can show differences in their evolutionary histories and thus produce different gene trees. The phylogenetic tree in Figure 10.15 shows times when the virus switched from monkeys to humans (Wertheim and Worobey 2009).

Special elements in phylogenies

  • Most phylogenetic trees are shown rooted, the `root’ is usually found by including an outgroup in the tree tips, as we will see later.
  • Characters that are derived from this common ancestry are called homologous (geneticists doing population studies replace the term homology by identity by descent (IBD)).
  • Sisters on the tree defined by a common ancestor are called clades or monophyletic groups, they have more than just similarities in common.

10.4.1 Markovian models for evolution.

To infer what happened in the ancestral species from contemporary data collected on the tips of the tree, we have to make assumptions about how substitutions and deletions occur through time. The models we use are all Markovian and said to be time homogeneous: the mutation rate is constant across history.

This is called the molecular clock hypothesis, if we do not make this assumption we run into what is known as non-identifiability (ie we can't tell the difference between the many possibile mutational histories given the observed data).

Continuous Markov chain and generator matrix

We are going to use the Markov chain we saw in Figure 10.7 on the states A,C,G,T; however now we consider that the changes of state, i.e. mutations, occur at random times. The gaps between these mutational events will follow an exponential distribution. These continuous time Markov chains have the following properties:

  • No Memory. \(P(Y(u+t)=j\;|\;Y(t)=i)\) does not depend on times before t.

  • Time homogeneity. The probability \(P(Y(h+t)=j\,|\,Y(t)=i)\) does not depend on \(t\), but on \(h\), the time between the events and on \(i\) and \(j\).

  • Linearity. The instantaneous transition rate is of an approximately linear form

We use an error term written here as \(o(h)\), we read this little \(o\) of \(h\), which just means that this error terms grows much slower (i.e., sublinear) than \(h\).

\[\begin{align} P_{ij}(h)&=q_{ij}h+o(h), \quad\text{for } j\neq i \tag{10.1} \\ P_{ii}(h)&=1-q_i(h)+ o(h), \qquad\text{where } q_i=\sum_{j\neq i}q_{ij}. \tag{10.2} \end{align}\]

\(q_{ij}\) is known as the instantaneous transition rate. These rates define matrices as in Table 10.1.

  • Exponential distribution. Times between changes are supposed to be exponentially distributed.

Table 10.1: Two examples of rate matrices, on the left:the Jukes-Cantor (JC69) model, on the right is shown the Kimura (K80) two parameter model.

\(Q=\begin{array}{lcccc} & A &T &C & G\\A&-3\alpha & \alpha &\alpha&\alpha\\T&\alpha & -3\alpha&\alpha &\alpha\\C&\alpha & \alpha &-3\alpha&\alpha\\G&\alpha &\alpha &\alpha &-3\alpha\\\end{array}\) \(Q=\begin{array}{lcccc} & A &T &C & G\\A&-\alpha-2\beta & \beta &\beta&\alpha\\T&\beta & -\alpha-2\beta &\alpha &\beta\\C&\beta & \alpha &-\alpha-2\beta&\beta\\G&\alpha &\beta &\beta &-\alpha-2\beta\\\end{array}\)

the instantaneous change probability matrix called the generator. In the simplest possible model, called a Jukes-Cantor model; all the mutations are equally likely (see the left of Table 10.1). A slightly more flexible model, called the Kimura model is shown on the right in Table 10.1.

► Question 10.8

Why do we say the Kimura model is more flexible ?

► Solution

Vocabulary overload here! : Transitions in this context mean mutational changes within the purines (tt A<->G) or within the pyrimidines (tt C <-> T); whereas when we talked about Markov chains earlier our transition matrix contains all probabilities of any state changes.

The most flexible model is called the Generalized Time Reversible (GTR) model; it has 6 free parameters. We are going to show an example of data simulated according to these generative models from a known tree.

10.4.2 Simulating data and plotting a tree

Suppose we already know our phylogenetic tree and want to simulate the evolution of the nucleotides down this tree. First we visualize the tree tree1 using ggtree; loading the tree and the relevant packages with:

library("phangorn")
library("ggtree")
load("../data/tree1.RData")

► Task

Use the ggtree function to plot tree1; make the tips of the tree green triangles, the ancestral nodes, red circles.

This is the tree we use as our *true* parameter. We generate nucleotides one at a time from the root and `dropping' them down the tree. With some probability proportional to the edge lengths, mutations occur down the branches. Figure 10.16: This is the tree we use as our true parameter. We generate nucleotides one at a time from the root and `dropping’ them down the tree. With some probability proportional to the edge lengths, mutations occur down the branches.

ggtree(tree1, lwd = 2, color = "darkgreen", alpha = 0.8, right = TRUE) +
  geom_tiplab(size = 7, angle = 90, offset = 0.05) +
  geom_point(aes(shape = isTip, color = isTip), size = 5, alpha = 0.6)

Now we generate some sequences from our tree. Each sequence starts with a new nucleotide letter generated randomly at the root; mutations may occur as we go down the tree. You can see in Figure 10.17 that the colors are not equally represented, because the frequency at the root was chosen to be different from the uniform, see the following code.

seqs6 = simSeq(tree1, l = 60, type = "DNA", bf = c(1, 1, 3, 3)/8, rate = 0.1)
seqs6
## 6 sequences with 60 character and 23 different site patterns.
## The states are a c g t
mat6df = data.frame(as.character(seqs6))
p = ggtree(tree1, lwd = 1.2) + geom_tiplab(aes(x = branch), size = 5, vjust = 2)
gheatmap(p, mat6df[, 1:60], offset = 0.01, colnames = FALSE)

Figure 10.17: The tree on the left was used to generate the sequences on the right according to a Jukes Cantor model the nucleotide frequencies generated at the root were quite unequal, with A and C being generated more rarely. As the sequences percolate down the tree, mutations occur, they are more likely to occur on the longer branches.

The tree on the left was used to generate the sequences on the right according to a Jukes Cantor model the nucleotide frequencies generated at the root were quite unequal, with `A` and `C` being generated more rarely. As the sequences percolate down the tree, mutations occur, they are more likely to occur on the longer branches.

► Question 10.9

Experiment with the code above. Change the bf and rate arguments in the simSeq function to make mutations more likely. Do you think sequences generated with a very high mutation rate would make it easier to infer the tree that generated them ?

► Solution

► Question 10.10

Estimation bias: distance underestimation.
1) If we only count the number of changes between two sequences using a simple Hamming distance, but there has been much evolutionary change between the two, why do we underestimate the distance between the sequences?
2) Will be the bias be larger for smaller evolutionary distances?

The standard Markovian models of evolution we saw above enable us to improve these estimates.

10.4.3 Estimating a phylogenetic tree

includegraphics[width=2cm][Magnify.png]{} “In solving a problem of this sort, the grand thing is to be able to reason backward. That is a very useful accomplishment, and a very easy one, but people do not practise it much. In the everyday affairs of life it is more useful to reason forward, and so the other comes to be neglected. There are fifty who can reason synthetically for one who can reason analytically”. Sherlock Holmes

When the true tree-parameter is known, the above-mentioned probabilistic generative models of evolution tells us what patterns to expect in the sequences. As we have seen in earlier chapters, statistics means going back from the data to reasonable estimates of the parameters; here the tree itself and the branch lengths, even the evolutionary rates can be considered to be the parameters.

A Steiner tree, the inner points are represented as squares. The method for creating the shortest tree thatpasses through all outer 1,2,5,6 is to create two inside ("ancester") points 3 and 4. Figure 10.18: A Steiner tree, the inner points are represented as squares. The method for creating the shortest tree thatpasses through all outer 1,2,5,6 is to create two inside (“ancester”) points 3 and 4.

There are several approaches to estimation: tree `building’ is no exception, here are the main ones:

A nonparametric estimate: the parsimony tree Parsimony is a nonparametric method that minimizes the number of changes necessary to explain the data, it’s solution is the same as that of the Steiner tree problem (see Figure 10.18).

A parametric estimate: the maximum likelihood tree In order to estimate the tree using a maximum likelihood or Bayesian approach one needs a model for molecular evolution that integrates mutation rates and branch edge lengths. ML estimation (e.g., Phyml, FastML, RaxML) use efficient optimization algorithms to maximize the likelihood of a tree under the model assumptions.

Bayesian posterior distributions for trees Bayesian estimation, MrBayes (Ronquist et al. 2012) or BEAST (Bouckaert et al. 2014) both use MCMC to find posterior distributions of the phylogenies. Bayesian methods are not directly integrated into R and require the user to import the collections of trees generated by Monte Carlo methods in order to summarize them and make confidence statements see Chakerian and Holmes (2012) for simple examples.

The semi-parametric approach: distance based methods These methods, called Neighbor Joining and UPGMA, are quite similar to the hierachical clusering algorithms we already encountered in Chapter 5. However, the distance estimation steps uses the parametric evolutionary models of Table 10.1; the `parametric’ part of why we call the method semi-parametric.

The Neighbor-joining algorithm itself uses Steiner points as the summary of two combined points, and proceeds iteratively as in hierarchical clustering. It can be quite fast and is often used as a good starting point for the more time-consuming methods.

Let’s start by estimating the tree from the data seqs6 using the nj (neighbor joining) on DNA distances based on the one-parameter Jukes-Cantor model, we make Figure 10.19 using the ggtree function:

Trees built with a neighbor joining algorithm are very fast to compute and are often used as initial values for more expensive estimation procedures such as the maximum likelihood or parsimony. Figure 10.19: Trees built with a neighbor joining algorithm are very fast to compute and are often used as initial values for more expensive estimation procedures such as the maximum likelihood or parsimony.

tree.nj = nj(dist.ml(seqs6, "JC69"))
ggtree(tree.nj) + geom_tiplab(size = 7) + ggplot2::xlim(0, 0.8)

► Question 10.11

Generate the maximum likelihood scores of the tree1 given the seqs6 data and compare them to those of the neighbor joining tree.

► Solution

► Question 10.12

When we have aligned amino acids from which we want to infer a tree, we use (\(20 \times 20\)) transition matrices. The methods for estimating the phylogenetic are very similar. Try this in phangorn with an HIV amino acid sequence downloaded from https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html.

The quality of the tree estimates depend on the number of sequences per taxa and the distance to the root. We can evaluate the quality of the estimates either by using parametric and nonparametric bootstraps or performing Bayesian tree estimation using MCMC. For examples of how to visualize and compare the sampling distribution of trees, see Chakerian and Holmes (2012).

10.4.4 Application to 16S rRNA data

In Chapter 5 we saw how to use a probabilistic clustering method to denoise 16S rRNA sequences. We can now reload these denoised sequences and preprocess them before building their phylogeny.

library("dada2")
seqtab = readRDS("../data/seqtab.rds")
seqs = getSequences(seqtab)
names(seqs) = seqs

131 In order to keep all the information and be able to compare sequences from different experiments, we use the sequences themselves as their label(Callahan, McMurdie, and Holmes 2017). One of the benefits of using well-studied marker loci such as the 16S rRNA gene is the ability to taxonomically classify the sequenced variants. dada2 includes a naive Bayesian classifier method for this purpose (Wang et al. 2007). This classifier compares sequence variants to training sets of classified sequences. Here we use the RDP v16 training set (Cole et al. 2009)132 See the download link on the dada2 website:https://benjjneb.github.io/dada2/training.html. As an example the code, not run here, for brevity’s sake:

fastaRef = "../tmp/rdp_train_set_16.fa.gz"
taxtab = assignTaxonomy(seqtab, refFasta = fastaRef)

We obtain a table of taxonomic information, that we have stored and can reload using:

taxtab = readRDS(file= "../data/taxtab16.rds")
dim(taxtab)
## [1] 268   6

► Question 10.13

Write one line of code using the %>% operator that shows just the first 6 rows of the taxonomic information without the row names.

► Solution

► Question 10.14

What is the difference between taxonomic and phylogenetic information?

Note that as the seqs data are randomly generated, they are “cleaner” than the real data we will have to handle.

In particular naturally occurring raw sequences have to be aligned. This is necessary as there are often extra nucleotides in some sequences, a consequence of what we call indel events133 A nucleotide is deleted or inserted and it is often hard to distinguish which took place.. Also mutations occur and appear as subsitutions of one nucleotide by another.

Here is an example of what the first few characters of aligned sequences looks like:

readLines("../data/mal2.dna.txt") %>% head(12) %>% cat(sep="\n")
##     11   1620
## Pre1        GTACTTGTTA GGCCTTATAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pme2        GTATCTGTTA AGCCTTATAA AAAGATAGT- T-TAAATTAA AGGAATTATA
## Pma3        GTATTTGTTA AGCCTTATAA GAGAAAAGTA TATTAACTTA AGGA-TTATA
## Pfa4        GTATTTGTTA GGCCTTATAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pbe5        GTATTTGTTA AGCCTTATAA GAAAAA--T- TTTTAATTAA AGGAATTATA
## Plo6        GTATTTGTTA AGCCTTATAA GAAAAAAGT- TACTAACTAA AGGAATTATA
## Pfr7        GTACTTGTTA AGCCTTATAA GAAAGAAGT- TATTAACTTA AGGAATTATA
## Pkn8        GTACTTGTTA AGCCTTATAA GAAAAGAGT- TATTAACTTA AGGAATTATA
## Pcy9        GTACTCGTTA AGCCTTTTAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pvi10       GTACTTGTTA AGCCTTTTAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pga11       GTATTTGTTA AGCCTTATAA GAAAAAAGT- TATTAATTTA AGGAATTATA

We will perform this multiple-alignment on our seqs data using the DECIPHER package (Wright 2015):

library("DECIPHER")
alignment = AlignSeqs(DNAStringSet(seqs), anchor = NA, verbose = FALSE)

We use the phangorn package to build the MLE tree (under the GTR model), but will use the neighbor-joining tree as our starting point.

phangAlign = phangorn::phyDat(as(alignment, "matrix"), type = "DNA")
dm = phangorn::dist.ml(phangAlign)
treeNJ = phangorn::NJ(dm) # Note, tip order != sequence order
fit = phangorn::pml(treeNJ, data = phangAlign)
fitGTR = update(fit, k = 4, inv = 0.2)
fitGTR = phangorn::optim.pml(fitGTR, model = "GTR", optInv = TRUE,
         optGamma = TRUE,  rearrangement = "stochastic",
         control = phangorn::pml.control(trace = 0))

10.5 Combining a phylogenetic trees into a data analysis

We now need to combine the phylogenetic tree and the denoised read abundances with the complementary information provided about the samples from which the reads were gathered. This sample information is often provided as a spreadhseet (or .csv), and (mis)named the meta-data. This data integration step is facilitated by the specialized containers and accessors that phyloseq provides.

We read in the sample data from a csv file:

mimarksPathClean = "../data/MIMARKS_Data_clean.csv"
samdf = read.csv(mimarksPathClean, header = TRUE)

Then the full suite of data for this study – the sample-by-sequence feature table, the sample (meta)data, the sequence taxonomies, and the phylogenetic tree – are combined into a single object as follows:

library("phyloseq")
physeq = phyloseq(tax_table(taxtab), sample_data(samdf),
        otu_table(seqtab, taxa_are_rows = FALSE), phy_tree(fitGTR$tree))

This block will only usually be executed once. Once created, we save the object and do all the manipulations starting from that container.

We have already encountered several cases of combining heterogeneous data into special data classes (in particular in Chapter 8, when we studied the pasilla data).

► Task

Look at the detailed phyloseq documentation here. Try a few filtering operations; for instance make a subset of the data that contains the tree, taxa abundance table, the sample and taxa information for only the samples that have more than 5,000 reads.

This can be done in one line:

ps1 =  prune_samples(rowSums(otu_table(ps0)) > 5000, ps0)

We can also make data transformations, while maintaining the integrity of the links between all the data components.

► Question 10.15

What do the following lines of code do ?

prev0 = apply(X = otu_table(ps0),
              MARGIN = ifelse(taxa_are_rows(ps0), yes = 1, no = 2),
              FUN = function(x) {sum(x > 0)})
prevdf = data.frame(Prevalence = prev0,
                     TotalAbundance = taxa_sums(ps0),
                     tax_table(ps1))
keepPhyla = table(prevdf$Phylum)[table(prevdf$Phylum) >  5]
prevdf1   = subset(prevdf, Phylum %in% names(keepPhyla))
ps2v      = subset_taxa(ps1v, Phylum %in% names(keepPhyla))

Plotting the abundances for certain bacteria can easily be performed using barcharts. ggplot2 expressions have been hardwired into one-line calls in the phyloseq package. This makes performing these types of exploratory plots painless. There is even an interactive Shiny-phyloseq browser based tool (McMurdie and Holmes 2015). We do not give full details here but refer the reader to the online vignettes.

10.5.1 Hierarchical multiple testing

Hypothesis testing can identify individual bacteria whose abundance relates to sample variables of interest. A standard approach is very similar to the approach we already visited in Chapter 6. Compute a test statistic for each taxa individually; then jointly adjust p-values to ensure a false discovery rate upper bound. However, this procedure does not exploit any structure among the tested hypotheses – for example, it is likely that if one Ruminococcus species is strongly associated with age, then others are as well. To integrate this information, Benjamini and Yekutieli (2003) and Benjamini and Bogomolov (2014) proposed a hierarchical testing procedure, where taxonomic groups are only tested if higher levels are found to be be associated. In the case where many related species have a slight signal, this pooling of information can increase power.

We apply this method to test the association between microbial abundance and age. Before computing the test statistic, it is appropriate to start by using the normalization protocols we discussed in Chapter 8 following Love, Huber, and Anders (2014)134 The fourth line in the below code chunk fixes the levels of the ageBin factor, to make them admissible for DESeq2. for RNA-Seq data and McMurdie and Holmes (2014) for 16S rRNA generated count data and available in the DESeq2 package:

library("DESeq2")
library("phyloseq")
ps1 = readRDS("../data/ps1.rds")
ps_dds = phyloseq_to_deseq2(ps1, design = ~ ageBin + family_relationship)
geometricmean = function(x)
   if (all(x == 0)) { 0 } else { exp(mean(log(x[x != 0]))) }
geoMeans = apply(counts(ps_dds), 1, geometricmean)
ps_dds = estimateSizeFactors(ps_dds, geoMeans = geoMeans)
ps_dds = estimateDispersions(ps_dds)
abund = getVarianceStabilizedData(ps_dds)

We use the structSSI package to perform the hierarchical testing (Sankaran and Holmes 2014). For more convenient printing, we first shorten the names of each taxa:

rownames(abund) = substr(rownames(abund), 1, 5) %>% make.names(unique = TRUE)

Unlike standard multiple hypothesis testing, the hierarchical testing procedure needs univariate tests for each higher-level taxonomic group, not just every taxa. A helper function, treePValues, is available for this; it expects an edgelist encoding parent-child relationships, with the first row specifying the root node.

library("structSSI")
el = phy_tree(ps1)$edge
el0 = el
el0 = el0[rev(seq_len(nrow(el))), ]
el_names = c(rownames(abund), seq_len(phy_tree(ps1)$Nnode))
el[, 1] = el_names[el0[, 1]]
el[, 2] = el_names[el0[, 2]]
unadj_p = treePValues(el, abund, sample_data(ps1)$ageBin)

We can now do our FDR calculations using the hierarchical testing procedure. The test results are guaranteed to control several variants of FDR, but at different levels; we defer details to (Benjamini and Yekutieli 2003; Benjamini and Bogomolov 2014; Sankaran and Holmes 2014).

► Task

Try the following code, including the interactive plotting command that will open a browser window:

hfdr_res = hFDR.adjust(unadj_p, el, 0.75)
summary(hfdr_res)
#plot(hfdr_res, height = 5000) # not run: opens in a browser

A screenshot of a subtree with many differentially abundant microbes, as determined by the hierarchical testing procedure. Currently the user is hovering over the node associated with microbe GCGAG.33; this causes the adjusted p-value (0.029) to appear. Figure 10.20: A screenshot of a subtree with many differentially abundant microbes, as determined by the hierarchical testing procedure. Currently the user is hovering over the node associated with microbe GCGAG.33; this causes the adjusted p-value (0.029) to appear.

The plot opens in a new browser – a static screenshot of a subtree is displayed in Figure 10.20. Nodes are shaded according to p-values, from blue to orange, representing the strongest to weakest associations. Grey nodes were never tested, to focus power on more promising subtrees. Scanning the full tree; it becomes clear that the association between age group and taxonomic abundances is present in only a few isolated taxonomic groups. It is quite strong in those groups. To give context to these results, we can retrieve the taxonomic identity of the rejected hypotheses.

library("dplyr")
options(digits = 3)
tax = tax_table(ps1)[, c("Family", "Genus")] %>% data.frame
tax$seq = rownames(abund)
hfdr_res@p.vals$seq = rownames(hfdr_res@p.vals)
tax %>%  left_join(hfdr_res@p.vals[,-3]) %>%
  arrange(adjp) %>% head(9) %>% dplyr::select(1,2,4,5)
##            Family            Genus   unadjp     adjp
## 1 Lachnospiraceae             <NA> 2.15e-82 4.30e-82
## 2 Lachnospiraceae        Roseburia 1.30e-75 2.60e-75
## 3 Lachnospiraceae Clostridium_XlVa 4.00e-59 8.00e-59
## 4 Lachnospiraceae             <NA> 3.83e-50 7.67e-50
## 5 Lachnospiraceae Clostridium_XlVa 1.21e-49 2.41e-49
## 6 Lachnospiraceae             <NA> 8.63e-49 1.73e-48
## 7 Ruminococcaceae     Ruminococcus 9.32e-47 1.86e-46
## 8 Lachnospiraceae Clostridium_XlVa 5.90e-43 1.18e-42
## 9 Lachnospiraceae        Roseburia 2.34e-36 4.68e-36

It seems that the most strongly associated bacteria all belong to family Lachnospiraceae.

10.6 Minimum spanning trees

A very simple and useful graph is the so-called minimum spanning tree (MST). Given distances between vertices; the MST is the tree that spans all the points and has the minimum total length (see Figure 10.21).

Greedy algorithms work well for computing the MST and there are many implementations in R: (mstree in ade4), mst in ape, spantree in vegan or mst in igraph.

A **spanning tree** is a tree that goes through all points at least once, the top graph with red edges is such a tree.A **spanning tree** is a tree that goes through all points at least once, the top graph with red edges is such a tree. Figure 10.21: A spanning tree is a tree that goes through all points at least once, the top graph with red edges is such a tree.

The minimum spanning tree is the spanning tree of minimum length; the lower blue graph in Figure 10.21 is the minimum spanning tree or MST for short. Here we are going to take the DNA sequence distances between strains of HIV from patients all over the world and construct their minimum spanning tree, the result is shown in Figure 10.22.

load("../data/dist2009c.RData")
country09 = attr(dist2009c, "Label")
mstree2009 = ape::mst(dist2009c)
gr09 = graph.adjacency(mstree2009, mode = "undirected")
gg = ggnetwork(gr09, arrow.gap = 0, layout = "fruchtermanreingold")
ggplot(gg, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(color = "black", alpha = 0.5, curvature = 0.1) +
  geom_nodes(aes(color = vertex.names), size = 2) +  theme_blank() +
  geom_nodetext(aes(label = vertex.names), color = "black", size = 2.5) +
  theme(plot.margin = unit(c(0, 1, 1, 6), "cm"))+
  guides(color = guide_legend(keyheight = 0.09, keywidth = 0.09,
      title = "Countries")) + theme(legend.position = c(0, 0.14),
      legend.background = element_blank(),
      legend.text = element_text(size = 7))

Figure 10.22: The minimum spanning tree computed from DNA distances between HIV sequences from samples taken in 2009 and whose country of origin was known, data as published in the HIVdb database (Rhee et al. 2003), .

The minimum spanning tree computed from DNA distances between HIV sequences from samples taken in 2009 and whose country of origin was known, data as published in the `HIVdb` database  [@HIVdb], .

► Question 10.16

Make the network plot again but replace geom_nodetext with a labels that repel each other to minimize the overlapping node labels.

► Solution

It could be preferable to use a graph layout that incorporates the known geographic coordinates. Thus, we might be able to see how the virus jumped large distances across the world through traveller mobility. We introduce approximate country coordinates which we then jitter slightly to avoid too much overlapping:

library("rworldmap")
mat = match(country09, countriesLow$NAME)
coords2009 = data.frame(
  lat = countriesLow$LAT[mat],
  lon = countriesLow$LON[mat],
  country = country09)
layoutCoordinates = cbind(
  x = jitter(coords2009$lon, amount = 15),
  y = jitter(coords2009$lat, amount = 8))
labc = names(table(country09)[which(table(country09) > 1)])
matc = match(labc, countriesLow$NAME)
dfc = data.frame(
  latc = countriesLow$LAT[matc],
  lonc = countriesLow$LON[matc],
  labc)
dfctrans = dfc
dfctrans[, 1] = (dfc[,1] + 31) / 93
dfctrans[, 2] = (dfc[,2] + 105) / 238
ggeo09 = ggnetwork(gr09, arrow.gap = 0, layout = layoutCoordinates)
ggplot(ggeo09, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(color = "black", alpha = 0.5, curvature = 0.1) +
  geom_nodes(aes(color = vertex.names), size = 2) +
  theme_blank() +
  geom_label(data = dfctrans, aes(x = lonc, xend = lonc, y = latc, yend = latc,
       label = labc, fill = labc), colour = "white", alpha = 0.5, size = 3) +
   theme(legend.position = "none")

Figure 10.23: A minimum spanning tree between HIV cases using a jitter of the geographic location of each case was made to reduce overlapping of points; the DNA distances between patient strains were used as the input to an undirected minimum spanning tree algorithm.

A minimum spanning tree between HIV cases using **a jitter** of the geographic location of each case was made to reduce overlapping of points; the DNA distances between patient strains were used as the input to an undirected minimum spanning tree algorithm.

The input to the minimum spanning tree algorithm is a distance matrix or a graph with a length edge attribute. Figure 10.23 is the minimum spanning tree between cases of HIV, for which strain information was made available through the HIVdb . The DNA distances was computed using the Jukes-Cantor mutation model.

► Question 10.17

The above analysis provided an undirected network of connections, in fact several implementations of the minimum spanning tree (ie for instance mstree in ade4) provide a directed path through the points, which can provide meaningful information on the (apparent) spread of disases. Make a directed network version of the above maps.

MST is a very useful component of a simple nonparametric test for detecting differences between factors that are mapped onto its vertices.

0.5cm

10.6.1 MST based testing: the Friedman–Rafsky test

Graph-based two-sample tests135 Tests that explore whether two samples aredrawn from the same distribution. were introduced by Friedman and Rafsky (Friedman and Rafsky 1979) as a generalization of the Wald-Wolfowitz runs test (see Figure 10.24). Our previous examples show graph vertices associated with covariates such as country of origin. Here we test whether the covariate is significantly associated to the graph structure.

The Friedman-Rafsky tests for two/multiple sample segregation on a minimum spanning tree. It was conceived as a generalization of the univariate Wald-Wolfowitz runs test. If we are comparing two samples, say men and women, whose coordinates represent a measurement of interest. We color the two groups blue and red as in Figure 10.24, the Wald-Wolfowitz test looks for long runs of the samecolor that would indicate that the two groups have different means.

Figure 10.24: Seeing the number of runs in a one-dimensional, two-sample, nonparametric Wald-Wolfowitz test can indicate whether the two groups have the same distributions.

Seeing the number of runs in a one-dimensional, two-sample, nonparametric Wald-Wolfowitz test can indicate whether the two groups have the same distributions.

Instead of looking for consecutive values of one type (‘runs’), we count the number of connected nodes of the same type.

Once the minimum spanning tree has been constructed, the vertices are assigned `colors’ according to the different levels of a categorical variable. We call pure edges those whose two nodes have the same level of the factor variable. We use \(S_O\), the number of pure edges as our test statistic. To evaluate whether our observed value could have occurred by chance when the groups have the same distributions, we permute the vertix labels (colors) randomly and recount how many pure edges there are. This label swapping is repeated many times, creating our null distribution for \(S\).

10.6.2 Example: Bacteria sharing between mice

Here we illustrate the idea on a collection of samples from mice whose stool were analyzed for their microbial content. We read in a data set with many mice and many taxa, we compute the Jaccard distance and then use the mst function from the igraph package. We annotate the graph with the relevant covariates as shown in the code below:

ps1  = readRDS("../data/ps1.rds")
sampledata = data.frame( sample_data(ps1))
d1 = as.matrix(phyloseq::distance(ps1, method="jaccard"))
gr = graph.adjacency(d1,  mode = "undirected", weighted = TRUE)
net = igraph::mst(gr)
V(net)$id = sampledata[names(V(net)), "host_subject_id"]
V(net)$litter = sampledata[names(V(net)), "family_relationship"]

We make a ggnetwork object from the resulting igraph generated minimum spanning tree and then plot it, as shown in Figure 10.25.

The minimum spanning tree based on Jaccard dissimilarity and annotated with the mice ID and litter factors Figure 10.25: The minimum spanning tree based on Jaccard dissimilarity and annotated with the mice ID and litter factors

gnet=ggnetwork(net)
ggplot(gnet, aes(x = x, y = y, xend = xend, yend = yend))+
  geom_edges(color = "darkgray") +
  geom_nodes(aes(color = id, shape = litter)) + theme_blank()+
  theme(legend.position="bottom")

Now we compute the null distribution and p-value for the test, this is implemented in the phyloseqGraphTest package:

library("phyloseqGraphTest")
gt = graph_perm_test(ps1, "host_subject_id", distance="jaccard",
                     type="mst",  nperm=1000)
gt$pval
## [1] 0.000999

We can take a look at the complete histogram of the null distribution generatedby permutation using:

The permutation histogram of the number of pure edges in the network obtained from the minimal spanning tree with Jaccard similarity. Figure 10.26: The permutation histogram of the number of pure edges in the network obtained from the minimal spanning tree with Jaccard similarity.

plot_permutations(gt)

Different choices for the skeleton graph

It is not necessary to use an MST for the skeleton graph that defines the edges. Graphs made by linking nearest neighbors (Schilling 1986) or distance thresholding work as well.

The Bioconductor package phyloseq has functionality for creating graphs based on thresholding a distance matrix through the function make_network. We create a network by creating an edge between samples whose Jaccard dissimilarity is less than a threshold, which we set below via the parameter max.dist. We use the ggnetwork package to add attributes to the vertices indicating which mouse the sample came from and which litter the mouse was in. We see that in the resulting network, shown in Figure 10.27, there is grouping of the samples by both mouse and litter.

net = make_network(ps1, max.dist = 0.35)
sampledata = data.frame(sample_data(ps1))
V(net)$id = sampledata[names(V(net)), "host_subject_id"]
V(net)$litter = sampledata[names(V(net)), "family_relationship"]
netg = ggnetwork(net)
ggplot(netg, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_edges(color = "darkgray") +
  geom_nodes(aes(color = id, shape = litter)) + theme_blank()+
    theme(plot.margin = unit(c(0, 5, 2, 0), "cm"))+
    theme(legend.position = c(1.4, 0.3),legend.background = element_blank(),
          legend.margin=margin(0, 3, 0, 0, "cm"))+
         guides(color=guide_legend(ncol=2))

Figure 10.27: A co-occurrence network created by using a threshold on the Jaccard dissimilarity matrix. The colors represent which mouse the sample came from; the shape represents which litter the mouse was in.

A co-occurrence network created by using a threshold on the Jaccard dissimilarity matrix. The colors represent which mouse the sample came from; the shape represents which litter the mouse was in.

Note that no matter which graph we build between the samples, we can approximate a null distribution by permuting the labels of the nodes of the graph. However, sometimes it will preferable to adjust the permutation distribution to account for known structure between the covariates.

10.6.3 Friedman–Rafsy test with nested covariates

In the test above, we took a rather naïve approach and showed there was a significant difference between individual mice (the host_subject_id variable). Here we perform a slightly different permutation test to find out if we control for the difference between mice; is there a litter (the family_relationship variable) effect ? The setup of the test is similar, it is simply how the permutations are generated which differs. We maintain the nested structure of the two factors using the grouping argument We permute the family_relationship labels but keep the host_subject_id structure intact.

gt = graph_perm_test(ps1, "family_relationship",
        grouping = "host_subject_id",
        distance = "jaccard", type = "mst", nperm= 1000)
gt$pval
## [1] 0.005

This test has a small p-value, and we reject the null hypothesis that the two samples come from the same distribution. From the plot of the minimum spanning tree in Figure 10.26, we see by eye that the samples group by litter more than we would expect by chance.

The permutation histogram obtained from the minimal spanning tree with Jaccard similarity. Figure 10.28: The permutation histogram obtained from the minimal spanning tree with Jaccard similarity.

plot_permutations(gt)

► Question 10.18

The \(k\)-nearest neighbors graph is obtained by putting an edge between two samples whenever one of them is in the set of \(k\)-nearest neighbors of the other. Redo the test, defining the graph using nearest neighbors defined with the Jaccard distance. What would you conclude?

► Solution

Note: The dual graph
In the examples above we sought to show relationships between samples through their shared taxa. It can also be of interest to ask the question about taxa: do some of the taxa co-occur more often than one would expect? This approach can help study microbial `communities’ as they assemble in the microbiome. The methods we developed above all apply to this use-case, all one really does is transpose the data. It is always preferable with sparse data such as the microbiome to use Jaccard and not build correlation networks that can be appropriate in other settings.

10.7 Summary of this chapter

Annotated graphs

. In this chapter we have learnt how to store and plot data that have more structure than simple arrays: graphs have edges and nodes that can also be associated to extra annotations that can be displayed usefully.

Important examples of graphs and useful R packages

. We started by specific examples such as Markov chain graphs, phylogenetic trees and minimum spanning trees. We saw how to use the ggnetwork and igraph packages to visualize graphs and show as much information as possible by using specific graph layout algorithms.

Combining graphs with statistical data

. We then approached the problem of incorporating a known `skeleton’ graph into differential expression analyses. This enables use to pinpoint perturbation hotspots in a network. We saw how evolutionary models defined along rooted binary trees serve as the basis for phylogenetic tree estimation and how we can incorporate these trees as supplementary information in a differential abundance analysis using the R packages structSSI and phyloseq.

Linking co-occurrence to other variables

. Graph and network tools also enable the creation of networks from co-occurrence data and can be used to visualize and test the effect of factor covariates. We saw the Friedman-Rafsky test which provides an easy way of testing dependencies of a variable with the edge structure of a skeleton graph.

Context and intepretation aides

. This chapter illustrated ways of incorporating interactions of players in a network and we saw how useful it was to combine this with statistical scores. This often provides biological insight into analyses of complex biological systems.

Previous knowledge or outcome

. We saw that graphs can be both useful to encode our previous knowledge, metabolic network information, gene ontologies and phylogenetic trees of known bacteria are all available in standard databases. It is beneficial in a study to incorprate all known information and doing so by combining these skeleton networks with observed data enhances our understanding of experimental results in the context of what is already known.

On the other hand, the graph can be the outcome that we want to predict and we saw how to build graphs from data (phylogenetic trees, co-occurrence networks and minimum spanning trees).

10.8 Further reading

For complete developments and many important consequences of the evolutionary models used in phylogenetic trees, see the books by Li (1997; Li and Graur 1991). The book by Felsenstein (2004) is the classic text on estimating phylogenetic trees.

The book written by the author of the ape packages, Paradis (2011) contains many use-cases and details about manipulation of trees in R. A review of bootstrapping for phylogenetic trees can be found in Holmes (2003a).

We can use a tree as well as abundances in a contingency table data through an extension of PCoA-MDS called DPCoA (Double principal coordinate analysis). For microbiome data, the phylogenetic tree provides distances between taxa; these distances serve as the basis for the first PCoA. A second PCoA enables the projection of the weighted sample points. This has proved very effective in microbial ecology applications, see Purdom (2010) or Fukuyama et al. (2012) for details.

Graphs can be used to predict vertex covariates. There is a large field of applied statistics and machine learning that considers the edges in the graph as a response variable for which one can make predictions based on covariates or partial knowledge of the graph; these include ERGM’s (Exponential Random Graph Models, Robins et al. (2007)) and kernel methods for graphs (Schölkopf, Tsuda, and Vert 2004).

For theoretical properties of the Friedman-Rafsky test and more examples see Bhattacharya (2015).

A full list packages that deal with graphs and networks is available at: http://www.bioconductor.org/packages/release/BiocViews.html#___GraphAndNetwork.

10.9 Exercises

► Exercise 10.1

Create a function that plots a graph starting from an adjacency matrix. Show how it works on an example.

► Exercise 10.2

The relationships between gene functions is organized hierarchically into a graph called the Gene Ontology (GO) graph. The biological processes are organized at finer and finer scale. Take one of the databases providing the GO information for the organisms you are interested in. Choose a gene list and build the GO graph for that list.
Hint: Some examples can be found in the packages , , .

► Exercise 10.3

Markov chain graph of transitions between states of the vaginal microbiota: In DiGiulio et al. (2015) the authors use an igraph plot to represent the transitions rates between community state types CSTs using the markovchain package. Load the data and the transition rates and state names into an object of the special class markovchain and tailor the layout carefully to include the percentage of preterm birth as a covariate for the vertices (make the vertex size proportional to this variable). Include the size of transitions between states as the width of the arrows.

► Exercise 10.4

Protein interaction networks: Read the Wikipedia article about the http://en.wikipedia.org/wiki/STRING database (http://www.string-db.org).
The protein Cyclin B1 is encoded by the CCNB1 gene. You can read about it on wikipedia here: http://en.wikipedia.org/wiki/Cyclin_B1.
Use STRING to generate a text file (call it ccnb1datsmall.txt) of edges around the CCNB1 gene. Choose nodes that are connected by evidence of co-expression with a confidence higher than 0.9. Collect no more than 50 interactions and additional nodes that are two steps away from CCNB1 in the graph.

► Exercise 10.5

Read the txt file ccnb1datsmall.txt into R and make a plot of the graph using one of the graph visualization methods covered in this chapter.

► Exercise 10.6

Make a heatmap showing the adjacency matrix of the graph created in Exercises 10.5.

► Exercise 10.7

The visualization shows the strongest interactions in the two step neighborhood of CCNB1. Both the plotted graph and the heatmap image show the same data: there seems to be a cluster of proteins which are all similar to CCNB1 and there is also another cluster in the other proteins. Many of the proteins in the CCNB1 cluster are coexpressed at the same time as each other.
Why might this be the case?
Conversely, proteins which are coexpressed with a protein that is coexpressed with CCNB1 (two steps away) do not tend to be coexpressed with each other.
Is it easier for you to see this in one of the figures (the plot or the heatmap) than the other?

► Exercise 10.8

Compare the use of ape and phangorn in the analysis of HIV GAG data. Compute the Jukes Cantor distances between the sequences using both packages and compare them to the Hamming distances.

library("ape")
library("phangorn")
GAG=read.dna("../data/DNA_GAG_20.txt")
► Exercise 10.9

Perform the Friedman–Rafsy type test with a “two-nearest” neighbor-graph using the Bray-Curtis dissimilarity.

References

Beisser, Daniela, Gunnar W Klau, Thomas Dandekar, Tobias Müller, and Marcus T Dittrich. 2010. “BioNet: An R-Package for the Functional Analysis of Biological Networks.” Bioinformatics 26 (8). Oxford Univ Press: 1129–30.

Benjamini, Yoav, and Marina Bogomolov. 2014. “Selective Inference on Multiple Families of Hypotheses.” JRSSB 76 (1). Wiley Online Library: 297–318.

Benjamini, Yoav, and Daniel Yekutieli. 2003. “Hierarchical FDR Testing of Trees of Hypotheses.” Technical report, Department of Statistics; Operations Research, Tel Aviv University.

Bhattacharya, Bhaswar B. 2015. “Power of Graph-Based Two-Sample Tests.” arXiv Preprint arXiv:1508.07530.

Bouckaert, Remco, Joseph Heled, Denise Kühnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drummond. 2014. “BEAST 2: A Software Platform for Bayesian Evolutionary Analysis.” PLoS Computational Biology 10 (4). Public Library of Science: e1003537.

Callahan, Benjamin J, Paul J McMurdie, and Susan P Holmes. 2017. “Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker Gene Data Analysis.” ISME Journal. Nature publishing, 1–5.

Chakerian, John, and Susan Holmes. 2012. “Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees.” Journal of Computational and Graphical Statistics 21 (3). Taylor & Francis Group: 581–99.

Cole, J.R., Q. Wang, E. Cardenas, J. Fish, B. Chai, R.J. Farris, A.S. Kulam-Syed-Mohideen, et al. 2009. “The Ribosomal Database Project: Improved Alignments and New Tools for rRNA Analysis.” Nucleic Acids Research 37 (Supplement 1): D141–D145.

DiGiulio, Daniel B., Benjamin J. Callahan, Paul J. McMurdie, Elizabeth K. Costello, Deirdre J. Lyelle, Anna Robaczewska, Christine L. Sun, et al. 2015. “Temporal and Spatial Variation of the Human Microbiota During Pregnancy.” PNAS.

Felsenstein, Joseph. 2004. Inferring Phylogenies. Boston: Sinauer.

Friedman, Jerome H, and Lawrence C Rafsky. 1979. “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests.” The Annals of Statistics, 697–717.

Fukuyama, Julia, Paul J McMurdie, Les Dethlefsen, David A Relman, and Susan Holmes. 2012. “Comparisons of Distance Methods for Combining Covariates and Abundances in Microbiome Studies.” In Pac Symp Biocomput. World Scientific.

Holmes, Susan. 1999. “Phylogenetic Trees: An Overview.” In Statistics and Genetics, 81–118. IMA 112. New York: Springer.

Holmes, Susan. 2003a. “Bootstrapping Phylogenetic Trees: Theory and Methods.” Statistical Science 18 (2): 241–55.

Holmes, Susan. 2003b. “Statistics for phylogenetic trees.” Theoretical Population Biology 63 (1). Elsevier: 17–32.

Ideker, Trey, Owen Ozier, Benno Schwikowski, and Andrew F Siegel. 2002. “Discovering Regulatory and Signalling Circuits in Molecular Interaction Networks.” Bioinformatics 18 Suppl 1 (January): S233–40. http://bioinformatics.oxfordjournals.org/cgi/reprint/18/suppl\_1/S233.

Li, Wen-Hsiung. 1997. Molecular Evolution. Sinauer Associates Incorporated.

Li, Wen-Hsiung, and Dan Graur. 1991. Fundamentals of Molecular Evolution. Vol. 48. Sinauer Associates Sunderland, MA.

Liberzon, Arthur, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P Mesirov. 2011. “Molecular Signatures Database (MSigDB) 3.0.” Bioinformatics 27 (12): 1739–40.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Gnome Biology 15 (12): 1–21.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Computational Biology 10 (4). Public Library of Science: e1003531.

McMurdie, Paul J, and Susan Holmes. 2015. “Shiny-Phyloseq: Web Application for Interactive Microbiome Analysis with Provenance Tracking.” Bioinformatics 31 (2). Oxford Univ Press: 282–83.

Mossel, Elchanan. 2003. “On the Impossibility of Reconstructing Ancestral Data and Phylogenies.” Journal of Computational Biology 10 (5). Mary Ann Liebert, Inc.: 669–76.

Nacu, Serban, Rebecca Critchley-Thorne, Peter Lee, and Susan Holmes. 2007. “Gene Expression Network Analysis and Applications to Immunology.” Bioinformatics 23 (7, 7): 850–8. https://doi.org/10.1093/bioinformatics/btm019.

Nolan, Daniel J, Michael Ginsberg, Edo Israely, Brisa Palikuqi, Michael G Poulos, Daylon James, Bi-Sen Ding, et al. 2013. “Molecular Signatures of Tissue-Specific Microvascular Endothelial Cell Heterogeneity in Organ Maintenance and Regeneration.” Developmental Cell 26 (2). Elsevier: 204–19.

Paradis, Emmanuel. 2011. Analysis of Phylogenetics and Evolution with R. Springer Science & Business Media.

Pounds, Stan, and Stephan W Morris. 2003. “Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-Values.” Bioinformatics 19 (10). Oxford Univ Press: 1236–42.

Purdom, Elizabeth. 2010. “Analysis of a Data Matrix and a Graph: Metagenomic Data and the Phylogenetic Tree.” Annals of Applied Statistics, July.

Rhee, Soo-Yon, Matthew J Gonzales, Rami Kantor, Bradley J Betts, Jaideep Ravela, and Robert W Shafer. 2003. “Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database.” Nucleic Acids Research 31 (1). Oxford Univ Press: 298–303.

Robins, Garry, Tom Snijders, Peng Wang, Mark Handcock, and Philippa Pattison. 2007. “Recent Developments in Exponential Random Graph (P*) Models for Social Networks.” Social Networks 29 (2). Elsevier: 192–215.

Ronquist, Fredrik, Maxim Teslenko, Paul van der Mark, Daniel L Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc A Suchard, and John P Huelsenbeck. 2012. “MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space.” Systematic Biology 61 (3). Oxford University Press: 539–42.

Sankaran, Kris, and Susan Holmes. 2014. “structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data.” Journal of Statistical Software 59 (1): 1–21.

Schilling, Mark F. 1986. “Multivariate Two-Sample Tests Based on Nearest Neighbors.” Journal of the American Statistical Association 81 (395). Taylor & Francis: 799–806.

Schölkopf, Bernhard, Koji Tsuda, and Jean-Philippe Vert. 2004. Kernel Methods in Computational Biology. MIT press.

Wang, Q., G.M. Garrity, J.M. Tiedje, and J.R. Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73 (16). Am Soc Microbiol: 5261.

Wertheim, Joel O, and Michael Worobey. 2009. “Dating the Age of the SIV Lineages That Gave Rise to HIV-1 and HIV-2.” PLoS Computational Biology 5 (5). Public Library of Science: e1000377.

Wright, Erik S. 2015. “DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment.” BMC Bioinformatics 16 (1). BioMed Central: 1.

Yu, Hongxiang, Diana L Simons, Ilana Segall, Valeria Carcamo-Cavazos, Erich J Schwartz, Ning Yan, Neta S Zuckerman, et al. 2012. “PRC2/EED-EZH2 Complex Is up-Regulated in Breast Cancer Lymph Node Metastasis Compared to Primary Tumor and Correlates with Tumor Proliferation in Situ.” PloS One 7 (12). Public Library of Science: e51239.

Page built: 2019-05-15

Support for maintaining the online version of this book is provided by de.NBI
For website support please contact msmith@embl.de