Networks and trees are often used to represent both biological data and knowledge about a system. Phylogenetic trees were used to represent family relationships between species even before Darwin’s famous notebook sketch. Networks are often used to schematize complex interactions between proteins involved in diseases as in Figure 10.1.

Figure 10.1: This graph represents genetic interactions that appeared to be modified in cancer patients where perturbations appeared in a chemokine signaling network.

We saw in Chapter 2 that we could model state transitions as a Markov chain and these can be usefully represented as directed graphs with weights on the edges. Metabolic pathways in which nodes are chemical metabolites and the edges represent chemical reactions. Mutation history trees are used in cancer genomics to infer lineages of mutations.

Transmission networks are important in studying the epidemiology of infectious diseases. As real networks can be very large, we will need special methods for representing and visualizing them. This chapter will be focused on ways of integrating graphs into a data analytic workflow.

In this chapter we will

Use the formal definition of a graph’s components: edges, vertices, layout to see how we can manipulate them in

**R**using both adjacency matrices and lists of edges.We will transform a graph object from

**igraph**into an object that can be visualized according to the layers approach in**ggplot2**using**ggnetwork**. We will experiment with covariates that we attach to graph edges and nodes.Graphs are useful ways of encoding prior knowledge about a system, and we will see how they enable us to go from simple gene set analyses to meaningful biological recommendations by mapping significance scores onto the network to detect perturbation

*hotspots*.We will build phylogenetic trees starting with from DNA sequences and then visualize these trees with the specifically designed R packages

**ape**and**ggtree**.We will combine a phylogenetic tree built from microbiome 16S rRNA data with covariates to show how the hierarchical relationship between taxa can increase the power in multiple hypothesis testing.

A special tree called a minimum spanning tree (MST) is very useful for testing the relations between a graph and other covariates. We’ll see how to implement different versions of what is known as the Friedman-Rafsky test. We’ll study both co-occurrence of bacteria in mice litters and strain similarities in HIV contagion networks.

A **graph** is formed by a set of nodes or vertices (often called \(V\)) and a set of edges between these vertices (\(E\)). Edges \(E\) are provided as unordered pairs of vertices in undirected graphs and ordered pairs for directed or oriented graphs.

An **adjacency matrix** \(\mathbf{A}\) is the matrix representation of \(E\). \(\mathbf{A}\) is a square matrix with as many rows as nodes in the graph. \(\mathbf{A}\) contains a non zero entry in the \(i\)th row and \(j\)th column if there is an edge between the \(i\)th and \(j\)th vertices.

► Question 10.1

For graphs with undirected edges what is special about the adjacency matrix \(A\)?

► Solution

Figure 10.3: A small undirected graph with numbered nodes.

```
library("igraph")
edges1 = matrix(c(1,3,2,3,3,4,4,5,4,6),byrow = TRUE, ncol = 2)
g1 = graph_from_edgelist(edges1, directed = FALSE)
plot(g1, vertex.size = 25, edge.width = 5, vertex.color = "coral")
```

► Question 10.2

Can you give an alternative way of reading the graph from an edge set dataframe?

► Solution

The nodes or vertices which are the colored circles with numbers in them in Figure 10.3.

Edges or connections, the segments that join the nodes and which can be directed or not.

Edge lengths, when not specified, we suppose they are all one and compute the distance between vertices on the graph as the number the edges traversed. On the other hand, in many situations we have meaningful edge lengths or strengths of connections between vertices that we can use both in the plots and analyses.

Edge and node attributes: optionally, each edge or each node can be mapped to further continuous or categorical variables.

We call a weighted, directed graph a **network**. Networks have adjacency matrices \({\bf A}\) which are \(n\) by \(n\) matrices of positive numbers corresponding to the edge lengths.

For large graphs, some authors like to summarize overall graph structure such as vertex degrees (the number of edges connected to the vertex), centrality or betweenness. We focus on combining graphs with many other sources of data. So, we will not delve deeply into these statistics, though they are available in the various packages (**network**, **igraph**) we use.

If the number of edges is similar to the number of nodes (written \(\#E\sim O(\#V)\)), we say that the graph is **sparse**. Some graphs have many nodes, for instance the package **ppiData** contains a predicted protein interaction (`ppipred`

) graph on about 2500 proteins with 20,000 edges. A complete adjacency matrix for this graph requires 6,250,000 elements. In fact a list of edges and `weights’ are sufficient: this only use 60,000 elements. So storing a large graph as an adjacency matrix can be wasteful. If a graph is *sparse* we can use sparse encodings similar to those implemented in the package **Matrix**.

On the other hand, if the number of edges is (approximately) a quadratic function of the nodes (written \(\#E\sim O(\#V^2)\)), our graph is **dense**. Memory space can be an issue for the storage of large graphs.

An enriched graph can contain: **Arrows and directed edges** In directed graphs we differentiate between in-degree and out-degree and can identify graphs that have cycles. **Annotation variables on nodes and edges** The strength of a link in a graph can be visualized by the width of the edge. Covariates can be added to nodes. A continuous covariate can be associated to the size of the node in the graphical representation. A categorical value associated to the color.

We will see several examples where the same graph is plotted in different ways, either for aesthetic or practical reasons. This is done through the choice of the **graph layout**.

When the edges have lengths representing distances the problem of a 2D representation of the graph is the same as the multidimensional scaling we saw in Chapter 9. It is often solved in a similar way by spreading out the vertex-points as much as possible. In the simple case of edges without lengths, the algorithms can choose different criteria, we’ll see that the method of Fruchterman and Reingold often appears as a default choice. It is based on a physical model where similar points attract and repel each other as if under the effect of physical forces.

► Task

Use the **igraph** package to do the following

Create a dense random graph with 12 nodes and more than 50 edges.

Experiment plotting the graph with different layouts: place the nodes on a circle, or represent the graph as symmetrically as possible, avoid any overlapping nodes or edges.

Usually data do not arrive in the form of graphs. Graphical or network representations are often the result of transforming other data types: **From distances:** Graphs can simplify distance matrices by rounding the smaller distances down to zero. This requires defining a threshold \(t\) and then making edges between points whose distances are smaller than the threshold \(t\); these are often called geometric graphs.

**Binary data** Some data arrive naturally as 0/1 matrices, for instance presence or absence of the different finches in the Galapagos archipelago can be encoded with a 0/1 table. We can build a special type of graph between the finch-vertices and the island-vertices that is equivalent to such a table. That graph is called a bipartite graph; there are only edges between a taxon and an island, and none between taxa, nor between islands. An example is shown in Figure 10.4.

Figure 10.4: This **bipartite** graph connects each taxon to the sites where it was observed.

► Question 10.3

Load the `finch`

data from the **networksis** package and experiment plotting the data to highlight that this is a bipartite network by modifying the following lines of code (Figure 10.5):

► Solution

► Question 10.4

Make a dataframe from the graph `g1`

using the **ggnetwork** package, then plot it using `ggplot`

and the provided geoms `geom_edges`

, `geom_nodes`

and `geom_nodetext`

.

► Solution

In Chapter 2 we saw how a Markov chain can summarize transitions between nucleotides (considered the states of the system). This is often schematized by a graph, the **igraph** package enables many choices for graph `decoration’:

```
library("markovchain")
statesNames = c("A", "C", "G","T")
T1MC = new("markovchain", states = statesNames, transitionMatrix =
matrix(c(0.2,0.1,0.4,0.3,0,1,0,0,0.1,0.2,0.2,0.5,0.1,0.1,0.8,0.0),
nrow = 4,byrow = TRUE, dimnames=list(statesNames,statesNames)))
plot(T1MC, edge.arrow.size = 0.4, vertex.color="purple",
edge.arrow.width = 2.2, edge.width = 5, edge.color = "blue",
edge.curved = TRUE, edge.label.cex = 2.5, vertex.size= 32,
vertex.label.cex = 3.5, edge.loop.angle = 3,
vertex.label.family="sans", vertex.label.color = "white")
```

Figure 10.7: A four state Markov chain with arrows representing possible transitions between states.

Markov chains are idealized models of dynamical systems and the states are represented by the nodes in the graph. The transition matrix gives us the weights on the **directed** edges (arrows) between the states.

► Question 10.5

Which state do you think this Markov chain will end up in ?

► Task

a) Try changing your `set.seed`

function input and see if it changes the plot.

b) Access the help for this particular `plot`

function.

c) Redo the graph and label the edges with the transition probabilities in green and vertices in brown.

We will see how to build a complete example of annotated state space Markov chain graph in Exercise 10.3.

Here is an example of plotting a graph downloaded from the STRING database with annotations at the vertices.

```
library("ggnetwork")
oldpar=par(mar=c(5.1,4.7,4.1,2.6))
datf = read.table("../data/string_graph.txt", header = TRUE)
grs = graph_from_data_frame(datf[, c("node1", "node2")], directed = FALSE)
E(grs)$weight = 1
V(grs)$size = centralization.degree(grs)$res
ggdf = ggnetwork(grs, layout = "fruchtermanreingold", cell.jitter = 0)
ggplot(ggdf, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "black", curvature = 0.1, size = 0.95, alpha = 0.8)+
geom_nodes(aes(x = x, y = y), size = 3, alpha = 0.5, color = "orange") +
geom_nodelabel_repel(aes(label = vertex.names), size=4, color="#8856a7") +
# geom_nodetext(aes(label = vertex.names), size = 4, color = "#8856a7") +
theme_blank() + theme(legend.position = "none")
par(oldpar)
```

Figure 10.8 shows the full perturbed chemokine subnetwork discovered in the study of breast cancer metastasis using GXNA (Nacu et al. 2007) and reported by Yu et al. (2012).

A long unstructured laundry list of possibly differentially expressed genes can be daunting. In Chapter 8, we studied methods for finding a list of differentially expressed genes. Small sample sizes, coupled with efforts to maintain low FDRs, often result in low power to detect differential expression. Therefore, obtaining a long list of genes that can be confidently declared as differentially expressed is, initially, a triumph. However, understanding the underlying biology requires more than just a laundry list of significant players in a biological system.

One of the earliest approaches was to look for gene attributes that are **overrepresented** or **enriched** in the laundry list of significant genes. These gene classes are often based on Gene Ontology (GO) categories (for example, genes that are involved in organ growth, or genes that are involved in feeding behavior). The Gene Ontology (GO) is a collection of three ontologies that describe genes and gene products. These ontologies are restricted vocabularies that have the structure of directed acyclic graphs (DAGS). The most specific terms are the leaves of the graph. The GO graph consists of nodes (here, Gene Ontology terms) and edges from more specific terms (children) to less specific (parents), often these edges are directed. Nodes and edges can have multiple attributes that can be visualized. The main purpose of using GO annotations for a particular set of Genes designated as significant in an experiment is to look for the **enrichment** of a GO term in this list, we will give this term a statistical meaning below. Many other useful lists of important gene sets exist.

► Task

Find a useful database of gene sets.

For instance the `MsigDB`

Molecular Signature Database (Liberzon et al. 2011) contains many gene sets that can be accessed from within **R** using the Bioconductor package **GSEABase** command `getBroadSets`

as follows:

```
library("GSEABase")
## Not run, this requires a login to the website.
fl = "/path/to/msigdb_v5.1.xml"
gss = getBroadSets(fl) # read entire msigdb
organism(gss[[1]])
table(sapply(gss, organism))
```

Yello | w Blue | Red | |
---|---|---|---|

Significant | 25 | 25 | 25 |

Universe | 500 | 100 | 400 |

Here, we start by explaining a basic approach often called or **hypergeometric test**ing.

So-called 'exact' tests because they are nonparametric and based on exhaustive enumerations: **not** because we are sure of the answer – this is statistics after all.

Define a universe of candidate genes that may potentially be significant; say this universe is of size \(N\). We also have a record of the genes that actually **did** come out significant, of which we suppose there were \(m\).

We make a toy model involving balls in boxes, with a total of \(N\) balls corresponding to the genes identified in the gene universe. These genes are split into different functional categories, suppose there are \(N=1,000\) genes, of which 500 are yellow, 100 are blue and 400 are red. Then a subset of \(m=75\) genes are labeled as **significant**, suppose among these significantly interesting genes, there are 25 yellow, 25 red and 25 blue. Is the blue category enriched or overrepresented?

We use this hypergeometric two-way table testing to account for the fact that some categories are extremely numerous and others are rarer.

► Question 10.6

Run a Monte Carlo experiment with 20,000 simulations and compute the p-value of significance of having 25 blues under the null hypothesis that no category is over-represented in the significant set.

► Solution

In the general case, the gene universe is an urn with \(N\) balls, if we pick the \(m\) balls at random and there is a proportion of \(k/N\) blue balls, we expect to see \(km/N\) blue balls in a draw of size \(k\).

Here we show an attractive way of summarizing the connections between the gene functional categories and the significant gene set.

```
library("GOplot")
data("EC")
circ = circle_dat(EC$david, EC$genelist)
chord = chord_dat(circ, EC$genes, EC$process)
GOChord(chord, limit = c(0, 5))
```

In fact, the Gene Ontology graph does not necessarily capture meaningful *gene interactions* as genes from different processes often interact productively. A large amount of information remains unused, for example, all significant genes are usually given equal weight, despite the potentially large variations in their p-values.

We have at our disposal more than just the Gene Ontology. There are many different databases of gene networks from which we can choose a **known skeleton** graph onto which we project significance scores such as p-values from our differential expression experiment. We will follow an idea first suggested by Ideker et al. (2002). This is further developed in Nacu et al. (2007). A careful implementation with many improvements is available as the Bioconductor package **BioNet** (Beisser et al. 2010). These method all search for the subgraphs or modules of a scored-skeleton network that seem to be particularly **perturbed**.

Each gene-node in the network is assigned a score that can either be calculated from a t-statistic or a p-value. Often pathways contain both upregulated and downregulated genes; as pointed out in Ideker et al. (2002), this can be captured by taking absolute values of the test statistic or just incorporating scores computed from the p-values129 We’ll want something like \(-\log p\), so that small p-values give large scores. Beisser et al. (2010) model the pvalues of the genes as we did in Chapter 6: mixture of `non-perturbed`

genes whose pvalues will be uniformly distributed and non uniformly distributed pvalues from the perturbed genes. We model the signal in the data using a beta distribution for the p-values following Pounds and Morris (2003).

Given our node-scoring function; we search for connected hotspots in the graph, ie a subgraph of genes with high combined scores.

Finding the maximal scoring subgraph of a generic graph is known to be intractable in general (we say it is an NP-hard problem), so various approximate algorithms have been proposed. Ideker et al. (2002) suggested using simulated annealing, however this is slow and tends to produce large subgraphs that are difficult to interpret. Nacu et al. (2007) started with a seed vertex and gradually expand around it. Beisser et al. (2010) started the search with a so-called minimal spanning tree (MST), a graph we we will study later in this chapter.

To illustrate the method, we show data from the **BioNet** package. The `interactome`

data contains a connected component of the network comprising 2034 different gene products and 8399 interactions. This constitutes the skeleton graph with which we will work, see Beisser et al. (2010).

The `dataLym`

contains the relevant pvalues and \(t\) statistics for 3,583 genes, you can access them and do the analysis as follows:

```
library("BioNet")
library("DLBCL")
data("dataLym")
data("interactome")
interactome
```

```
## A graphNEL graph with undirected edges
## Number of Nodes = 9386
## Number of Edges = 36504
```

```
pval = dataLym$t.pval
names(pval) = dataLym$label
subnet = subNetwork(dataLym$label, interactome)
subnet = rmSelfLoops(subnet)
subnet
```

```
## A graphNEL graph with undirected edges
## Number of Nodes = 2559
## Number of Edges = 7788
```

The p-values are fit to the type of mixture we studied in Chapter 4, with a uniform component from the null with probability \(\pi_0\) and a beta distribution (proportional to \(a x^{a - 1}\)) for the p-values corresponding to the alternatives (Pounds and Morris 2003). \[f(x|a,\pi_0)= \pi_0 + (1-\pi_0) a x^{a - 1}\qquad \mbox{ for } 0 <x \leq 1; \; 0<a<1\] The package actually gives a different name to \(\pi_0\): it uses \(\lambda\) and calls it the mixing parameter. Running the model with an fdr of 0.001:

```
fb = fitBumModel(pval, plot = FALSE)
fb
```

```
## Beta-Uniform-Mixture (BUM) model
##
## 3583 pvalues fitted
##
## Mixture parameter (lambda): 0.482
## shape parameter (a): 0.180
## log-likelihood: 4471.8
```

`scores=scoreNodes(subnet, fb, fdr = 0.001)`

Figure 10.11: The qqplot shows the quality of the fit of beta-uniform mixture model to the data. The red points have the theoretical quantiles from the beta distribution as the x coordinates the observed quantiles and the y coordinates. The blue line shows that this model fits nicely.

Figure 10.12: A histogram of the mixture components for the p-values, the beta in red and the uniform in blue, \(\pi_0\) is the mixing proportion assigned to the null component whose distribution should be uniform.

Then we run a heuristic search for a high scoring subgraph using:

```
hotSub = runFastHeinz(subnet, scores)
hotSub
```

```
## A graphNEL graph with undirected edges
## Number of Nodes = 153
## Number of Edges = 243
```

```
logFC=dataLym$diff
names(logFC)=dataLym$label
```

Figure 10.14: As mathematical objects, the hierarchical clustering trees (studied in Chapter 5) are the same as phylogenetic trees. They are **rooted binary** trees with labels at the tips.

One really important use of graphs in biology is the construction of phylogenetic trees. Trees are graphs with no **cycles** (the official word for loops, whether self loops, or ones that go through several vertices). Phylogenetic trees are usually rooted binary trees that only have labels on the leaves corresponding to contemporary130 Because they are contemporary, the trees are often represented so that the leaves are a ll the same distance from the root. taxa at the tips. The inner nodes correspond to **ancestral** sequences which have to be inferred from the **contemporaneous** data on the tips. Many methods use aligned DNA sequences from the different species or populations to infer or estimate the tree. The tips of the tree are usually called **OTU**s (Operational Taxonomic Units). The statistical **parameter** of interest in these analyses is the rooted binary tree with OTU labels on its leaves (see Holmes (1999, 2003b) for details).

Figure 10.15: This phylogenetic tree describes the history of different HIV/SIV strains in Africa (Wertheim and Worobey 2009), [Figure from].

HIV is a virus that protects itself by evolving very fast (within months, several mutations can appear). Its evolution can thus be followed in real time; whereas the evolution of large organisms which has happened over millions of years. HIV trees are built for medical purposes such as the detection and understanding of drug resistance. They are estimated for individual genes. Different genes can show differences in their evolutionary histories and thus produce different **gene trees**. The phylogenetic tree in Figure 10.15 shows times when the virus switched from monkeys to humans (Wertheim and Worobey 2009).

- Most phylogenetic trees are shown rooted, the `root’ is usually found by including an outgroup in the tree tips, as we will see later.

- Characters that are derived from this common ancestry are called homologous (geneticists doing population studies replace the term homology by identity by descent (IBD)).

- Sisters on the tree defined by a common ancestor are called clades or monophyletic groups, they have more than just
*similarities*in common.

To infer what happened in the ancestral species from contemporary data collected on the tips of the tree, we have to make assumptions about how substitutions and deletions occur through time. The models we use are all Markovian and said to be time homogeneous: the mutation rate is constant across history.

This is called the **molecular clock** hypothesis, if we do not make this assumption we run into what is known as non-identifiability (ie we can't tell the difference between the many possibile mutational histories given the observed data).

We are going to use the Markov chain we saw in Figure 10.7 on the states A,C,G,T; however now we consider that the changes of state, i.e. mutations, occur at random times. The gaps between these mutational events will follow an exponential distribution. These continuous time Markov chains have the following properties:

**No Memory**. \(P(Y(u+t)=j\;|\;Y(t)=i)\) does not depend on times before t.**Time homogeneity**. The probability \(P(Y(h+t)=j\,|\,Y(t)=i)\) does not depend on \(t\), but on \(h\), the time between the events and on \(i\) and \(j\).**Linearity**. The instantaneous transition rate is of an approximately linear form

We use an error term written here as \(o(h)\), we read this little \(o\) of \(h\), which just means that this error terms grows much slower (i.e., sublinear) than \(h\).

\[\begin{align} P_{ij}(h)&=q_{ij}h+o(h), \quad\text{for } j\neq i \tag{10.1} \\ P_{ii}(h)&=1-q_i(h)+ o(h), \qquad\text{where } q_i=\sum_{j\neq i}q_{ij}. \tag{10.2} \end{align}\]

\(q_{ij}\) is known as the instantaneous transition rate. These rates define matrices as in Table 10.1.

**Exponential distribution**. Times between changes are supposed to be exponentially distributed.

Table 10.1: Two examples of rate matrices, on the left:the Jukes-Cantor (`JC69`

) model, on the right is shown the Kimura (`K80`

) two parameter model.

\(Q=\begin{array}{lcccc} & A &T &C & G\\A&-3\alpha & \alpha &\alpha&\alpha\\T&\alpha & -3\alpha&\alpha &\alpha\\C&\alpha & \alpha &-3\alpha&\alpha\\G&\alpha &\alpha &\alpha &-3\alpha\\\end{array}\) | \(Q=\begin{array}{lcccc} & A &T &C & G\\A&-\alpha-2\beta & \beta &\beta&\alpha\\T&\beta & -\alpha-2\beta &\alpha &\beta\\C&\beta & \alpha &-\alpha-2\beta&\beta\\G&\alpha &\beta &\beta &-\alpha-2\beta\\\end{array}\) |

the instantaneous change probability matrix called **the generator**. In the simplest possible model, called a Jukes-Cantor model; all the mutations are equally likely (see the left of Table 10.1). A slightly more flexible model, called the Kimura model is shown on the right in Table 10.1.

► Question 10.8

Why do we say the Kimura model is more flexible ?

► Solution

Vocabulary overload here! : Transitions in this context mean mutational changes within the purines (tt A<->G) or within the pyrimidines (tt C <-> T); whereas when we talked about Markov chains earlier our **transition** matrix contains all probabilities of any state changes.

The most flexible model is called the Generalized Time Reversible (GTR) model; it has 6 free parameters. We are going to show an example of data simulated according to these generative models from a known tree.

Suppose we already know our phylogenetic tree and want to simulate the evolution of the nucleotides down this tree. First we visualize the tree `tree1`

using `ggtree`

; loading the tree and the relevant packages with:

```
library("phangorn")
library("ggtree")
load("../data/tree1.RData")
```

► Task

Use the `ggtree`

function to plot `tree1`

; make the tips of the tree green triangles, the ancestral nodes, red circles.

Figure 10.16: This is the tree we use as our *true* parameter. We generate nucleotides one at a time from the root and `dropping’ them down the tree. With some probability proportional to the edge lengths, mutations occur down the branches.

```
ggtree(tree1, lwd = 2, color = "darkgreen", alpha = 0.8, right = TRUE) +
geom_tiplab(size = 7, angle = 90, offset = 0.05) +
geom_point(aes(shape = isTip, color = isTip), size = 5, alpha = 0.6)
```

Now we generate some sequences from our tree. Each sequence starts with a new nucleotide letter generated randomly at the root; mutations may occur as we go down the tree. You can see in Figure 10.17 that the colors are not equally represented, because the frequency at the root was chosen to be different from the uniform, see the following code.

```
seqs6 = simSeq(tree1, l = 60, type = "DNA", bf = c(1, 1, 3, 3)/8, rate = 0.1)
seqs6
```

```
## 6 sequences with 60 character and 23 different site patterns.
## The states are a c g t
```

```
mat6df = data.frame(as.character(seqs6))
p = ggtree(tree1, lwd = 1.2) + geom_tiplab(aes(x = branch), size = 5, vjust = 2)
gheatmap(p, mat6df[, 1:60], offset = 0.01, colnames = FALSE)
```

► Question 10.9

Experiment with the code above. Change the bf and rate arguments in the `simSeq`

function to make mutations more likely. Do you think sequences generated with a very high mutation rate would make it easier to infer the tree that generated them ?

► Solution

► Question 10.10

**Estimation bias: distance underestimation.**

1) If we only count the number of changes between two sequences using a simple Hamming distance, but there has been much evolutionary change between the two, why do we underestimate the distance between the sequences?

2) Will be the bias be larger for smaller evolutionary distances?

The standard Markovian models of evolution we saw above enable us to improve these estimates.

includegraphics[width=2cm][Magnify.png]{} “In solving a problem of this sort, the grand thing is to be able to reason backward. That is a very useful accomplishment, and a very easy one, but people do not practise it much. In the everyday affairs of life it is more useful to reason forward, and so the other comes to be neglected. There are fifty who can reason synthetically for one who can reason analytically”. **Sherlock Holmes**

When the true tree-parameter is known, the above-mentioned probabilistic generative models of evolution tells us what patterns to expect in the sequences. As we have seen in earlier chapters, statistics means going back from the data to reasonable estimates of the parameters; here the tree itself and the branch lengths, even the evolutionary rates can be considered to be the parameters.

Figure 10.18: A Steiner tree, the inner points are represented as squares. The method for creating the shortest tree thatpasses through all outer 1,2,5,6 is to create two inside (“ancester”) points 3 and 4.

There are several approaches to estimation: tree `building’ is no exception, here are the main ones:

**A nonparametric estimate: the parsimony tree** Parsimony is a nonparametric method that minimizes the number of changes necessary to explain the data, it’s solution is the same as that of the Steiner tree problem (see Figure 10.18).

**A parametric estimate: the maximum likelihood tree** In order to estimate the tree using a maximum likelihood or Bayesian approach one needs a model for molecular evolution that integrates mutation rates and branch edge lengths. ML estimation (e.g., `Phyml`

, `FastML`

, `RaxML`

) use efficient optimization algorithms to maximize the likelihood of a tree under the model assumptions.

**Bayesian posterior distributions for trees** Bayesian estimation, MrBayes (Ronquist et al. 2012) or BEAST (Bouckaert et al. 2014) both use MCMC to find posterior distributions of the phylogenies. Bayesian methods are not directly integrated into **R** and require the user to import the collections of trees generated by Monte Carlo methods in order to summarize them and make confidence statements see Chakerian and Holmes (2012) for simple examples.

**The semi-parametric approach: distance based methods** These methods, called Neighbor Joining and UPGMA, are quite similar to the hierachical clusering algorithms we already encountered in Chapter 5. However, the distance estimation steps uses the parametric evolutionary models of Table 10.1; the `parametric’ part of why we call the method semi-parametric.

The Neighbor-joining algorithm itself uses Steiner points as the summary of two combined points, and proceeds iteratively as in hierarchical clustering. It can be quite fast and is often used as a good starting point for the more time-consuming methods.

Let’s start by estimating the tree from the data `seqs6`

using the `nj`

(neighbor joining) on DNA distances based on the one-parameter Jukes-Cantor model, we make Figure 10.19 using the `ggtree`

function:

Figure 10.19: Trees built with a neighbor joining algorithm are very fast to compute and are often used as initial values for more expensive estimation procedures such as the maximum likelihood or parsimony.

```
tree.nj = nj(dist.ml(seqs6, "JC69"))
ggtree(tree.nj) + geom_tiplab(size = 7) + ggplot2::xlim(0, 0.8)
```

► Question 10.11

Generate the maximum likelihood scores of the tree1 given the `seqs6`

data and compare them to those of the neighbor joining tree.

► Solution

► Question 10.12

When we have aligned amino acids from which we want to infer a tree, we use (\(20 \times 20\)) transition matrices. The methods for estimating the phylogenetic are very similar. Try this in phangorn with an HIV amino acid sequence downloaded from https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html.

The quality of the tree estimates depend on the number of sequences per taxa and the distance to the root. We can evaluate the quality of the estimates either by using parametric and nonparametric bootstraps or performing Bayesian tree estimation using MCMC. For examples of how to visualize and compare the sampling distribution of trees, see Chakerian and Holmes (2012).

In Chapter 5 we saw how to use a probabilistic clustering method to denoise 16S rRNA sequences. We can now reload these denoised sequences and preprocess them before building their phylogeny.

```
library("dada2")
seqtab = readRDS("../data/seqtab.rds")
seqs = getSequences(seqtab)
names(seqs) = seqs
```

131 In order to keep all the information and be able to compare sequences from different experiments, we use the sequences themselves as their label(Callahan, McMurdie, and Holmes 2017). One of the benefits of using well-studied marker loci such as the 16S rRNA gene is the ability to taxonomically classify the sequenced variants. **dada2** includes a naive Bayesian classifier method for this purpose (Wang et al. 2007). This classifier compares sequence variants to training sets of classified sequences. Here we use the RDP v16 training set (Cole et al. 2009)132 See the download link on the dada2 website:https://benjjneb.github.io/dada2/training.html. As an example the code, not run here, for brevity’s sake:

```
fastaRef = "../tmp/rdp_train_set_16.fa.gz"
taxtab = assignTaxonomy(seqtab, refFasta = fastaRef)
```

We obtain a table of taxonomic information, that we have stored and can reload using:

```
taxtab = readRDS(file= "../data/taxtab16.rds")
dim(taxtab)
```

`## [1] 268 6`

► Question 10.13

Write one line of code using the %>% operator that shows just the first 6 rows of the taxonomic information without the row names.

► Solution

► Question 10.14

What is the difference between taxonomic and phylogenetic information?

Note that as the `seqs`

data are randomly generated, they are “cleaner” than the real data we will have to handle.

In particular naturally occurring raw sequences have to be **aligned**. This is necessary as there are often extra nucleotides in some sequences, a consequence of what we call *indel* events133 A nucleotide is deleted or inserted and it is often hard to distinguish which took place.. Also mutations occur and appear as subsitutions of one nucleotide by another.

Here is an example of what the first few characters of **aligned** sequences looks like:

`readLines("../data/mal2.dna.txt") %>% head(12) %>% cat(sep="\n")`

```
## 11 1620
## Pre1 GTACTTGTTA GGCCTTATAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pme2 GTATCTGTTA AGCCTTATAA AAAGATAGT- T-TAAATTAA AGGAATTATA
## Pma3 GTATTTGTTA AGCCTTATAA GAGAAAAGTA TATTAACTTA AGGA-TTATA
## Pfa4 GTATTTGTTA GGCCTTATAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pbe5 GTATTTGTTA AGCCTTATAA GAAAAA--T- TTTTAATTAA AGGAATTATA
## Plo6 GTATTTGTTA AGCCTTATAA GAAAAAAGT- TACTAACTAA AGGAATTATA
## Pfr7 GTACTTGTTA AGCCTTATAA GAAAGAAGT- TATTAACTTA AGGAATTATA
## Pkn8 GTACTTGTTA AGCCTTATAA GAAAAGAGT- TATTAACTTA AGGAATTATA
## Pcy9 GTACTCGTTA AGCCTTTTAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pvi10 GTACTTGTTA AGCCTTTTAA GAAAAAAGT- TATTAACTTA AGGAATTATA
## Pga11 GTATTTGTTA AGCCTTATAA GAAAAAAGT- TATTAATTTA AGGAATTATA
```

We will perform this multiple-alignment on our `seqs`

data using the **DECIPHER** package (Wright 2015):

```
library("DECIPHER")
alignment = AlignSeqs(DNAStringSet(seqs), anchor = NA, verbose = FALSE)
```

We use the **phangorn** package to build the MLE tree (under the GTR model), but will use the neighbor-joining tree as our starting point.

```
phangAlign = phangorn::phyDat(as(alignment, "matrix"), type = "DNA")
dm = phangorn::dist.ml(phangAlign)
treeNJ = phangorn::NJ(dm) # Note, tip order != sequence order
fit = phangorn::pml(treeNJ, data = phangAlign)
fitGTR = update(fit, k = 4, inv = 0.2)
fitGTR = phangorn::optim.pml(fitGTR, model = "GTR", optInv = TRUE,
optGamma = TRUE, rearrangement = "stochastic",
control = phangorn::pml.control(trace = 0))
```

We now need to combine the phylogenetic tree and the denoised read abundances with the complementary information provided about the samples from which the reads were gathered. This **sample** information is often provided as a spreadhseet (or `.csv`

), and (mis)named the *meta*-data. This data integration step is facilitated by the specialized containers and accessors that **phyloseq** provides.

We read in the sample data from a `csv`

file:

```
mimarksPathClean = "../data/MIMARKS_Data_clean.csv"
samdf = read.csv(mimarksPathClean, header = TRUE)
```

Then the full suite of data for this study – the sample-by-sequence feature table, the sample (meta)data, the sequence taxonomies, and the phylogenetic tree – are combined into a single object as follows:

```
library("phyloseq")
physeq = phyloseq(tax_table(taxtab), sample_data(samdf),
otu_table(seqtab, taxa_are_rows = FALSE), phy_tree(fitGTR$tree))
```

This block will only usually be executed once. Once created, we save the object and do all the manipulations starting from that container.

We have already encountered several cases of combining heterogeneous data into special data classes (in particular in Chapter 8, when we studied the `pasilla`

data).

► Task

Look at the detailed **phyloseq** documentation here. Try a few filtering operations; for instance make a subset of the data that contains the tree, taxa abundance table, the sample and taxa information for only the samples that have more than 5,000 reads.

This can be done in one line:

`ps1 = prune_samples(rowSums(otu_table(ps0)) > 5000, ps0)`

We can also make data transformations, while maintaining the integrity of the links between all the data components.

► Question 10.15

What do the following lines of code do ?

```
prev0 = apply(X = otu_table(ps0),
MARGIN = ifelse(taxa_are_rows(ps0), yes = 1, no = 2),
FUN = function(x) {sum(x > 0)})
prevdf = data.frame(Prevalence = prev0,
TotalAbundance = taxa_sums(ps0),
tax_table(ps1))
keepPhyla = table(prevdf$Phylum)[table(prevdf$Phylum) > 5]
prevdf1 = subset(prevdf, Phylum %in% names(keepPhyla))
ps2v = subset_taxa(ps1v, Phylum %in% names(keepPhyla))
```

Plotting the abundances for certain bacteria can easily be performed using barcharts. **ggplot2** expressions have been hardwired into one-line calls in the **phyloseq** package. This makes performing these types of exploratory plots painless. There is even an interactive Shiny-phyloseq browser based tool (McMurdie and Holmes 2015). We do not give full details here but refer the reader to the online vignettes.

Hypothesis testing can identify individual bacteria whose abundance relates to sample variables of interest. A standard approach is very similar to the approach we already visited in Chapter 6. Compute a test statistic for each taxa individually; then jointly adjust p-values to ensure a false discovery rate upper bound. However, this procedure does not exploit any structure among the tested hypotheses – for example, it is likely that if one Ruminococcus species is strongly associated with age, then others are as well. To integrate this information, Benjamini and Yekutieli (2003) and Benjamini and Bogomolov (2014) proposed a hierarchical testing procedure, where taxonomic groups are only tested if higher levels are found to be be associated. In the case where many related species have a slight signal, this pooling of information can increase power.

We apply this method to test the association between microbial abundance and age. Before computing the test statistic, it is appropriate to start by using the normalization protocols we discussed in Chapter 8 following Love, Huber, and Anders (2014)134 The fourth line in the below code chunk fixes the levels of the `ageBin`

factor, to make them admissible for **DESeq2**. for RNA-Seq data and McMurdie and Holmes (2014) for 16S rRNA generated count data and available in the **DESeq2** package:

```
library("DESeq2")
library("phyloseq")
ps1 = readRDS("../data/ps1.rds")
ps_dds = phyloseq_to_deseq2(ps1, design = ~ ageBin + family_relationship)
geometricmean = function(x)
if (all(x == 0)) { 0 } else { exp(mean(log(x[x != 0]))) }
geoMeans = apply(counts(ps_dds), 1, geometricmean)
ps_dds = estimateSizeFactors(ps_dds, geoMeans = geoMeans)
ps_dds = estimateDispersions(ps_dds)
abund = getVarianceStabilizedData(ps_dds)
```

We use the **structSSI** package to perform the hierarchical testing (Sankaran and Holmes 2014). For more convenient printing, we first shorten the names of each taxa:

`rownames(abund) = substr(rownames(abund), 1, 5) %>% make.names(unique = TRUE)`

Unlike standard multiple hypothesis testing, the hierarchical testing procedure needs univariate tests for each higher-level taxonomic group, not just every taxa. A helper function, `treePValues`

, is available for this; it expects an edgelist encoding parent-child relationships, with the first row specifying the root node.

```
library("structSSI")
el = phy_tree(ps1)$edge
el0 = el
el0 = el0[rev(seq_len(nrow(el))), ]
el_names = c(rownames(abund), seq_len(phy_tree(ps1)$Nnode))
el[, 1] = el_names[el0[, 1]]
el[, 2] = el_names[el0[, 2]]
unadj_p = treePValues(el, abund, sample_data(ps1)$ageBin)
```

We can now do our FDR calculations using the hierarchical testing procedure. The test results are guaranteed to control several variants of FDR, but at different levels; we defer details to (Benjamini and Yekutieli 2003; Benjamini and Bogomolov 2014; Sankaran and Holmes 2014).

► Task

Try the following code, including the interactive plotting command that will open a browser window:

```
hfdr_res = hFDR.adjust(unadj_p, el, 0.75)
summary(hfdr_res)
#plot(hfdr_res, height = 5000) # not run: opens in a browser
```

Figure 10.20: A screenshot of a subtree with many differentially abundant microbes, as determined by the hierarchical testing procedure. Currently the user is hovering over the node associated with microbe GCGAG.33; this causes the adjusted p-value (0.029) to appear.

The plot opens in a new browser – a static screenshot of a subtree is displayed in Figure 10.20. Nodes are shaded according to p-values, from blue to orange, representing the strongest to weakest associations. Grey nodes were never tested, to focus power on more promising subtrees. Scanning the full tree; it becomes clear that the association between age group and taxonomic abundances is present in only a few isolated taxonomic groups. It is quite strong in those groups. To give context to these results, we can retrieve the taxonomic identity of the rejected hypotheses.

```
library("dplyr")
options(digits = 3)
tax = tax_table(ps1)[, c("Family", "Genus")] %>% data.frame
tax$seq = rownames(abund)
hfdr_res@p.vals$seq = rownames(hfdr_res@p.vals)
tax %>% left_join(hfdr_res@p.vals[,-3]) %>%
arrange(adjp) %>% head(9) %>% dplyr::select(1,2,4,5)
```

```
## Family Genus unadjp adjp
## 1 Lachnospiraceae <NA> 2.15e-82 4.30e-82
## 2 Lachnospiraceae Roseburia 1.30e-75 2.60e-75
## 3 Lachnospiraceae Clostridium_XlVa 4.00e-59 8.00e-59
## 4 Lachnospiraceae <NA> 3.83e-50 7.67e-50
## 5 Lachnospiraceae Clostridium_XlVa 1.21e-49 2.41e-49
## 6 Lachnospiraceae <NA> 8.63e-49 1.73e-48
## 7 Ruminococcaceae Ruminococcus 9.32e-47 1.86e-46
## 8 Lachnospiraceae Clostridium_XlVa 5.90e-43 1.18e-42
## 9 Lachnospiraceae Roseburia 2.34e-36 4.68e-36
```

It seems that the most strongly associated bacteria all belong to family *Lachnospiraceae*.

A very simple and useful graph is the so-called **minimum spanning tree** (**MST**). Given distances between vertices; the MST is the tree that spans all the points and has the minimum total length (see Figure 10.21).

Greedy algorithms work well for computing the MST and there are many implementations in **R**: (`mstree`

in **ade4**), `mst`

in **ape**, `spantree`

in **vegan** or `mst`

in **igraph**.

Figure 10.21: A **spanning tree** is a tree that goes through all points at least once, the top graph with red edges is such a tree.

The **minimum spanning tree** is the spanning tree of minimum length; the lower blue graph in Figure 10.21 is the minimum spanning tree or MST for short. Here we are going to take the DNA sequence distances between strains of HIV from patients all over the world and construct their minimum spanning tree, the result is shown in Figure 10.22.

```
load("../data/dist2009c.RData")
country09 = attr(dist2009c, "Label")
mstree2009 = ape::mst(dist2009c)
gr09 = graph.adjacency(mstree2009, mode = "undirected")
gg = ggnetwork(gr09, arrow.gap = 0, layout = "fruchtermanreingold")
ggplot(gg, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "black", alpha = 0.5, curvature = 0.1) +
geom_nodes(aes(color = vertex.names), size = 2) + theme_blank() +
geom_nodetext(aes(label = vertex.names), color = "black", size = 2.5) +
theme(plot.margin = unit(c(0, 1, 1, 6), "cm"))+
guides(color = guide_legend(keyheight = 0.09, keywidth = 0.09,
title = "Countries")) + theme(legend.position = c(0, 0.14),
legend.background = element_blank(),
legend.text = element_text(size = 7))
```

► Question 10.16

Make the network plot again but replace `geom_nodetext`

with a labels that repel each other to minimize the overlapping node labels.

► Solution

It could be preferable to use a **graph layout** that incorporates the known geographic coordinates. Thus, we might be able to see how the virus jumped large distances across the world through traveller mobility. We introduce approximate country coordinates which we then **jitter** slightly to avoid too much overlapping:

```
library("rworldmap")
mat = match(country09, countriesLow$NAME)
coords2009 = data.frame(
lat = countriesLow$LAT[mat],
lon = countriesLow$LON[mat],
country = country09)
layoutCoordinates = cbind(
x = jitter(coords2009$lon, amount = 15),
y = jitter(coords2009$lat, amount = 8))
labc = names(table(country09)[which(table(country09) > 1)])
matc = match(labc, countriesLow$NAME)
dfc = data.frame(
latc = countriesLow$LAT[matc],
lonc = countriesLow$LON[matc],
labc)
dfctrans = dfc
dfctrans[, 1] = (dfc[,1] + 31) / 93
dfctrans[, 2] = (dfc[,2] + 105) / 238
ggeo09 = ggnetwork(gr09, arrow.gap = 0, layout = layoutCoordinates)
ggplot(ggeo09, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "black", alpha = 0.5, curvature = 0.1) +
geom_nodes(aes(color = vertex.names), size = 2) +
theme_blank() +
geom_label(data = dfctrans, aes(x = lonc, xend = lonc, y = latc, yend = latc,
label = labc, fill = labc), colour = "white", alpha = 0.5, size = 3) +
theme(legend.position = "none")
```

The input to the minimum spanning tree algorithm is a distance matrix or a graph with a length edge attribute. Figure 10.23 is the minimum spanning tree between cases of HIV, for which strain information was made available through the HIVdb database@HIVdb. The DNA distances was computed using the Jukes-Cantor mutation model.

► Question 10.17

The above analysis provided an **undirected** network of connections, in fact several implementations of the minimum spanning tree (ie for instance `mstree`

in **ade4**) provide a directed path through the points, which can provide meaningful information on the (apparent) spread of disases. Make a directed network version of the above maps.

MST is a very useful component of a simple nonparametric test for detecting differences between factors that are mapped onto its vertices.

0.5cm

Graph-based two-sample tests135 Tests that explore whether two samples aredrawn from the same distribution. were introduced by Friedman and Rafsky (Friedman and Rafsky 1979) as a generalization of the Wald-Wolfowitz runs test (see Figure 10.24). Our previous examples show graph vertices associated with covariates such as country of origin. Here we test whether the covariate is significantly **associated** to the graph structure.

The **Friedman-Rafsky** tests for two/multiple sample segregation on a minimum spanning tree. It was conceived as a generalization of the univariate Wald-Wolfowitz runs test. If we are comparing two samples, say men and women, whose coordinates represent a measurement of interest. We color the two groups blue and red as in Figure 10.24, the Wald-Wolfowitz test looks for long runs of the samecolor that would indicate that the two groups have different means.

Instead of looking for consecutive values of one type (‘runs’), we count the number of connected nodes of the same type.

Once the minimum spanning tree has been constructed, the vertices are assigned `colors’ according to the different levels of a categorical variable. We call **pure** edges those whose two nodes have the same level of the factor variable. We use \(S_O\), the number of **pure** edges as our test statistic. To evaluate whether our observed value could have occurred by chance when the groups have the same distributions, we permute the vertix labels (colors) randomly and recount how many pure edges there are. This label swapping is repeated many times, creating our null distribution for \(S\).

Here we illustrate the idea on a collection of samples from mice whose stool were analyzed for their microbial content. We read in a data set with many mice and many taxa, we compute the Jaccard distance and then use the `mst`

function from the **igraph** package. We annotate the graph with the relevant covariates as shown in the code below:

```
ps1 = readRDS("../data/ps1.rds")
sampledata = data.frame( sample_data(ps1))
d1 = as.matrix(phyloseq::distance(ps1, method="jaccard"))
gr = graph.adjacency(d1, mode = "undirected", weighted = TRUE)
net = igraph::mst(gr)
V(net)$id = sampledata[names(V(net)), "host_subject_id"]
V(net)$litter = sampledata[names(V(net)), "family_relationship"]
```

We make a `ggnetwork`

object from the resulting igraph generated minimum spanning tree and then plot it, as shown in Figure 10.25.

Figure 10.25: The minimum spanning tree based on Jaccard dissimilarity and annotated with the mice ID and litter factors

```
gnet=ggnetwork(net)
ggplot(gnet, aes(x = x, y = y, xend = xend, yend = yend))+
geom_edges(color = "darkgray") +
geom_nodes(aes(color = id, shape = litter)) + theme_blank()+
theme(legend.position="bottom")
```

Now we compute the null distribution and p-value for the test, this is implemented in the **phyloseqGraphTest** package:

```
library("phyloseqGraphTest")
gt = graph_perm_test(ps1, "host_subject_id", distance="jaccard",
type="mst", nperm=1000)
gt$pval
```

`## [1] 0.000999`

We can take a look at the complete histogram of the null distribution generatedby permutation using:

Figure 10.26: The permutation histogram of the number of pure edges in the network obtained from the minimal spanning tree with Jaccard similarity.

`plot_permutations(gt)`

It is not necessary to use an MST for the skeleton graph that defines the edges. Graphs made by linking nearest neighbors (Schilling 1986) or distance thresholding work as well.

The Bioconductor package **phyloseq** has functionality for creating graphs based on thresholding a distance matrix through the function `make_network`

. We create a network by creating an edge between samples whose Jaccard dissimilarity is less than a threshold, which we set below via the parameter `max.dist`

. We use the **ggnetwork** package to add attributes to the vertices indicating which mouse the sample came from and which litter the mouse was in. We see that in the resulting network, shown in Figure 10.27, there is grouping of the samples by both mouse and litter.

```
net = make_network(ps1, max.dist = 0.35)
sampledata = data.frame(sample_data(ps1))
V(net)$id = sampledata[names(V(net)), "host_subject_id"]
V(net)$litter = sampledata[names(V(net)), "family_relationship"]
netg = ggnetwork(net)
```

```
ggplot(netg, aes(x = x, y = y, xend = xend, yend = yend)) +
geom_edges(color = "darkgray") +
geom_nodes(aes(color = id, shape = litter)) + theme_blank()+
theme(plot.margin = unit(c(0, 5, 2, 0), "cm"))+
theme(legend.position = c(1.4, 0.3),legend.background = element_blank(),
legend.margin=margin(0, 3, 0, 0, "cm"))+
guides(color=guide_legend(ncol=2))
```

Note that no matter which graph we build between the samples, we can approximate a null distribution by permuting the labels of the nodes of the graph. However, sometimes it will preferable to adjust the permutation distribution to account for known structure between the covariates.

In the test above, we took a rather naïve approach and showed there was a significant difference between individual mice (the `host_subject_id`

variable). Here we perform a slightly different permutation test to find out if we control for the difference between mice; is there a litter (the `family_relationship`

variable) effect ? The setup of the test is similar, it is simply how the permutations are generated which differs. We maintain the nested structure of the two factors using the `grouping`

argument We permute the `family_relationship`

labels but keep the `host_subject_id`

structure intact.

```
gt = graph_perm_test(ps1, "family_relationship",
grouping = "host_subject_id",
distance = "jaccard", type = "mst", nperm= 1000)
gt$pval
```

`## [1] 0.005`

This test has a small p-value, and we reject the null hypothesis that the two samples come from the same distribution. From the plot of the minimum spanning tree in Figure 10.26, we see by eye that the samples group by litter more than we would expect by chance.

Figure 10.28: The permutation histogram obtained from the minimal spanning tree with Jaccard similarity.

`plot_permutations(gt)`

► Question 10.18

The \(k\)-nearest neighbors graph is obtained by putting an edge between two samples whenever one of them is in the set of \(k\)-nearest neighbors of the other. Redo the test, defining the graph using nearest neighbors defined with the Jaccard distance. What would you conclude?

► Solution

**Note: The dual graph**

In the examples above we sought to show relationships between samples through their shared taxa. It can also be of interest to ask the question about taxa: do some of the taxa co-occur more often than one would expect? This approach can help study microbial `communities’ as they assemble in the microbiome. The methods we developed above all apply to this use-case, all one really does is transpose the data. It is always preferable with sparse data such as the microbiome to use Jaccard and not build correlation networks that can be appropriate in other settings.

. In this chapter we have learnt how to store and plot data that have more structure than simple arrays: graphs have edges and nodes that can also be associated to extra annotations that can be displayed usefully.

. We started by specific examples such as Markov chain graphs, phylogenetic trees and minimum spanning trees. We saw how to use the **ggnetwork** and **igraph** packages to visualize graphs and show as much information as possible by using specific graph layout algorithms.

. We then approached the problem of incorporating a known `skeleton’ graph into differential expression analyses. This enables use to pinpoint perturbation hotspots in a network. We saw how evolutionary models defined along rooted binary trees serve as the basis for phylogenetic tree estimation and how we can incorporate these trees as supplementary information in a differential abundance analysis using the R packages **structSSI** and **phyloseq**.

. Graph and network tools also enable the creation of networks from co-occurrence data and can be used to visualize and test the effect of factor covariates. We saw the Friedman-Rafsky test which provides an easy way of testing dependencies of a variable with the edge structure of a skeleton graph.

. This chapter illustrated ways of incorporating interactions of players in a network and we saw how useful it was to combine this with statistical scores. This often provides biological insight into analyses of complex biological systems.

. We saw that graphs can be both useful to encode our previous knowledge, metabolic network information, gene ontologies and phylogenetic trees of known bacteria are all available in standard databases. It is beneficial in a study to incorprate all known information and doing so by combining these skeleton networks with observed data enhances our understanding of experimental results in the context of what is already known.

On the other hand, the graph can be the outcome that we want to predict and we saw how to build graphs from data (phylogenetic trees, co-occurrence networks and minimum spanning trees).

For complete developments and many important consequences of the evolutionary models used in phylogenetic trees, see the books by Li (1997; Li and Graur 1991). The book by Felsenstein (2004) is the classic text on estimating phylogenetic trees.

The book written by the author of the **ape** packages, Paradis (2011) contains many use-cases and details about manipulation of trees in **R**. A review of bootstrapping for phylogenetic trees can be found in Holmes (2003a).

We can use a tree as well as abundances in a contingency table data through an extension of PCoA-MDS called DPCoA (Double principal coordinate analysis). For microbiome data, the phylogenetic tree provides distances between taxa; these distances serve as the basis for the first PCoA. A second PCoA enables the projection of the weighted sample points. This has proved very effective in microbial ecology applications, see Purdom (2010) or Fukuyama et al. (2012) for details.

Graphs can be used to predict vertex covariates. There is a large field of applied statistics and machine learning that considers the edges in the graph as a response variable for which one can make predictions based on covariates or partial knowledge of the graph; these include **ERGM**’s (Exponential Random Graph Models, Robins et al. (2007)) and kernel methods for graphs (Schölkopf, Tsuda, and Vert 2004).

For theoretical properties of the Friedman-Rafsky test and more examples see Bhattacharya (2015).

A full list packages that deal with graphs and networks is available at: http://www.bioconductor.org/packages/release/BiocViews.html#___GraphAndNetwork.

Create a function that plots a graph starting from an adjacency matrix. Show how it works on an example.

▢

The relationships between gene functions is organized hierarchically into a graph called the Gene Ontology (GO) graph. The biological processes are organized at finer and finer scale. Take one of the databases providing the GO information for the organisms you are interested in. Choose a gene list and build the GO graph for that list.

Hint: Some examples can be found in the packages , , .

▢

*Markov chain graph of transitions between states of the vaginal microbiota*: In DiGiulio et al. (2015) the authors use an **igraph** plot to represent the transitions rates between community state types CSTs using the **markovchain** package. Load the data and the transition rates and state names into an object of the special class `markovchain`

and tailor the layout carefully to include the percentage of preterm birth as a covariate for the vertices (make the vertex size proportional to this variable). Include the size of transitions between states as the width of the arrows.

▢

**Protein interaction networks**: Read the Wikipedia article about the http://en.wikipedia.org/wiki/STRING database (http://www.string-db.org).

The protein Cyclin B1 is encoded by the CCNB1 gene. You can read about it on wikipedia here: http://en.wikipedia.org/wiki/Cyclin_B1.

Use STRING to generate a text file (call it ccnb1datsmall.txt) of edges around the CCNB1 gene. Choose nodes that are connected by evidence of co-expression with a confidence higher than 0.9. Collect no more than 50 interactions and additional nodes that are two steps away from CCNB1 in the graph.

▢

Read the txt file ccnb1datsmall.txt into **R** and make a plot of the graph using one of the graph visualization methods covered in this chapter.

▢

Make a heatmap showing the adjacency matrix of the graph created in Exercises 10.5.

▢

The visualization shows the strongest interactions in the two step neighborhood of CCNB1. Both the plotted graph and the heatmap image show the same data: there seems to be a cluster of proteins which are all similar to CCNB1 and there is also another cluster in the other proteins. Many of the proteins in the CCNB1 cluster are coexpressed at the same time as each other.

Why might this be the case?

Conversely, proteins which are coexpressed with a protein that is coexpressed with CCNB1 (two steps away) do not tend to be coexpressed with each other.

Is it easier for you to see this in one of the figures (the plot or the heatmap) than the other?

▢

Compare the use of **ape** and **phangorn** in the analysis of HIV GAG data. Compute the Jukes Cantor distances between the sequences using both packages and compare them to the Hamming distances.

```
library("ape")
library("phangorn")
GAG=read.dna("../data/DNA_GAG_20.txt")
```

▢

Perform the Friedman–Rafsy type test with a “two-nearest” neighbor-graph using the Bray-Curtis dissimilarity.

▢

Beisser, Daniela, Gunnar W Klau, Thomas Dandekar, Tobias Müller, and Marcus T Dittrich. 2010. “BioNet: An R-Package for the Functional Analysis of Biological Networks.” *Bioinformatics* 26 (8). Oxford Univ Press: 1129–30.

Benjamini, Yoav, and Marina Bogomolov. 2014. “Selective Inference on Multiple Families of Hypotheses.” *JRSSB* 76 (1). Wiley Online Library: 297–318.

Benjamini, Yoav, and Daniel Yekutieli. 2003. “Hierarchical FDR Testing of Trees of Hypotheses.” Technical report, Department of Statistics; Operations Research, Tel Aviv University.

Bhattacharya, Bhaswar B. 2015. “Power of Graph-Based Two-Sample Tests.” *arXiv Preprint arXiv:1508.07530*.

Bouckaert, Remco, Joseph Heled, Denise Kühnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drummond. 2014. “BEAST 2: A Software Platform for Bayesian Evolutionary Analysis.” *PLoS Computational Biology* 10 (4). Public Library of Science: e1003537.

Callahan, Benjamin J, Paul J McMurdie, and Susan P Holmes. 2017. “Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker Gene Data Analysis.” *ISME Journal*. Nature publishing, 1–5.

Chakerian, John, and Susan Holmes. 2012. “Computational Tools for Evaluating Phylogenetic and Hierarchical Clustering Trees.” *Journal of Computational and Graphical Statistics* 21 (3). Taylor & Francis Group: 581–99.

Cole, J.R., Q. Wang, E. Cardenas, J. Fish, B. Chai, R.J. Farris, A.S. Kulam-Syed-Mohideen, et al. 2009. “The Ribosomal Database Project: Improved Alignments and New Tools for rRNA Analysis.” *Nucleic Acids Research* 37 (Supplement 1): D141–D145.

DiGiulio, Daniel B., Benjamin J. Callahan, Paul J. McMurdie, Elizabeth K. Costello, Deirdre J. Lyelle, Anna Robaczewska, Christine L. Sun, et al. 2015. “Temporal and Spatial Variation of the Human Microbiota During Pregnancy.” *PNAS*.

Felsenstein, Joseph. 2004. *Inferring Phylogenies*. Boston: Sinauer.

Friedman, Jerome H, and Lawrence C Rafsky. 1979. “Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests.” *The Annals of Statistics*, 697–717.

Fukuyama, Julia, Paul J McMurdie, Les Dethlefsen, David A Relman, and Susan Holmes. 2012. “Comparisons of Distance Methods for Combining Covariates and Abundances in Microbiome Studies.” In *Pac Symp Biocomput*. World Scientific.

Holmes, Susan. 1999. “Phylogenetic Trees: An Overview.” In *Statistics and Genetics*, 81–118. IMA 112. New York: Springer.

Holmes, Susan. 2003a. “Bootstrapping Phylogenetic Trees: Theory and Methods.” *Statistical Science* 18 (2): 241–55.

Holmes, Susan. 2003b. “Statistics for phylogenetic trees.” *Theoretical Population Biology* 63 (1). Elsevier: 17–32.

Ideker, Trey, Owen Ozier, Benno Schwikowski, and Andrew F Siegel. 2002. “Discovering Regulatory and Signalling Circuits in Molecular Interaction Networks.” *Bioinformatics* 18 Suppl 1 (January): S233–40. http://bioinformatics.oxfordjournals.org/cgi/reprint/18/suppl\_1/S233.

Li, Wen-Hsiung. 1997. *Molecular Evolution.* Sinauer Associates Incorporated.

Li, Wen-Hsiung, and Dan Graur. 1991. *Fundamentals of Molecular Evolution*. Vol. 48. Sinauer Associates Sunderland, MA.

Liberzon, Arthur, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdóttir, Pablo Tamayo, and Jill P Mesirov. 2011. “Molecular Signatures Database (MSigDB) 3.0.” *Bioinformatics* 27 (12): 1739–40.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” *Gnome Biology* 15 (12): 1–21.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” *PLoS Computational Biology* 10 (4). Public Library of Science: e1003531.

McMurdie, Paul J, and Susan Holmes. 2015. “Shiny-Phyloseq: Web Application for Interactive Microbiome Analysis with Provenance Tracking.” *Bioinformatics* 31 (2). Oxford Univ Press: 282–83.

Mossel, Elchanan. 2003. “On the Impossibility of Reconstructing Ancestral Data and Phylogenies.” *Journal of Computational Biology* 10 (5). Mary Ann Liebert, Inc.: 669–76.

Nacu, Serban, Rebecca Critchley-Thorne, Peter Lee, and Susan Holmes. 2007. “Gene Expression Network Analysis and Applications to Immunology.” *Bioinformatics* 23 (7, 7): 850–8. https://doi.org/10.1093/bioinformatics/btm019.

Nolan, Daniel J, Michael Ginsberg, Edo Israely, Brisa Palikuqi, Michael G Poulos, Daylon James, Bi-Sen Ding, et al. 2013. “Molecular Signatures of Tissue-Specific Microvascular Endothelial Cell Heterogeneity in Organ Maintenance and Regeneration.” *Developmental Cell* 26 (2). Elsevier: 204–19.

Paradis, Emmanuel. 2011. *Analysis of Phylogenetics and Evolution with R*. Springer Science & Business Media.

Pounds, Stan, and Stephan W Morris. 2003. “Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-Values.” *Bioinformatics* 19 (10). Oxford Univ Press: 1236–42.

Purdom, Elizabeth. 2010. “Analysis of a Data Matrix and a Graph: Metagenomic Data and the Phylogenetic Tree.” *Annals of Applied Statistics*, July.

Rhee, Soo-Yon, Matthew J Gonzales, Rami Kantor, Bradley J Betts, Jaideep Ravela, and Robert W Shafer. 2003. “Human Immunodeficiency Virus Reverse Transcriptase and Protease Sequence Database.” *Nucleic Acids Research* 31 (1). Oxford Univ Press: 298–303.

Robins, Garry, Tom Snijders, Peng Wang, Mark Handcock, and Philippa Pattison. 2007. “Recent Developments in Exponential Random Graph (P*) Models for Social Networks.” *Social Networks* 29 (2). Elsevier: 192–215.

Ronquist, Fredrik, Maxim Teslenko, Paul van der Mark, Daniel L Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc A Suchard, and John P Huelsenbeck. 2012. “MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space.” *Systematic Biology* 61 (3). Oxford University Press: 539–42.

Sankaran, Kris, and Susan Holmes. 2014. “structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data.” *Journal of Statistical Software* 59 (1): 1–21.

Schilling, Mark F. 1986. “Multivariate Two-Sample Tests Based on Nearest Neighbors.” *Journal of the American Statistical Association* 81 (395). Taylor & Francis: 799–806.

Schölkopf, Bernhard, Koji Tsuda, and Jean-Philippe Vert. 2004. *Kernel Methods in Computational Biology*. MIT press.

Wang, Q., G.M. Garrity, J.M. Tiedje, and J.R. Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” *Applied and Environmental Microbiology* 73 (16). Am Soc Microbiol: 5261.

Wertheim, Joel O, and Michael Worobey. 2009. “Dating the Age of the SIV Lineages That Gave Rise to HIV-1 and HIV-2.” *PLoS Computational Biology* 5 (5). Public Library of Science: e1000377.

Wright, Erik S. 2015. “DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment.” *BMC Bioinformatics* 16 (1). BioMed Central: 1.

Yu, Hongxiang, Diana L Simons, Ilana Segall, Valeria Carcamo-Cavazos, Erich J Schwartz, Ning Yan, Neta S Zuckerman, et al. 2012. “PRC2/EED-EZH2 Complex Is up-Regulated in Breast Cancer Lymph Node Metastasis Compared to Primary Tumor and Correlates with Tumor Proliferation in Situ.” *PloS One* 7 (12). Public Library of Science: e51239.

Page built: 2019-05-15

Support for maintaining the online version of this book is provided by de.NBI

For website support please contact msmith@embl.de