```{r load_crabs_data , echo = TRUE}
data("crabs")
dim(crabs)
sample_n(crabs,10)
table(crabs$sex)
```

We can plot the rear width (RW, in mm) per sex and species to get on idea on whether it
differs by age group.
```{r plot_crabs_per_age , echo = TRUE}
qplot(sp, RW, data = crabs, facets = ~ sex, color = sp, geom = "boxplot")
```
We will regard these data as providing five quantitative features (FL, RW, CL, CW,
BD) 1 and a pair of class labels (sex, sp=species). We may regard this as a four class
problem, or as two two class problems.
Our first problem does not involve any computations. If you want to write R code to solve
the problem, do so, but use prose first.
* **Question ** On the basis of the boxplots in the Fiugre, comment on the prospects for
predicting species on the basis of RW. State a rule for computing the predictions.
Describe how to assess the performance of your rule.
## Species prediction via logistic regression
A simple approach to prediction involves logistic regression, which uses a logit
transformation of the class labels and then puts in the predictors in a regresssion
curve.
```{r log_reg, dependson="load_crabs_data"}
m1 = glm(sp~RW, data=crabs, family=binomial)
summary(m1)
```

The logistic regression fit will tell us the probability for sample of it belonging
to class O. Below, we classify a sample into the O class if the probability for
it is greater than 0.5.
* **Question ** Write down the statistical model corresponding to the R expression
above. How can we derive a classifier from this model?
* **Question ** Perform the following computations. Discuss their interpretation. What
are the estimated error rates of the two models? Is the second model, on the subset,
better?
```{r log_reg_mod, dependson="log_reg"}
qplot(crabs$sp, predict(m1,type="response"), geom = "boxplot", color=crabs$sp )
table(predict(m1,type="response")>.5, crabs$sp)
crabs.F <- subset(crabs, sex=="F")
m2 = glm(sp~RW, data=crabs.F, family=binomial)
table(predict(m2,type="response")>.5, crabs.F$sp)
qplot(crabs.F$sp, predict(m2,type="response"), geom = "boxplot", color=crabs.F$sp )
```
What we computed above is commonly called "confusion Matrix": For a two--class
scenario, it gives the number of true positives / negatives and the errors
represented by false positives and false negatives. This terminology stems
from medical disease testing, obviously, in our example there is no natural
"positive" or "negative" category. Thus, we set (arbitrarly) "B" as negative.
```{r confMatrix, dependson="log_reg_mod"}
cMatrix = crossval::confusionMatrix(actual = crabs.F$sp,
predicted = ifelse(predict(m2,type="response")>.5, "O", "B"),
negative="B")
cMatrix
diagnosticErrors(cMatrix)
```
The function `diagnosticErrors` computes various diagnostic errors useful for
evaluating the performance of a diagnostic test or a classifier:
accuracy (acc), sensitivity (sens), specificity (spec), positive
predictive value (ppv), negative predictive value (npv), and
log-odds ratio (lor). The defintions are:
* acc = (TP+TN)/(FP+TN+TP+FN)
* sens = TP/(TP+FN)
* spec = TN/(FP+TN)
* ppv = TP/(FP+TP)
* npv = TN/(TN+FN)
* lor = $log$(TP$*$TN/(FN$*$FP))
The first model is very bad, there is hardly any predictive power, if we condition
on the sex and use only the females, the predictions become better.
## The cross-validation concept
Cross--validation is a technique that is widely used for reducing bias in the estimation
of predictive accuracy. If no precautions are taken, bias can be caused by **overfitting** a
classification algorithm to a particular dataset; the algorithm learns the classification "by
heart", but performs poorly when asked to generalize it to new, unseen examples. Briefly, in
cross--validation the dataset is deterministically partitioned into a series of training and test
sets. The model is built for each training set and evaluated on the test set. The accuracy
measures are averaged over this series of fits. Leave--one--out cross--validation consists of N
fits, with N training sets of size N--1 and N test sets of size 1.
First let us use MLearn from the `r Biocpkg("MLInterfaces") ` package to fit a single logistic model.
MLearn requires you to specify an index set for training. We use ` c(1:30, 51:80) ` to choose
a training set of size 60, balanced between two species (because we know the ordering of
records). This procedure also requires you to specify a probability threshold for classification.
We use a typical default of 0.5. If the predicted probability of being "O" exceeds 0.5,
we classify to "O", otherwise to "B".
```{r trainTest, dependson="log_reg_mod" }
ml1 = MLearn( sp~RW, crabs.F, glmI.logistic(thresh=.5), c(1:30, 51:80),
family=binomial)
ml1
cMatrix.TT = crossval::confusionMatrix(actual = ml1@testOutcomes,
predicted =ml1@testPredictions,
negative="B")
cMatrix.TT
diagnosticErrors(cMatrix.TT)
```
* **Question ** What does the report on ml1 tell you about predictions with this model?
Can you reconcile this with the results in model m2? (Hint --- non--randomness of the
selection of the training set is a problem.)
Now we will illustrate cross-validation. First, we scramble the order of records in the
ExpressionSet so that sequentially formed groups are approximately random samples.
We invoke the now the MLearn specifying a five--fold cross-validation.
```{r crossval5, dependson="trainTest"}
smx1 = MLearn( sp~RW, crabs.F, glmI.logistic(thresh=.5),
xvalSpec(type = "LOG", niter = 5, balKfold.xvspec(5))
, family=binomial)
cMatrix.cv = crossval::confusionMatrix(actual = smx1@testOutcomes,
predicted =smx1@testPredictions,
negative="B")
cMatrix.cv
diagnosticErrors(cMatrix.cv)
```
We can now compare the crossvalidation results to the ones obtained using
the training set also for testing and using only one training--test split.
```{r compareTTandXVal}
rbind(noSplit = diagnosticErrors(cMatrix),
oneSplit = diagnosticErrors(cMatrix.TT),
CV5 = diagnosticErrors(cMatrix.cv)
)
```
* **Question**. Define clearly the difference between models sml1 and smx1
# Applying classification to Microarray data
Classification methods have been very popular for microarray data, in
order to identify e.g. signatures of cancer subtypes.
Traditional methods often yield unsatisfactory results or are
even inapplicable in the $p \gg n$ setting.
Hence, microarray studies have stimulated the development of
new approaches and motivated the adaptation of known traditional methods
to high-dimensional data.
One of the most commonly used algorithms for this purpose is called NSC
or PAM. [See the paper](http://statweb.stanford.edu/~tibs/ftp/STS040.pdf).
Moreover, model selection and evaluation of prediction rules
proves to be highly difficult in this situation for several reasons.
Firstly, the hazard of overfitting, which is common to all prediction problems,
is increased by high dimensionality. Secondly, the usual evaluation scheme based
on the splitting into learning and test data sets often applies only partially in
the case of small samples.
Lastly, modern classification techniques rely on the proper choice of hyperparameters
whose optimization is highly computer-intensive, especially in the case of
high-dimensional data.
## Learning with expression arrays
Here we will concentrate on ALL: acute lymphocytic
leukemia, B-cell type.
### Phenotype reduction
We will identify expression patterns
that discriminate individuals with BCR/ABL fusion in
B-cell leukemia.
```{r setupALL}
data("ALL")
bALL = ALL[, substr(ALL$BT,1,1) == "B"]
fus = bALL[, bALL$mol.biol %in% c("BCR/ABL", "NEG")]
fus$mol.biol = factor(fus$mol.biol)
fus
sample_n(pData(fus), 10)
```
### Nonspecific filtering
We can nonspecifically filter to 300 genes (to save computing time) with
largest measures of robust variation across all samples:
```{r selectGenes, dependson="setupALL"}
mads = apply(exprs(fus),1,mad)
fusk = fus[ mads > sort(mads,decr=TRUE)[300], ]
fcol = ifelse(fusk$mol.biol=="NEG", "green", "red")
```
## Constructing a classification rule
The second main issue is the construction of a appropriate prediction rule.
In microarray data analysis, we
have to deal with the $n\ll p$ situation, i.e. the number of predictors
exceeds by far the number of observations. Some class prediction methods
only work for the case $n \ll p$, e.g. linear or qudratic discriminant analysis
which are based on the inversion of a matrix of size $p\times p$ and rank $n-1$.
In the $n\ll p$ setting, the hazard of
overfitting is especially acute: perfect separation of the classes
for a given sample based on a high number of predictors is always
possible. However, the resulting classification rule may generalize
poorly on independent test data.
There are basically three approaches to cope with the $n\ll p$ setting:
* variable selection using, e.g., univariate statistical tests
* regularization or shrinkage methods, such as the Support Vector
Machine $\ell_2$ or $\ell_1$
penalized logistic regression
or Boosting from
which some also perform variable selection
* dimension reduction or feature extraction. Most prominent is the
Partial Least Squares method
Most classification methods depend on a vector of hyperparameters ${\lambda}$
that have to be correctly chosen. Together with variable selection, this is part of the model selection and has thus to be performed separately for each learning sample
${L}_b$ to avoid bias.
An optimal vector of
hyperparameters ${\lambda}^{\text{opt}}$ is determined by defining a discrete set of
values whose performance is then measured by cross-validation. This involves
a further splitting step, which is sometimes called 'inner loop' or
'nested' cross-validation.
More precisely, each learning set ${L}_b$ is divided into
$l=1,\ldots,k$ parts such that the $l$-th part
forms the test sample for the $l$-th (inner) iteration.
This procedure can be used to derive an error estimator for each value on the grid
of candiate hyperparameter values. The optimal vector ${\lambda}^{\text{opt}}$
is chosen to minimize this error criterion. Note that there are often several
such minimizers.
Furthermore, the
minimizer found by this procedure can be relatively far away from the true minimizer,
depending on how fine the discrete grid has been chosen.
The choice of the inner cross-validation scheme is difficult. With a high $k$,
computation times soon become prohibitively high. With a low $k$, the size of
${L}_{b_l}, \; l=1,\ldots,k$ is strongly reduced compared to the complete
sample $S$, which may have an impact on the derived optimal parameter values.
Nevertheless, nested cross-validation is commonly used in this context.
Considering the computational effort required for hyperparameter optimization and
the small sample sizes, one may prefer class prediction methods
that do not depend on many hyperparameters and/or behave robustly against changes
of the hyperparameter values.
## Using Logistic regression
We first generate the learning sets for CV and then tune
the parameters of the method. We use the method of
[Zhu and Hastie, 2004](http://dx.doi.org/10.1093/biostatistics/kxg046).
```{r logisticReg, dependson="setupALL"}
labs <- pData(fus)$mol.biol
fiveCVdat <- GenerateLearningsets(y=labs,
method="CV", fold = 5, strat = TRUE)
fiveCVdat
#?plrCMA
## tune the parameters
tuningstep <- CMA::tune(X = t(exprs(fus)), y=labs,
learningsets = fiveCVdat, classifier = "plrCMA")
tuningstep
plot(tuningstep, iter = 3)
unlist(CMA::best(tuningstep))
## run the classifier
class_fiveCV <- classification(X = t(exprs(fus)), y=labs, classifier = "plrCMA",
learningsets = fiveCVdat, lambda=0.0625)
##evaluate
av_logR <- evaluation(class_fiveCV, measure = "average probability",
scheme = "classwise", y=labs)
show(av_logR)
#boxplot(av_logR)
#summary(av_logR)
```
## Using PAM
```{r PAM, dependson="setupALL"}
library(pamr)
pamTrain <- pamr.train(list(x= exprs(fus), y=labs))
pamTrain
#pam.results <- pamr.cv(pamTrain, list(x= exprs(fus), y=labs))
#predict
labs_new = pamr.predict(pamTrain, exprs(fus) , threshold=TRUE)
pred.err = sum(labs_new != labs)/length(labs)
pred.err
# class specific prediczion error
#pred.err = sum(labs_new != labs & )/length(labs)
```
# Exploratory multivariate analysis
## Scatterplots
* **Question 7** Interpret the following code, whose result is shown in Figure 2. Modify
it to depict the pairwise configurations with different colors for crab genders.
```{r pairs}
pairs(crabs[,-c(1:3)], col=ifelse(crabs$sp=="B", "blue", "orange"))
```
## Principal components analysis (PCA)
Principal components analysis transforms the multivariate
data $X$ into a new coordinate system. If the original variables
are X1, \ldots, Xp, then the variables in the new representation
are denoted $PC_1$, \ldots, $PC_p$. These new variables have
the properties that PC1 is the linear combination of the $X_1$, \ldots, $X_p$
having maximal variance, PC2 is the variance-maximizing linear combination of
residuals of X after projecting into the hyperplane normal to PC1, and so on.
If most of the variation in $X_{n \times p}$ can be captured in
a low dimensional linear subspace of the space spanned
by the columns of $X$, then the scatterplots of
the first few principal components give a good representation of
the structure in the data.
One can show the data points (i.e., here, the samples) are projected
onto the 2D plane such that they spread out optimally.
Formally, we can compute the PC using the singular value decomposition of
$X$, in which $X = UDV^t$, where $U_{n \times p}$ and $V_{p \times p}$
are orthonormal, and $D$ is a diagonal matrix of $p$ nonnegative
singular values. The principal components transformation is
$XV = UD$, and if $D$ is structured so that $D_{ii} \geq D_{jj}$ whenever
$i > j$, then column $i$ of $XV$ is PCi. Note also that $D_{ii} =
\sqrt{n-1}\;\mbox{sd}(\mbox{PCi})$.
We use the function ` prcomp ` to compute the principal components. The
resulting principal component scores, i.e. the coordinates of the samples in
the space of the principal components can then be plotted.
For this we use the function `qplot` from `r CRANpkg("ggplot2") `. `r CRANpkg("ggplot2") `
is a comprehensive plotting package for R, a very good tutorial is [here](http://jofrhwld.github.io/avml2012/).
The function `qplot` is supposed to mimic the standard plotting function `plot`
as closely as possible, but additional changes can be made, e.g. lattice--type
graphics (splitting the plot by a factor of interest) can easily be generated.
Addition elements are changed by using `+` and the corresponding commands.
In our PCA plot, we change the plotting colors used and the axis labels.
```{r PCA_computations, dependson= "log_reg_mod" }
PCA <- prcomp( dplyr::select(crabs, FL:BD ))
percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)
```
We then use the function ` prcomp ` to compute the principal components. The
resulting principal component scores, i.e. the coordinates of the samples in
the space of the principal components can then be plotted.
For this we use the function `qplot` from `r CRANpkg("ggplot2") `. `r CRANpkg("ggplot2") `
is a comprehensive plotting package for R, a very good tutorial is [here](http://jofrhwld.github.io/avml2012/).
The function `qplot` is supposed to mimic the standard plotting function `plot`
as closely as possible, but additional changes can be made, e.g. lattice--type
graphics (splitting the plot by a factor of interest) can easily be generated.
Addition elements are changed by using `+` and the corresponding commands.
In our PCA plot, we change the plotting colors used and the axis labels.
```{r PCA_plots, fig.cap= "PCA plot", fig.width=7.5, fig.height=6, dependson="PCA_computations" }
dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2],
PC3 = PCA$x[,3], PC4 = PCA$x[,4],
species = crabs$sp,
sex = crabs$sex)
(qplot(PC1, PC2, data = dataGG, color = species, shape = sex,
main = "PC1 vs PC2", size = I(3))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=2)
)
(qplot(PC2, PC3, data = dataGG, color = species, shape = sex,
main = "PC2 vs PC3", size = I(3))
+ labs(x = paste0("PC2, VarExp:", round(percentVar[2],4)),
y = paste0("PC3, VarExp:", round(percentVar[3],4)))
+ scale_colour_brewer(type="qual", palette=2)
)
# (qplot(PC3, PC4, data = dataGG, color = species, shape = sex,
# main = "PC3 vs PC4", size = I(3))
# + labs(x = paste0("PC3, VarExp:", round(percentVar[3],4)),
# y = paste0("PC4, VarExp:", round(percentVar[4],4)))
# + scale_colour_brewer(type="qual", palette=2)
# )
```
## Clustering
A familiar technique for displaying multivariate data in
high--throughput biology is called the heatmap. In this display,
samples are clustered as columns, and features as rows. The
clustering technique used by default is R \Rfunction{hclust}. This
procedure builds a clustering tree for the data as follows. Distances
are computed between each pair of feature vectors for all $N$ observations. The
two closest pair is joined and regarded as a new object,
so there are $N-1$ objects (clusters) at this point. This process is
repeated until 1 cluster is formed; the clustering tree shows the
process by which clusters are created via this agglomeration process.
The most crucial choice when applying this method is the initial choice of the distance
metric between the features.
Once clusters are being formed, there are several ways to measure distances
between them, based on the initial between-feature distances. Single-linkage clustering
takes the distance between two clusters
to be the shortest distance between any two
members of the different clusters;
average linkage averages all the distances
between members; complete-linkage uses hte
maximum distance between any two members of the different
clusters. Other methods are also available in \Rfunction{hclust}.
We visualize cluster trees for samples and features in a heatmap, using the function
` heatmap.2 ` from the ` r CRANpkg("gplots")` package.
``` {r heatmap, fig.cap= "clustering heatmap", fig.height = 20, fig.width = 20, eval=TRUE}
#distfun <- function(x){dist(x, method = "euclidean"}
hmcol <- colorRampPalette(brewer.pal(n = 3, name="PuOr"))(255)
heatmap.2(as.matrix(dplyr::select(crabs, FL:BD )), trace="none", col = rev(hmcol),
margin=c(13, 13), main = "Eucledian distance heatmap")
```
The heatmaps shows that the features CL and CW seem to discriminate best
between the two main clusters, while RW does not show obvious differences
between the samples.
# Application to high--dimensional biological data
We will use two data sets now, - one for supervised learning and one for
unsupervised learning.
For **Unsupervised learning**, we will use the NCI60 cancer cell
microarray data which contains 6830 gene expression measurements of 64 cancer
cell lines. We dropped the sub types which contain only 1 sample ( there were
about 6 of them) and created a SummarizedExperiment object.
The *SummarizedExperiment* class is a matrix-like container where rows
represent ranges of interest (as a GRanges or GRangesList-class) and columns
represent samples (with sample data summarized as a DataFrame-class). A
SummarizedExperiment contains one or more assays, each represented by a
matrix-like object of numeric or other mode.
![Summarized Experiment](SummarizedExperiment.png)
This object has been made available to you, you can simply read it in using
```{r message=FALSE}
sefile <- system.file("extdata", "NCI60.Rda", package="LearnBioconductor")
load(sefile)
nci60data <- t(assay(NCI60))
ncilabels <- colData(NCI60)
```
The gene expression data is stored in "assay" whereas the labels are stored in
colData.
For **Supervised learning**, we will use cervical count data from the
Biocoductor package, `r Biocpkg("MLSeq")`. This data set contains
expressions of 714 miRNA's of human samples. There are 29 tumor and 29
non-tumor cervical samples. For learning purposes, we can treat these
as two separate groups and run various classification algorithms.
```{r message=FALSE}
filepath = system.file("extdata/cervical.txt", package = "MLSeq")
cervical = read.table(filepath, header = TRUE)
```
## Unsupervised Learning
Let us use PCA to visualize the NCI60 data set. We first perform PCA on the
data after scaling the variables (genes) to have standard deviation one and
plot the first few principal components in order to visualize the data just as
we did for the crabs data set.
```{r PCA_NCI}
pcaNCI <- prcomp(nci60data, scale=TRUE)
dataGG = data.frame(PC1 = pcaNCI$x[,1], PC2 = pcaNCI$x[,2],
PC3 = pcaNCI$x[,3], PC4 = pcaNCI$x[,4],
cell.line = colData(NCI60)$info)
(qplot(PC1, PC2, data = dataGG, color = cell.line,
main = "PC1 vs PC2", size = I(3))
+ labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)),
y = paste0("PC2, VarExp:", round(percentVar[2],4)))
+ scale_colour_brewer(type="qual", palette=6)
)
(qplot(PC2, PC3, data = dataGG, color = cell.line,
main = "PC2 vs PC3", size = I(3))
+ labs(x = paste0("PC2, VarExp:", round(percentVar[2],4)),
y = paste0("PC3, VarExp:", round(percentVar[3],4)))
+ scale_colour_brewer(type="qual", palette=6)
)
```
Overall, we can conclude that the cell lines corresponding to a single cancer
type tend to have similar values on the first few principal component score
vectors. This indicates that cell lines from the same cancer type tend to have
similar gene expression levels.
## Clustering observations
In hierarchical clustering, we start by defining some dissimilarity measure
between each pair of observations (like Euclidean distance). We start at the
bottom of the dendrogram, each of the n observations are considered as a
cluster, the two clusters that are most similar to each other are fused together
so now we have n-1 clusters. Next the two clusters that are similar together are
fused so that there are n-2 clusters. This algorithm proceeds iteractively until
all samples belong to one single cluster and the dendrogram is complete.
We define the dissimilarity between two clusters or two groups of observations
using Linkage. There are four common types of linkage - "average", "complete",
"single", "centroid". Assuming we have two clusters A and B, then -
1. _Complete_ refers to recording the *largest* of all pairwise dissimilarities
between observations in cluster A and observations in Cluster B.
2. _Single_ refers to recording the *smallest* of all pairwise dissimilarities
between observations in cluster A and observations in Cluster B.
3. _Single_ refers to recording the *average* of all pairwise dissimilarities
between observations in cluster A and observations in Cluster B.
4. _Centroid_ refers to the dissimilarity between the centroid for cluster A and
the centroid for cluster B.
Usually, the dissimilarity measure is the Euclidean Distance.
Lets cluster the observations using complete linkage with Euclidean distance
as the dissimilarity measure in "complete linkage".
```{r NCIclusterAnalysis, fig.width=12 , message=FALSE}
labs <- as.character(colData(NCI60)$info)
cellColor <- function(vec)
{
uvec <- unique(vec)
cols = rainbow(length(uvec))
colvec <- cols[as.numeric(as.factor(vec))]
list(colvec=colvec, cols=cols, labels= uvec)
}
colres <- cellColor(labs)
sdata <- scale(nci60data)
d <- dist(sdata)
labs <- as.character(unlist(as.list(ncilabels)))
comp_clust <- hclust(d)
dend <- as.dendrogram(comp_clust)
leaves <- labs[order.dendrogram(dend)]
labels_colors(dend, labels=TRUE) <- cellColor(leaves)$colvec
labels(dend) <- leaves
plot(dend, main ="Clustering using Complete Linkage")
```
### Exercise: Different linkage methods
1. Perform hierarchical clustering using average and single linkage.
2. Interpret the difference in the dendrograms.
3. Can you observe some patterns from these dendrograms? (hint: use cutree)
**Solutions:**
1. The plots can be made with the following code -
```{r fig.width=12, fig.height=6}
plot(hclust(d, method="average"), labels= labs,
main ="Clustering using Average Linkage" , xlab="", ylab="" )
plot(hclust(d, method="single"), labels= labs,
main ="Clusteringg using Single Linkage" , xlab="", ylab="" )
```
2. Briefly, complete linkage provides maximal inter cluster dissimilarity, single
linkage results in minimal inter cluster dissimilarity and average results in
mean inter cluster dissimilarity. We see that the results are affected by the
choice of the linkage. Single linkage tend to yield trailing clusters while
complete and average linkage leads to more balanced and attractive clusters.
3. For our example, we see that the cell lines within a single cancer cell type do
not cluster together. But if we consider complete linkage and cut the tree at
height=4 ( we tried different heights to observe patterns) we observe some clear
patterns like the leukemia cell lines fall in cluster 2 and the breast cancer
cell lines spread out.
## Supervised Learning
In supervised learning, along with the features $X_1$, $X_2$, ....,$X_p$, we
also have the a response Y measured on the same n observations. The goal is then
to predict Y using $X_1$, $X_2$, ....,$X_p$ for new observations.
### A simple example using knn
For the cervical data, we know that the first 29 are non-Tumor samples whereas
the last 29 are Tumor samples. We will code these as 0 and 1 respectively.
```{r}
class = data.frame(condition = factor(rep(c(0, 1), c(29, 29))))
```
Lets look at one of the most basic supervised learning techniques
**k-Nearest Neighbor** and see what all goes into building a simple model with
it. For the sake of simplicity, we will use only 2 predictors (so that we can
represent the data in 2 dimensional space)
```{r}
data <- t(cervical)
data <- data[,1:2]
df <- cbind(data, class)
colnames(df) <- c("x1","x2","y")
rownames(df) <- NULL
head(df)
```
This is how the data looks -
```{r fig.width=12}
plot(df[,"x1"], df[,"x2"], xlab="x1", ylab="x2",
main="data representation for knn",
col=ifelse(as.character(df[,"y"])==1, "red","blue"))
```
Given a observation $x_0$ and a positive integer, K, the KNN classifier first
identifies K points in the training data that are closest to $x_0$, represented
by $N_0$. It estimates the conditional probability for class j as a fraction of
$N_0$ and applies Bayes rule to classify the test observation to the class
with the largest probability.
More concretely, if k=3 and there are 2 observation belonging to class 1 and 1
observation belonging to class 2, then we the test observation is assigned to
class1.
![knn_figure](our_figures/knn.png)
For all supervised experiments its a good idea to hold out some data as
_Training Data_ and build a model with this data. We can then test the built
model using the left over data (_Test Data_) to gain confidence in our model.
We will randomly sample 30% of our data and use that as a test set. The
remaining 70% of the data will be used as training data
```{r }
set.seed(9)
nTest = ceiling(ncol(cervical) * 0.2)
ind = sample(ncol(cervical), nTest, FALSE)
cervical.train = cervical[, -ind]
cervical.train = as.matrix(cervical.train + 1)
classtr = data.frame(condition = class[-ind, ])
cervical.test = cervical[, ind]
cervical.test = as.matrix(cervical.test + 1)
classts = data.frame(condition = class[ind, ])
```
Training set error is the proportion of mistakes made if we apply our model to
the training data and Test set error is the proportion of mistakes made when
we apply our model to test data.
For different neighbors , let us calculate the training error and test error
using KNN.
```{r message=FALSE}
newknn <- function( testset, trainset, testclass, trainclass, k)
{
pred.train <- knn.cv(trainset, trainclass, k=k)
pred.test <- knn(trainset, testset, trainclass, k=k)
test_fit <- length(which(mapply(identical, as.character(pred.test),
testclass)==FALSE))/length(testclass)
train_fit <- length(which(mapply(identical, as.character(pred.train),
trainclass)==FALSE))/length(trainclass)
c(train_fit=train_fit, test_fit= test_fit)
}
trainset <- t(cervical.train)
testset <- t(cervical.test)
testclass <- t(classts)
trainclass <- t(classtr)
klist <- 1:15
ans <- lapply(klist, function(x)
newknn(testset, trainset, testclass, trainclass,k =x))
resdf <- t(as.data.frame(ans))
rownames(resdf) <- NULL
plot(klist, resdf[,"train_fit"], col="blue", type="b",ylim=c(range(resdf)),
main="k Nearest Neighbors for Cervical Data", xlab="No of neighbors",
ylab ="Training and Test Error")
points(klist, resdf[,"test_fit"], col="red", type="b")
legend("bottomright", legend=c("Training error","Test error"),
text.col=c("blue","red"), bty="n")
```
Another important concept in machine learning is **cross validation**. Since
samples are often scarse, it is often not possible to set aside a validation set
ans then use it to assess the performance of our prediction model. So we use
cross validation to train a better model. We start by dividing the training data
randomly into n equal parts. The learning method is fit to n-1 parts of the
data, and the prediction error is computed on the remaining part. This is done
in turn for each 1/n parts of the data, and finally the n prediction error
estimates are averaged.
For example, K-fold cross validation where K=5
![cv_figure](our_figures/cross_validation.png)
As you can see, computation for this very simple learner can become quite
complicated.
### Fast classification using Bioconductor.
`r Biocpkg("MLSeq")` aims to make computation less complicated for a user and
allows one to learn a model using various classifier's with one single function.
The main function of this package is classify which requires data in the form of
a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment,
used to store the input values, intermediate calculations and results of an
analysis of differential expression.
So lets create DESeqDataSet object for both the training and test set, and run
DESeq on it.
```{r}
cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train,
colData = classtr, formula(~condition))
cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local")
cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts,
formula(~condition))
cervical.testS4 = DESeq(cervical.testS4, fitType = "local")
```
Classify using Support Vector Machines.
```{r}
svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq",
deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
svm
```
It returns an object of class 'MLseq' and we observe that it successfully
fitted a model with 97.8% accuracy. We can access the slots of this S4 object by
```{r}
getSlots("MLSeq")
```
And also, ask about the model trained.
```{r}
trained(svm)
```
We can predict the class labels of our test data using "predict"
```{r}
pred.svm = predictClassify(svm, cervical.testS4)
table(pred.svm, relevel(cervical.testS4$condition, 2))
```
The other classification methods available are 'randomforest', 'cart' and
'bagsvm'.
### Exercise:
Train the same training data and test data using randomForest.
**Solutions:**
```{r}
rf = classify(data = cervical.trainS4, method = "randomforest",
normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1")
trained(rf)
pred.rf = predictClassify(rf, cervical.testS4)
table(pred.rf, relevel(cervical.testS4$condition, 2))
```
# Session information
As last part of this document, we call the function *sessionInfo*,
which reports the version numbers of R and all the packages used in
this session. It is good practice to always keep such a record as it
will help to trace down what has happened in case that an R script
ceases to work because the functions have been changed in a newer
version of a package. The session information should also **always**
be included in any emails to the
[Bioconductor support site](https://support.bioconductor.org) along
with all code used in the analysis.
```{r}
sessionInfo()
```