This lecture is based on the Bio-IT course “Introduction
to R”, which in turn is based on Data Carpentry materials.
The mouse data are from Winzell
and Ahrén (2004).
The first day is an introduction to data handling, exploration and plotting in R. Before we can think about useful statistical analysis of our data, we first have to get an overview how they look like. And for many scientific questions, the answer will already be apparent when looking at the right plots.
We start by loading the required packages.
ggplot2
is included in the
tidyverse
package.
library(tidyverse)
Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.
A data frame can be created by hand, but most commonly they are
generated by the functions read_csv()
or
read_table()
; in other words, when importing spreadsheets
from your hard drive (or the web).
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
We load the
mice_pheno
data into R:
mice_pheno <- read_csv(file= url("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"))
mice_pheno$Bodyweight <- as.numeric(mice_pheno$Bodyweight)
We can see this when inspecting the structure of a data frame
with the function str()
:
str(mice_pheno)
## spc_tbl_ [846 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Sex : chr [1:846] "F" "F" "F" "F" ...
## $ Diet : chr [1:846] "hf" "hf" "hf" "hf" ...
## $ Bodyweight: num [1:846] 31.9 32.5 22.8 19.9 32.2 ...
## - attr(*, "spec")=
## .. cols(
## .. Sex = col_character(),
## .. Diet = col_character(),
## .. Bodyweight = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dim(mice_pheno)
- returns a vector with the number of
rows in the first element, and the number of columns as the second
element (the dimensions of the object)head(mice_pheno)
- shows the first 6 rowsnames(mice_pheno)
- returns the column names (synonym
of colnames()
for data.frame
objects)rownames(mice_pheno)
- returns the row namesData frames can be subset by calling indices (as shown previously), but also by calling their column names directly:
mice_pheno[, 3] # Result is a vector
mice_pheno$Bodyweight # Result is a vector
Exercise: Use the functions you learned above to
print out the column names of the PlantGrowth
data set.
Then print out the column with the weights
Factors represent categorical data. Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male"))
sex
## [1] male female female male
## Levels: female male
There are a few basic functions for handling data frames. Also have a look at the cheat sheet for data transformation
Fo the following operations, we use the pipe operator, denoted by
%>%
, to pass a data frame to a function. The object on
the left of the pipe operator will be the first argument for the
function on the right of the pipe operator. As you will see below, a
series of pipe operations can be used to encode a series of data
manipulations.
Filter
Extract observations (i.e. rows) that fulfill certain criteria. We use logical operators to specify the criteria:
mice_pheno %>%
filter(Sex == "F" & Diet == "hf" & Bodyweight < 18)
## # A tibble: 5 × 3
## Sex Diet Bodyweight
## <chr> <chr> <dbl>
## 1 F hf 16.6
## 2 F hf 17.5
## 3 F hf 17.0
## 4 F hf 16.0
## 5 F hf 18.0
Select
Select one or more columns of interest. This operation can be preceded (or followed) by filtering.
mice_pheno %>%
filter(Sex == "F" & Diet == "hf" & Bodyweight < 18) %>%
select(Bodyweight, Diet)
## # A tibble: 5 × 2
## Bodyweight Diet
## <dbl> <chr>
## 1 16.6 hf
## 2 17.5 hf
## 3 17.0 hf
## 4 16.0 hf
## 5 18.0 hf
group_by
and
summarize
We can divide the data in a data frame into groups. Then, the following analysis (often a summarizing function) will be applied on each group separately.
For example, we could want to find out the mean weight of mice.
mean(mice_pheno$Bodyweight, na.rm=TRUE)
## [1] 28.84705
But actually, we are interested in the difference in means between the two diets. So we group the data into the two diet groups, before applying the mean:
mice_pheno %>%
group_by(Diet) %>%
summarize(group_mean = mean(Bodyweight, na.rm=TRUE))
## # A tibble: 2 × 2
## Diet group_mean
## <chr> <dbl>
## 1 chow 27.4
## 2 hf 30.5
We can also further group by sex:
mice_pheno %>%
group_by(Diet, Sex) %>%
summarize(group_mean = mean(Bodyweight,na.rm=TRUE), group_sd=sd(Bodyweight,na.rm=TRUE))
## # A tibble: 4 × 4
## # Groups: Diet [2]
## Diet Sex group_mean group_sd
## <chr> <chr> <dbl> <dbl>
## 1 chow F 23.9 3.42
## 2 chow M 31.0 4.43
## 3 hf F 26.3 5.08
## 4 hf M 34.8 5.59
Exercise:
PlantGrowth
data for experiments from the
control group.ggplot2
Have a look at the cheat sheet for data visualization!
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
ggplot()
function and bind the plot to a
specific data frame using the data
argumentggplot(data = mice_pheno)
aes
) function),
by selecting the variables to be plotted and specifying how to present
them in the graph, e.g. as x/y positions or characteristics such as
size, shape, color, etc.ggplot(data = mice_pheno, mapping = aes(x = Diet, y = Bodyweight))
add ‘geoms’ – graphical representations of the data in the plot
(points, lines, bars). ggplot2
offers many
different geoms; we will use some common ones today, including:
geom_point()
for scatter plots, dot plots, etc.geom_boxplot()
for, well, boxplots!To add a geom to the plot use the +
operator. Because we
have two continuous variables, let’s use geom_point()
first:
weights_plot <- ggplot(data = mice_pheno, mapping = aes(x = Diet, y = Bodyweight))
weights_plot + geom_point()
The +
in the ggplot2
package is particularly useful because it allows you to modify existing
ggplot
objects.
Notes
ggplot()
function can be seen
by any geom layers that you add (i.e., these are universal plot
settings). This includes the x- and y-axis mapping you set up in
aes()
.ggplot()
function.+
sign used to add new layers must be placed at the
end of the line containing the previous layer. If, instead, the
+
sign is added at the beginning of the line containing the
new layer, ggplot2
will not add the new
layer and will return an error message.# This is the correct syntax for adding layers
weights_plot +
geom_point()
# This will not add the new layer and will return an error message
weights_plot
+ geom_point()
Exercise: Complete the code below to plot the plant weights against the groups.
PlantGrowth %>%
ggplot( )+
Categories are encoded as factors or character. In our mice data, the
diet and sex variable are categorical variables. In our plots, we want
to compare the categories of these variables to each other, for example
we want to compare the chow
and hf
diets to
each other, with respect to the resulting mouse weight.
A boxplot summarizes a few statistics on the data: The median, first and third quartile and outliers.
weights_plot +
geom_boxplot()
But even though this summary can be useful, some information on this original data points is lost. Therefore, it’s useful to overlay the data points on the boxes:
weights_plot +
geom_boxplot(outlier.shape =NA) +
geom_jitter(width=.3, alpha=0.2)
Because there are quite a few data points, I use transparency and jitter to avoid overplotting.
The violin plot is a nice tool to show the distributions in several categories in the same plot.
weights_plot +
geom_violin(aes(fill=Diet))
If we look at the mice_pheno
data set, we notice that we
so far ignored one variable:
str(mice_pheno)
## spc_tbl_ [846 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Sex : chr [1:846] "F" "F" "F" "F" ...
## $ Diet : chr [1:846] "hf" "hf" "hf" "hf" ...
## $ Bodyweight: num [1:846] 31.9 32.5 22.8 19.9 32.2 ...
## - attr(*, "spec")=
## .. cols(
## .. Sex = col_character(),
## .. Diet = col_character(),
## .. Bodyweight = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
We have several options:
* Keep ignoring the sex
variable
* Subset the data frame for plotting
* encode the Sex
variable in the plot using tools like
color or faceting
We are comparing the two diets only for female mice:
mice_pheno %>%
filter(Sex=="F") %>%
ggplot(aes(x=Diet, y = Bodyweight)) +
geom_violin(aes(fill=Diet))
Use the faceting option in ggplot2
to create one
sub-plot for each sex:
weights_plot +
geom_violin(aes(fill=Diet))+
facet_wrap(vars(Sex))
weights_plot +
geom_jitter(width=0.4,aes(colour=Sex), alpha=0.5)
To get a feeling for the data, it’s always recommended to look at individual points at least once. It’s also a matter of taste which graphics you find more enlightening. I personally find this plot useful, but also a bit messy, so I’d probably prefer faceting and violins.
In the testing block, we will see the rationale behind testing for weight differences between the two diets, or male and female mice.
Now, we are going to look at the palmerpenguins
data. These are measurements of on three different penguin species on
islands near the antarctic Palmer Station.
knitr::include_graphics("../images/penguins.png")
library(palmerpenguins)
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
We can start exploring this data set by looking at the dimensions:
dim(penguins)
## [1] 344 8
Visualize individual relationships in a scatter plot, for example bill length and flipper length:
penguins %>%
ggplot(aes(x=bill_length_mm, y=flipper_length_mm)) +
geom_point()+
geom_rug()
We see a few patterns here, so maybe it is useful to point out a few other variables How about species or sex?
penguins %>%
filter(!is.na(sex)) %>%
ggplot(aes(x=bill_length_mm, y=flipper_length_mm, colour=species, shape=sex)) +
geom_point()
We can subset to female Gentoo and look at individual relationships:
penguins %>%
filter(sex=="female", species =="Gentoo") %>%
ggplot( aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point()
Now it’s easy to get lost in the details and individual plots and
associations. Therefore, it’s always good to get an overview first. One
convenient tool for that is the GGally::ggpairs
function.
Let’s suppose we are interested in the Gentoo, so we filter for
them:
library(GGally)
penguins %>%
filter( species =="Gentoo") %>%
select(bill_length_mm, body_mass_g, flipper_length_mm) %>%
ggpairs()
This plot quickly tells us that bill length, flipper length and body
mass are all positively correlated with each other. One way to interpret
this is that they all somehow represent (or grow with) the size of the
penguin.
We can use this interpretation to reduce the dimensionality of the data
set for further analyses.
Exercise: Combine filtering and plotting: Filter for Adelie penguins first, then plot body mass against flipper length in a scatter plot.
library(palmerpenguins)
penguins %>%
filter() %>%
ggplot() +
This is a modified example from chapter 7 in the online book R4ds by Hadley Wickham and Garrett Grolemund.
We look at a data sets on diamonds. Data was collected on more than 50,000 diamonds’ price, carat, size, and some quality criteria. We will focus on the relation between carat and price, and we’ll also try to learn something about the distribution of the data.
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
A good starting point for looking at the covariation of two variables is often a scatter plot:
diamonds %>%
ggplot(aes(x=carat, y=price))+
geom_point()
But we have a problem here: There are >50,000 data points, and they overlap heavily. From this plot, we can’t see where most of the data points lie. Anywhere in the black regions could be a huge, dense cloud of data points and anything else we see are just “outliers”. One thing to see more clearly is to make the points transparent:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(alpha = 1 / 100)
You notice that the a lot of points are cramped towards the origin. You could try the log-scale:
diamonds %>%
ggplot(mapping = aes(x = carat, y = price) )+
geom_point(alpha = 1 / 100)+
scale_y_log10()+
scale_x_log10()
Or bin them - this is kind of a 2D histogram, where the counts are represented as colors:
library(hexbin)
diamonds %>%
ggplot(aes(x = carat, y = price)) +
geom_hex()+
scale_y_log10()+
scale_x_log10()
By looking at the data in these different ways, we have gotten the idea, that a lot of the diamonds have below 1 carat, and the relation between carat and price looks linear on the log-log-scale, which could be useful for modeling it.