This lecture is based on the Bio-IT course “Introduction to R”, which in turn is based on Data Carpentry materials.
The mouse data are from Winzell and Ahrén (2004).

The first day is an introduction to data handling, exploration and plotting in R. Before we can think about useful statistical analysis of our data, we first have to get an overview how they look like. And for many scientific questions, the answer will already be apparent when looking at the right plots.

We start by loading the required packages. ggplot2 is included in the tidyverse package.

library(tidyverse)

Data frames

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the functions read_csv() or read_table(); in other words, when importing spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We load the mice_pheno data into R:

mice_pheno <- read_csv(file= url("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"))
mice_pheno$Bodyweight <- as.numeric(mice_pheno$Bodyweight)

We can see this when inspecting the structure of a data frame with the function str():

str(mice_pheno)

## spc_tbl_ [846 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Sex       : chr [1:846] "F" "F" "F" "F" ...
##  $ Diet      : chr [1:846] "hf" "hf" "hf" "hf" ...
##  $ Bodyweight: num [1:846] 31.9 32.5 22.8 19.9 32.2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Sex = col_character(),
##   ..   Diet = col_character(),
##   ..   Bodyweight = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Explore the data frame

Size:
- dim(mice_pheno) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
Content:
- head(mice_pheno) - shows the first 6 rows
Names:
- names(mice_pheno) - returns the column names (synonym of colnames() for data.frame objects)
- rownames(mice_pheno) - returns the row names

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

mice_pheno[, 3]     # Result is a vector
mice_pheno$Bodyweight          # Result is a vector

Exercise: Use the functions you learned above to print out the column names of the PlantGrowth data set. Then print out the column with the weights

Factors

Factors represent categorical data. Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))
sex

## [1] male   female female male  
## Levels: female male

Manipulating data frames

There are a few basic functions for handling data frames. Also have a look at the cheat sheet for data transformation

The pipe operator

Fo the following operations, we use the pipe operator, denoted by %>%, to pass a data frame to a function. The object on the left of the pipe operator will be the first argument for the function on the right of the pipe operator. As you will see below, a series of pipe operations can be used to encode a series of data manipulations.

`Filter`

Extract observations (i.e. rows) that fulfill certain criteria. We use logical operators to specify the criteria:

mice_pheno %>%  
  filter(Sex == "F" & Diet == "hf" & Bodyweight < 18)

## # A tibble: 5 × 3
##   Sex   Diet  Bodyweight
##   <chr> <chr>      <dbl>
## 1 F     hf          16.6
## 2 F     hf          17.5
## 3 F     hf          17.0
## 4 F     hf          16.0
## 5 F     hf          18.0

`Select`

Select one or more columns of interest. This operation can be preceded (or followed) by filtering.

mice_pheno %>%  
  filter(Sex == "F" & Diet == "hf" & Bodyweight < 18) %>% 
  select(Bodyweight, Diet)

## # A tibble: 5 × 2
##   Bodyweight Diet 
##        <dbl> <chr>
## 1       16.6 hf   
## 2       17.5 hf   
## 3       17.0 hf   
## 4       16.0 hf   
## 5       18.0 hf

Grouped summaries: `group_by` and `summarize`

We can divide the data in a data frame into groups. Then, the following analysis (often a summarizing function) will be applied on each group separately.

For example, we could want to find out the mean weight of mice.

mean(mice_pheno$Bodyweight, na.rm=TRUE)

## [1] 28.84705

But actually, we are interested in the difference in means between the two diets. So we group the data into the two diet groups, before applying the mean:

mice_pheno %>% 
  group_by(Diet) %>% 
  summarize(group_mean = mean(Bodyweight, na.rm=TRUE))

## # A tibble: 2 × 2
##   Diet  group_mean
##   <chr>      <dbl>
## 1 chow        27.4
## 2 hf          30.5

We can also further group by sex:

mice_pheno %>% 
  group_by(Diet, Sex) %>% 
  summarize(group_mean = mean(Bodyweight,na.rm=TRUE), group_sd=sd(Bodyweight,na.rm=TRUE))

## # A tibble: 4 × 4
## # Groups:   Diet [2]
##   Diet  Sex   group_mean group_sd
##   <chr> <chr>      <dbl>    <dbl>
## 1 chow  F           23.9     3.42
## 2 chow  M           31.0     4.43
## 3 hf    F           26.3     5.08
## 4 hf    M           34.8     5.59

Exercise:

Filter the PlantGrowth data for experiments from the control group.
Additionally select for weight.

Plotting with `ggplot2`

Have a look at the cheat sheet for data visualization!

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

use the ggplot() function and bind the plot to a specific data frame using the data argument

ggplot(data = mice_pheno)

define a mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.

ggplot(data = mice_pheno, mapping = aes(x = Diet, y = Bodyweight))

add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:
- geom_point() for scatter plots, dot plots, etc.
- geom_boxplot() for, well, boxplots!

To add a geom to the plot use the + operator. Because we have two continuous variables, let’s use geom_point() first:

weights_plot <- ggplot(data = mice_pheno, mapping = aes(x = Diet, y = Bodyweight))
weights_plot +  geom_point()

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects.

Notes

Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
You can also specify mappings for a given geom independently of the mappings defined globally in the ggplot() function.
The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.

# This is the correct syntax for adding layers
weights_plot +
  geom_point()

# This will not add the new layer and will return an error message
weights_plot
  + geom_point()

Exercise: Complete the code below to plot the plant weights against the groups.

PlantGrowth %>% 
  ggplot(   )+

Plotting categorical data

Categories are encoded as factors or character. In our mice data, the diet and sex variable are categorical variables. In our plots, we want to compare the categories of these variables to each other, for example we want to compare the chow and hf diets to each other, with respect to the resulting mouse weight.

Boxplot

A boxplot summarizes a few statistics on the data: The median, first and third quartile and outliers.

weights_plot +
  geom_boxplot()

But even though this summary can be useful, some information on this original data points is lost. Therefore, it’s useful to overlay the data points on the boxes:

weights_plot +
  geom_boxplot(outlier.shape =NA) +
  geom_jitter(width=.3, alpha=0.2)

Because there are quite a few data points, I use transparency and jitter to avoid overplotting.

Violin plot

The violin plot is a nice tool to show the distributions in several categories in the same plot.

weights_plot +
  geom_violin(aes(fill=Diet))

More than two variables

If we look at the mice_pheno data set, we notice that we so far ignored one variable:

str(mice_pheno)

## spc_tbl_ [846 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Sex       : chr [1:846] "F" "F" "F" "F" ...
##  $ Diet      : chr [1:846] "hf" "hf" "hf" "hf" ...
##  $ Bodyweight: num [1:846] 31.9 32.5 22.8 19.9 32.2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Sex = col_character(),
##   ..   Diet = col_character(),
##   ..   Bodyweight = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

We have several options:
* Keep ignoring the sex variable
* Subset the data frame for plotting
* encode the Sex variable in the plot using tools like color or faceting

Combine filtering and plotting

We are comparing the two diets only for female mice:

mice_pheno %>% 
  filter(Sex=="F") %>% 
  ggplot(aes(x=Diet, y = Bodyweight)) +
  geom_violin(aes(fill=Diet))

Faceting

Use the faceting option in ggplot2 to create one sub-plot for each sex:

weights_plot +
  geom_violin(aes(fill=Diet))+
  facet_wrap(vars(Sex))

Encode a variable by color

weights_plot +
  geom_jitter(width=0.4,aes(colour=Sex), alpha=0.5)

To get a feeling for the data, it’s always recommended to look at individual points at least once. It’s also a matter of taste which graphics you find more enlightening. I personally find this plot useful, but also a bit messy, so I’d probably prefer faceting and violins.

In the testing block, we will see the rationale behind testing for weight differences between the two diets, or male and female mice.

Continuous variables

Now, we are going to look at the palmerpenguins data. These are measurements of on three different penguin species on islands near the antarctic Palmer Station.

knitr::include_graphics("../images/penguins.png")

library(palmerpenguins)
names(penguins)

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

We can start exploring this data set by looking at the dimensions:

dim(penguins)

## [1] 344   8

Visualize individual relationships in a scatter plot, for example bill length and flipper length:

penguins %>% 
  ggplot(aes(x=bill_length_mm, y=flipper_length_mm)) +
  geom_point()+
  geom_rug()

We see a few patterns here, so maybe it is useful to point out a few other variables How about species or sex?

penguins %>% 
  filter(!is.na(sex)) %>% 
  ggplot(aes(x=bill_length_mm, y=flipper_length_mm, colour=species, shape=sex)) +
  geom_point()

We can subset to female Gentoo and look at individual relationships:

penguins %>% 
  filter(sex=="female", species =="Gentoo") %>% 
  ggplot( aes(x=bill_length_mm, y=bill_depth_mm)) +
  geom_point()

Now it’s easy to get lost in the details and individual plots and associations. Therefore, it’s always good to get an overview first. One convenient tool for that is the GGally::ggpairs function. Let’s suppose we are interested in the Gentoo, so we filter for them:

library(GGally)
penguins %>% 
  filter( species =="Gentoo") %>% 
  select(bill_length_mm,  body_mass_g, flipper_length_mm) %>% 
  ggpairs()

This plot quickly tells us that bill length, flipper length and body mass are all positively correlated with each other. One way to interpret this is that they all somehow represent (or grow with) the size of the penguin.
We can use this interpretation to reduce the dimensionality of the data set for further analyses.

Exercise: Combine filtering and plotting: Filter for Adelie penguins first, then plot body mass against flipper length in a scatter plot.

library(palmerpenguins)
penguins %>% 
  filter() %>% 
  ggplot() +

Overplotting

This is a modified example from chapter 7 in the online book R4ds by Hadley Wickham and Garrett Grolemund.

We look at a data sets on diamonds. Data was collected on more than 50,000 diamonds’ price, carat, size, and some quality criteria. We will focus on the relation between carat and price, and we’ll also try to learn something about the distribution of the data.

str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

A good starting point for looking at the covariation of two variables is often a scatter plot:

diamonds %>% 
  ggplot(aes(x=carat, y=price))+
  geom_point()

But we have a problem here: There are >50,000 data points, and they overlap heavily. From this plot, we can’t see where most of the data points lie. Anywhere in the black regions could be a huge, dense cloud of data points and anything else we see are just “outliers”. One thing to see more clearly is to make the points transparent:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price)) + 
  geom_point(alpha = 1 / 100)

You notice that the a lot of points are cramped towards the origin. You could try the log-scale:

diamonds %>% 
  ggplot(mapping = aes(x = carat, y = price) )+ 
  geom_point(alpha = 1 / 100)+
  scale_y_log10()+
  scale_x_log10()

Or bin them - this is kind of a 2D histogram, where the counts are represented as colors:

library(hexbin)
diamonds %>% 
  ggplot(aes(x = carat, y = price)) +
  geom_hex()+
  scale_y_log10()+
  scale_x_log10()

By looking at the data in these different ways, we have gotten the idea, that a lot of the diamonds have below 1 carat, and the relation between carat and price looks linear on the log-log-scale, which could be useful for modeling it.

Data exploration in R

Sarah Kaspar

2023-03-20

Data frames

Explore the data frame

Factors

Manipulating data frames

The pipe operator

`Filter`

`Select`

Grouped summaries: `group_by` and `summarize`

Plotting with `ggplot2`

Plotting categorical data

Boxplot

Violin plot

More than two variables

Combine filtering and plotting

Faceting

Encode a variable by color

Continuous variables

Overplotting

Data exploration in R

Sarah Kaspar

2023-03-20

Data frames

Explore the data frame

Factors

Manipulating data frames

The pipe operator

Filter

Select

Grouped summaries: group_by and summarize

Plotting with ggplot2

Plotting categorical data

Boxplot

Violin plot

More than two variables

Combine filtering and plotting

Faceting

Encode a variable by color

Continuous variables

Overplotting

`Filter`

`Select`

Grouped summaries: `group_by` and `summarize`

Plotting with `ggplot2`