```
library("TeachingDemos")
library("openxlsx")
library("multtest")
library("Biobase")
library("tidyverse")
library("cowplot")
```

R is a software language for carrying out complicated (and simple) statistical analyses. It includes routines for data summary and exploration, graphical presentation and data modeling. The aim of this lab is to provide you with a basic fluency in the language. When you work in R you create objects that are stored in the current workspace. Each object created remains in the image unless you explicitly delete it. At the end of the session the workspace will be lost unless you save it.

You can get your current working directory via `getwd()`

and set it with `setwd()`

. By default, it is usually your home directory.

Commands written in R are saved in memory throughout the session. You can scroll back to previous commands typed by using the “up” arrow key (and “down” to scroll back again). You finish an R session by typing `q()`

at which point you will also be prompted as to whether or not you want to save the current workspace into your working directory. If you do not want to, it will be lost. Remember the ways to get help:

- Just ask!
`help.start()`

and the HTML help button in the Windows GUI.`help`

and`?`

:`help("data.frame")`

or`?help`

.`help.search()`

,`apropos()`

`browseVignettes("package")`

- rseek.org
- use tab–completion in RStudio, this will also display help–snippets

In this tutorial we will make use of packages from the tidyverse and the tutorial itself was written using rmarkdown. The tidyverse is a set of R packages that try to make your life easier when working with data in R. They improve the basic R experience tremendously and are designed to foster the human understanding of programming code.

The elementary unit in R is an object and the simplest objects are scalars, vectors and matrices. R is designed with interactivity in mind, so you can get started by simply typing:

`4 + 6`

` [1] 10`

What does R do? It sums up the two numbers and returns the scalar value 10. In fact, R returns a vector of length 1 - hence the [1] denoting first element of the vector. We can assign objects values for subsequent use. For example:

```
x <- 6
y <- 4
z <- x + y
z
```

` [1] 10`

does the same calculation as above, storing the result in an object called `z`

. We can look at the contents of the object by simply typing its name. At any time we can list the objects which we have created:

`ls()`

` [1] "x" "y" "z"`

Notice that `ls`

is actually an object itself. Typing `ls`

would result in a display of the contents of this object, in this case, the commands of the function. The use of parentheses, `ls()`

, ensures that the function is executed and its result — in this case, a list of the objects in the current environment — displayed. More commonly, a function will operate on an object, for example

`sqrt(16)`

` [1] 4`

calculates the square root of 16. Objects can be removed from the current workspace with the function `rm()`

. There are many standard functions available in R, and it is also possible to create new ones. Vectors can be created in R in a number of ways. For example, we can list all of the elements:

`z <- c(5, 9, 1, 0)`

Note the use of the function `c`

to concatenate or “glue together” individual elements. This function can be used much more widely, for example

```
x <- c(5, 9)
y <- c(1, 0)
z <- c(x, y)
```

would lead to the same result by gluing together two vectors to create a single vector. Sequences can be generated as follows:

`seq(1, 9, by = 2)`

` [1] 1 3 5 7 9`

`seq(8, 20, length = 6)`

` [1] 8.0 10.4 12.8 15.2 17.6 20.0`

These examples illustrate that many functions in R have optional arguments, in this case, either the step length or the total length of the sequence (it doesn’t make sense to use both). If you leave out both of these options, R will make its own default choice, in this case assuming a step length of 1. So, for example,

`x <- seq(1, 10)`

also generates a vector of integers from 1 to 10. At this point it’s worth mentioning the help facility again. If you don’t know how to use a function, or don’t know what the options or default values are, type `help(functionname)`

or simply `?functionname`

where `functionname`

is the name of the function you are interested in. This will usually help and will often include examples to make things even clearer. Another useful function for building vectors is the `rep`

command for repeating things: the first command will repeat the vector 1, 2, 3 six times, will the second one will repeat each element six times.

`rep(1:3, 6)`

` [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3`

`rep(1:3, times = c(6, 6, 6))`

` [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3`

R will often adapt to the objects it is asked to work on. An example is the vectorized arithmetic used in R:

```
x <- 1:5
y <- 5:1
x + y
```

` [1] 6 6 6 6 6`

`x^2`

` [1] 1 4 9 16 25`

`x * y`

` [1] 5 8 9 8 5`

showing that R uses component-wise arithmetic on vectors. R will also try to make sense of a statement if objects are mixed. For example:

```
x <- c(6, 8 , 9 )
x + 2
```

` [1] 8 10 11`

Two particularly useful functions worth remembering are `length`

, which returns the length of a vector (i.e. the number of elements it contains) and `sum`

which calculates the sum of the elements of a vector. R also has basic calculator capabilities:

`a+b`

,`a-b`

,`a\*b`

,`a\*\*b`

(a to the power of b)- additionally:
`sqrt(a)`

,`sin(a)`

…

and some simple statistics:

`mean(a)`

`summary(a)`

`var(a)`

`min(a,b)`

,`max(a,b)`

**Exercise: Simple R operations**

Define

- x <- c(4, 2, 6)

and

- y <- c(1, 0, -1)

Decide what the result will be of the following:

- length(x)
- sum(x)
- sum(x^2)
- x + y
- x * y
- x - 2
- x^2

Use R to check your answers.

Decide what the following sequences are and use R to check your answers:

- 7:11
- seq(2, 9)
- seq(4, 10, by=2)
- seq(3, 30, length=10)
- seq(6, -4, by=-2)

Determine what the result will be of the following R expressions, and then use R to check whether you are right:

- rep(2, 4)
- rep(c(1, 2), 4)
- rep(c(1, 2), c(4, 4))
- rep(1:4, 4)
- rep(1:4, rep(3, 4))

Use the `rep`

function to define simply the following vectors in R.

- (6, 6, 6, 6, 6, 6)
- (5, 8, 5, 8, 5, 8, 5, 8)
- (5, 5, 5, 5, 8, 8, 8, 8)

**Exercise: R as a calculator** Calculate the following expression, where `x`

and `y`

have values `-0.25`

and `2`

respectively.

Then store the result in a new variable and print its content.

`x + cos(pi/y)`

Let’s suppose we’ve collected some data from an experiment and stored them in an object `x`

. Some simple summary statistics of these data can be produced:

```
x <- c(7.5, 8.2, 3.1, 5.6, 8.2, 9.3, 6.5, 7.0, 9.3, 1.2, 14.5, 6.2)
mean(x)
```

` [1] 7.22`

`var(x)`

` [1] 11`

`summary(x)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.20 6.05 7.25 7.22 8.47 14.50
```

It may be, however, that we subsequently learn that the first 6 data points correspond to measurements made in one experiment, and the second six on another experiment. This might suggest summarizing the two sets of data separately, so we would need to extract from `x`

the two relevant subvectors. This is achieved by subscripting:

`x[1:6]`

` [1] 7.5 8.2 3.1 5.6 8.2 9.3`

`x[7:12]`

` [1] 6.5 7.0 9.3 1.2 14.5 6.2`

`summary(x[1:6])`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.10 6.07 7.85 6.98 8.20 9.30
```

`summary(x[7:12])`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.20 6.28 6.75 7.45 8.73 14.50
```

You simply put the indexes of the element you want to access in square brackets. Note that R starts counting from 1 onwards. Other subsets can be created in the obvious way. Putting a minus in front, excludes the elements:

`x[c(2, 4, 9)]`

` [1] 8.2 5.6 9.3`

`x[-(1:6)]`

` [1] 6.5 7.0 9.3 1.2 14.5 6.2`

`head(x)`

` [1] 7.5 8.2 3.1 5.6 8.2 9.3`

The function `head`

provides a preview of the vector. There are also

useful functions to order and sort vectors:

`sort`

: sort in increasing order`order`

: orders the indexes is such a way that the elements of the vector are sorted, i.e`sort(v) = v[order(v)]`

`rank`

: gives the ranks of the elements of a vector, different options for handling*ties*are available.

```
x <- c(1.3, 3.5, 2.7, 6.3, 6.3)
sort(x)
```

` [1] 1.3 2.7 3.5 6.3 6.3`

`order(x)`

` [1] 1 3 2 4 5`

`x[order(x)]`

` [1] 1.3 2.7 3.5 6.3 6.3`

`rank(x)`

` [1] 1.0 3.0 2.0 4.5 4.5`

**Exercise: Milk sales and summaries**

- Define x <- c(5, 9, 2, 3, 4, 6, 7, 0, 8, 12, 2, 9)

Decide what the result will be of the following:

- x[2]
- x[2:4]
- x[c(2, 3, 6)]
- x[c(1:5, 10:12)]
- x[-(10:12)]

Use R to check your answers.

- The vector y <- c(33, 44, 29, 16, 25, 45, 33, 19, 54, 22, 21, 49, 11, 24, 56) contains sales of milk in liters for 5 days in three different shops (the first 3 values are for shops 1, 2 and 3 on Monday, etc.). Produce a statistical summary of the sales for each day of the week and also for each shop.

R is an object-oriented language, so every data item is an object in R. As in other programming languages, objects are instances of “blue-prints” called classes. There are the following elementary types or (“modes”):

- numeric: real number
- character: chain of characters, text
- factor: categorical data that takes a fixed set of values
- logical: TRUE, FALSE
- special values: NA (missing value), NULL (“empty object”), Inf, -Inf (infinity), NaN (not a number)

We haven’t met factors yet: They are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

```
x <- factor(c("wt", "wt", "mut", "mut"), levels = c("wt", "mut"))
x
```

```
[1] wt wt mut mut
Levels: wt mut
```

Data storage types includes matrices, lists, data frames (tibbles), which will be introduced in the next section. Certain types can have different subtypes, e.g. numeric can be further subdivided into the integer, single and double types. Types can be checked by the `is.*`

and changed (“casted”) by the `as.*`

functions. Furthermore, the function `str`

is very useful in order to obtain an overview of an (possibly complex) object at hand. The following examples will make this clear. We first assign the value `9`

to an object and then perform various operations on it.

```
a <- 9
# is a a string?
is.character(a)
```

` [1] FALSE`

```
# is a a number?
is.numeric(a)
```

` [1] TRUE`

```
# What's its type?
typeof(a)
```

` [1] "double"`

```
# now turn it into a factor
a <- as.factor(a)
# Is it a factor?
is.factor(a)
```

` [1] TRUE`

```
# assign an string to a:
a <- "NAME"
# what's a?
class(a)
```

` [1] "character"`

`str(a) `

` chr "NAME"`

Matrices are two–dimensional vectors and can be created in R in a variety of ways. Perhaps the simplest is to create the columns and then glue them together with the command `cbind`

. For example:

```
x <- c(5, 7 , 9)
y <- c(6, 3 , 4)
z <- cbind(x, y)
z
```

```
x y
[1,] 5 6
[2,] 7 3
[3,] 9 4
```

`dim(z)`

` [1] 3 2`

We can also use the function `matrix()`

directly to create a matrix.

`z <- matrix(c(5, 7, 9, 6, 3, 4), nrow = 3)`

There is a similar command, `rbind`

, for building matrices by gluing rows together. The functions `cbind`

and `rbind`

can also be applied to matrices themselves (provided the dimensions match) to form larger matrices.

Notice that the dimension of the matrix is determined by the size of the vector and the requirement that the number of rows is 3 in the example above, as specified by the argument `nrow = 3`

. As an alternative we could have specified the number of columns with the argument `ncol = 2`

(obviously, it is unnecessary to give both). Notice that the matrix is “filled up” column-wise. If instead you wish to fill up row-wise, add the option `byrow=TRUE`

.

```
z <- matrix(c(5, 7 , 9 , 6 , 3 , 4), nrow = 3, byrow = TRUE)
z
```

```
[,1] [,2]
[1,] 5 7
[2,] 9 6
[3,] 3 4
```

R will try to interpret operations on matrices in a natural way. For example, with z as above, and y defined below we get:

```
y <- matrix(c(1, 3, 0, 9, 5, -1), nrow = 3, byrow = TRUE)
y
```

```
[,1] [,2]
[1,] 1 3
[2,] 0 9
[3,] 5 -1
```

`y + z`

```
[,1] [,2]
[1,] 6 10
[2,] 9 15
[3,] 8 3
```

`y * z`

```
[,1] [,2]
[1,] 5 21
[2,] 0 54
[3,] 15 -4
```

Notice that multiplication here is component–wise. As with vectors it is useful to be able to extract sub-components of matrices. In this case, we may wish to pick out individual elements, rows or columns. As before, the `[ ]`

notation is used to subscript. The following examples illustrate this:

`z[1, 1]`

` [1] 5`

`z[, 2]`

` [1] 7 6 4`

`z[1:2, ]`

```
[,1] [,2]
[1,] 5 7
[2,] 9 6
```

`z[-1, ]`

```
[,1] [,2]
[1,] 9 6
[2,] 3 4
```

`z[-c(1, 2), ]`

` [1] 3 4`

So, in particular, it is necessary to specify which rows and columns are required, whilst omitting the index for either dimension implies that every element in that dimension is selected.

A data frame is a matrix where the columns can have different data types. As such, it is usually used to represent a whole data set, where the rows represent the samples and columns the variables. Essentially, you can think of a data frame as an excel table.

Here, we will meet the first tidyverse member, namely the *tibble* package, which improves the conventional R `data.frame`

class. A tibble is a `data.frame`

which a lot of tweaks and more sensible defaults that make your life easier. For details on the tweaks, see the help on tibble: `?tibble`

so that you never have to use a standard data frame anymore.

Let’s illustrate this by the small data set saved in comma–separated-format (csv) — `patients`

. We load it in from a website using the function `read_csv`

, which is used to import a data file in *comma separated format — csv* into R. In a .csv–file the data are stored row–wise, and the entries in each row are separated by commas.

The function `read_csv`

is from the *readr* package and will give us a tibble as the result. The function `glimpse()`

gives a nice summary of a tibble.

`pat <- read_csv("http://www-huber.embl.de/users/klaus/BasicR/Patients.csv")`

```
Parsed with column specification:
cols(
PatientId = col_character(),
Height = col_double(),
Weight = col_double(),
Gender = col_character()
)
```

`pat`

```
# A tibble: 3 × 4
PatientId Height Weight Gender
<chr> <dbl> <dbl> <chr>
1 P1 1.65 75 f
2 P2 1.90 NA m
3 P3 1.60 50 f
```

`glimpse(pat)`

```
Observations: 3
Variables: 4
$ PatientId <chr> "P1", "P2", "P3"
$ Height <dbl> 1.65, 1.90, 1.60
$ Weight <dbl> 75, NA, 50
$ Gender <chr> "f", "m", "f"
```

It has weight, height and gender of three people. We can also use the function `read.xlsx`

from the *openxlsx* package to import the data from an excel sheet. Here, we have to use the function `as_tibble`

to turn the data.frame into an equivalent tibble.

```
pat_xls <- as_tibble(read.xlsx("Patients.xlsx"))
pat_xls
```

```
# A tibble: 3 × 4
PatientId Height Weight Gender
* <chr> <chr> <chr> <chr>
1 P1 1.65 75.0 f
2 P2 1.90 <NA> m
3 P3 1.60 50.0 f
```

`str(pat_xls)`

```
Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 4 variables:
$ PatientId: chr "P1" "P2" "P3"
$ Height : chr "1.65" "1.90" "1.60"
$ Weight : chr "75.0" NA "50.0"
$ Gender : chr "f" "m" "f"
```

Now that we have imported the small data set, you might be wondering how to actually access the data. For this the functions `filter`

and `select`

from the *dplyr* package of the *tidyverse* are useful. `filter`

will select certain rows (observations), while `select`

will subset the columns (variables of the data). In the following command, we get all the patients that are less tall than 1.5 and select their Height and Gender as well as their Id:

```
pat_tiny <- filter(pat, Height < 1.7)
select(pat_tiny, PatientId, Height, Gender)
```

```
# A tibble: 2 × 3
PatientId Height Gender
<chr> <dbl> <chr>
1 P1 1.65 f
2 P3 1.60 f
```

There are a couple of operators useful for comparisons:

`Variable == value`

: equal`Variable != value`

: un–equal`Variable < value`

: less`Variable > value`

: greater`&: and`

`|`

or`!`

: negation`%in%`

: is element?

The function `filter`

allows us to combine multiple conditions easily, if you specify multiple of them, they will automatically concatenated via a `&`

. For example, we can easily get light and female patients via:

`filter(pat, Height < 1.5, Gender == "f")`

```
# A tibble: 0 × 4
# ... with 4 variables: PatientId <chr>, Height <dbl>, Weight <dbl>,
# Gender <chr>
```

We can also retrieve small OR female patients via

`filter(pat, (Height < 1.5) | (Gender == "f"))`

```
# A tibble: 2 × 4
PatientId Height Weight Gender
<chr> <dbl> <dbl> <chr>
1 P1 1.65 75 f
2 P3 1.60 50 f
```

Lists can be viewed as vectors that contain not only elementary objects such as number or strings but can potentially arbitrary objects. The following example will make this clear. The list that we create contains a number, two vectors and a string that is itself part of a list.

```
L <- list(one = 1, two = c(1, 2), five = seq(1, 4, length = 5),
list(string = "Hello World"))
L
```

```
$one
[1] 1
$two
[1] 1 2
$five
[1] 1.00 1.75 2.50 3.25 4.00
[[4]]
[[4]]$string
[1] "Hello World"
```

Lists are the most general data type in R. In fact, data frames (tibbles) are lists with elements of equal lengths. List elements can either be accessed by their name using the dollar sign `$`

or via their position via a double bracket operator `[[]]`

.

`names(L)`

` [1] "one" "two" "five" ""`

`L$five + 10`

` [1] 11.0 11.8 12.5 13.2 14.0`

`L[[3]] + 10`

` [1] 11.0 11.8 12.5 13.2 14.0`

Using only a single bracket (`[]`

) will extract a sublist, so the result will always be a list, while the dollar sign `$`

or the double bracket operator `[[]]`

removes a level of the list hierarchy. Thus, in order to access the string, we would first have to extract the sublist containing the string from `L`

and then get the actual string from the sublist, `[[`

drills down into the list while `[`

returns a new, smaller list.

`L[[4]]$string`

` [1] "Hello World"`

`L[2]`

```
$two
[1] 1 2
```

Since data frames are just a special kind of lists, they can actually be accessed in the same way.

`pat$Height`

` [1] 1.65 1.90 1.60`

`pat[[2]]`

` [1] 1.65 1.90 1.60`

`pat[["Gender"]]`

` [1] "f" "m" "f"`

More on lists can be found in the respective chapter of “R for data science” here.

We prape a simple vector to illustrate the access options again:

```
sample_vector <- c("Alice" = 5.4, "Bob" = 3.7, "Claire" = 8.8)
sample_vector
```

```
Alice Bob Claire
5.4 3.7 8.8
```

The simplest way to access the elements in a vector is via their indices. Specifically, you provide a vector of indices to say which elements from the vector you want to retrieve. A minus sign excludes the respective positions

`sample_vector[1:2]`

```
Alice Bob
5.4 3.7
```

`sample_vector[-(1:2)]`

```
Claire
8.8
```

If you generate a boolean vector the same size as your actual vector you can use the positions of the true values to pull out certain positions from the full set. You can also use smaller boolean vectors and they will be concatenated to match all of the positions in the vector, but this is less common.

`sample_vector[c(TRUE, FALSE, TRUE)]`

```
Alice Claire
5.4 8.8
```

This can also be used in conjunction with logical tests which generate a boolean result. Boolean vectors can be combined with logical operators to create more complex filters.

`sample_vector[sample_vector < 6]`

```
Alice Bob
5.4 3.7
```

if there are names such as column names present (note that rowname are not preserved in the tidyverse), you can access by name as well:

`sample_vector[c("Alice", "Claire")]`

```
Alice Claire
5.4 8.8
```

R encourages the use of functions for programming. Instead of e.g. looping through a vector or data frame, you can use specialized functions that apply another function to each element of your data. These kinds of functions are called apply functions. Here, we will use the `map`

family of functions from the *purrr* package instead of the base R functions. An apply / map call applies a function to a vector or list and returns the result in another vector/list. Thus, each step consists of “mapping” a list value to a result.

We will introduce the `map`

functions by looking at a typical data set in a tabular format, where the rows represent the samples and the columns the variables measured. The data set `bodyfat`

contains various body measures for 252 men. We turn it into a tibble by using the function `as_tibble()`

.

Let’s inspect it a bit. The first thing we notice is that tibbles prints only the first 10 rows by default. Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. Additionally, we get a nice summary of the variables available in our data set.

```
load(url("http://www-huber.embl.de/users/klaus/BasicR/bodyfat.rda"))
bodyfat <- as_tibble(bodyfat)
bodyfat
```

```
# A tibble: 252 × 15
density percent.fat age weight height neck.circum chest.circum
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 1.07 12.3 23 154 67.8 36.2 93.1
2 1.09 6.1 22 173 72.2 38.5 93.6
3 1.04 25.3 22 154 66.2 34.0 95.8
4 1.08 10.4 26 185 72.2 37.4 101.8
5 1.03 28.7 24 184 71.2 34.4 97.3
6 1.05 20.9 24 210 74.8 39.0 104.5
7 1.05 19.2 26 181 69.8 36.4 105.1
8 1.07 12.4 25 176 72.5 37.8 99.6
9 1.09 4.1 25 191 74.0 38.1 100.9
10 1.07 11.7 23 198 73.5 42.1 99.6
# ... with 242 more rows, and 8 more variables: abdomen.circum <dbl>,
# hip.circum <dbl>, thigh.circum <dbl>, knee.circum <dbl>,
# ankle.circum <dbl>, bicep.circum <dbl>, forearm.circum <dbl>,
# wrist.circum <dbl>
```

As data frames are just a special kind of list, namely a list that is composed of vectors of equal length, we can use a map function to compute the mean value for every variable in our data set.

`head(map_dbl(bodyfat, mean))`

```
density percent.fat age weight height neck.circum
1.06 19.15 44.88 178.92 70.15 37.99
```

Here we use map_dbl, to ensure that we get a double value back. There are specialized mapping functions for many data types, but you can always use the default `map()`

function as a fallback when there is no specialized equivalent available.

The map functions are really useful for applying your custom functions, for example we can compute a robust z–score by subtracting the median and dividing by the mean absolute deviation for each variable.

This will bring all the variables in the data set to a common scale and make them directly comparable. These kinds of transformations are often performed before clustering or dimensionality reduction.

You can create your own functions very easily by adhering to the following template:

```
function_name <- function(argument_1, argument_2,
optional_argument = defautl_value )
{
return(...)
}
```

As you can see, the source code of the function has to be in curly brackets, while the arguments are defined in the parentheses. Arguments without a default value are mandatory, and default value are specified by equality signs.

By default R returns the result of the last computation performed within the curly brackets (often, this will be the last line of the function). However, you can always specify the return value directly with `return()`

. If you want to return multiple values, you can return a list.

We can now easily define our function and apply it to the data set.

```
robust_z <- function(x){
(x - median(x)) / mad(x)
}
head(map_df(bodyfat, robust_z), 3)
```

```
# A tibble: 3 × 15
density percent.fat age weight height neck.circum chest.circum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.763 -0.745 -1.69 -0.775 -0.759 -0.759 -0.782
2 1.459 -1.414 -1.77 -0.113 0.759 0.211 -0.722
3 -0.648 0.658 -1.77 -0.783 -1.265 -1.686 -0.460
# ... with 8 more variables: abdomen.circum <dbl>, hip.circum <dbl>,
# thigh.circum <dbl>, knee.circum <dbl>, ankle.circum <dbl>,
# bicep.circum <dbl>, forearm.circum <dbl>, wrist.circum <dbl>
```

Here, we used the function `map_df`

to make sure that we get a data frame back. There is an even simpler way to achieve the same goal. Using a tilde (~) to create an R formula, the map functions allow you to define anonymous functions with a default argument `.x`

.

With this, we do not need to define our robust z–score function explicitly.

`head(map_df(bodyfat, ~ (.x - median(.x)) / mad(.x)), 3)`

```
# A tibble: 3 × 15
density percent.fat age weight height neck.circum chest.circum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.763 -0.745 -1.69 -0.775 -0.759 -0.759 -0.782
2 1.459 -1.414 -1.77 -0.113 0.759 0.211 -0.722
3 -0.648 0.658 -1.77 -0.783 -1.265 -1.686 -0.460
# ... with 8 more variables: abdomen.circum <dbl>, hip.circum <dbl>,
# thigh.circum <dbl>, knee.circum <dbl>, ankle.circum <dbl>,
# bicep.circum <dbl>, forearm.circum <dbl>, wrist.circum <dbl>
```

Often, we want to use variables stored in our data set to compute derived quantities. For example, we might be interest in the weight in kilograms instead of pounds and the height in meters instead of inches. The function `mutate`

allows us to do this.

```
pb_to_kg <- 1/2.2046
inch_to_m <- 0.0254
bodyfat <- mutate(bodyfat, height_m = height * inch_to_m,
weight_kg = weight * pb_to_kg)
select(bodyfat, height, height_m, weight, weight_kg)
```

```
# A tibble: 252 × 4
height height_m weight weight_kg
<dbl> <dbl> <dbl> <dbl>
1 67.8 1.72 154 70.0
2 72.2 1.84 173 78.6
3 66.2 1.68 154 69.9
4 72.2 1.84 185 83.8
5 71.2 1.81 184 83.6
6 74.8 1.90 210 95.4
7 69.8 1.77 181 82.1
8 72.5 1.84 176 79.8
9 74.0 1.88 191 86.6
10 73.5 1.87 198 89.9
# ... with 242 more rows
```

We often want to apply our function only to variables in the data set that are of a specific type, e.g. numeric, we can use simple predicate functions that return `TRUE`

or `FALSE`

in combination with `discard`

or `keep`

to perform appropriate selections. For example, we can exclude the id column of the patients data set, before computing the variable–wise means.

`keep(pat, is_double)`

```
# A tibble: 3 × 2
Height Weight
<dbl> <dbl>
1 1.65 75
2 1.90 NA
3 1.60 50
```

`map_dbl(discard(pat, is_character), mean, na.rm = TRUE)`

```
Height Weight
1.72 62.50
```

Note that we specified `na.rm = TRUE`

as an additional argument to the map function. This will be directly passed on to the argument list of `mean`

and excludes any missing values before computing the means.

**Exercise: Handling a small data set**

- Import the data set
`Patients.csv`

from the website

http://www-huber.embl.de/users/klaus/BasicR/Patients.csv

- Which variables are stored in the data frame and what are their