Overview
Teaching: 90 min
Exercises: 0 minQuestions
What is R and when should I use it?
How do I use the RStudio graphical user interface?
How do I assign variables?
How do I read tabular data into R?
What is a data frame?
How do I access subsets of a data frame?
How do I calculate simple statistics like mean and median?
Where can I get help?
How do I install a package?
How can I plot my data?
Objectives
Get introduced to R and RStudio.
Learn basic R syntax.
Read tabular data from a file into R.
Assign values to variables.
Select individual values and subsections from data.
Perform operations on a data frame of data.
Be able to install a package from CRAN.
Display simple graphs using ggplot2.
R is a statistical computing environment that has become a widely adopted standard for data analysis various fields - notably bioinformatics.
Consider using R to get access to packages that implement solution to a given problems like
With Python, we have the framework to write great software for data analysis - with R, that software is, for better or worse, often already written
While R can be used directly in the shell, it is much nicer with a graphical interface. RStudio is by far the best one - let’s learn how to use it.
There are two main ways of interacting with R: using the console or by using script files (plain text files that contain your code).
The console window (in RStudio, the bottom left panel) is the place where R is
waiting for you to tell it what to do, and where it will show the results of a
command. You can type commands directly into the console, but they will be
forgotten when you close the session. It is better to enter the commands in the
script editor, and save the script. This way, you have a complete record of what
you did, you can easily show others how you did it and you can do it again later
on if needed. You can copy-paste into the R console, but the Rstudio script
editor allows you to ‘send’ the current line or the currently selected text to
the R console using the Ctrl-Enter
shortcut.
At some point in your analysis you may want to check the content of variable or
the structure of an object, without necessarily keep a record of it in your
script. You can type these commands directly in the console. RStudio provides
the Ctrl-1
and Ctrl-2
shortcuts allow you to jump between the script and the
console windows.
If R is ready to accept commands, the R console shows a >
prompt. If it
receives a command (by typing, copy-pasting or sent from the script editor using
Ctrl-Enter
), R will try to execute it, and when ready, show the results and
come back with a new >
-prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn’t complete yet,
the console will show a +
prompt. It means that you haven’t finished entering
a complete command. This is because you have not ‘closed’ a parenthesis or
quotation. If you’re in RStudio and this happens, click inside the console
window and press Esc
; this should help you out of trouble.
### R as a calculator and variable assignment
Just like with python, we can perform simple operations using the R console and assign the output to variables
1 + 1
[1] 2
<-
is the assignment operator. It assigns values on the right to objects on
the left. So, after executing x <- 3
, the value of x
is 3
. The arrow can
be read as 3 goes into x
. You can also use =
for assignments but not in
all contexts so it is good practice to use <-
for assignments. =
should only
be used to specify the values of arguments in functions, see below.
Save some keystrokes..
In RStudio, typing
Alt + -
(pushAlt
, the key next to your space bar at the same time as the-
key) will write ` <- ` in a single keystroke.
foo <- 1 + 1
foo + 1
[1] 3
Commenting
We can add comments to our code using the
#
character. It is useful to document our code in this way so that others (and us the next time we read it) have an easier time following what the code is doing.
We can create some random numbers from the normal distribution and calculate the mean by using two built-in functions
randomNumbers <- rnorm(10)
mean(randomNumbers)
[1] 0.101893
All those built-ins..
In contrast to Python, R has many functions available without loading any extra packages (2383 to be exact). These are mostly functions and becoming proficient with R is a lot about learning how these work.
All R functions are documented and you can read about them using RStudio documentation pane, or typing ?object
, eg
?mean
### R data types
R has three main data types that we need to know about, the two main ones are numeric
which is both integers and floats, character
, and factor
which is like integer but with a character label attached to number (or level
).
is.numeric(1)
[1] TRUE
is.character("foo")
[1] TRUE
is.factor(factor(c("a", "b", "c", "c")))
[1] TRUE
There are numerous data structures and classes but mainly three we need to care about.
Vectors Scalars act like vectors of length 1.
foo <- c(1, 2, 3)
foo[2]
[1] 2
bar <- 1
bar[1]
[1] 1
Indexing starts at 1!
Unlike python (and indeed most other programming languages) indexing starts at 1.
bar[0]
is however still syntactic correct, it just selects the null-vector from vectorbar
thus always a vector of length 0.
Matrices work like one would expect, values in two dimensions.
(foo <- matrix(c(1, 2, 3, 4, 5, 6), nrow=2))
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Lists are widely used and work like vectors except that each element can be any data structure - so you can have lists of matrices, lists of lists etc.
(foo <- list(bar=c(1, 2, 3), baz=matrix(c(1, 2, 3, 4, 5, 6), nrow=2)))
$bar
[1] 1 2 3
$baz
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
foo[[2]]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
foo$baz
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Finally, data frames but they are so important they deserve a small section on their own.
Now that we know how to assign things to variables and use functions, let’s read some yeast OD growth data into R using read.table
and briefly examine the dataset.
growth <- read.table(file = "data/yeast-growth.csv", header = TRUE, sep = ",")
Loading Data Without Headers
What happens if you put
header = FALSE
? The default value isheader = TRUE
?. What do you expect will happen if you leave the default value? Before you run any code, think about what will happen to the first few rows of your data frame, and its overall size. Then run the following code and see if your expectations agree:head(read.table(file = "data/yeast-growth.csv", header = FALSE, sep = ","))
Where is that file? Or, what is my working directory?
R is always running inside a specific directory, the working directory. Paths can be given relative to that directory so with
data/yeast-growth.csv
we mean ‘the fileyeast-growth.csv
in thedata
directory that is right at the working directory. Set the working directory using RStudioSession
>Set Working Directory..
orsetwd()
Now that our data is loaded in memory, we can start doing things with it.
First, let’s ask what type of thing growth
is:
head(growth)
V3 strain timepoint od medium
1 1e-02 strain-a 1 0.017 low
2 3e-02 strain-b 1 0.017 low
3 1e+00 strain-c 1 0.018 medium
4 3e+00 strain-d 1 0.017 medium
5 3e+01 strain-e 1 0.017 medium
6 1e+02 strain-f 1 0.016 high
str(growth) # what data types are the different columns?
'data.frame': 455 obs. of 5 variables:
$ V3 : num 1e-02 3e-02 1e+00 3e+00 3e+01 1e+02 3e+02 1e-02 3e-02 1e+00 ...
$ strain : Factor w/ 7 levels "strain-a","strain-b",..: 1 2 3 4 5 6 7 1 2 3 ...
$ timepoint: int 1 1 1 1 1 1 1 2 2 2 ...
$ od : num 0.017 0.017 0.018 0.017 0.017 0.016 0.015 0.015 0.018 0.021 ...
$ medium : Factor w/ 3 levels "high","low","medium": 2 2 3 3 3 1 1 2 2 3 ...
class(growth)
[1] "data.frame"
The output tells us that is a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. Data frames are very useful for storing data and you will find them elsewhere when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.
We can see the shape, or dimensions, of the data frame with the function dim
:
dim(growth)
[1] 455 5
This tells us that our data frame, growth
, has 455 rows and 5 columns.
If we want to get a single value from the data frame, we can provide an index in square brackets, just as we do in math:
# first value in dat
growth[1, 1]
[1] 0.01
# middle value in dat
growth[30, 2]
[1] strain-b
7 Levels: strain-a strain-b strain-c strain-d strain-e ... strain-g
An index like [30, 2]
selects a single element of a data frame, but we can select whole sections as well.
For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:
growth[1:4, 1:2]
V3 strain
1 0.01 strain-a
2 0.03 strain-b
3 1.00 strain-c
4 3.00 strain-d
The slice 1:4
means, “Values from 1 to 4.”
We can use the function c
, which stands for concatenate, to select non-contiguous values:
growth[c(3, 8, 37, 56), c(1, 3)]
V3 timepoint
3 1e+00 1
8 1e-02 2
37 3e-02 6
56 3e+02 8
We also don’t have to provide a slice for either the rows or the columns.
If we don’t include a slice for the rows, R returns all the rows; if we don’t include a slice for the columns, R returns all the columns.
If we don’t provide a slice for either rows or columns, e.g. growth[, ]
, R returns the full data frame.
growth[5, ]
V3 strain timepoint od medium
5 30 strain-e 1 0.017 medium
Addressing Columns by Name (the better way)
Columns can also be addressed by name, with either the
$
operator (ie.growth$medium
) or square brackets (ie.growth[,"medium"]
). You can learn more about subsetting by column name in this supplementary lesson.
Particularly useful is also to user other vectors as filters and only return the rows that evaluate to TRUE
. Here, growth$strain == "strain-e"
gives a vector with TRUE
or FALSE
for every element in growth$strain
that is equal to "strain-e"
.
head(growth[growth$strain == "strain-e",])
V3 strain timepoint od medium
5 30 strain-e 1 0.017 medium
12 30 strain-e 2 0.019 medium
19 30 strain-e 3 0.016 medium
26 30 strain-e 4 0.022 medium
33 30 strain-e 5 0.022 medium
40 30 strain-e 6 0.022 medium
Now let’s perform some common mathematical operations to learn about our growth curves.
max(growth[growth$strain == "strain-e", "od"])
[1] 0.065
R also has functions for other common calculations, e.g. finding the minimum, mean, and standard deviation of the data:
min(growth[growth$strain == "strain-e", "od"])
[1] 0.016
mean(growth[growth$strain == "strain-e", "od"])
[1] 0.04347692
sd(growth[growth$strain == "strain-e", "od"])
[1] 0.01259974
We may want to compare the different strains and for that we can use the split-apply approach which is very common in R. One (out of many) functions in R that implements this is tapply
tapply(growth$od, growth$strain, max)
strain-a strain-b strain-c strain-d strain-e strain-f strain-g
0.237 0.233 0.231 0.221 0.065 0.040 0.034
There are many more apply
style functions among which lapply
for applying functions of to elements of lists, apply
for applying functions to rows or columns of a matrix.
Other forms of apply
split
is a function that can turn an array to a list based on another variable that indicates the groups, e.g. our ‘strains’. Read a bit aboutsplit
then use it together withlapply
to achieve the same result astapply(growth$od, growth$strain, max)
Next we will plot our data but since the default graphics system in R is quite cumbersome to use, and there is a very popular both easy to use and the same time very flexible package for making beautiful plots, we will use that instead - ggplot2
. Since ggplot2
isn’t included with R by default, we first have to install it.
Install in RStudio by the menu system or type
install.packages("ggplot2")
This will also install a number of packages that ggplot2
depends on. Once done, load the package from the library by
library(ggplot2)
ggplot2
works well with data frames, particularly when formatted in the ‘long’ format that our growth data is already in. Plots are initialized with the ggplot()
function and then we add layers to it to represent the data. Let’s first make a simple scatter plot.
ggplot(growth, aes(x=timepoint, y=od)) +
geom_point()
Let’s add another layer, a line this time.
ggplot(growth, aes(x=timepoint, y=od)) +
geom_point() +
geom_line()
Oops, that looks funny. Why? Because we haven’t informed ggplot about the strains that each should make up a trajectory in our plot. We can do that by simply adding strain as another aesthetic.
ggplot(growth, aes(x=timepoint, y=od, color=strain)) +
geom_point() +
geom_line()
Plotting each line in separat facet would have been another option
ggplot(growth, aes(x=timepoint, y=od)) +
geom_point() +
geom_line() +
facet_wrap(~strain)
ggplot2
can present data in a large number of ways, explore the
online documentation or the
R graph gallery
for inspiration.
Key Points
R is a strong statistical computing environment
Thousands of packages for R
Use
variable <- value
to assign a value to a variable in order to record it in memory.Objects are created on demand whenever a value is assigned to them.
Use
read.table
andwrite.table
to import / export data.The function
str
describes the data frame.Use
object[x, y]
to select a single element from a data frame.Use
from:to
to specify a sequence that includes the indices fromfrom
toto
.All the indexing and slicing that works on data frames also works on vectors.
Use
#
to add comments to programs.Use
mean
,max
,min
andsd
to calculate simple statistics.Use
tapply
to calculate statistics across the groups in a data frame.Use
ggplot
to create both simple and advanced visualizations.