Overview
Teaching: 30 min
Exercises: 0 minQuestions
How do I get an overview of an RNA-seq dataset?
How do I calculate differentially expressed genes?
How do I find enriched biological processes?
Objectives
Get introduced to using a specific R-package.
Learn how to perform basic gene expression analysis.
von Wulffen et al has deposited a RNA-seq expression dataset from studying the effects on E. coli transitioning from anaerobic conditions to aerobic conditions. Three biological replicate cultures where grown in anaerobic conditions, sampled, then subjected to aeration at 1 l/min and new samples were taken after 0.5, 1, 2, 5 and 10 min. Total RNA was extracted from the samples, ribo-depleted and sequenced on Illumina HISeq. Reads were aligned to K12 reference genome and counted for each gene.
That data has been downloaded here and we will here use it to provide an example of how to perform a introductory analysis using the edgeR package. We will
We will make use of the Bioconductor edgeR
package as well as the org.EcK12.eg.db
package so we start by downloading and installing those.
source("https://bioconductor.org/biocLite.R")
biocLite(c("edgeR", "org.EcK12.eg.db"))
install.packages("locfit")
Then load the required packages.
library(edgeR)
library(ggplot2)
library(org.EcK12.eg.db)
edgeR
comes with very good user manual. You can access it by
edgeRUsersGuide()
Vignettes often provide great introduction to packages
edgeR
provides a special function to open the vignette, other packages use thevignette(topic, package)
function. See which vignettes are available for e.g. ggplot2!
The read-counts data is simply a table and we already know how to read those.
wulffenTable <- read.table("data/GSE71562.csv", header = TRUE, row.names = 1, sep = ",")
head(wulffenTable)
E14R012a01 E14R012a02 E14R012a03 E14R012a04 E14R012a05 E14R012a06
aaeA 100 56 44 94 32 38
aaeB 116 47 54 80 37 43
aaeR 316 253 249 396 181 176
aaeX 77 53 53 86 46 37
aas 407 286 283 375 188 169
aat 243 169 163 252 104 169
E14R012b01 E14R012b02 E14R012b03 E14R012b04 E14R012b05 E14R012b06
aaeA 55 41 74 89 88 101
aaeB 54 31 72 75 69 123
aaeR 220 164 277 363 400 333
aaeX 68 35 70 85 96 91
aas 265 189 380 362 399 427
aat 104 116 219 263 232 300
E14R012c01 E14R012c02 E14R012c03 E14R012c04 E14R012c05 E14R012c06
aaeA 29 109 132 90 50 66
aaeB 22 101 104 88 58 81
aaeR 117 381 521 393 194 213
aaeX 52 76 134 132 45 58
aas 128 471 614 470 208 283
aat 68 233 348 237 123 202
Genes in rows, samples in columns.
We also need to know which sample is which is which and there is a different file that contains that information.
samples <- read.table("data/pheno.csv", header = TRUE, row.names = 1, sep = ",")
samples
replicate time
E14R012a01 a t0
E14R012a02 a t0.5
E14R012a03 a t1
E14R012a04 a t2
E14R012a05 a t5
E14R012a06 a t10
E14R012b01 b t0
E14R012b02 b t0.5
E14R012b03 b t1
E14R012b04 b t2
E14R012b05 b t5
E14R012b06 b t10
E14R012c01 c t0
E14R012c02 c t0.5
E14R012c03 c t1
E14R012c04 c t2
E14R012c05 c t5
E14R012c06 c t10
We then create a DGEList
which is a class used by edgeR
and calculate normalization factor for each library (to make sure we don’t overestimate expression of genes that come from samples that were sequenced deeper).
wulffen <- DGEList(counts=wulffenTable, genes=rownames(wulffenTable),
samples=samples)
wulffen <- calcNormFactors(wulffen)
An often very useful way to explore large datasets is to perform a PCA and plot the samples in 2D that maximally capture the variation in the dataset. This must be done on a statistic for each gene that is independent on the length of the gene so for this purpose we calculate get the ‘counts per million’ matrix.
wulffenCpm <- cpm(wulffen)
Then we perform PCA using the built-in R function prcomp
.
scores <- prcomp(log2(t(wulffenCpm) + 0.25))$x
What did the
t
do? Why+ 0.25
?
prcomp
requires thevariables
in this case the genes, to come in the rows so we usedt
to transpose the data matrix. Since we know gene expression values tend to follow log-normal distributions, we uselog2
to transform the data. Why did we add the magic value0.25
? Try removing it and see what you get.
To get a nice data frame that we can use for plotting we simply use merge
with the samples data fram.
pcaDf <- merge(scores, samples, by=0)
Then we can plot the data using ggplot2
ggplot(pcaDf, aes(PC1, PC2, label=time, color=replicate)) +
geom_text()
The time-series can easily be recognized which is a good sign that experiment was successful.
From our PCA we could, as expected, see that the las timepoint is the most dissimilar from the the anaerobic condition. Let’s make a comparison between the anaerobic and 10 min anearobic samples and see which genes are differentially expressed between those.
With edgeR
we will fit a simple generalized linear model to get estimates for differential expression and for that we first need to create a design matrix that accurately describes the comparison we are after.
wulffenShort <- wulffen[, wulffen$samples$time %in% c("t0", "t10")]
design <- model.matrix(~as.character(time), data=wulffenShort$samples)
colnames(design) <- c("(Intercept)", "t10")
design
(Intercept) t10
E14R012a01 1 0
E14R012a06 1 1
E14R012b01 1 0
E14R012b06 1 1
E14R012c01 1 0
E14R012c06 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`as.character(time)`
[1] "contr.treatment"
The matrix we just created indicates which samples should be used to calculate the intercept (all samples) and then the effect of 10 min aeration (the t10 samples). With these objects we can now perform our differential expression analysis.
wulffenShort <- estimateDisp(wulffenShort, design)
fit <- glmFit(wulffenShort, design)
lrt <- glmLRT(fit)
topTags(lrt)
Coefficient: t10
genes logFC logCPM LR PValue FDR
1464 mqo 5.983158 8.274567 501.8363 3.788200e-111 1.636124e-107
411 cyoA 7.245625 9.872659 457.2736 1.884543e-101 4.069670e-98
2240 sodA 7.363860 9.419456 444.0240 1.441070e-98 2.074661e-95
1735 phoH 5.249415 6.500320 409.3635 5.042966e-91 5.445142e-88
413 cyoC 6.132766 8.098590 401.6896 2.361164e-89 2.039573e-86
415 cyoE 5.368196 8.313842 400.1644 5.071640e-89 3.650736e-86
1867 puuD 4.451530 6.908350 383.1221 2.601531e-85 1.605145e-82
674 fhuF 6.378675 7.996506 374.5202 1.940986e-83 1.047890e-80
1186 iscR 4.453203 9.513508 369.7437 2.128146e-82 1.021274e-79
227 betT 4.818173 8.357982 361.3547 1.427579e-80 6.165712e-78
What did we just do? The estimateDisp
function is needed to estimate variance components robustly, glmFit
fits the model we are after that essentially has one overall mean of expression and another mean for the t10 samples. glmLRT performes a log-likelihood ratio test against the null-hypothesis that t10 has the same average as all the samples together. Then with topTags
we extract a table with the 10 most differentially expressed genes.
Write the expression estimates to a file
It is often useful to export the data for use in other programs and sharing with colleagues. Use the
write.table
function to export a comma separated file with the output oftopTags
for all genes.
Extra challenge: identifier mappings
Use the
merge
function and the mapping to bnumbers indata/ecoli.csv
to also add a column with bnumbers as identifiers
We want to examine if the top differentially expressed genes have any particular biological processes in common. We will do this using the function goana
from the limma
package. The input to goana
must be Entrez identifiers so we first need to map our gene symbols to Entrez. Bioconductor conveniently provides this mapping so all we need to do is to load the right annotation package and map our identifiers. We also define an object universe
which holds all the genes which were present in our dataset, and which could be mapped to Entrez identifiers - that is simply all our mapped genes except the missing values (na = not available).
library(org.EcK12.eg.db)
symbol2entrez <- mapIds(org.EcK12.eg.db, rownames(lrt), "ENTREZID", keytype="SYMBOL")
'select()' returned 1:many mapping between keys and columns
universe <- na.omit(symbol2entrez)
Then, we get a list of differentially expresssed genes, which we define as having a false discovery rate below 0.05.
fdr <- p.adjust(lrt$table[,"PValue"], "fdr")
allSymbols <- rownames(lrt$table)
deSymbols <- allSymbols[fdr < 0.05 & !is.na(symbol2entrez)]
deEntrez <- symbol2entrez[deSymbols]
We perform the GO over-representation analyis using goana
. We define the species to enable goana
figure out the mapping between Entrez identifiers go GO terms.
goTable <- goana(deEntrez, universe=universe, species="EcK12")
head(goTable[order(goTable$P.DE),])
Term Ont N DE
GO:0006091 generation of precursor metabolites and energy BP 196 140
GO:0015980 energy derivation by oxidation of organic compounds BP 185 133
GO:0045333 cellular respiration BP 168 121
GO:0055114 oxidation-reduction process BP 213 140
GO:0009061 anaerobic respiration BP 140 100
GO:0009060 aerobic respiration BP 65 55
P.DE
GO:0006091 2.367795e-45
GO:0015980 1.912235e-43
GO:0045333 2.009595e-39
GO:0055114 1.290043e-38
GO:0009061 8.139077e-32
GO:0009060 6.101281e-24
Key Points
Use vignettes to get introduced to a new package.
Gene expression analysis edgeR can be quite straight-forward.