A count table represents the number of reads mapped per gene/transcripts in an RNA-seq experiment. This is the entry point of many downstream supervised or unsupervised analysis.
There are many different ways to generate a count table, and many tools can be used.
Usually, generating such table requires two mains steps :
The mapping step aims at positioning the sequencing reads on your reference genome. Different tools such as HiSat21, STAR2, etc. are still commonly used. In theory, if well configured, these tools should give close results, although their mapping strategy and computational requirements might be different. Of note, recent methods/tools based on pseudo-mapping approaches such as Salmon3, Kallisto4, Rapmap5, etc. can also be used to quantify the gene expression from raw RNA-seq data (see Bray et al. 2016 6).
Once the data are mapped on the genome, several tools can be used to count and assign reads to a given gene (exons).
Among the most popular tools, HTSeqCount7 or FeatureCounts8 are frequently used. Note that for this step, it is crucial to have details on the protocol used to generate the samples, and especially if the protocol was stranded or not.
This step also requires some gene annotations. Databases such as Ensembl, Refseq, or Gencode can be used. They all contain the most common coding genes but they also all have their own specifities.
To wrap up, here is an example of a typical RNA-seq workflow for gene expression profiling.
Please be sure to have a recent version of R (>3.3) with the following packages from the CRAN :
From BioConductor :
In order to install these packages, use the following command :
## For CRAN packages
install.packages("knitr")
## For BioC packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
Google is your best friend ! But a lot is already available within R …
## Example of help in R
## Use "?" followed by a function name
?read.csv
R is working in vectorial mode. A vector
is a one-dimensional object accessible through its indices.
Examples
## 'a' is a vector from 10 to 20
## It could also be noted as
## a <- c(10, 11, 12, 14, 15, 16, 17, 18, 19, 20)
a <- 10:20
a
## [1] 10 11 12 13 14 15 16 17 18 19 20
## Access to a value
a[3]
## [1] 12
## Sum
sum(a)
## [1] 165
## Operation of vector
a+1
## [1] 11 12 13 14 15 16 17 18 19 20 21
## Which indices within my vector are higher that 5
which(a>5)
## [1] 1 2 3 4 5 6 7 8 9 10 11
## Therefore, what are the values corresponding to these indexes
a[which(a>5)]
## [1] 10 11 12 13 14 15 16 17 18 19 20
A table or matrix
is a n x n array of similar data type (character or numeric). It therefore contains two coordinates [x, y] where x represents the raws and y the colums. A matrix
is characterized by its values, but also its colnames and rownames.
Examples
## Let's create a simple matrix in R
## You do not know what 'rnorm' is doing ?
## ?rnorm
d <- matrix(1:50, ncol=5, nrow=10)
d
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50
## Access to the first column
d[,1]
## [1] 1 2 3 4 5 6 7 8 9 10
## Access to the first row
d[1,]
## [1] 1 11 21 31 41
## Access to the value at 3 rows and 4 columns
d[3,4]
## [1] 33
## Making a sum for each raw
rowSums(d)
## [1] 105 110 115 120 125 130 135 140 145 150
## Making a sum for each cols
colSums(d)
## [1] 55 155 255 355 455
## How many values in the matrix are higher than 5
length(which(d>1))
## [1] 49
## The matrix we just created does not have any col/rownames
rownames(d)
## NULL
colnames(d)
## NULL
## Let's add some colnames
colnames(d) <- c("A", "B", "C", "D", "E")
d
## A B C D E
## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50
colnames(d)
## [1] "A" "B" "C" "D" "E"
A dataframe
is used to store data tables. It is a list of vectors of equal length. A dataframe
is more general than a matrix, in the sense that different columns can have different modes (numeric, character, factor, etc.).
Examples
df <- data.frame(mycol=letters[1:nrow(d)], d)
df
mycol | A | B | C | D | E |
---|---|---|---|---|---|
a | 1 | 11 | 21 | 31 | 41 |
b | 2 | 12 | 22 | 32 | 42 |
c | 3 | 13 | 23 | 33 | 43 |
d | 4 | 14 | 24 | 34 | 44 |
e | 5 | 15 | 25 | 35 | 45 |
f | 6 | 16 | 26 | 36 | 46 |
g | 7 | 17 | 27 | 37 | 47 |
h | 8 | 18 | 28 | 38 | 48 |
i | 9 | 19 | 29 | 39 | 49 |
j | 10 | 20 | 30 | 40 | 50 |
class(df)
## [1] "data.frame"
## The as.matrix (resp. as.dataframe) allows to convert my object from a dataframe to a matrix
## But note that in this case, all values will be convert into character
as.matrix(df)
## mycol A B C D E
## [1,] "a" " 1" "11" "21" "31" "41"
## [2,] "b" " 2" "12" "22" "32" "42"
## [3,] "c" " 3" "13" "23" "33" "43"
## [4,] "d" " 4" "14" "24" "34" "44"
## [5,] "e" " 5" "15" "25" "35" "45"
## [6,] "f" " 6" "16" "26" "36" "46"
## [7,] "g" " 7" "17" "27" "37" "47"
## [8,] "h" " 8" "18" "28" "38" "48"
## [9,] "i" " 9" "19" "29" "39" "49"
## [10,] "j" "10" "20" "30" "40" "50"
## Removing the first column allows to get only one data type and to come back to numeric values
as.matrix(df[,-1])
## A B C D E
## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50
The class
and typeof
functions allow to get information about the nature of an object and its data type.
Examples
class(d)
## [1] "matrix"
is.matrix(d)
## [1] TRUE
is.data.frame(d)
## [1] FALSE
As in any programming langage, R allows the user to write functions. A function is defined by:
return
value if neededUsing functions is strongly adviced to factorize a code as much as possible, and to limit coding errors.
Examples
myfunc <- function(a, b=1){
## do something
res <- a^2 + b
## return the results
return(res)
}
myfunc(10)
## [1] 101
myfunc(a=10, b=5)
## [1] 105
The for
loop system is common to many langage. It allows a variable to be updated in an iterative way.
In the following example, the variable i
will be replaced by 1, 2, 3, 4, ..., 10
Examples
for (i in c(1:10)){
myfunc(i)
}
However, this looping system is usually not recommanded in R, as it is quite slow … Instead, R offers the apply()
family function.
The apply()
family function is a set of functions to manipulate and loop on data structure such as matrices, dataframes, lists, etc. In practice, the family is made up of apply()
, lapply()
, sapply()
, vapply()
, mapply()
,rapply()
, tapply()
.
So far, we will just focus on the apply()
function that can be called on matrices and dataframes where :
Examples
## Loop over all columns of 'd' and run the 'sum' function
apply(d, 2, sum)
## A B C D E
## 55 155 255 355 455
## Loop over all rows of 'd' and run the 'sum' function
apply(d, 1, sum)
## [1] 105 110 115 120 125 130 135 140 145 150
One of the most common operation in R is to be able to match two vectors.
The match
operation returns a vector of the positions of (first) matches of its first argument in its second.
%in%
is built as a binary operator, and indicates if there is a match or not between the two arguments.
Examples
a <- c(-100, -50, -64, 20, 65, 126)
## Get indices of values > 0
which(a>0)
## [1] 4 5 6
## Get values > 0
a[which(a>0)]
## [1] 20 65 126
a <- c("A","B","C","D","E","F")
b <- c("B", "A", "E")
## match 'b' values into the 'a' vector - return the position
match(b, a)
## [1] 2 1 5
## %in% can be an alternative to which for vector comparison
a %in% b
## [1] TRUE TRUE FALSE FALSE TRUE FALSE
It is usually recommanded to follow good programming practices when you are writting your own code.
Among them, pay attention to :
sessionInfo()
at the end of each scriptIf you want to go further :
renv
package.library(renv)
renv::init() ## Init a new project locally
## Install your packages
renv::snapshot() ## Save the state of the project library in a lockfile
renv::restore() ## Restore your local environment defined in the lockfile
As a toy dataset, we will use the data published by Horvath et al. 9 available on GEO (GSE52194).
This dataset is composed of 20 breast samples
:
All these samples were processed using the pipeline presented above to generate the count table. Data are available on the cluster in the folder /shared/projects/dubii2020/data/rnaseq/countsdata
.
Usually, the easiest way would be to start the analysis from a raw count table. If you do not have access to such table and have the STAR gene counts output, the following functions can be used to build the raw contact matrix.
starload <- function(input_path, col.nb){
message("loading STAR gene counts ...")
exprs.in <- list.files(path=input_path,
pattern="ReadsPerGene.out.tab",
full.names=TRUE,recursive=TRUE)
counts.exprs <- lapply(exprs.in, read.csv, sep="\t", header=FALSE,
row.names=1, check.names=FALSE)
counts.exprs <- data.frame(lapply(counts.exprs, "[", col.nb))
colnames(counts.exprs) <- basename(exprs.in)
## remove first 4 lines
counts.exprs <- counts.exprs[5:nrow(counts.exprs), , drop=FALSE]
counts.exprs
}
R can easily load .csv files. The CSV (Comma Separated Values) format is fully compatible with Excel. It is a plain text format, where columns are separated by a comma
.
proj <- "/shared/projects/dubii2021/trainers/module5/seance5/data/counts/"
proj <- "../data/counts"
## Load a raw count table from a csv file
d <- read.csv(file.path(proj, "tablecounts_raw.csv"), row.names = 1)
d <- as.matrix(d)
## Load TPM normalized count table from a csv file
d.tpm <- read.csv(file.path(proj, "tablecounts_tpm.csv"), row.names = 1)
d.tpm <- as.matrix(d.tpm)
## Loading sample plan
splan <- read.csv(file.path(proj, "SAMPLE_PLAN"), row.names=1, header=FALSE)
colnames(splan) <- c("sname","subtype")
splan
sname | subtype | |
---|---|---|
SRR1027171 | TNBC1 | TNBC |
SRR1027172 | TNBC2 | TNBC |
SRR1027173 | TNBC3 | TNBC |
SRR1027174 | TNBC4 | TNBC |
SRR1027175 | TNBC5 | TNBC |
SRR1027176 | TNBC6 | TNBC |
SRR1027177 | NonTNBC1 | NonTNBC |
SRR1027178 | NonTNBC2 | NonTNBC |
SRR1027179 | NonTNBC3 | NonTNBC |
SRR1027180 | NonTNBC4 | NonTNBC |
SRR1027181 | NonTNBC5 | NonTNBC |
SRR1027182 | NonTNBC6 | NonTNBC |
SRR1027183 | HER2_1 | HER2 |
SRR1027184 | HER2_2 | HER2 |
SRR1027185 | HER2_3 | HER2 |
SRR1027186 | HER2_4 | HER2 |
SRR1027187 | HER2_5 | HER2 |
SRR1027188 | NBS1 | Control |
SRR1027189 | NBS2 | Control |
SRR1027190 | NBS3 | Control |
## Update colnames of my count tables
colnames(d) <- as.character(splan[colnames(d),"sname"])
colnames(d.tpm) <- as.character(splan[colnames(d.tpm),"sname"])
d[1:5,1:5]
## TNBC1 TNBC2 TNBC3 TNBC4 TNBC5
## ENSG00000223972.4|DDX11L1 0 0 0 0 1
## ENSG00000227232.4|WASH7P 16 5 16 26 26
## ENSG00000243485.2|MIR1302-11 1 0 0 0 0
## ENSG00000237613.2|FAM138A 0 0 0 0 0
## ENSG00000268020.2|OR4G4P 0 0 1 0 0
R is extremely powerfull and is a perfect tool to explore your data. Here are a few examples of questions you may want to address:
Show
## How many samples/genes ?
dim(d)
## [1] 57820 20
Show
## How many reads do I have per sample
colSums(d)
## TNBC1 TNBC2 TNBC3 TNBC4 TNBC5 TNBC6 NonTNBC1 NonTNBC2
## 10536204 3014694 7963296 9233649 8820837 9245006 21299507 11956159
## NonTNBC3 NonTNBC4 NonTNBC5 NonTNBC6 HER2_1 HER2_2 HER2_3 HER2_4
## 12774570 12045260 13350516 9897833 11164046 15483794 11256553 12441938
## HER2_5 NBS1 NBS2 NBS3
## 5746612 23629818 24802379 29812185
Show
## How many genes have zero counts in all samples
rs <- rowSums(d)
nbgenes_at_zeros <- length(which(rs==0))
nbgenes_at_zeros
## [1] 3689
Show
## For each sample, how many genes have more than one count ?
number_expressed <- function(x, mincounts=1){
nb <- length(which(x>mincounts))
return(nb)
}
nbgenes_per_sample <- apply(d, 2, number_expressed)
nbgenes_per_sample
## TNBC1 TNBC2 TNBC3 TNBC4 TNBC5 TNBC6 NonTNBC1 NonTNBC2
## 42316 19338 43913 29067 31222 32807 33237 34484
## NonTNBC3 NonTNBC4 NonTNBC5 NonTNBC6 HER2_1 HER2_2 HER2_3 HER2_4
## 35270 35839 35819 29013 28711 28852 27723 26966
## HER2_5 NBS1 NBS2 NBS3
## 22389 25490 24736 24107
Show
## Distribution of raw counts (log2)
boxplot(log2(1+d),
las=2, ylab="raw counts (log2)", col="gray50", pch=16)
Show
## if you no want to use grep
## grep("ERBB2$", rownames(d.tpm))
erbb2_pos = which(rownames(d.tpm)=="ENSG00000141736.9|ERBB2")
## Looking at the expression level of my favorite gene
barplot(d.tpm[erbb2_pos, ], ylab="TPM", las=2, col="gray50", border="white")
Show
getExpressedNumber <- function(x){
n <- length(which(x>1))
return(n)
}
nb_expressed_genes <- apply(d.tpm, 2, getExpressedNumber)
nb_expressed_genes
## TNBC1 TNBC2 TNBC3 TNBC4 TNBC5 TNBC6 NonTNBC1 NonTNBC2
## 35439 18249 42477 20245 22206 24195 16873 21891
## NonTNBC3 NonTNBC4 NonTNBC5 NonTNBC6 HER2_1 HER2_2 HER2_3 HER2_4
## 22015 23389 21604 19178 19896 20099 19404 16174
## HER2_5 NBS1 NBS2 NBS3
## 17528 18963 19101 18067
Show
tnbc_mean <- rowMeans(d.tpm[,c("TNBC1","TNBC2", "TNBC3", "TNBC4", "TNBC5", "TNBC6")])
nontnbc_mean <- rowMeans(d.tpm[,c("NonTNBC1","NonTNBC2", "NonTNBC3", "NonTNBC4", "NonTNBC5", "NonTNBC6")])
Show
## Calculate logFC
fc <- log2(1 + tnbc_mean) - log2(1 + nontnbc_mean)
## Order by Fold Changes
fc.ord <- fc[order(fc, decreasing=TRUE)]
## Top 10 genes with higher fold change
head(fc.ord, 10)
## ENSG00000232216.1|IGHV3-43 ENSG00000181617.5|FDCSP
## 9.221682 7.781491
## ENSG00000240382.1|IGKV1-17 ENSG00000211967.2|IGHV3-53
## 7.420509 7.301344
## ENSG00000211955.2|IGHV3-33 ENSG00000171209.3|CSN3
## 7.058050 6.772671
## ENSG00000211663.2|IGLV3-19 ENSG00000211972.2|IGHV3-66
## 6.210281 6.118904
## ENSG00000211666.2|IGLV2-14 ENSG00000167754.8|KLK5
## 6.036707 6.002069
Show
## Scatter plot
plot(x=log2(nontnbc_mean + 1), y=log2(tnbc_mean + 1), col=ifelse(abs(fc)>3, "red", "gray"),
pch=16, cex=.7,
xlab="mean of NonTNBC (log2)", ylab="mean of TNBC (log2)")
Show
od <- order(tnbc_mean, decreasing=TRUE)
names(tnbc_mean[od[1:10]])
## [1] "ENSG00000210082.2|MT-RNR2" "ENSG00000265150.1|RN7SL2"
## [3] "ENSG00000258486.2|RN7SL1" "ENSG00000202198.1|RN7SK"
## [5] "ENSG00000259001.2|RPPH1" "ENSG00000269900.2|RMRP"
## [7] "ENSG00000239776.2|AC079949.1" "ENSG00000211459.2|MT-RNR1"
## [9] "ENSG00000241781.2|AL161626.1" "ENSG00000211896.2|IGHG1"
Exploratory analysis assesses overall similarity between samples without any prior knowledge: which samples are similar to each other? Does this fit to the experimental design? is there any outlier samples?
TPM and RPKM are units
that can be used to look at transcript abundance. However, they do not perform robust cross-sample normalization. Methods as implemented in the limma, DESeq2 or edgeR packages propose to normalize by sequencing depth by calculing scaling factors using more sophisticated approaches.
DESeq2 defines a virtual reference sample by taking the median of each gene’s values across samples and then computes size factors as the median of ratios of each sample to the reference sample. For details about how DESeq2 calculates its size factors, see here.
To avoid that the distance measure between samples was dominated by a few highly variable genes, and have a roughly equal contribution from all genes, it is recommanded to use a normalization approach that stabilize the variance
accross expression level, leading to near-homoskedastic data (i.e. the variance of the gene expression does not depend on the mean).
In RNA–Seq data, the variance grows with the mean. For example, if we run the PCA directly on a matrix of counts, the result will mainly depends on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple strategy to avoid this is to take the log of the counts. However, now the genes with low counts tend to dominate the results because they show the strongest relative differences between samples.
Therefore, transformations that stabilize the variance over the mean are advised.
Here, we will use the rlog
method from the DESeq2 package.
library(DESeq2)
## Load data
dds <- DESeqDataSetFromMatrix(countData=d, DataFrame(condition=splan$subtype), ~ condition)
## Estimate size factors
dds <- estimateSizeFactors(dds)
## Remove lines with only zeros
dds <- dds[ rowSums(counts(dds)) > 0, ]
## Run the rlog normalization
rld <- rlog(dds, blind=TRUE)
## r rlog
par(mfrow=c(1,3))
plot(counts(dds, normalized=TRUE)[,3:4],
pch=16, cex=0.3, xlim=c(0,20e3), ylim=c(0,20e3), main="normalized counts")
plot(log2(counts(dds, normalized=TRUE)[,3:4] + 1),
pch=16, cex=0.3, main="log2 normalized counts")
plot(assay(rld)[,3:4],
pch=16, cex=0.3, main="rlog normalized counts")
A first way to visualize distances between samples is to project the samples onto a 2-dimension plane such that they spread out optimally. This can be achieved using PCA analysis. Several PCA packages are available in R. Here, we simply use the function provided by DESeq2.
## short and easy version
DESeq2::plotPCA(rld, intgroup = c("condition"), ntop=1000)
Other (nicer) visualization functions with FactoMineR
…
library(FactoMineR) ## for PCA
library(factoextra) ## for visualisation functions
## extract rlog matrix
d.rlog <- assay(rld)
## Select most variable genes
gvar <- apply(d.rlog, 1, var)
mostvargenes <- order(gvar, decreasing=TRUE)[1:1000]
## Run PCA
res_pca <- PCA(t(d.rlog[mostvargenes,]), ncp=3, graph=FALSE)
## Make beautiful plots
fviz_eig(res_pca, addlabels = TRUE, ylim = c(0, 50))
fviz_pca_ind(res_pca, label="none", habillage=splan$subtype)
fviz_pca_ind(res_pca, axes=c(2,3), label="ind", habillage=splan$subtype, repel = TRUE)
## Look at the variable that contribute the most to the PC2
best.contrib <- names(sort(res_pca$var$contrib[,"Dim.2"], decreasing=TRUE))
## Let's check the best contributors
barplot(d.tpm[best.contrib[5],], col="gray50",
border="white", las=2, main=best.contrib[5])
barplot(d.tpm[best.contrib[7],], col="gray50",
border="white", las=2, main=best.contrib[7])
In addition to PCA analysis, hierarchical clustering is another way to represent the similarity between samples using different metrics. Hierarchical clustering requires a distance
between sample and a method
to agglomerate clusters.
require(pheatmap)
require(RColorBrewer)
## matrix of rlog normalized data
d.rlog <- assay(rld)
## Clustering of samples on all genes
sampleDist <- dist(t(d.rlog))
hc <- hclust(sampleDist, method="ward.D2")
plot(hc)
## Clustering of samples on the most variable 100 genes
mostvargenes100 <- order(gvar, decreasing=TRUE)[1:100]
annot <- data.frame(splan$subtype, row.names=splan$sname)
pheatmap(d.rlog[mostvargenes,],
clustering_distance_rows = "correlation",
clustering_distance_cols = "euclidean",
clustering_method = "ward.D2",
annotation_col = annot,
show_rownames = FALSE,
scale = "row")
The first steps of the analysis presented above are dedicated to count data manipulation and exploration. The main goal is to validate the overlall quality of the dataset, and to see how similar are the different samples. If any outlier is detected, they will most likely increase the overall variability and thus decrease the statistical power later when testing for differential expression. It is therefore advised to discard these samples. Importantly, these analyses rely on appropriate normalization methods that stabilize the variance over the mean expression levels, and are therefore well designed for exploratory analysis.
Then, the next step of the analysis is usually to perform differential expression analysis between groups of samples. In this context, different normalization methods can be used, mainly based on Negative Binomial model. In RNA-seq count data, the variance is usually greater than the mean, requiring the estimation of an overdispersion parameter for each gene. Packages as DESeq2 or edgeR propose methods to normalize and detect differentially expressed genes based on these assumptions.
In addition, the limma package proposes an alternative approach based on a Gaussian model, through the voom
transformation. This will compute additional precision weights to account for the change in the mean-variance relationship between small and large counts10.
Genes filtering is usually required before differential analysis. The idea is simply to restrict the analysis to expressed genes, and to remove any non-relevant information.
Here, we defined the set of expressed genes as those with a TPM (transcripts per million) normalized counts >= 1 in at least one sample.
## Expressed genes
nbexpr <- apply(d.tpm, 1, function(x){length(which(x>=1))})
isexpr <- which(nbexpr>=1)
d.f <- d[isexpr,]
hist(log2(1+rowSums(d.tpm)), breaks=100, main="Gene Expression across all samples (TPM)", col="gray50", border="white")
hist(log2(1+rowSums(d.tpm[isexpr,])), breaks=100, add=TRUE, col="red", border="white")
In the following example, we will use the trimmed mean of M-values method (TMM) from the edgeR
package 11 that calculate a normalization factor that can be applied to the library size of each patient for normalization. Together with the DESeq
method, TMM has been found to perform well in comparative studies.
It is important to keep in mind that such normalization is based on the assumption that most genes are invariant.
Of note, it is good to know that DESeq and edgeR methods do not relies on the same type of normalization factor. Each normalization method should therefore be used with its appropriate statistical modeling. Methods are therefore ‘not’ exchangeable.
## Scaling factor normalization based on the TMM method
require(edgeR)
require(limma)
y <- DGEList(counts=d.f)
y <- calcNormFactors(y, method="TMM")
d.norm <- cpm(y, log=TRUE)
Once our data are normalized, we can now test for differential expression between groups.
Here again, several approaches like DESeq
, edgeR
or limma
can be used.
Broadly speaking, the three methods should give close results, although from our experience limma
is a bit more conservative.
When the library sizes are quite variable between samples, limma
proposed an additional transformation (the voom
approach) applied to the normalized and filtered DGEList
object.
y <- DGEList(counts=d.f)
y <- calcNormFactors(y, method="TMM")
## Voom transformation of normalized data to apply the limma statistical framework
design <- model.matrix(~ 0 + subtype, data=splan)
v <- voom(y, design, plot=FALSE)
After this, the usual limma
pipeline can be applied to comparer groups of samples (here the TNBC vs non TNBC samples).
Each gene is tested and has its own p-value. It is therefore important to correct for multiple testing in order to control for False Discovery Rate (FDR).
## Differential analysis based on 'limma'
fit <- lmFit(v, design)
## Compare TNBC vs NonTNBC
contrast <-makeContrasts(subtypeTNBC - subtypeNonTNBC, levels=design)
## Run test
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)
## Extract the results
res <- topTable(fit2, number=1e6, adjust.method="BH")
## Pvalue distribution
hist(res$P.Value, main="Pvalue histogram", col="grey50", border="white")
## Extract list of DEG
idx.sign <- which(res$adj.P.Val < 0.05 &
abs(res$logFC) > 1)
deg <- rownames(res[idx.sign,])
After differential analysis, it is highly recommanded to make simple plots to double check the results, such as:
mygene <- rownames(res)[1]
barplot(d.norm[mygene,], las=2, main=mygene, ylab="Normalized counts",
col="grey50", border="white")
idx.sub <- which(splan$subtype=="TNBC" | splan$subtype=="NonTNBC")
data.sub <- d.norm[deg[1:50],idx.sub]
pheatmap(data.sub,
cutree_cols = 2,
show_rownames=FALSE)
plot(res$AveExpr, res$logFC, xlab="A - Mean Expression", ylab="M - logFC",
col=ifelse(res$adj.P.Val<0.05, "red", "black"), pch=16, cex=.5)
volcanoplot(fit2, highlight=100)
Here is a short example of Gene Ontology analysis with R, and the clusterProfiler
package. Many other packages are available for functional analysis. You can also use online tools such as PANTHER or DAVID.
library(org.Hs.eg.db)
library(clusterProfiler)
## convert to Entrez Id
symbol.sign <- sapply(strsplit(deg, "\\|"), "[", 2)
entrez.sign <- bitr(symbol.sign, fromType = "SYMBOL",
toType = c("ENTREZID"), OrgDb = org.Hs.eg.db)
symbol.all <- sapply(strsplit(rownames(res), "\\|"), "[", 2)
universe <- bitr(symbol.all, fromType = "SYMBOL",
toType = c("ENTREZID"), OrgDb = org.Hs.eg.db)
## Enrich GO
ego.BP <- enrichGO(gene=entrez.sign$ENTREZID, universe=universe$ENTREZID, OrgDb=org.Hs.eg.db, ont="BP", pAdjustMethod="BH", pvalueCutoff=1, qvalueCutoff=0.25, readable=TRUE)
dotplot(dropGO(ego.BP, level=c(1:3)), showCategory=20)
The previous VolcanoPlot is not very nice, and writting you own function can be useful to make beautiful plots ! As an exercice, you can write your own function based on the ggplot2 package to make a much better VolcanoPlot.
Solution
LimmaVolcano <- function(res, main="", fct=1.5, pt=0.05){
stopifnot(require(ggplot2))
res$sign <- 0
res$sign[which(res$adj.P.Val < pt & abs(res$logFC) > fct)] <- 1
p <- ggplot(data=res, aes(x=logFC, y=-log10(adj.P.Val), colour=as.factor(sign))) +
theme_classic() + geom_point(alpha=0.4, size=2) +
scale_color_manual(name="", values=c("1"="red", "0"="black")) +
ggtitle(paste0("Volcano Plot - Limma ", main)) + theme(legend.position = "none") +
xlab("log2 fold change") + ylab("-log10 adj pvalue") +
geom_vline(xintercept=c(-fct, fct), linetype=2) +
geom_hline(yintercept=-log10(pt), linetype=2)
p
}
LimmaVolcano(res, fct=1, pt=0.05, main=" - Untreated vs EGF")
There are many ways to deal with annotations in R. Several packages include databases that can be requested to convert a gene ID to another (see BioMart). You can look at this online tutorial for examples and details.
Here, our dataset is based on ENSEMBL annotation. Our goal is thus to convert the ENSEMBL
IDs into SYMBOL
which are already available from the annotation file we used during the data processing (gtf).
Solution
## ensemble2symbol
## x: matrix with ENSEMBL Ids as rownames
## gtf.in: path to gtf file
## return: x with ENSEMBL|SYMBOL annotation
ensembl2symbol <- function(x, gtf.in){
stopifnot(require(rtracklayer))
message("Loading gtf file ...")
dgtf <- rtracklayer::import(gtf.in)
message("Subset only genes ...")
my_genes <- dgtf[dgtf$type=="gene"]
mcols(my_genes) <- mcols(my_genes)[c("gene_id","gene_type","gene_name")]
message("Convert ENSEMBL to SYMBOL")
m <- match(rownames(x), my_genes$gene_id)
m.symb <- m[which(!is.na(m))]
message("Loosing ", length(which(is.na(m))), " genes ...")
x <- x[m.symb,]
rownames(x) <- my_genes$gene_name[m.symb]
return(x)
}
mygtf <- file.path(proj, "gencode.v19.annotation.gtf.gz")
d.ensembl <- read.csv(file.path(proj, "tablecounts_raw_ensembl.csv"),
row.names = 1)
d.ensembl <- as.matrix(d.ensembl)
d.annot <- ensembl2symbol(d.ensembl, mygtf)
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] rtracklayer_1.43.4 clusterProfiler_3.12.0
## [3] org.Hs.eg.db_3.8.2 AnnotationDbi_1.45.1
## [5] edgeR_3.25.9 limma_3.39.19
## [7] RColorBrewer_1.1-2 pheatmap_1.0.12
## [9] factoextra_1.0.6 ggplot2_3.1.1
## [11] FactoMineR_1.42 DESeq2_1.23.10
## [13] SummarizedExperiment_1.13.0 DelayedArray_0.8.0
## [15] BiocParallel_1.17.19 matrixStats_0.58.0
## [17] Biobase_2.43.1 GenomicRanges_1.35.1
## [19] GenomeInfoDb_1.19.3 IRanges_2.17.5
## [21] S4Vectors_0.21.24 BiocGenerics_0.29.2
##
## loaded via a namespace (and not attached):
## [1] fgsea_1.10.0 colorspace_1.4-1
## [3] ggridges_0.5.1 qvalue_2.16.0
## [5] htmlTable_1.13.1 XVector_0.23.2
## [7] base64enc_0.1-3 rstudioapi_0.10
## [9] ggpubr_0.2 farver_1.1.0
## [11] urltools_1.7.3 ggrepel_0.8.0
## [13] bit64_0.9-7 xml2_1.2.0
## [15] splines_3.6.3 leaps_3.0
## [17] GOSemSim_2.10.0 geneplotter_1.60.0
## [19] knitr_1.22 polyclip_1.10-0
## [21] jsonlite_1.6 Formula_1.2-3
## [23] Rsamtools_1.99.6 annotate_1.60.1
## [25] cluster_2.1.0 GO.db_3.8.2
## [27] ggforce_0.2.2 httr_1.4.0
## [29] compiler_3.6.3 rvcheck_0.1.3
## [31] backports_1.1.4 assertthat_0.2.1
## [33] Matrix_1.3-2 lazyeval_0.2.2
## [35] tweenr_1.0.1 prettyunits_1.0.2
## [37] acepack_1.4.1 htmltools_0.3.6
## [39] tools_3.6.3 igraph_1.2.4.1
## [41] gtable_0.3.0 glue_1.3.1
## [43] GenomeInfoDbData_1.2.1 reshape2_1.4.3
## [45] DO.db_2.9 dplyr_0.8.0.1
## [47] fastmatch_1.1-0 Rcpp_1.0.6
## [49] enrichplot_1.4.0 Biostrings_2.51.5
## [51] ggraph_1.0.2 xfun_0.6
## [53] stringr_1.4.0 XML_3.98-1.19
## [55] DOSE_3.10.2 europepmc_0.3
## [57] zlibbioc_1.29.0 MASS_7.3-53.1
## [59] scales_1.0.0 hms_0.4.2
## [61] yaml_2.2.0 memoise_1.1.0
## [63] gridExtra_2.3 UpSetR_1.4.0
## [65] triebeard_0.3.0 rpart_4.1-15
## [67] latticeExtra_0.6-28 stringi_1.4.3
## [69] RSQLite_2.1.1 highr_0.8
## [71] genefilter_1.65.0 checkmate_1.9.1
## [73] rlang_0.4.10 pkgconfig_2.0.2
## [75] bitops_1.0-6 evaluate_0.13
## [77] lattice_0.20-41 purrr_0.3.2
## [79] GenomicAlignments_1.19.1 htmlwidgets_1.5.3
## [81] labeling_0.3 cowplot_0.9.4
## [83] bit_4.0.4 tidyselect_0.2.5
## [85] plyr_1.8.4 magrittr_1.5
## [87] R6_2.4.0 Hmisc_4.2-0
## [89] DBI_1.0.0 pillar_1.3.1
## [91] foreign_0.8-76 withr_2.1.2
## [93] survival_3.2-7 scatterplot3d_0.3-41
## [95] RCurl_1.95-4.12 nnet_7.3-15
## [97] tibble_2.1.1 crayon_1.3.4
## [99] rmarkdown_1.12 progress_1.2.0
## [101] viridis_0.5.1 locfit_1.5-9.1
## [103] grid_3.6.3 data.table_1.12.2
## [105] blob_1.1.1 digest_0.6.18
## [107] flashClust_1.01-2 xtable_1.8-4
## [109] tidyr_0.8.3 gridGraphics_0.4-1
## [111] munsell_0.5.0 ggplotify_0.0.3
## [113] viridisLite_0.3.0
This support has been prepared with the support of the Cancéropole Ile de France.
Kim D, Langmead B and Salzberg SL. (2015) HISAT: a fast spliced aligner with low memory requirements. Nature Methods↩
Dobin A., Davis C.A., Schlesinger F. et al. (2013) STAR: ultrafast universal RNA-seq aligner, Bioinformatics, 29(1):15–21,↩
Patro, R., Duggal, G., Love, M. I. et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.↩
Nicolas L Bray N.L., Pimentel H., Melsted P. et al. (2016) Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527↩
Srivastava A., Sarkar H., Gupta N. et al. (2016) RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes, Bioinformatics. 32(12)↩
Bray N.L. et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nature Biotech., 34(5):525–527.↩
Anders S., Pyl T.P., Huber W. (2015) HTSeq - A Python framework to work with high-throughput sequencing data. Bioinformatics 31(2):166-9↩
Liao Y, Smyth GK and Shi W. (2014) featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30↩
Kim D, Langmead B and Salzberg SL. (2015) HISAT: a fast spliced aligner with low memory requirements. Nature Methods↩
Law C.W., Chen Y., Shi W., Smyth G.K. voom: precision weights unlowk linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15(2): R29↩
Robinson MD and Oshlack A. A scaling normalization method for differential analysis of RNA-seq data. Genome Biol. 2010; 11(3): R25↩