During this practical session, we will cover the following items.
The GTF (General Transfer Format) file format is extensively used to provide easily readable genomics annotations while being very handy with a computer.
Text files,
The GTF format is described on the following websites:
For this session we will use the E. Coli GTF annotation file available here.
Exercise: create a working directory named workDir
in your home folder and go inside it.
# Define the working directory
workDir <- "~/practicals/GTF_exploration_results"
# Create the working directory
dir.create(workDir, recursive = TRUE, showWarnings = FALSE)
# Go to the working directory
setwd(workDir)
getwd() ## Check your current location
list.files() ## List files (should be empty if just created)
Exercise: download the GTF file in the working directory (optionally, adapt the command to load a GTF of your interest). Before downloading the file we check if it is already present in the working directory. If yes, we skip the download.
Tip: use the commands file.exists()
, download.file()
.
For the sake of readability, we can define an URL by concatenating its different components, separated by a slash character /
. For this, we use the paste()
function.
## Define the URL of the file to download
ensemblbacteria.site <-
"ftp://ftp.ensemblgenomes.org/pub/bacteria"
## Define the URL of the GTF file
gtf.url <- paste(
sep = "/", # separator
ensemblbacteria.site,
"release-37",
"gtf",
"bacteria_0_collection",
"escherichia_coli_str_k_12_substr_mg1655",
"Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz")
print(gtf.url)
[1] "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz"
# create a directory to store the file
dir.create("data", showWarnings = FALSE)
# Create a local filename
destfile <- paste0("data/", basename(gtf.url))
print(destfile)
[1] "data/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz"
After a file transfer, it is safe to check the content of the taregt folder with the list.files()
command.
[1] "data"
[2] "exprs8.txt"
[3] "figures"
[4] "gtf_exploration_files"
[5] "gtf_exploration.html"
[6] "gtf_exploration.pdf"
[7] "gtf_exploration.Rmd"
[8] "images"
[9] "module-3-Stat-R_presentation.html"
[10] "module-3-Stat-R_presentation.Rmd"
[11] "R_intro.html"
[12] "R_intro.pdf"
[13] "R_intro.Rmd"
[14] "README.md"
[15] "TP_bacterial_regulation.html"
[16] "TP_bacterial_regulation.pdf"
[17] "TP_bacterial_regulation.Rmd"
[1] "annotation.csv"
[2] "Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz"
[3] "expression.txt"
Commands: read.table
, read.delim
, read.csv
.
R includes several types of tabular structures (matrix, data.frame, table). The most widely used is data.frame()
, which consists in a table of values with a type (strings, integer, …) attached to each column, and names associated to rows and columns.
The function read.table()
enables to read a text file containing tabular data, and to store its content in a variable.
Several functions derived from read.table()
facilitate the loading of different formats.
read.delim()
for files where a particular charcater is used as column separator (by default the tab character ").
read.csv()
for “comma-searated values”.
Load the GTF file in a variable named featureTable
.
Tip: command read.delim
.
## Load GTF file in a data.frame
featureTable <- read.delim(destfile, comment.char = "#", sep = "\t",
header = FALSE, row.names = NULL)
## The GTF format has no header, but we can define it based on the specification
names(featureTable) <- c("seqname", "source", "feature", "start", "end",
"score", "strand", "frame", "attribute")
Immediately after having loaded a data table, check its dimensions.
[1] 25979 9
[1] 25979
[1] 9
Displaying the full annotation table would not be very convenient, since it contains tens of thousands of rows.
We can display the first rows of the file with the function head()
, and the last rows with tail()
.
If you are using the RStudio environment, you can display the table in a dynamic viewer pane with the function View()
.
The View()
function is interactive, so it should not be used in a script because it would perturbate its execution.
The last column of GTF files is particularly heavy, it contains a lof of semi-structured information.
We can select the 8 first columns and display the 5 first rows of this sub-table.
seqname source feature start end score strand frame
1 Chromosome ena gene 190 255 . + .
2 Chromosome ena transcript 190 255 . + .
3 Chromosome ena exon 190 255 . + .
4 Chromosome ena CDS 190 252 . + 0
5 Chromosome ena start_codon 190 192 . + 0
seqname source feature start end score strand frame
1 Chromosome ena gene 190 255 . + .
2 Chromosome ena transcript 190 255 . + .
3 Chromosome ena exon 190 255 . + .
4 Chromosome ena CDS 190 252 . + 0
5 Chromosome ena start_codon 190 192 . + 0
Exercise: the column feature of the GTF indicates the feature table.
Tip: commands unique
, table
and sort
.
[1] gene transcript exon CDS start_codon stop_codon
Levels: CDS exon gene start_codon stop_codon transcript
exon gene transcript CDS start_codon stop_codon
4564 4497 4497 4141 4140 4140
The table()
function allows to count the frequency of each value in a qualitative variable:
Chromosome
25979
- +
13246 12733
CDS exon gene start_codon stop_codon transcript
4141 4564 4497 4140 4140 4497
We can compute the number of combinations between two qualitatives variables:
CDS exon gene start_codon stop_codon transcript
- 2129 2307 2277 2128 2128 2277
+ 2012 2257 2220 2012 2012 2220
feature
strand CDS exon gene start_codon stop_codon transcript
- 2129 2307 2277 2128 2128 2277
+ 2012 2257 2220 2012 2012 2220
Note about feature length computation (explain why) :
\[L = \text{end} - \text{start} + 1\]
The function subset()
enables to select a subset of rows based on a filter applied to the content of one or several columns.
We can use it to select the subset of features corresponding to genes.
genes
.Tip: commands subset
, summary
.
## Select subset of features having "CDS" as "feature" attribute
genes <- subset(featureTable, feature == "gene")
## Print a message with the number of genes
message("Number of genes: ", nrow(genes))
## Compute basic statistics on genes lengths
summary(genes$length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.0 462.0 813.0 929.4 1221.0 21837.0
Other types of plots allow to explore the distribution of some data. In particular, boxplots display the median, the first and third quartiles and outlier values.
boxplot(length ~ strand, data = genes, col="palegreen", horizontal=TRUE,
las=1, xlab="Gene length", ylab="Strand")