Exploration of a genomic annotations table (GTF)

DUBii 2019

Jacques van Helden, Hugo Varet and Frédéric Guyon

2020-02-17

Goal of the practical session

During this practical session, we will cover the following items.

  1. Manipulate a table containing genomic data (E. Coli genome annotations).
  2. Select a subset of the data/rows according to a given criterion.
  3. Generate different graphical representations of these data.
  4. Compute statistics describing the different types of annotations.

The GTF file format

The GTF (General Transfer Format) file format is extensively used to provide easily readable genomics annotations while being very handy with a computer.

Text files,

  • one row per genomic “object” (gene, transcript, exon, intron, CDS, …)
  • one column per attribute (name, source, object type, genomic coordinates, description).

The GTF format is described on the following websites:

Find the GTF file of your favorite organism (on Ensembl)

  1. Visit http://ensemblgenomes.org/.
  2. Click on the link Bacteria.
  3. Click on Download
  4. In the Filter box, write Escherichia coli. The list of the proposed organisms is changing while you are writing.

For this session we will use the E. Coli GTF annotation file available here.

Page d’accueil d’EnsemblGenomes

http://ensemblgenomes.org/

EnsemblGenomes Bacteria

http://bacteria.ensembl.org/

EnsemblGenomes Fungi

http://fungi.ensembl.org/

EnsemblGenomes Fungi Download page

http://fungi.ensembl.org/info/website/ftp/

Define and create your working directory

Exercise: create a working directory named workDir in your home folder and go inside it.

Downloading the GTF file

Exercise: download the GTF file in the working directory (optionally, adapt the command to load a GTF of your interest). Before downloading the file we check if it is already present in the working directory. If yes, we skip the download.

Tip: use the commands file.exists(), download.file().

Defining the URL by concatenating strings

For the sake of readability, we can define an URL by concatenating its different components, separated by a slash character /. For this, we use the paste() function.

[1] "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz"

Downloading the GTF file: solution

[1] "data/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz"

Listing the files

After a file transfer, it is safe to check the content of the taregt folder with the list.files() command.

 [1] "data"                             
 [2] "exprs8.txt"                       
 [3] "figures"                          
 [4] "gtf_exploration_files"            
 [5] "gtf_exploration.html"             
 [6] "gtf_exploration.pdf"              
 [7] "gtf_exploration.Rmd"              
 [8] "images"                           
 [9] "module-3-Stat-R_presentation.html"
[10] "module-3-Stat-R_presentation.Rmd" 
[11] "R_intro.html"                     
[12] "R_intro.pdf"                      
[13] "R_intro.Rmd"                      
[14] "README.md"                        
[15] "TP_bacterial_regulation.html"     
[16] "TP_bacterial_regulation.pdf"      
[17] "TP_bacterial_regulation.Rmd"      
[1] "annotation.csv"                                            
[2] "Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.37.gtf.gz"
[3] "expression.txt"                                            

Loading a data table in R

Commands: read.table, read.delim, read.csv.

R includes several types of tabular structures (matrix, data.frame, table). The most widely used is data.frame(), which consists in a table of values with a type (strings, integer, …) attached to each column, and names associated to rows and columns.

The function read.table() enables to read a text file containing tabular data, and to store its content in a variable.

Several functions derived from read.table() facilitate the loading of different formats.

  • read.delim() for files where a particular charcater is used as column separator (by default the tab character ").

  • read.csv() for “comma-searated values”.

Loading the GTF file

Load the GTF file in a variable named featureTable.

Tip: command read.delim.

Exploring the content of a data table

Immediately after having loaded a data table, check its dimensions.

[1] 25979     9
[1] 25979
[1] 9

Checking heads and tails

Displaying the full annotation table would not be very convenient, since it contains tens of thousands of rows.

We can display the first rows of the file with the function head(), and the last rows with tail().

Viewing a table

If you are using the RStudio environment, you can display the table in a dynamic viewer pane with the function View().

The View() function is interactive, so it should not be used in a script because it would perturbate its execution.

Selecting columns

The last column of GTF files is particularly heavy, it contains a lof of semi-structured information.

We can select the 8 first columns and display the 5 first rows of this sub-table.

     seqname source     feature start end score strand frame
1 Chromosome    ena        gene   190 255     .      +     .
2 Chromosome    ena  transcript   190 255     .      +     .
3 Chromosome    ena        exon   190 255     .      +     .
4 Chromosome    ena         CDS   190 252     .      +     0
5 Chromosome    ena start_codon   190 192     .      +     0
     seqname source     feature start end score strand frame
1 Chromosome    ena        gene   190 255     .      +     .
2 Chromosome    ena  transcript   190 255     .      +     .
3 Chromosome    ena        exon   190 255     .      +     .
4 Chromosome    ena         CDS   190 252     .      +     0
5 Chromosome    ena start_codon   190 192     .      +     0

Feature types

Exercise: the column feature of the GTF indicates the feature table.

  • List the feature types found in the GTF
  • Count the number of features per type, and sort them by decreasing values.

Tip: commands unique, table and sort.

[1] gene        transcript  exon        CDS         start_codon stop_codon 
Levels: CDS exon gene start_codon stop_codon transcript

       exon        gene  transcript         CDS start_codon  stop_codon 
       4564        4497        4497        4141        4140        4140 

Counts per value

The table() function allows to count the frequency of each value in a qualitative variable:


Chromosome 
     25979 

    -     + 
13246 12733 

        CDS        exon        gene start_codon  stop_codon  transcript 
       4141        4564        4497        4140        4140        4497 

Contingency table

We can compute the number of combinations between two qualitatives variables:

   
     CDS exon gene start_codon stop_codon transcript
  - 2129 2307 2277        2128       2128       2277
  + 2012 2257 2220        2012       2012       2220
      feature
strand  CDS exon gene start_codon stop_codon transcript
     - 2129 2307 2277        2128       2128       2277
     + 2012 2257 2220        2012       2012       2220

Computing feature lengths

  • Add a column with feature lengths.

Note about feature length computation (explain why) :

\[L = \text{end} - \text{start} + 1\]

Filtering rows based on a column content

The function subset() enables to select a subset of rows based on a filter applied to the content of one or several columns.

We can use it to select the subset of features corresponding to genes.

Selecting genes from the GTF table

  • Select of genes from the GTF table and store them in a separate variable named genes.
  • Compute summary statistics about gene lengthhs

Tip: commands subset, summary.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   14.0   462.0   813.0   929.4  1221.0 21837.0 

Exercices

  1. Draw an histogram with gene length distribution. Choose a relevant number of breaks to display an informative histogram.
  2. Draw a boxplot of gene lengths per strand. Are gene longer on the minus or plus strand?

Gene length histogram

Setting a relevant number of breaks

Gene length distribution – improving the output

Distribution of gene lengths for E. Coli.

Distribution of gene lengths for E. Coli.

Gene length box plot

Other types of plots allow to explore the distribution of some data. In particular, boxplots display the median, the first and third quartiles and outlier values.

Boxplot of gene lengths per chromosome

Boxplot of gene lengths per chromosome