Reminders

  1. How to connect to the ifb core cluster?
ssh -X login@core.cluster.france-bioinformatique.fr
  1. How to load a tool?
module avail |grep mytool
module load mytool
  1. How to display help of a command or a tool?
man mytool
mytool --help
mytool -h
mytool
mytool -help
No convention established: Read The (F***) Manual!
  1. How to submit jobs? IFB documentation
srun [options] mycommand
sbatch [options] myscript
sbatch [options] --wrap="mycommand"

1 Preparation of your working directory

  1. Go to your home directory
  2. Create a directory called M5S1 (i.e Module5 session 1) and move in
  3. Create the following directory structure and check with tree


.
├── CLEANING
├── FASTQ
├── MAPPING
└── QC

4 directories, 0 files
Step by step correction

mkdir, cd and tree are needed.

mkdir -p ~/M5S1/FASTQ  # -p: no error if existing, make parent directories as needed
mkdir -p ~/M5S1/CLEANING
mkdir -p ~/M5S1/MAPPING
mkdir -p ~/M5S1/QC
cd ~/M5S1
tree ~/M5S1 # list contents of directories in a tree-like format.

2 Retrieve raw data (FASTQ)

  1. Find the run identifier of the the raw data (Illumina) associated with this article
  2. Download FASTQ files in FASTQ directory with sra-tools [1]
  3. Compress them with gzip

These files must be present in the FASTQ directory

ls -ltrh ~/M5S1/FASTQ/
total 236M
-rw-rw-r-- 1 orue orue 127M  6 mars  12:32 SRR8082143_2.fastq.gz
-rw-rw-r-- 1 orue orue 109M  6 mars  12:32 SRR8082143_1.fastq.gz
Step by step correction
  1. Find the section describing sequencing data
Data availability
The annotated chromosome has been deposited in NCBI GenBank under the accession number CP031214. Illumina and PacBio reads are available under the accession numbers SRX4909245 and SRX4908799, respectively, in the Sequence Read Archive (SRA).

SRX4909245 is the Illumina experience, SRR8082143 is the run identifier.
  1. Download with fasterq-dump subcommand of sra-tools
module load  sra-tools
fasterq-dump -h
srun --cpus-per-task=6 fasterq-dump --split-files -p SRR8082143 --outdir FASTQ
  1. Then compress the files with gzip:
cd ~/M5S1/FASTQ
srun gzip *.fastq

2.1 Read a FASTQ file

Never uncompress a FASTQ file unless it is absolutely necessary.
  1. Display the 8 first lines of SRR8082143_1.fastq.gz
  2. How many reads are present in this file?
Step by step correction
  1. zcat or zless are used to display compressed FASTQ files. head displays first lines.
cd ~/M5S1/FASTQ
zcat SRR8082143_1.fastq.gz | head -n 8
# or
gzip -cd SRR8082143_1.fastq.gz |head -n 8
  1. Informations about one read are stored in 4 consecutives lines. So to count the number of reads in a FASTQ file, a simple way is to count the number of lines and to divide it by 4.
cd ~/M5S1/FASTQ
# With bash
echo $(zcat SRR8082143_1.fastq.gz|wc -l)/4|bc
# or
zcat SRR8082143_1.fastq.gz| echo $((`wc -l`/4))
# or
expr $(zcat SRR8082143_1.fastq.gz | wc -l) / 4
# or, with awk
zcat SRR8082143_1.fastq.gz |awk '{s++}END{print s/4}' 
# ...

2.2 Go further

FASTQ files can be directly downloaded with wget from the ftp server.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_2.fastq.gz
This method shows its limits as soon as the number of FASTQs grows up.

A very smart utility is SRA-explorer. It gives you several ways to download the files you need.

  1. Write in a file called raw_data_statistics.txt the name of the FASTQ file and the number of reads in the same line, for all FASTQ files in a directory
for i in *.fastq.gz ; do echo $i $(zcat $i |echo $((`wc -l`/4))) ; done >> raw_data_statistics.txt
  1. Which command of seqkit [2] tool does the same (better)?
#module avail |grep seqkit
module load seqkit
srun seqkit stats --threads 1 *.fastq.gz
file                   format  type  num_seqs      sum_len  min_len  avg_len  max_len
SRR8082143_1.fastq.gz  FASTQ   DNA    718,021  169,774,551       35    236.4      251
SRR8082143_2.fastq.gz  FASTQ   DNA    718,021  169,928,313       35    236.7      251
  1. Check with expr if R1 and R2 files have the same number of reads
expr $(expr $(zcat SRR8082143_1.fastq.gz | wc -l) / 4) = $(expr $(zcat SRR8082143_2.fastq.gz | wc -l) / 4)
  1. Why this command gives you (often) a wrong answer?


echo $(zgrep -cv "^@" SRR8082143_1.fastq.gz)/4|bc
zgrep  "^@" SRR8082143_1.fastq.gz |grep -v SRR | head

3 Quality control

  1. Launch FastQC [3] on the paired-end FastQ files of the sample you previously downloaded and write results in QC directory (use 8 threads)
  2. Explore the results and interpret the graphics

You have to obtain these files:

ls -ltrh ~/M5S1/QC
total 1,9M
-rw-rw-r-- 1 orue orue 321K  6 mars  13:23 SRR8082143_1_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_1_fastqc.html
-rw-rw-r-- 1 orue orue 333K  6 mars  13:23 SRR8082143_2_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_2_fastqc.html
Step by step correction
  1. Load fastqc
cd ~/M5S1 # if you went somewhere else
module load fastqc
  1. Run fastqc
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_1.fastq.gz -o QC/ -t 8
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_2.fastq.gz -o QC/ -t 8
  1. Download the reports on your local desktop with scp
# On your local desktop
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/QC/*.html .
  1. Open the HTML files with Firefox, Chrome, Safari
# On your local desktop
firefox *.html &

3.1 Go further

  1. Is the length distribution of reads expected for Illumina raw data? Why
Some cleaning has be performed. Usually REALLY RAW reads have the same lengths.
  1. FastQC can be run with R (see this tutorial) or with Python (see this package).

4 Reads cleaning with fastp

  1. Launch fastp [4] on the paired-end FastQ files of the sample you previously downloaded
  • Detect and Remove the classical Illumina adapters
  • Filter reads with :
    • mean quality >= 30 on a sliding window of 8 from 3’ extremity to 5’ extremity
    • length of the read >= 100
    • keep only pairs
  1. Inspect the results
  • How many reads are remaining ?


ls -ltrh ~/M5S1/CLEANING/
-rw-rw-r-- 1 orue orue 117M  1 mars  16:07 SRR8082143_2.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 107M  1 mars  16:07 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 531K  1 mars  16:07 fastp.html
Step by step correction
  1. Load fastp, write and submit the appropriate command
cd ~/M5S1
module load fastp

fastp --version
fastp 0.20.0

fastp --help
srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail --json CLEANING/fastp.json
  1. How many reads are remaining?
seqkit stats CLEANING/*.fastq.gz
file                                    format  type  num_seqs      sum_len  min_len  avg_len  max_len
SRR8082143_1.cleaned_filtered.fastq.gz  FASTQ   DNA    677,649  159,276,371      100      235      251
SRR8082143_2.cleaned_filtered.fastq.gz  FASTQ   DNA    677,649  150,859,154      100    222.6      251

677,649 out of 718,021 reads are remaining (~94%)

4.1 Go further

  1. How to avoid to write fastp.json?
Useless files can be written in /dev/null. It can be compared to a black hole in space.
srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail --json /dev/null
  1. How to redirect the informations given by fastp in a file called fastp.log?
srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail  &> CLEANING/fastp.log

5 Mapping with bwa

  1. Copy the file /shared/projects/dubii2021/trainers/module5/seance1/CP031214.1.fasta in the directory ~/M5S1/MAPPING/

  2. Index it with bwa [5].

  3. Map the reads to this reference with default parameters and write alignments in a file called SRR8082143_on_CP031214.1.sam

  4. Convert the SAM file into a BAM file with samtools [6]

  5. Sort the BAM file with samtools sort

  6. Index the sorted BAM file with samtools index

  7. Remove unnecessary files

    ls -ltrh ~/M5S1/MAPPING/ total 249M -rw-rw-r– 1 orue orue 249M 6 mars 13:01 SRR8082143.bam

Step by step correction
  1. Copy the file with cp
cd ~/M5S1/MAPPING
cp /shared/projects/dubii2021/trainers/module5/seance1/CP031214.1.fasta .
  1. Load bwa and Index FASTA file with bwa index
module load bwa
bwa
# Version: 0.7.17-r1188
bwa index
srun bwa index CP031214.1.fasta
  1. Map reads with bwa mem
bwa mem
srun --cpus-per-task=32 bwa mem CP031214.1.fasta ../CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz ../CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 > SRR8082143_on_CP031214.1.sam
  1. Load samtools and convert SAM file to BAM file with samtools view
module load samtools
samtools --version
# samtools 1.10
srun --cpus-per-task=8 samtools view --threads 8 SRR8082143_on_CP031214.1.sam -b > SRR8082143_on_CP031214.1.bam
  1. Sort the BAM file with samtools sort
srun samtools sort SRR8082143_on_CP031214.1.bam -o SRR8082143_on_CP031214.1.sort.bam
  1. Index the BAM file with samtools index
srun samtools index SRR8082143_on_CP031214.1.sort.bam
  1. Remove useless files with rm
du -hcs SRR8082143_on_CP031214.1.bam SRR8082143_on_CP031214.1.sam
226M    SRR8082143_on_CP031214.1.bam
732M    SRR8082143_on_CP031214.1.sam
958M    total

rm -f SRR8082143_on_CP031214.1.bam SRR8082143_on_CP031214.1.sam # -f: never prompt, be sure of what you do!

5.1 Mapping statistics

  1. Run samtools idxstats and samtools flagstat on your BAM file and write stdout to .idxstats and .flagstat files
cd ~/M5S1/MAPPING
srun samtools idxstats SRR8082143_on_CP031214.1.sort.bam > SRR8082143_on_CP031214.1.sort.bam.idxstats
srun samtools flagstat SRR8082143_on_CP031214.1.sort.bam > SRR8082143_on_CP031214.1.sort.bam.flagstat

5.2 Go further

  1. Can you write a command to map, convert to BAM and sort in only one command to avoid temporary files?
srun --cpus-per-task=34 bwa mem CP031214.1.fasta ../CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz ../CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 | srun samtools view -b - | srun samtools sort - > SRR8082143.bam
Each command must be prefixed with srun and the number --cpus-per-task is the sum of all cpus used by each command
  1. What is the size of the FASTA file?
cd ~/M5S1/MAPPING
seqkit stats CP031214.1.fasta
file                      format  type  num_seqs    sum_len    min_len    avg_len    max_len
MAPPING/CP031214.1.fasta  FASTA   DNA          1  4,599,824  4,599,824  4,599,824  4,599,824
  1. How to count the number of sequences in the FASTA file?
grep -c '>' CP031214.1.fasta

6 Synthesis of all steps done with MultiQC

  1. Run MultiQC [7] to obtain a report all tools run in your directory.
  2. Visualize the HTML report on your local desktop
Step by step correction
  1. Run multiqc
cd ~/M5S1
module load multiqc
srun multiqc -d . -o .
  1. Download multiqc report and open with your web browser
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/multiqc_report.html .
firefox multiqc_report.html &

7 Summary

tree
.
├── CLEANING
│   ├── fastp.html
│   ├── fastp.json
│   ├── SRR8082143_1.cleaned_filtered.fastq.gz
│   └── SRR8082143_2.cleaned_filtered.fastq.gz
├── FASTQ
│   ├── SRR8082143_1.fastq.gz
│   └── SRR8082143_2.fastq.gz
├── MAPPING
│   ├── CP031214.1.fasta
│   ├── CP031214.1.fasta.amb
│   ├── CP031214.1.fasta.ann
│   ├── CP031214.1.fasta.bwt
│   ├── CP031214.1.fasta.pac
│   ├── CP031214.1.fasta.sa
│   ├── SRR8082143_on_CP031214.1.bam
│   ├── SRR8082143_on_CP031214.1.sam
│   ├── SRR8082143_on_CP031214.1.sort.bam
│   ├── SRR8082143_on_CP031214.1.sort.bam.bai
│   ├── SRR8082143_on_CP031214.1.sort.bam.flagstat
│   └── SRR8082143_on_CP031214.1.sort.bam.idxstats
├── multiqc_report.html
└── QC
    ├── SRR8082143_1_fastqc.html
    ├── SRR8082143_1_fastqc.zip
    ├── SRR8082143_2_fastqc.html
    └── SRR8082143_2_fastqc.zip

4 directories, 23 files

8 Visualization

  1. Create the FASTA index with samtools faidx
cd ~/M5S1/MAPPING/
samtools faidx CP031214.1.fasta
  1. Download the following files on your local desktop
  • ~/M5S1/MAPPING/CP031214.1.fasta
  • ~/M5S1/MAPPING/CP031214.1.fasta.fai
  • ~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam
  • ~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam.bai
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/CP031214.1.fasta .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/CP031214.1.fasta.fai .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam.bai .
  1. Open you web browser and go to https://igv.org/app/

  2. Load the reference

  3. Load the alignments

Step by step correction

References

1. toolkit NS. NCBI sra toolkit. NCBI, GitHub repository. 2019.

2. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for fasta/q file manipulation. PloS one. 2016;11:e0163962.

3. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

4. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.

5. Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:13033997. 2013.

6. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–9.

7. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.

 

A work by Migale Bioinformatics Facility

https://migale.inrae.fr

Our two affiliations to cite us:

Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France

Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France