Reminders

How to connect to the ifb core cluster?

ssh -X login@core.cluster.france-bioinformatique.fr

How to load a tool?

module avail |grep mytool
module load mytool

How to display help of a command or a tool?

man mytool
mytool --help
mytool -h
mytool
mytool -help

No convention established: Read The (F***) Manual!

How to submit jobs? IFB documentation

srun [options] mycommand
sbatch [options] myscript
sbatch [options] --wrap="mycommand"

1 Preparation of your working directory

Go to your home directory
Create a directory called M5S1 (i.e Module5 session 1) and move in
Create the following directory structure and check with tree

.
├── CLEANING
├── FASTQ
├── MAPPING
└── QC

4 directories, 0 files

Step by step correction

mkdir, cd and tree are needed.

mkdir -p ~/M5S1/FASTQ  # -p: no error if existing, make parent directories as needed
mkdir -p ~/M5S1/CLEANING
mkdir -p ~/M5S1/MAPPING
mkdir -p ~/M5S1/QC
cd ~/M5S1
tree ~/M5S1 # list contents of directories in a tree-like format.

2 Retrieve raw data (FASTQ)

Find the run identifier of the the raw data (Illumina) associated with this article
Download FASTQ files in FASTQ directory with sra-tools [1]
Compress them with gzip

These files must be present in the FASTQ directory

ls -ltrh ~/M5S1/FASTQ/
total 236M
-rw-rw-r-- 1 orue orue 127M  6 mars  12:32 SRR8082143_2.fastq.gz
-rw-rw-r-- 1 orue orue 109M  6 mars  12:32 SRR8082143_1.fastq.gz

Step by step correction

Find the section describing sequencing data

Data availability
The annotated chromosome has been deposited in NCBI GenBank under the accession number CP031214. Illumina and PacBio reads are available under the accession numbers SRX4909245 and SRX4908799, respectively, in the Sequence Read Archive (SRA).

SRX4909245 is the Illumina experience, SRR8082143 is the run identifier.

Download with fasterq-dump subcommand of sra-tools

module load  sra-tools
fasterq-dump -h
srun --cpus-per-task=6 fasterq-dump --split-files -p SRR8082143 --outdir FASTQ

Then compress the files with gzip:

cd ~/M5S1/FASTQ
srun gzip *.fastq

2.1 Read a FASTQ file

Never uncompress a FASTQ file unless it is absolutely necessary.

Display the 8 first lines of SRR8082143_1.fastq.gz
How many reads are present in this file?

Step by step correction

zcat or zless are used to display compressed FASTQ files. head displays first lines.

cd ~/M5S1/FASTQ
zcat SRR8082143_1.fastq.gz | head -n 8
# or
gzip -cd SRR8082143_1.fastq.gz |head -n 8

Informations about one read are stored in 4 consecutives lines. So to count the number of reads in a FASTQ file, a simple way is to count the number of lines and to divide it by 4.

cd ~/M5S1/FASTQ
# With bash
echo $(zcat SRR8082143_1.fastq.gz|wc -l)/4|bc
# or
zcat SRR8082143_1.fastq.gz| echo $((`wc -l`/4))
# or
expr $(zcat SRR8082143_1.fastq.gz | wc -l) / 4
# or, with awk
zcat SRR8082143_1.fastq.gz |awk '{s++}END{print s/4}' 
# ...

2.2 Go further

FASTQ files can be directly downloaded with wget from the ftp server.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_2.fastq.gz

This method shows its limits as soon as the number of FASTQs grows up.

A very smart utility is SRA-explorer. It gives you several ways to download the files you need.

Write in a file called raw_data_statistics.txt the name of the FASTQ file and the number of reads in the same line, for all FASTQ files in a directory

for i in *.fastq.gz ; do echo $i $(zcat $i |echo $((`wc -l`/4))) ; done >> raw_data_statistics.txt

Which command of seqkit [2] tool does the same (better)?

#module avail |grep seqkit
module load seqkit
srun seqkit stats --threads 1 *.fastq.gz
file                   format  type  num_seqs      sum_len  min_len  avg_len  max_len
SRR8082143_1.fastq.gz  FASTQ   DNA    718,021  169,774,551       35    236.4      251
SRR8082143_2.fastq.gz  FASTQ   DNA    718,021  169,928,313       35    236.7      251

Check with expr if R1 and R2 files have the same number of reads

expr $(expr $(zcat SRR8082143_1.fastq.gz | wc -l) / 4) = $(expr $(zcat SRR8082143_2.fastq.gz | wc -l) / 4)

Why this command gives you (often) a wrong answer?

echo $(zgrep -cv "^@" SRR8082143_1.fastq.gz)/4|bc

zgrep  "^@" SRR8082143_1.fastq.gz |grep -v SRR | head

3 Quality control

Launch FastQC [3] on the paired-end FastQ files of the sample you previously downloaded and write results in QC directory (use 8 threads)
Explore the results and interpret the graphics

You have to obtain these files:

ls -ltrh ~/M5S1/QC
total 1,9M
-rw-rw-r-- 1 orue orue 321K  6 mars  13:23 SRR8082143_1_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_1_fastqc.html
-rw-rw-r-- 1 orue orue 333K  6 mars  13:23 SRR8082143_2_fastqc.zip
-rw-rw-r-- 1 orue orue 642K  6 mars  13:23 SRR8082143_2_fastqc.html

Step by step correction

Load fastqc

cd ~/M5S1 # if you went somewhere else
module load fastqc

Run fastqc

srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_1.fastq.gz -o QC/ -t 8
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_2.fastq.gz -o QC/ -t 8

Download the reports on your local desktop with scp

# On your local desktop
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/QC/*.html .

Open the HTML files with Firefox, Chrome, Safari…

# On your local desktop
firefox *.html &

3.1 Go further

Is the length distribution of reads expected for Illumina raw data? Why

Some cleaning has be performed. Usually REALLY RAW reads have the same lengths.

FastQC can be run with R (see this tutorial) or with Python (see this package).

4 Reads cleaning with fastp

Launch fastp [4] on the paired-end FastQ files of the sample you previously downloaded

Detect and Remove the classical Illumina adapters
Filter reads with :
- mean quality >= 30 on a sliding window of 8 from 3’ extremity to 5’ extremity
- length of the read >= 100
- keep only pairs

Inspect the results

How many reads are remaining ?

ls -ltrh ~/M5S1/CLEANING/
-rw-rw-r-- 1 orue orue 117M  1 mars  16:07 SRR8082143_2.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 107M  1 mars  16:07 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 531K  1 mars  16:07 fastp.html

Step by step correction

Load fastp, write and submit the appropriate command

cd ~/M5S1
module load fastp

fastp --version
fastp 0.20.0

fastp --help

srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail --json CLEANING/fastp.json

How many reads are remaining?

seqkit stats CLEANING/*.fastq.gz
file                                    format  type  num_seqs      sum_len  min_len  avg_len  max_len
SRR8082143_1.cleaned_filtered.fastq.gz  FASTQ   DNA    677,649  159,276,371      100      235      251
SRR8082143_2.cleaned_filtered.fastq.gz  FASTQ   DNA    677,649  150,859,154      100    222.6      251

677,649 out of 718,021 reads are remaining (~94%)

fastp report

4.1 Go further

How to avoid to write fastp.json?

Useless files can be written in /dev/null. It can be compared to a black hole in space.

srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail --json /dev/null

How to redirect the informations given by fastp in a file called fastp.log?

srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail  &> CLEANING/fastp.log

5 Mapping with bwa

Copy the file /shared/projects/dubii2021/trainers/module5/seance1/CP031214.1.fasta in the directory ~/M5S1/MAPPING/
Index it with bwa [5].
Map the reads to this reference with default parameters and write alignments in a file called SRR8082143_on_CP031214.1.sam
Convert the SAM file into a BAM file with samtools [6]
Sort the BAM file with samtools sort
Index the sorted BAM file with samtools index
Remove unnecessary files

ls -ltrh ~/M5S1/MAPPING/ total 249M -rw-rw-r– 1 orue orue 249M 6 mars 13:01 SRR8082143.bam

Step by step correction

Copy the file with cp

cd ~/M5S1/MAPPING
cp /shared/projects/dubii2021/trainers/module5/seance1/CP031214.1.fasta .

Load bwa and Index FASTA file with bwa index

module load bwa
bwa
# Version: 0.7.17-r1188
bwa index

srun bwa index CP031214.1.fasta

Map reads with bwa mem

bwa mem
srun --cpus-per-task=32 bwa mem CP031214.1.fasta ../CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz ../CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 > SRR8082143_on_CP031214.1.sam

Load samtools and convert SAM file to BAM file with samtools view

module load samtools
samtools --version
# samtools 1.10

srun --cpus-per-task=8 samtools view --threads 8 SRR8082143_on_CP031214.1.sam -b > SRR8082143_on_CP031214.1.bam

Sort the BAM file with samtools sort

srun samtools sort SRR8082143_on_CP031214.1.bam -o SRR8082143_on_CP031214.1.sort.bam

Index the BAM file with samtools index

srun samtools index SRR8082143_on_CP031214.1.sort.bam

Remove useless files with rm

du -hcs SRR8082143_on_CP031214.1.bam SRR8082143_on_CP031214.1.sam
226M    SRR8082143_on_CP031214.1.bam
732M    SRR8082143_on_CP031214.1.sam
958M    total

rm -f SRR8082143_on_CP031214.1.bam SRR8082143_on_CP031214.1.sam # -f: never prompt, be sure of what you do!

5.1 Mapping statistics

Run samtools idxstats and samtools flagstat on your BAM file and write stdout to .idxstats and .flagstat files

cd ~/M5S1/MAPPING
srun samtools idxstats SRR8082143_on_CP031214.1.sort.bam > SRR8082143_on_CP031214.1.sort.bam.idxstats
srun samtools flagstat SRR8082143_on_CP031214.1.sort.bam > SRR8082143_on_CP031214.1.sort.bam.flagstat

5.2 Go further

Can you write a command to map, convert to BAM and sort in only one command to avoid temporary files?

srun --cpus-per-task=34 bwa mem CP031214.1.fasta ../CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz ../CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 | srun samtools view -b - | srun samtools sort - > SRR8082143.bam

Each command must be prefixed with srun and the number --cpus-per-task is the sum of all cpus used by each command

What is the size of the FASTA file?

cd ~/M5S1/MAPPING
seqkit stats CP031214.1.fasta
file                      format  type  num_seqs    sum_len    min_len    avg_len    max_len
MAPPING/CP031214.1.fasta  FASTA   DNA          1  4,599,824  4,599,824  4,599,824  4,599,824

How to count the number of sequences in the FASTA file?

grep -c '>' CP031214.1.fasta

6 Synthesis of all steps done with MultiQC

Run MultiQC [7] to obtain a report all tools run in your directory.
Visualize the HTML report on your local desktop

Step by step correction

Run multiqc

cd ~/M5S1
module load multiqc
srun multiqc -d . -o .

Download multiqc report and open with your web browser

scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/multiqc_report.html .
firefox multiqc_report.html &

multiqc_report

7 Summary

tree
.
├── CLEANING
│   ├── fastp.html
│   ├── fastp.json
│   ├── SRR8082143_1.cleaned_filtered.fastq.gz
│   └── SRR8082143_2.cleaned_filtered.fastq.gz
├── FASTQ
│   ├── SRR8082143_1.fastq.gz
│   └── SRR8082143_2.fastq.gz
├── MAPPING
│   ├── CP031214.1.fasta
│   ├── CP031214.1.fasta.amb
│   ├── CP031214.1.fasta.ann
│   ├── CP031214.1.fasta.bwt
│   ├── CP031214.1.fasta.pac
│   ├── CP031214.1.fasta.sa
│   ├── SRR8082143_on_CP031214.1.bam
│   ├── SRR8082143_on_CP031214.1.sam
│   ├── SRR8082143_on_CP031214.1.sort.bam
│   ├── SRR8082143_on_CP031214.1.sort.bam.bai
│   ├── SRR8082143_on_CP031214.1.sort.bam.flagstat
│   └── SRR8082143_on_CP031214.1.sort.bam.idxstats
├── multiqc_report.html
└── QC
    ├── SRR8082143_1_fastqc.html
    ├── SRR8082143_1_fastqc.zip
    ├── SRR8082143_2_fastqc.html
    └── SRR8082143_2_fastqc.zip

4 directories, 23 files

8 Visualization

Create the FASTA index with samtools faidx

cd ~/M5S1/MAPPING/
samtools faidx CP031214.1.fasta

Download the following files on your local desktop

~/M5S1/MAPPING/CP031214.1.fasta
~/M5S1/MAPPING/CP031214.1.fasta.fai
~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam
~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam.bai

scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/CP031214.1.fasta .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/CP031214.1.fasta.fai .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam.bai .

Open you web browser and go to https://igv.org/app/
Load the reference
Load the alignments

Step by step correction

References

1. toolkit NS. NCBI sra toolkit. NCBI, GitHub repository. 2019.

2. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for fasta/q file manipulation. PloS one. 2016;11:e0163962.

3. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

4. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.

5. Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:13033997. 2013.

6. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–9.

7. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.

A work by Migale Bioinformatics Facility

https://migale.inrae.fr

Our two affiliations to cite us:

Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France

Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France

First steps with NGS data

DUBii - Module 5 - Session 1

Valentin Loux - Olivier Rué

2021/03/08

Reminders

1 Preparation of your working directory

2 Retrieve raw data (FASTQ)

2.1 Read a FASTQ file

2.2 Go further

3 Quality control

3.1 Go further

4 Reads cleaning with fastp

4.1 Go further

5 Mapping with bwa

5.1 Mapping statistics

5.2 Go further

6 Synthesis of all steps done with MultiQC

7 Summary

8 Visualization

References