ssh -X login@core.cluster.france-bioinformatique.fr
module avail |grep mytool
module load mytool
man mytool
mytool --help
mytool -h
mytool
mytool -help
srun [options] mycommand
sbatch [options] myscript
sbatch [options] --wrap="mycommand"
M5S1
(i.e Module5 session 1) and move intree
.
├── CLEANING
├── FASTQ
├── MAPPING
└── QC
4 directories, 0 files
mkdir
, cd
and tree
are needed.
mkdir -p ~/M5S1/FASTQ # -p: no error if existing, make parent directories as needed
mkdir -p ~/M5S1/CLEANING
mkdir -p ~/M5S1/MAPPING
mkdir -p ~/M5S1/QC
cd ~/M5S1
tree ~/M5S1 # list contents of directories in a tree-like format.
FASTQ
directory with sra-tools [1]gzip
These files must be present in the FASTQ
directory
ls -ltrh ~/M5S1/FASTQ/
total 236M
-rw-rw-r-- 1 orue orue 127M 6 mars 12:32 SRR8082143_2.fastq.gz
-rw-rw-r-- 1 orue orue 109M 6 mars 12:32 SRR8082143_1.fastq.gz
Data availability
The annotated chromosome has been deposited in NCBI GenBank under the accession number CP031214. Illumina and PacBio reads are available under the accession numbers SRX4909245 and SRX4908799, respectively, in the Sequence Read Archive (SRA).
SRX4909245 is the Illumina experience, SRR8082143 is the run identifier.
fasterq-dump
subcommand of sra-tools
module load sra-tools
fasterq-dump -h
srun --cpus-per-task=6 fasterq-dump --split-files -p SRR8082143 --outdir FASTQ
gzip
:cd ~/M5S1/FASTQ
srun gzip *.fastq
SRR8082143_1.fastq.gz
zcat
or zless
are used to display compressed FASTQ files. head
displays first lines.cd ~/M5S1/FASTQ
zcat SRR8082143_1.fastq.gz | head -n 8
# or
gzip -cd SRR8082143_1.fastq.gz |head -n 8
cd ~/M5S1/FASTQ
# With bash
echo $(zcat SRR8082143_1.fastq.gz|wc -l)/4|bc
# or
zcat SRR8082143_1.fastq.gz| echo $((`wc -l`/4))
# or
expr $(zcat SRR8082143_1.fastq.gz | wc -l) / 4
# or, with awk
zcat SRR8082143_1.fastq.gz |awk '{s++}END{print s/4}'
# ...
FASTQ files can be directly downloaded with wget
from the ftp server.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR808/003/SRR8082143/SRR8082143_2.fastq.gz
A very smart utility is SRA-explorer. It gives you several ways to download the files you need.
raw_data_statistics.txt
the name of the FASTQ file and the number of reads in the same line, for all FASTQ files in a directoryfor i in *.fastq.gz ; do echo $i $(zcat $i |echo $((`wc -l`/4))) ; done >> raw_data_statistics.txt
#module avail |grep seqkit
module load seqkit
srun seqkit stats --threads 1 *.fastq.gz
file format type num_seqs sum_len min_len avg_len max_len
SRR8082143_1.fastq.gz FASTQ DNA 718,021 169,774,551 35 236.4 251
SRR8082143_2.fastq.gz FASTQ DNA 718,021 169,928,313 35 236.7 251
expr
if R1 and R2 files have the same number of readsexpr $(expr $(zcat SRR8082143_1.fastq.gz | wc -l) / 4) = $(expr $(zcat SRR8082143_2.fastq.gz | wc -l) / 4)
echo $(zgrep -cv "^@" SRR8082143_1.fastq.gz)/4|bc
zgrep "^@" SRR8082143_1.fastq.gz |grep -v SRR | head
QC
directory (use 8 threads)You have to obtain these files:
ls -ltrh ~/M5S1/QC
total 1,9M
-rw-rw-r-- 1 orue orue 321K 6 mars 13:23 SRR8082143_1_fastqc.zip
-rw-rw-r-- 1 orue orue 642K 6 mars 13:23 SRR8082143_1_fastqc.html
-rw-rw-r-- 1 orue orue 333K 6 mars 13:23 SRR8082143_2_fastqc.zip
-rw-rw-r-- 1 orue orue 642K 6 mars 13:23 SRR8082143_2_fastqc.html
cd ~/M5S1 # if you went somewhere else
module load fastqc
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_1.fastq.gz -o QC/ -t 8
srun --cpus-per-task 8 fastqc FASTQ/SRR8082143_2.fastq.gz -o QC/ -t 8
scp
# On your local desktop
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/QC/*.html .
HTML
files with Firefox
, Chrome
, Safari
…# On your local desktop
firefox *.html &
Some cleaning has be performed. Usually REALLY RAW reads have the same lengths.
ls -ltrh ~/M5S1/CLEANING/
-rw-rw-r-- 1 orue orue 117M 1 mars 16:07 SRR8082143_2.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 107M 1 mars 16:07 SRR8082143_1.cleaned_filtered.fastq.gz
-rw-rw-r-- 1 orue orue 531K 1 mars 16:07 fastp.html
cd ~/M5S1
module load fastp
fastp --version
fastp 0.20.0
fastp --help
srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail --json CLEANING/fastp.json
seqkit stats CLEANING/*.fastq.gz
file format type num_seqs sum_len min_len avg_len max_len
SRR8082143_1.cleaned_filtered.fastq.gz FASTQ DNA 677,649 159,276,371 100 235 251
SRR8082143_2.cleaned_filtered.fastq.gz FASTQ DNA 677,649 150,859,154 100 222.6 251
677,649 out of 718,021 reads are remaining (~94%)
fastp.json
?/dev/null
. It can be compared to a black hole in space.
srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail --json /dev/null
fastp
in a file called fastp.log
?srun --cpus-per-task 8 fastp --in1 FASTQ/SRR8082143_1.fastq.gz --in2 FASTQ/SRR8082143_2.fastq.gz --out1 CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz --out2 CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz --html CLEANING/fastp.html --thread 8 --cut_mean_quality 30 --cut_window_size 8 --length_required 100 --cut_tail &> CLEANING/fastp.log
Copy the file /shared/projects/dubii2021/trainers/module5/seance1/CP031214.1.fasta
in the directory ~/M5S1/MAPPING/
Index it with bwa [5].
Map the reads to this reference with default parameters and write alignments in a file called SRR8082143_on_CP031214.1.sam
Convert the SAM
file into a BAM
file with samtools [6]
Sort the BAM
file with samtools sort
Index the sorted BAM
file with samtools index
Remove unnecessary files
ls -ltrh ~/M5S1/MAPPING/ total 249M -rw-rw-r– 1 orue orue 249M 6 mars 13:01 SRR8082143.bam
cp
cd ~/M5S1/MAPPING
cp /shared/projects/dubii2021/trainers/module5/seance1/CP031214.1.fasta .
bwa index
module load bwa
bwa
# Version: 0.7.17-r1188
bwa index
srun bwa index CP031214.1.fasta
bwa mem
bwa mem
srun --cpus-per-task=32 bwa mem CP031214.1.fasta ../CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz ../CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 > SRR8082143_on_CP031214.1.sam
samtools
and convert SAM
file to BAM
file with samtools view
module load samtools
samtools --version
# samtools 1.10
srun --cpus-per-task=8 samtools view --threads 8 SRR8082143_on_CP031214.1.sam -b > SRR8082143_on_CP031214.1.bam
BAM
file with samtools sort
srun samtools sort SRR8082143_on_CP031214.1.bam -o SRR8082143_on_CP031214.1.sort.bam
BAM
file with samtools index
srun samtools index SRR8082143_on_CP031214.1.sort.bam
rm
du -hcs SRR8082143_on_CP031214.1.bam SRR8082143_on_CP031214.1.sam
226M SRR8082143_on_CP031214.1.bam
732M SRR8082143_on_CP031214.1.sam
958M total
rm -f SRR8082143_on_CP031214.1.bam SRR8082143_on_CP031214.1.sam # -f: never prompt, be sure of what you do!
samtools idxstats
and samtools flagstat
on your BAM
file and write stdout to .idxstats
and .flagstat
filescd ~/M5S1/MAPPING
srun samtools idxstats SRR8082143_on_CP031214.1.sort.bam > SRR8082143_on_CP031214.1.sort.bam.idxstats
srun samtools flagstat SRR8082143_on_CP031214.1.sort.bam > SRR8082143_on_CP031214.1.sort.bam.flagstat
srun --cpus-per-task=34 bwa mem CP031214.1.fasta ../CLEANING/SRR8082143_1.cleaned_filtered.fastq.gz ../CLEANING/SRR8082143_2.cleaned_filtered.fastq.gz -t 32 | srun samtools view -b - | srun samtools sort - > SRR8082143.bam
srun
and the number --cpus-per-task
is the sum of all cpus used by each command
FASTA
file?cd ~/M5S1/MAPPING
seqkit stats CP031214.1.fasta
file format type num_seqs sum_len min_len avg_len max_len
MAPPING/CP031214.1.fasta FASTA DNA 1 4,599,824 4,599,824 4,599,824 4,599,824
FASTA
file?grep -c '>' CP031214.1.fasta
HTML
report on your local desktopcd ~/M5S1
module load multiqc
srun multiqc -d . -o .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/multiqc_report.html .
firefox multiqc_report.html &
tree
.
├── CLEANING
│ ├── fastp.html
│ ├── fastp.json
│ ├── SRR8082143_1.cleaned_filtered.fastq.gz
│ └── SRR8082143_2.cleaned_filtered.fastq.gz
├── FASTQ
│ ├── SRR8082143_1.fastq.gz
│ └── SRR8082143_2.fastq.gz
├── MAPPING
│ ├── CP031214.1.fasta
│ ├── CP031214.1.fasta.amb
│ ├── CP031214.1.fasta.ann
│ ├── CP031214.1.fasta.bwt
│ ├── CP031214.1.fasta.pac
│ ├── CP031214.1.fasta.sa
│ ├── SRR8082143_on_CP031214.1.bam
│ ├── SRR8082143_on_CP031214.1.sam
│ ├── SRR8082143_on_CP031214.1.sort.bam
│ ├── SRR8082143_on_CP031214.1.sort.bam.bai
│ ├── SRR8082143_on_CP031214.1.sort.bam.flagstat
│ └── SRR8082143_on_CP031214.1.sort.bam.idxstats
├── multiqc_report.html
└── QC
├── SRR8082143_1_fastqc.html
├── SRR8082143_1_fastqc.zip
├── SRR8082143_2_fastqc.html
└── SRR8082143_2_fastqc.zip
4 directories, 23 files
FASTA
index with samtools faidx
cd ~/M5S1/MAPPING/
samtools faidx CP031214.1.fasta
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/CP031214.1.fasta .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/CP031214.1.fasta.fai .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam .
scp orue@core.cluster.france-bioinformatique.fr:~/M5S1/MAPPING/SRR8082143_on_CP031214.1.sort.bam.bai .
Open you web browser and go to https://igv.org/app/
Load the reference
Load the alignments
1. toolkit NS. NCBI sra toolkit. NCBI, GitHub repository. 2019.
2. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for fasta/q file manipulation. PloS one. 2016;11:e0163962.
3. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
4. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one fastq preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.
5. Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:13033997. 2013.
6. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–9.
7. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.
A work by Migale Bioinformatics Facility
https://migale.inrae.fr
Our two affiliations to cite us:
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France