First steps with NGS data

# First steps with NGS data
## DUBii - Module 5
### Valentin Loux - Olivier Rué
### 2021-03-08

---

# Program

- Introduction
- Get data from public resources
- FASTQ format
- Quality control
- Cleaning of reads
- Mapping of reads
- FASTA format
- SAM format
- Visualization

---

class: tp
<img src="images/TP.png" class="handson">
# Hands-on: Preparation of your working directory

## Instruction

- Switch to <a href="TP.html#1_Preparation_of_your_working_directory">TP document</a>
- Preparation of your working directory

---
class: heading-slide, middle, center
# The Data

---

# What is data

## Definition

- `Data` is <i>a symbolic representation of information</i>
- `Data` is stored in files whose format allows an easy way to access and manipulate
- `Data` represents the knowledge at a given time.

## Properties

- The same information may be represented in different formats
- The content depends on technologies

<div class="alert comment"><svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zm-248 50c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"></path></svg> Understanding data formats, what information is encoded in each, and when it is appropriate to use one format over another is an essential skill of a bioinformatician.</div>

---

# Some NGS file formats

Format | Who generates it? | Who reads it?
--- | --- | ---
*FASTQ* | `sequencers`, `simulation tools` | `mapping tools`, `QC tools`, `cleaning tools`, `taxonomic assignation tools`
*FASTA* | `assembly tools`, `gene prediction tools` ... | `visualization tools`, almost all
*SAM/BAM* | `mapping tools`, `samtools` | `visualization tools`, `variant discovery tools`, `counting tools`
*BED* | `annotation tools`, `bedtools` | `visualization tools`, `variant discovery tools`, `peak calling tools`, `counting tools`
*GFF* | `annotation tools` | `visualization tools`, `variant discovery tools`, `peak calling tools`, `RNAseq tools`
*VCF* | `variant discovery tools` | `vcftools`,  `visualization tools`, `variant discovery tools`

* [synthesis](https://ressources.france-bioinformatique.fr/sites/default/files/formats.pdf)

---

# Genomics sequences resources

The International Nucleotide Sequence Database Collaboration (INSDC) is a long-standing foundational initiative that operates between DDBJ, EMBL-EBI and NCBI. INSDC covers the spectrum of data raw reads, through alignments and assemblies to functional annotation, enriched with contextual information relating to samples and experimental configurations.

<div class="figure" style="text-align: center">
<img src="images/public_resources.png" alt="INDSC resources" width="70%" />
<p class="caption">INDSC resources</p>
</div>

---
# International Nucleotide Sequence Database Collaboration

The member organizations of this collaboration are:
- NCBI: National Center for Biotechnology Information
- EMBL: European Molecular Biology Laboratory
- DDBJ: DNA Data Bank of Japan

The INSDC has set up rules on the types of data that will be mirrored. The most important
of these from a bioinformatician’s perspective are:
- GenBank/Ebi ENA contains all annotated and identified DNA sequence information
- SRA [NCBI Sequence Reads Archive](https://trace.ncbi.nlm.nih.gov/Traces/sra/) / ENA [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/search): Short Read Archive contains measurements from high throughput sequencing
experiments (raw data)

Deposit of sequencing (raw) and processed (analyzed) datas are (most of the time) a prerequiste for publication.

---
# Other sequence resources

## NAR Database Issue

Once a year the journal Nucleic Acids Research publishes its so-called “database issue”. Each
article of this issue of the journal will provide an overview of generic and specific
databases written by the maintainers of that resource.
- The 2021 Nucleic Acids Research database issue and the online molecular biology database collection

<div class="figure" style="text-align: center">
<img src="images/NAR_db.png" alt="NAR 2019 database issue overview" width="50%" />
<p class="caption">NAR 2019 database issue overview</p>
</div>

---

---
# Retrieving NGS data

- Very easy when it concerns only a few files, can be done directly from the website
- Much more tricky for tens, hundreds, thousands...

## Sequencing data

- Specialized Tools or API are offered by the public repository to easily get data locally
- ENA: enaBrowserTools (command line, python, R)
- NCBI: sra-toolkit (command line, python, R)

Common command lines (wget) are most of the time also available.

---
class: tp
<img src="images/TP.png" class="handson">
# Hands-on: Retrieving raw data

## Instruction

Get the raw shot read data (Illumina) associated with this article <a name=cite-Allue-Guardiae01052-18></a>([Allué-Guardia, Nyong, Koenig, Vargas, Bono, and Eppinger, 2019](https://mra.asm.org/content/8/2/e01052-18)).

- Switch to <a href="TP.html#2_Retrieve_raw_data_(FASTQ)">TP document</a>
- Retrieve raw data (FASTQ)

---

# Sequencing  - Vocabulary

- **read**: a single sequence produced from a sequencer. Think: a sequencing machine read a molecule and this is what it thinks it is.

- **library**: a collection of DNA fragments that have been prepared for sequencing. This is generally talking about individual samples.

- **flowcell**: a chip on which DNA is loaded and provided to the sequencer.

- **lane**: one portion of a flowcell. Usually used for technical replicates or different samples.

- **run**: an entire sequencing reaction from start to finish.

---
# Sequencing  - Vocabulary

**DNA fragment** = 1 or more reads depending on whether the sequencing is single end or paird-end

**Insert** = Fragment size

**Depth** = `$N*L/G$` 
N= number of reads, L = size, G : genome size

**Coverage** = % of genome covered
]
.pull-right[
<img src="images/se-pe.png" width="80%" style="display: block; margin: auto;" />

<div class="figure" style="text-align: center">
<img src="images/depth-breadth.png" alt="Single-End , Paired-End" width="80%" />
<p class="caption">Single-End , Paired-End</p>
</div>

]
---
class: heading-slide, middle, center
# FASTQ format

---

# FASTQ syntax

The FASTQ format is the de facto standard by which all sequencing instruments represent
data. It may be thought of as a variant of the FASTA format that allows it to associate a
quality measure to each sequence base:   **FASTA with QUALITIES**.

The FASTQ format consists of 4 sections:
1. A FASTA-like header, but instead of the <code>></code> symbol it uses the <code>@</code> symbol. This is followed
by an ID and more optional text, similar to the FASTA headers.
2. The second section contains the measured sequence (typically on a single line), but it
may be wrapped until the <code>+</code> sign starts the next section.
3. The third section is marked by the <code>+</code> sign and may be optionally followed by the same
sequence id and header as the first section
4. The last line encodes the quality values for the sequence in section 2, and must be of
the same length as section 2.

<i>Example</i>

```bash
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

---

# FASTQ quality

The weird characters in the 4th section are the so called “encoded” numerical values.
In a nutshell, each character represents a numerical value: a so-called Phred score,
encoded via a single letter encoding.

```bash
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
|    |    |    |    |    |    |    |    |
0....5...10...15...20...25...30...35...40
|    |    |    |    |    |    |    |    |
worst................................best
```

The quality values of the FASTQ files are on top. The numbers in the middle of the scale
from 0 to 40 are called Phred scores. The numbers represent the error probabilities  via the formula:

Error=10ˆ(-P/10) 
It is basically summarized as:

- P=0 means 1/1 (100% probability of error)
- P=10 means 1/10 (10% probability of error)
- P=20 means 1/100 (1% probability of error)
- P=30 means 1/1000 (0.1% probability of error)
- P=40 means 1/10000 (0.01% probability of error)

---

# FASTQ quality encoding specificities

There was a time when instrumentation makers could not decide at what
character to start the scale. The **current standard** shown above is the so-called Sanger (+33)
format where the ASCII codes are shifted by 33. There is the so-called +64 format that
starts close to where the other scale ends.

<div class="figure" style="text-align: center">
<img src="images/qualityscore.png" alt="FASTQ encoding values" width="80%" />
<p class="caption">FASTQ encoding values</p>
</div>

---

# FASTQ Header informations

Information is often encoded in the "free" text section of a FASTQ file.

<code>@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG</code> contains the following information:

- <code>EAS139</code>: the unique instrument name
- <code>136</code>: the run id
- <code>FC706VJ</code>: the flowcell id
- <code>2</code>: flowcell lane
- <code>2104</code>: tile number within the flowcell lane
- <code>15343</code>: ‘x’-coordinate of the cluster within the tile
- <code>197393</code>: ‘y’-coordinate of the cluster within the tile
- <code>1</code>: the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
- <code>Y</code>: Y if the read is filtered, N otherwise
- <code>18</code>: 0 when none of the control bits are on, otherwise it is an even number
- <code>ATCACG</code>: index sequence

This information is specific to a particular instrument/vendor and may change with different
versions or releases of that instrument.

---

# Quality control

---

## Why QC'ing your reads ?

**Try to answer to (not always) simple questions :**
--

- Are the generated sequences conform to the expected level of performance?
  - Size
  - Number of reads
  - Quality
- Residual presence of adapters or indexes ?
- Are there (un)expected technical biases?
- Are there (un)expected biological biases?

<div class="alert comment"><svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;">  [ comment ]  <path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zm-248 50c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"></path></svg> Quality control without context leads to misinterpretation</div>

---
# Quality control for FASTQ files

- FastQC <a name=cite-fastqc></a>([Andrews, 2010](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
  - QC for (Illumina) FastQ files
  - Command line fastqc or graphical interface
  - Complete HTML report to spot problem originating from sequencer, library preparation, contamination
  - Summary graphs and tables to quickly assess your data

<div class="figure" style="text-align: center">
<img src="images/fastqc.png" alt="FastQC software" width="40%" />
<p class="caption">FastQC software</p>
</div>

---
class: tp
<img src="images/TP.png" class="handson">
# Hands-on : Quality control

- Switch to <a href="TP.html#3_Quality_control">TP document</a>
- Quality control

---
class: heading-slide, middle, center
# Reads cleaning
---

## Objectives

- Detect and remove sequencing adapters (still) present in the FastQ files
- Filter / trim reads according to quality (as plotted in FastQC)

## Tools

- Simple & fast : Sickle <a name=cite-sickle></a>([Joshi and Fass, 2011](#bib-sickle)) (quality), cutadapt <a name=cite-cutadapt></a>([Martin, 2011](#bib-cutadapt)) (adpater removal)
- Ultra-configurable : Trimmomatic 
- All in one & ultra-fast : fastp <a name=cite-fastp></a>([Zhou, Chen, Chen, and Gu, 2018](https://dx.doi.org/10.1093/bioinformatics/bty560))

<div class="figure" style="text-align: center">
<img src="images/fastp_wkwf.png" alt="FASTQ encoding values" width="55%" />
<p class="caption">FASTQ encoding values</p>
</div>

---
class: tp
<img src="images/TP.png" class="handson">
#  Hands-on : reads cleaning with fastp

- Switch to <a href="TP.html#4_Reads_cleaning_with_fastp">TP document</a>
- Reads cleaning with fastp

---
class: heading-slide, middle, center
# Mapping

---
# Mapping

- Map short reads to a reference genome is predict the locus where a read comes from.
- The result of a mapping is the list of the most probable regions with an associated probability.

---
# Reference

It can be everything containing DNA information:
- Complete genome
- Assembly
- Set of contigs
- Set of sequences
- Genes, non-coding RNA...

For mapping, references have to be stored in a <code>FASTA</code> file.

---
class: heading-slide, middle, center
# FASTA format

---
# Informations inside

The FASTA format is used to represent sequence information. The format is very simple:
- A <code>></code> symbol on the FASTA header line indicates a fasta record start.
- A string of letters called the sequence id may follow the <code>></code> symbol.
- The header line may contain an arbitrary amount of text (including spaces) on the
same line.
- Subsequent lines contain the sequence.

<i>Example</i>

```bash
>foo
ATGCC
>bar other optional text could go here
CCGTA
>bidou
ACTGCAGT
TTCGN
>repeatmasker
ATGTGTcggggggATTTT
>prot2; my_favourite_prot
MTSRRSVKSGPREVPRDEYEDLYYTPSSGMASP
```

---

# FASTA syntax

The lack of a definition of the FASTA format and its apparent simplicity can be a source of
some of the most confounding errors in bioinformatics. Since the format appears so exceed-
ingly straightforward, software developers have been tacitly assuming that the properties
they are accustomed to are required by some standard - whereas no such thing exists.

## Common problems

- Some tools need 60 characters per line
- Some tools ignore anything following the first space in the header line
- Some tools are very restrictive on the alphabet used
- Some tools require uppercase letters
- seqkit <a name=cite-shen2016seqkit></a>([Shen, Le, Li, and Hu, 2016](#bib-shen2016seqkit)) saves your life

---

# FASTA formating

## Good practices

The sequence lines should always wrap at the same width (with the exception of the
last line). Some tools will fail to operate correctly and may not even warn the users if
this condition is not satisfied. The following is technically a valid FASTA but it may
cause various subtle problems.

```bash
>foo
ATGCATGCATGCATGCATGC
ATGCATGCA
TGATGCATGCATGCATGCA
```

should be reformated to

```bash
>foo
ATGCATGCATGCATGCATGC
ATGCATGCATGATGCATGCA
TGCATGCA
```

<i>Can be easily to with seqkit ([Shen, Le, Li, et al., 2016](#bib-shen2016seqkit))</i>

```bash
seqkit seq -w 60 seqs.fa > seqs2.fa
```

---

# FASTA Header

Some data repositories will format FASTA headers to include structured information. Tools
may operate differently when this information is present in the FASTA header. Below is a list of the
recognized FASTA header formats.

<div class="figure" style="text-align: center">
<img src="images/FASTA_headers.png" alt="FASTA header examples" width="50%" />
<p class="caption">FASTA header examples</p>
</div>

---
class: heading-slide, middle, center
# Alignment

---
# Alignment strategies

```bash
GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA
                 ATCTTGATCGCCGAC----ATT              # GLOBAL
                 ATCTTGATCGCCGACATT                  # LOCAL, with soft clipping
```

## Global alignment

Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the <code>Needleman–Wunsch algorithm</code>, which is based on dynamic programming.

## Local alignment

Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <code>Smith–Waterman algorithm</code> is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.

---
# Seed-and-extend especially adapted to NGS data

Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to ﬁnd locations in the reference genome that closely match the read

.pull-left[
1. First, the mapper obtains a read
2. Second, the mapper selects smaller DNA segments from the read to
serve as seeds
3. Third, the mapper indexes a data structure with each seed to
obtain a list of possible locations within the reference genome that could result in
a match
4. Fourth, for each possible location in the list, the mapper obtains the
corresponding DNA sequence from the reference genome
5. Fifth, the mapper aligns the read sequence to the reference sequence, using an expensive sequence
alignment (i.e., veriﬁcation) algorithm to determine the similarity between the read
sequence and the reference sequence.
]
.pull-right[
<img src="images/seed_and_extend.png" width="90%" style="display: block; margin: auto;" />
]
---
# Mapping tools

- Short reads: BWA <a name=cite-bwa></a>([Li, 2013](#bib-bwa))/ BOWTIE <a name=cite-langmead2012fast></a>([Langmead and Salzberg, 2012](#bib-langmead2012fast)) -> for `DNAseq`!

---
class: tp
<img src="images/TP.png" class="handson">
# Hands-on: mapping with bwa

- Switch to <a href="TP.html#5_Mapping_with_bwa">TP document</a>
- Mapping with bwa

---
class: heading-slide, middle, center
# Sequence Alignment Format (SAM)

---
# SAM / BAM formats

The SAM/BAM formats are so-called Sequence Alignment Maps. These files typically represent
the results of aligning a FASTQ file to a reference FASTA file and describe the individual,
pairwise alignments that were found. Different algorithms may create different alignments
(and hence BAM files)

<img src="images/SAM_format.jpg" width="70%" style="display: block; margin: auto;" />
---
# SAM FLAG

[FLAGS](https://broadinstitute.github.io/picard/explain-flags.html) contain a lot of informations.

---
# SAM CIGAR

---
# SAM toolbox

## Samtools & Picard tools

Samtools <a name=cite-samtools></a>([Li, Handsaker, Wysoker, Fennell, Ruan, Homer, Marth, Abecasis, and Durbin, 2009](#bib-samtools)) and Picard tools <a name=cite-picardtools></a>([Broad Institute, 2018](#bib-picardtools)) are Swiss-knifes for operating of SAM/BAM format

- Visualize
- Filter
- Stats
- Index
- Merge
- ...

---
# MultiQC: a tool to synthesize results

* MultiqQC <a name=cite-multiqc></a>([Ewels, Magnusson, Lundin, and Käller, 2016](#bib-multiqc)) allow the aggregation of individual reports from FastQC, Fastp, Trimmomactic, Cutadapt and much more
* [97 supported tools](https://multiqc.info/#supported-tools)

---
class: tp
<img src="images/TP.png" class="handson">
# Hands-on: mapping with bwa

- Switch to <a href="TP.html#6_Synthesis_of_all_steps_done_with_MultiQC">TP document</a>
- Synthesis of all steps done with MultiQC

---
class: heading-slide, middle, center
# Visualization

---
# Visualization

- Genome browsers allow to display data visually, in context with other informations (coding genes and other features postions, functionnal annotations, … )

- A long list of softwares to choose from, with numerous features :
    - "optimized" for model organisms (mouse, human,…), allow non-eukaryotic and private genomes
    - allow multiple "tracks" with different information (read mapping, variant calling, …)
    - configurable
    - data integration
    - allow genome edition and annotation

The tendance is to move from desktop software to "full feature" software in your browser. 
---
# Some software

- **IGV** ( Integrative Genomics Viewer ) : open source, desktop or webapp, memory efficient. **A reference**.

- **IGB** (Integrated Genome Browser) : Open Source, desktop

- **Artemis** : Open Source, desktop, developped since 1999, fitted for prokaryotic genome. Allow genome annotation.

- **Jbrowse** : Open Source, web , highly configurable.

You will find in [this document](https://wikis.univ-lille.fr/bilille/genome_browser) a partial comparison between 7 popular genome browsers.

For the hands-on, we will use **IGV** on the web.

---
# What about Long Reads ?

As global quality and error profiles ar different, ,algorithms and tools are different for long reads.

The raw read format is also different

- PacBio :
  - internal read correction
  - built in software for QC / correction
  - QC : nanoPlot <a name=cite-101093bioinformaticsbty149></a>([De Coster, D’Hert, Schultz, Cruts, and Van Broeckhoven, 2018](https://doi.org/10.1093/bioinformatics/bty149))
  - Correction (hybrid) : LorDec <a name=cite-salmela2014lordec></a>([Salmela and Rivals, 2014](#bib-salmela2014lordec))
  - Alignment = minimap2 <a name=cite-li2018minimap2></a>([Li, 2018](#bib-li2018minimap2)), BLASR <a name=cite-chaisson2012mapping></a>([Chaisson and Tesler, 2012](#bib-chaisson2012mapping))
- NanoPore :
  - Caution to basecaller / chemistry version !
  - QC : nanoPlot
  - Correction : Canu <a name=cite-koren2017canu></a>([Koren, Walenz, Berlin, Miller, Bergman, and Phillippy, 2017](#bib-koren2017canu)), MECAT <a name=cite-xiao2017mecat></a>([Xiao, Chen, Xie, Chen, Wang, Han, Luo, and Xie, 2017](#bib-xiao2017mecat))
  - Alignment : minimap2 ([Li, 2018](#bib-li2018minimap2))

---
# Bioinformaticians best friends

* [labworm](https://labworm.com/category/for-the-developer)
* [biostats blog](https://www.biostars.org/)
* [biostars books](https://biostar.myshopify.com/)
* [bionfo-fr.net](https://bioinfo-fr.net/)
* [seqanswers](http://seqanswers.com/)
* [IFB community](https://community.france-bioinformatique.fr/)

---

# References

<a name=bib-Allue-Guardiae01052-18></a>[Allué-Guardia, A., E. C. Nyong,
S. S. K. Koenig, et al.](#cite-Allue-Guardiae01052-18) (2019). "Closed
Genome Sequence of Escherichia coli K-12 Group Strain C600". In:
_Microbiology Resource Announcements_ 8.2. Ed. by J. A. Maresca. DOI:
[10.1128/MRA.01052-18](https://doi.org/10.1128%2FMRA.01052-18). eprint:
https://mra.asm.org/content/8/2/e01052-18.full.pdf. URL:
[https://mra.asm.org/content/8/2/e01052-18](https://mra.asm.org/content/8/2/e01052-18).

<a name=bib-fastqc></a>[Andrews, S.](#cite-fastqc) (2010). _FastQC A
Quality Control tool for High Throughput Sequence Data_. URL:
[http://www.bioinformatics.babraham.ac.uk/projects/fastqc/](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

<a name=bib-picardtools></a>[Broad Institute](#cite-picardtools)
(2018). _Picard Tools_. <URL: http://broadinstitute.github.io/picard/>.

<a name=bib-chaisson2012mapping></a>[Chaisson, M. J. and G.
Tesler](#cite-chaisson2012mapping) (2012). "Mapping single molecule
sequencing reads using basic local alignment with successive refinement
(BLASR): application and theory". In: _BMC bioinformatics_ 13.1, p.
238.

<a name=bib-101093bioinformaticsbty149></a>[De Coster, W., S. D’Hert,
D. T. Schultz, et al.](#cite-101093bioinformaticsbty149) (2018).
"NanoPack: visualizing and processing long-read sequencing data". In:
_Bioinformatics_ 34.15, pp. 2666-2669. ISSN: 1367-4803. DOI:
[10.1093/bioinformatics/bty149](https://doi.org/10.1093%2Fbioinformatics%2Fbty149).
eprint:
https://academic.oup.com/bioinformatics/article-pdf/34/15/2666/25230836/bty149.pdf.
URL:
[https://doi.org/10.1093/bioinformatics/bty149](https://doi.org/10.1093/bioinformatics/bty149).

<a name=bib-multiqc></a>[Ewels, P., M. Magnusson, S. Lundin, et
al.](#cite-multiqc) (2016). "MultiQC: summarize analysis results for
multiple tools and samples in a single report". In: _Bioinformatics_
32.19, pp. 3047-3048.

---
# References
<a name=bib-sickle></a>[Joshi, N. and J. Fass](#cite-sickle) (2011).
_Sickle: a sliding-window, adaptive, quality-based trimming tool for
FastQ files_.

<a name=bib-koren2017canu></a>[Koren, S., B. P. Walenz, K. Berlin, et
al.](#cite-koren2017canu) (2017). "Canu: scalable and accurate
long-read assembly via adaptive k-mer weighting and repeat separation".
In: _Genome research_ 27.5, pp. 722-736.

<a name=bib-langmead2012fast></a>[Langmead, B. and S.
Salzberg](#cite-langmead2012fast) (2012). _Fast gapped-read alignment
with bowtie 2 Nat Methods 9 (4): 357-359. pmid: 22388286 View Article
PubMed_.

<a name=bib-bwa></a>[Li, H.](#cite-bwa) (2013). "Aligning sequence
reads, clone sequences and assembly contigs with BWA-MEM". In: _arXiv
preprint arXiv:1303.3997_.

<a name=bib-li2018minimap2></a>[Li, H.](#cite-li2018minimap2) (2018).
"Minimap2: pairwise alignment for nucleotide sequences". In:
_Bioinformatics_ 34.18, pp. 3094-3100.

<a name=bib-samtools></a>[Li, H., B. Handsaker, A. Wysoker, et
al.](#cite-samtools) (2009). "The sequence alignment/map format and
SAMtools". In: _Bioinformatics_ 25.16, pp. 2078-2079.

<a name=bib-cutadapt></a>[Martin, M.](#cite-cutadapt) (2011). "Cutadapt
removes adapter sequences from high-throughput sequencing reads". In:
_EMBnet. journal_ 17.1, pp. 10-12.

<a name=bib-salmela2014lordec></a>[Salmela, L. and E.
Rivals](#cite-salmela2014lordec) (2014). "LoRDEC: accurate and
efficient long read error correction". In: _Bioinformatics_ 30.24, pp.
3506-3514.

---
# References

```
## Warning in `[[.BibEntry`(x, ind): subscript out of bounds
```

<a name=bib-shen2016seqkit></a>[Shen, W., S. Le, Y. Li, et
al.](#cite-shen2016seqkit) (2016). "SeqKit: a cross-platform and
ultrafast toolkit for FASTA/Q file manipulation". In: _PloS one_ 11.10.

<a name=bib-xiao2017mecat></a>[Xiao, C., Y. Chen, S. Xie, et
al.](#cite-xiao2017mecat) (2017). "MECAT: fast mapping, error
correction, and de novo assembly for single-molecule sequencing reads".
In: _nature methods_ 14.11, p. 1072.

<a name=bib-fastp></a>[Zhou, Y., Y. Chen, S. Chen, et al.](#cite-fastp)
(2018). "fastp: an ultra-fast all-in-one FASTQ preprocessor". In:
_Bioinformatics_ 34.17, pp. i884-i890. ISSN: 1367-4803. DOI:
[10.1093/bioinformatics/bty560](https://doi.org/10.1093%2Fbioinformatics%2Fbty560).
eprint:
http://academic.oup.com/bioinformatics/article-pdf/34/17/i884/25702346/bty560.pdf.
URL:
[https://dx.doi.org/10.1093/bioinformatics/bty560](https://dx.doi.org/10.1093/bioinformatics/bty560).