Analysing metabarcoding data using dada2

Nicolas Henry

Introduction

Metabarcoding at the lab

Sequencing generations


Paired-end sequencing


Next steps until an ASV table

FASTQ records

Overview

Two files (R1 and R2) per sequencing run or per sample (100 first nucleotides):

R1:

@M01522:260:000000000-C3YFC:1:2114:21886:9214 1:N:0:CTTGTA
CCTACGGGGGGCAGCAGTAGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAACGCCGCGTGATCGATGAAGCTTCTAGGAGCGTAAAGATCTGTC
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGFGGGGGGGGGGGGGGFGGGFGGGGGGGGGGGGGGGGGGGGGGGGG

R2:

@M01522:260:000000000-C3YFC:1:2114:21886:9214 2:N:0:CTTGTA
GACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGTCCCTCAGCGTCAGTTTTAGGCCAGAAAGTTGCCTTCGCCATTGGTGTTCCTTCTGAT
+
CCCCCFGGGGGGGGGGGGGGGGGGGCEGGGGGGGGGGEEEFGG>GGGGGGGGGGGGFGGGG?FGGGGGFGFGGFGGFGGCGEGGGGGGGG>FGGGGGGGG

Per record: identifier, sequence, quality

Identifier

Sequence


  • The tag informs you from which sample the read come from

  • The primer used for the amplification

  • The targeted sequence (metabarcode)

Quality

Phred quality score (Q) encoded using ASCII characters:



\(Q\) is related to the base calling error probability (\(P\)):

\[Q = -10\log_{10}P\]

\[P=10^{Q/-10}\]

Quality: \(P\) as a function of \(Q\)

Bioinformatic pipelines

Main strategies

Some tools


  • Genetic clustering based approaches producing OTUs:
  • Denoising approaches producing ESVs:

Many tools

Workflow manager

Nextflow and Snakemake are the most popular

DADA2, qu’est-ce que c’est?


DADA2, the full workflow

  • Quality assessment → plotQualityProfile()
  • Length trimming → filterAndTrim()
  • Quality filtering → filterAndTrim()
  • Denoising → dada()
  • Merging pair-end reads → mergePairs()
  • Chimera identification → removeBimeraDenovo()
  • Taxonomic assignment → assignTaxonomy()

Before using Dada2

Tags and primers

Tags are used to encode the sample provenance of the reads. Reads need to be grouped by sample provenance (demultiplexing)

Primer sequences will bias the error model built by DADA2 and need to be removed (primer trimming).

Both task can be achieved using Cutadapt, a command line-tool to find and remove error-tolerantly adapters from reads (Martin 2011).

Demultiplexing using cutadapt

If your tags are in a fasta file with corresponding sample names as header, you can use this command-line:

cutadapt \
    -g file:${BARCODES} \
    -o {name}_R1.fastq.gz \
    -p {name}_R2.fastq.gz \
    ${R1_FILE} \
    ${R2_FILE}
# 
# tags to look for at the beginning (5') of R1 files. ${BARCODES} is a fasta file containing the tags
# demultiplexed R1 file output. name will be replace by the name of the tag
# same as above but with R2 files
# input R1 file
# input R2 file

As many R1 and R2 files as samples you have

Help! My reads are mixed-orientated

Run cutadapt a second time, looking for tags in R2.

Keep the outputs of the two rounds separated for the rest of the workflow.

Primer removal using cutadapt

To remove forward and reverse primer sequences from pair-end read files:

cutadapt \
    -e 0.1 \
    -g ${PRIMER_FORWARD} \
    -G ${PRIMER_REVERSE} \
    --report=minimal \
    --discard-untrimmed \
    --minimum-length ${MIN_LENGTH} \
    --no-indels \
    -o ${FORWARD} \
    -p ${REVERSE} \
    ${R1_FILE} \
    ${R2_FILE} 1> ${LOG}
# 
# error tolerance (default value)
# forward primer
# reverse primer
# ask to print primer trimming statistics
# reads not containing primers are discarded
# read with a length below this threshold after trimming are discarded
# no indels allowed when mathing primer to reads
# R1 output
# R2 output
# R1 input
# R2 input; 1> ${LOG} export the report in the file ${LOG}

As for demultiplexing, if reads are mix-orientated, run cutadapt twice

DADA2 workflow: reads preparation

Check reads quality

Check the overall quality of a sequencing run using plotQualityProfile()

Outside of DADA2, you can also use FASTQC

If the overall quality is too low, you will probably have to resequence your samples

A quality drop is often observed in the end of the reads

Trimming and filtering

Trimming, at a given length, will improve the overall read quality

Danger zone

After trimming, make sure that forward and reverse reads are still long enough to overlap

Reads of low quality and/or with ambiguous nucleotides (N) after trimming are filtered out.

Both length trimming and quality filtering are achieved using the function filterAndTrim()

DADA2 workflow: denoising approach

Denoising

Is sequence \(i\) generated by sequence \(j\) because of sequencing errors?

In order to define if \(i\) is an error of \(j\) and perform denoising using DADA2, we need to compute:

  • the error rate \(\lambda_{ji}\)
  • the abundance p-value \(p_A(j \rightarrow i)\)

Error rate

The rate at which an amplicon read with sequence i is produced from sample sequence j is reduced to the product over the transition probabilities between the L aligned nucleotides:

\[\lambda_{ji} = \prod_{l=0}^L p(j(l) \rightarrow i(l),q_i(l))\]

The abundance p-value

The abundance p-value (\(p_A\)) is the probability of all reads of \(i\) (\(a_i\)) or more being produced from \(j\) during amplification or sequencing.

\[p_A(j \rightarrow i) = \frac{1}{1- \rho_{\mathrm{pois}}(n_j\lambda_{ji},0)} \sum_{a=a_i}^\infty \rho_{\mathrm{pois}}(n_j\lambda_{ji},a)\]

A low p-value indicate that it is unlikely that \(i\) is noise from amplifying and sequencing \(j\)

The divisive partitioning algorithm: step 1

The divisive partitioning algorithm: step 2

The divisive partitioning algorithm: step 3

The divisive partitioning algorithm: step 4

The divisive partitioning algorithm: step 5

DADA2 workflow: denoising in practice

Learn the error model

How do we compute \(\lambda_{ji}\) if we don’t know the error rate for each possible transition?

The error rates will be learned from the data using learnErrors()

The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution

Visualise the error model

You can visualise the estimated error rates using the function plotErrors()

Run the DADA2 algorithm

  • After dereplicating your sequences (derepFastq()), denoise using the function dada()

  • By default sample inference is performed on each sample individually (pool = FALSE).

  • If you are interested in rare variants present in several samples use pool = TRUE

  • When working on big data, pool = "pseudo" is an interesting alternative to pool = TRUE

DADA2 workflow: build the ASV table

Merge paired reads

Merge forward and reverse reads using mergePairs()

  • minOverlap: minimum size of overlap
  • maxMismatch: maximum number of mismatches
  • justConcatenate: in case your reads don’t overlap

Chimeras

Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.

Assign taxonomy

Reference sequence database

Make the link in between nucleic sequences and taxonomic

Broad reference database: GenBank

BLAST like approaches

Looking for the closest reference sequences to inherit their taxonomy

Best hits or LCA approaches

Recommand global alignment (usearch_global VSEARCH command) with specialised reference database

vsearch \
  --usearch_global ${INPUT} \
  -db ${REFDB} \
  --id 0.80 \
  --maxaccepts 0 \
  --top_hits_only \
  --userout ${TMP} \
  --userfields "query+id1+target"
# 
# sequences to assign fasta file
# reference database reference database
# lower similarity threshold
# maximum hits (0 = unlimited)
# keep only best hits
# output file
# fields to export

Classifiers (machine learning)

  • Sequence taxonomic classification based on k-mers composition.
  • No sequence similarity threshold but a confidence score instead.
  • RDP classifier (naïve Bayes method) is very popular, give good results at least for 16S/18S
  • IDTAXA, more recent, would tend to less overclassify

Phylogenetic placement

Extra steps

Post-clustering with LULU

Abundance aware clustering:

Decontamination

Frequency-based contaminant identification with the R package decontam:

Now it is your turn!

References

Antich, Adrià, Creu Palacin, Owen S Wangensteen, and Xavier Turon. 2021. “To Denoise or to Cluster, That Is Not the Question: Optimizing Pipelines for COI Metabarcoding and Metaphylogeography.” BMC Bioinformatics 22: 1–24.
Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7): 581–83.
Frøslev, Tobias Guldberg, Rasmus Kjøller, Hans Henrik Bruun, Rasmus Ejrnæs, Ane Kirstine Brunbjerg, Carlotta Pietroni, and Anders Johannes Hansen. 2017. “Algorithm for Post-Clustering Curation of DNA Amplicon Data Yields Reliable Biodiversity Estimates.” Nature Communications 8 (1): 1188.
Hakimzadeh, Ali, Alejandro Abdala Asbun, Davide Albanese, Maria Bernard, Dominik Buchner, Benjamin Callahan, J Gregory Caporaso, et al. 2023. “A Pile of Pipelines: An Overview of the Bioinformatics Software for Metabarcoding Data Analyses.” Molecular Ecology Resources.
Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet. Journal 17 (1): 10–12.
Rosen, Michael J, Benjamin J Callahan, Daniel S Fisher, and Susan P Holmes. 2012. “Denoising PCR-Amplified Metagenome Data.” BMC Bioinformatics 13 (1): 1–16.