Preprocessing : dada2

Authors

Fabrice Armougom, MIO

Marc Garel, MIO

Nicolas Henry, ABiMS

Published

September 1, 2023

1 Prepare your working directory

1.1 Get the repository files

To begin with, download the course repository on your computer or virtual machine.

To do so, visit the (ANF-metabarcoding) repository on github and download it as a zip file.

You can also download the repository directly from the terminal using this commmand:

wget https://github.com/ANF-MetaBioDiv/course-material/archive/refs/heads/main.zip

Once on your machine, unzip the file and place the resulting folder in the most convenient location (~/Documents/ for example).

1.2 Dowload the reference database

Save as a variable the path to the folder where you will place the references databases.

refdb_folder <- here::here("data", "refdb")
refdb_folder

[1] "/home/nhenry/Documents/anf-metabarcoding/data/refdb"

In this course we use as much as possible the :: operator. It allows calling a function from a given package without having to load the entire package using library(package). This way it is clear where the functions come from. In that case, the function here() comes from the package here.

The reason why we use here::here() is that when you render a Rmarkdown file, the working directory is where the Rmarkdown file is:

getwd()

Whereas here::here() point to the root of the R project

here::here()

Now, let’s create the folder directly from R:

if (!dir.exists(refdb_folder)) dir.create(refdb_folder, recursive = TRUE)

You can also create the folder from RStudio in the Files window

Tip: You can access the documentation of any R function using ? in the console. If you to know everything about the function dir.create(), simply run ?dir.create()

The Silva reference database, commonly used to assign 16S metabarcoding data, will be used in practical.

In case you are working with 18S sequences, you will have better assignments using PR2 or EukRibo especially if you are interested in protists.

The following code downloads dada2 formatted silva reference databases. If you are not comfortable with it, you can simply download the reference database files from your web browser here and here.

# R stop downloading after timeout which is
# 60 seconds by default
getOption("timeout")

[1] 60

# so we change timeout to be 20 minutes
options(timeout = 1200)

# we save in variable the path to the refdb
# in the working space
silva_train_set <- file.path(refdb_folder,
                             "silva_nr99_v138.1_train_set.fa.gz")

silva_species_assignment <- file.path(refdb_folder,
                                      "silva_species_assignment_v138.1.fa.gz")

# then we download the files if they don't already exist

if (!file.exists(silva_train_set)) {
  download.file(
    "https://zenodo.org/record/4587955/files/silva_nr99_v138.1_train_set.fa.gz",
    silva_train_set,
    quiet = TRUE
  )
}

if (!file.exists(silva_species_assignment)) {
  download.file(
    "https://zenodo.org/record/4587955/files/silva_species_assignment_v138.1.fa.gz",
    silva_species_assignment,
    quiet = TRUE
  )
}

1.3 Attach custom functions

We will use in this practical R functions especially written for this course. The “classic” way to import functions is to use source() with the name of the R script to source.

Instead, we use devtools::load_all(). This function will source all the scripts from the folder R/ along with the documentation in man/ :

devtools::load_all()

2 Inputs files

2.1 Locate the sequencing files

Save the path to the directory containing your raw data (paired-end sequencing fastq files) in a variable named path_to_fastqs

path_to_fastqs <- here::here("data", "raw")

The gzipped (compressed) FASTQ formatted “forward” (R1) and “reverse” (R2) files are named as follow:

${SAMPLENAME}_R1.fastq.gz for the forward files
${SAMPLENAME}_R2.fastq.gz for the reverse files.

We list the forward files using the function list.files(). The argument pattern gives you the possibility to only select file names matching a regular expression. In our case, we select all file names finishing by _R1.fastq.gz.

fnFs <- sort(list.files(path_to_fastqs,
                        pattern = "_R1.fastq.gz",
                        full.names = TRUE))

We do the same for reverse samples.

fnRs <- sort(list.files(path_to_fastqs,
                        pattern = "_R2.fastq.gz",
                        full.names = TRUE))

To understand: What fnFs & fnRs variables contain?

2.2 Extract sample names

sample_names <- basename(fnFs) |>
  strsplit(split = "_") |>
  sapply(head, 1)

To understand:

basename(): remove path to only keep file name.

|>: R “pipe”. It allows you to chain functions, avoiding intermediate variables and nested parenthesis. It basically transfers the output of the left expression to the input of the right expression. You need R > 4.1 to use this pipe, otherwise use %>% from magrittr.

strsplit(): split character chain according to a defined pattern. ?strsplit for documentation.

sapply(): apply a function to each element of a list or vector. The output is simplified to be vector.

Let’s go step by step. First list the R1 file names.

basename(fnFs) |>
  head()

[1] "S11B_R1.fastq.gz" "S1B_R1.fastq.gz"  "S2B_R1.fastq.gz"  "S2S_R1.fastq.gz" 
[5] "S3B_R1.fastq.gz"  "S3S_R1.fastq.gz"

We can see that the sample name is before the first _. With strsplit(), we can split each file name into a 2 elements vector. The result is a list of 2 elements vectors.

basename(fnFs) |>
  strsplit(split = "_") |>
  head()

[[1]]
[1] "S11B"        "R1.fastq.gz"

[[2]]
[1] "S1B"         "R1.fastq.gz"

[[3]]
[1] "S2B"         "R1.fastq.gz"

[[4]]
[1] "S2S"         "R1.fastq.gz"

[[5]]
[1] "S3B"         "R1.fastq.gz"

[[6]]
[1] "S3S"         "R1.fastq.gz"

Now, We just have to extract the first element for each file.

basename(fnFs) |>
  strsplit(split = "_") |>
  sapply(head, 1) |>
  head()

[1] "S11B" "S1B"  "S2B"  "S2S"  "S3B"  "S3S"

Tip: you can achieve the same thing using regular expressions:

gsub("^.+/|_.+$", "", fnFs) |> head()

[1] "S11B" "S1B"  "S2B"  "S2S"  "S3B"  "S3S"

Regular expressions are extremely useful. If you are keen to learn how to use them, have a look here

3 Sequence quality check

We use a custom function, qualityprofile(), implemented in R/preprocessing.R to check the quality of the raw sequences.

Run ?qualityprofile to know more about this function.

# create a directory for the outputs
quality_folder <- here::here("outputs",
                             "dada2",
                             "quality_plots")

if (!dir.exists(quality_folder)) {
  dir.create(quality_folder, recursive = TRUE)
}

qualityprofile(fnFs,
               fnRs,
               file.path(quality_folder, "quality_plots.pdf"))

Open the the pdf file generated by qualityprofile()

4 Primer removal

4.1 Prepare outputs

We first create a folder where to save the reads once they are trimmed:

path_to_trimmed_reads <- here::here(
  "outputs",
  "dada2",
  "trimmed"
)

if (!dir.exists(path_to_trimmed_reads)) dir.create(path_to_trimmed_reads, recursive = TRUE)

4.2 Remove primers

The data you are working with correspond to the V3-V4 region using the primers Pro341F (CCTACGGGNBGCASCAG) and Pro805R (GACTACNVGGGTATCTAAT). Save into variables the forward and reverse primers:

primer_fwd  <- "CCTACGGGNBGCASCAG"
primer_rev  <- "GACTACNVGGGTATCTAAT"

Let’s have a closer look at sequences and find the primers.

Forward reads would contain CCTACGGGNBGCASCAG

Biostrings::readDNAStringSet(
  fnFs[1],
  format = "fastq",
  nrec = 10
)

DNAStringSet object of length 10:
     width seq                                              names               
 [1]   293 CCTACGGGGGGCAGCAGTAGGGA...ACATCGGCTTAACCGATGAAGT M01522:260:000000...
 [2]   293 CCTACGGGTGGCACCAGTAGGGA...CGGGGCTTAACCTCGGAACTGC M01522:260:000000...
 [3]   292 CCTACGGGGCGCAGCAGGCGCGA...GGGACCGGGAGAGGTGTGAGGT M01522:260:000000...
 [4]   293 CCTACGGGGTGCAGCAGTAGGGA...TCAAAACTCCCAGTCTAGAGTT M01522:260:000000...
 [5]   291 CCTACGGGTGGCAGCAGTGGGGA...GCAGTGGAAACTGTTGGGCTTG M01522:260:000000...
 [6]   293 CCTACGGGATGCAGCAGGCGCGA...GGGACCGGGAGAGGTGTGGGGG M01522:260:000000...
 [7]   292 CCTACGGGATGCAGCAGTGGGGA...TTTAATCCTGATGAGCTAGAAA M01522:260:000000...
 [8]   293 CCTACGGGGCGCAGCAGTAGGGA...TTAAAACTTTTGTTCTGGAATT M01522:260:000000...
 [9]   292 CCTACGGGTTGCAGCAGTGGGGA...ATTAAAACTTTTCAGCTAGAGT M01522:260:000000...
[10]   293 CCTACGGGAGGCAGCAGTGGGGA...CCCGGGCTCAACCTGGGAACGG M01522:260:000000...

And reverse reads should contain GACTACNVGGGTATCTAAT

Biostrings::readDNAStringSet(
  fnRs[1],
  format = "fastq",
  nrec = 10
)

DNAStringSet object of length 10:
     width seq                                              names               
 [1]   301 GACTACCAGGGTATCTAATCCTG...GGCTGCTGGCACGAAGTTCGCC M01522:260:000000...
 [2]   301 GACTACCGGGGTATCTAATCCTG...GGCTGCTGGCACGGAGTTAGCC M01522:260:000000...
 [3]   300 AATCCGGTTCGTGCCCCTAGGCT...TCTTTCCCAGCCCTTATTCCAA M01522:260:000000...
 [4]   301 GACTACCGGGGTATCTAATCCTG...GGCTGCTGGCACGGAGTTAGCC M01522:260:000000...
 [5]   301 GACTACCGGGGTATCTAATCCCT...GGCTGCTGGCCCGGAATTAGCC M01522:260:000000...
 [6]   301 GGTATCTAATCCGGTTCGTGCCC...CACCGTCCTTACCCCCCCCTTT M01522:260:000000...
 [7]   301 GGTATCTAATCTTGTTTGCTCCC...CCCGACGTTAGCCGGGGCTTCT M01522:260:000000...
 [8]   301 GACTACGAGGGTATCTAATCCCG...GGCTGCTGGCACGGAATTAGCC M01522:260:000000...
 [9]   301 GGTATCTAATCCTCTTCGCTACC...CACGAAGTTAGCCGGACCTTCT M01522:260:000000...
[10]   301 GACTACGGGGGTATCTAATCCTG...GGCTGCCGGCACGGGGTTAGCC M01522:260:000000...

We use a custom function, primer_trim(), implemented in R/preprocessing.R to remove the primers using cutadapt. To work, primer_trim() needs cutadapt to be installed on your computer.

Run ?primer_trim to know more about this function.

(primer_log <- primer_trim(
  forward_files = fnFs,
  reverse_files = fnRs,
  primer_fwd = primer_fwd,
  primer_rev = primer_rev,
  output_dir = path_to_trimmed_reads,
  min_size = 200
))

   sample status in_reads   in_bp too_short too_long too_many_n out_reads
1    S11B     OK     2000 1186767         0        0          0      1863
2     S1B     OK     2000 1186613         1        0          0      1855
3     S2B     OK     2000 1186942         0        0          0      1839
4     S2S     OK     2000 1186868         0        0          0      1833
5     S3B     OK     2000 1186650         0        0          0      1860
6     S3S     OK     2000 1186475         1        0          0      1880
7     S4B     OK     2000 1186331         2        0          0      1867
8     S4S     OK     2000 1186681         0        0          0      1872
9     S5B     OK     2000 1186386         1        0          0      1841
10    S5S     OK     2000 1186501         1        0          0      1861
11    S6B     OK     2000 1186261         2        0          0      1839
12    S6S     OK     2000 1187078         1        0          0      1835
13    S7B     OK     2000 1186888         0        0          0      1825
14    S7S     OK     2000 1186299         3        0          0      1845
15    S8B     OK     2000 1186354         3        0          0      1840
16    S8S     OK     2000 1186610         1        0          0      1848
17    S9B     OK     2000 1187038         0        0          0      1834
18    S9S     OK     2000 1186867         0        0          0      1835
   w/adapters qualtrim_bp out_bp w/adapters2 qualtrim2_bp out2_bp
1        1986           0 513149        1876            0  528595
2        1975           0 511096        1877            0  525893
3        1987           0 506659        1850            0  521371
4        1989           0 504998        1843            0  519979
5        1989           0 512326        1870            0  527518
6        1989           0 517598        1891            0  532758
7        1980           0 514342        1884            0  529379
8        1987           0 515511        1884            0  530555
9        1984           0 506972        1856            0  522013
10       1991           0 512539        1869            0  527592
11       1981           0 506577        1857            0  521787
12       1982           0 505929        1851            0  520562
13       1987           0 503033        1836            0  517931
14       1987           0 508524        1857            0  523039
15       1993           0 507178        1847            0  522137
16       1982           0 509177        1865            0  524085
17       1983           0 505424        1851            0  520706
18       1979           0 505519        1853            0  520103

nopFw <- sort(list.files(path_to_trimmed_reads, pattern = "R1", full.names = TRUE))
nopRv <- sort(list.files(path_to_trimmed_reads, pattern = "R2", full.names = TRUE))

4.3 For more complex situations

If you have to deal with mix-orientated paired-end reads, you may find inspiration here

5 Trimming and quality filtering

5.0.1 Prepare outputs

Same as before, create a folder

path_to_filtered_reads <- here::here("outputs", "dada2", "filtered")
if (!dir.exists(path_to_filtered_reads)) dir.create(path_to_filtered_reads, recursive = TRUE)

and list paths:

filtFs <- file.path(path_to_filtered_reads, basename(fnFs))
filtRs <- file.path(path_to_filtered_reads, basename(fnRs))

To make the link between files and sample names, simply name vector of file names using the sample names

names(filtFs) <- sample_names
names(filtRs) <- sample_names

5.0.2 Use `dada2::filterAndTrim()`

Let’s have a look at what the function is doing. To do so, type ?dada2::filterAndTrim() in the console.

Let’s run the function.

(out <- dada2::filterAndTrim(
  fwd = nopFw,
  filt = filtFs,
  rev = nopRv,
  filt.rev = filtRs,
  minLen = 150,
  matchIDs = TRUE,
  maxN = 0,
  maxEE = c(3, 3),
  truncQ = 2
))

                 reads.in reads.out
S11B_R1.fastq.gz     1863      1200
S1B_R1.fastq.gz      1855      1251
S2B_R1.fastq.gz      1839      1255
S2S_R1.fastq.gz      1833      1244
S3B_R1.fastq.gz      1860      1244
S3S_R1.fastq.gz      1880      1312
S4B_R1.fastq.gz      1867      1262
S4S_R1.fastq.gz      1872      1328
S5B_R1.fastq.gz      1841      1255
S5S_R1.fastq.gz      1861      1244
S6B_R1.fastq.gz      1839      1251
S6S_R1.fastq.gz      1835      1239
S7B_R1.fastq.gz      1825      1203
S7S_R1.fastq.gz      1845      1182
S8B_R1.fastq.gz      1840      1169
S8S_R1.fastq.gz      1848      1267
S9B_R1.fastq.gz      1834      1195
S9S_R1.fastq.gz      1835      1249

What happened?

Details about the function arguments: * nopFw : input, where the forward reads without primers are (path) * filtFs : output, where forward filtered reads are written (path) * nopRv and filRs : same as above, but wiht reverse reads * TruncLen : truncate reads after truncLen bases. Reads shorter than that are discarded (TruncLen=c(200,150), means forward and reverse reads are cut at 200 bp and 150 bp respectively) * TrimLeft : number of nucleotides to remove from the start * Trimright : number of nucleotides to remove from the end * maxN : max number of ambiguous bases accepted * maxEE : read expected errors (EE) threshold. The EE of a read is the sum of the error probability of each base composing it. Increase that value to accept more low quality reads. The first value refers to the forward reads and the second to the reverse reads. * TruncQ=2: truncate reads at the first instance of a quality score less than or equal to truncQ.

6 Denoising

6.1 Learn the error model

To be able to denoise your data, you need an error model. The error model will tell you at which rate a nucleotide is replace by another for a given quality score. For example, for a quality score Q of 30, what is the probability of an A being wrongly read as a T.

This error model can be learnt directly from the data with the function dada2::learnErrors(). You can come back to the course for more details about the maths behind.

errF <- dada2::learnErrors(filtFs,
                           randomize = TRUE,
                           multithread = TRUE)

6157072 total bases in 22350 reads from 18 samples will be used for learning the error rates.

errR <- dada2::learnErrors(filtRs,
                           randomize = TRUE,
                           multithread = TRUE)

6337638 total bases in 22350 reads from 18 samples will be used for learning the error rates.

You can visualise the resulting error model using the function dada2::plotErrors()

dada2::plotErrors(errF, nominalQ=TRUE)

Warning: Transformation introduced infinite values in continuous y-axis
Transformation introduced infinite values in continuous y-axis

6.2 Dereplication

Before denoising, we need to dereplicate the sequences. It means, for each unique sequence, count the number of reads.

The dereplication is achieved using the function dada2::derepFastq()

derepFs <- dada2::derepFastq(filtFs, verbose = TRUE)

derepRs <- dada2::derepFastq(filtRs, verbose = TRUE)

6.3 Run dada

Now we are ready to run the denoising algorithm with dada2::dada(). As input, we need the error model and the dereplicated sequences.

dadaFs <- dada2::dada(derepFs, err = errF, multithread = TRUE)

Sample 1 - 1200 reads in 754 unique sequences.
Sample 2 - 1251 reads in 779 unique sequences.
Sample 3 - 1255 reads in 789 unique sequences.
Sample 4 - 1244 reads in 762 unique sequences.
Sample 5 - 1244 reads in 772 unique sequences.
Sample 6 - 1312 reads in 763 unique sequences.
Sample 7 - 1262 reads in 738 unique sequences.
Sample 8 - 1328 reads in 638 unique sequences.
Sample 9 - 1255 reads in 782 unique sequences.
Sample 10 - 1244 reads in 663 unique sequences.
Sample 11 - 1251 reads in 696 unique sequences.
Sample 12 - 1239 reads in 657 unique sequences.
Sample 13 - 1203 reads in 691 unique sequences.
Sample 14 - 1182 reads in 675 unique sequences.
Sample 15 - 1169 reads in 697 unique sequences.
Sample 16 - 1267 reads in 714 unique sequences.
Sample 17 - 1195 reads in 685 unique sequences.
Sample 18 - 1249 reads in 677 unique sequences.

dadaRs <- dada2::dada(derepRs, err = errR, multithread = TRUE)

Sample 1 - 1200 reads in 928 unique sequences.
Sample 2 - 1251 reads in 948 unique sequences.
Sample 3 - 1255 reads in 968 unique sequences.
Sample 4 - 1244 reads in 925 unique sequences.
Sample 5 - 1244 reads in 948 unique sequences.
Sample 6 - 1312 reads in 967 unique sequences.
Sample 7 - 1262 reads in 953 unique sequences.
Sample 8 - 1328 reads in 904 unique sequences.
Sample 9 - 1255 reads in 975 unique sequences.
Sample 10 - 1244 reads in 887 unique sequences.
Sample 11 - 1251 reads in 914 unique sequences.
Sample 12 - 1239 reads in 846 unique sequences.
Sample 13 - 1203 reads in 881 unique sequences.
Sample 14 - 1182 reads in 874 unique sequences.
Sample 15 - 1169 reads in 879 unique sequences.
Sample 16 - 1267 reads in 967 unique sequences.
Sample 17 - 1195 reads in 892 unique sequences.
Sample 18 - 1249 reads in 911 unique sequences.

7 Merge paired-end reads

Once forward and reverse reads have been denoised, we can merge them with dada2::mergePairs().

mergers <- dada2::mergePairs(
  dadaF = dadaFs,
  derepF = derepFs,
  dadaR = dadaRs,
  derepR = derepRs,
  maxMismatch = 0,
  verbose = TRUE
)

8 Build the ASV table

At this point we have ASVs and we know their number of reads in each sample. We have enough information to build an ASV table.

seqtab <- dada2::makeSequenceTable(mergers)

9 Remove chimeras

Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. We find and remove bimeras (two-parent chimeras) using the function dada2::removeBimeraDenovo()

seqtab_nochim <- dada2::removeBimeraDenovo(seqtab,
                                           method = "consensus",
                                           multithread = TRUE,
                                           verbose = TRUE)

10 Taxonomic assignment from dada2

The ASV table is ready. But without a clue about the ASVs taxonomic identity, we won’t go far in our ecological interpretations. We can have an idea of ASV taxonomic identity comparing their sequences to references databases such as SILVA.

The taxonomic assignment is done in two steps.

First, each ASV is assigned to a taxonomy using the RDP Naive Bayesian Classifier algorithm described in Wang et al. 2007 called by the function dada2::assignTaxonomy().

taxonomy <- dada2::assignTaxonomy(
  seqs = seqtab_nochim,
  refFasta = silva_train_set,
  taxLevels = c("Kingdom", "Phylum", "Class",
                "Order", "Family", "Genus",
                "Species"),
  multithread = TRUE,
  minBoot = 60
)

The method is robust, however, it often fails to assign at the species level.

If you consider that in case an ASV is 100% similar to a reference sequence, it belongs to the same species, then you can use dada2::addSpecies()

taxonomy <- dada2::addSpecies(
  taxonomy,
  silva_species_assignment,
  allowMultiple = FALSE
)

This function assign to the species level ASVs which are identical to a reference sequence.

11 Export

All the preprocessing is done. Now we export our results.

11.1 R objects

The results can be exported as a R objects, one object for the ASV table and another one for the taxonomy.

export_folder <- here::here("outputs", "dada2", "asv_table")

if (!dir.exists(export_folder)) dir.create(export_folder, recursive = TRUE)

saveRDS(object = seqtab_nochim,
        file = file.path(export_folder, "seqtab_nochim.rds"))

saveRDS(object = taxonomy,
        file = file.path(export_folder, "taxonomy.rds"))

11.2 Text files

We recommand to export your results as text files. They are then reusable by other programs/languages.

But before, we need to format the data a little bit.

First we create a new variable to collect the ASV sequences:

asv_seq <- colnames(seqtab_nochim)

We create unique ids for each ASV. The sequence itself is an unique id, but we would like to have something shorter.

ndigits <- nchar(length(asv_seq))
asv_id <- sprintf(paste0("ASV_%0", ndigits, "d"), seq_along(asv_seq))

and rename the different variables with the new ids

row.names(taxonomy) <- colnames(seqtab_nochim) <- names(asv_seq) <- asv_id

Before exporting the data frames (taxonomy and seqtab_nochim) as a text file, we convert their row names (ASV ids) into a new column named asv. This is achieved using the custom function df_export()

taxonomy_export <- df_export(taxonomy, new_rn = "asv")

seqtab_nochim_export <- t(seqtab_nochim)
seqtab_nochim_export <- df_export(seqtab_nochim_export, new_rn = "asv")

Finally, we can export the taxonomy

write.table(taxonomy_export,
            file = file.path(export_folder, "taxonomy.tsv"),
            quote = FALSE,
            sep = "\t",
            row.names = FALSE)

the ASV table

write.table(seqtab_nochim_export,
            file = file.path(export_folder, "asv_table.tsv"),
            quote = FALSE,
            sep = "\t",
            row.names = FALSE)

and the sequences as a fasta file

cat(paste0(">", names(asv_seq), "\n", asv_seq),
    sep = "\n",
    file = file.path(export_folder, "asv.fasta"))

11.3 Log

Statistics about each preprocessing step can also be exported.

First this table need to be assembled:

getN <- function(x) sum(dada2::getUniques(x))

log_table <- data.frame(
  input = primer_log$in_reads,
  with_fwd_primer = primer_log$`w/adapters`,
  with_rev_primer = primer_log$`w/adapters2` ,
  with_both_primers = out[, 1],
  filtered = out[, 2],
  denoisedF = sapply(dadaFs, getN),
  denoisedR = sapply(dadaRs, getN),
  merged = sapply(mergers, getN),
  nonchim = rowSums(seqtab_nochim),
  perc_retained = rowSums(seqtab_nochim) / out[, 1] * 100
)

rownames(log_table) <- sample_names

Then it can be exported:

df_export(log_table, new_rn = "sample") |>
  write.table(file = file.path(export_folder, "log_table.tsv"),
              quote = FALSE,
              sep = "\t",
              row.names = FALSE)