<- here::here("data", "refdb")
refdb_folder refdb_folder
[1] "/home/nhenry/Documents/anf-metabarcoding/data/refdb"
Fabrice Armougom, MIO
Marc Garel, MIO
Nicolas Henry, ABiMS
September 1, 2023
To begin with, download the course repository on your computer or virtual machine.
To do so, visit the (ANF-metabarcoding) repository on github and download it as a zip file.
You can also download the repository directly from the terminal using this commmand:
Once on your machine, unzip the file and place the resulting folder in the most convenient location (~/Documents/
for example).
Save as a variable the path to the folder where you will place the references databases.
[1] "/home/nhenry/Documents/anf-metabarcoding/data/refdb"
In this course we use as much as possible the ::
operator. It allows calling a function from a given package without having to load the entire package using library(package)
. This way it is clear where the functions come from. In that case, the function here()
comes from the package here
.
The reason why we use here::here() is that when you render a Rmarkdown file, the working directory is where the Rmarkdown file is:
Whereas here::here() point to the root of the R project
Now, let’s create the folder directly from R:
You can also create the folder from RStudio in the Files
window
Tip: You can access the documentation of any R function using ?
in the console. If you to know everything about the function dir.create()
, simply run ?dir.create()
The Silva reference database, commonly used to assign 16S metabarcoding data, will be used in practical.
In case you are working with 18S sequences, you will have better assignments using PR2 or EukRibo especially if you are interested in protists.
The following code downloads dada2 formatted silva reference databases. If you are not comfortable with it, you can simply download the reference database files from your web browser here and here.
[1] 60
# so we change timeout to be 20 minutes
options(timeout = 1200)
# we save in variable the path to the refdb
# in the working space
silva_train_set <- file.path(refdb_folder,
"silva_nr99_v138.1_train_set.fa.gz")
silva_species_assignment <- file.path(refdb_folder,
"silva_species_assignment_v138.1.fa.gz")
# then we download the files if they don't already exist
if (!file.exists(silva_train_set)) {
download.file(
"https://zenodo.org/record/4587955/files/silva_nr99_v138.1_train_set.fa.gz",
silva_train_set,
quiet = TRUE
)
}
if (!file.exists(silva_species_assignment)) {
download.file(
"https://zenodo.org/record/4587955/files/silva_species_assignment_v138.1.fa.gz",
silva_species_assignment,
quiet = TRUE
)
}
We will use in this practical R functions especially written for this course. The “classic” way to import functions is to use source()
with the name of the R script to source.
Instead, we use devtools::load_all()
. This function will source all the scripts from the folder R/
along with the documentation in man/
:
Save the path to the directory containing your raw data (paired-end sequencing fastq files) in a variable named path_to_fastqs
The gzipped (compressed) FASTQ formatted “forward” (R1) and “reverse” (R2) files are named as follow:
${SAMPLENAME}_R1.fastq.gz
for the forward files${SAMPLENAME}_R2.fastq.gz
for the reverse files.We list the forward files using the function list.files()
. The argument pattern
gives you the possibility to only select file names matching a regular expression. In our case, we select all file names finishing by _R1.fastq.gz
.
We do the same for reverse samples.
To understand: What fnFs & fnRs variables contain?
To understand:
basename()
: remove path to only keep file name.
|>
: R “pipe”. It allows you to chain functions, avoiding intermediate variables and nested parenthesis. It basically transfers the output of the left expression to the input of the right expression. You need R > 4.1 to use this pipe, otherwise use %>%
from magrittr
.
strsplit()
: split character chain according to a defined pattern. ?strsplit
for documentation.
sapply()
: apply a function to each element of a list or vector. The output is simplified to be vector.
Let’s go step by step. First list the R1 file names.
[1] "S11B_R1.fastq.gz" "S1B_R1.fastq.gz" "S2B_R1.fastq.gz" "S2S_R1.fastq.gz"
[5] "S3B_R1.fastq.gz" "S3S_R1.fastq.gz"
We can see that the sample name is before the first _
. With strsplit()
, we can split each file name into a 2 elements vector. The result is a list of 2 elements vectors.
[[1]]
[1] "S11B" "R1.fastq.gz"
[[2]]
[1] "S1B" "R1.fastq.gz"
[[3]]
[1] "S2B" "R1.fastq.gz"
[[4]]
[1] "S2S" "R1.fastq.gz"
[[5]]
[1] "S3B" "R1.fastq.gz"
[[6]]
[1] "S3S" "R1.fastq.gz"
Now, We just have to extract the first element for each file.
[1] "S11B" "S1B" "S2B" "S2S" "S3B" "S3S"
Tip: you can achieve the same thing using regular expressions:
Regular expressions are extremely useful. If you are keen to learn how to use them, have a look here
We use a custom function, qualityprofile()
, implemented in R/preprocessing.R
to check the quality of the raw sequences.
Run ?qualityprofile
to know more about this function.
Open the the pdf file generated by qualityprofile()
We first create a folder where to save the reads once they are trimmed:
The data you are working with correspond to the V3-V4 region using the primers Pro341F (CCTACGGGNBGCASCAG
) and Pro805R (GACTACNVGGGTATCTAAT
). Save into variables the forward and reverse primers:
Let’s have a closer look at sequences and find the primers.
Forward reads would contain CCTACGGGNBGCASCAG
DNAStringSet object of length 10:
width seq names
[1] 293 CCTACGGGGGGCAGCAGTAGGGA...ACATCGGCTTAACCGATGAAGT M01522:260:000000...
[2] 293 CCTACGGGTGGCACCAGTAGGGA...CGGGGCTTAACCTCGGAACTGC M01522:260:000000...
[3] 292 CCTACGGGGCGCAGCAGGCGCGA...GGGACCGGGAGAGGTGTGAGGT M01522:260:000000...
[4] 293 CCTACGGGGTGCAGCAGTAGGGA...TCAAAACTCCCAGTCTAGAGTT M01522:260:000000...
[5] 291 CCTACGGGTGGCAGCAGTGGGGA...GCAGTGGAAACTGTTGGGCTTG M01522:260:000000...
[6] 293 CCTACGGGATGCAGCAGGCGCGA...GGGACCGGGAGAGGTGTGGGGG M01522:260:000000...
[7] 292 CCTACGGGATGCAGCAGTGGGGA...TTTAATCCTGATGAGCTAGAAA M01522:260:000000...
[8] 293 CCTACGGGGCGCAGCAGTAGGGA...TTAAAACTTTTGTTCTGGAATT M01522:260:000000...
[9] 292 CCTACGGGTTGCAGCAGTGGGGA...ATTAAAACTTTTCAGCTAGAGT M01522:260:000000...
[10] 293 CCTACGGGAGGCAGCAGTGGGGA...CCCGGGCTCAACCTGGGAACGG M01522:260:000000...
And reverse reads should contain GACTACNVGGGTATCTAAT
DNAStringSet object of length 10:
width seq names
[1] 301 GACTACCAGGGTATCTAATCCTG...GGCTGCTGGCACGAAGTTCGCC M01522:260:000000...
[2] 301 GACTACCGGGGTATCTAATCCTG...GGCTGCTGGCACGGAGTTAGCC M01522:260:000000...
[3] 300 AATCCGGTTCGTGCCCCTAGGCT...TCTTTCCCAGCCCTTATTCCAA M01522:260:000000...
[4] 301 GACTACCGGGGTATCTAATCCTG...GGCTGCTGGCACGGAGTTAGCC M01522:260:000000...
[5] 301 GACTACCGGGGTATCTAATCCCT...GGCTGCTGGCCCGGAATTAGCC M01522:260:000000...
[6] 301 GGTATCTAATCCGGTTCGTGCCC...CACCGTCCTTACCCCCCCCTTT M01522:260:000000...
[7] 301 GGTATCTAATCTTGTTTGCTCCC...CCCGACGTTAGCCGGGGCTTCT M01522:260:000000...
[8] 301 GACTACGAGGGTATCTAATCCCG...GGCTGCTGGCACGGAATTAGCC M01522:260:000000...
[9] 301 GGTATCTAATCCTCTTCGCTACC...CACGAAGTTAGCCGGACCTTCT M01522:260:000000...
[10] 301 GACTACGGGGGTATCTAATCCTG...GGCTGCCGGCACGGGGTTAGCC M01522:260:000000...
We use a custom function, primer_trim()
, implemented in R/preprocessing.R
to remove the primers using cutadapt. To work, primer_trim()
needs cutadapt to be installed on your computer.
Run ?primer_trim
to know more about this function.
(primer_log <- primer_trim(
forward_files = fnFs,
reverse_files = fnRs,
primer_fwd = primer_fwd,
primer_rev = primer_rev,
output_dir = path_to_trimmed_reads,
min_size = 200
))
sample status in_reads in_bp too_short too_long too_many_n out_reads
1 S11B OK 2000 1186767 0 0 0 1863
2 S1B OK 2000 1186613 1 0 0 1855
3 S2B OK 2000 1186942 0 0 0 1839
4 S2S OK 2000 1186868 0 0 0 1833
5 S3B OK 2000 1186650 0 0 0 1860
6 S3S OK 2000 1186475 1 0 0 1880
7 S4B OK 2000 1186331 2 0 0 1867
8 S4S OK 2000 1186681 0 0 0 1872
9 S5B OK 2000 1186386 1 0 0 1841
10 S5S OK 2000 1186501 1 0 0 1861
11 S6B OK 2000 1186261 2 0 0 1839
12 S6S OK 2000 1187078 1 0 0 1835
13 S7B OK 2000 1186888 0 0 0 1825
14 S7S OK 2000 1186299 3 0 0 1845
15 S8B OK 2000 1186354 3 0 0 1840
16 S8S OK 2000 1186610 1 0 0 1848
17 S9B OK 2000 1187038 0 0 0 1834
18 S9S OK 2000 1186867 0 0 0 1835
w/adapters qualtrim_bp out_bp w/adapters2 qualtrim2_bp out2_bp
1 1986 0 513149 1876 0 528595
2 1975 0 511096 1877 0 525893
3 1987 0 506659 1850 0 521371
4 1989 0 504998 1843 0 519979
5 1989 0 512326 1870 0 527518
6 1989 0 517598 1891 0 532758
7 1980 0 514342 1884 0 529379
8 1987 0 515511 1884 0 530555
9 1984 0 506972 1856 0 522013
10 1991 0 512539 1869 0 527592
11 1981 0 506577 1857 0 521787
12 1982 0 505929 1851 0 520562
13 1987 0 503033 1836 0 517931
14 1987 0 508524 1857 0 523039
15 1993 0 507178 1847 0 522137
16 1982 0 509177 1865 0 524085
17 1983 0 505424 1851 0 520706
18 1979 0 505519 1853 0 520103
If you have to deal with mix-orientated paired-end reads, you may find inspiration here
Same as before, create a folder
and list paths:
To make the link between files and sample names, simply name vector of file names using the sample names
dada2::filterAndTrim()
Let’s have a look at what the function is doing. To do so, type ?dada2::filterAndTrim()
in the console.
Let’s run the function.
(out <- dada2::filterAndTrim(
fwd = nopFw,
filt = filtFs,
rev = nopRv,
filt.rev = filtRs,
minLen = 150,
matchIDs = TRUE,
maxN = 0,
maxEE = c(3, 3),
truncQ = 2
))
reads.in reads.out
S11B_R1.fastq.gz 1863 1200
S1B_R1.fastq.gz 1855 1251
S2B_R1.fastq.gz 1839 1255
S2S_R1.fastq.gz 1833 1244
S3B_R1.fastq.gz 1860 1244
S3S_R1.fastq.gz 1880 1312
S4B_R1.fastq.gz 1867 1262
S4S_R1.fastq.gz 1872 1328
S5B_R1.fastq.gz 1841 1255
S5S_R1.fastq.gz 1861 1244
S6B_R1.fastq.gz 1839 1251
S6S_R1.fastq.gz 1835 1239
S7B_R1.fastq.gz 1825 1203
S7S_R1.fastq.gz 1845 1182
S8B_R1.fastq.gz 1840 1169
S8S_R1.fastq.gz 1848 1267
S9B_R1.fastq.gz 1834 1195
S9S_R1.fastq.gz 1835 1249
What happened?
Details about the function arguments: * nopFw : input, where the forward reads without primers are (path) * filtFs : output, where forward filtered reads are written (path) * nopRv and filRs : same as above, but wiht reverse reads * TruncLen : truncate reads after truncLen bases. Reads shorter than that are discarded (TruncLen=c(200,150), means forward and reverse reads are cut at 200 bp and 150 bp respectively) * TrimLeft : number of nucleotides to remove from the start * Trimright : number of nucleotides to remove from the end * maxN : max number of ambiguous bases accepted * maxEE : read expected errors (EE) threshold. The EE of a read is the sum of the error probability of each base composing it. Increase that value to accept more low quality reads. The first value refers to the forward reads and the second to the reverse reads. * TruncQ=2: truncate reads at the first instance of a quality score less than or equal to truncQ.
To be able to denoise your data, you need an error model. The error model will tell you at which rate a nucleotide is replace by another for a given quality score. For example, for a quality score Q of 30, what is the probability of an A being wrongly read as a T.
This error model can be learnt directly from the data with the function dada2::learnErrors()
. You can come back to the course for more details about the maths behind.
6157072 total bases in 22350 reads from 18 samples will be used for learning the error rates.
6337638 total bases in 22350 reads from 18 samples will be used for learning the error rates.
You can visualise the resulting error model using the function dada2::plotErrors()
Before denoising, we need to dereplicate the sequences. It means, for each unique sequence, count the number of reads.
The dereplication is achieved using the function dada2::derepFastq()
Now we are ready to run the denoising algorithm with dada2::dada()
. As input, we need the error model and the dereplicated sequences.
Sample 1 - 1200 reads in 754 unique sequences.
Sample 2 - 1251 reads in 779 unique sequences.
Sample 3 - 1255 reads in 789 unique sequences.
Sample 4 - 1244 reads in 762 unique sequences.
Sample 5 - 1244 reads in 772 unique sequences.
Sample 6 - 1312 reads in 763 unique sequences.
Sample 7 - 1262 reads in 738 unique sequences.
Sample 8 - 1328 reads in 638 unique sequences.
Sample 9 - 1255 reads in 782 unique sequences.
Sample 10 - 1244 reads in 663 unique sequences.
Sample 11 - 1251 reads in 696 unique sequences.
Sample 12 - 1239 reads in 657 unique sequences.
Sample 13 - 1203 reads in 691 unique sequences.
Sample 14 - 1182 reads in 675 unique sequences.
Sample 15 - 1169 reads in 697 unique sequences.
Sample 16 - 1267 reads in 714 unique sequences.
Sample 17 - 1195 reads in 685 unique sequences.
Sample 18 - 1249 reads in 677 unique sequences.
Sample 1 - 1200 reads in 928 unique sequences.
Sample 2 - 1251 reads in 948 unique sequences.
Sample 3 - 1255 reads in 968 unique sequences.
Sample 4 - 1244 reads in 925 unique sequences.
Sample 5 - 1244 reads in 948 unique sequences.
Sample 6 - 1312 reads in 967 unique sequences.
Sample 7 - 1262 reads in 953 unique sequences.
Sample 8 - 1328 reads in 904 unique sequences.
Sample 9 - 1255 reads in 975 unique sequences.
Sample 10 - 1244 reads in 887 unique sequences.
Sample 11 - 1251 reads in 914 unique sequences.
Sample 12 - 1239 reads in 846 unique sequences.
Sample 13 - 1203 reads in 881 unique sequences.
Sample 14 - 1182 reads in 874 unique sequences.
Sample 15 - 1169 reads in 879 unique sequences.
Sample 16 - 1267 reads in 967 unique sequences.
Sample 17 - 1195 reads in 892 unique sequences.
Sample 18 - 1249 reads in 911 unique sequences.
Once forward and reverse reads have been denoised, we can merge them with dada2::mergePairs()
.
At this point we have ASVs and we know their number of reads in each sample. We have enough information to build an ASV table.
Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. We find and remove bimeras (two-parent chimeras) using the function dada2::removeBimeraDenovo()
The ASV table is ready. But without a clue about the ASVs taxonomic identity, we won’t go far in our ecological interpretations. We can have an idea of ASV taxonomic identity comparing their sequences to references databases such as SILVA.
The taxonomic assignment is done in two steps.
First, each ASV is assigned to a taxonomy using the RDP Naive Bayesian Classifier algorithm described in Wang et al. 2007 called by the function dada2::assignTaxonomy()
.
The method is robust, however, it often fails to assign at the species level.
If you consider that in case an ASV is 100% similar to a reference sequence, it belongs to the same species, then you can use dada2::addSpecies()
This function assign to the species level ASVs which are identical to a reference sequence.
All the preprocessing is done. Now we export our results.
The results can be exported as a R objects, one object for the ASV table and another one for the taxonomy.
We recommand to export your results as text files. They are then reusable by other programs/languages.
But before, we need to format the data a little bit.
First we create a new variable to collect the ASV sequences:
We create unique ids for each ASV. The sequence itself is an unique id, but we would like to have something shorter.
and rename the different variables with the new ids
Before exporting the data frames (taxonomy
and seqtab_nochim
) as a text file, we convert their row names (ASV ids) into a new column named asv
. This is achieved using the custom function df_export()
Finally, we can export the taxonomy
the ASV table
and the sequences as a fasta file
Statistics about each preprocessing step can also be exported.
First this table need to be assembled:
getN <- function(x) sum(dada2::getUniques(x))
log_table <- data.frame(
input = primer_log$in_reads,
with_fwd_primer = primer_log$`w/adapters`,
with_rev_primer = primer_log$`w/adapters2` ,
with_both_primers = out[, 1],
filtered = out[, 2],
denoisedF = sapply(dadaFs, getN),
denoisedR = sapply(dadaRs, getN),
merged = sapply(mergers, getN),
nonchim = rowSums(seqtab_nochim),
perc_retained = rowSums(seqtab_nochim) / out[, 1] * 100
)
rownames(log_table) <- sample_names
Then it can be exported: