September 30, 2025
Metabarcoding
is a molecular technique used to identify and quantify the diversity of organisms in environmental samples by amplifying and sequencing specific genetic markers (barcodes) from environmental DNA (eDNA).
Environmental genomics
nf-core pipelines
Standardized Workflows:
Nf-core provides standardized, community-curated workflows for analyzing data, ensuring consistency and reproducibility.
Nextflow Integration:
Built on Nextflow
, a powerful workflow management system, allowing for scalable and portable data analysis.
nf-core/ampliseq?
* Pre-configured bioinformatics analysis workflow for amplicon sequencing, supporting multiple input formats.
* Developed and maintained by a global community of researchers and bioinformaticians, ensuring continuous improvement and updates.
* Comprehensive documentation and tutorials make it accessible for users of all skill levels.
Relative requirements to launch ampliseq on the BioPipes VM
ampliseq_vm.config
(or your own nextflow.config
) to define parameters specific to the compute environment of the BioPipes VM for nf-core pipelines (such as resourceLimits and partitions).Files needed:
.fastq.gz
.tsv
, .csv
or .yml
file listing samples and paths to FastQ files..yml
file containing pipeline specific configurations (genomic references, tools, etc.)..sh
to submit the ampliseq pipeline to a SLURM cluster.The sample sheet file can be tab-separated (.tsv
), comma-separated (.csv
), or in YAML format (.yml
/.yaml
) and can have two to four columns/entries with the following headers:
sampleID |
forwardReads |
reverseReads |
run |
---|---|---|---|
required | required | optional | optional |
Unique sample identifiers | Paths to (forward) reads zipped FastQ files | Paths to reverse reads zipped FastQ files, required if the data is paired-end | If the data was produced by multiple sequencing runs |
samplesheet.tsv
sampleID |
run |
forwardReads |
reverseReads |
---|---|---|---|
S11B | run_01 | data/raw/S11B_R1.fastq.gz | data/raw/S11B_R2.fastq.gz |
S1B | run_01 | data/raw/S1B_R1.fastq.gz | data/raw/S1B_R2.fastq.gz |
S2B | run_01 | data/raw/S2B_R1.fastq.gz | data/raw/S2B_R2.fastq.gz |
S2S | run_01 | data/raw/S2S_R1.fastq.gz | data/raw/S2S_R2.fastq.gz |
… | … | … | … |
params_ampliseq.yaml
input: "input/samplesheet.tsv"
FW_primer: "GTGYCAGCMGCCGCGGTAA"
RV_primer: "CCGYCAATTYMTTTRAGTTT"
outdir: "output/ampliseq"
# Read trimming and quality filtering
max_ee: 2
trunclenf: 200
trunclenr: 200
trunc_qmin: 2
# Amplicon Sequence Variants (ASV) calculation
sample_inference: "pooled"
# Taxonomic db
dada_ref_taxonomy: "pr2=5.0.0"
# Asv filtering
min_frequency: 2
min_samples: 3
# Generic options
# Max job request
max_cpus: 32
max_memory: "100.GB"
max_time: "720.h"
Using bash script: ampliseq.sh
#!/bin/bash
nextflow run nf-core/ampliseq \
-bg \
-r 2.14.0 \
-profile docker \
-c "practicals/ampliseq_files/ampliseq_vm.config" \
-params-file practicals/ampliseq_files/ampliseq_parameters_simple.yaml
# Path to files:
# ├── ampliseq.sh
# ├── practicals
# │ └── ampliseq_files
# │ ├── samplesheet.tsv
# │ ├── ampliseq_vm.config
# │ └── ampliseq_parameters_simple.yaml
Approximately 30 minutes for ANF training data
The longest step being the taxonomic assignment!
summary_report.html
: pipeline summary report as standalone HTML file that can be viewed in your web browser. Example here.ASV_tax.*.tsv
: Taxonomic classification for each ASV sequence.ASV_tax.pr2_5_0_0 in our example
ASV_ID |
Domain |
Supergroup |
Division |
Subdivision |
Class |
Order |
---|---|---|---|---|---|---|
b7db | Bacteria | PANNAM | Proteobacteria | Proteobacteria_X | Gammaproteobacteria | Alteromonadales |
e01e | Bacteria | FCB | Bacteroidetes | Bacteroidetes_X | Bacteroidia | Flavobacteriales |
Family |
Genus |
Species |
Confidence |
Sequence |
---|---|---|---|---|
Marinobacteraceae | Marinobacter | Marinobacter_sp. | 1 | TACGGAGGGTGCAA… |
Flavobacteriaceae | Flavobacterium | Flavobacterium_sp. | 0.99 | TACGGAGGATCCAAGCG… |
ASV_table.tsv
: summary of the Amplicon Sequence Variants (ASVs) detected in each of your sample.ASV_ID |
A120_A |
A120_B |
A141_A |
A141_B |
... |
---|---|---|---|---|---|
b7db | 2841 | 1118 | 614 | 442 | … |
bab5 | 9163 | 7036 | 19075 | 13799 | … |
After preprocessing (primer trimming, QC) & ASV inference with DADA2, the QIIME2 analysis goals are :
The abundance tables are the final data for further downstream analysis and visualisations.
Output | Description |
---|---|
Absolute abundance table | Raw ASV counts per sample |
Relative abundance table | Normalized counts (e.g. proportions per sample) |
Alpha diversity measures the species diversity within samples.
Beta diversity measures the species community differences between samples.
Note
There are many others output files, I let you investigate them!
Complete list here
There’s no shame in getting help!
Thank you for your attention!
Thank you for your attention!