Preprocessing with nf-core/ampliseq

1 Prepare workspace

Connect to BioPipes VM using ifb.m4.2xlarge (8 vCPU, 32Go GB RAM, 200Go GB local disk) and then:

wget https://github.com/ANF-MetaBioDiv/course-material/archive/refs/tags/v2.tar.gz
tar -xzf v2.tar.gz
cd course-material-2/

2 List files to analyse

First of of all, you need a table listing all the samples you want to analyse and their corresponding fastq files (forward and reverse in case of paired-end sequencing). In case your samples are coming from multiple runs, the run id will be indicated in a fourth column. Below a bash command to generate such a file for the practical:

ls data/raw/ | \
  paste - - | \
  awk '
    BEGIN {OFS = FS = "\t"; print "sampleID","run","forwardReads","reverseReads"}
    {
      sample = $1; sub(/_.+$/,"",sample)
      run = "run_01"
      fw_file = "data/raw/" $1
      rv_file = "data/raw/" $2
      print sample, run, fw_file, rv_file
    }
  ' > practicals/ampliseq_files/samplesheet.tsv

and an overview of the generated file

sampleID	run	forwardReads	reverseReads
S11B	run_01	data/raw/S11B_R1.fastq.gz	data/raw/S11B_R2.fastq.gz
S1B	run_01	data/raw/S1B_R1.fastq.gz	data/raw/S1B_R2.fastq.gz
S2B	run_01	data/raw/S2B_R1.fastq.gz	data/raw/S2B_R2.fastq.gz
S2S	run_01	data/raw/S2S_R1.fastq.gz	data/raw/S2S_R2.fastq.gz
S3B	run_01	data/raw/S3B_R1.fastq.gz	data/raw/S3B_R2.fastq.gz
S3S	run_01	data/raw/S3S_R1.fastq.gz	data/raw/S3S_R2.fastq.gz
S4B	run_01	data/raw/S4B_R1.fastq.gz	data/raw/S4B_R2.fastq.gz
S4S	run_01	data/raw/S4S_R1.fastq.gz	data/raw/S4S_R2.fastq.gz
S5B	run_01	data/raw/S5B_R1.fastq.gz	data/raw/S5B_R2.fastq.gz
S5S	run_01	data/raw/S5S_R1.fastq.gz	data/raw/S5S_R2.fastq.gz
S6B	run_01	data/raw/S6B_R1.fastq.gz	data/raw/S6B_R2.fastq.gz
S6S	run_01	data/raw/S6S_R1.fastq.gz	data/raw/S6S_R2.fastq.gz
S7B	run_01	data/raw/S7B_R1.fastq.gz	data/raw/S7B_R2.fastq.gz
S7S	run_01	data/raw/S7S_R1.fastq.gz	data/raw/S7S_R2.fastq.gz
S8B	run_01	data/raw/S8B_R1.fastq.gz	data/raw/S8B_R2.fastq.gz
S8S	run_01	data/raw/S8S_R1.fastq.gz	data/raw/S8S_R2.fastq.gz
S9B	run_01	data/raw/S9B_R1.fastq.gz	data/raw/S9B_R2.fastq.gz
S9S	run_01	data/raw/S9S_R1.fastq.gz	data/raw/S9S_R2.fastq.gz

3 Make nextflow available

If first time connecting to the ampliseq_vm

conda init

Then, each time you’re opening a terminal

conda activate nextflow

4 Run nf-core/ampliseq

This is an example of a command to launch nf-core/ampliseq:

nextflow run nf-core/ampliseq \
  -bg \
  -r 2.14.0 \
  -profile docker \
  -c "practicals/ampliseq_files/ampliseq_vm.config" \
  -params-file practicals/ampliseq_files/ampliseq_parameters_simple.yaml

With run you specify the workflow you want to use. The option -bg allows you to run the workflow as a background job, meaning that you can close your ssh connection, connect later to your VM, the workflow will still be running as long as the VM is running. Remove this option if you want to visualise progress bars. With -r you can call a specific version of the workflow you want to use, this is highly recommended for reproducibility purpose. With -profile you can call lists of pre set parameters. In our case we are calling the profile docker telling nextflow to download and use docker images of the tools needed for our workflow. With -c you can call a file with workflow execution configurations. Here we use configurations from practicals/ampliseq_files/ampliseq_vm.config to specify the maximum amount of ressources allocated to our workflow. Finally we are calling the parameters file practicals/ampliseq_files/ampliseq_parameters_simple.yaml. This where you will for exemple specify the primers used for the amplification. Let’s have a closer look at this file:

# Main arguments

input: "practicals/ampliseq_files/samplesheet.tsv"
FW_primer: "CCTACGGGNBGCASCAG"
RV_primer: "GACTACNVGGGTATCTAAT"
outdir: "outputs/ampliseq"

These are required parameters you need to specify on order to be able to run nf-core/ampliseq. The parameter input is the path the the table listing all the files you want to analyse. Primers are specified with FW_primer and RV_primer and the output directory with outdir. But as you will see later there are many more parameters to fine tune your workflow, for a complete list have a look here.

5 Main outputs

Here is a simplified tree structure illustrating the key output files and directories produced by the nf-core/ampliseq pipeline:

outputs/ampliseq/
├── summary_report/
│   └── summary_report.html         # Pipeline summary report: overview of read counts, stats, and key results
├── multiqc/
│   └── multiqc_report.html         # Aggregated QC report for all samples
├── qiime2/
│   ├── abundance_tables/           # Exported ASV abundance tables (e.g. BIOM, TSV formats)
│   └── input/
│       ├── table.qza               # QIIME 2 feature table (ASV counts per sample)
│       └── taxonomy.qza            # QIIME 2 artifact with taxonomy assignment per ASV
├── barrnap/
│   └── summary.tsv                 # Summary of rRNA detection (e.g. 16S) by Barrnap for each sample
├── cutadapt/
│   └── cutadapt_summary.tsv        # Adapter trimming statistics: read counts before/after trimming
├── dada2/
│   ├── ASV_seqs.fasta              # Representative ASV sequences in FASTA format
│   ├── ASV_table.tsv               # Tab-separated ASV abundance table (samples × ASVs)
│   ├── ASV_tax.silva_138_2.tsv     # ASVs annotated with taxonomy from SILVA 138.2 database
│   └── DADA2_stats.tsv             # Read filtering and denoising stats per sample
├── fastqc/                         # Contains raw and trimmed read quality reports (per-sample)
├── input/
│   └── samplesheet.tsv             # Metadata table listing input samples and file paths
├── phyloseq/
│   └── dada2_phyloseq.rds          # RDS object containing full phyloseq object (ASVs, taxonomy, metadata)
└── pipeline_info/                  # Nextflow logs, execution reports, and trace info

.nextflow.log                       # Nextflow log file (useful for debugging)

6 Tune an rerun

You can decide later to rerun your workflow with different parameters. An example of a more detailed parameter file:

# Main arguments

input: "practicals/ampliseq_files/samplesheet.tsv"
FW_primer: "CCTACGGGNBGCASCAG"
RV_primer: "GACTACNVGGGTATCTAAT"
outdir: "outputs/ampliseq"

# Read trimming and quality filtering

trunclenf: 250
trunclenr: 220
maxee: 3

# Amplicon Sequence Variants (ASV) calculation

sample_inference: "pooled"

# ASV post processing

min_len_asv: 200

# Taxonomic database

dada_ref_taxonomy: "greengenes2=2024.09"

# ASV filtering

exclude_taxa: mitochondria,chloroplast # default values
min_frequency: 10
min_samples: 2

Add the option -resume to avoid redoing analyses already done during the previous run:

nextflow run nf-core/ampliseq \
  -resume \
  -bg \
  -r 2.14.0 \
  -profile docker \
  -c "practicals/ampliseq_files/ampliseq_vm.config" \
  -params-file practicals/ampliseq_files/ampliseq_parameters_tuned.yaml

7 Add metadata for postprocessing

nf-core/ampliseq allows you to run some alpha and beta diversity analyses on your ASV table. To do so you need to provide a table with the metadata. This table will contain the sample ids in the column ID and as many columns as descriptors you want to specify. These descriptors have to be categorical, you can’t for example add continuous values such as temperature measurements for example.

So let’s create such a table

cat data/context/mapfileFA.txt | \
  tr '\r' '\n' | \
  awk '
    BEGIN {FS=OFS="\t"; print "ID", "geo", "depth"}
    NR > 1 {
      sample = $1
      geo = $2
      depth = $1; sub(/S[0-9]+/, "", depth)
      print sample, geo, depth
    }
  ' > practicals/ampliseq_files/ampliseq_metadata.tsv

ID	Geo	depth
S1B	North	B
S2B	North	B
S2S	North	S
S3B	North	B
S3S	North	S
S4B	North	B
S4S	North	S
S5B	North	B
S5S	North	S
S6B	South	B
S6S	South	S
S7B	South	B
S7S	South	S
S8B	South	B
S8S	South	S
S9B	South	B
S9S	South	S
S11B	South	B

Now your metada table have been created you can refer to it in the parameter file and rerun ampliseq with the -resume option

nextflow run nf-core/ampliseq \
  -resume \
  -bg \
  -r 2.14.0 \
  -profile docker \
  -c "practicals/ampliseq_files/ampliseq_vm.config" \
  -params-file practicals/ampliseq_files/ampliseq_parameters_tuned_with_metadata.yaml