Preprocessing with nf-core/ampliseq

1 Prepare workspace

Connect to BioPipes VM using ifb.m4.2xlarge (8 vCPU, 32Go GB RAM, 200Go GB local disk) and then:

wget https://github.com/ANF-MetaBioDiv/course-material/archive/refs/tags/v2.tar.gz
tar -xzf v2.tar.gz
cd course-material-2/

2 List files to analyse

First of of all, you need a table listing all the samples you want to analyse and their corresponding fastq files (forward and reverse in case of paired-end sequencing). In case your samples are coming from multiple runs, the run id will be indicated in a fourth column. Below a bash command to generate such a file for the practical:

ls data/raw/ | \
  paste - - | \
  awk '
    BEGIN {OFS = FS = "\t"; print "sampleID","run","forwardReads","reverseReads"}
    {
      sample = $1; sub(/_.+$/,"",sample)
      run = "run_01"
      fw_file = "data/raw/" $1
      rv_file = "data/raw/" $2
      print sample, run, fw_file, rv_file
    }
  ' > practicals/ampliseq_files/samplesheet.tsv

and an overview of the generated file

sampleID run forwardReads reverseReads
S11B run_01 data/raw/S11B_R1.fastq.gz data/raw/S11B_R2.fastq.gz
S1B run_01 data/raw/S1B_R1.fastq.gz data/raw/S1B_R2.fastq.gz
S2B run_01 data/raw/S2B_R1.fastq.gz data/raw/S2B_R2.fastq.gz
S2S run_01 data/raw/S2S_R1.fastq.gz data/raw/S2S_R2.fastq.gz
S3B run_01 data/raw/S3B_R1.fastq.gz data/raw/S3B_R2.fastq.gz
S3S run_01 data/raw/S3S_R1.fastq.gz data/raw/S3S_R2.fastq.gz
S4B run_01 data/raw/S4B_R1.fastq.gz data/raw/S4B_R2.fastq.gz
S4S run_01 data/raw/S4S_R1.fastq.gz data/raw/S4S_R2.fastq.gz
S5B run_01 data/raw/S5B_R1.fastq.gz data/raw/S5B_R2.fastq.gz
S5S run_01 data/raw/S5S_R1.fastq.gz data/raw/S5S_R2.fastq.gz
S6B run_01 data/raw/S6B_R1.fastq.gz data/raw/S6B_R2.fastq.gz
S6S run_01 data/raw/S6S_R1.fastq.gz data/raw/S6S_R2.fastq.gz
S7B run_01 data/raw/S7B_R1.fastq.gz data/raw/S7B_R2.fastq.gz
S7S run_01 data/raw/S7S_R1.fastq.gz data/raw/S7S_R2.fastq.gz
S8B run_01 data/raw/S8B_R1.fastq.gz data/raw/S8B_R2.fastq.gz
S8S run_01 data/raw/S8S_R1.fastq.gz data/raw/S8S_R2.fastq.gz
S9B run_01 data/raw/S9B_R1.fastq.gz data/raw/S9B_R2.fastq.gz
S9S run_01 data/raw/S9S_R1.fastq.gz data/raw/S9S_R2.fastq.gz

3 Make nextflow available

If first time connecting to the ampliseq_vm

conda init

Then, each time you’re opening a terminal

conda activate nextflow

4 Run nf-core/ampliseq

This is an example of a command to launch nf-core/ampliseq:

nextflow run nf-core/ampliseq \
  -bg \
  -r 2.14.0 \
  -profile docker \
  -c "practicals/ampliseq_files/ampliseq_vm.config" \
  -params-file practicals/ampliseq_files/ampliseq_parameters_simple.yaml

With run you specify the workflow you want to use. The option -bg allows you to run the workflow as a background job, meaning that you can close your ssh connection, connect later to your VM, the workflow will still be running as long as the VM is running. Remove this option if you want to visualise progress bars. With -r you can call a specific version of the workflow you want to use, this is highly recommended for reproducibility purpose. With -profile you can call lists of pre set parameters. In our case we are calling the profile docker telling nextflow to download and use docker images of the tools needed for our workflow. With -c you can call a file with workflow execution configurations. Here we use configurations from practicals/ampliseq_files/ampliseq_vm.config to specify the maximum amount of ressources allocated to our workflow. Finally we are calling the parameters file practicals/ampliseq_files/ampliseq_parameters_simple.yaml. This where you will for exemple specify the primers used for the amplification. Let’s have a closer look at this file:

# Main arguments

input: "practicals/ampliseq_files/samplesheet.tsv"
FW_primer: "CCTACGGGNBGCASCAG"
RV_primer: "GACTACNVGGGTATCTAAT"
outdir: "outputs/ampliseq"

These are required parameters you need to specify on order to be able to run nf-core/ampliseq. The parameter input is the path the the table listing all the files you want to analyse. Primers are specified with FW_primer and RV_primer and the output directory with outdir. But as you will see later there are many more parameters to fine tune your workflow, for a complete list have a look here.

5 Main outputs

Here is a simplified tree structure illustrating the key output files and directories produced by the nf-core/ampliseq pipeline:

outputs/ampliseq/
├── summary_report/
   └── summary_report.html         # Pipeline summary report: overview of read counts, stats, and key results
├── multiqc/
   └── multiqc_report.html         # Aggregated QC report for all samples
├── qiime2/
   ├── abundance_tables/           # Exported ASV abundance tables (e.g. BIOM, TSV formats)
   └── input/
       ├── table.qza               # QIIME 2 feature table (ASV counts per sample)
       └── taxonomy.qza            # QIIME 2 artifact with taxonomy assignment per ASV
├── barrnap/
   └── summary.tsv                 # Summary of rRNA detection (e.g. 16S) by Barrnap for each sample
├── cutadapt/
   └── cutadapt_summary.tsv        # Adapter trimming statistics: read counts before/after trimming
├── dada2/
   ├── ASV_seqs.fasta              # Representative ASV sequences in FASTA format
   ├── ASV_table.tsv               # Tab-separated ASV abundance table (samples × ASVs)
   ├── ASV_tax.silva_138_2.tsv     # ASVs annotated with taxonomy from SILVA 138.2 database
   └── DADA2_stats.tsv             # Read filtering and denoising stats per sample
├── fastqc/                         # Contains raw and trimmed read quality reports (per-sample)
├── input/
   └── samplesheet.tsv             # Metadata table listing input samples and file paths
├── phyloseq/
   └── dada2_phyloseq.rds          # RDS object containing full phyloseq object (ASVs, taxonomy, metadata)
└── pipeline_info/                  # Nextflow logs, execution reports, and trace info

.nextflow.log                       # Nextflow log file (useful for debugging)

6 Tune an rerun

You can decide later to rerun your workflow with different parameters. An example of a more detailed parameter file:

# Main arguments

input: "practicals/ampliseq_files/samplesheet.tsv"
FW_primer: "CCTACGGGNBGCASCAG"
RV_primer: "GACTACNVGGGTATCTAAT"
outdir: "outputs/ampliseq"

# Read trimming and quality filtering

trunclenf: 250
trunclenr: 220
maxee: 3

# Amplicon Sequence Variants (ASV) calculation

sample_inference: "pooled"

# ASV post processing

min_len_asv: 200

# Taxonomic database

dada_ref_taxonomy: "greengenes2=2024.09"

# ASV filtering

exclude_taxa: mitochondria,chloroplast # default values
min_frequency: 10
min_samples: 2

Add the option -resume to avoid redoing analyses already done during the previous run:

nextflow run nf-core/ampliseq \
  -resume \
  -bg \
  -r 2.14.0 \
  -profile docker \
  -c "practicals/ampliseq_files/ampliseq_vm.config" \
  -params-file practicals/ampliseq_files/ampliseq_parameters_tuned.yaml

7 Add metadata for postprocessing

nf-core/ampliseq allows you to run some alpha and beta diversity analyses on your ASV table. To do so you need to provide a table with the metadata. This table will contain the sample ids in the column ID and as many columns as descriptors you want to specify. These descriptors have to be categorical, you can’t for example add continuous values such as temperature measurements for example.

So let’s create such a table

cat data/context/mapfileFA.txt | \
  tr '\r' '\n' | \
  awk '
    BEGIN {FS=OFS="\t"; print "ID", "geo", "depth"}
    NR > 1 {
      sample = $1
      geo = $2
      depth = $1; sub(/S[0-9]+/, "", depth)
      print sample, geo, depth
    }
  ' > practicals/ampliseq_files/ampliseq_metadata.tsv
ID Geo depth
S1B North B
S2B North B
S2S North S
S3B North B
S3S North S
S4B North B
S4S North S
S5B North B
S5S North S
S6B South B
S6S South S
S7B South B
S7S South S
S8B South B
S8S South S
S9B South B
S9S South S
S11B South B

Now your metada table have been created you can refer to it in the parameter file and rerun ampliseq with the -resume option

nextflow run nf-core/ampliseq \
  -resume \
  -bg \
  -r 2.14.0 \
  -profile docker \
  -c "practicals/ampliseq_files/ampliseq_vm.config" \
  -params-file practicals/ampliseq_files/ampliseq_parameters_tuned_with_metadata.yaml