Pipeline

Introduction

This page describes my pipeline for processing and analyzing data from the 16S rRNA amplicon sequencing. Obviously, the pipeline will not be suitable for everyone, so I highly recommend that you first go through the checklist below and make sure you are able to check all the boxes before proceeding.

  • [ ] Sequencing data are paired-end reads.

  • [ ] Paired reads have “enough” overlaps.

  • [ ] I have access to a grid computing cluster such as Sun Grid Engine (SGE).

Throughout the page, you may see many Dokdo and QIIME 2 actions that you are not familiar with. Even though I will try to provide a brief explanation for each of the actions, if you want more detailed explanation, please refer to the appropriate page(s) in the Read the Docs.

Directory Structure

Below figure displays my typical directory structure after the pipeline is run. Note that for this toy example, I only have two sequencing runs, but you can easily modify the structure to accommodate a scenario where there are more than two runs.

Microbiome
├── QIIME2
│   └── Classifiers
│       └── silva-138-99-nb-classifier.qza
│
├── Project1
│   ├── FASTQ
│   │   ├── Run1
│   │   │   ├── sample_S1_R1_001.fastq.gz
│   │   │   ├── sample_S1_R2_001.fastq.gz
│   │   │   ...
│   │   │
│   │   └── Run2
│   │       ├── sample_S1_R1_001.fastq.gz
│   │       ├── sample_S1_R2_001.fastq.gz
│   │       ...
│   │
│   ├── Demux_Data
│   │   ├── manifest-file-run1.tsv
│   │   ├── manifest-file-run2.tsv
│   │   ├── paired-end-demux-run1.qza
│   │   └── paired-end-demux-run2.qza
│   │
│   ├── Pipeline
│   │   ├── table-run1.qza
│   │   ├── table-run2.qza
│   │   ├── rep-seqs-run1.qza
│   │   ├── rep-seqs-run2.qza
│   │   ├── denoising-stats-run1.qza
│   │   ├── denoising-stats-run2.qza
│   │   ├── demux-run1.qzv
│   │   ├── demux-run2.qzv
│   │   ├── table-merged.qza
│   │   ├── rep-seqs-merged.qza
│   │   ├── sample-metadata.tsv
│   │   ├── alpha-rarefaction-max.qzv
│   │   ├── alpha-rarefaction-min.qzv
│   │   ├── aligned-rep-seqs.qza
│   │   ├── masked-aligned-rep-seqs.qza
│   │   ├── unrooted-tree.qza
│   │   ├── rooted-tree.qza
│   │   ├── taxonomy.qza
│   │   ├── taxa-bar-plots.qzv
│   │   └── core-metrics-results
│   │       ├── jaccard_distance_matrix.qza
│   │       ├── jaccard_pcoa_results.qza
│   │       ├── jaccard_emperor.qzv
│   │       ├── bray_curtis_distance_matrix.qza
│   │       ├── bray_curtis_pcoa_results.qza
│   │       ├── bray_curtis_emperor.qzv
│   │       ├── unweighted_unifrac_distance_matrix.qza
│   │       ├── unweighted_unifrac_pcoa_results.qza
│   │       ├── unweighted_unifrac_emperor.qzv
│   │       ├── weighted_unifrac_distance_matrix.qza
│   │       ├── weighted_unifrac_pcoa_results.qza
│   │       ├── weighted_unifrac_emperor.qzv
│   │       ├── evenness_vector.qza
│   │       ├── evenness_group-significance.qzv
│   │       ├── shannon_vector.qza
│   │       ├── shannon_group-significance.qzv
│   │       ├── faith_pd_vector.qza
│   │       ├── faith_pd_group-significance.qzv
│   │       ├── observed_features_vector.qza
│   │       └── observed_features_group-significance.qzv
│   │
│   └── Analysis
│       ├── Read_Quality
│       ├── Rarefaction_Curve
│       ├── Alpha_Diversity
│       ├── Beta_Diversity
│       ├── Taxa_Barplots
│       ...
│
├── Project2
...

1. Demultiplexing

I use Illumina’s bcl2fastq software to demultiplex sequence reads. For example, if sequencing data was generated by the MiSeq platform with 2x300 bp reads, I would use a command line similar to the following:

$ bcl2fastq \
    --output-dir $OUTPUT_DIR \
    --sample-sheet $SAMPLE_SHEET \
    --runfolder-dir $RUNFOLDER_DIR \
    --interop-dir $INTEROP_DIR \
    --stats-dir $STATS_DIR \
    --reports-dir $REPORTS_DIR \
    --no-lane-splitting \
    --tiles s_1 \
    --use-bases-mask Y301,I8,I8,Y301 \
    --barcode-mismatches 0 \
    --processing-threads 10

After demultiplexing is finished, the end product should be a directory containing two FASTQ files per sample (i.e. forward and reverse reads). Do not store FASTQ files from different sequencing runs in the same directory. Skip this step if you already have demultiplexed FASTQ files.

2. Import Sequences to QIIME 2

$ dokdo make-manifest \
    -i $FASTA_DIR \
    -o manifest_file.tsv

$ qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path manifest_file.tsv \
    --input-format PairedEndFastqManifestPhred33V2 \
    --output-path paired-end-demux.qza

$ qiime demux summarize \
    --i-data paired-end-demux.qza \
    --o-visualization demux.qzv

See also:

3. Identify ASVs

We can identify ASVs by denoising the sequence reads with DADA2.

$ qsub -S /bin/sh -cwd -l h=$NODE_NAME -V -pe pePAC 45 qsubme-denoise-paired.sh

The qsubme-denoise-paired.sh file looks like:

#!/bin/bash

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs paired-end-demux.qza \
  --p-trunc-len-f 245 \
  --p-trunc-len-r 240 \
  --p-trim-left-f 17 \
  --p-trim-left-r 21 \
  --p-n-threads 40 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

See also:

4. Merge Multiple Sequencing Runs

Note: This step can be skipped if there was only one sequencing run.

We can merge multiple feature tables with the following:

$ qiime feature-table merge \
    --i-tables table-run1.qza \
    --i-tables table-run2.qza \
    --i-tables table-run3.qza \
    --o-merged-table table-merged.qza

We can also merge multiple representative sequences with the following:

$ qiime feature-table merge-seqs \
    --i-data rep-seqs-run1.qza \
    --i-data rep-seqs-run2.qza \
    --i-data rep-seqs-run3.qza \
    --o-merged-data rep-seqs-merged.qza

See also:

5. Classify Taxonomy

We assign taxonomy to the representative sequences.

$ qsub -S /bin/sh -cwd -l h=$NODE_NAME -V -pe pePAC 45 qsubme-classify-sklearn.sh

The qsubme-classify-sklearn.sh file looks like:

#!/bin/bash

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8

qiime feature-classifier classify-sklearn \
  --i-classifier $TAXONOMY_CLASSIFIER \
  --i-reads rep-seqs.qza \
  --p-n-jobs 40 \
  --o-classification taxonomy.qza

See also:

6. Summarize and Filter ASV Table

Note: In this step, the filtering part can be skipped when it’s justifiable.

At this point, your (merged) feature table will probably contain lots of false positive signals from contaminants, especially if the input DNA is from low-biomass samples.

Summarize Original Table

We need to find the range of ASV frequency.

$ dokdo summarize -i table.qza
Number of samples: 338
Number of features: 15935
Total frequency: 24019367.0
Frequency per sample:
0.00      7318.00
0.25     38498.00
0.50     59694.00
0.75     84168.75
1.00    441278.00
Frequency per feature:
0.00          1.0
0.25         20.0
0.50         56.0
0.75        172.0
1.00    6576141.0

Contingency-Based Filtering

We filter out ASVs that are present only in a single sample.

$ qiime feature-table filter-features \
    --i-table table.qza \
    --p-min-samples 2 \
    --o-filtered-table table-s2.qza

Total-Frequency-Based Filtering

We filter out ASVs with a total abundance (summed across all samples) of less than 10.

$ qiime feature-table filter-features \
    --i-table table-s2.qza \
    --p-min-frequency 10 \
    --o-filtered-table table-s2-f10.qza

Taxonomy-Based Filtering

We can filter out ASVs that are annotated as either mitochondria or chloroplast.

$ qiime taxa filter-table \
    --i-table table-s2-f10.qza \
    --i-taxonomy taxonomy-c0.qza \
    --p-exclude mitochondria,chloroplast \
    --o-filtered-table filtered-table.qza

Summarize Filtered Table

We need to find the range of ASV frequency.

$ dokdo summarize -i filtered-table.qza
Number of samples: 338
Number of features: 2583
Total frequency: 21026677.0
Frequency per sample:
0.00      2902.00
0.25     34001.25
0.50     53299.00
0.75     72941.00
1.00    367878.00
Frequency per feature:
0.00         10.0
0.25        124.0
0.50        387.0
0.75       1163.5
1.00    6576141.0

7. Build Phylogenetic Tree

We build a rooted phylogenetic tree from the representative sequences.

$ qsub -S /bin/sh -cwd -l h=$NODE_NAME -V -pe pePAC 45 qsubme-build-tree.sh

The qsubme-build-tree.sh file looks like:

#!/bin/bash

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8

qiime alignment mafft \
  --i-sequences rep-seqs.qza \
  --p-n-threads 40 \
  --o-alignment aligned-rep-seqs.qza

qiime alignment mask \
  --i-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza

qiime phylogeny fasttree \
  --i-alignment masked-aligned-rep-seqs.qza \
  --p-n-threads 40 \
  --o-tree unrooted-tree.qza

qiime phylogeny midpoint-root \
  --i-tree unrooted-tree.qza \
  --o-rooted-tree rooted-tree.qza

See also:

8. Create Rarefaction Curves

I usually perform two separate rarefactions, one with the minimum sample frequency and another with the maximum. The former is useful for checking whether all samples are sufficiently sequenced (i.e. whether the lines hit the plateau), while the latter is useful for seeing the global trend of sample depth.

$ qiime diversity alpha-rarefaction \
    --i-table filtered-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-max-depth 2902 \
    --p-steps 20 \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization alpha-rarefaction-min.qzv
$ qiime diversity alpha-rarefaction \
    --i-table filtered-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-max-depth 367878 \
    --p-iterations 3 \
    --p-steps 100 \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization alpha-rarefaction-max.qzv

See also:

9. Compute Core Metrics

$ qiime diversity core-metrics-phylogenetic \
    --i-table filtered-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-sampling-depth 2902 \
    --m-metadata-file sample-metadata.tsv \
    --output-dir core-metrics-results

See also:

10. Create Taxonomy Barplot

$ qiime taxa barplot \
    --i-table filtered-table.qza \
    --i-taxonomy taxonomy.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization taxa-bar-plots.qzv

See also: