Non-biological synthetic spike-in controls and the AMPtk software pipeline improve mycobiome data

Jonathan M. Palmer; Michelle A. Jusino; Mark T. Banik; Daniel L. Lindner

doi:10.1101/213470

Abstract

High throughput amplicon sequencing (HTAS) of conserved DNA regions is a powerful technique to characterize microbial communities. Recently, spike-in mock communities have been used to measure accuracy of sequencing platforms and data analysis pipelines. To assess the ability of sequencing platforms and data processing pipelines using fungal ITS amplicons, we created two ITS spike-in control mock communities composed of cloned DNA in plasmids: a biological mock community (BioMock), consisting of ITS sequences from fungal taxa, and a synthetic mock community (SynMock), consisting of non-biological ITS-like sequences. Using these spike-in controls we show that: 1) a non-biological synthetic control (e.g., SynMock) is the best solution for parameterizing bioinformatics pipelines, 2) pre-clustering steps for variable length amplicons are critically important, 3) a major source of bias is attributed to initial PCR reactions and thus HTAS read abundances are typically not representative of starting values. We developed AMPtk, a versatile software solution equipped to deal with variable length amplicons and quality filter HTAS data based on spike-in controls. While we describe herein a non-biological synthetic mock community for ITS sequences, the concept and AMPtk software can be widely applied to any HTAS dataset to improve data quality.

Availability and Implementation - AMPtk is publically available at https://github.com/nextgenusfs/amptk. All primary data and data analysis done in this manuscript are available via the Open Science Framework (https://osf.io/4xd9r/). The SynMock sequences and the script to produce them are available in the OSF repository ((https://osf.io/4xd9r/) as well as packaged into AMPtk distributions.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.