PT - JOURNAL ARTICLE AU - C. Andújar AU - T. J. Creedy AU - P. Arribas AU - H. López AU - A. Salces-Castellano AU - A. Pérez-Delgado AU - A. P. Vogler AU - B. C. Emerson TI - NUMT dumping: validated removal of nuclear pseudogenes from mitochondrial metabarcode data AID - 10.1101/2020.06.17.157347 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.06.17.157347 4099 - http://biorxiv.org/content/early/2020/06/18/2020.06.17.157347.short 4100 - http://biorxiv.org/content/early/2020/06/18/2020.06.17.157347.full AB - Metabarcoding of Metazoa using mitochondrial genes is confounded by the co-amplification of mitochondrial pseudogenes (NUMTs). Current denoising protocols have been designed to remove PCR and sequencing artefacts, but pseudogenes are not usually recognised by these procedures. Authentic mitochondrial amplicon sequence variants (ASVs), which represent the majority of reads, can be distinguished from PCR-derived errors, sequencing errors and NUMTs (non-authentic ASVs) due to their lower abundances. However, the use of simple read abundance thresholds is complicated by the highly variable DNA contribution of individuals in a metabarcoding sample.We show how ASVs that survive standard denoising, but are identified as non-authentic, are consistent with expectations for NUMTs with regard to patterns of phylogenetic relatedness, read-abundance, and library co-occurrence. We then propose and demonstrate a new self-validating framework, named NUMT dumping, which allows NUMT filtering strategies to be evaluated by quantifying (i) the prevalence of non-authentic ASVs (NUMT and erroneous sequences) and (ii) the collateral effects on the removal of authentic ASVs (mtDNA haplotypes) in filtered data. We propose several filtering strategies within the NUMT dumping framework, based on the application of read-abundance thresholds, structured with regard to sequence library and phylogeny.The framework was validated using mock and natural communities, both of which showed opposing trends for the removal of authentic and non-authentic ASVs, when threshold values for minimum abundance to filter out sequences were increased. Filtering can be optimized to retain less than 5% of non-authentic ASVs while retaining more than 89% of authentic mitochondrial ASVs, or complete removal of non-authentic ASV with 77% of authentic mitochondrial ASVs retained.We provide a program, NUMTdumper, that can be used to evaluate and decide upon the most adequate metabarcoding filtering strategy for specific research objectives, providing a measure of expected prevalence of non-authentic ASVs in metabarcoding datasets. In addition, this evaluation allows the user to quantify effects of taxonomic inflation when ASVs are clustered into OTUs. It improves the reliability of intraspecific genetic information derived from metabarcode data, opening the door for community-level genetic analyses requiring haplotype-level resolution.Competing Interest StatementThe authors have declared no competing interest.