Abstract
Whole genome sequencing (WGS) is now playing a central role in the control and study of tuberculosis. Factors hindering genomic data interpretation can difficult our understanding of the pathogen biology and lead to incorrect clinical predictions. Analyzing an extensive Mycobacterium tuberculosis dataset comprising more than 1,500 sequencing samples from different published works, with additional samples sequenced in our laboratory, we find that contamination with non-target DNA is a common phenomenon among WGS studies. By using this data and in-silico simulations, we show that even subtle contaminations can produce dozens of false variants and large miscalculations of allele frequencies, often leading to errors that are very hard to detect and propagate through the analysis. In our dataset, 94% of the polymorphic positions were incorrectly identified due to contaminations. We exemplify the consequences of these errors in the context of clinical predictions for all the studies analyzed, and demonstrate that unexpected contaminations suppose a major pitfall in WGS studies. In addition, we present an approach based on the removal of contaminant reads that shows an outstanding performance analyzing both clean and contaminated data. Based in our findings, applicable to most of organisms, we urge for the implementation of contamination-aware analysis pipelines.