%0 Journal Article %A Andrew H. Buultjens %A Kyra Y. L. Chua %A Sarah L. Baines %A Jason Kwong %A Wei Gao %A Zoe Cutcher %A Stuart Adcock %A Susan Ballard %A Mark B. Schultz %A Takehiro Tomita %A Nela Subasinghe %A Glen P. Carter %A Sacha J. Pidot %A Lucinda Franklin %A Torsten Seemann %A Anders Gonçalves Da Silva %A Benjamin P. Howden %A Timothy P. Stinear %T A Supervised Statistical Learning Approach For Accurate Legionella pneumophila Source Attribution During Outbreaks %D 2017 %R 10.1101/133033 %J bioRxiv %P 133033 %X Public health agencies are increasingly relying on genomics during Legionnaires’ disease investigations. However, the causative bacterium (Legionella pneumophila) has an unusual population structure with extreme temporal and spatial genome sequence conservation. Furthermore, Legionnaires’ disease outbreaks can be caused by multiple L. pneumophila genotypes in a single source. These factors can confound cluster identification using standard phylogenomic methods. Here, we show that a statistical learning approach based on L. pneumophila core genome single nucleotide polymorphism (SNP) comparisons eliminates ambiguity for defining outbreak clusters and accurately predicts exposure sources for clinical cases. We illustrate the performance of our method by genome comparisons of 234 L. pneumophila isolates obtained from patients and cooling towers in Melbourne, Australia between 1994 and 2014. This collection included one of the largest reported Legionnaires’ disease outbreaks, involving 125 cases at an aquarium. Using only sequence data from L. pneumophila cooling tower isolates and including all core genome variation, we built a multivariate model using discriminant analysis of principal components (DAPC) to find cooling tower-specific genomic signatures, and then used it to predict the origin of clinical isolates. Model assignments were 93% congruent with epidemiological data, including the aquarium Legionnaires’ outbreak and three other unrelated outbreak investigations. We applied the same approach to a recently described investigation of Legionnaires’ disease within a UK hospital and observed model predictive ability of 86%. We have developed a robust means to breach L. pneumophila genetic diversity extremes and provide objective source attribution data for outbreak investigations.Importance Microbial outbreak investigation is moving to a paradigm where phylogenomic trees and core genome multilocus typing schemes are sufficient to identify infection sources with high certainty. We show by studying 234 Legionella pneumophila genomes collected over 21 years, that it is critically important to have a detailed understanding of local bacterial population diversity or risk misidentifying an outbreak source. We propose statistical learning approaches that can accommodate all core genome variation and link clinical L. pneumophila isolates back to environmental sources, at both the inter- and intra-institutional levels, eliminating the ambiguity of inferring transmission from phylogenies. This information is critical in outbreak investigations, particularly for L. pneumophila, which spreads via aerosols causing Legionnaires’disease. %U https://www.biorxiv.org/content/biorxiv/early/2017/07/09/133033.full.pdf