RT Journal Article SR Electronic T1 SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data JF bioRxiv FD Cold Spring Harbor Laboratory SP 179465 DO 10.1101/179465 A1 Lennard Epping A1 Martin Hunt A1 Andries J. van Tonder A1 Rebecca A. Gladstone A1 The Global Pneumococcal Sequencing consortium A1 Stephen D. Bentley A1 Andrew J. Page A1 Jacqueline A. Keane YR 2017 UL http://biorxiv.org/content/early/2017/08/22/179465.abstract AB Streptococcus pneumoniae is responsible for 240,000 - 460,000 deaths in children under 5 years of age each year. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines. Recent efforts have been made to infer serotypes directly from genomic data but current software approaches are limited and do not scale well. Here, we introduce a novel method, SeroBA, which uses a hybrid assembly and mapping approach. We compared SeroBA against real and simulated data and present results on the concordance and computational performance against a validation dataset, the robustness and scalability when analysing a large dataset, and the impact of varying the depth of coverage in the cps locus region on sequence-based serotyping. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98% concordance using a k-mer based method, can process 10,000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 10x. SeroBA is implemented in Python3 and is freely available under an open source GPLv3 license from: https://github.com/sanger-pathogens/serobaDATA SUMMARYThe reference genome Streptococcus pneumoniae ATCC 700669 is available from National Center for Biotechnology Information (NCBI) with the accession number: FM211187Simulated paired end reads for experiment 2 have been deposited in FigShare: https://doi.org/10.6084/m9.figshare.5086054.v1Accession numbers for all other experiments are listed in Supplementary Table S1 and Supplementary Table S2.DATA SUMMARYI/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENT This article describes SeroBA, a k-mer based method for predicting the serotypes of Streptococcus pneumoniae from Whole Genome Sequencing (WGS) data. SeroBA can identify 92 serotypes and 2 subtypes with constant memory usage and low computational costs. We showed that SeroBA is able to reliably predict serotypes at a depth of coverage as low as 10x and is scalable to large datasets.