PT - JOURNAL ARTICLE AU - Jody Phelan AU - Wouter Deelder AU - Daniel Ward AU - Susana Campino AU - Martin L. Hibberd AU - Taane G Clark TI - Controlling the SARS-CoV-2 outbreak, insights from large scale whole genome sequences generated across the world AID - 10.1101/2020.04.28.066977 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.04.28.066977 4099 - http://biorxiv.org/content/early/2020/05/26/2020.04.28.066977.short 4100 - http://biorxiv.org/content/early/2020/05/26/2020.04.28.066977.full AB - Background SARS-CoV-2 most likely evolved from a bat beta-coronavirus and started infecting humans in December 2019. Since then it has rapidly infected people around the world, with more than 4.5 million confirmed cases by the middle of May 2020. Early genome sequencing of the virus has enabled the development of molecular diagnostics and the commencement of therapy and vaccine development. The analysis of the early sequences showed relatively few evolutionary selection pressures. However, with the rapid worldwide expansion into diverse human populations, significant genetic variations are becoming increasingly likely. The current limitations on social movement between countries also offers the opportunity for these viral variants to become distinct strains with potential implications for diagnostics, therapies and vaccines.Methods We used the current sequencing archives (NCBI and GISAID) to investigate 15,487 whole genomes, looking for evidence of strain diversification and selective pressure.Results We used 6,294 SNPs to build a phylogenetic tree of SARS-CoV-2 diversity and noted strong evidence for the existence of two major clades and six sub-clades, unevenly distributed across the world. We also noted that convergent evolution has potentially occurred across several locations in the genome, showing selection pressures, including on the spike glycoprotein where we noted a potentially critical mutation that could affect its binding to the ACE2 receptor. We also report on mutations that could prevent current molecular diagnostics from detecting some of the sub-clades.Conclusion The worldwide whole genome sequencing effort is revealing the challenge of developing SARS-CoV-2 containment tools suitable for everyone and the need for data to be continually evaluated to ensure accuracy in outbreak estimations.Competing Interest StatementThe authors have declared no competing interest.