ABSTRACT
Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from single nucleotide polymorphism (SNP) alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a ‘strict core’ by removing those sites which do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a ‘soft core’ that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of Neisseria gonorrhoeae and Salmonella enterica serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10,000-genome alignment of Salmonella enterica serovar Typhi, a 95% soft core yielded 10 times more informative sites than a 100% strict core. Similar patterns were observed in Neisseria gonorrhoeae. We further evaluated the accuracy of phylogenies built from strict and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the Neisseria gonorrhoeae 95% soft core phylogeny had a root-to-tip regression R2 of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed Core-SNP-filter (github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.
IMPACT STATEMENT This study addresses a major limitation in modern bacterial genomics – the significant data loss observed in large datasets for phylogenetic analyses, often due to strict-core SNP alignment approaches. As bacterial genome sequence datasets grow and diversity increases, a strict-core approach can greatly reduce the number of informative sites, compromising phylogenetic resolution. Our research highlights the advantages of soft-core alignment methods which tolerate some missing data and retain more genetic information. To streamline the processing of alignments, we developed Core-SNP-filter (github.com/rrwick/Core-SNP-filter), a publicly available resource-efficient tool that filters alignments to informative and core sites.
DATA SUMMARY All genomic sequence reads used in this study were already publicly available and accessions can be found in Supplementary Dataset 1. Supplementary methods and all code can be found in the accompanying GitHub repository: (github.com/mtaouk/Core-SNP-filter-methods).
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
The following new additions have been incorporated into the manuscript: Figure 1: a visual schematic of the core filtering process Supplementary Figure 1: the variance across the ten replicates in Figure 1 Supplementary Figure 2: counting variant sites in large S. Typhi alignments after recombination filtering Supplementary Figure 4: extended the validation in the real infection cases to S. Typhi Supplementary Figure 5-6: comparing the percentage of identical genome pairs based on pairwise SNP distances across studies of N. gonorrhoeae and S. Typhi Supplementary Figure 8: a new analysis statistically comparing the topologies of the phylogenies at different core thresholds Additionally, we have edited the discussion to include the implications from the new analyses, specifically the impact of the finds on real-world applications such as public health and outbreak investigations. We have also expanded the introduction and discussion to include the impact of missing data on phylogenetic inference based on the shared literature and instances where using a low core SNP threshold may be inappropriate.