Abstract
Bacterial species with large sequence diversity enable studies focused on comparative genomics, population genetics and pan-genome evolution. In such analyses it is key to determine whether sequences (e.g. genes) from different strains, are the same or different. This is often achieved by clustering orthologous genes based on sequence similarity. Importantly, one limitation of existing pan-genome clustering methods is that they do not assign a confidence score to the identified clusters. Given that clustering ground truth is unavailable when working with pan-genomes, the absence of confidence scores makes performance evaluation on real data an open challenge. Moreover, most pan-genome clustering solutions do not accommodate cluster augmentation, which is the addition of new sequences to an already clustered set of sequences. Finally, the pan-genome size of many organisms prevents direct application of powerful clustering techniques that do not scale to large datasets. Here, we present Boundary-Forest Clustering (BFClust), a method that addresses these challenges in three main steps: 1) The approximate-nearest-neighbor retrieval method Boundary-Forest is used as a representative selection step; 2) Downstream clustering of the representatives is performed using Markov Clustering (MCL); 3) Consensus clustering is applied across the Boundary-Forest, improving clustering accuracy and enabling confidence score calculation. First, MCL is favorably benchmarked against 6 powerful clustering methods. To explore the strengths of the entire BFClust approach, it is applied to 4 different datasets of the bacterial pathogen Streptococcus pneumoniae, and compared against 4 other pan-genome clustering tools. Unlike existing approaches, BFClust is fast, accurate, robust to noise and allows augmentation. Moreover, BFClust uniquely identifies low-confidence clusters in each dataset, which can negatively impact downstream analyses and interpretation of pan-genomes. Being the first tool that outputs confidence scores both when clustering de novo, and during cluster augmentation, BFClust offers a way of automatically evaluating and eliminating ambiguity in pan-genomes.
Author Summary Clustering of biological sequences is a critical step in studying bacterial species with large sequence diversity. Existing clustering approaches group sequences together based on similarity. However, these approaches do not offer a way of evaluating the confidence of their output. This makes it impossible to determine whether the clustering output reflect biologically relevant clusters. Most existing methods also do not allow cluster augmentation, which is the quick incorporation and clustering of newly available sequences with an already clustered set. We present Boundary-Forest Clustering (BFClust) as a method that can generate cluster confidence scores, as well as allow cluster augmentation. In addition to having these additional key functionalities and being scalable to large dataset sizes, BFClust matches and outperforms state-of-the-art software in terms of accuracy, robustness to noise and speed. We show on 4 Streptococcus pneumoniae datasets that the confidence scores uniquely generated by BFClust can indeed be used to identify ambiguous sequence clusters. These scores thereby serve as a quality control step before further analysis on the clustering output commences. BFClust is currently the only biological sequence clustering tool that allows augmentation and outputs confidence scores, which should benefit most pan-genome studies.