Boundary-Forest Clustering: Large-Scale Consensus Clustering of Biological Sequences

Bacterial species with large sequence diversity enable studies focused on comparative genomics, population genetics and pan-genome evolution. In such analyses it is key to determine whether sequences (e.g. genes) from different strains, are the same or different. This is often achieved by clustering orthologous genes based on sequence similarity. Importantly, one limitation of existing pan-genome clustering methods is that they do not assign a confidence score to the identified clusters. Given that clustering ground truth is unavailable when working with pan-genomes, the absence of confidence scores makes performance evaluation on real data an open challenge. Moreover, most pan-genome clustering solutions do not accommodate cluster augmentation, which is the addition of new sequences to an already clustered set of sequences. Finally, the pan-genome size of many organisms prevents direct application of powerful clustering techniques that do not scale to large datasets. Here, we present Boundary-Forest Clustering (BFClust), a method that addresses these challenges in three main steps: 1) The approximate-nearest-neighbor retrieval method Boundary-Forest is used as a representative selection step; 2) Downstream clustering of the representatives is performed using Markov Clustering (MCL); 3) Consensus clustering is applied across the Boundary-Forest, improving clustering accuracy and enabling confidence score calculation. First, MCL is favorably benchmarked against 6 powerful clustering methods. To explore the strengths of the entire BFClust approach, it is applied to 4 different datasets of the bacterial pathogen Streptococcus pneumoniae, and compared against 4 other pan-genome clustering tools. Unlike existing approaches, BFClust is fast, accurate, robust to noise and allows augmentation. Moreover, BFClust uniquely identifies low-confidence clusters in each dataset, which can negatively impact downstream analyses and interpretation of pan-genomes. Being the first tool that outputs confidence scores both when clustering de novo, and during cluster augmentation, BFClust offers a way of automatically evaluating and eliminating ambiguity in pan-genomes. Author Summary Clustering of biological sequences is a critical step in studying bacterial species with large sequence diversity. Existing clustering approaches group sequences together based on similarity. However, these approaches do not offer a way of evaluating the confidence of their output. This makes it impossible to determine whether the clustering output reflect biologically relevant clusters. Most existing methods also do not allow cluster augmentation, which is the quick incorporation and clustering of newly available sequences with an already clustered set. We present Boundary-Forest Clustering (BFClust) as a method that can generate cluster confidence scores, as well as allow cluster augmentation. In addition to having these additional key functionalities and being scalable to large dataset sizes, BFClust matches and outperforms state-of-the-art software in terms of accuracy, robustness to noise and speed. We show on 4 Streptococcus pneumoniae datasets that the confidence scores uniquely generated by BFClust can indeed be used to identify ambiguous sequence clusters. These scores thereby serve as a quality control step before further analysis on the clustering output commences. BFClust is currently the only biological sequence clustering tool that allows augmentation and outputs confidence scores, which should benefit most pan-genome studies.


Introduction
5 83 network requires all-against-all comparisons, these methods also do not scale out-of-the-box. To 84 overcome this challenge, the two software solutions for pan-genome clustering, PanX (18) and 85 Roary (19), first use a representative selection step -which reduces the redundancy in, and the 86 size of, the dataset by grouping extremely similar sequences together. For each group, a 87 representative sequence is picked, and the representatives are then clustered using MCL. The 88 cluster membership for the representatives is then extrapolated to all sequences. 89 There are multiple strategies for representative selection. For instance, PanX separates consecutive 90 input sequences into groups, then performs clustering within each one of these groups, and finally, 91 selects one representative from each cluster from each group. Alternatively, Roary uses CD-HIT 92 as its representative selection method (19). In either case, only a single set of representatives is 93 selected, and there is no guarantee that this set best represents the whole dataset, which is a critical 94 limitation. 95 Two additional challenges for pan-genome clustering are a lack of cluster augmentation, and a 96 lack of confidence scores on the clustering output. Currently CD-HIT is the only clustering 97 software that enables cluster augmentation, and no software produces confidence scores, which 98 are critical in evaluating the ambiguity in the clustering results. 99 To overcome these challenges, we developed BFClust. BFClust uses a Boundary-Forest as a 100 representative selection step, resulting in multiple sets of representatives that are stored. Each set 101 of representatives is then clustered using MCL, yielding a clustering ensemble. A final consensus 102 clustering step yields a single clustering solution from the ensemble. This approach has 2 main 103 advantages: 1) multiple sets of representatives and consensus clustering enable calculation of 104 confidence scores; 2) storing the Boundary-Forest enables quick cluster augmentation. 105 In this work, we evaluate the performance of 7 clustering methods (including hierarchical, K-106 means, spectral and MCL), and show that network-based methods such as MCL outperform others. 107 BFClust using MCL is then compared to UCLUST, CD-HIT, PanX and Roary, which highlights 108 that BFClust and PanX have high accuracy and robustness to noise when evaluated on a synthetic 109 dataset generated in silico with known cluster assignments. In real pan-genome datasets, BFClust 110 identifies clusters with low confidence scores, even in the core genome. Since such clusters most 111 likely do not represent real orthologues, the confidence score can thus serve as a means to filter 112 clustering results, only retaining unambiguous clusters. To the best of our knowledge, BFClust is 6 113 the only clustering solution that produces the critical confidence scores, offers automatic cluster 114 augmentation, and updates confidence scores during cluster augmentation. BFClust thereby is a 115 unique method that enables quality control on the clusters it produces, while matching current gold 116 standard tools in terms speed and performance.

118
Algorithm Overview 119 Clustering of sequences using BFClust has three major steps: 1) representative selection i.e. 120 reducing redundancy in the input data using Boundary-Forest; 2) clustering of each set of 121 representatives associated with each Boundary-Tree into an ensemble of clustering solutions; and 122 3) deriving a consensus clustering from this ensemble of solutions (Figure 1). Once a consensus 123 clustering is obtained, each cluster is assigned a cluster confidence score, and each amino acid 124 sequence is given an item consensus score, based on the agreement of the clustering produced 125 using the different Boundary-Trees. 126 A naïve way to cluster all sequences from many bacterial genomes would be to look at all-vs-all 127 pairwise sequence comparisons. Since all-vs-all pairwise comparisons require a computational 128 effort that scales quadratically (O(N 2 ) comparisons) with the number of sequences (N), it is 129 beneficial to apply a representative selection scheme such that a group of extremely similar 130 sequences is represented by a single sequence. We achieve this by constructing a Boundary-Forest 131 (see Supplementary Notes for pseudocode). In a Boundary-Forest, n Boundary-Trees are 132 constructed, with n =10 as the default size of the forest. Before constructing each Boundary-Tree, 133 the order of sequences is randomized. The Boundary-Tree is constructed by placing the first 134 sequence as the root, and the second sequence as its child. Then, each subsequent sequence is 135 compared to the root node and its children. If the Smith-Waterman distance (20) between the 136 incoming sequence and a node in the Boundary-Tree is smaller than a pre-set threshold t, the 137 incoming sequence is represented by this node, and omitted from the tree. If the incoming sequence 138 is not within the threshold of the root node or its children, we select the node (among the parent 139 and children being compared to the incoming sequence) with smallest distance to the incoming 140 sequence. If the newly selected node also has children, we repeat the comparison, moving down 141 the tree until a representative is found that is sufficiently close to the incoming sequence. If such 142 a node is found, we assign this node as the representative of the incoming sequence, and start 143 processing the next incoming sequence. If no node within distance t is found on the tree, the new 144 sequence is added as a child of a leaf in the tree. To control the breadth of the tree, we also limit 145 the maximum number of children allowed for each node (with the parameter "MaxChild"). Note 146 that below, we explore the sensitivity of the approach to these parameters. Since any Boundary-8 147 Tree that is constructed is sensitive to the order in which the sequences are read, a single tree is 148 not guaranteed to capture a set of representatives that leads to highly accurate downstream 149 clustering. Therefore, multiple Boundary-Trees (the Boundary-Forest) are used, which can be 150 thought of as multiple 'opinions' on what representative sequences should be chosen. Once the 151 sequence set is reduced to n sets of representatives, stored as a forest of n trees, the pairwise 152 distances are computed within each set of representatives, and well-established clustering 153 algorithms are applied.

154
After clustering the representatives, the cluster assignments of the representatives are extended to 155 the full dataset. This is a necessary step for comparing the clustering output to the ground truth, 156 comparing two clustering outputs to each other, and for consensus clustering, as these actions are 157 performed on the full dataset, and not on the representatives. During the construction of each 158 Boundary-Tree, each sequence is assigned a representative (or is itself a representative) based on 159 sequence similarity. Cluster extension from the representatives to the full dataset is done by 160 assigning each sequence the cluster of its representative.

161
The representatives of each Boundary-Tree are used to produce one clustering output, the whole 162 Boundary-Forest thus leading to an ensemble of possible clustering outputs. Consensus clustering 163 across the clustering ensemble is then applied, combining the clustering output obtained from each 164 tree, to improve accuracy. Finally, BFClust compares how the different clustering outputs in the 165 ensemble contribute to the consensus clustering, and using the differences in these contributions 166 it assigns an item confidence score to the membership of each sequence to its consensus cluster, 167 and a cluster confidence score to the existence of each cluster.

168
Boundary-Forest reduces redundancy in the sequence set 169 In order to evaluate whether Boundary-Forest effectively reduces an input dataset into a small set 170 of representatives by removing redundant sequences, we studied how much this step reduces the 171 size of the dataset, how this reduction depends on the algorithm's parameters, and how, in turn, 172 this affects downstream clustering accuracy. We generated a small test dataset ('minigenomes'), 173 with 500 sequences of varying length (ranging from 65 to 1170 amino acids). This dataset has 50 174 noisy copies of 10 genes, and therefore 10 inherent sequence clusters. The noise is independent, 175 random changes in the nucleotide sequence with probability 0.01 per nucleotide. Since BFClust 176 uses amino acid sequences by default, the perturbed nucleotide sequences were then translated into 9 177 perturbed amino acid sequences in silico. As the changes introduced to the sequences are random, 178 10 replicate sequence sets with the same mutation probability were generated. Figure 2A shows 179 how the size of the Boundary-Tree constructed from this dataset is robust to two parameters that 180 are crucial in constructing the Boundary-Tree: MaxChild and the sequence similarity threshold t.

181
A detailed description of all parameters used in BFClust is provided in S1 Appendix. While a 182 drastically small threshold value (t = 0.01) results in larger trees (which is intuitive, since with a 183 smaller similarity threshold, fewer sequences can be represented by the same node), the size of the There are alternatives to representative selection that involve simpler algorithms than Boundary-

202
Forest; two approaches we discuss here are a random sampling and a naïve sampling strategy.

242
BT comparisons: expected number of comparisons that will be made during cluster augmentation using Boundary-

243
Trees (note this value is the same as tree depth multiplied by the number of children allowed, which is 10 by default).

244
MCL outperforms other methods during clustering of the representative sequences. 245 We tested 4 commonly used clustering methods (a total of 7 variants of the 4 methods) for datasets, it might not be the case that SSE is minimized to 0 when the correct K is selected.

268
Therefore, the elbow method described above is preferable over selecting the K at which the SSE 269 curve attains its minimum.

270
The selected elbow on the SSE curve for each tree is often at the correct K, but not always ( improves the accuracy of the clustering, and to be able to report confidence scores, we add a 278 consensus clustering step to BFClust. For this, we take the clustering assignments from the 279 clustering obtained from each tree as a vector, and use a scalable clustering method (e.g. K-280 medoids) to cluster these vectors (Supplementary Figure 3). The median of the elbow K values 281 from each tree is used as the final number of clusters to be generated by the consensus. Figure 3C 282 shows that after the consensus step, the errors against the known cluster assignments are reduced 283 to 0 for most methods, despite the clustering from individual trees having higher error prior to 284 consensus.

285
BFClust can compute cluster confidence scores 286 The consensus clustering step across the Boundary-Forest not only reduces error, but it also allows 287 confidence estimation for the existence of each cluster, and for the membership of each sequence 288 in its consensus cluster. By comparing the clustering done using the representatives on each 289 Boundary-Tree, it is possible to measure how frequently a cluster has the same members, and use 290 this value as an estimate of cluster confidence. We define a cluster confidence score (for each  In Figures 3D and 3E, the cluster and item confidence scores are presented as a means to evaluate 305 the self-consistency of each method. However, when clustering real pan-genome datasets, these 306 confidence scores computed for each cluster or sequence serve as a confidence measure of the 307 clustering output. This means that clusters with lower scores are those with higher uncertainty, and 308 should be used with caution in further analysis and interpretation.

309
Cluster augmentation: addition of new sequences to an existing clustering. 310 A major advantage of the BFClust algorithm is that it stores the Boundary-Forest containing 311 representatives from all previously processed input sequences. This allows BFClust to add new 312 sequences to an existing clustering/partition while being able to update the confidence scores 313 without much computational work. To achieve this, we implement a cluster augmentation method 314 (see Figure 4A for a schematic overview). A set of incoming input sequences can either be used 14 327 The runtime of de novo clustering, and cluster augmentation scales tractably with increasing 328 number of data points ( Figure 4B). For de novo clustering, increasing numbers of replicates of the 329 500-sequence dataset (with different random mutations) were included. For cluster augmentation, 330 the last 500 sequences from the dataset was left out, the remaining N -500 sequences were 331 clustered de novo, and the remaining 500 sequences were added to the dataset using the procedure 332 described above. Augmentation is much faster than clustering sequences de novo, even without 333 the initial representative selection step (Figure 4B, green curve). When the optional representative 334 selection step is included, there is an additional 5-fold increase in the runtime (Figure 4B, orange 335 curve). Error on the final clustering, both for de novo, and for cluster augmentation was 0.

336
Comparison and benchmarking with existing methods 337 We selected four existing sequence clustering methods to compare BFClust against. The first is 338 Usearch UCLUST (7), a very fast and scalable algorithm that is also the basis of several other 339 pangenome phylogenetic analysis pipelines such as SaturnV (27), PanPhlAn (28) and BPGA (10).

340
Second, we consider CD-HIT, another scalable software that has been used directly in pan-genome 341 analysis (8, 9) and for representative selection in other pipelines such as Roary (19). Roary is also 342 included, as it was one of the first popular software tools that allowed the pan-genome analysis of 343 several hundred genomes at once, and therefore has been utilized in recent studies (29,30). Finally,

344
PanX (18) is included, another recent software that can be used with hundreds of genomes.

345
First, the runtime of the four software tools was compared to BFClust ( Figure 5A). The direct-346 threshold methods UCLUST and CD-HIT are orders of magnitude faster than the other methods.

350
In order to compare the sensitivity to noise of our approach to existing methods for biological 351 sequence clustering, we generated an extended set of test sequences. Here, we made 10 replicates 352 of the 500-sequence minigenome set (with 10 known gene clusters) at given mutation rates, 353 ranging from 0 to 0.4. We then computed the error against known clusters for each mutation rate.

354
Both Roary and its representative selection method CD-HIT are very sensitive to low levels of 355 noise in the sequence data ( Figure 5B). In contrast, BFClust and PanX have 0 error when the 356 mutation rate is less than or equal to 0.1 ( Figure 5B, Table 2). Based on this, we recommend using PanX or BFClust when a high amount of variation is expected in the sequence data, either due to 358 genetic variation, or noise from error-prone sequencing technologies such as Oxford Nanopore 359 (31). Error of BFClust when using other clustering methods is also low at mutation rates ≤ 0.1, 360 however none of these methods outperform MCL (Supplementary Figure 4). In addition to being     We evaluated whether the core and accessory genome profiles detected by each method are 400 consistent. A reasonable expectation for a given tool is that it produces similar core and pan-401 genome size estimates for the 3 larger datasets (MA, Maela, Nijmegen). This expectation is met 402 by all methods but Roary, which shows a big discrepancy in the core and pan genome size across 403 these datasets (Figure 6A, B). Relative to the other methods, it appears that Roary underestimates 404 the core genome size, and over-estimates the pan-genome size (Figure 6A, B). In comparison,

405
BFClust and PanX both find a larger core genome and a smaller accessory genome compared to 406 the other methods, whereas UCLUST and CD-HIT find a similarly sized core genome, but a larger 407 accessory genome compared to BFClust and PanX (Figure 6). 408 In these analyses, BFClust is used with MCL. The alternative method were also tested within the The full list of isolates used for clustering can be found in (Supplementary Table 1 Within BFClust, a large sequence dataset is reduced to a set of representative sequences using 541 Boundary-Forests. For each input dataset, 10 randomized read orders are generated. The sequences 542 are read in these orders and 10 different Boundary-Trees are constructed as described in (21).

543
Briefly, the first sequence that is read is placed as the root node, and the second as its child. For 544 each subsequent sequence read, it is compared to the root node, and all its children using Smith-

545
Waterman distance (20). If the sequence being processed is within a pre-determined distance 546 similarity threshold t of a node already on the tree, then this node on the tree becomes its 547 representative. This means that the sequence being processed is marked with the identity of the 548 representative, and is not included in the tree. Otherwise, the sequence is compared to the current 549 node, and all its children, and added to the tree as a child of the node that it is closest to. Most of 550 the input sequences are not included in the tree and are simply associated with a representative on 551 the tree. Boundary-Trees contain ~2% of the original input sequences, making the clustering of 552 large numbers of (e.g. ~1 million) sequences possible. By default, the sequence distance similarity 553 threshold is 0.1 and each node is allowed up to 10 children. We found that the clustering results 554 on the minigenomes dataset were not altered when these parameters were changed. Error and selection of best number of clusters 587 In cases where the ground truth is not known, we use the sum of squared errors (SSE) as a measure 588 of clustering quality. SSE is defined as follows: 589 Where K is the total number of clusters, is the i'th cluster, and is the number of elements in | |
is the medioid (sequence that has the smallest total distance to all other points within the 592 cluster), and is the j'th element in . 593 We compute SSE for a user-defined range of K values. The most appropriate number of clusters 594 is determined to be the elbow point, or the point of maximal curvature, of the SSE vs K curve. We