## Abstract

**Motivation** Clustering is a fundamental task in the analysis of nucleotide sequences. Despite the exponential increase in the size of sequence databases of homologous genes, few methods exist to cluster divergent sequences. Traditional clustering methods have mostly focused on optimizing high speed clustering of highly similar sequences. We develop a phylogenetic clustering method which infers ancestral sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences.

**Results** We describe a clustering program *AncestralClust*, which is developed for clustering divergent sequences. We compare this method with other state-of-the-art clustering methods using datasets of homologous sequences from different species. We show that, in divergent datasets, AncestralClust has higher accuracy and more even cluster sizes than current popular methods.

**Availability and implementation** AncestralClust is an Open Source program available at https://github.com/lpipes/ancestralclust

**Contact** lpipes{at}berkeley.edu

**Supplementary information** Supplementary figures and table are available online.

## 1 Introduction

Traditional clustering methods such as UCLUST (Edgar, 2010), CD-HIT (Fu *et al.*, 2012), and DNACLUST (Ghodsi *et al.*, 2011) use hierarchical or greedy algorithms that rely on user input of a sequence identity threshold. These methods were developed for high speed clustering of a high quantity of highly similar sequences (Ghodsi *et al.*, 2011; Li *et al.*, 2001; Edgar, 2010) and, generally, these methods are considered unreliable for identity thresholds <75% because of either the poor quality of alignments at low identities (Zou *et al.*,2018) or because the performance of the threshold used to count short words drops dramatically with low identities (Huang et al., 2010). At low identities, these methods produce uneven clusters where the majority of sequences are contained in only a few clusters (Chen et al., 2018) and the high variance in cluster sizes reduces the utility of the clustering step for many practical purposes. Clustering of divergent sequences is a fundamental step in genomics analysis because it allows for an early divide-and-conquer strategy that will significantly increase the speed of downstream analyses (Zheng et al., 2018) and clustering of divergent sequences is a frequent request of users of at least one clustering method (Huang *et al.*, 2010). Currently, there are no clustering methods that can accurately cluster large taxonomically divergent metabarcoding reference databases such as the Barcode of Life database (Ratnasingham and Hebert, 2007) in relatively even clusters. Only a few other methods, such as Sp-Clust (Matar *et al.*, 2019) and TreeCluster (Balaban *et al.*, 2019), exist for clustering potentially divergent sequences. SpClust creates clusters based on the use of Laplacian Eigenmaps and the Gaussian Mixture Model based on a similarity matrix calculated on all input sequences. While this approach is highly accurate, the calculation of an all-to-all similarity matrix is a computationally exhaustive step. TreeCluster uses user-specified constraints for splitting a phylogenetic tree into clusters. However, TreeCluster requires an input tree and thus can also be prohibitively slow for large numbers of sequences where a phylogenetic tree is difficult to estimate reliably. With the increasing size of reference databases (Schoch *et al.*, 2020), there is a need for new computationally efficient methods that can cluster divergent sequences. Here we present AncestralClust that was specifically developed for clustering of divergent metabarcoding reference sequences in clusters of relatively even size.

## 2 Methods

To cluster divergent sequences, we developed AncestralClust which is written in C (Figure 1). Firstly, *k* random sequences are chosen and the sequences are aligned pairwise using the wavefront algorithm (Marco-Sola *et al.*, 2020). A Jukes-Cantor distance matrix is constructed from the alignments and a neighbor-joining phylogenetic tree is constructed. The Jukes-Cantor model is chosen for computational speed, but more complex models could in principle be used to potentially increase accuracy but also increase computational time. The *C* — 1 longest branches in the tree are then cut to yield *C* clusters. These subtrees comprise the initial starting clusters. The sequences in each starting cluster are aligned in a multiple sequence alignment using kalign3 (Lassmann, 2020). The ancestral sequences at the root of the tree of each cluster is estimated using the maximum of the posterior probability of each nucleotide using standard programming algorithms from phylogenetics (see e.g., Yang, 2014). The ancestral sequences are used as the representative sequence for each cluster. Next, the rest of the sequences are assigned to each cluster based on the shortest nucleotide distance from the wavefront alignment between the sequence and the *C* ancestral sequences. If the shortest distance to any of the *C* ancestral sequences is larger than the average distance between clusters, the sequence is saved for the next iteration. We iterate this process until all sequences are assigned to a cluster. In each iteration after the first iteration, a cut of a branch in the phylogenetic tree is chosen if the the branch is longer that the average length of branches cut in the first iteration. In praxis, only one or two iterations are needed for most data sets if *k* is defined to be sufficiently large.

We compared AncestralClust to five other state-of-the-art clustering methods: UCLUST (Edgar, 2010), meshclust2 (James and Girgis, 2018), DNACLUST (Ghodsi *et al.*, 2011), CD-HIT (Fu *et al.*, 2012), and SpClust (Matar *et al.*, 2019). We used a variety of measurements to assess the accuracy and evennness of the clustering. We calculated two traditional measures of accuracy, purity and normalized mutual information (NMI), used in Bonder *et al.* (2012). The purity of clusters is calculated as:
where Ω = *w*_{1},*w*_{2},…,*w*_{k} is the set of clusters, *C* = *c*_{1},*c*_{2},…,*c _{j}* is the set of taxonomic classes and

*N*is the total number of sequences. NMI is calculated as: where mutual information gain is

*I*(Ω,

*C*) and

*H*is the entropy function. To measure the evenness of the clusters, we used the coefficient of variation which is calculated as: where

*n*is the number of sequences in cluster

_{i}*i, j*is the total number of clusters, and

*m*is the mean size of the clusters. We also used a taxonomic incompatibility measure to assess the accuracy of the clusters. Let

*a,b*be a pair of species found in cluster

*i*. Incompatibility at a given taxonomic rank is calculated by first identifying the number of times

*a*and

*b*exist in clusters other than cluster

*i*. The total incompatibility is calculated by summing over all pairs of sequences (

*a,b*) and all

*i*.

Both NMI and taxonomic incompatibility are very sensitive to the number of clusters and also to unevenness of cluster sizes. To allow fair comparison when numbers of clusters and evenness of cluster sizes vary we, therefore, calculate the *relative NMI* and *relative incompatibility*. These measures are calculated by scaling them relative to their expected values under random assignments given the number of clusters and the cluster sizes. We estimated *relative NMI* by dividing the raw NMI score by the average NMI of 10 clusterings in which sequences have been assigned at random with equal probability to clusters, such that the cluster sizes are same as the cluster sizes produced in the original clustering. The same procedure was used to convert the taxonomic incompatibility measure into relative incompatibility.

## 3 Results

To first assess performance of clustering methods on divergent nucleotide sequences, we used 100 random samples of 10,000 sequences from three metabarcode reference databases (16, 18S, and Cytochrome Oxidase I (COI)) from the CALeDNA project Meyer *et al.* (2019). We chose to compare our method on this dataset against UCLUST because it is the most widely used clustering program and it performs better than CD-HIT on low identity thresholds (Chen et al., 2018).

We first compared AncestralClust against UCLUST using *relative NMI* and Coefficient of Variation (Figure 2). We used *k* = 300 random initial sequences, which is 3% of the total number of sequences in each sample and *C* = 16 cuts in the initial phylogenetic tree. Notice that the relative NMI tends to be higher with a lower coefficient of variation for AncestralClust across all barcodes. This suggests, that for these divergent eDNA sequences, AncestralClust provides clusterings that are more even in size and that are more consistent with conventional taxonomic assignment. As a second measure of accuracy we measured *relative incompatibility* and coefficient of variation using AncestralClust and UCLUST using for the same datasets under the same running conditions. Notice in Figure 3, AncestralClust tends to create balanced clusters with lower relative taxonomic incompatibilities compared to UCLUST at all taxonomic levels. Similar results are seen for metabarcode 18S (Fig S1). However, for metabarcode 16S (Fig S2), AncestralClust performs noticeably better than UCLUST at the species, genus, and family levels but at the order, class, and phylum levels it performs either the same or worse. Also, at the species, genus, and family levels, it is apparent that as the UCLUST clusters approach a lower coefficient of variation, the *relative incompatibility* increases dramatically.

Next, we analyzed two datasets with different properties: one dataset of diverse species from the same gene and another dataset of homologous genes from species of the same phyla. In the first dataset, we expect that the sequences to cluster according to species. In the second dataset, we expect the sequences to cluster according to different genes. We compared AncestralClust to four commonly used clustering programs (UCLUST, meshclust2, CD-HIT2, and DNACLUST) and one clustering program designed for divergent sequences, SpClust. The first dataset contained 13,043 sequences from the COI CaleDNA database from 11 divergent species that were from 7 different phyla and 11 different classes and the second data set contained sequences from 6 different genes from taxonomically similar species. First, we compared all methods using 13,043 COI sequences from the 11 different species (Table 1). We expect these sequences to form 11 different clusters, each including all the sequences from one species. We chose identity thresholds to enforce the expected number of clusters for each method. We were unable to form 11 clusters using CD-HIT because the program does not allow clustering of sequences with identity thresholds < 80% at default parameters. For SpClust, we used the three precision modes available for the method. In this analysis, AncestralClust achieved a perfect clustering (the purity was 1 and *relative incompatibility* was 0) although it was the second slowest, and had the second lowest memory requirements. UCLUST was one of the fastest methods and used the least amount of memory but had the second lowest purity with third highest *relative NMI* values. meshclust2 had no incompatibilities and the second highest purity and relative NMI values but was the third slowest method. DNACLUST had the most uneven clusters and the second lowest *relative NMI* value with the highest *relative incompatibility*. SpClust only identified one cluster, with a computational time of ~2 days. In comparison, AncestralClust took ~5 minutes and UCLUST used < 1 second.

Next, we analyzed ‘genomic set 1’ from Matar *et al.* (2019), which consists of 39 sequences from 6 homologous genes (FCER1G, S100A1, S100A6, S100A8, S100A12, and SH3BGRL3 in Table 2). We expect these sequences to form 6 clusters. We varied the identity thresholds for UCLUST and meshclust2 using thresholds 0.4, 0.6, and 0.8. For CD-HIT, we used the lowest identity threshold available on default parameters which is 0.8. We were unable to use DNACLUST for this analysis because it cannot handle sequences longer than 4500bp (the average sequence length was 2,387.9bp and the longest sequence was 5,363bp). Since this dataset contained 6 different genes, we calculated *relative NMI* using genes as the classes and did not use incompatibility as an accuracy measure. Only AncestralClust, UCLUST, and meshclust2 produced the expected number of clusters, and among the methods that created the expected number of clusters, AncestralClust had the highest purity value. Ancestral-Clust was the second slowest method and had the highest memory requirements which is due to the wavefront algorithm alignment which is in memory requirements where *s* is the alignment score. Since alignments were performed using 6 different genes that were longer than 1.5kb, this resulted in a high value of s. Sp-Clust had the highest *relative NMI* using all precision modes and the same purity as AncestralClust for its moderate and maximum precision modes, however, failed to produce the expected number of clusters.

## 4 Conclusions

We developed a phylogenetic-based clustering method, AncestralClust, specifically to cluster divergent metabarcode sequences. We performed a comparative study between AncestralClust and widely used clustering programs such as UCLUST, CD-HIT, DNACLUST, meshclust2, and for divergent sequences, SpClust. UCLUST and DNACLUST are substantially faster than AncestralClust and should be the preferred method if computational speed is the main concern. However, AncestralClust tends to form clusters of more even size with lower taxonomic incompatibility and higher NMI than other methods, for the relatively divergent sequences analyzed here. We recommend the use of Ancestral-Clust when sequences are divergent, especially if a relatively even clustering is also desirable, for example for various divide-and-conquer approaches where computational speed of downstream analyses increases faster than linearly with cluster size.

## Acknowledgements

This work used the Extreme Science and Engineering Discovery Environment (XSEDE) Bridges system at the Pittsburgh Supercomputing Center through allocation BIO180028.