Clustering analysis of single nucleotide polymorphism data 1 reveals population structure of SARS-CoV-2 worldwide

.


Introduction
biological entities based on their genetic closeness (9, 10). The distances of one 48 sequence from the other sequences indicate the degree of relationships. As population 49 genomic datasets grow in size, grouping the entities into clusters helps to simplify the 50 analysis. A cluster is a subset of these entities such that the similarity among the entities 51 in the subset is generally higher than the similarity among the objects in the full set. 52 Traditionally, using the distance matrix and the branches of the phylogenetic tree, 53 entities can be grouped into clusters. However, when the number of entities becomes 54 large, it is not easy to directly and accurately partition the clades in the phylogenetic 55 tree. 56 In order to identify the best way to effectively group the entities, clustering methods 57 emerge as more productive and robust solutions. Clustering is the process of 58 partitioning biological entities into meaningful clusters, based on distances in their 59 phenotypic and genetic characteristics. The objective of clustering is automatically 60 minimizing intra-cluster distances and maximizing inter-cluster distances (11).

61
Accurate clustering helps to better understand the inner relationships between data and 62 inform downstream analysis. Clustering methods have been widely used as a good 63 supplementary tool in phylogenetic analysis, including phylogenetic tree construction 64 (12-14), ancestral relationship identification (15), evolutionary rate estimation (16, 17), update) until the centroids become stabilized or the defined number of iterations has 114 been achieved. The model was implemented using the Python package sklearn with the 115 KMeans function. 116 Another unsupervised method, hierarchical clustering does not require the user to 117 pre-specify the number of clusters to be generated as K-means does. Hierarchical 118 clustering starts by treating each observation as a separate cluster. It then iteratively 119 executes the following two steps: (a) identify the two clusters that are closest together, the clustering features are organized in a CF tree and each leaf node is a subcluster. 130 Each new sample descends the tree by following its closest CF until a leaf node is 131 reached and merges into the leaf node. The distance between the incoming sample and 132 the existing subclusters is limited by the parameter threshold. If the radius of the 133 subcluster obtained by merging the new sample and the nearest subcluster is greater 134 than the square of the threshold, the two farthest subclusters are taken and the 135 subclusters are separated into two clusters on the basis of the distance between these 136 subclusters. After the CF tree is built, a selected clustering algorithm is applied to 137 cluster these subclusters into global clusters (the global clustering step). The selected 138 clustering algorithm is adapted to work with a set of subclusters, rather than a set of 139 data points. The model was implemented using the Python package sklearn with the 140 Birch function.  To trace the evolutionary history within a cluster, we calculated the pairwise 179 mutation dependency scores for polymorphic mutations. For two selected mutations X 180 and Y, the scores S(X|Y) and S(Y|X) can be calculated using the following functions: an objective that is a combination of cluster compactness and cluster separation (35).

234
To determine which method is best, we aligned the partitions of the six clusters 235 against the phylogenetic tree for the three methods ( Figure 1C). The differences 236 between the hierarchical clustering results and the two other clustering results were 237 mainly on the boundary of the clusters. Of the three methods, strains clusters by K-238 means and BIRCH were more compact in the phylogenetic tree than those by 239 hierarchical clustering. For example, the strains in both K-means cluster D and BIRCH 240 cluster D were split into two clusters by hierarchical clustering. However, such split 241 was not supported by the phylogenetic tree ( Figure 1C). The objective of clustering is 242 minimizing intra-cluster distances and maximizing inter-cluster distances. Therefore,

243
we calculated the intra-cluster pairwise genetic distances. The average number of 244 genetic distances for K-means, hierarchical clustering and BIRCH were 4.886, 5.062 245 and 4.900, respectively. The average number of genetic distances of K-means was 246 significantly lower than that of hierarchical clustering (P-value < 0.001, Wilcoxon rank-247 sum test) and BIRCH (P-value < 0.001, Wilcoxon rank-sum test). In general, K-means 248 exhibited the best performance compared to the other methods.

249
In the meantime, we used complementary approaches to validate the K-means 250 clustering results. First, we compared the pairwise genetic distances between intra-251 cluster and inter-cluster. In all six clusters, the average number of intra-cluster genetic 252 distances was significantly lower (P-value < 0.001, Wilcoxon rank-sum test, Figure 1D) 253 than inter-cluster. Second, we applied T-distributed Stochastic Neighbor Embedding (t-254 SNE) to visualize the K-means clustering results. In the t-SNE plot, the strains were 255 well isolated between clusters ( Figure 1E).  considering the sampling bias of the SARS-CoV-2 ( Figure Table S1. It is noteworthy that the clusters in the USA also display geographic 274 preferences ( Figure S6). Strains in cluster D are mainly on the west coast and strains in 275 cluster E are mainly on the east coast. in cluster F, suggesting that the diversity of cluster F was higher than the other clusters.

291
The nucleotide diversities per site for the six clusters were 0.0196%, 0.0222%, 0.0171%, 292 0.0256%, 0.0131% and 0.0132%. The high mutation rates but low nucleotide diversity 293 in cluster E and cluster F suggests that these two clusters may have more fixed 294 mutations than the other clusters. We then calculated the nucleotide diversity of each 295 gene across all clusters ( Figure 3B-G). Except for some short genes that are unlikely to 296 be informative, the diversity of most genes was close to the diversity of their genome-297 wide variants.

298
To test whether these clusters were subject to positive selection or purifying suggesting positive selection in these two clusters ( Figure S8). In contrast, the pairwise 306 dN/dS ratios in the other four clusters were significantly lower than 1 (P-value < 0.001,

307
Wilcoxon rank-sum test), suggesting that the nonsynonymous mutations tended to be 308 selected against. In particular, the pairwise dN/dS in cluster D was much lower than 309 those in the other clusters. indicating that "L" type was more prevalent and transmissible than "S" type. The "S" Europe (75%), North America (65%), Oceania (55%) and Asia (32%). The earliest time 368 when sequences carrying mutation A23403G were collected was in late January 2020.

369
About a month later, these mutations were discovered throughout the world. Cluster E 370 is grouped as an independent cluster by K-means as strains in this group that accumulate 371 two mutations: C1059T and G25563T. These two mutations were first collected on 21

372
February 2020. Compared with cluster C and cluster E, cluster F has three unique fixed 373 mutations: the trinucleotide mutation from position 28881 to position 28883, which was 374 initially discovered on 24 February 2020.

375
In contrast to cluster C, cluster E and cluster F, there are no common fixed 376 mutations between the other three clusters. Cluster D has two fixed mutations (C8782T, 377 T28144C). Mutations C8782T and T28144C were two of the earliest mutations that 378 were collected in early January. Cluster A has three fixed mutations: G11083T, collected in late January, and the third one was discovered about two weeks later. In 381 addition, we found that mutation G14805T may be a recurrent mutation. The earliest 382 sequence carrying both C2558T and G14805T mutations was discovered on 9 February 383 2020, but the earliest sequence carrying only G14805T was discovered about ten days 384 later. Furthermore, the standardized disequilibrium coefficients (D') of mutation 385 G14805T with two genetically linked mutations C8782T and T28144C are very low 386 ( Figure S10), which is consistent with our assumption that G14805T is a recurrent 387 mutation.

388
Of the six clusters, cluster B is the only cluster that has no fixed mutations,

389
suggesting that the population structure in cluster B is complex. We calculated the T. F. Gonzalez, Clustering to minimize the maximum intercluster distance.