Covariance predicts conserved protein residue interactions important to the emergence and continued evolution of SARS-CoV-2 as a human pathogen

SARS-CoV-2 is one of three recognized coronaviruses (CoVs) that have caused epidemics or pandemics in the 21st century and that likely emerged from animal reservoirs. Differences in nucleotide and protein sequence composition within related β-coronaviruses are often used to better understand CoV evolution, host adaptation, and their emergence as human pathogens. Here we report the comprehensive analysis of amino acid residue changes that have occurred in lineage B β-coronaviruses that show covariance with each other. This analysis revealed patterns of covariance within conserved viral proteins that potentially define conserved interactions within and between core proteins encoded by SARS-CoV-2 related β-coranaviruses. We identified not only individual pairs but also networks of amino acid residues that exhibited statistically high frequencies of covariance with each other using an independent pair model followed by a tandem model approach. Using 149 different CoV genomes that vary in their relatedness, we identified networks of unique combinations of alleles that can be incrementally traced genome by genome within different phylogenic lineages. Remarkably, covariant residues and their respective regions most abundantly represented are implicated in the emergence of SARS-CoV-2 are also enriched in dominant SARS-CoV-2 variants.


Introduction 21
The prior emergence of SARS-CoV and MERS-CoV as human pathogens is attributed to zoonotic 22 viruses that transferred from bats to civets and camels, respectively, while SARS-CoV-2 is most similar to 23 viruses isolated from both bats and pangolins 1-6 . The ~30kb genome size of all SARS-related CoVs renders 24 sequence alignment and pairwise distance methods effective for phylogenic studies and determining genetic 25 events that correlate with their adaption to the human host. While nucleic acid sequence-based phylogenies 26 are informative, they clearly have limitations as not all single nucleotide polymorphisms are equal. For 27 example, single-strand RNA viruses possess significant base-pairing in regions of their genomes that can 28 result in different fitness costs even for synonymous mutations because higher-ordered RNA structures and 29 non-coding regions can impact replication, transcription, and recognition by the host immune system 7,8 . 30 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint protease. Others include genes for S (Spike), the two viroporins ORF3a and E (envelope), M (membrane), 127 and N (nucleocapsid). Our goal was to align the AA sequences of these proteins to identity covariant pairs 128 and residues within and between core non-structural and structural viral proteins. The alignment resulted in 129 a 9458 AA consensus sequence with only 1.5% of sites being gaps with low coverage and 2% with residue 130 conservation of less than 80%. 131 When the AA sequences for each protein are aligned and compared, the degree of conservation 132 varies between each protein and also within individual and discrete protein domains. Genes that encode 133 nonstructural proteins (nsp)s with roles in RNA metabolism and genome replication (i.e., nsp12-14) are 134 among the most highly conserved in AA identity. Others that encode nsp2, nsp3, and nsp4 are highly 135 variable. The NTD and RBD of Spike show the most significant variability in both residue identity and 136 length. In contrast, the sequence of much of the CTD of the S1 and the entire S2 subunit of Spike is highly 137 conserved. The channel-forming E (envelope) viroporin protein is one of the most highly conserved proteins 138 in contrast to the viroporin ORF3a that exhibits high variability in its NTD and other CTD subdomains. We 139 propose that this is evidence of evolutionary pressure on residues in CoV proteins and that certain protein 140 subdomains may be under higher selective pressure than others. 141 In addition to greater AA sequence variability in some genes, individual residues and continuous 142 sections of AA are either uniquely present or absent in a portion of CoVs. One key aim of our study was to 143 identify and evaluate the covariance of all residues in our selected proteins among these 149 CoVs without 144 bias. We predicted that his analysis might reveal the existence of critical amino acid residue conservation 145 or changes that would possibly correlate with changes in the virus-host range or biological properties. In 146 this regard, SARS-CoV-2 is recognized to possess unique sequences not present in other closely related 147 CoVs including those that define and enhance a furin cleavage site (FCS) of the Spike protein 25,50 . After 148 binding to the ACE-2 receptor, cleavage of S by furin and or other proteases is critical to conformational 149 changes that allow viral envelope-host membrane fusion and subsequent viral RNA entry into the host 150 cytosol 25 . Other changes in S are less understood. For example, certain residues in the NTD of SARS-151 CoV-2 Spike are present in other unrelated CoVs but are notably absent in the corresponding regions of 152 SARS-CoV 51 . Conversely, there are residues in many CoVs with no positional equivalent in SARS-CoV-153 2. To address gaps in the alignment, we have indicated such residues with a "Z" designation to 154 accommodate possible covariance between residues and such deletion occurrences. 155 156 157 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint 6

Extracted networks of covariant residues are informative about evolutionary relationships in CoVs 158
Covariance is a quantitative measurement of how often the identity of one AA is correlated to the 159 identity of another AA or AAs in either the same protein or in a completely different protein 52-54 . Because 160 covariant AA residues change in concert with each other, they can define critical AA-AA residue 161 interactions within a protein or in homologous or heterologous protein-protein pairs or instead indicate 162 phylogenetic relatedness based on their co-existence 37,38 . These putative residue interactions may provide 163 new insights into the evolution and relatedness of the CoV family of viruses 55 . To survey the frequency of 164 covariance among a reference collection CoVs, we identified correlative pairs and also assembled groups 165 of three or more (here designated as 'clusters') of covarying amino acid residues using a correlating tandem 166 model 56 . These clusters are not typically generated using other typical pairwise algorithms tailored for 167 determining protein structure or docking interfaces. We chose the FastCoV approach for its distinct quality 168 in identifying larger networks of putative compensatory mutations generated by selection and adaptation 169 which seems well-suited for studying the emergence of viruses similar to SARS-CoV and SAR-CoV-2 56 . 170 This differs from DCA-based and other covariance approaches used to predict co-evolving residue 171 interactions that may use corrected and weighted correlative data that can also be coupled with other various 172 predictive secondary structure motifs to assist in protein structure and interaction predictions 57 . Our 173 approach simply provides a raw covariance purity and percentage score for pairs and larger networks of 174 covariant residues with no goal for structure-based predictions. We selected a purity threshold (0.7) based 175 on our small sampling size and extracted 973,649 unique pairs and 741 clusters. In this preliminary analysis, 176 we identified a collection of gaps and also unique sequences selectively conserved in some CoVs and we 177 concluded that such deletions or insertions, like residues, may also covary with AA sequences in proteins. 178 Deletions were temporarily substituted as rare alternate amino acids in the alignment and covariance was 179 analyzed to reveal putative covariance between all residues and also deletions. This expanded the total 180 number of unique correlating residue pairs (1,089,836) and clusters (769) (Supplemental File S2). We 181 identified many deletions that correlated with AA residues and also other deletions. 182 All CoVs genomes, clusters, and residues were graphed using a force mapping algorithm 183 (Supplementary File S7, shown in Figure 1A). This interactive graph facilitates the extraction of clusters 184 and respective residues and deletions uniquely present to different groups and subsets of CoVs. 185 Remarkably, the spatial organization of graphed CoVs is highly consistent with our phylogenic estimate 186 based on nucleotide alignment ( Figure 1B). CoVs most closely related to SARS-CoV, SARS-CoV-2, and 187 HKU-3 are spatially positioned close to one another in each group solely based on shared covariant residues. 188 Other CoVs that vary regarding relatedness are distributed in between these three indicated groups. Because 189 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint some covariant pairs and clusters are entirely inclusive to single groupings or instead shared between certain 190 CoVs, we conclude that covariant residues can be enriched through a common evolutionary history such as 191 ancestry or can be selected by adaption to a specific host(s). 192 To provide information about the phylogenic distribution of any cluster that may be due to ancestry, 193 an average taxonomic distribution score (ATDS) was calculated for each cluster based on the number of 194 CoVs present in a given cluster and their average distribution based on branch lengths estimated in the ML 195 tree (Supplemental File S3). Though this score is relative and also determined by the relatedness of all 196 CoVs in the phylogenic reconstructions, clusters and their respective alleles with a larger ATDS are more 197 broadly represented in the evolutionary record within the scope of these 149 β-coronaviruses analyzed. A 198 small ATDS value indicates these covariant residues in a given cluster are restricted to CoVs that are very 199 similar or almost identical. We predict this class of clusters may be biologically informative about covariant 200 residues specifically enriched in SARS-CoV and SARS-CoV-2 and their respective relatives. Conversely, 201 clusters with large ATDS values are those clusters with residues that are present in more evolutionary 202 disjunctively distributed single or groups of CoVs. These may be the result of divergent or independent 203 selective events or are instead conserved covariant residues that have persisted during the evolution of CoVs 204 and are possibly ancestral or even essential to the lineage of these viruses. 205 Of the 1,089,836 unique pairs with varying degrees of residue identity at each two positions, we 206 identified 522,336 correlative AA residue pair positions and also calculated the number of unique amino 207 acid identities that can exist for each position in the pair (tabulated in Supplemental File S2). This degree 208 of residue representation of each pair varied between a minimum of two (481,024) and a maximum of seven 209 (2). Only ~8% (41,312) of all pairs are represented by three or more unique residue identities and this 210 representation drops significantly stepwise for each unique identity between three and seven. We 211 hypothesized that the increased number of independent residue pairs represented at any two correlative 212 positions in the evolutionary record increases the probability that there is a true interacting relationship 213 between such residues. 214 The position of every covariant residue in the alignment was mapped to the respective residue 215 position in SARS-CoV-2. For an overwhelming majority of residues that show high conservation and are 216 present among the 149 CoVs, this translational numbering assignment based on the sequence alignments is 217 straightforward. For residues in less conserved regions such as those that exist as gaps or insertions in some 218 CoVs including SARS-CoV-2, this created residues positions that are represented by gaps for sites missing 219 in SARS-CoV-2 and duplicate numbering. For example, several CoVs possess between two and eight 220 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint additional residues between the aligned numbered positions of AA 7 and 8 of SARS-CoV-2 Spike. If any 221 of these residues are covariant with other residues or gaps, the use of SARS-CoV-2 residue numbering 222 necessitates either no assignment or the number of the residue that flanks the missing residues in SARS-223 CoV-2 to preserve the information position of gapped-residue covariance. We chose the latter to indicate 224 the position, thus some residues may appear to be duplicated or even covariant with themselves when 225 SARS-CoV-2 numbering is employed. We have provided tables that indicates these positions to identify 226 such occurrences (Supplemental File S2). 227 The presence of more variable covariant residues in other proteins varies significantly. For pairs 228 with at least five independent identities (319), nearly half (154) of these are located in the Spike protein. Various structures of Spike trimer are elucidated and well-studied due to the roles in receptor recognition, 230 cell entry, and interactions with monoclonal antibodies 23,58-61 . Using available high-resolution PDB 231 structures, we screened for predicted interacting residues and then referenced our identified covariant pairs 232 to establish a correlation between the number of unique identities in pairs. As observed in the alignment, 233 Spike sequences and high order structures vary between CoVs, and we adjusted scoring both for directly 234 interacting residues and those directly adjacent by one residue position (Supplemental File S5). Residue 235 pairs with increased representation are more likely to interact or be in close proximity to one another in the 236 Spike trimer protein (~24%) than those represented by only two identities (~5%). This provided confidence 237 that residues with increased representation are more likely to have direct interactions with their identified 238 cognate pairs. 239

AA covariance in the CoVs closely related to SARS-CoV-2 is enriched in Spike 240
We examined the identity and distribution of covariant residues within the lineages of CoVs most 241 closely related to SARS-CoV-2 identified in this work designated as Groups 1-5 (Figure 2A). We reasoned 242 that the selective pressure imposed on the AA identity of truly covariant residues should be different than 243 for all other residues. Thus the collection of putative covariant AAs in SARS-CoV-2 and other CoVs that 244 share a common ancestor provides a new perspective about the evolutionary relationships between these 245

viruses. 246
Group 5 exhibits the most numerous and distributed covariant residues identified in distinct clusters 247 ( Figure 2C). This is not surprising based on the apparent evolutionary divergence within these CoVs when 248 compared to Groups 1-4 ( Figure 2B). Both Group 2, which is entirely represented by Pangolin CoVs, and 249 Group 3, all bat CoVs, similarly exhibit a greater number of covariant residues when compared to Group 250 1, also likely due to differences in the overall relatedness of CoVs. The CoVs represented in Group 1 exhibit 251 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint 9 high conservation and some members are nearly clonal. For these, nearly all covariant residues in this group 252 are restricted to Spike and ORF3a, except for two residues in Nsp3 (AA 149 and 175) and a single residue 253 in Nsp15 (115). 254 Because CoVs in Groups 1-3 are most closely related to SARS-CoV-2 based on their nucleotide 255 identity, we focused on these and examined covariant residues in Spike (Figures 3A-C). Residues in Spike 256 are recognized to be among the most relevant in the emergence and persistence of SARS-CoV-2 and also 257 implicated in host adaption in both SARS-CoV and SARS-CoV-2 43,62 . We vetted residues in Groups 1-3 258 because we hypothesized these covariant alleles might be important for selection, adaptation, and viral 259 fitness for the most closely SARS-CoV-2-related viruses in human, pangolin, and bat hosts. Furthermore, 260 we identified covariant residues co-present in two or in all three of these groups. By definition, the AA 261 identity of individual covariant residues is not highly conserved in CoVs, but instead, their conserved 262 identity varies with other residues. Thus any covariant pair or cluster of residues may indicate a direct or 263 indirect conserved interaction between AAs important during the adaption of a CoV. We find a common 264 subset of conserved covariant residues between both bat and pangolin CoVs with those closely related to 265 SARS-CoV-2 in Group 1. These may indicate specific interactions between residues and residue identities 266 especially relevant to the biology of SARS-CoV-2 and related CoVs. 267 A majority of the Spike-specific covariant residues common to clusters in Groups 1-3 are located 268 within discrete domains primarily in the S1 domain ( Figures 3B and 3C to be less mutable than other noncovariant residues with low conservation. 279

Clinical and pan covariant residues are similarly represented 280
The resolution and extent of our pan-CoV covariance analysis are in part defined by the number of 281 distinct genomes and also their relatedness. Roles for all identified covariant residues in all proteins cannot 282 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint be readily ascertained, but this generated data may be further validated as more SARS-CoV-2 protein 283 structures become available. Because of the vast scope and magnitude of SARS-CoV-2 infections during 284 the ongoing pandemic and availability of whole genomes sequenced, we supposed covariant residues may 285 also be apparent in the millions of sampled clinical strains over 18 months. Genes enriched residues with 286 covariant relationships that appear to be co-present in both the pan and clinical covariant analysis are more 287 likely to be of special interest to SARS-CoV-2 biology. We extracted and stringently selected whole-288 genome sequences that were nearly or entirely complete to avoid artifacts that may bias our analysis and Due to the near clonality of SARS-CoV-2 sequences, we were unsurprised to find only 1.2% of the 292 total covariant residues when compared to those identified in pan-CoV analysis. 13,041 uniquely 293 represented pairs of AA residues can be reduced to 6,137 correlative pairs for all proteins. As observed in 294 the pan-analysis, the distribution is exceedingly skewed toward several genes encoding proteins including 295 Spike. When the distribution of every single residue identified in both the pan and clinical analysis is 296 compared gene-by-gene, regardless of the positions of correlative partner, we discovered the contributed 297 representation for each encoded protein follows a similar trend (Figure 4A and 4B). Genes encoding nsp5 298 (3CL-pro), nsp7-nsp16, Envelope, and Membrane proteins are sparse in coverage of co-identified single 299 residues. In contrast, covariant residues are most abundant in frequency in genes encoding nsp1-3, Spike, 300 and ORF3a. Remarkably, in either category, there are proteins and specific regions of proteins similarly 301 enriched in the distribution of covariant residues for both analyses. When the co-occurrence for each residue 302 is quantified for the entire protein, the observed overlap between clinical and pan covariance is found to be 303 statistically significant for nsp2-4, nsp13, nsp16, Spike, and nucleocapsid. For proteins with overlap 304 measured to be above the threshold of significance, such as nsp1, envelope, and ORF3a, this graphing 305 allows us to observe both similar patterns and frequencies of covariant residues across the protein. Conversely, for nsp6, nsp10, and membrane proteins we see no significant similarities in residue 307 distribution or patterns. 308

Mapping of identified conserved pairs in both analyses 309
We accounted for the distribution of intra-and inter-protein residue pairs identified in both analyses 310 ( Figure 5A). We reasoned these pairs are more informative about conserved residue interactions than the 311 distribution of single residues mapped per protein. As with single residues shown in Figure 4, the 312 distribution of linked residue pairs is not uniform and the density varies significantly by protein and within 313 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint protein regions. The majority of linked pairs are mapped to genes encoding nsp1, nsp2, nsp3, Spike, ORF3a, 314 and nucleocapsid. For genes encoding nsp4 and nsp16, the occurrence is sparse or absent in residue pair 315 representation. Of 538 total pairs (Supplemental File S6), 90% (485) are represented by residues within 316 the same protein (Figure 5B), are frequently proximal or adjacent to each other by position, and are not 317 random in distribution. We expect this bias as most interacting residues should be present within the same 318 protein. The remaining 10% (53) intra-protein pairs are similarly clustered and nonrandom in their 319 positional enrichment (Figure 5C). Intra-protein residues in nsp3, Spike, and nucleocapsid are linked with 320 the most diverse partner proteins ( Figure 5F). In contrast, nsp12, nsp13, nsp15, nsp16, and envelope 321 possess only one or two intraprotein pairs. 322

Evidence for interactions within and between Spike and ORF3a linked to viral emergence and 323
adaption 324 The enrichment of residue pairs in the subdomains of Spike and ORF3a were of special interest 325 ( Figure 5D). First, the distribution and abundance of these residues are similar to the covariant residues we 326 identified within the CoVs most related to SARS-CoV-2 ( Figure 2C). Furthermore, of these 224 residues, (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint below). The stand-alone identification of these in both pan and clinical covariance analyses and co-345 occurrence in dominant circulating variants indicate these are often under significant selective pressure to 346 either mutate or become absent by consequence of in-frame deletion, often in the context of distal residues. 347 Spike protein remarkably accounts for 58% of residues in our identified 538 coincident pairs and 348 also is recognized to possess the majority of mutations in dominant variants. We examined all 538 residue 349 pairs and then cross-listed the occurrence of each residue in dominant variants to identify those in all 350 proteins. 119 (30%) of 394 total represented residues are found in 15 dominant variants including the two 351 current Omicron sublineages 31 . When both residue pairs are present in a dominant variant (49), we find 352 76% of these are remarkably present together in the same variant lineage (Figure 7). This observation is 353 suggestive of covariance pressure operative in the emergence of variants during the ongoing pandemic. 354

Discussion 355
For both nucleotide and amino acid identity-based approaches, the sequence conservation in either 356 complete or partial regions of CoV genomes continues to be applied to understand the relatedness between 357 β-coronaviruses and SARS-CoV-2. This extends to the emergence of SARS-CoV-2 as a human pathogen 358 responsible for a global pandemic and its continued adaptation. In this work, we examined the conservation 359 of correlative covariant pairs and even clusters of amino acids that appear to change in concert with one 360 another across the entire genome. We acknowledge apparent covariant mutations can also be a simple 361 consequence of spontaneously emerged mutations and other common concurrent mutations at less 362 conserved sites. Conversely, these could be a result of spontaneous mutations that imposes a selection for 363 one or even more compensatory mutations at other sites to maintain or even increase viral fitness. Both 364 instances are certain to be present in this dataset. An applied hypergeometric probability distribution 365 predicts the overlap of 538 covariant pairs between the pan and clinical datasets is exceedingly significant 366 (Figure 7). We acknowledge that the conservation and essentiality of amino acid residues vary significantly 367 in CoVs based on the scope of evolutionary relatedness. This should also influence the probability of true 368 coupling because not all residues are equally conserved or mutable. We propose future efforts should 369 examine such covariance in the context of residue mutability. Recent efforts that applied DCA to predict 370 epistatic interactions in the context of SARS-CoV-2 residue mutability concluded coupling also played a 371 minor role in emerging mutations in variants. The authors surmised that a restricted number of unique 372 genomes and a broad scope of evolutionary divergence among all coronaviruses also limited the analysis 373 performance for epistatis-mutation comparisons in SARS-CoV-2 proteins 63 . We chose to limit our pan-374 analysis to only lineage B β-coronaviruses for our work based on this very principle. 375 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made We find evidence of true covariance in this work when we compare the total number of independent 376 changes present in each pair with a known structure. In Spike, the probability of either a direct or possible 377 indirect interaction by one flanking residue increased from 5% to 24% when the minimum number of unique 378 AA changes in any given residue pair of two identities were compared to those with five. Furthermore, for 379 covariant residues with increased independent residue representation in the pan covariance analysis, these 380 were also more likely to be identified in the clinical covariance analysis. Notably, when residues identified 381 in both analyses are compared, 90% are found within the same gene and enriched in certain genes with 382 important virus-host interactions such as Spike. This observation is consistent with an expected model 383 where intraprotein covariance is predicted to be more abundant than that for interprotein, including proteins 384 that form homotrimers such as Spike. definition reveal co-present residues common to variants, the independence occurrence of these in the Pan-392 CoV covariance is intriguing. We note these clinical sequences were collected through August 2021, over 393 three months prior to the first emergence of Omicron, and yet we identify some Omicron-specific residues 394 and pairs in this work. We propose all covariant residues identified in Spike and other conserved CoV 395 proteins in this work serve as one reference for possible future single and multiple mutations that might 396 arise in dominant variants. These could inform about epitopes and antigenic regions that are possibly 397 vulnerable to enriched mutations also in part due to covariance such as specific regions in the Spike NTD. 398 Furthermore, these may reveal key residues that contribute to yet undiscovered interactions between viral 399 proteins of SARS-CoV-2 including Spike and ORF3a as discussed in an earlier description of our initial 400 results 55 .  Table S4) 71 . We expect some deletions and insertions are due to sequencing and assembly 418 errors but only co-varying deletions should become apparent during covariance analysis. The clinical 419 alignment spans both known and predicted genes between nsp1 and Orf9c, but only covariance between 420 the proteins also studied in pan covariance set are analyzed and compared in this work. 421

Phylogeny and ATDS calculation 422
We inferred phylogeny by reconstructing a maximum likelihood (ML) tree with IQTree after first 423 testing and comparing 286 DNA models by creating initial parsimony trees scored according to Bayesian For all genomes that belong to each cluster, the sum of branch lengths between every possible pair was 432 extracted from the tree file and averaged to calculate the Average Taxonomic Distribution Score (ATDS). 433 This relative score is provided as additional metadata in the Gephi Force mapping file. 434

Covariance analysis and force mapping 435
Pairwise and multiple residue covariance and scores were calculated using FastCov 56 . Alignment 436 files for both pan and clinical CoVs were substituted to provide a "W' in place of absent/deleted residues. 437 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint Using the known position of true "W", residues, all "W" deletions were replaced as "Z" to indicate absence 438 following analysis. We set a purity score (0.7) for stringency cutoffs in both the pan-CoV and clinical-CoV 439 sequence alignment. A raw table of predicted covariant pairs is provided as Supplemental Table S2. This 440 allowed the binning of clusters and respective strains for Force Mapping in Gephi using the Multigravity 441 ForceAtlas 2 setting and comparison of covariant residues based on clusters and strains 73 . All clusters and 442 residues and their respective occurrence in CoVs for both analyses are tabulated in Supplemental Table  443 S3. Genomes, clusters, residues were mapped in Gephi using the MultiGravity ForceAtlas 2 algorithm. 73 . 444 This data is provided in Supplemental File S6 for interactive application using Gephi Software. 445

Prediction of interacting residues and mapping of residues in Spike trimer structure 446
The Arpeggio program was used to calculate inter and intramolecular interactions between residues 447 in the 7JJI.PDB file (Supplemental Table S5 unique residues identified as covariant in each independent analyses and the total pairs co-present (overlap) 463 was examined as above by applying a hypergeometric probability formula. 464

Plotting 465
Circular graphing of key collections of residues was graphically plotted using Circos 78 . 466 467 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint

Ethics approval and consent to participate 470
This study includes sequence and metadata of 252,102 CoV virus strains from a publically available 471 database (GISAID) and though patient age and sex has been approved to be publically available in this 472 database, only the locations and date of virus isolation are noted in this work. No IRB approval is needed 473 for this data or and the all acknowledged sources and authors for every sequence in this source data 474 tabulated from the public GISAID are provided in Supplemental Table S4. 475

Availability of data and materials 476
All data was generated using publically deposited and available genome and protein sequences and the 477 identity and accession numbers are provided. All generated and analyzed raw output data for which this 478 study is based is provided in this published article and can be referenced in the tabulated supplemental 479 spreadsheet files. 480

Competing interests 481
We declare there are no financial or non-financial competing interests with the published work. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022. ; https://doi.org/10.1101/2022.01.13.476204 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 7, 2022.  The respective host for each CoV is indicated by color and the average taxonomic distribution score (ATDS) for each cluster is indicated by cluster circle size. All CoV, clusters, and residues can be accessed and extracted from the interactive Gephi file and supplemental tables. CoVs most closely related to SARS-CoV, SARS-CoV-2, and HKU3 based on phylogeny are circles and labeled. (B) Overview of ML tree of 169 CoVs that spans nts of genes ORF1a/b through N (Nucleocapsid). CoVs are colored by host. SARS-CoV, SARS-CoV-2, and HKU3-related CoVs are circled to match groups in Figure 1A.     . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made