Missense variants in health and disease affect distinct functional pathways and proteomics features

Missense variants are present amongst the healthy population, but some of them are causative of human diseases. Therefore, a classification of variants associated with “healthy” or “diseased” states is not always straightforward. A deeper understanding of the nature of missense variants in health and disease, the cellular processes they may affect, and the general molecular principles which underlie these differences, is essential to better distinguish pathogenic from population variants. Here we quantify variant enrichment across full-length proteins, their domains and 3D-structure defined regions. We integrate this with available transcriptomic and proteomic (protein half-life, thermal stability, abundance) data. Using this approach we have mined a rich set of molecular features which enable us to understand the differences underlying pathogenic and population variants: pathogenic variants mainly affect proteins involved in cell proliferation and nucleotide processing, localise to protein cores and interaction interfaces, and are enriched in more abundant proteins. In terms of their molecular properties, we find that common population variants and pathogenic variants show the greatest contrast. Additionally, in contrary to other studies, we find that rare population variants display features closer to common than pathogenic variants. This study provides molecular details into how different proteins exhibit resilience and/or sensitivity towards missense variants. Such details could be harnessed to predict variant deleteriousness, and prioritise variant-enriched proteins and protein domains for therapeutic targeting and development. The ZoomVar database, which we created for this study, is available at http://fraternalilab.kcl.ac.uk/ZoomVar. It allows users to programmatically annotate a large number of missense variants with protein structural information, and to calculate variant enrichment in different protein structural regions. Significance Statement One of the greatest challenges in understanding the genetic basis of diseases is to discriminate between likely harmless and potentially disease-causing sequence variants. To better evaluate the pathogenic potential of missense variants, we developed a strategy to quantitatively measure the enrichment of both disease and non disease-related variants within a protein based on its structural and domain organisation. By integrating available transcriptomics and proteomics data, our approach distinguishes pathogenic from population variants far more clearly than previously possible, and reveals hitherto unknown details of how different proteins exhibit resilience and/or sensitivity towards genetic variants. Our results will help to prioritise variant-enriched proteins for therapeutic targeting; we have created the ZoomVar database, accessible at http://fraternalilab.kcl.ac.uk/ZoomVar, for programmatic mapping of user-defined variants to protein structural and domain information.

The genomic revolution has brought about large advances in the identification of disease-associated variants. However, despite the recent explosion of genetic data, the problem of "missing heritability" still persists [1], where the genetic component of a phenotype remains poorly identified. It is often the case that for some variants, a causal link to the disease in question is difficult to establish. Prime examples include variants with low penetrance, and/or those with higher penetrance which are unique to single/few individuals, such as de novo variants implicated in developmental disorders [2]. Compensated pathogenic mutations represent another such case, where mutations are present as the wild-type in other species, but their pathogenic effects are negated by another variant [3]. Difficult cases also arise in the analysis of somatic cancer variants, where driver mutations can be challenging to segregate from passenger mutations; moreover this classification may vary from case to case [4]. These variants pose challenges to the detection of disease association using existing statistical methods. In head-to-head comparison against large-scale saturation mutagenesis screens, where mutational impact could be measured in vitro, current predictive methods were shown to be limited in accuracy [5,6].
Moreover, variant impact prediction has focused primarily on detecting differences between disease-associated and common variants, neglecting the distinction between disease-associated and rare variants; thus it has been suggested that these do not perform so well when distinguishing rare neutral variants from those which are pathogenic [7]. The boundary separating disease-causing from neutral variants can be fluid: for example, a number of missense variants thought to lead to severe Mendelian childhood disease were identified in nominally healthy individuals in the ExAC database [8]. Moreover, whether common population variants have more functional impact than rare variants is hotly debated [9,10]. A better understanding of the molecular principles which underlie differences between disease-associated and population variants is necessary to improve variant classification, and define more clearly the boundaries which distinguish variants in health and disease.
Here we present a detailed analysis of different classes of missense variants, including germline disease variants, somatic cancer variants (both "driver" and "passenger" variants with varying effects on tumour progression), as well as population variants of different frequencies, in the quest to extract the governing principles of variant pathogenicity. We rely on the synergy between utilising two types of data: first, we place emphasis on mapping the localisation of variants on protein structures, taking into account their positions in the protein fold, as well as their proximity to functional sites (e.g. post-translational modifications, or PTMs) [11,12,13,14,15,16]. Such protein structural information has been shown to be effective in uncovering the impact of variants at the molecular level [17]. In the field of cancer research, protein structure-based methods have been used to successfully predict cancer driver genes [18,19], as validated by a recent large-scale study by Bailey and colleagues [20]. Despite such success, 3D structure-based evaluation does not appear to have been applied systematically to other classes of variants (i.e. population and Mendelian disease-associated variants).
Second, we also make use of recently available large-scale proteomic measurements, including protein half-life [21], abundance [22], thermal stability [23] and transcriptomics data [24], to uncover biophysical and biochemical principles governing the impact of variants. Our analyses highlight a striking difference in the enrichment of pathogenic and population variants, which depends upon their localisation to protein domain and structural features. Using these features, we demonstrate that rare population variants display characteristics which are more similar to common population variants than to disease-associated variants, reinforcing the boundary between variants in health and disease. This integrative analysis provides molecular details into how resiliance and sensitivity to missense variants are manifested in different proteins and functional pathways. We have created the ZoomVar database (http://fraternalilab.kcl.ac.uk/ZoomVar), which holds the data generated in this analysis. ZoomVar is designed for large-scale programmatic structural annotation of missense variants, and calculation of the enrichment of missense variants in different protein structural regions. Comprehensive mapping of structural localisation of variants could inform the development of therapeutic interventions, e.g. structure-based drug design and/or drug repurposing [25]. More generally, the wealth of features that separates missense variants in health and disease could contribute to building and training next-generation predictors, which hold the promise of improving the accuracy of variant impact prediction.

Materials and Methods
See Supplementary Methods in SI Appendix for a more detailed account.

ZoomVar Database
ZoomVar was constructed by mapping human protein sequences to resolved structures/homologues from the PDB using BLAST [40]. Protein domains were defined by scanning UniProt sequences against the PFAM seed library [41] using HMMER [42]. Per-residue mappings were performed by the alignment softwares T-COFFEE [43] or Stretcher [44] (which was used to map UniProt and Ensembl sequences which were not of the same length, and were too long to align using T-COFFEE). These generated correspondences between PDB structures and those proteins/domains with structural coverage. Interaction complexes were inferred from homologues (defined using BLAST). As an example, if protein A and B are annotated as interacting in UniPPIN, and their structure homologues A and B are located in a resolved structural complex (and at least one residue from each protein is involved in a shared interface), residues from A and B are mapped onto A and B to infer their interaction interface. The partner-specific regression formula from HomPPI [45] was used to assign a score and zone to each interaction interface inferred in this way.

Definition of regions
Structural regions. We partitioned protein/domain into surface, core and interface regions. Interface regions were considered to be composed of residues which bind to at least one protein interaction partner. The interfaces were assigned using POPSCOMP [46]. Residues with a change in solvent accessible surface area [SASA] > 15 A 2 were annotated as interface residues [47]. For surface and core regions, these were classified by considering the quotient SASA [Q(SASA)] per residue, which was computed using POPS [48]. Core residues were defined as those with a Q(SASA) < 0.15 [47]. Surface residues were defined as those with a Q(SASA) ≥ 0.15 which do not take part in protein-protein interaction interfaces.
Order and disorder. Disordered protein regions were predicted using DISOPRED3 [49]. We overlaid these definitions of ordered and disordered regions with Pfam domain boundaries, and partitioned protein sequences into intra-domain ordered, intra-domain disordered and inter-domain disordered regions.
Functional sites. Post-translational modification (PTM) sites, specifically ubiquitination and phosphorylation sites, were obtained from PhosphoSitePlus [50]. Regions close to phosphorylation and ubiquitination sites were defined as those within 8Å in Euclidean distance.

Mapping of variant data
Variants in each dataset were annotated according to protein region localisation using the ZoomVar database.   Table 1: Numbers of missense variants which localise to different levels and regions of protein anatomy. Data are listed for each of gnomAD, COSMIC and ClinVar datasets. Here "common" and "rare" variants are subset of gnomAD defined using the Minor Allele Frequency (MAF) cut-off of 0.01, below which variants are classified as "rare". This definition is used throughout this work except for the analysis on varying the "rarity" of variants (see Figure 8). Note that the full length domain-type statistics are omitted here, as by definition they will be identical to the "full-length domain" row. Figure 2B illustrates the definition of regions listed here in the first column.
In the exploration of variant enrichment in different structural regions, the COSMIC data was divided into "driver" and "non-driver" subsets, taking drivers as variants which map to all proteins from both tier 1 and tier 2 of the Cancer Gene Census (CGC) (COSMIC v84). The non-driver subset contains all other variants.

The protein anatomy
In this study missense variant enrichment was quantified across the "protein anatomy", in which we partition the human proteome in different ways (see Figure 2A-B). We first define a list of levels of the anatomy, namely: (i) individual proteins; (ii) specific domains of proteins, and; (iii) instances of a domain-type across the human proteome. See Results for detailed examples. Here, missense variant enrichment quantification was considered in both of the following scenarios: (i) for a given full-length instance of a level, relative to all other instances at the given level (e.g. for epidermal growth factor receptor [EGFR] relative to all other proteins in the human proteome), or (ii) for a given region, relative to all regions defined under a given criteria at a relevant level (e.g. for the protein core relative to all protein structural regions). For the sake of clarity, in this Methods section the instance of interest is hereafter referred to as the entity of interest; its use is clarified in the next paragraph, and illustrated in Figure 1.

Calculation of missense variant enrichment
The binomial cumulative distributive function (Equation 1, also illustrated in Figure 1A) was used to assess the missense variant enrichment of a given entity, and the two-tailed binomial test was used to assess the significance of enrichment/depletion. Briefly, the number of missense variants in the given entity is modelled by a Binomial variable X, parameterised by: (i) k, the number of observed missense variants which localise to the given entity, (ii) n, the total number of missense variants which localise to all relevant entities, and (iii) p, the ratio of the size of the entity of interest (in terms of number of residues) to the total size of all relevant entities: Here, P (X entity ) ≤ k) is the cumulative probability of observing k missense variants in the chosen entity. Under this scheme, for example if the missense variant enrichment of a particular protein W in dataset D is to be calculated, we took the total number of missense variants in dataset D as n, the number of missense variants which localise to W as k, and the ratio between the size of protein W to the sum of sizes of all proteins in the proteome as p. Note that this framework of variant enrichment quantification, in contrary to others [52,53,54], is not designed to detect mutational "hotspots" clustered in sequence or structure space. Instead, it quantifies the extent to which missense variants are found in the entity concerned, evaluating whether the number of such variants are more or less than expected.
The overall missense variant enrichment for each dataset was also calculated using a density-based metric ω (see Equation 2).
ω(X entity ) = X entity /S entity X all entities /S all entities (2) where S refers to the size in terms of the number of amino acids.
Here 95 % confidence intervals were estimated via bootstrapping (10,000 iterations). The 2-tailed significance of enrichment/depletion was estimated by simulation of the null background. 10,000 simulations were carried out for each dataset, in which the number of variants which localise to a given entity was kept constant, but their location within the entity randomised. The density of variants was calculated for each simulation and compared to the actual value in order to derive a p-value. Simulations were performed in this way, keeping the observed number of missense variants fixed, in order to overcome bias which stems from the assumption that variants are uniformly distributed throughout the proteome.

Gene set enrichment
Gene enrichment analyses were performed using Gene Set Enrichment Analysis (GSEA), using the implementation provided by the R fgsea package [55]. Given an enrichment statistic for each query gene, the GSEA algorithm outputs a score per gene set, which quantifies the enrichment of query genes in the sets examined. This is then normalised by the size of the gene set, to give a normalised enrichment score (NES) [56]. We utilise the centred VES enrichment statistic, i.e. substrating 0.5 from the VES, as input into the GSEA algorithm.
Thus, proteins with the number of missense variants observed fewer than expected would have a negative score.

Definition of pathway clusters
The pathway normalised enrichment scores (NESs), calculated at the whole protein level for each dataset, were used to perform K-means clustering of KEGG pathways [57]. The R package NbClust [58] was used to determine the optimum number of clusters.

Analysis of expression, abundance, and stability data
Spearman correlations of protein-wise and region missense variant enrichments with expression levels (RPKM), abundance (ppm), half-life (hours), thermal stability (Tm, in o C), and density (mean contacts of core Cαs) were calculated. Additionally, gene set enrichment analysis was performed as detailed above, except that the mean value for each quantity of interest was subtracted to obtain values centred around 0, allowing both pathway enrichment and depletion to be assessed (see SI Appendix).

Statistics and data visualisation
The majority of data analyses were performed in the R statistical programming environment. All corrections for multiple testing have been done using the Benjamini-Hochberg method in R (p.adjust function). Bootstrapping was performed using the boot package [59]. Spearman correlations were performed using the SpearmanRho function of the DescTools package [60]. Heatmaps were produced with either the heatmap.2 function in the gplots package [61] or the ComplexHeatmap package [62], in which clustering, wherever shown, was performed with hierarchical clustering (hclust function) using default parameters unless otherwise stated. Circos plots were generated with the Circos package [63]. Additionally, binomial cumulative distributive functions were calculated and two-tailed binomial tests performed using the NumPy package in Python [64].

A detailed protein-centric anatomy of variant enrichment across scales
We present a multifactorial analysis of missense variants observed in the general population (gnomAD database) [28], in comparison to somatic cancer-associated missense variants from the COSMIC database [27] and diseaseassociated missense variants from the ClinVar database [26]. Throughout this analysis we further divide the gnomAD data by their minor allele frequencies (MAF) into common and rare variants, to investigate whether there are differences between these two subsets. A summary of the numbers of missense variants investigated in each dataset is given in Table 1, and a more detailed breakdown is given in the SI Appendix S4.1.
We compare pathogenic and population variants in terms of their associations with specific features across the molecular scale, in a framework we call "protein anatomy", in which we partition the human proteome in different ways. This includes the consideration of individual proteins (for example, investigating the enrichment of missense variants in the epidermal growth factor receptor [EGFR] protein), specific constituent domains of proteins (e.g. the EGFR tyrosine kinase domain), or generally for all instances of a domain-type (e.g. all tyrosine kinase domains) found in the human proteome. These are referred to as the levels of protein anatomy ( Figure 2A). For each level, we analyse variant enrichment in full-length entities (i.e. protein/domain/domaintype, dependent on the level of interest), and also various constituent regions defined using different criteria ( Figure 2B-C). These include protein structural information (partitioning into surface, core and interacting interfaces), protein disorder predictions (segregating into ordered and disordered regions, within or outside of Pfam domains) and vicinity to functional sites such as phosphorylation and ubiquitination sites. To quantify missense variant enrichment, we employ a similar approach to that used in the prediction of cancer driver genes [65]: variant enrichment has been modelled using a binomial distribution (Methods Equation 1), which allowed for quantification of a Variant Enrichment Score (VES), ranging from 0 to 1, and normalised by the size of the region/protein/domain in question ( Figure 2D, also see Methods). We also quantified the robustness of such quantification of variant enrichment: the significance of the enrichment/depletion of missense variants, in terms of their density, is assessed by comparison to simulated null distributions, in which the number of missense variants is kept identical to that observed in the data, but their positions within the protein are randomised.
This goes beyond similar studies (e.g. [14,15,16]) and addresses biases which could result from the selective focus in molecular studies of disease-related proteins. We use this method to quantify and compare pathogenic and population variants in terms of "microscopic", atomistic protein structural features, and examined the association between these enrichment statistics and large-scale, "macroscopic" features like functional pathways, as well as the various proteomics features collated (see Methods and below).

Disease-associated and population variants affect different functional pathways
We first investigated whether variants from each dataset localise to proteins which are involved in distinct functional pathways. Gene Set Enrichment Analysis (GSEA) was performed on a pre-ranked list of proteins using their whole-protein VESs computed for each dataset. The pathway enrichment scores were then subjected to clustering and Principal Component Analysis (PCA) (see Materials and Methods). As shown in Figure 3A, variant enrichment segregates pathways into three clusters. Strikingly, each pathway cluster appears to have distinct characteristics (see Figure 3B-D for the pathway terms belonging to each cluster). The cluster visualised in orange is primarily composed of terms associated with cancer, growth and proliferation, whereas that coloured in pink contains pathways associated with splicing, transcription, translation and metabolic terms. Pathways associated with sensory perception and the immune response are found in the "green" cluster. A handful of metabolic pathways also localise to this cluster, however, these appear to be more associated with environmental response and adaptation than those pathways found in the "pink" cluster; for example, pathways associated with the metabolism of drugs and xenobiotics are found here. For brevity, the "orange", "pink" and "green" clusters will be termed the "proliferation", "nucleotide processing" and "response" clusters respectively, for the remainder of this text. A list of pathways assigned to each cluster is given in the SI Appendix Section S4.2.
Strikingly, this visualisation reveals that population variant datasets (gnomAD rare and common) are clearly ClinVar variants is visible for specific "nucleotide processing" functions, e.g. ribosomes. Across all pathways, the trends of functional distinction can be further visualised in the Circos [63] plot ( Figure 4A), whereby the extent of shared enrichment of pathways is visualised by arcs across the different coloured segments. Identical to Figure 3D, pathways in the "response" cluster are enriched in missense variants across all four datasets; enrichment in the "proliferation" cluster is shown only in the disease-associated variants and the "nucleotide processing" cluster is unique to ClinVar variants. This analysis clearly distinguishes missense variants in health from those which are found in diseased individuals; moreover their different tendencies to perturb specific functional pathways are also highlighted.
We went on to extend this analysis to across protein structural regions (Figures 4 and SI Appendix Figure  S4). Here we find that proteins enriched in gnomAD variants at the surface ( Figure 4B) are significantly enriched in pathways belonging to the "proliferation" cluster. Moreover, this enrichment is shared between common and rare variants (albeit not significant for common variants in individual pathways after false discovery rate [FDR] correction). Proteins with surfaces enriched in disease-associated variants (from COSMIC and ClinVar) are, contrastingly, not enriched in "proliferation" cluster pathways. Pathways in the "proliferation" cluster show either depletion (for rare variants) or no patterns at all (for common variants) for population variants, when the protein core ( Figure 4C) and interface (SI Appendix Figure S4B) are concerned. This could indicate that population variants avoid the core of proliferation-related proteins. Interestingly, the "nucleotide processing" cluster does not show such a marked enrichment of variants which localise to the surface in the gnomAD  dataset, a possible indication that proteins in these pathways are more robust to disruption of their structural fold, compared to those in the "proliferation" cluster. These data show that there is clearly an interplay between variant localisation at macroscopic (functional pathways) and microscopic (structural regions) protein features.

Population and disease-associated variants localise to different protein regions
We then zoomed in to view trends in the enrichment of variants in different regions of the protein anatomy, defined using structural information, order and disorder, and the vicinity (distance ≤ 8Å) of the variant positions to post-translational modifications (see above). The following findings are highlighted: Different structural localisation of pathogenic vs population missense variants. In agreement with previous research [13,14,15,16], we find ClinVar variants to be enriched in both protein cores and interfaces, but depleted on protein surfaces ( Figure 5A-C and SI Appendix Figure S4). This reflects the potential disruption, caused by such mutations, of structurally and functionally important sites. The enrichment of ClinVar variants in structurally important sites is further demonstrated by their tendency to affect residues which are highly connected when considering network representations of protein structures (SI Appendix Section S2.1). GnomAD variants (both common and rare) and somatic non-driver variants display the opposite trend, as variants tend to localise preferentially to protein surfaces, and are therefore less likely to impact on protein structure and function than either core or interface mutations. Somatic driver variants follow trends closer to ClinVar variants, with slight, but significant, depletion on the surface, but enrichment in the core. Protein interfaces are enriched in disease-associated variants but depleted of gnomAD rare variants. GnomAD common variants appear neither significantly enriched nor depleted, however this may result from the relative sparsity and high dispersion of the data (fewer variants are shared between many individuals; see Table 1). Interestingly, COSMIC non-driver variants appear depleted in interacting interfaces. However, it becomes clear that they are actually significantly enriched when compared to simulated null distributions (see SI Appendix Figure S5), and that this enrichment is due to a small subset of proteins which harbour a large number of variants at interface regions. Genes in which these variants reside may be putative driver genes (see SI Appendix Section S4.3), as a number of known driver genes are enriched in variants in protein interface regions [14,20,65], and this phenomenon has been exploited by by Porta-Pardo and colleagues [65] to identify cancer driver genes.
Protein structural information distinguishes oncogenes and TSGs. We analysed, in greater granularity, missense variant enrichment in the COSMIC dataset. By examining a curated list of oncogenes and tumour suppressor genes (TSGs) [66], we found, in agreement with others [13,14,67], that these proteins could be TSGs as two separate groups, the GSEA results confirm a similar trend of structural localisation (SI Appendix Figure S6); moreover, it can also be seen that the disease-associated datasets (ClinVar and COSMIC) show opposite patterns of enrichment in comparison to the gnomAD data (SI Appendix Figure S6) [13,14,67]. These results go beyond similar studies, and show that by analysing the structural localisation of variants, cancer driver genes can be separated into the distinct groups of oncogenes and TSGs.
Pathogenic variants tend to localise to ordered regions within domains. Pathogenic variants are close to phosphorylation sites. When proximity to PTMs is considered (SI Appendix Figure S5), ClinVar variants appear enriched in terms of the density of missense variants close to phosphorylation sites, but not significantly so in comparison to the simulated null background; this may be again due to data sparsity, as suggested by large bootstrapped confidence intervals (SI Appendix Figure S5). COSMIC driver variants are also close to phosphorylation sites; however, COSMIC non-driver variants, which appear depleted in terms of variant density, are also significantly enriched close to phosphorylation sites in comparison to simulated null distributions (SI Appendix Figure S5). This indicates that, in agreement with a number of other studies [68,69], the disruption of phosphorylation sites may play a particularly important role in cancer. In contrast to phosphorylation sites, all datasets appear depleted of variants close to ubiquitination sites (SI Appendix Figure S5).   Figure S7. Figure   6 shows  Figure 6A). This could suggest that these domains take part in functions for which it is desirable to maintain diversity within a population; however, little is known about either domain-type [72,73]. Thereby this further highlights the bias in the number of studies targeting domains associated with disease, rather than those enriched in population variants. It also becomes apparent that the global trends in variant localisation to the core, surface and interface regions observed above are recapitulated here ( Figure 6B) for those domain-types with structural coverage. The majority of domains are enriched in gnomAD (rare and common) variants at the surface but ClinVar variants at the core. For COSMIC the patterns of localisation are more mixed, but it is clear that in comparison to the gnomAD sets, a larger proportion of domain-types are enriched in COSMIC variants at the core or interface.
These include domain-types with known cancer driver associations, such as the P53 and VHL domains [74].
Case studies on CATH architectures [75] and DNA-binding domains further highlight our observed patterns of variant enrichment (see SI Appendix Sections S2.3 and S2.2).
We also explored how the targeting of domains by drugs and small molecules mapped to the observed landscape of variant enrichment. Using DrugBank [76] data we observe that the targeting of domain-types by existing drugs is highly biased towards a small number of domain-types ( Figure 6C), such as GPCRs and tyrosine kinase, as already extensively pointed out [77]. Indeed, we observe a large number of drugs targeting proteins containing 7tm (GPCR) domains. These domains are enriched in variants from the gnomAD and COSMIC database, but are devoid of disease-associated ClinVar variants ( Figure 6A)

Proteomics and transcriptomics features associate with variant localisation
Proteins, of course, do not function in isolation but in the crowded environment of the cell [78]. Therefore, the properties of proteins in cells, including their quantities, turnover rates and thermal stability, can crucially affect the fitness of a protein to perform its function. Here we ask if variant enrichments are associated with these proteomics features. We have made use of large-scale proteomics data, including protein abundance data for various organs from PaxDb [22], proteomics surveys of protein half-lives and thermal stability [23,21], together with transcriptomics data (GTEx database [24]), to explore relationships between these features and variant localisation.
We first compared the thermal stability and abundance of proteins enriched in each class of variants. This comparison demonstrates that for proteins affected by ClinVar variants, their wild-types tend to be more stable and abundant in comparison to those proteins enriched with gnomAD variants (SI Appendix Figure S8).
Extending to the entire proteome, the protein-wise Variant Enrichment Scores of disease-associated variants displays positive correlations with protein abundance, expression, half-life and thermal stability, whereas population variants exhibit the opposite trend ( Figure 7 and SI Appendix Figures S10-S11). However, zooming into the enrichment of variants in the core of protein structures, we found that in comparison to all regions of proteins with resolved structure, proteins more enriched in ClinVar variants in the core tend to be less abundant and less stable, whereas the contrary is true for rare population variants (Figure 7). Thus our results indicate two competing trends for disease-associated variants: (i) disease-associated variants tend to localise to more abundant and stable proteins, which may suggest that these proteins are more sensitive to perturbation by variants; (ii) disease-associated variants in protein cores tend to localise to less stable proteins, which is consistent with the idea that such proteins might be more easily destabilised to a degree at which function is deleteriously impacted (see Discussion). gnomAD common data also show negative correlations with protein stability, for variants occurring at the core; this could potentially support the argument presented by Mahlich and colleagues [9] that common variants could affect molecular function more than rare variants. However, we believe this is more likely to be due to the fact that very few common variants localise to protein cores, as shown by Figure   5B, resulting in sparse statistics (i.e. the correlation is calculated over Variant Enrichment Scores which are already very low). Analogous correlations for variant enrichments at protein surfaces display opposite trends to those observed at the protein cores (see SI Appendix Figure S9). Due to the relative sparsity of variants which map to protein interfaces, we believe it is difficult to draw robust conclusions from any trends observed Figure 7: The protein-wise enrichment of missense variants in comparison to protein abundance, expression and stability. Spearman correlations for missense variant enrichment (quantified as VESs) at the full-length protein and the core with (A) protein stability in terms of melting temperature (Tm, o C) and (B) protein abundance (ppm) are depicted here. For (A), the Tm data was taken from [21], in which two measurements of Tm in the absence of any drug treatment were available; both measurements are considered, and are denoted datasets "1" and "2" in the plot. For (B), only data from selected tissue types are listed. See SI Appendix Figure S9 for the complete list. (C) Functional enrichment of proteins in KEGG pathways according to Tm. The Normalised Enrichment Score (NES) is shown on the vertical axis. KEGG Pathways are listed on the horizontal axis, and grouped to the 3 clusters as defined in Figure 3. See SI Appendix Figure S13 for a complete list of pathways depicted here.
for correlations of proteomics data with variant enrichment at protein-protein interaction sites.
One might expect that mutations would be less easily accommodated in cores of densely packed proteins, i.e. those likely to have higher thermal stability. To assess this we calculate the mean number of Cα contacts within 8Å of core residues, as a proxy for protein density. We observe weak but significant correlation between this metric and protein thermal stability (Tm measurements from two replicates reported by Franken and colleagues [21]: replicate 1 [ρ = 0.168, q = 1.464e-12] and replicate 2 [ρ = 0.185, q = 1.529e -13]). This metric of core density (see SI Appendix Section S1.4 for details) is negatively correlated with the core Variant Enrichment Score for the gnomAD common dataset. No other datasets show significant correlations with core density, however a clear trend emerges in which correlations with the core density become progressively more positive in the order of gnomAD common, gnomAD rare, COSMIC driver, COSMIC non-driver and ClinVar (see SI Appendix Figure   S12). This suggests variants may be more disruptive if they localise to a densely packed protein core.
We highlight two more observations in terms of the interplay between proteomics features. First, the various proteomics features examined here are inter-dependent. Protein abundance and thermal stability are significantly correlated with one another (see SI Appendix Section S4.6), in agreement with the work of Leuenberger and colleagues [79]. Moreover, core packing and thermal stability are correlated, albeit with a low correlation coefficient. The correlation values displayed in Figure 7 are also typically of a weak effect. Therefore, the interplay between variant enrichment and proteomics features appear multifaceted and complex. Secondly, in the analysis of protein abundance, the trends observed with variant enrichment at both full-length proteins and specifically the protein core, are less pronounced for cell line data and break down for extracellular fluids (saliva and urine, Figure 7B). The correlation is most evident for tissues containing long-lived cell-types, such as the brain, ovary and testis. Transcriptomics data (SI Appendix Figure S10) again reinforces this picture, albeit with less contrast between datasets (particularly at the protein core). This brings finer granularity into assessing the impact of variants in different organs and contexts.
We finally ask whether correlations with these proteomic and transcriptomic features could be associated with the specific functional roles of the involved proteins. For the majority of proteomic and transcriptomic features, no clear associations with the functional clusters identified in Figure 3 can be detected (see SI Appendix Figures S11-S13). An exception to this is protein thermal stability: pathways which belong to the "proliferation" cluster are clearly enriched in proteins of lower stability than the other two clusters ( Figure 7C). This suggests that proliferation-related proteins may be vulnerable to disruption by mutations which localise to their already unstable cores. Moreover, this agrees with the idea proposed above (Figure 4), that "proliferation" cluster proteins may be less robust to disruption. Taken together, these analyses provide fine molecular details into defining both the resilience towards variants, and the sensitivity towards variants, for a given protein (see Discussion). Moreover, the association of variant enrichment with features such as abundance and stability is indicative of the condition (disease/health) associated with the variants.

Rare variants are similar to common variants
Throughout the majority of analyses, the greatest segregation of data can be seen between common and diseaseassociated variants (Figures 3-5). Here we vary the criteria with which to define rarity of variants in the gnomAD set, to examine whether extremely rare variants would show characteristics akin to disease-associated variants. Figure 8 demonstrates that rare variants are more similar to common variants, both in terms of the functional pathways they affect, and in terms of the protein regions they localise to (core, surface and interface, order and disorder). If more stringent minor allele frequency (MAF) thresholds are used to define rare variants, their properties move towards those of disease-associated variants, but still remain closest to those of common variants ( Figure 8 and SI Appendix Figure S14). A visible separation between common and rare variants, especially in the pathway analysis, can only be seen if an extreme MAF cutoff (<0.00001) is used. This reinforces the boundary between population and disease-associated variants, and supports the distinction in terms of molecular characteristics associated with rare population variants and disease-associated variants. features and functional pathways) and microscopic (protein structural localisation) perspectives. Additionally, we find that the properties of rare variants remain close to those of common variants. These findings contrast with other observations [9], which suggest that common variants have more impact on molecular function than rare variants. In this study, only for a few proteomics properties, such as the thermal stability and abundance of the affected proteins, common variants appear closer in character to disease-associated variants than to rare variants. However, for these few properties, the results might not be robust due to the sparsity of the data. Rare genetic variations are abundant across individuals [80,81], with some predicted to confer a regulatory impact [82] or loss of function [83]. Alhuzimi and colleagues [10] suggest that the properties of genes enriched in rare population variants are similar to those enriched in disease-associated variants, and are thus good candidates for harbouring unknown disease associations. Instead, we show that proteins enriched in rare variants are, based on the associated functional pathways, most similar to those enriched in common variants ( Figure 8). Moreover, our results show that population variants implicate functions mainly associated with environmental response (Figure 3), in agreement with results from evolutionary studies reviewed in [84].
We have dissected the extent of variant enrichment in diverse datasets and across different protein regions ( Figure 2). Whereas protein structural information has been utilised to annotate genetic variants and prioritise impactful variants for further investigations, many of the published methods focus on 3D-structural "hotspots", prioritising variants which cluster together in three-dimensional space (e.g. in [52,53,54]). Here we have adopted an alternative approach, and quantified enrichment of missense variants without the pre-condition of spatial clustering. This provides an unbiased resource to map missense variants to protein structural data.
The calculation of variant enrichment, as an additional layer of annotation, provides a unique link between cataloguing sequence variants and understanding both their mechanistic and functional effects. This supplies invaluable information to researchers studying specific proteins or domains, or focusing on proteins involved in a particular function (e.g. DNA binding; SI Appendix Figure S3). By analysing the enrichment of variants in protein regions (core, surface, interface, disorder and order, PTM vicinity), we recapitulate trends observed by previous studies (e.g. in the comparison of oncogenes and TSGs; Figure 5D) [14,16,15,13,67], but also shed light on the debate as to whether somatic cancer variants are enriched in interface regions, by simulating null distributions of variants. These simulations show that it is essential to consider that variants from different datasets are not uniformly and randomly distributed throughout the proteome. Through density-based metrics we find that somatic cancer variants appear at first sight not enriched in protein interfaces, however by comparison to a simulated null background we do find an enrichment (SI Appendix Figure S5). A similar simulation-based approach was taken by Gress and colleagues [13], but they found no significant enrichment for COSMIC variants in interface regions. Whilst they analysed a filtered set of mutations likely to play a driver role, we investigated all somatic variants and addressed separately mutations that localise to defined driver and non-driver genes. Throughout this analysis, we have of course been limited by the number of proteins with available structural data, despite enrichment with homologous structures. We are also still limited by the structural coverage of protein interactions; although enough data exists to uncover broad trends, our analyses of protein-protein interaction sites generally lacked statistical power. Moreover, it is likely that a more detailed picture will emerge if different classes of protein interactions (e.g. transient vs permanent interactions) could be probed systematically. We envisage that the recent advances in cryo-electron microscopy [85], and the integration of structural data derived by a variety of techniques [86], will further increase the structural coverage of the protein-protein interaction network, enabling such finer-grained analyses in the future.
Our analysis of probing the associations between missense variant enrichment and proteomic features, is, Figure 9: Summary of analyses. Here we have explored patterns of structural localisation, abundance and stability for proteins enriched in disease-associated and population variants respectively. These molecular attributes determine the resilience and sensitivity of the proteins towards missense variants (see Discussion).
to the best of our knowledge, unprecedented, and has only been made possible due to the recent release of large-scale proteomics data [21,22,23,79]. We observe correlations which suggest an interplay between variant enrichment, protein abundance and thermal stability ( Figure 8). These associations are indicative of the relative resilience (tolerance) and sensitivity of proteins towards missense variants. Our results thus illustrate a set of features, based on which different parts of the proteome could be assessed for their tendency to be enriched in disease-associated or population variants ( Figure 9). Population variants tend to be enriched on protein surfaces but depleted in core and interacting sites, and tend to be found in less abundant, less stable proteins.
These features could potentially contribute to limit the functional impact of missense variants found in the general population. On the other hand, disease-associated variants localise preferentially to proteins which are highly expressed and abundant ( Figure 9). However, when selectively looking at variants mapping to protein cores, which presumably could bring about the most dramatic impact on fitness, disease-associated variants are actually associated with cores of less stable, less abundant proteins (Figure 7). Such proteins are conceivably easier to inflict damage at the core (although it is also possible that some of these proteins are not globular, but are instead more extended in conformation, see below). The combination of these molecular features could also suggest the likely selection pressure a protein could experience under different contexts. For example, certain proteins, possibly further to the left of the spectrum presented in Figure 9, could show a more extreme combination of molecular features compared to those proteins discussed here to be enriched in disease-associated variants. These proteins are likely to be highly sensitive towards variants, such that any of such variants would be lethal ( Figure 9) and be eliminated via selection; these lethal variants suffer from under-sampling in the data analysed here. On the other hand, variants, if localised to sensitive proteins, may bring benefits to cell viability; these variants could ascertain a role in driving cancers ( Figure 5).
Our work highlights a set of rules in predicting the impact of variants. For instance, one could be fairly confident that a variant can be disruptive if it localises to the core of an abundant, stable protein. This type of variant annotation could be valuable to clinicians in interpreting variants observable in any given patient.
The detailed set of features we provide could also be harnessed for more systematic improvement of variant impact prediction and interpretation. Firstly, the analysis concerning protein stability suggests it is important to consider the base-line stability of the protein in question when assessing the impact a variant could bring.
A number of algorithms have used the estimated change in protein stability upon mutation (∆∆G) as a proxy for variant impact. From their analysis of the ProTherm database, Serahijos and colleagues [87] found that mutations in more stable proteins generally led to greater destabilisation (∆∆G variation). They interpret this as suggesting that proteins which have evolved to become more stable are in a state closer to their peak stability, where any changes will result in strong destabilisation. Similarly, Pucci and Rooman [88] used temperaturedependent statistical potentials to investigate the thermal stability of the structurome (all proteins with resolved structure), and concluded that mutations in proteins which are highly thermally stable lead to a larger decrease in thermal stability, compared with those in less thermally stable proteins. We believe that our results point to the fact that, even under a scenario in which mutations in proteins with higher stability result in a greater change in stability, a mutation in an already unstable protein is more likely to result in complete/partial unfolding under physiological conditions. This is likely to be relevant to globular proteins, whereas for other types of proteins, e.g. intrinsically disordered proteins, function will be related less to the fold and the density of the protein core.
These factors should be brought into consideration when the impact of missense variants.
Secondly, we show that greater insight into the properties of variants in health and disease can be obtained by combining protein structural and functional pathway information. For example, it can be clearly seen that population variants are most enriched on the surface of proteins which take part in pathways we have defined as belonging to the "proliferation" cluster ( Figure 4B). Moreover, pathways belonging to this cluster also appear to be enriched in proteins with less thermal stability ( Figure 7C), suggesting a possible mechanistic basis underlying the localisation of variants (variants tend to localise to the surface and avoid disrupting the core of these already unstable proteins). This indicates that the combined use of such features may aid in both improving the prediction of variant impact, and in assessing the underlying molecular mechanisms.
Thirdly, our analysis highlights the tissue specificity of variant impact, in terms of the stability and abundance of the altered protein. Our association analysis (Figure 7) of variant enrichment with proteomics features complements a body of research which concludes that the rate of protein evolution correlates negatively with protein expression and abundance [89], the extent of which has been found to be tissue-specific; those tissues with a high neuron density demonstrating the highest anti-correlation [90]. Consistent with this, we found the largest negative correlation for the protein-wise enrichment of rare variants, from the gnomAD dataset, with protein abundance in the brain, and, interestingly also in the ovary and testis, which both harbour longlived germline progenitor cells ( Figure 7B; SI Appendix Figure S10). Purportedly the lifespan of long-lived cells renders them more sensitive to (and therefore necessitate protective strategies [91] against) the toxicity of misfolded proteins. Our analysis highlights the importance of considering the underlying context, specific to the affected organ alongside with the abundance and stability levels of the affected proteins, in assessing the potential impact a missense variant could pose.
Fourthly, the wealth of data presented here could have implications in the development of therapeutic strategies. Rare population variants are known to be abundant in known drug targets, potentially modulating disease risk and drug response [92]. Here we envisage that our domain-centric landscape of variant enrichment ( Figure   6), which includes the mapping of targeted drugs, besides providing another feature for the characterisation of variants, will allow for more informed decisions in optimising therapeutic strategies. For example, targets with few population variants could be selected, to minimise differential drug response due to genetic differences between individuals. Interestingly it has recently been shown that genetic variants in such domains (GPCRs), identified in the general population, may be associated with differential drug response between individuals [93]. By viewing variant enrichment and drug availability together, such a domain-centric landscape of variant localisation has implications useful for both understanding variant impact and motivating therapeutic design.
In conclusion, our work highlights the complex interplay between different factors which may determine variant pathogenicity, from atomistic protein structural features ("microscopic") to large-scale ("macroscopic") functional pathways and proteomics features. We believe that these insights will prove important in the prioritisation of likely disease-associated variants, and the prediction of variants which drive disease phenotypes. Moreover, the ZoomVar database, which we have made available at http://fraternalilab.kcl.ac.uk/ZoomVar, will facilitate users in the structural analysis of variants. A script is downloadable from the site to allow largescale programmatic access to the webpage for the structural annotation of user-input variant data; we also provide in the webpage precomputed data underlying all analyses presented here. Further advancement in the structural coverage of the proteome, and the exploitation of high throughput proteomics technologies, such as those analysed here [23,79], will ultimately offer a finer-grained picture of features which segregate variants in

Acknowledgement
We thank Dr. Jens Kleinjung for his critical reading of and comments on this manuscript, and members of the Fraternali group for valuable discussion. This research was supported by the British Heart Foundation (RE/13/2/30182 to FF and AL), Croucher Foundation Hong Kong (to JCN) and the Medical Research Council (MR/L01257X/1 to FF).