Burden analysis of missense variants in 1,330 disease-associated genes on 3D provides insights into the mutation effects

Interpretation of the colossal number of genetic variants identified from sequencing applications is one of the major bottlenecks in clinical genetics, with the inference of the effect of amino acid-substituting missense variants on protein structure and function being especially challenging. Here we evaluated the burden of amino acids affected in pathogenic variants (n=32,923) compared to the variants (n=164,915) from the general population in 1,330 disease-associated genes on forty protein features using over 14,000 experimentally-solved 3D structures. By analyzing the whole gene/variant set jointly, we identified 18 features associated with 3D mutational hotspots that are generally important for protein fitness and stability. Individual analyses performed for twenty-four protein functional classes further revealed 240 characteristics of mutational hotspots in total, including new associations recapitulating the sheer diversity across proteins essential structural regions. We demonstrated that the function-specific features of variants correspond to the readouts of mutagenesis experiments and positively correlate with clinically-interpreted pathogenic and benign missense variants. Finally, we made our results available through a web server to foster accessibility and downstream research. Our findings represent a crucial step towards translational genetics, from highlighting the impact of mutations on protein structure to rationalizing the pathogenicity of variants in terms of the perturbed molecular mechanisms.


Background
Recent advances in technologies have ushered in an era of rapid and cost-effective genome and exome sequencing. Genetic screening is now frequently applied in clinical practice, especially for the diagnosis of rare diseases and cancer, producing a plethora of benign and disease-associated genetic variants1,2. A large fraction of these clinically-derived variants has been captured in databases such as OMIM3, HGMD4, and ClinVar5. Simultaneously, millions of likely-benign or mildly disease-influencing variants from the general population have been cataloged and are accessible through the Exome Aggregation Consortium (ExAC)6 and in the genome Aggregation Database (gnomAD)7. Missense variation causes an amino acid substitution upon a single nucleotide variation in the protein-coding region of the genome, which can cause a Mendelian phenotype. However, missense variants can be either benign or disease-causing, and both types coexist in almost every human disease gene6. This underscores that functional interpretation of missense variants is particularly difficult, and some specific amino acid residues and their positions are essential for protein function and thus are vulnerable to substitutions. Exome sequencing studies can identify such protein-altering missense variants; however, the missing knowledge about the consequence of missense variants on protein structure and function severely limits the interpretation of clinical genetic screening, and our understanding of the implication of missense variants in disease phenotype.
Current algorithms for variant interpretation incorporate numerous annotations to assess the pathogenicity of missense variants, including the conservation of sequences through evolution8-10, human population variation information11, and protein structural information12,13. With the latest explosion of genetic variant data, machine learning algorithms can learn and predict variant pathogenicity with reasonable accuracy14,15. However, variant pathogenicity predictions, combined from multiple in silico tools, represent only one piece of supporting evidence (referred to as PP316) in the variant interpretation guidelines as proposed by American College of Medical Genetics and Genomics (ACMG). This is primarily because of the similar underlying biases in these algorithms' prediction outputs16, resulting from the use of similar variant sets. Further, the prediction scores do not provide biologically interpretable annotations that can be translated into possible variant-induced changes of underlying protein function.
Studying the effect of disease-associated missense variation on the protein structure and function represents a growing research area. The damaging consequence of amino acid substitution has been found to be associated with the structural property and localization of the amino acid on protein structure17-19. Cluster analysis of patient variants and structural conservation analysis have shown that functionally essential amino acids and regions are not randomly distributed20- 22. In discrete studies, missense variants have been found to disrupt protein function by perturbing protein-protein interactions23, modifying functional residues24, destabilizing the entire protein fold25,26, or by substituting amino acid residues in protein cores27,28 and in protein-protein interfaces29-31. However, characterization of missense variants using a multiple line of evidence from protein structure, including physicochemical, functional, and 3D-structure based features, has not been performed using a common, large-scale dataset and single, unified pipeline. Such a thorough characterization of disease-causing and benign missense variants on 3D can serve as a valuable resource for translational genetics, providing many structural and functional insights into the impact of an amino acid substitution. Moreover, functionally essential protein regions and associated features are shared among proteins performing similar molecular function as well as can be diverse across different protein classes. And, there is no study available that provides a comparative overview of protein features associated with pathogenic and benign missense variants for different protein functional classes. Developing an inclusive annotation of structural, chemical, and functional features for missense variants for different protein classes thus can be a powerful resource for hypothesizing function-dependent variant pathogenicity.
In this study, we investigate the features of amino acid residues altered by missense variants on protein structure at scale. First, we built an automated pipeline to map variants from genomic location to protein 3D structure location. Using the framework, we mapped >500,000 missense variants in 5,850 genes onto >43,000 protein 3D structures from the protein data bank (PDB)32. The variant set was comprised of population missense variants from the gnomAD database7 and pathogenic missense variants from the ClinVar5 (pathogenic and likely-pathogenic) and HGMD4 (disease mutation) databases. Second, we annotated the amino acid positions of these genes with a set of seven protein features having forty feature subtypes. Third, we evaluated the burden of population and pathogenic missense variants in 1,330 diseaseassociated genes on these forty features. To do so relative to protein function, we carried out separate analyses with variants in genes grouped into 24 protein functional classes33. We then validated the effectiveness of identified pathogenic and population missense variant-associated features on independent variant sets, ascertained clinically and functionally.

Protein class-specific analysis identifies function-dependent vulnerable features
Proteins are heterogenous in function and structure. We next explored if the features of pathogenic and population variants obtained from the full DAGS1330 dataset were consistent across different protein classes, having heterogenous molecular functions (see Methods for protein class ascertainment, Supplementary Table 2 for protein class-specific genes and variant counts, Supplementary Fig. 2 -8 for protein class-specific Fisher's Exact test outputs, and Table   2 for the summary of the results). Through the protein functional class-specific analyses, we aim to capture both shared and unique function-specific features of pathogenic and population variants on protein structure.
We observed that pathogenic variants are enriched in β-strand/sheet residues for 20 out was consistent for the majority of the protein classes (Supplementary Fig. 4). Amino acid residues in the interior of the 3D structure with < 25% relative accessible surface area (RSA) were enriched with pathogenic variants (Supplementary Fig. 4a-b), whereas the residues with ≥ 25% RSA were enriched with population variants (Supplementary Fig. 4c-e). Among physiochemical property-based groups of amino acids, pathogenic variants were observed significantly more frequent on aromatic (tryptophan, tyrosine, phenylalanine) amino acids in 23 out of 24 protein classes, as well as on special amino acids (cysteine, proline, glycine) in all 24 protein classes (Supplementary Fig. 5c and 5h). Specifically, the cell adhesion molecules, transcription factors, phosphatases, and signaling molecule proteins carried greater than 2-fold enrichments of pathogenic variants changing an aromatic amino acid. At positions of special amino acids, calcium-binding and extracellular matrix proteins showed enrichments of pathogenic variants with OR > 3.0 (p < 1.0e-100).
The inter-chain disulfide bond had the strongest enrichment in the joint analysis (OR = 21.75, p < 1.0e-100, Fig. 2). While such an association has been demonstrated 27, our protein class-specific analysis revealed that the enrichment was primarily directed by pathogenic variants from three protein classes alone (Supplementary Fig. 6a). In contrast, the non-bonded van der Waals interaction sites were enriched with pathogenic variants from all protein classes except phosphatases and nucleic acid binding proteins (Supplementary Fig. 6d). The pathogenic variants in kinases, nucleic acid binding and cytoskeletal proteins were found more frequent on salt-bridge and hydrogen bond sites ( Supplementary Fig. 6c-d), which are protein class-specific features as these sites were found depleted of pathogenic variants for the full set ( Fig. 2).
Heterogeneous patterns of association between missense variants and post-translational modification (PTM) types were observed across different protein classes (Supplementary Fig.   7). The amino acid residues near phosphorylation sites were found to be enriched with pathogenic variants in joint analysis (Fig. 2); however, a differential pattern was observed for hydrolases (OR = 0.80, p = 7.29e-11) and cell adhesion molecules (OR = 0.58, p = 3.81e-15) with significant depletion of pathogenic variants around phosphorylation sites (Supplementary Fig. 7d). The cell junction proteins pathogenic variants showed an enrichment around methylation sites (OR = 9.10, p = 9.29e-11, Supplementary Fig. 7b), whereas there was no significant association observed for the full dataset. For nine out of twenty-four protein classes, we observed significant enrichment of population variants around ubiquitination sites; however, the trend was different for transcription factors (OR = 1.44, p = 1.77e-06) and extracellular matrix proteins (OR = 2.05, p = 2.51e-05), with an enrichment of pathogenic variants around ubiquitination sites (Supplementary Fig. 7f).
majority of the protein classes, twenty-three and twenty-two, respectively, out of twenty-four classes (Supplementary Fig. 8). The pathogenic variants of transporters, kinases, transcription factors, and nucleic acid binding proteins were observed to be associated with functional sites, binding regions and sequence motifs. The protein class-specific analysis also showed that only the transfer/carrier protein (OR = 8.90, p = 3.68e-09) and enzyme modulator (OR = 3.09, p = 6.20e-10) carry enrichment of pathogenic variants on protein regions involved in molecular processing and signaling (Supplementary Fig. 8d); however, we observed no association with this feature collectively for the full variant set (Fig. 2).

Validation of 3D features associated to pathogenic and population missense variants on an independent set of ClinVar variants
Having characterized the pathogenic and population missense variants using 3D features (reported in Table 2 and Fig. 2), we then carried out a comparison with an independent set of variants from ClinVar (see Methods for details) to validate the potential of our identified features for clinical interpretation of missense variants.
In order to quantify how deleterious an amino acid substitution is, we derived an index per residue based on the difference in the pathogenic and population variant-associated 3D features of the reference (altered) amino acid. The score is referred to as pathogenic 3D feature index (P3DFi) (see P3DFi computation in Methods). We expect that the amino acid residues located in vulnerable 3D sites will have a higher number of pathogenic variant-associated features (P3DFi > 0). Conversely, amino acids We then binned the variants based on their P3DFi values (from less than -2 to greater than +2) and, as expected, the pathogenic and benign variants showed contrasting distributions across different P3DFiDAGS1330 values (Fig. 3). Overall, the pathogenic variants were 3.6-fold enriched (p < 1.0e-300) in positive P3DFiDAGS1330 values. Note that the most positive (P3DFi > 2) and negative (P3DFi < -2) score bins represent the 3D sites with highest and lowest difference between pathogenic-and population-associated 3D features (identified in this study). Remarkably, 82.7% (972 out of 1,175) of all variants in the highest score bin (P3DFiDAGS1330 > 2, Fig. 3   Our statistical analyses confirmed established association signatures between missense variants and protein features that have been found on a smaller scale and also illuminated novel protein class-specific signatures. For example, we observed an enrichment of pathogenic variants in β-strands/sheets conformation (Fig. 2) for all genes, thereby confirming the previous result that β-strands/sheets are more variant-intolerant and vulnerable to stabilization compared to helices upon sequence changes41. However, we found novel protein class-specific associations, such as pathogenic variants being significantly enriched on α-helices for five protein classes ( Supplementary Fig. 3d). In cell junction proteins (OR = 10.54), transfer/carrier proteins (OR = 3.58), and proteases (OR = 2.71), pathogenic variants were more likely to mutate π-helix residues. The π-helical regions have been shown to undergo selection and often provide a functional advantage to proteins42,43; therefore, pathogenesis upon altering π-helices may be associated with perturbation of the protein functions. The protein core (Relative Solvent Area, RSA < 5%) is found to be a uniform descriptor of variant pathogenicity for all protein classes ( Supplementary Fig. 4a), which is in line with established literature as well27. At the same time, for twenty-one out of twenty-four protein classes, distributions of RSA values for pathogenic missense variants showed higher maximum RSA values and accumulate a higher number of outliers (Supplementary Fig. 10). This result indicates the presence of pathogenic variants on the protein-protein surface31 that may perturb essential interaction sites by altering exposed residues.
On protein structure, the special amino acids cysteine (Cys, OR = 3.84, p < 1.00e-100) and glycine (Gly, OR = 1.76, p < 2.19e-154) were found to be predominantly mutated by pathogenic missense variation (Supplementary Fig. 11). The enrichment for Cys residues correlates well with the cogent association between variant pathogenicity and perturbation of covalent disulfide bond (a structure-based pathogenic variant-associated feature), as a Cys mutation is likely to impair such a bond27. However, there is a protein class-specific bias in the available disulfide bond annotations35 (Supplementary Fig. 6a). Specifically, 93.8% (3,642/3,879) of the disulfide bonds affected by a pathogenic variant were found in genes that encode for proteins transducing signals between cells. The other special amino acid, Gly, is the smallest amino acid having no side chain and is usually flexible. Greater than 67% of the Gly residues in our variant set is pliable coils and their substitutions by a larger amino acid is likely to have an impact on structure due to steric clashes23. All aromatic residues were found enriched for pathogenic variants (Fig. 2), with tryptophan (Trp) having the strongest association (OR = 3.27, p = 2.88e-126). Trp is the largest amino acid and it has also been shown that Trp mutations lead an increase in free energy of protein folding, making them more likely to destabilize the structure of protein27. Arginine (Arg) was also found to have a burden of pathogenic variants31 ( Supplementary Fig. 11), while the other positively-charged amino acids (lysine and histidine) were found to be associated with population variants.
Pathogenic variants were, in general, found to spatially cluster near different posttranslational modification (PTM) sites44; however, we identified several protein class-specific novel signatures. For example, the pathogenic variants in transcription factors, extracellular matrix proteins, and structural proteins were found to be enriched near ubiquitination sites ( Supplementary Fig. 8f), which were found depleted of pathogenic variants by joint analysis (Fig.   2). Kinases alone carried 15% and 56% of the pathogenic variants near phosphorylation and SUMOylation sites. Expanding the six UniProt-based functional features into twenty-five constituent features (details in Supplementary Note), we identified twenty-one pathogenic variant-associated features. While some of these features, such as modular domains45, motifs, and binding sites44, are known to be associated with pathogenic variants, our study revealed new associations. A burden of pathogenic variants (OR = 3.9, p < 1.0e-100, Supplementary Fig. 12) on residues annotated by at least one functional feature versus no feature was also notable, showing putative association between variant pathogenicity and protein function.
It is important to note that we characterized the protein positions of missense variants that were mappable onto at least one protein 3D structure available in PDB32, which includes structures for only one-third of all human proteins. Many of these structures have partial coverage of the protein. Therefore, our study outcomes are not exclusive of the influence that may come from accounting for the structured part of the full protein sequence. Also, we could transfer a higher proportion of pathogenic variants (~60%) compared to the population variants (~33%) onto protein structure. Out of the 1,330 disease-associated genes (>80%, n = 1,077 annotated in OMIM), >55% of the genes had at least one structure with >50% of the sequence covered in the structure, and >75% of the ClinVar and HGMD pathogenic variants could be mapped onto the structure for those genes. This observation is consistent with the positive correlation observed between the number of known variants in proteins and the number of proteins being molecularly solved and reported in the literature46.
While performing the protein class-specific analyses, we also observed issues due to having smaller variant sets that could be mapped on the 3D structures. For one (lyase) and four (chaperons, cytoskeletal protein, defense/immunity proteins and membrane-traffic proteins) protein classes, we found three or fewer pathogenic and population variant-associated features ( Table 2), respectively, after applying the stringent significance cut-off. In addition, the power of the P3DFi is affected by unavailability of 3D structure information of a reference amino acid, as, in that case, the quantitative score is computed based on the physicochemical and functional features only. In future work, we plan to include high-fidelity homology models and computational models of protein structures to scale up the generated resource to the full human proteome.
Currently, a variety of in silico tools aid in the interpretation of sequence variants, primarily providing a prediction score generated using different algorithms, training cohorts, and features16.
The three most commonly used missense variant interpretation tools16 include PolyPhen28, SIFT10 and Mutation Taster47. While these tools can stratify pathogenic variants from benign with a reasonable accuracy (65 -80%)48, progress has not been made to provide users with biologically-relevant features of the location and context of missense variants within 3D structures.
This is a problem in particular for molecular scientists who need a pre-selection of variants to study protein function. In this work, besides generating a quantitative score to rationalize missense variant pathogenicity, we also provide the relevant features to support the candidate variant selection by molecular biologists. Moreover, the gnomAD cohort has been leveraged for the first time in our work to assess genetic burden on protein features. The full data overview is given in Table 1. The pipeline for mapping variants onto the protein 3D structures is shown in Fig. 1.

Protein class annotation
The PANTHER (Protein ANalysis THrough Evolutionary Relationships) database33 was curated to classify genes by their molecular function, biological process, and protein class. The PANTHER Protein Class ontology was adapted from the PANTHER/X molecular function ontology.
We obtained the protein class annotations for 1,330 genes in the DAGS1330 dataset with population (gnomAD) and pathogenic (ClinVar and HGMD) variants mapped on the protein 3D structures.
Initially, the genes were classified into 178 groups based on similar molecular function, 234 groups based on their involvement in similar biological processes, and into 206 different protein class groups. Finally, guided by the relationships defined among different protein classes (source: http://data.pantherdb.org/PANTHER13.1/ontology/Protein_Class_13.0), the genes were grouped into 24 major protein classes. After automatic annotation, we observed that 624 genes were not assigned to any specific protein class and so as to any protein major class. We then downloaded the Ensembl family description of all human genes (version 93) using BioMart (https://www.ensembl.org/biomart). After that, we manually annotated the 624 genes into those 24 major classes based on (i) the Ensemble family description, (ii) molecular function/biological process annotation previously collected from PANTHER or (iii) molecular function/biological process annotation available in UniProt as defined by the Gene Ontology consortium. Supplementary Table 2 lists the 24 major classes and the number of genes, the number of protein 3D structures, and the count of population/pathogenic variants mapped on protein structures for each protein class. Note that each gene may not be assigned into a unique protein class group; therefore, the genes in each group are not mutually exclusive.

Feature set mining and annotation
We annotated the amino acid residues of 1,330 genes with seven different features comprising of 40 feature subtypes. In the following, we introduce the features and we discuss how we curated these features and annotated the amino acid positions in the Supplementary Note.

Statistical analysis
For each of the 40 subtypes of seven protein features, we carried out the two-tailed Fisher Exact test on the 2x2 contingency matrix, populated with the total number of pathogenic and population variants of one specific protein feature subtype (helix within 3-class secondary structure) and the rest within a protein structure feature-group ('not helix' equals to 'β-strand/sheet and coil' within 3-class secondary structure).
From this test, we obtained an estimate of fold enrichment, referred to as odds ratio (OR), along with the 95% confidence interval (CI) that shows whether the population/patient variants are relatively enriched or depleted in residues of any specific feature type on 3D structure. A value of OR equal to 1 indicates that there is no association between a specific type of variant with a particular feature, whereas a value of OR greater than 1 (or less than 1) indicates that the pathogenic variants are enriched (or depleted) in a particular type of feature subtype. Further, we calculated p-values showing the significance of association that were compared to the pcut-off. This analysis was also performed taking variants from all proteins and variants from the 24 different protein classes separately, and the output gives the features that are most likely to be associated to the pathogenic and population variants in general and also separately in different protein types. We defined pcut-off according to the Bonferroni correction for multiple testing in statistical analysis, thus pcut-off was set to 5.00e-05 (0.05/(40 * 25)) as we have 40 features on which we performed the association analysis taking all proteins and separately for 24 protein classes.