Insights into protein structural, physicochemical, and functional consequences of missense variants in 1,330 disease-associated human genes

Inference of the structural and functional consequences of amino acid-altering missense variants is challenging and not yet scalable. Clinical and research applications of the colossal number of identified missense variants is thus limited. Here we describe the aggregation and analysis of large-scale genomic variation and structural biology data for 1,330 disease-associated genes. Comparing the burden of 40 structural, physicochemical, and functional protein features of altered amino acids with 3-dimensional coordinates, we found 18 and 14 features that are associated with pathogenic and population missense variants, respectively. Separate analyses of variants from 24 protein functional classes revealed novel function-dependent vulnerable features. We then devised a quantitative spectrum, identifying variants with higher pathogenic variant-associated features. Finally, we developed a web resource (MISCAST; http://miscast.broadinstitute.org/) for interactive analysis of variants on linear and tertiary protein structures. The biological impact of missense variants available through the webtool will assist researchers in hypothesizing variant pathogenicity and disease trajectories.

Recent advances in technologies have ushered in an era of rapid and cost-effective genome and exome sequencing. Genetic screening is now frequently applied in clinical practice, especially for the diagnosis of rare diseases and cancer, producing a plethora of benign and disease-associated genetic variants 1,2 . A large fraction of these clinically-derived variants has been captured in databases such as OMIM 3 , HGMD 4 , and ClinVar 5 . Simultaneously, millions of likely-benign or mildly disease-influencing variants from the general population have been cataloged and are accessible through the Exome Aggregation Consortium (ExAC) 6 and in the genome Aggregation Database (gnomAD) 7 . Missense variation causes an amino acid substitution upon a single nucleotide variation in the protein-coding region of the genome.
Missense variants can be either benign or disease-causing, and both types coexist in almost every human disease gene 6 . This underscores that functional interpretation of missense variants is particularly difficult, and some specific amino acid residues and their positions are essential for protein function and thus are vulnerable to substitutions. Exome sequencing studies can identify such protein-altering missense variants; however, the missing knowledge about the consequence of missense variants on protein function severely limits the interpretation of clinical genetic screening, and our understanding of how a missense variant is implicated in a disease phenotype.
Current algorithms for variant interpretation incorporate numerous annotations to assess the pathogenicity of missense variants, including the conservation of sequences through evolution [8][9][10] , human population variation information 11 , and protein structural information 12,13 .
With the latest explosion of genetic variant data, machine learning algorithms can learn and predict variant pathogenicity with reasonable accuracy 14,15 . However, variant pathogenicity predictions, combined from multiple in silico tools, represent only one piece of supporting evidence (referred to as PP3 16

) in the variant interpretation guidelines as proposed by American
College of Medical Genetics and Genomics (ACMG). This is primarily because of the similar underlying biases in these algorithms' prediction outputs 16 , resulting from the use of similar variant sets. Further, the prediction scores do not provide biologically interpretable annotations that can be translated into possible variant-induced changes of underlying protein function.
Studying the effect of disease-associated missense variation on the protein structure and function represents a growing research area. A damaging consequence of amino acid substitution has been found to be associated with the structural property and localization of the amino acid on protein structure [17][18][19] . Cluster analyses of patient variants and structural conservation analyses have shown that functionally essential amino acids and regions are not randomly distributed [20][21][22] . In separate research studies, missense variants were found to disrupt protein function by perturbing protein-protein interactions 23 , modifying functional residues 24 , destabilizing the entire protein fold 25,26 , or by substituting amino acid residues in protein cores 27,28 and in protein-protein interfaces [29][30][31] . However, characterization of variants using a comprehensive set of protein features including physicochemical, functional, and 3D-structural features, for example, secondary structure types and protein-protein interactions, has not been performed using a common dataset and single, unified pipeline. Such a thorough characterization of disease-causing and benign missense variants would serve as a valuable resource for translational genetics, providing many structural and functional insights into the impact of an amino acid substitution.
Functionally essential protein regions are shared among proteins performing similar molecular function. There is no study that provides a quantitative comparison of pathogenic and benign missense-variant-associated protein features across different protein functional classes.
The current ACMG guidelines list the presence of variants in a mutational hot spot in less wellcharacterized protein regions or in a critical functional domain that is depleted in population variants as a moderate criterion (referred to as PM1) 16 for variant pathogenicity assessment.
However, domain information may not be the only protein-based feature that can be used as a determinant of variant pathogenicity. Developing an inclusive annotation of structural, chemical, and functional features for missense variants would be a powerful resource for the ACMG community.
In this study, we investigate the features of amino acid residues altered by missense variants on protein structure at scale. First, we built an automated pipeline to map variants from genomic location to protein 3D structure location. Using the framework, we mapped >500,000 missense variants in 5,850 genes onto >43,000 protein 3D structures from the protein data bank (PDB) 32 . The variant set was comprised of population missense variants from the gnomAD database 7 and pathogenic missense variants from the ClinVar 5 (pathogenic and likelypathogenic) and HGMD 4 (disease mutation) databases. Second, we annotated the amino acid positions of these genes with a comprehensive set of seven protein features with forty feature subtypes. Third, we evaluated the burden of population and pathogenic missense variants in 1,330 disease-associated genes on these forty features. To do so relative to protein function, we carried out separate analyses with variants in genes grouped into 24 protein classes based on their molecular function 33

Description of dataset
In this study, we characterized missense variants with 3D coordinates in at least one protein structure using seven protein features of amino acid residues: (1) 3-class secondary structure, (2) 8-class secondary structure, (3) residue exposure level, (4) physicochemical property, (5) protein-protein interaction, (6) post-translational modification, and (7) Table   1), which were grouped into 24 protein classes (Supplementary Table 2, protein class ascertainment in Methods) to perform protein class-specific analysis. This missense variant and gene dataset are referred to as Disease-Associated Genes with Structure (DAGS1330) in this study. The workflow of this study is briefly presented in Fig. 1 and the overview of the dataset is shown in Table 1.

Feature patterns of population and pathogenic missense variants on protein 3D
structure across all protein classes.
Our understanding of which protein features are more likely to be disease relevant is still in its infancy. To identify structural, physiochemical, and functional protein features that are associated with pathogenic and population missense variants, we performed burden analysis with seven features including in total 40 subtypes (ascertainment and definition of features in Methods and Supplementary Note) across DAGS1330 genes. Fisher's Exact test was performed for individual feature subtypes of the population and pathogenic variants, quantifying the fold-enrichment and significance of the association. Notably, out of the seven features, physiochemical properties of amino acid and UniProt-based functional features could be curated from the protein sequence only. However, our dataset comprised of the missense variants that could be mapped on the protein 3D structure only, thus, the identified features depict the characteristics of variants on protein structure. All seven protein features showed a deviating distribution of population and pathogenic missense variants for at least one feature subtype.
The direction of the association was heterogeneous across feature subtypes (Fig. 2). In total, 18 out of 40 (45%) protein feature subtypes were significantly enriched for pathogenic variants. The UniProt-based functional protein features had the highest number of feature subtypes (5/6) that were enriched for pathogenic variants (Fig. 2). In contrast, 14 Fig. 1 Table 2 for the summary of the results). Through the protein functional class-specific analyses, we aim to capture both shared and unique function-specific features of pathogenic and population variants on protein structure.
We observed that pathogenic variants are enriched in β-strand/sheet residues for 20 out 24 of protein classes (Supplementary Fig. 2a and 3b). This observation is consistent with the joint analysis with full DAGS1330 dataset (Fig. 2). In contrast, cell junction proteins displayed a deviating characteristic with enrichment of population variants in β-strand/sheet (OR = 0.37, p = 1.40e-13, Supplementary Fig. 2a), specifically, β-sheet (OR = 0.35, p = 3.91e-14, in Supplementary Fig. 3b). Moreover, pathogenic variants of cell adhesion molecules were significantly frequent in both stable β-sheet residues (OR = 1.72, p = 7.52e-25, Supplementary   Fig. 3b) and flexible coils, specifically, random loops (OR = 1.28, p = 3.67e-06, Supplementary   Fig. 3f), and bends (OR = 1.35, p = 2.19e-05, Supplementary Fig. 3g). The enrichment of pathogenic variants followed a negative gradient of exposure levels of amino acids (Fig. 2) and was consistent for the majority of the protein classes (Supplementary Fig. 4). Amino acid residues in the interior of the 3D structure with < 25% relative accessible surface area (RSA) were enriched with pathogenic variants (Supplementary Fig. 4a-b), whereas the residues with ≥ 25% RSA were enriched with population variants (Supplementary Fig. 4c-e). Among physiochemical property-based groups of amino acids, pathogenic variants were observed significantly more frequent on aromatic (tryptophan, tyrosine, phenylalanine) amino acids in 23 out of 24 protein classes, as well as on special amino acids (cysteine, proline, glycine) in all 24 protein classes (Supplementary Fig. 5c and 5h). Specifically, the cell adhesion molecules, transcription factors, phosphatases, and signaling molecule proteins carried greater than 2-fold enrichments of pathogenic variants changing an aromatic amino acid. At positions of special amino acids, calcium-binding and extracellular matrix proteins showed enrichments of pathogenic variants with OR > 3.0 (p < 1.0e-100).
The inter-chain disulfide bond had the strongest enrichment in the joint analysis (OR = 21.75, p < 1.0e-100, Fig. 2). While such an association has been demonstrated 27 , our protein class-specific analysis revealed that the enrichment was primarily directed by pathogenic variants from three protein classes alone (Supplementary Fig. 6a). In contrast, the non-bonded van der Waals interaction sites were enriched with pathogenic variants from all protein classes except phosphatases and nucleic acid binding proteins (Supplementary Fig. 6d). The pathogenic variants in kinases, nucleic acid binding and cytoskeletal proteins were found more frequent on salt-bridge and hydrogen bond sites (Supplementary Fig. 6c-d), which are protein class-specific features as these sites were found depleted of pathogenic variants for the full set ( Fig. 2). Heterogeneous patterns of association between missense variants and posttranslational modification (PTM) types were observed across different protein classes ( Supplementary Fig. 7). The amino acid residues near phosphorylation sites were found to be enriched with pathogenic variants in joint analysis (Fig. 2); however, a differential pattern was observed for hydrolases (OR = 0.80, p = 7.29e-11) and cell adhesion molecules (OR = 0.58, p = 3.81e-15) with significant depletion of pathogenic variants around phosphorylation sites ( Supplementary Fig. 7d). The cell junction proteins pathogenic variants showed an enrichment around methylation sites (OR = 9.10, p = 9.29e-11, Supplementary Fig. 7b), whereas there was no significant association observed for the full dataset. For nine out of twenty-four protein classes, we observed significant enrichment of population variants around ubiquitination sites; however, the trend was different for transcription factors (OR = 1.44, p = 1.77e-06) and extracellular matrix proteins (OR = 2.05, p = 2.51e-05), with an enrichment of pathogenic variants around ubiquitination sites (Supplementary Fig. 7f). Modular domains and modified residues were identified as shared functional features for a majority of the protein classes, twenty-three and twenty-two, respectively, out of twenty-four classes (Supplementary Fig. 8).
The pathogenic variants of transporters, kinases, transcription factors, and nucleic acid binding proteins were observed to be associated with functional sites, binding regions and sequence motifs. The protein class-specific analysis also showed that only the transfer/carrier protein (OR = 8.90, p = 3.68e-09) and enzyme modulator (OR = 3.09, p = 6.20e-10) carry enrichment of pathogenic variants on protein regions involved in molecular processing and signaling ( Supplementary Fig. 8d); however, we observed no association with this feature collectively for the full variant set (Fig. 2).

Validation of MISCAST annotation
We developed an online resource for exploration of the generated data named Missense variant to proteIn StruCture Analysis web SuiTe (MISCAST; http://miscast.broadinstitute.org/).
MISCAST includes interactive visualization of features and variants on linear and 3D protein structure (Fig. 3a), amino-acid-wise feature annotations (Fig. 3b), and protein class-specific features of pathogenic and population variants (Fig. 3d). In addition, through MISCAST, the users can obtain amino-acid-wise variant summary reports that can assist in rationalizing the pathogenicity of variants according to features of the reference amino acid. We expect that show the utility of our newly generated protein feature annotation, we evaluated readouts from high-throughput mutagenesis for BRCA1 34 and PTEN 35 (details in Supplementary Note).
Missense variant to proteIn StruCture Analysis web SuiTe (MISCAST) features to hypothesize the molecular consequence of the Tyr262 substitution.  Fig. 3).
Our statistical analyses confirmed established association signatures between missense variants and protein features that have been found on a smaller scale and also illuminated novel protein class-specific signatures. For example, we observed an enrichment of pathogenic variants in β-strands/sheets conformation (Fig. 2) for all genes, thereby confirming the previous result that β-strands/sheets are more variant-intolerant and vulnerable to stabilization compared to helices upon sequence changes 39 . However, we found novel protein class-specific associations, such as pathogenic variants being significantly enriched on α-helices for five protein classes (Supplementary Fig. 3d). In cell junction proteins (OR = 10.54), transfer/carrier proteins (OR = 3.58), and proteases (OR = 2.71), pathogenic variants were more likely to mutate π-helix residues. The π-helical regions have been shown to undergo selection and often provide a functional advantage to proteins 40,41 ; therefore, pathogenesis upon altering π-helices may be associated with perturbation of the protein functions. The protein core (Relative Solvent Area, RSA < 5%) is found to be a uniform descriptor of variant pathogenicity for all protein classes (Supplementary Fig. 4a), which is in line with established literature as well 27 . At the same time, for twenty-one out of twenty-four protein classes, distributions of RSA values for pathogenic missense variants showed higher maximum RSA values and accumulate a higher number of outliers (Supplementary Fig. 10). This result indicates the presence of pathogenic variants on the protein-protein surface 31 that may perturb essential interaction sites by altering exposed residues.
On protein structure, the special amino acids cysteine (Cys, OR = 3.84, p < 1.00e-100) and glycine (Gly, OR = 1.76, p < 2.19e-154) were found to be predominantly mutated by pathogenic missense variation (Supplementary Fig. 11). The enrichment for Cys residues correlates well with the cogent association between variant pathogenicity and perturbation of covalent disulfide bond (a structure-based pathogenic-variant-associated feature), as a Cys mutation is likely to impair such a bond 27 . However, there is a protein class-specific bias in the available disulfide bond annotations 42 (Supplementary Fig. 6a). Specifically, 93.8% (3,642/3,879) of the disulfide bonds affected by a pathogenic variant were found in genes that encode for proteins transducing signals between cells. The other special amino acid, Gly, is the smallest amino acid having no side chain and is usually flexible. Greater than 67% of the Gly residues in our variant set is pliable coils and their substitutions by a larger amino acid is likely to have an impact on structure due to steric clashes 23 . All aromatic residues were found enriched for pathogenic variants (Fig. 2), with tryptophan (Trp) having the strongest association (OR = 3.27, p = 2.88e-126). Trp is the largest amino acid and it has also been shown that Trp mutations lead an increase in free energy of protein folding, making them more likely to destabilize the structure of protein 27 . Arginine (Arg) was also found to have a burden of pathogenic variants 31 (Supplementary Fig. 11), while the other positively-charged amino acids (lysine and histidine) were found to be associated with population variants.
Pathogenic variants were, in general, found to spatially cluster near different posttranslational modification (PTM) sites 43 ; however, we identified several protein class-specific novel signatures. For example, the pathogenic variants in transcription factors, extracellular matrix proteins, and structural proteins were found to be enriched near ubiquitination sites ( Supplementary Fig. 8f), which were found depleted of pathogenic variants by joint analysis (Fig. 2). Kinases alone carried 15% and 56% of the pathogenic variants near phosphorylation and SUMOylation sites. Expanding the six UniProt-based functional features into twenty-five constituent features (details in Supplementary Note), we identified twenty-one pathogenic variant-associated features. While some of these features, such as modular domains 44 , motifs, and binding sites 43 , are known to be associated with pathogenic variants, our study revealed new associations. A burden of pathogenic variants (OR = 3.9, p < 1.0e-100, Supplementary   Fig. 12) on residues annotated by at least one functional feature versus no feature was also notable, showing putative association between variant pathogenicity and protein function.
It is important to note that we characterized the protein positions of missense variants that were mappable onto at least one protein 3D structure available in PDB 32 , which includes structures for only one-third of all human proteins. Many of these structures have partial coverage of the protein. Therefore, our study outcomes are not exclusive of the influence that may come from accounting for the structured part of the full protein sequence. Also, we could transfer a higher proportion of pathogenic variants (~60%) compared to the population variants (~33%) onto protein structure. Out of the 1,330 disease-associated genes (>80%, n = 1,077 annotated in OMIM), >55% of the genes had at least one structure with >50% of the sequence covered in the structure, and >75% of the ClinVar and HGMD pathogenic variants could be mapped onto the structure for those genes. This observation is consistent with the positive correlation observed between the number of known variants in proteins and the number of proteins being molecularly solved and reported in the literature 45 .
While performing the protein class-specific analyses, we also observed issues due to having smaller variant sets that could be mapped on the 3D structures. For one (lyase) and four (chaperons, cytoskeletal protein, defense/immunity proteins and membrane-traffic proteins) protein classes, we found three or fewer pathogenic-and population-variant-associated features ( Table 2), respectively, after applying the stringent significance cut-off. In addition, the power of the MISCAST-fdspec metric is affected by unavailability of 3D structure information of a reference amino acid, as, in that case, the quantitative score is computed based on the physicochemical and functional features only. In future work, we plan to include high-fidelity homology models and computational models of protein structures to scale up the generated resource to the full human proteome.
Currently, a variety of in silico tools aid in the interpretation of sequence variants, primarily providing a prediction score generated using different algorithms, training cohorts, and features 16 . The three most commonly used missense variant interpretation tools 16 include PolyPhen2 8 , SIFT 10 and Mutation Taster 46 . While these tools can stratify pathogenic variants from benign with a reasonable accuracy (65 -80%) 47 , progress has not been made to provide users with biologically-relevant features of the location and context of missense variants within 3D structures. This is a problem in particular for molecular scientists who need a pre-selection of variants to study protein function. In this work, besides generating a quantitative score to rationalize missense variant pathogenicity, we also provide the relevant features to support the candidate variant selection by molecular biologists (Fig. 3). Moreover, the gnomAD cohort has been leveraged for the first time in our work to assess genetic burden on protein features.
MISCAST is also the only resource to date that gives protein functional class-specific characterization of missense variants, providing many function-specific structural insights into the effect of missense variants at the molecular level. Thus, MISCAST can serve as a powerful resource for the translation of personal genomics to precision medicine by translating genetic variant into 3D protein context: it can help to delineate variant pathogenicity, select candidate variants for functional assay, and aid in generating hypothesis for targeted drug development.

Online Methods
Protein structure and sequence collection.
The Protein Data Bank (PDB) 32 was mined on September 08, 2017 to collect all solved protein threedimensional (3D) structures that were fully or partly (chimeric) indexed as homo sapiens. Our initial curation resulted in 43,805 protein structures, both single-chain and multiple-chain, including structures solved by six different experimental methods. Around 85% of the structures in this collection were solved by X-ray crystallography, of which over 50% of the structures in our collection are of resolution greater than 2.0 Å. The next highest proportion of the structures (~10%) were solved using nuclear magnetic resonance (NMR).
From UniProt 48 , the corresponding protein identifiers and gene name for the 43,805 protein structures were collected, which resulted in 5,870 unique protein identifiers and 5,850 unique gene names. Mapping of coordinates of amino acid residues in 3D structures to linear protein sequence was derived from the SIFTS database 49 .

Protein class annotation
The PANTHER (Protein ANalysis THrough Evolutionary Relationships) database 33 was curated to classify genes by their molecular function, biological process, and protein class. The PANTHER Protein Class ontology was adapted from the PANTHER/X molecular function ontology.
We obtained the protein class annotations for 1,330 genes in the DAGS1330 dataset with population (gnomAD) and pathogenic (ClinVar and HGMD) variants mapped on the protein 3D structures.
Initially, the genes were classified into 178 groups based on similar molecular function, 234 groups based on their involvement in similar biological processes, and into 206 different protein class groups. Finally, guided by the relationships defined among different protein classes (source: http://data.pantherdb.org/PANTHER13.1/ontology/Protein_Class_13.0), the genes were grouped into 24 major protein classes. After automatic annotation, we observed that 624 genes were not assigned to any specific protein class and so as to any protein major class. We then downloaded the Ensembl family description of all human genes (version 93) using BioMart (https://www.ensembl.org/biomart). After that, we manually annotated the 624 genes into those 24 major classes based on (i) the Ensemble family description, (ii) molecular function/biological process annotation previously collected from PANTHER or (iii) molecular function/biological process annotation available in UniProt as defined by the Gene Ontology consortium. Supplementary Table 2 lists the 24 major classes and the number of genes, the number of protein 3D structures, and the count of population/pathogenic variants mapped on protein structures for each protein class. Note that each gene may not be assigned into a unique protein class group; therefore, the genes in each group are not mutually exclusive.

Feature set mining and annotation
We annotated the amino acid residues of 1,330 genes with seven different features comprising of 40 feature subtypes. In the following, we introduce the features and we discuss how we curated these features and annotated the amino acid positions in the Supplementary Note. From this test, we obtained an estimate of fold enrichment, referred to as odds ratio (OR), along with the 95% confidence interval (CI) that shows whether the population/patient variants are relatively enriched or depleted in residues of any specific feature type on 3D structure. A value of OR equal to 1 indicates that there is no association between a specific type of variant with a particular feature, whereas a value of OR greater than 1 (or less than 1) indicates that the pathogenic variants are enriched (or depleted) in a particular type of feature subtype. Further, we calculated p-values showing the significance of association that were compared to the pcut-off. This analysis was also performed taking variants from all proteins and variants from the 24 different protein classes separately, and the output gives the features that are most likely to be associated to the pathogenic and population variants in general and also separately in different protein types. We defined pcut-off according to the Bonferroni correction for multiple testing in statistical analysis, thus pcut-off was set to 5.00e-05 (0.05/(40 * 25)) as we have 40 features on which we performed the association analysis taking all proteins and separately for 24 protein classes.

Implementation of the online resource, MISCAST
The Missense variant to proteIn StruCture Analysis SuiTe, MISCAST has been made publicly available through (http://miscast.broadinstitute.org/). MISCAST was developed with the Shiny framework of R studio 52 . All the two-dimensional plots shown by the online tool are based on the ggplot2 53     The squares (brown) show the OR and the bars (blue) show the 95% confidence interval on the x-axis.
OR=1 indicates the neutral value (no enrichment or depletion) while the OR > 1.0 (and < 1.0) indicates enrichment of pathogenic variants (and population variants). If the association is significant (p-value < pcutoff = 5.0e-05), the corresponding p-value is followed by (*), otherwise (-). For protein-protein interaction types, the feature counts correspond to all bond-annotations available for the amino acid residues with pathogenic and population variants. For the rest of the features, the counts correspond to the number of amino acid residues.