Abstract
Summary Missense mutations that change protein stability are strongly associated with human inherited genetic disease. With the recent availability of predicted structures for all human proteins generated using the AlphaFold2 prediction model, genome-wide assessment of the stability effects of genetic variation can, for the first time, be easily performed. This facilitates the interrogation of personal genetic variation for potentially pathogenic effects through the application of stability metrics. Here, we present a novel algorithm to prioritise variants predicted to strongly destabilise essential proteins, available as both a standalone software package and a web-based tool. We demonstrate the utility of this tool by showing that at values of the Stability Sort Z-score above 1.6, pathogenic, protein-destabilising variants from ClinVar are detected at a 58% enrichment, over and above the destabilising (but presumably non-pathogenic) variation already present in the HapMap NA12878 genome.
Availability and Implementation StabilitySort is available as both a web service (http://130.56.244.113/StabilitySort/) and can be deployed as a standalone system (https://gitlab.com/baaron/StabilitySort).
Contact Dan.Andrews{at}anu.edu.au
1 Introduction
The ease with which individual genomes can be sequenced belies the complexities we presently face when interpreting this information (MacArthur et al., 2014; Rehm, 2017; Tarailo-Graovac et al., 2017; Whiffin et al., 2017). Currently, interpretation of personal genome data is clinically relevant for only a small proportion of the variants identified (Manolio et al., 2019), and for the remainder, we lack a deep understanding of the relationship between missense variation and potential functional effects. Despite its’ importance, to date, the interpretive power of instability effects of mutations on proteins has been limited by incomplete structural information covering the observed variation (only 16% of human proteins match a structure in the Protein Data Bank with greater than 95% identity (Porta-Pardo et al., 2021)). However, DeepMind’s AlphaFold 2 has generated structural predictions whose accuracy approaches those of experimentally determined structures (Jumper et al., 2021). This, accompanied with the release of near-exhaustive sets of protein models from human and other species (Tunyasuvunakool et al., 2021), means that it is now feasible to produce structure-based estimates of the stability effects for the vast bulk of genomic variation encountered in a personal genome or exome sequence. While it is not feasible to use the AlphaFold algorithm for de novo prediction of mutant protein structures, we are able to use predicted wild type structures as a template for assessing the potential structural consequences of mutations (Pak et al., 2021)(Akdel et al., 2021; Pak & Ivankov, 2021). StabilitySort uses the AlphaFold predictions of 3D structures as templates to conduct genome-wide assessment of protein stability changes due to genetic variation, in the absence of a similar, complete resource of experimentally obtained structures. While the AlphaFold dataset of 3D predictions are variable in quality (Thornton et al., 2021), the almost genomewide coverage of the resource (Porta-Pardo et al., 2021) presents an opportunity for routine interpretation of the protein stability effects of personal genome variation information.
2 Analysis Workflow
StabilitySort uses internal data from exhaustive pre-computation of the predicted ΔΔG of each possible human missense variant, genome-wide, using AlphaFold-predicted structures (Tunyasuvunakool et al., 2021) with Maestro (Laimer et al., 2015). Maestro was chosen based on benchmarking results and the additional feature that it can simultaneously predict stability due to multiple amino acid substitutions (Marabotti et al., 2021). StabilitySort initially calculates stability values due to single amino acid substitutions in a single protein, as in most individuals, almost all proteins will harbour at most only a single variant. However, the user is also provided the option to pool all missense variation across a protein, should there be multiple variants in a given protein, to re-compute the predicted change in stability due to multiple substitutions.
The StabilitySort methodology seeks to identify missense variants that introduce either highly destabilising or stabilising changes into the encoded protein. We measure this effect by viewing the change with reference to the missense variation in the same protein from the GnomAD database (Karczewski et al., 2020). For each missense variant observed from an input genome, a StabilitySort Z-score test statistic is calculated that assesses the predicted ΔΔG with respect to all other non-disease associated variation identified in this protein. The StabilitySort Z-score measures how unusual a particular amino acid substitution is, in a particular protein, compared to the distribution of observed GnomAD substitutions in that same protein. Higher Z-score values indicate a higher chance that the amino acid substitution has a functional impact on the protein.
3 Results
To demonstrate the effectiveness of StabilitySort in prioritising disease-causing missense variants, we compared 1028 randomly sampled ClinVar pathogenic variants (annotated by ClinVar as CLNSIG=Pathogenic; (Landrum et al., 2020)) with the missense variation present in HapMap individual NA12878 (from the International Genome Sample Resource (Fairley et al., 2020)). The StabilitySort Z-score metric identified an enrichment of a subset of ClinVar variants that were unusually destabilising, given the population variation in these proteins (Supplementary Figure 1a). At Z-scores values greater than 1.6 there are 58% more ClinVar pathogenic mutations than at the same Z-score cutoff in the NA12878 genome. This enrichment was asymmetric and was only observed for destabiling amino acid substitutions. Furthermore, the predicted ΔΔG values alone did not show an excess of destabilising amino acid substitutions for the ClinVar pathogenic variants compared to the NA12878 variation (Supplementary Figure 1b). Interpretation of predicted ΔΔG in the context of gene importance is informative, as the range of stability effects observed increases as proteins become more tolerant to loss-of-function mutations (redundant and/or non-essential genes), though the median ΔΔG does not significantly vary (Supplementary Figure 1c; see notches in bar-plot). StabilitySort did not identify candidate pathogenic variation with strong stability effects in the genome of NA12878.
4 Conclusion
We present StabilitySort, a genetic variation prioritisation tool for the genome-wide detection of protein stability effects that may contribute to disease. This system is available as both a web service and as standalone software. With this methodology, it is now possible to scan for the presence of likely pathogenic stabilising or destabilising protein mutations in a high-throughput manner.
Figures
StabilitySort allows prioritisation of the full missense variant set of an given genome through an automated workflow that annotates amino acid subsitutions with their predicted stability effects. Input to the workflow begins with a user-supplied VCF file of variation. An index of exhaustively predicted ΔΔG values, predicted with the Maestro algorithm, using AlphaFold2-predicted human protein structures, is used to compare the value of the variant in question with the variation observed in the protein described in the GnomAD database. This comparison is quantified with a Z-score, and this metric, along with other annotated values, can be used to prioritise and order potentially-pathogenic missense variation at a genome-wide scale.
Comparison of ClinVar pathogenic missense variants with the personal missense variation identified in the NA12878 HapMap genome. a) Comparison of distribution of change of stability Z-Scores, and b) MAESTRO predicted ΔΔG values between missense variant sets. c) Distribtions of predicted ddG values for all missense variants from the NA12878 genome, separated into bins by LOUEF decile values. Overlap of notches in bar-plot indicate similarity of means between bins in a 95% confidence interval. For each bar coloured in gold, the bold mid-line indicates the median predicted DDG, the upper and lower ends of the bar indicate the second and third quartile and the whiskers indicate the first and fourth quartiles. The oulier values are indicated with dots. The ten bins separated by LOUEF decile values are numbered (0-9) according to increasing tolerance of loss-of-function mutations (see (Karczewski et al., 2020)). Bin 0 contains the least tolerant of loss-of-function mutations (designated as essential genes) and bin 9 contains the most tolerant (designated redundant or non-essential genes).