Individual human genomes frequently contain variants that have epistatic interactions

The availability of thousands of individual genomes provides many opportunities to understand genetic variation and the relationship to phenotype, particularly disease. However, this remains challenging as it is often difficult to identify if a non-synonymous variant alters protein structure and function. Many computational methods have been developed but they typically interpret individual variants in isolation, despite the possibility of variant-variant interactions. Here, we combine the genetic variation data present in the 1000 genome project with protein structural data to identify variant-variant interactions within individual human genomes. We find more than 4,000 combinations of variants that located close in 3D dimensional structure and more than 1,200 in protein-protein interfaces. Many variant combinations include amino acid changes that are compensatory such as maintaining charges or functional groups, thus supporting that these are coevolutionary events. This highlights the need for variant interpretation and precision medicine to consider the gestalt effects of variants.


Introduction
Proteins structures are delicately balanced between functional and non-functional. The effect of a single amino acid change can be the difference between health and disease. In recent years, due to advances in sequencing technologies, extensive human genetic variation data set have become available (Sherry 2001;Auton et al. 2015;Lek et al. 2016;Pagani et al. 2016;Telenti et al. 2016;Karczewski et al. 2017;Maretty et al. 2017;Landrum et al. 2018). Distinguishing between variants that are benign and those that cause disease is crucial, but experimental characterisation of the effects of each variant is impractical, and so a large number of computational tools have been developed to tackle this problem (Pappalardo & Wass 2014;Adzhubei et al. 2010;González-Pérez & López-Bigas 2011;Kumar et al. 2009;Kircher et al. 2014;Reva et al. 2011;Ng & Henikoff 2003). These tools primarily rely on first principles: using information about protein function, structure, evolutionary conservation, and the physiochemical properties of the wild type and variant amino acids. For simple cases, this approach has had some success, with many tools able to distinguish damaging from neutral variants with reasonable accuracy (Daneshjou et al. 2017;Hoskins et al. 2017).
These tools typically consider variants in isolation without considering the context in which they occur, i.e. the effect of a variant may be altered depending on other variants present within the same protein or in interaction partners. This context dependence has long been observed between species, as Compensated Pathogenic Deviations (CPDs) (Kimura 1985), where missense variants that cause disease in humans are wild type in other species. In other species, these changes typically occur with other amino acid changes that compensate for each other (Kimura 1985). For example, in human -haemoglobin p.Val20Glu (rs33918474) causes erythrocytosis, whereas p.Glu20 is wild type in horse haemoglobin. A second difference at position 69 is likely to compensate for this, with the horse sequence containing His and the human Gly (Kondrashov et al. 2002). In horse haemoglobin Glu20 and His69 form a hydrogen bond, while Val20 and Gly69 have van der Waals interactions in human -haemoglobin.
There a many possible mechanisms of such compensations, but the outcome is the samea variant in one individual may have different effects compared to the same variant in another individual. It has been estimated that between 3-12% of disease associated variants in HumVar (Adzhubei et al. 2013) and ClinVar (Landrum et al. 2018) are observed in orthologs in vertebrates representing possible CPDs (Jordan et al. 2015). Experimental analysis of the potential CPDs identified in the genes BBS4, RPGRIP1L, and BTG2 showed that CPDs were able to rescue pathogenic variants and restore wild type function (Jordan et al. 2015). Other studies identified mutually compensatory variants in human p53 that preserve protein stability (Mateu & Fersht 1999), and compensatory variants that preserve dimerization stability of the HIV capsid protein (Del Álamo & Mateu 2005). In these examples, there are combinations of variants where the destabilising or damaging effects of individual variants are non-additive and the full variant combination better preserves the properties of the wild type protein.
The concept of CPDs has also been applied between individuals of the same species, and is often referred to as "genetic buffering" or "genetic compensation". A recent study of a large population identified individuals carrying variants for severe childhood Mendelian diseases who have reached adulthood with no clinical manifestation of the disease (Chen et al. 2016). Other studies have shown that variants thought to cause specific Mendelian diseases are far more common in human populations than would be expected based on the prevalence of the disease (Minikel et al. 2016;Xue et al. 2012;Piton et al. 2013;Bell et al. 2011). This suggests that some individuals with these variants are protected from the diseases by other variants, or that these variants have been previously misclassified as disease causing.
Multiple studies have investigated the prevalence of CPDs between species (Kondrashov et al. 2002;Gao & Zhang 2003;Rhesus Macaque Genome Sequencing and Analysis Consortium 2007;Azevedo et al. 2009;Jordan et al. 2015), and incorporation of CPD information algorithms used to predict variant effects has been suggested as a way of improving the predictive power of such tools (Azevedo et al. 2017). While some work has been done to predict the combined effects of multiple variants (Hopf et al. 2017), no study to our knowledge has looked at the extent of possible compensation events within individual human genomes.
Here we quantify the extent of possible epistatic variant interactions within protein coding regions of the human genome. We utilise data from the 1,000 Genomes Project, generating individual variant sets for each of the 2,504 genomes. We analyse the extent of cooccurring variants within individual proteins, how co-occurring variants are distributed within the 3-dimensional structures of the proteins, and how co-occurring variants affect proteinprotein interface sites. The analysis of the combined effects of variant combinations will be essential in the future better to understand complex genotypes, and deliver the promise of precision medicine.

Protein Sequence Data
The UniProt human proteome set (UP000005640) was used, version last modified on 22/11/2015 (Bateman et al. 2015), and for human Ensembl protein sequences, Ensembl version 84 was used (Yates et al. 2016).
If the canonical isoform was unknown, isoforms present in the Interactome3D data set were chosen first, otherwise the isoform was chosen numerically by its ensembl protein identifier.
ANNOVAR was used to add functional annotation to variants. Corresponding UniProt and ensembl protein sequences were aligned using NEEDLE, version 6.6.0 (Rice et al. 2000).
All protein sequence positions given in this study are in reference to the UniProt sequence, and variants that could not be mapped to UniProt sequence positions were excluded.
ANNOVAR annotates each variant individually and does not consider multinucleotide variants. For each genome, we considered multiple variants in the same codon as a single change instead of two individual changes.
The gnomAD exome VCF dataset was downloaded on 26/04/2017, and the gnomAD allele frequencies of all variants from the 1,000 Genomes Project genomes were extracted (Lek et al. 2016;Karczewski et al. 2017).

Protein Structure Data
All experimental protein structures were downloaded from the Protein Data Bank (PDB), and the Structure Integration with Function, Taxonomy and Sequence (SIFTS) database was used to map the residues in the protein structures to the residues in the protein sequences in the UniProt human proteome set (Berman et al. 2000;Velankar et al. 2013).
Corresponding UniProt proteins and PDB structures were then aligned using MUSCLE (Edgar 2004). Structures were selected for each protein, first prioritising resolution, and if two structures for overlapping sequence regions both had resolutions <3.5Å the structure with the largest sequence coverage was selected. After this step, all sequence positions covered by an experimental structure have been assigned to the highest quality structure available. For the remaining proteins without experimental structures, and for proteins with partial experimental structures where there are contiguous regions of ≥50 residues without a corresponding experimental structure, structural modelling was performed.
For structural modelling, template structures were identified from a 70% identity representative set of the PDB using hhblits, selecting only high-confidence templates with an hhblits probability ≥80 (Berman et al. 2000;Remmert et al. 2012). Models were selected to cover as many sequence positions as possible in a non-redundant manner, starting from the highest confidence model and selecting additional models that cover ≥50 sequence positions not already covered. Models were then generated for each sequence using the identified template and a protocol based on Phyre2 (Kelly et al. 2015), with pulchra used to add and optimise the side chains (Rotkiewicz & Skolnick 2008).
For the set of 20,791 proteins analysed, 16,653 had >0% coverage either by experimental structures, models, or a combination of both. The remaining 4,138 proteins had no experimental structures available and no structural models were identified of adequate quality ( Figure S1). There are 1,609 proteins for which experimental structures cover the entire protein with no sequence gaps ≥50 contiguous residues (average coverage 88.97% with 1.1 structures per protein -multiple structures correspond to different regions of the protein and are not overlapping). There are 2,777 proteins with experimental structures, but with gaps ≥50 contiguous residues (mean coverage 38.79%, with 1.19 structures per protein). Of these, high-quality templates were identified for 1,653 proteins, and structural modelling was used to increase structure coverage for these proteins (mean coverage 56.43%, with 2.91 structures per protein). For the remaining 16,405 proteins without experimental structures, a high-quality template for structural modelling was identified for 12,267 proteins (mean coverage 56.33%, with 1.51 structures per protein). Overall, combining multiple non-overlapping structures per protein and combining structural modelling with experimental structures greatly increased protein sequence coverage ( Figure S1).

Identification of Variant Combinations Within Proteins
Combinations of variants were identified within proteins for each of the 2,504 individual genomes, with each sample having between zero and two variant combinations per protein (one combination per allele). These variant combinations therefore consist of all of the variants that each sample has per copy of the protein, and they can be distributed anywhere within the 3-dimensional structure of the protein -termed Global Combinations

Generation of Random Proximal Combinations
The positions of variants mapped to structures from Global Combinations were randomly assigned to different positions in the proteins, with the total number of variants in each protein kept the same, and Proximal Combinations were then redefined. For both nonsynonymous and synonymous Global Combinations, 1,000 iterations were performed, creating 1,000 sets of Proximal Combinations for variants assigned to random sequence positions.

Variant Property Compensations
Amino acid properties were assigned using the definitions of Innis et al., shown in Table S1 ( Innis et al. 2004). The sum changes in amino acid properties within variant combinations were then calculated to identify compensation of amino acid, charge, or functional group.
Charge and functional group compensations were not counted if they were caused by a direct amino acid compensation, e.g. a direct compensation of an arginine residue is also a positive charge compensation, but would be counted only as a direct amino acid compensation. Foldx (Schymkowitz et al. 2005) was used to predict protein structure stability changes for variant combinations. The Foldx RepairPDB function was run for each protein structure with default parameters, and stability predictions were made using the BuildModel command on the repaired structures with default parameters.

Interactome3D Data Processing
Protein-protein interaction data was taken from the Interactome3D database, Homo sapiens representative dataset version 2017_01 (Mosca et al. 2013). Experimental structures and global templates were included, with domain-domain models excluded. The final set contained 9,642 distinct complexes (2,987 homomeric interactions, 6,655 heteromeric interactions, and 5,500 distinct proteins).
For each protein-protein complex, any residue in one partner within 5Å of a residue in the other partner of the complex was classed as being an interface residue (Lensink et al. 2016). Positions in structures were mapped to positions in UniProt sequences using MUSCLE (Edgar 2004).

Identification of Interface Variant Combinations
Using the processed Interactome3D data, variants were classified into three different groups: interface, non-interface, and variants without interaction data. When searching for variants that may exhibit epistatic interactions or coevolve, proximity in 3-dimensional protein structure can be used to identify candidate combinations. This has been demonstrated by protein contact and structure prediction methods that use coevolution to identify residues that are close in space (Buslje et al. 2009  The Proximal Combinations from the random iterations of non-synonymous variants also contained fewer variants on average compared to those from the sample data (mean 2.2 variants per combination vs 3.4). The maximum sizes of combinations from random iterations were also smaller, with an average value of 7.9, almost half the size of the maximum combination size from the variant data (14).

Variant Combinations Per Protein
The unique variant combinations for each of the 20,791 proteins in the UniProt human proteome were considered (see Methods). Each protein has a distinct pattern of Global and Proximal Combinations, with some proteins having a large number of different combinations, and some having none or very few (Figure 1.E). The number of Combinations per protein is correlated with the length of the protein (r=0.54, Pearson's product-moment correlation coefficient), but there is no correlation between Proximal Combinations per protein and protein length (r=-0.01), structural coverage (r=0.01), or numbers of residues within 5Å (r=0.01; Figure S2).
For non-synonymous variants, the average number of Global Combinations per protein was 26.07, and 20.17 for synonymous variants (Table 1)     Combinations. Note -x-axes differ in their scales between the subplots.

Proteins have a Predominant Proximal Combination
Protein thousands of combinations where >90% of the total occurrences are from that super population, but the AFR super population clearly has many more of these combinations than the other four super populations ( Figure S5.A).  Overall, a potential functional compensation involving direct compensation of one or more amino acids, compensation of one or more charge types, or compensation of one or more functional groups was identified for 1,875 Proximal Combinations across 542 proteins. This corresponds to potential compensatory effects of variants in 42.96% of the 4,365 total

Coevolution occurs when a subsequent variant arises that compensates for a variant that
Proximal Combinations identified (Figure 4.A).

Full Variant Combinations Are More Stable Than Sub-Combinations
Foldx (Schymkowitz et al. 2005) was used to analyse the effect of variant combinations on protein stability as compensation may occur to maintain protein stability (see methods).
Overall, the majority (2,684; 61.5%) of full combinations were predicted to be closer to the wild type stability of the protein than one or more sub-combination of variants, rising to 86.9% for Proximal Combinations containing more than two variants (Figure 4.E). Therefore

Compensating Combinations Also Occur in Protein-Protein Interfaces
Protein-Protein interactions are essential for cellular function, a key component of cellular complexity, and their interface sites are enriched in disease associated variants (David et al. 2012;Wang et al. 2012). We therefore expanded our analysis to identify variant combinations occurring within interface sites. Using interaction data for 9,642 proteinprotein interactions with residue-level structural characterisation from Interactome3D

Variant Combinations Present in Individual Genomes
The  Nevertheless, this study provides an important first, to our knowledge, analysis of coevolution that has occurred within the human genome. Our findings highlight the necessity for programs that predict the effect of single nucleotide variants (Adzhubei et al. 2010;González-Pérez & López-Bigas 2011;Kumar et al. 2009;Kircher et al. 2014;Reva et al. 2011;Ng & Henikoff 2003) to consider the context in which these variants occur.
Individually programs may predict such variants to be deleterious, however given that we observe many compensations for possible functional effects, such predictions would be incorrect. Therefore, it is essential that the next generation of methods consider variant effect prediction in a much broader sense, considering other variants in the genome, rather than each one individually. Combinations of variants in individual human genomes will determine resilience or susceptibility to a host of selection pressures, from inherited disease, to drug response, to pathogenic infection. The interpretation of gestalt effects of variant combinations will be a key challenge in the future of precision medicine.