Assessing computational variant effect predictors via a prospective human

Computational predictors can help interpret pathogenicity of human genetic variants, especially for the majority of variants where no experimental data are available. However, because we lack a high-quality unbiased test set, identifying the best-performing predictors remains a challenge. To address this issue, we evaluated missense variant effect predictors using genotypes and traits from a prospective cohort. We considered 139 gene-trait combinations with rare-variant burden association based on at least one of four systematic studies using phenotypes and whole-exome sequences from ~200K UK Biobank participants. Using an evaluation set of 35,525 rare missense variants and the relevant associated traits, we assessed the correlation of participants’ traits with scores derived from 20 computational variant effect predictors. We found that two predictors—VARITY and REVEL—outperformed all others according to multiple performance measures. We expect that this study will help in selecting variant effect predictors, for both research and clinical purposes, while providing an unbiased benchmarking strategy that can be applied to additional cohorts and predictors.


Introduction
Rapidly increasing availability and use of sequencing in research and clinical genetic diagnostics has yielded millions of rare human genetic variants. Of particular interest are missense variants, which alter the coding sequence of human proteins, potentially altering protein functions 1 and thus contributing to human diseases 2 .
Where genetic disease is suspected, clinical variant interpretation commonly uses the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) framework, in which variants are classified as benign, likely benign, likely pathogenic, pathogenic, or of uncertain significance 3 . Of the 4.6 million missense variants in gnomAD 4 , only 2% have been clinically interpreted. The majority of these have been classified as a variant of uncertain significance (VUS) 5 , which, according to the ACMG/AMP guidelines 3 , should not be included in clinical decision making. Moreover, most VUS variants are missense variants, highlighting an unmet need for reliable evidence to support unambiguous clinical interpretation and thereby guide the diagnosis and treatment of disease 6,7 .
Well-established functional assays can provide strong evidence for classification of a variant as either pathogenic or benign 3,8 , but such results are typically unavailable for rare missense variants. Although "variant effect mapping" technologies have been established to proactively determine the functional effects of many variants in parallel [9][10][11] , a variant effect map is available for only ~1% of human disease-associated proteins 12 .
A less resource-intensive means of proactively assembling evidence for all possible variants is computational variant effect prediction. Widely used computational variant effect predictors developed over the last two decades include PROVEAN 13 , PolyPhen-2 14 and SIFT 15 . More recently, a new wave of variant effect predictors, e.g.
DeepSequence 16 and PrimateAI 17 , has benefited from advances in deep learning. Some 'meta-predictors', e.g. REVEL 18 and VARITY 19 , have benefited in large part by combining the results of many evidence sources, including the results of other prediction algorithms.
Current ACMG/AMP guidelines consider computational prediction methods to (at best) provide weak evidence for clinical interpretation 3 . However, as predictors improve, objective evaluation of evidentiary value may justify increased reliance on computational prediction and thus ultimately enable improved clinical interpretation.
Although there have been several benchmarking studies [20][21][22] , it has been difficult to address certain challenges inherent to the assessment of computational predictors.
Chief among these is the establishment of "ground truth" test sets that are independent of the data used in training the predictors being assessed. Where test data have previously been used in training, performance estimates for a computational model may be artificially inflated 23 .
To avoid this circularity issue, Livesey and Marsh 20 used variants measured by variant effect mapping experiments, but could only assess variant predictions for a few dozen proteins for which variant effect maps were available. Furthermore, they had limited ability to evaluate predictors (e.g., DeepSequence) that had been optimized using variant effect map data. Although Livesey and Marsh found that variant effect maps were typically more accurate predictors of pathogenicity than any computational method, using variant effect maps as a source of ground truth does have the caveat that functional assays carried out in cultured cells may not perfectly capture variant impacts on human traits.
Population-based cohorts with genotyped and phenotyped participants have the potential to provide independent data that have not previously been used in training predictors. For example, the UK Biobank 24 has assembled in-depth genetic and trait information for a prospective cohort of 500,000 participants. Whole-exome sequences for >200,000 participants have been released widely to researchers, and information on >7,000 traits is available for a large fraction of participants. Because no variant effect predictors have yet been trained on these data, using the UK Biobank dataset as a source of "ground truth" human trait data sidesteps the risk of performance inflation due to circularity.
Here, as shown in Figure 1, we examined 139 gene-trait combinations (involving 35,525 rare missense variants in 99 genes, and 56 traits) for which it has been reported that the burden of rare missense variation in the gene depends on the trait [25][26][27][28] . For each gene-trait combination, we assessed the correspondence between variant scores and human phenotypes for 20 computational predictors. Results were then compiled to identify computational methods that were top performers for the greatest number of gene-trait combinations.

Gene-trait combinations
We assembled gene-trait combinations for which a significant burden-of-variation association had been reported by at least one of four systematic 'burden scan' studies 25-28 of the UK Biobank cohort. From the initial set of 162 gene trait combinations, we excluded combinations for which the trait had been ascertained in fewer than 10 participants or for which the gene IDs are currently unrecognized or not linked to any proteins in Ensembl database version 104 29 . We also excluded the TTN gene as non-representative given its extreme size and enormous number of reported variants.

Human Variants
This study was conducted with whole-exome sequencing data from 200,619 participants in the UK Biobank cohort, for which variants were retrieved from the OQFE version 30 of the whole-exome VCF files (field ID: 23156). The canonical isoform of each gene product we examined was defined according to the Ensembl database (GRCh38) 29 , with coding exonic regions defined according to the CCDS database 31 . Coding variants corresponding to these coding regions were extracted from raw VCF files. Adapting the filtering criteria used by the UK Biobank 32 , we examined only coding single-nucleotide variants (SNVs) having a Phred quality score > 20, individual missingness < 10%, minimum read coverage depth of 7, and carried by at least one participant passed the allele balance threshold (i.e. the proportion of reads covering a variant's location that support the variant) > 0. 15. Because evidence to support clinical interpretation is typically more abundant for common variants (e.g., from genome-wide association studies), the most critical context for computational predictors to perform well is for rare variation. Therefore, we further restricted our analysis to rare variants, as defined here by having a global MAF < 0.1% in gnomAD 33 . If a variant was not found in the gnomAD database, we assumed it to be rare (MAF < 0.1%).

Variant effect predictions
We considered 20 computational variant effect predictors (Table S1). Scores for predictors were obtained either from a pre-existing repository 34 , or by running the predictor code directly. If a predictor did not provide predictions for at least 10 rare missense variants of the gene for a given gene-trait combination, it was not included in the comparison for that combination.

Predictors comparison
We first split the 139 gene-trait combinations into two categories, depending on whether the associated traits are binary or quantitative (i.e. where measurements are continuous).
Gene-trait combinations where the trait was binary were subjected to precision vs. recall analysis, where precision at a given score threshold is defined as fraction of variants that were correctly assigned as having come from an individual with the trait, and recall is the fraction of variants from individuals with the trait that were identified as such.
Because precision depends on the prior probability, i.e., the prevalence of the trait, we used balanced precision (the precision expected if the test set were to have been balanced) and calculated the area under the balanced precision-recall curve (AUBPRC) 19 .
For each variant effect predictor and gene-trait combination, we rescaled predictor scores to range from 0 to 1 (with 0 corresponding to neutral variants and 1 corresponding to damaging variants). To reduce the impact of outliers, we set the lowest and highest 5% of scores to 0 and 1, respectively.
We subsequently computed a person-centric variant score: the sum of all variant scores for a given participant. For each predictor, this allowed us to rank participants by the participant-centric variant score, plot a balanced precision-recall curve, and calculate AUBPRC for each gene-trait combination. To better understand uncertainty in the calculated AUBPRC values, we used 1000-iteration bootstrapping (random sampling of variants with replacement) to compute a distribution of AUBPRCs for each predictor with each gene-trait combination. We then empirically determined the mean AUBPRC and the 95% confidence interval of the resampled distribution. We compared variant effect predictors in terms of mean AUBPRCs, and calculated an empirical p-value for each pair of computational predictors, i.e., the fraction of pairs of resampled AUBPRC distributions, one from each predictor, in which one predictor achieved a higher AUBPRC than the other predictor. From the p-values, false discovery rates (FDRs) were subsequently calculated to account for multiple hypotheses testing 35 . For each gene-trait combination, predictors were ranked by performance and any predictor that was not significantly different from the numerically best-performing predictor (using an FDR threshold of 10%) was considered tied for best. We then ranked predictors by the number of gene-trait combinations for which the predictor was tied for best.
In contrast, for gene-trait combinations with quantitative traits, we employed the Pearson correlation coefficient (PCC) to compare predictor performance. Where multiple participants carried the same variant, we averaged their quantitative trait values. Each variant effect predictor was examined individually. For each predictor and a given gene-trait combination, we examined variants for which the trait was measured and a variant effect score was available. To better understand uncertainty in the calculated PCC values, we used 1000-iteration bootstrapping (i.e. random sampling of variants with replacement) to compute a distribution of PCCs for each predictor with each gene-trait combination. We then empirically determined the mean PCC and the 95% confidence interval of the resampled distribution. To eliminate negative values, we used PCC 2 instead of PCC, and calculated an empirical p-value (i.e., the fraction of pairs of resampled PCC distributions, one from each predictor, in which one predictor achieved a higher PCC 2 than the other predictor) for each pair of computational predictors. To account for multiple hypothesis testing, we next derived FDRs from these empirical p-values 35 . For each gene-trait combination, predictors were ranked by performance and any predictor that was not significantly different from the numerically best-performing predictor (FDR < 10%) was considered tied for best. We then ranked predictors by the number of gene-trait combinations in which the predictor was tied for best.
To assess the statistical significance of performance differences between methods considering all gene-trait combinations, we carried out a two-tailed Wilcoxon signed-rank test comparing the arrays of performance measures (AUBPRC or PCC 2 ) for each pair of predictors. FDR values for each comparison were derived as above.

Extracting rare missense variants from the UK Biobank cohort for trait-associated genes
To select gene-trait combinations for which rare (MAF < 0.1%) variation is associated with traits, we compiled a set of 139 gene-trait combinations that were collectively identified by four systematic burden testing studies performed using data from the UK Biobank cohort [25][26][27][28] . Table S2

Assessing the performance of computational variant effect predictors
For each rare human missense variant, we obtained variant effect scores from 20 computational variant effect predictors (see Table S1 for a complete list of predictors compared in this study). Some predictors (e.g. PROVEAN) assign low scores to predicted-damaging variants, while others (e.g. PolyPhen-2) assign high scores to such variants. To better compare rankings from different predictors, we negated scores of the former type so that the highest scores always corresponded to the most predicted-damaging variants.
For 12 (60%) of the 20 predictors we examined, scores were available for every missense variant in our evaluation set. For over 90% of the genes of interest, all predictors provided scores for at least 10 variants. Table S3 shows the prediction coverage for the 99 genes included in this study. To reduce the effect of extreme values on this performance evaluation, we applied a floor and ceiling at 5 th and 95 th percentile predictor scores, respectively, and applied a transformation to give all variant effect predictors the same 0-1 range of score values.
Next, we deployed two approaches to assess the performance of variant effect predictors, depending on whether the trait was binary or quantitative.
For gene-trait combinations with binary traits (including categorical traits that could be simplified and considered binary), we applied an AUBPRC approach. Because binary trait measurements were made on the participant level, i.e. one measurement per participant in the UK Biobank cohort, we wished to obtain a participant level summary of predicted variant effects. On average, about 1% of participants had more than one variant in a given gene ( Figure S1). Therefore, we summed the total predictor score for all missense variants observed in each gene of interest in each participant. We note that this approach models variant effects as additive, e.g., two mildly damaging variants (score = 0.5) combined will show a more damaging effect (total score = 1.0). To illustrate this approach with the computational predictor VARITY, Figure 2 shows that UK Biobank participants taking cholesterol-lowering medication (FID: 6153-1) are three times more likely to have a damaging (≥1) total missense variant impact than those that are not.
For every gene-trait combination and each predictor, we analyzed the tradeoff between precision (i.e. fraction of participants above a given total variant impact score threshold that had the trait value associated with heightened rare-variant burden) and recall (i.e. fraction of all participants with the rare-variant-burden-associated trait value that were detected at a given total variant impact score threshold). More specifically, because precision depends on the prior probability of the trait value, we evaluated balanced precision, which represents precision where the prior probability of the trait value is 50% 19 . This enabled us to evaluate, for each combination of gene and binary trait, the AUBPRC for each computational predictor.
Because AUBPRC analysis is only appropriate for binary traits, we examined quantitative traits using a PCC-based approach, by which we assessed the correspondence between variant impact score and trait value. Where multiple participants carried the same variant, we averaged the quantitative trait values. Variant predictors were considered better performing if they had a higher PCC value.
Thus, for each gene-trait combination, we obtained either a PCC or AUBPRC measure of performance. To estimate uncertainty in each performance measure, we carried out bootstrap resampling in which variants of a given gene were resampled with replacement and PCC or AUBPRC values were re-calculated for each sample. The performance among the remaining 18 predictors ranged from 66 to 88%, with mean 71%. The best variant predictor according to this ranking was VARITY 19 , followed by REVEL 18 and Eigen 38 . Because VARITY or REVEL were statistically indistinguishable (FDR = 0.32; Wilcoxon signed-rank test), we considered both VARITY and REVEL to be the best performing in this evaluation. Similar analysis ( Figure S2) showed that both VARITY and REVEL each significantly (FDR < 0.1) out-performed all other methods.

Discussion
We used a large UK Biobank cohort to assess the performance of 20 computational predictors of missense variant effects. Combining four systematic burden-test studies, collectively based on whole-exome sequences of the 50K and 200K participants in the UK Biobank cohort, yielded a large set of gene-trait combinations (139) to enable a robust comparison of predictors.
Because none of the computational predictors we examined had been trained using UK Biobank data, our evaluation approach has the marked advantage of independence and avoidance of the performance inflation that can arise when predictors are assessed using training data. Although every predictor was a top predictor for at least one gene-trait combination, counting the number of gene-trait combinations in which a predictor was best enabled an overall ranking.
It is interesting that the top three predictors overall -VARITY 19 , REVEL 18 , and Eigen 38are all meta-predictors, i.e., they combine prediction scores from other variant effect predictors. For meta-predictors, it can be especially difficult to establish ground truth sets of variants that had not been used for training any of the input predictors. That said, VARITY only exploited scores from other predictors that were unsupervised, i.e., made no direct use of variant pathogenicity annotations. The fourth-ranked predictor, MPC 39 , exploited observations of depleted missense variation within particular sub-genic regions. That none of the top-three-ranked predictors exploited this kind of information suggests the possibility that combining these approaches could yield still-better results.
One limitation of our study was that we did not evaluate all published computational variant effect predictors. This was in part due to the vast number of such predictors but many predictors were excluded due to non-functional websites. We were particularly interested in evaluating EVmutation 40 given its excellent reported performance.
However, EVmutation provided a score for only 7,021 (20%) of the 35,525 missense variants we examined, perhaps because the EVmutation method requires a deep multiple sequence alignment. However, we were able to examine variant effect predictors that are 1) widely used to predict variant effects (e.g. PolyPhen2, PROVEAN) and 2) novel and claimed better-performing than conventional predictors (e.g. PrimateAI, REVEL). We suggest that this analysis should be repeated periodically to benchmark and test evaluate predictors as they emerge.
Another limitation of this study is that we did not consider correlations between traits.
For example, multiple gene-trait combinations involved LDLR, and these traits were correlated with one another, so our analysis was influenced by some genes and traits more than others. That said, body mass index (BMI; FID: 21001) was the most recurrent trait (appearing in 22 combinations) and LDLR was the most recurrent gene (appearing in 5 gene-trait combinations). Thus, no one gene or trait dominated the collection of 139 gene-trait combinations examined here.
We also did not adjust trait values to account for dependencies on other participant variables. For example, LDL cholesterol levels are known to correlate with both age and sex 41 . In future analyses, LDL cholesterol measures adjusted for age and sex, or more precise variables such as apolipoprotein B, could have a variation that is more attributable to genetic variation and therefore show greater correlation with predictor scores. However, because the same adjusted LDL cholesterol values would be used to evaluate all predictors, such an adjustment would be expected to have limited impact on the relative rankings amongst predictors.
One possible criticism of our study is that the UK Biobank dataset we used contains variants that may have been used in training predictors. For this reason, we considered excluding variants that had been previously reported as pathogenic or benign, e.g. in ClinVar 42 , or as disease-associated or disease-causing by the Human Gene Mutation Database (HGMD) 43 , as many predictors will have trained on variants from these resources. However, vanishingly few of the variants used in predictor training sets will have been deposited in HGMD or ClinVar on the basis of analysis of the UK Biobank data, especially given that we have excluded common variants from our analysis. More obviously, neither the selection of UK Biobank participants, nor their traits or genotypes, was determined by variant annotations in ClinVar, HGMD or elsewhere.
Our analysis should in future be expanded to evaluate additional burden-test associated gene-trait combinations beyond the 139 examined here, as they emerge. Moreover, release of whole-exome sequences for an additional 250,000 UK Biobank participants is expected in 2022, and it will be important to revisit these comparisons with the expanded dataset.
In conclusion, this study provides an independent assessment of several computational variant effect predictors based on their correspondence with human traits in a large prospective cohort. Given the critical need for improving performance of computational predictors for both clinical and research applications, our benchmarking method is likely to be applicable to future human cohorts and methods for inferring the pathogenic impact of human genetic variants.   Performance comparisons used mean AUBPRC for binary traits and mean PCC values for quantitative traits, respectively.
Error bars indicate the 95% confidence intervals of performance measures.

Supplemental data
Supplemental data includes three tables and two figures.

Declaration of interests
F.P.R.is a scientific advisor and shareholder for Constantiam Biosciences and BioSymetrics, and a Ranomics shareholder. The authors declare no other competing interests.