Abstract
Despite the frequent implication of aberrant gene expression in diseases, algorithms predicting aberrantly expressed genes of an individual are lacking. To address this need, we compiled an aberrant expression prediction benchmark covering 8.2 million rare variants from 633 individuals across 48 tissues. While not geared toward aberrant expression, the deleteriousness score CADD and the loss-of-function predictor LOFTEE showed mild predictive ability (1-1.5% average precision). Leveraging these and further variant annotations, we next trained AbExp, a model that yielded 10% average precision by combining in a tissue-specific fashion expression variability with variant effects on isoforms and on aberrant splicing. Integrating expression measurements from clinically accessible tissues led to another two-fold improvement. Furthermore, we show on UK Biobank blood traits that performing rare variant association testing using the continuous and tissue-specific AbExp variant scores instead of LOFTEE variant burden increases gene discovery sensitivity and enables improved phenotype predictions.
Main
Aberrant gene expression, gene expression levels outside the physiological range, is a frequent cause of diseases. Aberrant underexpression of tumor suppressor genes and aberrant overexpression of oncogenes are hallmarks of oncogenesis1,2. Moreover, aberrant gene expression is a frequent cause of rare inheritable disorders3–8 and contributes to risks for common disease-associated traits9.
Statistical methods to call expression outliers from RNA-seq data10–13 applied to large cohorts have enabled investigating the genetic basis of aberrant expression. Rare variants have been found to be associated with expression outliers14. Specifically, rare variants including rare structural variants and rare variants likely triggering nonsense mediated decay such as premature stop codons, frameshift, and splice-disrupting variants are enriched among underexpression outliers6,14,15. Moreover, rare structural variants, notably gene duplication, were found enriched among overexpression outliers14,15.
Building on these findings, algorithms have been developed to prioritize which variants may be the genetic cause of an expression outlier identified in an RNA-seq sample given the matched genome14,15. However, there is no algorithm predicting aberrant expression using only DNA sequence as input. A sequence-based predictor that can predict aberrant expression in multiple tissues and generalize to unseen variants could improve our ability to identify high impact rare variants in large genomic cohorts which in turn would aid in the identification of diseaseassociated genes and pinpointing disease-causal variants.
To address this unmet need, we here establish a benchmark for genotype-based prediction of aberrant gene expression across human tissues (Fig. 1). While underexpression outliers can be assumed to have strongly impaired or entirely lost function, the functional consequence of an overexpression outlier is less clear, as it could as well result in a gain of function. We therefore focussed in this study on underexpression outliers and developed AbExp, a machine learning model predicting aberrant underexpression in multiple tissues using DNA sequence as input. By integrating various variant annotations with tissue-specific isoform proportions and expression variability, AbExp significantly outperforms other loss-of-function annotators that were not explicitly developed for aberrant underexpression prediction. We demonstrate how AbExp can be used in rare variant gene association testing as well as phenotype prediction on the UK Biobank dataset using 40 blood traits. Finally, we show that when gene expression measurements from clinically accessible tissues are available, AbExp scores can be integrated to improve prediction of aberrant expression in non-accessible tissues.
Results
A benchmark dataset of underexpression outliers across 48 human tissues
We first set out to predict which protein coding genes are aberrantly underexpressed in which tissues given an individual’s genome. To this end, we created a benchmark dataset using the aberrant expression caller OUTRIDER11 on 11,096 RNA-seq samples with paired wholegenome sequencing data of the Genotype-Tissue Expression dataset16 (GTEx v8), spanning 48 tissues and 633 individuals. For the whole-genome sequencing data, we selected version 7 of the GTEx data since, unlike for the version 8 release, structural variant calls including transcript ablation were available. Although the choice against the version 8 genome release reduced the number of underexpression outliers by 20%, it allowed us to consider structural variants, which are important determinants of aberrant expression15. In each tissue, we restricted the analysis to all protein-coding genes with an average read-pair count of at least 450, an estimated minimal coverage required to detect 50%-reduction outliers6. Overall, 18,152 protein-coding genes out of 18,563 (97.8%) showed sufficient average coverage in at least one tissue. We defined a gene to be aberrantly underexpressed if OUTRIDER reported a false discovery rate (FDR) lower than 0.2 and the expression was lower than expected (Methods). We further removed samples with more than 50 outliers since such a high number of outliers could reflect poor sample quality, an unreliable fit of OUTRIDER, or trans-regulatory effects affecting a large number of genes in the sample. Across all analyzed 11,096 RNA-seq samples, the resulting benchmark dataset consisted of 17,637 underexpression outliers occurring in 5,428 unique genes and amounting to 1.6 outliers per sample on average. A detailed overview of how many samples, genes, etc. remained after each filtering step can be seen in suppl. table T1.
Integrating rare variant annotations to predict underexpression outlier across tissues
We considered predicting aberrant expression for all 48 tissues for any protein-coding gene from rare variants, here defined as variants occurring in at most two GTEx individuals and with a gnomAD minor allele frequency of less than 0.1%. We reasoned that above 0.1% minor allele frequency, variants are unlikely to cause an expression outlier since the average frequency across samples of underexpression outliers among genes with sufficient RNA-seq coverage is about 0.01%. Moreover, we focussed on cis-regulatory variants by considering variants located within the gene and up to 5,000 bp around the gene to fully cover promoter and transcription termination regions.
We first investigated whether existing variant annotation tools, that were not developed for predicting aberrant underexpression, showed informative signals. Among the variant consequences annotated by Ensembl VEP we found frameshifts, variants affecting start and stop codons, splicing variants to be strongly enriched among underexpression outliers, consistent with previous reports6,14,15 (Fig. 2a). Variants including frameshifts, splice-variants, and stop-gains, which introduce premature stop codons are known to be strongly enriched within gene underexpression outliers as these often trigger nonsense-mediated decay (NMD) of the transcript14. LOFTEE is a tool that predicts a high-confidence subset of loss-of-function associated variants, notably variants likely to trigger NMD, by implementing filters such as removing stop-gained and frameshift variants that are within 50 bp of the end of the transcript as these usually escape NMD, or variants that affect splicing only in UTRs17. In GTEx more than 17% of the aberrantly underexpressed genes had a LOFTEE variant, compared to non-outliers that had a LOFTEE variant in less than 1% of the cases (Fig. 2a).
Moreover, we hypothesized that the deleteriousness score CADD18 could also be predictive of aberrant gene expression. The advantage of CADD, which was trained to distinguish between simulated de novo variants and variants that have arisen and become fixed in human populations, is that it provides a score for any variant. We found that CADD scores of rare GTEx variants in underexpression outlier genes were in the median ∼13 times higher than in nonoutliers (Fig. 2b).
Despite these enrichments, LOFTEE and CADD by themselves showed limited predictive value. Using the sole LOFTEE-positive variants recalled 16.7% of the underexpression outliers at a precision of 9.7% (Fig. 2c). While CADD scores showed some predictive value, at no single score cutoff the same precision nor the same recall as LOFTEE could be reached (Fig. 2c). Next, we trained a non-linear model that integrated all the above-mentioned features to quantitatively predict the OUTRIDER z-score (Methods). Predicting the underlying quantitative z-scores turned out to lead to better models than predicting the binary classes of outliers and non-outliers. Ranking based on the predicted z-scores uniformly outperformed ranking based on CADD scores on held-out data (Fig. 2c). Moreover, the integrative model reached the same precision at the same recall as filtering for LOFTEE variants with the added value of providing a continuous score, thereby allowing for applying more stringent cutoffs to yield a higher precision (up to 13%, Fig. 2c). Lastly, the advantage of the integrative model over CADD and LOFTEE was observed in aggregate (Fig. 2c) as well as across individual tissues according to the average precision (measured by the area under the precision-recall curve here and elsewhere, AUPRC, Fig. 2d).
Accounting for tissue-specific isoform expression improves predictions
By construction, the predictions of this first model were independent of the tissue as neither the variant annotations nor the considered transcript isoforms were tissue-specific. However, since the transcript isoforms of a gene are often expressed at different proportions across tissues, variants can have tissue-dependent effects19. For example, ENST00000358514, the canonical transcript of PSMB10 according to the MANE annotation20, was estimated to generate only about 4% of PSMB10 total gene expression in putamen (Methods). The vast majority (91%) of PSMB10 gene expression in putamen was attributed to another transcript, ENST00000570985. Conversely, in fibroblasts, the canonical transcript contributed to nearly 48% of the total gene expression. Exon 4 is not included in the transcript ENST00000570985 but is included in the canonical transcript ENST00000358514, explaining why a frameshift variant in exon 4 was associated with a high impact on gene expression in cultured fibroblasts but showed a limited effect in putamen (Fig. 3a,b).
Generally, we found that only 30% of the canonical transcripts contributed more than 90% of their gene’s total expression and that as much as 18% of the canonical transcripts contributed to less than 10% of their gene’s total expression (using MANE-select as the canonical transcript, suppl. Fig. S1). Therefore, relevant information is lost when considering the variant consequence assigned to a single isoform, even if it is annotated as the canonical one. To address this issue, we calculated the isoform composition in every tissue and weighted the VEP consequences and LOFTEE classification of each variant by the proportion of affected transcripts per gene and tissue (Methods). Training the model using these tissue-specific weighted annotations increased the average precision by 55% to reach 2.7% in median across tissues (Fig. 3c).
Incorporating the tissue-specific gene expression variability further improves predictions
Similar to other statistical models for RNA-seq data, OUTRIDER includes a measure of gene expression variability called the biological coefficient of variation11,23. In our benchmark dataset, the biological coefficient of variation captures the expression variability of genes per tissue across the GTEx population. We reasoned that the same expression fold changes could cause an expression outlier for a gene with low expression variability but not for a gene with high expression variability. Indeed, we observed that the minimal fold-change among expression outliers decreased with the biological coefficient variation (Fig. 3d). Therefore, a given relative reduction in gene expression can lead to aberrant expression in one gene or tissue but not necessarily in another. For instance, a 30% reduction of the gene LTBP3 in tibial artery expression resulted in an outlier. In contrast, a 30% expression reduction in blood of OR2W3 would not lead to an outlier as OR2W3 shows a very large gene expression variation in blood ranging between 10% and 230% (Fig. 3e). It is not surprising that OR2W3, one of the over 800 human olfactory receptor genes22 for which dysfunction is likely benign, exhibits more expression variability than LTBP3, a gene whose dysfunction is associated with dental anomalies and short stature21. Generally, we found that genes with lower expression variability were more genetically constrained in the human population (i.e. harbored fewer loss-of-function variants, Fig. 3f), in agreement with a previous report on primates24.
Next, we aimed to improve our underexpression outlier predictor by adjusting for expression variability. We first considered modeling expression fold-changes from variants and then converting the predicted fold-changes into a z-score, under the assumption that variants affect gene expression fold-changes independently of the expression variability. Using variants likely triggering NMD to test this assumption, we noticed however that the same class of variants associated with lower fold-changes among genes with lower expression variability (Fig. 3g, suppl. Fig. S2), perhaps because genes with low expression variability are subject to regulatory buffering mechanisms25,26. Therefore, we opted for a more general modeling approach in which the biological coefficient of variation is provided as an input feature along with variant annotations to a non-linear model predicting the z-score. This model increased the performance by more than 50% to 4.0% average precision (median across tissues, Fig. 3c). Altogether, these results show that gene expression variability must be taken into account when predicting aberrantly expressed genes, and that predicting z-score rather than fold-change is more relevant for variant interpretation.
Contribution of aberrant splicing prediction and transcript ablation
Aberrant splicing isoforms often contain a premature termination codon and as a consequence, yet not always, are degraded by NMD27. Our model already included splice site information, as it was based on LOFTEE, CADD, and tissue-specific isoform weighting of VEP annotations. However, these features ignored splice sites that are not part of the genome annotations upon which these tools are built. We have recently developed a model called AbSplice that predicts aberrant splicing across tissues using a more comprehensive map of splice sites, including unannotated weak splice sites, and their tissue-specific usage28. We found that AbSplice scores were in median 10 times higher among underexpression outliers than among non-outliers (suppl. Fig. S3). Integrating AbSplice scores significantly increased the performance to 4.9% average precision in median across tissues (Fig. 3c).
We next considered transcript ablations, which could be inferred in GTEx from the structural variant deletion calls. Overall, 33 out of the 43 GTEx individuals harboring transcript ablations also showed an expression outlier in at least one tissue. Including transcript ablation variants into the model resulted in a large gain in precision among the top-ranked predictions and increased the average precision to 9.1% in median across tissues (Fig. 3c). In the following, we refer to this model which integrates all features mentioned so far as AbExp. AbExp takes as input a set of variants within 5,000 bp of any annotated transcript of a protein-coding gene and returns a predicted z-score for each of the 48 tissues. For user convenience, we furthermore suggest a high-confidence cutoff (AbExp < -3.4) corresponding to 50% precision and 8.5% recall on our benchmark data, and a low-confidence cutoff (AbExp < -1.3) corresponding to 20% precision and 19.1% recall.
AbExp replicates in independent datasets
We next assessed how AbExp performance replicated on two independent datasets. The first dataset consisted of 299 individuals suspected to be affected by a mitochondrial disorder6 with whole-exome sequencing data paired with RNA-seq from fibroblasts. The second dataset consisted of 290 whole-genome sequencing samples with RNA-seq from iPSC-derived motor neurons from the AnswerALS research project29. Structural variant calls, and thus transcript ablation calls, were not available on either dataset. Moreover, we observed that the recall for all methods was twice as low on the ALS dataset than in GTEx and in the mitochondrial disorder dataset (suppl. Fig. S4), perhaps because of poorer expression outlier calls, a stronger role of epigenetic and trans-regulatory effects, or combinations thereof. Taking these differences into account, our results on those two independent datasets were in agreement with the evaluation on GTEx. We found that AbExp without transcript ablation annotation significantly outperformed LOFTEE and CADD by two to three times larger average precision (suppl. Fig. S4). Here too, AbExp without transcript ablation annotation allowed for slightly better precision at the same recall than LOFTEE filtering, while offering a continuous score allowing for reaching much higher precisions.
AbExp improves rare variant association testing and phenotype prediction
The rise of large exome-sequencing and genome-sequencing cohorts empowers rare variant association testing (RVAT) which helps pinpoint causal genes for traits30,31 and enables improved phenotype predictions, particularly among individuals showing extreme phenotypes32. RVAT consists of identifying genes within which the occurrence of likely high-impact variants associates with a trait33,34. To this end, accurate prediction of high-impact variants can be advantageous, suggesting that AbExp may have the potential to improve RVAT. To test this hypothesis, we considered 40 continuous blood traits including high-density lipoprotein cholesterol, glucose, and urate levels (suppl. table T2) from the UK Biobank 200k exome release35. We chose blood traits for this proof-of-concept investigation since they are wellstudied and frequently measured to diagnose and monitor chronic disease conditions. Moreover, RVAT methods are typically better calibrated on continuous traits as opposed to binary traits36,37.
To ease comparisons between variant annotations, we used linear regression as a common framework for rare variant association testing. As a realistic baseline, we considered RVAT based on LOFTEE variants, similar to the Genebass study31. For this baseline model, gene-trait association was tested by regressing the trait against the number of LOFTEE variants. The second model was designed to leverage that AbExp is both quantitative and tissue-specific. To this end, we considered the lowest AbExp z-score across all rare variants for each of the 48 tissues. The gene-trait association was tested by regressing the trait against the resulting 48 values. Two further models were considered by regressing against the minimum and against the median of the 48 values. To adjust for other relevant factors and effects due to common genetic variation, all four models included as covariates sex, age, the first 20 genetic principal components, a polygenic risk score predicting the trait, and common variants reported to be associated with the trait and located 250,000 bp around the gene (Methods). We fitted every model on two thirds of the dataset for gene-trait association discovery. Phenotype permutation analysis indicated that all models were calibrated (suppl. Fig. S5).
Using AbExp predictions for the 48 tissues, we identified in total 30% more gene-trait associations compared to the LOFTEE-based model (Fig. 4a), showing that AbExp can improve RVAT-based gene discovery. Notably, association testing using the tissue-specific predictions outperformed aggregated forms of the AbExp score in most cases by finding more gene-trait associations (suppl. Fig. S6). In some instances, we could rationalize which tissues showed the most significant associations. This was the case for the gene encoding Apolipoprotein B, a major constituent of triglyceride-rich lipoproteins synthesized in the liver and whose predicted aberrant underexpression in the liver was found to be negatively associated with blood triglyceride levels (suppl. Table T3). However, the high correlation between the AbExp scores per gene across tissues can make the estimation of each individual tissue-specific coefficient unstable, and, therefore, their interpretation should be done with caution.
Having shown that AbExp can improve the gene-discovery sensitivity of RVAT, we next assessed its utility in phenotype prediction. To this end, we used the remaining third of the dataset which was not used for gene-trait association discovery. Specifically, we fitted gradient boosted trees models predicting the traits given the AbExp scores on the one hand or the number of LOFTEE variants on the other hand, of the genes discovered on the first two thirds of the data. These models were controlled for sex, age, the first 20 genetic principal components, and a polygenic risk score predicting the trait (Methods). The predictions rarely differed between the common variant-based model and the model further including AbExp scores of rare variants, as exemplified for the Alanine aminotransferase blood levels (Fig. 4b). For this trait, predictions for less than 0.3% of all individuals differed between the two models by more than 1 standard deviation of the population trait distribution (Fig. 4b). Remarkably, the trait values of those differing individuals tended to deviate largely from the population average, suggesting that the model integrating AbExp scores especially improves the predictions of individuals with extreme phenotypes that common variants cannot explain (Fig. 4c).
On held-out data, the phenotype prediction model based on AbExp scores significantly increased the amount of explained variation (R2) over the model based on LOFTEE in 50% of the traits and never significantly decreased R2 (Fig. 4d). Considering the number of individuals differing by more than 1 standard deviation to the common-variant based model, the AbExp based model improved the prediction of 784 individuals across the 40 blood traits, while the LOFTEE-based model only improved the prediction of 259 individuals (Fig. 4e). Moreover, the advantage of using AbExp scores were similarly observed when predicting phenotypes with regularized linear regression instead of gradient boosted trees (suppl. Fig. S7a,b), yet yielding less accurate phenotype predictions (suppl. Fig. S7c).
Altogether, these results show that AbExp provides useful variant annotation for gene-trait association discovery by rare variant association testing and for building improved genetic risk scores.
Incorporating RNA-seq from clinically accessible tissues (CATs) boosts prediction performance
RNA sequencing is becoming increasingly popular for rare disease diagnostics as a complementary assay to genome or exome sequencing as it allows the direct measurement of aberrant gene regulation4–6,38–40. However, many rare disorders are suspected to originate from tissues that can only be very invasively sampled such as the brain or the heart. We and others 6,41 have shown that clinically accessible tissues (CATs), notably skin fibroblasts and to a lesser extent whole blood, share a substantial fraction of expressed genes with non-CATs and, therefore, are likely to capture aberrant expression occurring in non-CATs. The GTEx dataset – a dataset of post-mortem samples – offers a unique opportunity to test the validity of this assumption as it provides matched samples for CAT and non-CAT tissues. We found that the mere ranking of genes according to their OUTRIDER z-score in fibroblast RNA-seq samples led to an average precision in predicting underexpression outliers in non-CATs of 17.7% (median across tissues), significantly larger than the genome-based predictor AbExp (8.8%, P = 4.2×10-7, Fig. 5a). Next, we developed a model taking as input AbExp, whether the gene is expressed in a CAT and, if so, its OUTRIDER z-score. Using RNA-seq from skin fibroblasts to predict aberrant underexpression in all other tissues, this model reached an average precision of 19.5% in median across tissues (Fig. 5a). Consistent with previous work based on shared expressed genes6,41 and our work on aberrant splicing prediction 28, fibroblasts turned out be more informative than whole blood (median average precision 9.1% using RNA-seq only and 16.1% when integrating AbExp, Fig. 5b), owing to blood expressing less genes. Altogether, we have established a method integrating direct measurements of aberrant expression from RNA-seq data in a CAT along with genomic variant annotations to predict aberrant underexpression in other tissues. Doing so, we showed that integrating RNA-seq from fibroblasts yields a substantial improvement as it doubles the average precision over using genomic variants alone.
Discussion
Altogether, we established a benchmark dataset for aberrant gene underexpression prediction in 48 human tissues, addressing an unmet need in the area of high-impact variant effect prediction. We developed AbExp, a machine learning model predicting aberrant underexpression across tissues by integrating existing variant annotations with tissue-specific gene expression variability and transcript isoform composition. AbExp outperformed existing variant annotation tools by up to 7-fold in average precision. Using UK Biobank dataset blood traits, we demonstrated that the continuous and tissue-specific AbExp scores provide added information over the state-of-the-art putative loss of function classifier LOFTEE for rare variant gene association testing as well as for phenotype prediction. Finally, we showed that AbExp scores can be combined with gene expression measurements from clinically accessible tissues to predict aberrant expression in other tissues yielding an increased prediction performance by 2-fold over AbExp.
Refining the predictor, while our primary objective, also shed light into the biology of underexpression outliers. We found that gene expression variability plays a dual role in this context. On the one hand, altering expression of a gene is more likely to result in an outlier if the expression of the gene varies little in the population than if it varies largely. On the other hand, each variant category was associated with milder fold-changes for genes with lower expression variability, indicating the involvement of regulatory mechanisms that confer robustness to genetic perturbations. We modeled the outcome of these two counteracting phenomena with a non-linear model trained from the data. Future biophysical investigations unraveling these buffering mechanisms could help improving the predictions and more generally improve variant interpretation. Our work also confirmed the importance of nonsense mediated decay, which underpins a substantial proportion of the outliers, and the need to take tissue-specific transcript isoform into account when interpreting splice-affecting and nonsense variants as pioneered by Cummings and colleagues19.
This study has limitations. A basic assumption of AbExp is that an underexpression outlier is caused by a rare variant. However, it is possible that a rare combination of frequent genetic variants causes an expression outlier. Also, damage caused by one variant might be recovered by another variant, e.g. a second frameshift variant recovering the frame after a first frameshift variant. AbExp does not evaluate combinations of variants, which would require more complex modeling in particular by taking phasing into account. Moreover, AbExp covers the variants up to 5 kb away from transcript boundaries, missing middle-range and long-range enhancers. Future work could expand to such variants, for instance if sequence-based models of gene expression improved at this task42. Also, this work was focused on cis-acting regulation by considering only variants located within or near the genes. The effect of trans-acting gene regulation, which would need a very different modeling paradigm than investigating here in order to capture regulatory networks, remains to be addressed.
Predicting gene expression from sequence is a long-standing goal of computational biology that is still far from completion. While existing sequence-based models of cis-regulation are trained across the whole range of expression levels43–45, we have proposed here to focus on extreme expression variations. Extremes may not well be captured by models trained to globally predict gene expression. Also, the biological mechanisms underpinning extreme expression variations may differ from those governing moderate expression variations. However, the relevance of extreme expression for clinical diagnostics and research is high. We hope that the benchmark and algorithms we have developed will foster further research in this direction and aid in the development and validation of methods predicting the impact of large-effect variants on the human transcriptome.
Methods
Underexpression outlier benchmark dataset
GTEx dataset
We downloaded the GTEx RNA-seq read alignment files in the BAM format from dbGaP (phs000424.v8.p2). We excluded tissues with less than 50 RNA-seq samples due to insufficient statistical power11. This filter discarded the kidney cortex, leaving 48 tissues.
We obtained SNPs and small indels from the GTEx hg19 variant calls from the file GTEx_Analysis_2016-01-15_v7_WholeGenomeSeq_635Ind_PASS_AB02_GQ20_HETX_MISS15_PLINKQC.vcf.gz from the dbGap entry phg000830.p1. Structural variants were obtained from Ferraro and colleagues15.
Expression outliers
Gene expression outlier analysis was performed following the aberrant expression module of DROP v1.1.046 based on OUTRIDER11. To this end, we used as reference genome the GRCh37 primary assembly release 29 of the GENCODE project47. A fragment (reads pair) was assigned to a gene if and only if both reads were entirely aligned within the gene, allowing for fragments to be assigned to more than one gene. On each tissue separately, genes with an FPKM less than 1 in 95% or more of the samples were considered to be not sufficiently expressed in the tissue and filtered out, as previously described11.
OUTRIDER is an expression outlier caller that uses an autoencoder to model RNA-seq fragment count expectations with a negative binomial distribution. Specifically, OUTRIDER models the probability of the observed fragment count xs,g for every gene g in a sample s as: where:
- μs,g is the expected fragment count
- θt(s),g is the dispersion parameter for the gene g in the tissue of sample s t(s)
OUTRIDER further outputs:
- the biological coefficient of variation:
- the log2-transformed fold-change of the observed fragment count compared to the expected fragment count:
- the nominal p-value
- the False Discovery Rate using the Benjamini-Yekutieli method48
The resulting table was subsetted to individuals with an available whole genome sequencing and to protein-coding genes. Some of the data points could not be detected as outliers due to lack of statistical power. To reduce the proportion of these insufficiently powered data points in our benchmark, we discarded observations with an expected fragment count μs,g less than 450, a minimal value that was empirically estimated to allow recovering half of the two-fold reduction outliers transcriptome-wide upon a FDR cutoff of 5%6. We labeled as gene expression outliers all observations with an FDR less than 20%. This relaxed FDR cutoff of 20% turned out to help by leading to more robust evaluations and models. Lastly, RNA-seq samples that contained more than 50 outliers were discarded because samples with numerous outliers may be samples for which OUTRIDER could not adequately fit the data or for which gene expression is globally affected, resulting in widespread expression aberrations throughout the genome that cannot be predicted from local sequence variation.
OUTRIDER z-score computation
We quantile-mapped the OUTRIDER-fitted negative binomial distributions (Eq. 1) to the standard normal distribution as follows: Where is the inverse cumulative distribution function of the standard normal distribution and the negative-binomial cumulative distribution function.
Precision-recall
We evaluated models using precision-recall curves due to the small proportion of outliers and summarized them with the area under the precision-recall curve49 (AUPRC):
, where Pn and Rn are the precision and recall at the nth top prediction. The AUPRC is the average precision for each threshold weighted by the recall difference.
The average precision was computed on all held-out data together on the one hand, and on held-out data for each of the 26 GTEx tissue types on the other hand. The GTEx tissue types group together highly similar tissues, notably many regions of the brain. This grouping by tissue type allows not reporting inflated performance driven by a set of highly similar tissues.
Variant filtering, annotation, and gene-level aggregation
Filtering for rare high-quality variants
We considered a variant to be rare if it had a minor allele frequency in the general population ≤0.001 based on the Genome Aggregation Database (gnomAD v2.1.1) and was found in at most 2 individuals within GTEx. Variants had to be supported by at least 10 reads and had to pass the conservative genotype-quality filter of GQ > 30. For structural variants, we only filtered for the number of occurrences in the GTEx dataset (less than in 2 individuals).
Variant annotation with VEP and AbSplice
For all rare variants, we used Ensembl VEP50 v108 to calculate consequences, LOFTEE loss-of-function annotation, as well as CADD v1.6 scores. Tissue-specific aberrant splicing predictions were generated with AbSplice28.
Definition of gene-level features
For each combination of gene, individual, and tissue, we required a set of features to predict the underexpression outlier label. Therefore, we constructed a set of features starting from the annotations of the underlying rare variants. CADD predictions were max-aggregated per gene across the rare variants. For AbSplice, we kept the maximum absolute score in the corresponding gene and tissue.
Isoform-specific variant annotations, namely LOFTEE and VEP consequences, were aggregated differently depending on whether isoform proportions should be taken into account. When disregarding isoform proportions (i.e. the model “LOFTEE+CADD+consequences”), we only incorporated variants affecting the canonical transcript (canonical according to VEP). The gene was then assumed to have an annotation (LOFTEE classification and each VEP consequence, e.g. stop-gained) when any of the variants in the gene had this annotation. Since the VEP canonical transcript does not differ between tissues, all tissues end up with the same set of gene-level features for these annotations.
For all models incorporating isoform proportions, isoform-specific variant annotations were first weighted by the total proportion of isoforms i in tissue t that they affect:
, where δv affects i is 1 if the variant affects the isoform, otherwise 0, the proportion of an isoform i in a tissue t was estimated as the median TPM proportion across individuals among all isoform of the same gene g:
, and where tpm(i, t, s) is the transcript-level TPM obtained from GTEx v8 (dbGaP Accession phs000424.v8.p2). All resulting variant annotations were then max-aggregated per gene and tissue across variants.
Model training
All 633 individuals were split into six cross-validation groups with approximately equal numbers of underexpression outliers and subtissues.
The DNA-based models were trained to predict the OUTRIDER z-scores. To this end, we used gradient-boosted trees51 from the LightGBM52 framework with default parameters: boosting_type: gbdt, learning_rate: 0.1, max_depth: -1, min_child_samples: 20, min_child_weight: 0.001, min_split_gain: 0, n_estimators: 100, num_leaves: 31, reg_alpha: 0, reg_lambda: 0, subsample: 1, subsample_for_bin: 200000, subsample_freq: 0.
For training the model integrating CAT RNA-seq data, the same cross-validation scheme was used as for the DNA-based models while excluding the CAT and highly related tissues thereof from the predicted tissues. When using fibroblasts as CAT, we excluded the non sun-exposed suprapubic skin, sun-exposed lower leg skin, and cultured fibroblasts. When using whole blood as CAT, we excluded whole blood and EBV-transformed lymphocytes from the predicted tissues. Unlike the DNA-based models, this model was trained to predict underexpressed outliers using a logistic regression taking as input i) a binary variable indicating whether the gene is expressed in the CAT, ii) the OUTRIDER z-score of the gene in the CAT, iii) the AbExp prediction for the target tissue, and all 3 interaction terms between those 3 variables.
Replication in independent datasets
In the two replication datasets, variants were filtered for genotype quality ≥ 30 and read depth ≥ 10 reads. Moreover, rare variants were subsetted based on the gnomAD population with minor allele frequency ≤ 0.001. Those high-quality and rare variants were used as candidates for outlier prediction. Gene expression outliers were obtained with OUTRIDER and filtered for a sufficiently large expected number of fragments (μ > 450). Samples with more than 50 outliers were removed.
The mitochondrial disease dataset6 consisted of 311 whole-exome sequencing samples paired with RNA-seq from fibroblasts. After filtering, this dataset contained 501 underexpression outliers across 299 samples. A detailed overview of how many samples, genes, etc. remained after each filtering step can be seen in suppl. table T4.
For the amyotrophic lateral sclerosis (ALS) dataset, we downloaded 244 transcriptomes with matched whole-genome sequencing data from dataportal.answerals.org29. The data consisted of 205 cases diagnosed with amyotrophic lateral sclerosis and 39 control samples. RNA-seq measurements were obtained from iPSC-derived spinal motor neurons. After filtering, the dataset contained 739 underexpression outliers across 244 samples. The corresponding overview of how many samples, genes, etc. remained after each filtering step can be seen in suppl. table T5.
UK Biobank rare variant association testing and phenotype prediction
We analyzed data from 200,593 caucasian unrelated individuals in the UK Biobank (fields 22006 and 22011), all of whom had genotypes available from exome-sequencing and microarrays as well as blood and urine measurements. A detailed list of used phenotypes can be found in suppl. table T2. For every trait, trait values were inverse rank normal transformed, as in the Genebass study31.
Identification of lead trait-associated variants
To control for common variants in the vicinity of each gene, trait-associated variants were obtained from the PanUKBB study53. Using plink v1.954,55, variants with a p-value ≤ 0.0001 were clumped in a 250 kbp window with an LD-threshold of r2 < 0.5 to identify independent lead variants for every trait. The imputed genotypes were then subsetted for these lead variants in a 250 kbp window around each gene.
Application of polygenic risk scores
Polygenic risk scores were selected from the study by Privé et al.56 if available and otherwise from the study by Tanigawa et al.57 (suppl. Table T2). Score files were obtained from the PGS catalog database58 and applied to the imputed genotypes using plink v2.054,59.
Calculation of AbExp and LOFTEE scores
The UKBB whole-exome sequencing data was subsetted for variants with a minor allele frequency ≤0.001 based on gnomAD v3.1.1 and filtered for genotype quality ≥ 30 and read depth ≥ 10 reads. The remaining variants were then annotated using Ensembl VEP50 v108 with the LOFTEE plugin17 and AbExp.
Rare variant association testing
Gene association was tested using a likelihood ratio test between a restricted linear regression model containing only covariates and four linear regression models with additional predictor variables as described in Table 1.
The following covariates were used for all models: sex, age, age2, age times sex, age2 times sex, the 20 first genetic principal components (field 22009), lead associated variants in 250k bp window around the gene of interest, and a polygenic risk score predicting the trait. The polygenic risk scores and lead variants were based on the whole UK Biobank dataset. This mild data leakage may have led to model overfitting. However, since these features were used as covariates in both restricted and full models, the comparison between models remained valid. P-value calibration of the models was assessed by permuting the phenotype once. Identification of significantly trait-associated genes was performed on two thirds of the dataset.
Phenotype prediction
For the common-variant-based phenotype prediction model, the following features were used: sex, age, age2, age times sex, age2 times sex, the 20 first genetic principal components (field 22009), and a polygenic risk score predicting the trait. In contrast to the rare variant association testing, we did not include lead variants of the associating genes, as these would lead to a too large number of predictor variables.
We compared two phenotype prediction models integrating rare and common variants. Both are nonlinear models. To this end, we used gradient-boosted trees51 from the LightGBM52 framework with default parameters: boosting_type: gbdt, learning_rate: 0.1, max_depth: -1, min_child_samples: 20, min_child_weight: 0.001, min_split_gain: 0, n_estimators: 100, num_leaves: 31, reg_alpha: 0, reg_lambda: 0, subsample: 1, subsample_for_bin: 200000, subsample_freq: 0. Both models include the same features as the common-variant based prediction models. The AbExp phenotype prediction model further included the AbExp scores of significantly trait-associated genes in 48 tissues, while the LOFTEE phenotype prediction model included the number of LOFTEE pLoF variants of significantly trait-associated genes. For each trait, the two models can have a different set of significantly trait-associated genes as identified by their corresponding RVATs. The phenotype prediction models were trained with 5-fold cross-validation on the remaining third of the dataset. All evaluations were performed on the hold-out folds.
Declarations
Ethics approval and consent to participate
No new data was generated for this study. The different ethics approval can be found in the corresponding publications (GTEx60, AnswerALS29, mitochondrial disease6). The UK Biobank was approved by the North West Multi-centre Research Ethics Committee (21/NW/0157). Our reference number approved by the UK Biobank is 25214. All UK Biobank study participants gave written informed consent.
The research conformed to the principles of the Declaration of Helsinki.
Consent for publication
All individuals included or their legal guardians provided written consent to share pseudonymized patient data and analysis data, as described in the original publications.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
J.G. conceptualized the project. F.R.H., F.P.C. and J.G. designed the methodology. F.R.H., N.W. and V.A.Y. curated the data. F.R.H. and J.L. performed investigation and provided the software. F.R.H., J.L., and V.A.Y.. performed visualizations. F.R.H., V.A.Y. and J.G. wrote the original draft of the manuscript. All authors reviewed and edited the manuscript. J.G. supervised the project with the help of F.P.C. and V.A.Y.
Funding
The German Bundesministerium für Bildung und Forschung (BMBF) supported the study through the Model Exchange for Regulatory Genomics project (MERGE; grant no. 031L0174A to F.R.H. and J.G.); the VALE (Entdeckung und Vorhersage der Wirkung von genetischen Varianten durch Artifizielle Intelligenz für Leukämie Diagnose und Subtypidentifizierung) project (031L0203B to F.R.H., J.G); and the ERA PerMed project PerMiM (01KU2016B to J.L., V.A.Y. and J.G.). N.W. is supported by the Helmholtz Association under the joint research school ‘Munich School for Data Science – MUDS’. F.P.C. was funded by the Free State of Bavaria’s Hightech Agenda through the Institute of AI for Health (AIH).
The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by the NCI, NHGRI, NHLBI, NIDA, NIMH and NINDS. This study was supported by data provided by the Answer ALS Consortium, administered by the Robert Packard Center for ALS at Johns Hopkins. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. UK Biobank was established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government and the Northwest Regional Development Agency. It has also had funding from the Welsh Government, British Heart Foundation, Cancer Research UK and Diabetes UK. UK Biobank is supported by the National Health Service (NHS).
Availability of data and materials
Data availability
We provide the aberrant expression benchmark dataset, isoform proportions and the expected gene expression in GTEx v8 as open-access in the Zenodo repository61 (doi: 10.5281/zenodo.8427312).
Code availability
A Snakemake pipeline to calculate AbExp predictions can be found at https://github.com/gagneurlab/abexp.
The source code for the UK Biobank rare-variant association study and phenotype prediction can be found in the following repositories:
- Main analysis pipeline: https://github.com/gagneurlab/abexp-ukbb-trait-analysis
- Variant clumping: https://github.com/gagneurlab/abexp-ukbb-variant-clumping
- Polygenic risk score calculation: https://github.com/gagneurlab/abexp-ukbb-prs
Acknowledgements
We thank Felix Brechtmann, Alexander Karollus, and Holger Prokisch for fruitful discussions and their valuable feedback on the manuscript. Figure 1 was created with BioRender.com.