Metabolic predictors of phenotypic traits can replace and complement measured clinical variables in transcriptome-wide association studies

Transcriptome-wide association studies (TWAS) can provide valuable insights into biological and disease-underlying mechanisms. For studying clinical effects, availability of (confounding) phenotypic traits is essential. The (re)use of RNA-seq or other omics data can be limited by missing, incomplete, or inaccurate phenotypic information. A possible solution are molecular predictors inferring clinical or behavioral phenotypic traits. Such predictors have been developed based on different omics data types and are being applied in various studies. In this study, we applied 17 metabolic predictors to infer various traits, including diabetes status or exposure to lipid medication. We evaluated whether these metabolic surrogates can be used as an alternative to reported information for studying the respective phenotypes using TWAS. Our results revealed that in most cases, the use of metabolic surrogates yields similar results compared to using reported information, making them suitable substitutes for such studies. The application of metabolomics-derived surrogate outcomes opens new possibilities for reuse of multi-omics data sets, especially in situations where availability of clinical metadata is limited. Missing or incomplete information can be complemented by these surrogates, thereby increasing the size of available data sets. Studies would likely also benefit from the use of such surrogates to correct for potential biological confounding. This should be further investigated. Author summary Transcriptome-wide association studies (TWAS) can be used to associate gene expression levels with phenotypic traits. These associations can provide insights into biological mechanisms including those that underlie diseases. Such studies require molecular profiling data from a large number of individuals as well as information on the phenotypic trait of interest. Biobanks that collect samples and corresponding molecular data, also collect information on phenotypic traits or clinical information. However, this information can be heterogeneous and/or incomplete, or a certain piece of information could be missing entirely. In this study, we apply metabolic predictors to infer various traits, including diabetes status or exposure to lipid medication. We evaluate whether these metabolic surrogates can be used as an alternative to reported information for studying the respective phenotypes using TWAS. Our results reveal that in many cases, the use of metabolic surrogates yields similar results compared to using reported information, making them suitable substitutes for such studies. The possibility of using these surrogate outcomes can thus increase the size of data sets for studies where phenotypic information is incomplete or missing.

Genome-wide association studies (GWAS) have proven to be valuable in uncovering 2 links between genes and a wide range of phenotypic traits. Such findings have lead to 3 the discovery of new disease-related biomarkers and are often the basis for gaining a 4 better understanding of biological processes or disease-mechanisms. Since the 5 introduction of the first GWAS powered by the availability of genome-wide 6 single-nucleotide polymorphism (SNP) profiling, numerous studies have identified 7 thousands of SNP-trait associations [1]. Technological advancements allowing 8 high-throughput profiling of other molecular features, such as transcripts, or DNA 9 methylation sites, also enabled corresponding transcriptome-(TWAS), epigenome- 10 (EWAS), and other ome-wide association studies. 11 Such studies are susceptible to confounding by biological and technical factors that 12 can influence omics profiles and phenotypic traits of interest. However, measured values 13 to correct for such confounding are often not available. As a solution, differences in cell 14 type composition are commonly accounted for using information contained in the DNA 15 methylation profiles themselves, by either reference-based imputation [2] or 16 reference-free methods such as surrogate variable analysis (methods reviewed in [3] 17 and [4]). Other well-known examples of inferring values for possible confounding factors 18 from DNA methylation profiles include sex [5] and smoking status prediction. Bollepalli 19 et al. [6] trained a smoking status classifier using multinomial LASSO regression. 20 Machine learning approaches have also been applied to other omics data types to 21 predict environmental exposures [7]. Recently, Bizzarri et al. [8] trained predictors on 22 proton NMR-based metabolomics (Nightingale Health) data. The authors applied 23 logistic regression using elastic net regularization to train models for various clinical 24 variables, including physiological measures, environmental exposures, and clinical 25 endpoints. They demonstrated the use of their predicted values as metabolic surrogates 26 to complement missing phenotypic data when performing metabolome-wide association 27 studies (MWAS). 28 The value of such predictors is not only evident when complementing missing data 29 to account for technical or biological confounding, but also for using them as outcome 30 variables. These molecular surrogates can be used in association studies in order to link 31 molecular features to clinical phenotypes or exposures. Since identified associations in 32 ome-wide association studies often have only moderate effect sizes, a common approach 33 to detect relevant features are cross-cohort meta-analyses [9]. However, the applicability 34 of meta-analyses can be limited by availability of the respective outcomes of interest. 35 Specific clinical, environmental, or phenotypic traits might not be recorded in every 36 cohort, or the data collection might be based on different protocols, making the 37 reported values for these traits not directly comparable. 38 January 28, 2022 2/19 As more and more multi-omics data sets become available, it becomes possible to 39 make use of molecular predictors to infer phenotypic traits from specific omics layers. 40 We propose their use in subsequent analyses of other omics layers, not only as covariates 41 to account for confounding factors, but also as outcome variables. LifeLines (LL) [10], Leiden Longevity Study (LLS) [11], Netherlands Twin Register 58 (NTR) [12], and Rotterdam Study (RS) [13]. An overview of the study workflow is 59 shown in Fig 1. For the different outcomes (Table 1), we compared TWAS results based 60 on metabolic surrogate outcomes to those based on reported outcomes whenever 61 possible. Additionally, results were compared to literature findings.  white blood cell composition) confounding factors were included (formulas available in 82 S1 Table).

83
For an initial assessment of the performance of each model, we compared the 84 numbers of significant associations and the effect sizes between outcomes and outcome 85 variable types. Additionally, test statistic (t-statistic) bias and inflation were estimated 86 as parameters (mean and standard deviation) of the empirical null distribution using a 87 Bayesian method implemented in the R package bacon [26] (Fig 2). Numbers of 88 identified significant associations (Fig 2A)  and alcohol consumption, no or only few significant gene-trait associations were found. 92 For outcomes where TWAS results based on surrogate outcomes could be compared to 93 results for reported variables, the numbers of identified significant associations averaged 94 across all cohorts were higher for the metabolic surrogate outcomes in three cases, and 95 lower in nine cases. However, the variation across the four cohorts was generally higher 96 than the difference between models employing either reported or surrogate outcome 97 variables. Similarly, high variation across cohorts was observed for the other parameters 98 assessed to evaluate the performance of the models. Absolute effect size averaged across 99 all genes ( Fig 2B) were generally small, with the highest values observed for the 100 outcomes triglycerides and sex. In 10 cases, the mean absolute effect size averaged 101 across cohorts was lower when using metabolic surrogate outcomes instead of reported 102 variables; in two cases, it was higher. We observed relatively low test statistic bias 103 ( Fig 2C) across all outcomes and types of outcome variables. The bias, i.e., the deviation 104 of the empirical null distribution's mean from zero, averaged across cohorts decreased in 105 four cases, increased in three cases, and remained similar in five cases when employing 106 metabolic surrogate outcomes instead of reported variables. Bias in the RS cohort was 107 often higher than in the other cohorts. This may be explained with the differences in 108 population structure. The RS cohort has a higher average age than the other three 109 cohorts [27], indicating higher bias for the studied clinical variables in older populations. 110 Inflation (deviation of the empirical null distribution's standard deviation from one, 111 Fig 2D) was highest for the outcome sex. In most cases, inflation averaged across 112 cohorts remained stable when using different outcome variable types. For the outcome 113 total cholesterol, it slightly decreased when using metabolic surrogate outcomes instead 114 of reported variables; for the outcomes lipid medication and hsCRP, it slightly increased. 115 regression coefficients from gene-wise fitted linear models between two different types of 120 outcome variables. Correlation coefficients were generally high for surrogate outcomes 121 based on best-performing metabolic predictors. For outcomes based on predictors with 122 reported AUC > 0.9 (triglycerides, LDL cholesterol, total cholesterol, HDL cholesterol, 123 and sex) [8], the correlation coefficients averaged across cohorts ranged between 0.71 124 and 0.96. We observed a modest trend for decreasing correlations with decreasing 125 performance of predictors. However, there were exceptions, with sex having highest 126 correlation values, although the predictor's AUC was reported to be lower than those 127 for triglycerides or cholesterol. Lowest similarity of TWAS results with average absolute 128 Pearson r < 0.5 were observed for the outcomes age, white blood cells, smoking, and 129 hemoglobin, the latter having lowest correlation values. We often observed that 130 correlations were lower for the NTR cohort. This could be explained by a technical 131 difference in the metabolic profiles, with NTR missing glutamine [8].

Meta-analyses and replication studies 133
When comparing two alternative outcome variables, a lower or higher number of found 134 significant associations does not necessarily imply that results are better or worse, since 135 the values alone do not indicate if this is due to a reduction or increase of false positive 136 (noise) or true positive findings, respectively. We observed that the TWAS results differ 137 when surrogate values differ from reported values. However, we do not know which set 138 of results is correct, as reported values could contain inaccuracies. Under the 139 assumption that true positive findings, as opposed to false postive results, can be 140 replicated in different cohorts (validating the results), we performed replication studies 141 to determine which outcome variable type is more consistently reflected in the RNA-seq 142 data. We performed leave-one-cohort-out meta-analyses and replication studies for all comparisons (except for hsCRP where only two cohorts were available) using the 144 approach described by van Rooij et al. [27]. For each comparison, four meta-analyses 145 were performed leaving one cohort out each time, and using the left out cohort for a 146 replication analysis. Fig 4 shows the numbers of significant associations found in each 147 meta-analysis (number of meta-analyzed genes) and the respective percentage of 148 replicated genes. Overall, we did not find substantial differences in the numbers of 149 meta-analyzed genes (Fig 4A) except for the outcomes lipid medication, (high) age, and 150 current smoking. While more genes were meta-analyzed when using the metabolic 151 surrogate for lipid medication, the reported variable yielded more meta-analyzed genes 152 for age and smoking. For the latter two outcomes, results based on metabolic surrogates 153 could not be replicated (Fig 4B) while on average 11-15% of results based on reported 154 outcomes could be replicated. For these two outcomes we had also observed highest 155 differences between number of significations associations (Fig 2A). For other outcome 156 variables, the percentage of replicated genes was quite similar between outcome variable 157 types, but the cohort which was left out for the meta-analysis had a strong impact on 158 the results. Replication rates were generally low with the highest average replaction rate 159 observed for triglycerides with just under 30%. For a number of outcomes, associations 160 could hardly be replicated: LDL-associated cholesterol, hemoglobin, and alcohol 161 consumption. This is in line with the fact that almost no significant associations were 162 found for these outcomes in the individual cohorts (Fig 2). cohorts. Meta-analyzed results had an overlap of 38-69% and the order of significantly 174 enriched pathways was highly comparable (see S1 Appendix). The results for high age, 175 current smoking, hemoglobin, and LDL cholesterol demonstrated a lower overlap than 176 other outcomes and showed higher variation across the four cohorts compared to other 177 outcomes. This is partially in line with the comparison of gene-wise linear models from 178 TWAS (Fig 3), which showed that results for the outcome hemoglobin based on 179 reported and inferred values were not correlated, and high age and current smoking 180 were only moderately correlated. Since hardly any significant associations were found 181 for hemoglobin (see Fig 2A), the observed signal for this outcome was generally very low 182 in the studied cohorts, independent of the type of outcome variable. It is surprising to 183 observe that almost no significantly enriched pathways were observed for the outcome 184 sex, even though many gene-trait associations were found and could be replicated in the 185 meta-analysis and replication approach. found when meta-analyzing multiple cohorts (compare Fig 2 and Fig 4). Top-ranked 198 positively enriched Reactome pathways from gene-set enrichment analysis (S1 199 Appendix), including, e.g., innate immune system, signal transduction, and infectious 200 disease, have been linked previously to chronic alcohol drinking [34]. "Formation of a pool of free 40S subunits" which participate in the Eukaryotic translation 214 initiation [29] and were previously shown to be enriched in a high-cholesterol and 215 high-fat diet induced hypercholesterolemic rat model [35]. The metabolic predictors for 216 these outcomes are directly related to metabolic markers measured on the Nightingale 217 platform and had shown high performance (AUC ⩾ 0.95) [8]. It is expected that results 218 based on predicted outcomes will depend on the accuracy of the prediction. Accordingly, 219 we observed a slightly lower correlation between reported outcomes and metabolic 220 surrogates for molecular predictors that were known to have lower accuracy (Fig 3). In 221 the TWAS for lipid medication and BMI/obesity, the molecular predictors yielded even 222 more significant gene-trait associations than the reported outcomes (Fig 2A) and a 223 higher number of significantly enriched pathways were obtained when using the 224 metabolic surrogates (Fig 5). For lipid medication, this may be related to inaccurate 225 recording of this trait in the questionnaires used. For BMI, this may be explained by 226 the more direct capturing of metabolic processes that are associated with obesity as a 227 combined measure of BMI and waist circumference, because BMI alone is not a perfect 228 indicator of metabolic health [36]. Similarity of results based on surrogate and reported 229 value for sex was high across all assessment parameters, but GSEA did not yield 230 significantly enriched pathways in most cases. It is possible that the genes significantly 231 associated with sex belong to too many different pathways and/or that some genes 232 within a pathway have a positive association while others have a negative association 233 resulting in a failure to identify positively or negatively enriched pathways. For the 234 outcomes total cholesterol and hsCRP, regression coefficients were moderately correlated 235 and GSEA results were very similar. For the outcome white blood cells overlap of 236 significantly enriched pathways in GSEA was smaller. However, the order of top-ranked 237 pathways was similar (see S1 Appendix). The application of metabolic surrogates for 238 outcomes that were not reported in the data and the comparison of TWAS and GSEA 239 results with literature show that these surrogate outcomes allow to transfer information 240 from one data set (the training data) to another, thereby facilitating to study 241 phenotypes or exposures in data sets which would otherwise not be possible. performance of the metabolic predictors for these outcomes, or a lower biological signal 251 for the respective clinical outcomes and thus increased noise in the studied data. It is 252 known that aging is reflected in transcriptomics data [37], but the metabolic predictor 253 for high age trained on binarized data (⩾ 65 y.o.) might not be an ideal surrogate to phenomenon is known from epigenetic clocks whose age predictions can differ from 259 chronological age, and different clocks can reflect different aspects of biological age [38]). 260 While many of the top-ranked pathways (S1 Appendix) for smoking were found by both 261 outcome variable types, some of the pathways were solely found by using either reported 262 smoking status ("smoking current") or metabolic surrogate("s current smoking").

263
Several pathways related to translation initiation ("Formation of the ternary complex, 264 and subsequently, the 43S complex", "Translation initiation complex formation", 265 "Ribosomal scanning and start codon recognition") were only significantly enriched when 266 using the reported variable as outcome. Translation of mRNA is known to be 267 dysregulated in cancers [39]. Pathways only enriched when using the metabolic 268 surrogate include "Platelet activation, signaling and aggregation" and "Hemostasis".

269
Increased platelet aggregation has been reported in smokers [40] and platelet-dependent 270 thrombin levels were shown to be increased in smokers and following smoking [41]. This 271 possibly indicates that the reported smoking behavior captures effects of long-term information captured by the predictors. It is also possible that different omics types 279 capture different effects better, e.g., short-term and long-term effects. In that case, 280 combining reported outcome variables, and/or molecular surrogates from different omics 281 layers could be very useful, not only to study the effect of a certain exposure, but also 282 to better adjust for confounding factors. reported clinical (meta)data. This allows for inclusion of more cohorts in meta-analyses, 290 even when outcomes of interest were not reported for all cohorts. This approach also 291 increases possibilities to study clinical outcomes by allowing to infer important 292 confounding factors. Especially investigations that rely on reuse of existing data, .e.g. in 293 the case of rare disease studies which often also suffer from low sample sizes, will benefit 294 from this approach.

302
RNA-seq data was generated by the BBMRI-NL Biobank-based Integrative Omics 303 Study (BIOS) Consortium at the Human Genotyping facility (HugeF) of ErasmusMC, 304 the Netherlands. RNA sample processing and sequencing is described in detail by 305 Zhernakova et al. [15]. Briefly, total RNA was extracted from whole blood, depleted of 306 globin transcripts, and paired-end sequencing of 2x50-bp reads was conducted using the 307 Illumina HiSeq 2000 platform. Read alignment to reference genome hg19 was performed 308 using STAR (v2.3.0). We used the "Freeze2 unrelated data sets", which contain 309 maximum sets of unrelated individuals and are available within the BIOS workspace at 310 the SURF Research Cloud via the R package BBMRIomics v3.4.2 [16]. procedures performed in studies involving human participants were in accordance with 324 the ethical standards of the institutional and/or national research committee and with 325 the 1964 Helsinki declaration and its later amendments or comparable ethical 326 standards [11]. The Netherlands Twin Register [12]  variables, they were binarized as follows. For smoking status, "current smoker" was 354 coded as 1, and "former-smoker" and "non-smoker" were coded as 0; for sex, "male" 355 and "female" were coded as 1 and 0, respectively; for lipid medication, "statins" were 356 coded as 1, and "no" and "yes, but no statins" were coded as 0. In order to be able to 357 compare effect sizes in TWAS, all clinical variables were standardized to zero-mean and 358 unit-variance (z-score normalization).   Leave-one-cohort-out meta-analyses and replication studies in left out cohort were 404 performed as described in [27].