The genes controlling normal function of citrate and spermine secretion is lost in aggressive prostate cancer and prostate model systems

Background Secretion of the metabolites citrate and spermine into prostate lumen is a unique hallmark for normal prostate epithelial cells. However, the identity of the genes controlling citrate and spermine secretion remains mostly unknown despite their obvious relevance for progression to aggressive prostate cancer. Materials & Methods In this study, we have correlated simultaneous measurement of citrate/spermine and transcriptomics data. We have refined these gene correlations in 12 prostate cancer cohorts containing 2915 tissue samples to create a novel gene signature of 150 genes connected with citrate and spermine secretion. We further explored the signature in public data, interrogating over 18 000 samples from various tissues and model systems, including 3826 samples from prostate and prostate cancer. Results In prostate cancer, the expression of this gene signature is gradually lost in tissue from normal epithelial cells through PIN, low grade (Gleason <= 7), high grade cancer (Gleason >= 8) and metastatic lesions. The accuracy of the signature is validated by its unique enrichment in prostate compared to other tissues, and its strong enrichment in epithelial tissue compartments compared to stroma. Several zinc-binding proteins that are not previously investigated in the prostate are present in the gene signature, suggesting new mechanisms for controlling zinc homeostasis in citrate/spermine secretion. However, the absence of the gene signature in all common prostate normal and cancer cell-lines, as well as prostate organoids, underlines the challenge to study the role of these genes during prostate cancer progression in model systems. Conclusions A large collection of transcriptomics data integrated with metabolomics identifies the genes related to citrate and spermine secretion in the prostate, and show that the expression of these genes gradually decreases on the path towards aggressive prostate cancer. In addition, the study questions the relevance of currently available model systems to study metabolism in prostate cancer development.


Introduction 1
A unique hallmark for normal prostate epithelial cells is their ability to secrete large amounts of the 2 metabolites citrate. While cells from other human organs convert citrate to isocitrate for oxidation 3 and energy production in the TCA cycle, citrate conversion in the prostate epithelium is partly blocked, 4 leading to accumulation of citrate [1]. It has been reported that citrate/isocitrate ratios in the prostate 5 are 40/1, compared to 10/1 in other normal organs [2]. In the prostate, excess citrate is secreted out 6 of the cell into the prostate lumen of the prostatic glands, where it is a major constituent of the 7 prostatic fluid and essential for normal prostate organ function. 8 Almost all types of prostate cancer originates from prostate epithelial cells, and nearly all prostate 9 cancer cells lose their ability to secrete citrate during the development of malignant phenotypes [1]. 10 A reduced level of citrate has therefore been suggested as a biomarker, both for cancer detection and 11 for identification of cancers with more aggressive phenotypes [3][4][5]. Even with these demonstrations 12 of importance for normal prostate function and prostate cancer progression, the research in this field 13 has been markedly neglected, and thus the molecular mechanisms controlling these functions remain 14 mostly unknown. Nevertheless, a few dedicated research groups have managed to gain some insights: 15 First, high mitochondrial levels of zinc in the prostate block citrate conversion, and the regulation of 16 zinc transporter genes, in particular ZIP1 (SLC39A1), have been suggested to play a role in regulating 17 prostate zinc levels [6,7]. Second, the metabolite aspartate is a possible source of increased citrate in 18 prostate cells [8,9]. Aspartate is a precursor for oxaloacetate, which again is a precursor of citrate 19 synthesis, and works as a carrier of the carbon groups oxidized stepwise in the mitochondria during 20 the TCA cycle. Increased levels of aspartate could thus lead to increased levels of citrate when the 21 citrate/isocitrate conversion is blocked. Third, metabolites synthesized in the polyamine pathway are 22 regulated differently in the prostate [10][11][12]. Some interest has been directed to the polyamine 23 spermine, due to its extremely high correlation with citrate [13]. Spermine and citrate have opposite 24 polarity, and it has been speculated as to whether they may form a complex which is secreted into the 25 prostate lumen [13,14]. 26 These presented mechanisms, though intriguing, represent only small steps on the way to a widened 27 perspective of this unique prostate function which is not completely understood. This includes the 28 genes and proteins involved, as well as the connections between them. Most mechanisms suggested 29 so far have been studied in cell-lines and animal models, which are not able to fully capture the 30 complex interplay within human prostate tissue. However, a large amount of genome-wide -omics 31 data on prostate tissue, as well as their model systems, has accumulated since the introduction of the 32 genomic era more than 20 years ago. These data resources have yet to be fully exploited for research 33 on citrate and spermine secretion in the prostate. In this study we address this hallmark property of 34 the prostate by performing extensive bioinformatic analysis on data resources from the public 35 domain. We do this by first creating and a robust gene signature representing citrate and spermine 36 secretion in the prostate. We then explore and validate this gene signature in multiple datasets, 37 cohorts and public resources. In total we analyzed 32 datasets  with 18020 samples, of which 38 3826 were prostate samples including normal/normal-adjacent (epithelium and stroma) tissue, 39 high/low-grade cancer tissue, metastatic samples and model systems. Our study identifies genes 40 strongly associated with citrate and spermine secretion in the prostate and explore their behavior 41 during cancer progression and in model systems. Our results validate previously suggested 42 mechanisms, but also discover new central genes and mechanisms related to citrate and spermine 43 with potential importance for this unique and intriguing prostate hallmark. 44 45 Results 1 All datasets used in this study are listed in Supplementary File 1. It includes dataset ID and abbreviation 2 used in the main text, sample description, dataset accession and references. 3 Metabolite concentrations between citrate and spermine are highly correlated across tissue 4 samples 5 Our research group has previously generated a dataset with 129 normal and cancer tissue samples 6 from the prostate, where concentrations from 23 metabolites and the expression of 14161 unique 7 genes were measured on the exact same samples (Bertilsson,dataset ID 1). Similar to a previous report 8 [12], we observed a strong correlation (r=0.95) between concentrations of the metabolites citrate and 9 spermine across the samples ( Figure 1A and B). To simplify calculations, we will use the average 10 concentration profile for citrate and spermine throughout the rest of this study. ( Figure 1C). We use 11 the abbreviation CS (Citrate and Spermine) to denote all our results based on this average 12 concentration profile. 13  levels in the prostate by calculating the Pearson  3  correlations between the CS concentration profile and each of the 14161 unique genes measured in  4 Bertilsson (dataset ID 1). The correlation analyses were performed on the 95 cancer samples in the 5 dataset. We selected the 150 genes with the highest positive correlation which we termed the initial 6 gene module. We further assessed the integrity of this gene module by creating a Correlation Module 7

Figure 1: Correlations between metabolites citrate and spermine. A) Correlations between all 23
Score (CMS, see methods) where a high CMS means that strong intra-correlation is present between 8 the genes in the module, resulting in a high module integrity. On the contrary, a low CMS indicates 9 that there is low correlation between the genes in the module, and that the module integrity is lost. 10 Moreover, if the gene module has a significantly high CMS in a dataset, it indicates that the biological 11 process(es) which this module represents is functionally active for the samples in this dataset. 12 To test for significance, we generated 100 random CMSs by shuffling the CS metabolite levels between 13 the respective samples in Bertilsson (dataset ID 1). The resulting CMS was statistically significant when 14 compared to CMSs from random gene modules (CMS=0.34, p=0.007, lognormal distribution) ( Figure  15 2A). 16 We then validated whether the integrity of the initial gene module was significantly preserved in 11 17 additional datasets (dataset ID 2-12), including a total of 2638 tissue samples from normal prostate 18 and prostate cancer (ref dataset list). We calculated and compared the CMSs for the initial module, as 19 well as the 100 random modules in all 11 datasets. The CMS from the initial gene module was validated 20 in all 11 cancer datasets, and 7 out of 8 datasets with normal samples (lognormal test) ( Figure 2B, 21 Supplementary Table 1 and 2). We thus defined the initial module as our initial gene signature 22 associated with CS concentrations in the prostate. We used this initial CS signature as a starting point 23 for further refinement. 24 We also performed the above procedure using the 40 normal samples in Bertilsson, however the 25 resulting gene module was not significant (CMS=0.37, p=0.14, lognormal distribution), and could not 26 be validated in any of the additional datasets (Supplementary Figure 1, Supplementary Table 3 and 4). 27 The reason for this is probably due to an insufficient number of normal samples in Bertilsson to 28 perform a robust correlation analysis. 29 Refinement across 12 datasets produce a CS gene signature with improved integrity. 30 Corresponding gene expression and metabolite measurements are available only for Bertilsson 31 (dataset ID 1). We thus implemented a bioinformatics strategy to identify a refined CS gene signature 32 with improved integrity across all 12 datasets (2794 samples). We used the initial CS signature from 33 Bertilsson as a starting point, and then used results from cancer samples across the other 11 datasets 34 to nominate better gene candidates to replace genes from the initial CS signature (see Methods). 35 When evaluating the new top 150 candidate genes, we found that 74 of the 150 genes in the initial CS 36 signature had been replaced by new genes to improve integrity across all datasets ( Figure 2C). Of note, 37 the refined CS signature improved CMS for in both cancer and normal samples in all 11 additional 38 datasets. This consistent improvement in normal samples (including normal samples from Bertilsson) 39 affirms the relevance of the signature, since these samples were not used to generate or refine the CS 40 signature. In addition, only a marginal reduction in CMS was observed for the cancer samples in 41 Bertilsson ( Figure 2D). We thus conclude that the refined CS signature (referred to as the CS signature 42 from now) represents a more robust signature with improved integrity compared to the initial CS 43 signature. 44 1 The CS signature displays unique enrichment in prostate samples and negative correlation to 8 stromal tissue compartments. 9

Figure 2: Integrity of initial and refined citrate -spermine (CS) gene signatures. A) CMS (Correlation
Having established a CS signature of 150 genes associated with citrate and spermine secretion in the 10 previous sections, we expanded our validation to additional datasets and sample types. For this 11 analysis we used single samples Gene Set Enrichment Analysis (ssGSEA) [43], which is a method used 12 to assess whether a specific gene signature is enriched or depleted in a single sample. Since citrate 13 and spermine secretion is a highly specific function associated with prostate epithelium, higher ssGSEA 14 scores in tissue samples from prostate compared to samples from other tissue types would confirm 15 its functional relevance. Further, since prostate cancer samples are a heterogeneous tissue mixture 16 of normal prostate epithelium, cancer and stroma, ssGSEA scores should be inversely proportional to 17 the content of stroma tissue in the samples, since stromal cells are unable to secrete citrate or 18 spermine. 19 We first compared ssGSEA scores for cancer and normal samples in the prostate specific TCGA dataset 20 (dataset ID 5) to all cancer and normal samples profiled in the TCGA-complete resource (11093 21 samples from 33 cancer types, dataset ID 21). We observed strongly elevated ssGSEA scores for both 22 prostate cancer and normal samples compared to cancer and normal tissues from other tissues ( Figure  23 3A). We also tested the CS signature on average expression profiles for 53 normal tissues from the 24 GTEx portal (dataset ID 22) ( Figure 3B) and 1829 CAGE expression profiles from cell-lines and tissues 25 in the FANTOM Consortium (dataset ID 23) ( Figure 3C) where one tissue profile was from adult normal 26 prostate. In both datasets, normal prostate tissue showed the highest ssGSEA scores. 27 We next correlated the CS signature ssGSEA scores to a previously identified gene signature for stroma 28 content in prostate tissue [44] in datasets 1-12. We observed a strong reverse correlation between CS 29 and stroma signatures for normal samples in all datasets ( Figure 3D) (with an exception for Mortensen 30 (dataset ID 9) which contained laser dissected epithelium/cancer with minimal amounts of stroma). 31 The correlation was weaker in cancer samples. This is expected, since prostate cancers vary in their 32 ability to secrete citrate and spermine. The negative association between CS signature and stroma was 33 also confirmed using spatial transcriptomics data from Berglund (dataset ID 20), where pixels with 34 high and low ssGSEA scores overlaid regions with epithelium/lumen and stroma respectively ( Figure  35 3E, Supplementary Figure 2). 36 In summary, the CS signature shows strong associations with expected properties of the prostate 1 epithelium, which strengthens the biologically validity of the genes in the signature. 2 3 probably lack significant amounts of stroma. E) CS signature ssGSEA scores on spatial 2 transcriptomics data from tissue slice 3.2 in Berglund (dataset ID 20). The colorbar indicates 3 enrichment of the ssGSEA score in each pixel. The corresponding pathological tissue image can be 4 found for image 3.2 in Supplementary Information -Supplementary Figure 1b from Berglund et al. 5 [33], and comparison shows a strong overlap of high and low CS signature scores with 6 epithelial/lumen and stroma tissue compartments, respectively. Results from all 12 images in 7

Figure 3: Validation of citrate -spermine (CS) gene signature. A) CS
Berglund are shown in Supplementary Figure 2. 8 9 The CS gene signature is gradually lost from normal samples through low grade, high grade and 10 metastatic lesions. 11 Having established the relevance of the CS signature to prostate citrate and spermine secretion, we 12 next investigated how this signature associates with the various stages of prostate cancer 13 aggressiveness. Low citrate concentrations have previously been associated with high-grade prostate 14 cancer [4]. We identified nine datasets (1779 prostate cancer samples) where cancer samples had 15 been classified as either high-grade or low-grade, or assigned a Gleason score -the standard form of 16 prostate cancer grading. We found significant changes in CS signature ssGSEA scores between high-17 and low-grade cancers in all eight datasets where samples where Gleason score 4+3 was classified as 18 high-grade (613 high and 737 low-grade samples), and six out of eight datasets where samples with 19 Gleason score 4+3 was classified as low-grade (578 high-and 1258 low-grade samples) ( Figure 4A, 20 Table 1). The reduction in CS signature scores were not due to increased tumor fraction in high-grade 21 samples (shown in Table 1. These results support the previous findings and suggest that the changes 22 previously observed for citrate at metabolite level are accompanied by changes in gene expression. A 23 further significant reduction in CS signature scores were observed in metastatic compared to cancer 24 and normal samples in 8 datasets (157 metastatic samples) ( Figure 4B, Table 2). Note that the lower 25 CS-signature scores in normal samples compared to cancer samples is due to the high level of stromal 26 tissue in normal prostate samples [8, 44] (Supplementary Figure 3). Overall, our results indicate that 27 the genes associated with citrate and spermine secretion is upregulated in the normal epithelium, and 28 then gradually downregulated with tumor progression from early low-grade lesions, through high-29 grade cancers and finally metastatic tumors. This trajectory is nicely illustrated by the results from 30 Tomlins (dataset ID 18) ( Figure 4C), where laser dissection has been used to purify stroma, normal 31 epithelium, Prostate Intreapeithelial Neoplasia (PIN, an early pre-cancerous prostate lesion) cancer 32 and metastatic tissues. In line with this trajectory, the PIN lesions show an average CS signature score 33 level between the levels of normal epithelium and cancer. Though the ability of cells to secret citrate 34 and spermine seems to be lost in metastatic samples, metastatic tissue from prostate cancer still 35 contains traces of the CS gene profile compared to metastatic tissue originating from other organs 36 (Hsu,dataset ID 19) ( Figure 4D).

Figure 4: Citrate -spermine (CS) signature, tumor grade and metastasis. A) CS signature ssGSEA 2
scores for high-grade (Gleason higher or equal to 4+3) and low-grade (Gleason less than or equal to 3 3+4) cancers in 9 datasets. The samples compared corresponds to the HG1 and LG1 groups in Table  4 1. The scores were centered in each cohort before plotting to visualize similarities between datasets 5 better. B) CS signature ssGSEA scores for metastatic, cancer and normal samples in 7 datasets. The 6 scores were centered and normalized to range 0-1 before plotting to visualize similarities between 7 datasets better. The low scores in normal samples is due to the higher content of stroma in normal 8 samples (Supplementary Figure 3). C) CS signature ssGSEA scores from five laser dissected prostate 9 and prostate cancer tissue types from Tomlins (datasets Id 18 and MT2A), one zinc-transporter (SLC39A10) and the two genes AZGP1 (alpha-2-glycoprotein 1, zinc-10 binding) and ANPEP (alanyl aminopeptidase, membrane), the latter containing a consensus sequence 11 known from zinc-binding metalloproteinases. 12 We then used DAVID [45,46] and Enrichr [47,48] to perform Gene Ontology (GO) analysis on the CS 13 signature (Supplementary Figure 4). For both GO tools the most significant terms were related to 14 zinc/metal-ion binding and branched chain amino acid (Leucine, Isoleucine and Valine) degradation. 15 When inspecting the CS signature genes at NCBI, we noticed that many genes did not have any 16 associated functional description, while at the same time having high prostate specific expression. A 17 potential challenge for our GO analysis is that the prostate specific function of many of the genes in 18 our CS signature is not known. We therefore applied GAPGOM, an alternative GO analysis tool where 19 each gene is associated with functions based on consensus ontology terms for genes it is correlated 20 with [49] (methods, Supplementary File 4). This analysis revealed strong gene ontology terms related 21 to mitochondria, ER (endoplasmic reticulum), golgi, lysosome and exosome cellular components, 22 which would fit well with a secretory path of citrate. When we performed Principal Component 23 Analysis (PCA) to group genes according to their GO terms (see methods), the six metallothioneins 24 formed a distinct cluster related strongly to zinc/metal-ion binding ( Figure 5A, Supplementary File 4). 25 In addition, the metallothioneins also showed strong associations with the mitochondrial respiratory 26 complex and electron transport chain. 27 We also performed network analysis on the 150 genes in the CS signature (Supplementary Figure 5), 28 where the genes (ALOX15B, RAB27A, ENDOD1, SLC45A3, NCAPD3, EHHADH, ACAD8, AFF3, NANS and 29 YIPF1) were identified as the top 10 hubs in the network. These hub-genes were all ranked among the 30 top 20 in the CS-signature and had significant GO terms associated with branched chain fatty acid 31 catabolism/degradation. Finally, we compared CS signature ssGSEA scores with cell-types specific 32 gene signatures from a single-cell RNA-Seq study on normal prostate tissue [50]. The CS signature 33 scores showed high correlation with prostate luminal cells (the cells mainly responsible for secretion), 34 but not basal cells, and a negative correlation to stromal fibroblasts and smooth muscle cells ( Figure  35 5B). Moreover, the luminal gene signature also included several of the metallothionein genes, 36 corroborating the potential importance of these genes. 37 The CS signature is depleted in prostate model systems 1 To reveal detailed function of the genes highlighted in the previous section will need further wetlab 2 experiments. We therefore wanted to identify suitable model-systems where the possible role of 3 these genes and their mechanisms could be further studied. Unfortunately, we found the 4 identification of model systems to be a challenge. 5 We tested the CS signature on publicly available gene expression datasets from various model systems 6 for prostate cancer using ssGSEA. In the Prensner (dataset ID 10), 58 expression profiles from cell-lines 7 were analyzed together with tissue and metastatic samples, including the four most common prostate 8 cancer cell-lines PC3, DU145, LNCaP, VCaP, and the normal prostate cell lines RWPE and PrEC. This 9 unique collection of sample types makes the Prensner dataset an excellent reference for comparisons. 10 CS signature ssGSEA scores from the Prensner cell-lines were consistently at the low end, compared 11 to both tissue and metastatic samples ( Figure 6A), indicating that cell-lines derived from prostate do 12 not secrete citrate and spermine. This observation was confirmed in Taylor  To enable the comparison of ssGSEA scores between datasets, we implemented a method to adapt 25 ssGSEA scores from different datasets to the Prensner dataset (see Methods). To demonstrate the 26 utility of this implementation, we first adapted the CS ssGSEA scores from Taylor to Prensner, since 27 both datasets included tissue samples (normal and cancer), metastatic samples as well as cell-lines. 28 The CS ssGSEA scores where highly concordant down to the level of different cell-types between the 29 two cohorts after adaptation (Supplementary Figure 9). We then used the adaptation strategy to 30 compare datasets from organoids and mouse models (dataset IDs 27-31) to the Prensner dataset to 31 investigate whether they would resemble tissue or cell-lines. None of the model-systems analyzed 32 seemed to produce CS ssGSEA scores at levels similar to prostate epithelial tissue or low-grade primary 33 cancer tissue. Instead their CS scoring range typically compare to the cell-lines from Prensner ( Figure  34 6D). Overall, these results call attention to the research challenge for current prostate model systems 35 to recapitulate the function of normal prostate epithelium in terms of citrate and spermine secretion. 36  Figures 6-9). 14 15

Figure 6: Model systems. A) CS signature ssGSEA scores for all sample types in Prensner (dataset ID
Discussion 16

Expanding the view of zinc-binding proteins and zinc-transporters in the prostate 17
Six metallothioneins were identified as part of the CS gene signature in this study and constituted a 18 functionally distinct cluster particularly associated with zinc-binding. Metallothioneins are low 19 molecular weight, metal binding proteins localized to the Golgi, and have different expression in 20 normal prostate and prostate cancer [51,52] . In addition, expression levels can change in response 21 to zinc-stimuli in prostate cell-lines [53]. It has also been speculated that metallothioneins can affect 22 mitochondrial function, since they are small enough to enter the mitochondrial membrane bilayer 23 carrying zinc [54]. This suggestion fits with the GO associations of metallothioneins to mitochondria, 24 respiratory complex and the electron transport chain. Otherwise, the other potential zinc-binding 25 genes discovered in this study have not been studied previously in the context of citrate accumulation 1 and secretion in the prostate. 2 The zinc-transporter SLC39A1 (ZIP1) was not a part of the CS-signature, though this gene has 3 previously been shown to be important for zinc-homeostasis in prostate cells. There could be several 4 reasons for this discrepancy. First, this study only measures genes at the transcript level, while it is 5 possible that SLC39A1 is regulated at the protein level [6]. However, expression levels from genes 6 displayed in NCBI-gene resource show that SLC39A1 is not prostate specific, but expressed in most 7 tissues. Though SLC39A1 has the ability to transport zinc within prostate tissue, it may not be the main 8 determinant of prostate specific zinc regulation in vivo. In this context is should also be noted that 9 none of the metallothioneins displayed prostate specific expression in the NCBI-gene database. Of the 10 zinc transporters, most prostate specific expression is observed for SLC39A2, SLC39A6, SLC39A7 and 11 SLC39A10, where SLC39A10 was part of the CS signature. SLC39A6 and SLC39A7 (though not part of 12 the signature) were also positively correlated with citrate in our analysis, while SCL39A1 and SLC39A2 13 showed no correlation. Of the potential zinc-binders, AZGP1 show the most prostate-biased 14 expression, but the function of this gene is unknown. Overall, more targeted research experiments 15 are needed to identify the genes that control zinc-levels in the prostate, and how they perform their 16 function. 17 Previous studies have shown that prostate epithelial cells can utilize glucose and aspartate as the 18 carbon sources for citrate production. The gene GOT2 (glutamic-oxaloacetic transaminase 2, also 19 named mAAT-mitochondrial aspartate aminotransferase in previous reports) was shown to be 20 responsible for the conversion of aspartate to oxaloacetate and citrate in the mitochondria. GOT2 was 21 a part of the CS signature, in addition to SMS (spermine synthase), and both GOT2 and SMS showed 22 high tissue-expression in prostate according to NCBI-Gene. For these two genes our results fit with 23 previous mechanistic knowledge [9,55]. 24 Very little is known about the role of branched chain amino acid catabolism for normal prostate 25 function. All of the three amino acids leucine, isoleucine and valine are upregulated in prostate cancer 26 [12], indicating their relevance for cancer transformation, and this fits with our finding that the 27 degradation of these amino acids is important to maintain normal prostate function. Branched chain 28 amino acids can possibly work as precursors for citrate production [56], and leucine can act as a sensor 29 for mTOR-pathway activation [57], which is generally regarded as an important pathway in prostate 30 cancer development [58]. 31

Expression of CS signature genes are with prostate cancer progression 32
There has been a discussion on whether the reduced citrate levels observed in prostate cancer are 33 merely a result of a reduction in luminal glands leading to reduced amounts of prostatic fluid [59]. 34 However, our results clearly show that changes in citrate and spermine levels are accompanied by 35 changes at the gene expression level, and this study supports the hypothesis that changes in genes 36 responsible for citrate and spermine secretion are important for the cell transformations leading to 37 cancer. Further research into these mechanisms can lead to new discoveries and treatment targets in 38 the management of prostate cancer. 39 The challenge of finding a model system for normal prostate 40 We could not find CS signature enrichment comparable to tissue samples in any of the cell-lines and 41 model systems we investigated. Androgen responsive cell-lines seem to maintain some traces of 42 prostate specific functions compared to other model systems, but their CS signature enrichments are 43 far below that of in vivo tissue and indicate that most parts of the prostate-lineage specificity is lost. 1 Thus, these observations question the relevance of these model systems for studying normal prostate 2 function and transitions from normal cells to prostate cancer cells. The secretory function can possibly 3 be triggered by proper stimuli like dihydrotestosterone (DHT), testosterone or prolactin [60][61][62]. For 4 example, LNCaP (but not PC3) cells were able to secrete citrate when stimulated by DHT [61]. 5 However, it was also observed that the rate of citrate consumption by the TCA cycle increased 6 proportionally in the same experiment [61]. Since these experiments were performed before the era 7 of transcriptomics, it would be interesting to repeat these experiments with accompanying 8 transcriptomics analysis to identify the genes that change during such stimuli. 9 Signature refinement produces more robust gene signatures with improved integrity 10 In this manuscript, we have argued that improved gene signature integrity (assessed by CMS) can be 11 achieved by integrating data from multiple prostate cancer datasets. The aim was to remove noise 12 and produce better and more robust gene signatures in terms of biological interpretation. We found 13 several evidences that more robust signatures were achieved by integration of datasets. First, we 14 observed that the refined signature improved CMS scores compared to the initial signature in 15 independent normal prostate samples ( Figure 2D). Second, when we used the initial gene signature 16 for ssGSEA in the Hsu dataset, we observed no clear separation between prostate and other 17 metastatic samples ( Figure 4D, Supplementary Figure 10). Third, in the FANTOM dataset, the one 18 Prostate Adult Tissue sample were only ranked as the highest scoring sample after signature 19 refinement, but not when the initial signature was used (Figure3D, Supplementary Figure 11). 20 The refinement procedure generated an improved CS signature ssGSEA scores which was particularly 21 pronounced for prostate samples. We also tested the effect of refinement ion gene expression data 22 from other tissue types in the TCGA-complete dataset (33 cancer and 22 normal tissue types). Other 23 cancer types showed both elevated and decreased GSEA scores after refinement, and elevations were 24 consistently lower than in prostate samples (Supplementary Figure 12). 25 The best CS signature ssGSEA score improvement after refinement were also observed particularly for 26 the metabolites citrate and spermine. For this test, we redid the procedure to create initial and refined 27 signatures for all metabolites and lipids in the metabolite data from Bertilsson (23 metabolites and 17 28 lipid signals). We then calculated ssGSEA scores in all cancer and normal tissue types in the TCGA-29 complete dataset for each metabolite/lipid signature ( Supplementary Figures 13-16). The most 30 elevated ssGSEA scores were for citrate and spermine in prostate compared to other tissue types, 31 while other metabolites showed both elevated and decreased ssGSEA score for other tissue types 32 compared to prostate. This observation was similar for both cancer and normal tissue samples. Thus, 33 the effect of refinement was highest for citrate and spermine specifically in the prostate. Interestingly, 34 we noted that other metabolites/lipid signatures were elevated in prostate, particularly putrescine (a 35 precursor for spermine in the polyamine pathway) a few lipid signals and glucose, indicating prostate 36 specific regulation of other metabolites than just citrate and spermine. These observations were 37 confirmed in the FANTOM dataset (dataset ID 23) (Supplementary Figure 17). 38 In summary, we conclude that the identification followed by refinement procedure used in this study 39 created a robust and unbiased gene signature biologically relevant for in vivo spermine and citrate 40 secretion in the prostate. In a wider perspective, the module approach may represent a general 41 strategy to find robust genesets related to other -omics data (for example metabolomics, proteomics 42 or lipidomics), as long as these -omics data are accompanied with gene expression data. 43

Conclusion 1
The genes and mechanisms governing normal prostate function, in particular the ability to accumulate 2 and secrete large amounts of citrate and spermine, are mostly unknown. We have used bioinformatics 3 on an extensive collection on prostate datasets to discover genes relevant for secretion of citrate and 4 spermine in the prostate. Based on this we have created a 150 gene signature which can be used to 5 assess the degree of citrate and spermine secretion in a single sample. The bioinformatics approach 6 enabled us to validate and explore the gene signature in a large number of contexts based on public 7 data, including normal vs cancer enrichment, spatial colocalization on tissue, tumor grade, metastasis 8 and different model systems. The employed approach is generalizable to other types of data when 9 measured together with gene expression. The signature showed an inverse association to cancer 10 progression from normal through low and high grade to metastasis. The genes in the signature were 11 enriched for zinc-binding, mitochondrial function, respiratory complex, secretion through the ER -12 golgi -lysosome -exosome pathway and branched chain amino acid catabolism, pointing to known, 13 suggested and novel mechanisms for normal prostate function. The lack of suitable model systems is 14 a challenge, and need to be established to study these findings further. 15

Methods 16
Datasets 17 A list of all datasets with description and references are given in Supplementary File 1 18

Module refinement procedure 19
A schematic representation of the procedure with stages 1-3 is shown in Figure 7A. The procedure 20 make use of the following abbreviations: M=Module, G=several genes, g=single gene, D=Dataset, 21 P=gene expression Profile, TB=Table, av=average, in=initial, rf=refined, c=candidate, m=missing, To assess how well the correlations between genes in a module (or geneset) is preserved for all (or a 7 subset of) samples in a particular dataset, we created the Correlation Module Score (abbreviated 8 CMS). The CMS for a module in a dataset is calculated by first creating a table of intra-correlations 9 between expression profiles for all genes in the module for that dataset ( Figure 7B). Here the 10 expression profile for a gene is the vector consisting of gene expression values over all samples in that 11 dataset. For a 150-gene module, this will create a table with 150*150 = 22500 correlation values (a 12 table for a hypothetical 4-gene module is shown in Figure 7B). The CMS is then calculated as the 13 average over all correlations included the table (but excluding self-correlations). A high CMS 14 represents a high integrity of the gene module in the dataset, indicating the preservation of strong 15 correlations between the module-genes in the datasets. Likewise, a low CMS would indicate weak 16 relations between the genes in the module, and the loss of gene module integrity. We further assume 17 if a gene module has a high CMS in a dataset, it would indicate that the biological process the module 18 represents is important or active for samples in this dataset. In this study the datasets are patient 19 cohorts with normal and cancer tissue samples from the prostate. 20

Refinement Procedure -Stage 1 21
The purpose of Stage 1 in the module refinement procedure is to find the contribution of each gene 22 in the initial CS module from Bertilsson to the overall CMS for the module. To represent this 23 contribution we introduce the Module Gene Contribution Score (MGCS), which is the average over all 24 correlations between that gene and each of the other genes in the module (or, the overall contribution 25 of that gene to the overall CMS for the module). In Figure 7B this will be the average of the rows 26 (excluding the diagonal), where each average value represent the MGCS score for a gene. A high MGCS 27 score will indicate that the expression profile of the gene fits better with expression profile of other 28 genes in the module, while a low MGCS score indicates a gene expression profile that deviates from 29 other genes in the module. The purpose of Stage 2 is to calculate, for all candidate genes c-g present in any one of the 12 datasets 5 D, a score which describes how well each candidate gene expression profile fit with the expression 6 profiles of genes in the initial CS module. This is done by comparing each candidate gene expression 7 profile c-g-P-D to the average expression profile, in-M-G-P-D-av, of the genes in the initial CS module 8 for each dataset D. We regard any gene present in any of the 12 datasets D as candidate genes. The purpose of Stage 3 is to identify the genes that correlates best with genes in the initial CS module 23 over all 12 datasets, but that were not a part of the initial CS module to begin with. From this analysis 24 we select the 150 best candidates, and compare their MGCS to the MGSC for the genes in the initial 25 CS module from Stage 1. If the MGSC for a new candidate gene is higher than the MGSC for an initial 26 CS module gene, the initial gene is replaced by the candidate gene. The final output is a refined CS 27 module, where the robustness of gene correlations across all 12 prostate cancer datasets are taken 28 into consideration in the selection of genes for the module. This refined module is defined as the CS 29 gene signature. 30

Procedure: 31
Exclude initial CS module genes in-M-G from the correlation table cr-TB from Stage 2 32 Calculate the average correlation, c-g-cr-av, for each of the remaining candidate genes over all 12 33 datasets in cr-TB. 34 Sort genes in correlation table cr-TB by average correlation c-gr-cr-av. 35 Select 150 candidate genes, c-g-150, with the highest average correlation, c-gr-cr-av 36 For each dataset D: 37 For each of the top 150 candidate genes, c-g-150: 1 Calculate the average correlation of gene c-g-150 to all genes from the initial CS module, in-M-G-D, 2 and use the average correlation as the candidate gene MGSC, c-g-MGSC-D. This is a measure of the 3 fit for the candidate gene, c-g-150, to the initial CS module in dataset D. 4 For candidate genes not present in dataset D, the MGSC is set to the cohort specific missing gene 5 constant in m-MGSC from Stage 1 6 Combine the candidate MGSC table c-g-MGSC-D with the initial MGSC table in-g-MGSC-D from Stage 7 1. The combined table , comb-TB, will contain 300 genes, 150 from the initial CS module and 150 8 new candidates. 9 Calculate the average MGSC score, g-MGSC-av, for each gene over all 12 datasets for the combined 10 table comb-TB. 11 Sort genes by average MGSC score, g-MGSC-av. 12 Select the 150 genes with the highest average MGSC score, g-MGSC-av, as genes for the refined CS 13 module, rf-G-M. 14 The number of genes from the initial CS module present in each of the 12 datasets are shown in 15 Supplementary pixel. This was necessary due to many pixels with a high number of unexpressed genes, and/or very 29 few genes with abundant expression. 30

Adapted and normalized ssGSEA 31
The magnitude of GSEA scores depend on both the size of the gene set used, and the number of total 32 genes in the ranked list input to the GSEA calculations. To be able to compare ssGSEA scores from 33 Prensner to other datasets, we adapted the ssGSEA by only using the genes shared between the 34 datasets in each comparison. In this way, both the number and identity of genes, as well as the size of 35 the geneset will be identical for both datasets. For the comparison of several datasets in Figure 4D, 36 we also normalized the GSEA scores in Prensner to 0-1 scale for each comparison. The adapted and 37 normalized score calculated in each comparison were highly correlated (average 0.99), and had a small 38 standard deviation of the mean (0.004), and we conclude that they are comparable between the 39 datasets. 40

GO analysis 41
The GO analysis was based on gene annotation generated with the GAPGOM tool [49] [PMID: 1 30567492, GAPGOM in Bioconductor]. GAPGOM will predict gene annotation for a target gene by 2 identification of well-annotated genes showing correlated expression pattern with the target gene 3 across several experiments, and then estimate a predicted annotation for the target gene as a 4 consensus over the co-expressed genes, based on the hypothesis that co-expressed genes may be 5 involved in similar or related processes. This prediction of annotation terms may facilitate a richer 6 functional annotation of genes, in particular for genes where there is a lack of experimental 7 annotation data. 8 Since GAPGOM only allows us to do annotation prediction one gene at a time, GAPGOM was used in 9 conjunction with Snakemake 10 [https://snakemake.readthedocs.io/en/stable/project_info/citations.html] to create a software 11 pipeline. This pipeline was used to predict GO term annotation for all 150 genes in the CS-signature 12 on each of dataset 1-12. Since GO annotation consists of three different ontologies (MF, BP, CC), each 13 prediction is done separately for each. This then outputs a list for each gene in each dataset for each 14 GO ontology, in total 1800 lists. Each list contains one or multiple GO-term predictions, with the 15 following information for each prediction; GO-ID, Ontology, P-value, FDR/q-value (Bonferoni 16 normalized P-value), The description of the GO term, and the used correlation method for the 17 prediction. For each list, only GO terms with a q-value > 0.05 were selected. This generated a total of 18 21 080 GO terms for all lists, where 5354 GO terms were unique. A table was created with the number 19 of genes as rows and the number of unique GO-terms as columns. For each gene and GO-term, we 20 summed -log10 q-values from all datasets, producing an overall score for each GO-term and gene. We 21 made three tables, one for all 150 genes, one for the six Metallothioneins, and one for the 10 network 22 Hub-genes. 23

Network analysis 24
Correlation-based networks of the CS signature genes for each dataset are created as follows: The 150 25 genes are represented as nodes in the network and the pairwise Pearson correlation between genes 26 represents the interactions between nodes. To reduce the complexity and highlight the most powerful 27 interactions, only the 20 strongest outgoing links (in absolute Pearson correlation) from each node are 28 kept. In this way, the most central nodes will have 20 outgoing links and up to 130 ingoing links from 29 other nodes, while some of the non-central nodes will only have 20 links. By calculating the node-30 degree (number of links to other nodes) for each node, genes with high node-degree reflects central 31 genes for driving the biological processes [63]. This network construction is performed on cancer 32 samples for all 12 datasets, and the 10 genes with the highest mean degree across the datasets are 33 considered as the top 10 hub genes.

Ethical statement 26
The use of human tissue material and clinical data from the Bertilsson cohort was approved by the 27 Regional