Master regulators of signaling pathways coordinate key processes of embryonic development in breast cancer

Signal transduction pathways allow cells to respond to environmental cues and can induce intracellular changes. In some contexts, like embryonic development, signal transduction plays crucial roles in cell fate determination and differentiation, while in developed organisms some of this processes contribute in the maintenance of the structural integrity of tissues. Tumor cells are recognized as having deregulated signaling which leads to a series of abnormal behaviors known as the hallmarks of cancer. Although gene regulation is often viewed as the last step in signal transduction, transcriptional regulation of the components of a pathway may impact in the long term deregulation observed in tumors. The study of gene regulatory networks centered around genes of the signal transduction pathways allows the identification of transcriptional regulators with the greatest influence over the signal transduction gene signature, also denominated Master Regulators. In this work we identify, the master regulators that regulate the expression of genes of 25 relevant pathways grouped in KEGG within the category of signal transduction in a breast cancer dataset. For this purpose we implemented a modified MARINa algorithm that identifies, from a network of regulons, those that possess more differentially expressed genes related to the process to be studied. We identified CLOCK, TSHZ2, HOXA2, MEIS2, HOXA3, HAND2, HOXA5, TBX18, PEG3 and GLI2 as the top 10 master regulators of signaling pathways in breast cancer. Nine of them are recognized for taking part in embryonic development associated processes. Individual enrichment GO biological function for each TMR regulons showed to be significantly enriched in embryonic development related processes. Hedgehog signaling pathway was shown as enriched and also highly deregulated. The genes of the HOXA family are shared among most of the TMRs. Overall, this suggests the importance of the aberrant reprogramming of mechanisms present during embryonic development, being coopted in favor of tumor development.

Breast cancers are illnesses that originate from healthy cells that are somehow 2 reprogrammed to acquire unlimited proliferation and self-renewal capacity. Eventually these transformed cells are able to migrate and invade other tissues in the body [1]. In 4 this context, cancerous cells do not create brand new cell signaling pathways, but 5 through a variety of mechanisms, existing pathways are aberrantly activated [2]. On 6 them it is notable that many pathways associated to tumor development also play 7 important roles during embryonic development. Pathways such as WNT, Hedgehog or 8 VEGF are essential for differentiation, migration and pattern formation. Abnormal 9 expression of components of these pathways have been reported in tumors [3] [4] [5]. 10 Theories exist regarding the origin of cancerous cells and its relationship to 11 embryonic development pathways and on how these contribute in dedifferentiation and 12 later specialization on tumor development. Nevertheless, all of them agree on the 13 presence of deregulation of signal transduction pathways (STP) controlling these 14 processes [6] [7]. Accounting for gene expression as one possible mechanism in the 15 modulation of signal transduction pathways, the determination of transcription 16 regulators may help us understand this phenomenon. 17 A significant level of regulation of signaling is achieved through the action of 18 transcription factors (TFs) that modulate the transcription of groups of genes encoding 19 proteins that participate in these pathways [8][9][10]. Given their capacity to modulate 20 cellular pathways, TFs are of great interest in the study of complex 21 diseases [11] [12] [13]. Moreover, it has been recognized that some TFs exert a decisive 22 influence in the transition between phenotypes. These TFs, called Transcriptional 23 Master Regulators (TMRs) [14] are expressed at the early onset of the development of a 24 particular phenotype, consequently regulating multiple target genes either directly or 25 indirectly by means of transcription cascades resulting in significant gene expression 26 changes and hence phenotype variation. 27 Given the fact that multiple signal transduction factors are simultaneously 28 deregulated in the cancerous phenotype, an integrative approach is valuable in order to 29 understand the biology underlying this disease. MARINa (Master Regulator Inference phenotype [14,15]. In this work, we used a modified version of this algorithm to find the 33 most important transcription factors focused in the regulation of KEGG's 25 signal 34 transduction pathways in breast cancer. We also identified a TMR subset that regulates 35 genes belonging to specific signal transduction pathways in breast cancer. 36 Materials and methods 37 Obtaining and preprocessing data 38 A Gene Expression matrix was obtained from Espinal-Enriquez et al. [16]. 39 Corresponding to The Cancer Genome Atlas (TCGA) level 3 available data of the 40 Illumina HiSeq RNA-Seq platform, and consisting of 881 samples of which 780 41 correspond to breast cancer tissue and 101 to adjacent healthy mammary tissue. 42 Quality control and batch effect removal were performed with N OISeq [17] and 43 EDASeq [18] R libraries respectively [16]. 44 The Master Regulator Inference Algorithm 45 TMRs were inferred using the Master Regulator Inference Algorithm (MARINa) [15]. 46 MARINa identifies TMRs through an enrichment of TF regulons (a TF with its targets) 47 with differentially expressed genes between two phenotypes (breast cancer vs adjacent 48 healthy mammary tissue). TMR inference with MARINa requires as input a network of 49 regulons, a gene expression molecular signature and a null model [15] (Fig. 1). The 50 construction of these elements is described below.  RNAseq data from TCGA's 780 invasive mammary carcinomas and  101 adjacent tissue samples was processed to obtain an expression matrix (orange  cylinder). The expression matrix and a list of transcription factors from the TFCheckpoint database (pink cylinder) served as input to infer a transcriptional regulatory network with ARACNe. A regulon network was obtained associating the expression level of the targets of all transcription factors using the aracne2regulon function from viper (left side). For the generation of the molecular signature, we considered genes in the expression matrix in KEGG's "signal transduction" category (blue cylinder). Finally, a null model was generated by permuting sample labels and recalculating the molecular signature (right). These three elements are the input to MARINa for the inference of the transcriptional master regulators (TMR) of the signal transduction pathways.
Generation of the regulons network 52 The network of regulons is a directed network (TF→Target) of all the transcription 53 factors and their targets. To obtain it, we used the expression matrix and the mutual 54 information based transcriptional regulation network built with ARACNe [19]. For this 55 network, we considered transcription factors in the TFCheckpoint curated database [20] 56 that possessed experimental evidence for TF activity. 771 of these TFs were found 57 within the expression matrix S1 File.

58
This network contains the relationships between transcription factors and the rest of 59 the genes, measured by the mutual information (MI) function [19,21]. For this network 60 interactions were kept if its p value was below 0.005. Given that mutual information 61 can detect both indirect and direct relationships, ARACNe limits the number of indirect 62 interactions applying the Data Processing Inequality theorem (DPI), which considers 63 that, in a triangle of interactions, the weakest one has a greater probability of being 64 indirect if its difference is large with respect to the other two [22]. We applied a DPI 65 value of 0.2 as recommended in Margolin et al. 2006 [19], which means that the weakest 66 interactions of the triangles in the network were eliminated without introducing an 67 excessive number of false positives.

68
The type of association (activation or repression) of the transcription factors is 69 determined from Spearman correlation of the TF with the levels of expression of all its 70 targets [15]  results of this test were Z -score normalized to allow comparability [15]. 84 Null model generation 85 To estimate the probability that a Gene Enrichment Score depends on the biological 86 context and thus is not merely random, a null model was generated by random 87 permutation of samples between cases and controls and recalculation of differential 88 expression [15]. 89 Inferring the Master Regulators of signal transduction pathways 90 With the molecular signature, the regulon network and the null model, MARINa 91 estimated the top regulons that enrich the most differentially expressed genes in the 92 molecular signature through a Gene Set Enrichment Analysis [25]. An additional 93 constraint was to consider only TFs with 20 or more targets in the molecular 94 signature [15]. A p-value for each regulon was estimated by evaluating the Enrichment 95 Score (ES) with reference to the distribution of scores of the null model [15]. For TMR 96 inference we used Bioconductor's viper package [23].

97
Regulon enrichment of KEGG pathways 98 An over-represented pathway is defined as one for which we found significantly more 99 genes within a test set than the number expected from a random sampling [26], hence, 100 we say this set is enriched with genes of the pathway, this may in turn suggest biological 101 relevance. The statistical significance of an enrichment can be assesed by means of an 102 hypergeometric test. In order to know if the combined regulons of our most important 103 Trascription Master Regulators are enriched for biological pahways, an 104 Overrepresentation Enrichment Analysis (ORA) was performed using the WebGestalt 105 tool [27] with KEGG as the functional reference database [24]. Statistical significance 106 threshold was set to p ≤ 0.05 after FDR correction.

107
Pathway deregulation analysis 108 To determine which signal transduction pathways are the most deregulated in the breast 109 cancer phenotype, we estimated the degree of deregulation of KEGG Signal 110 Transduction pathways by using the Pathifier algorithm [28]. Pathifier assigns a score, 111 named Pathway Deregulation Score (PDS) for each pathway in a sample from the 112 expression status of the genes in the pathway in reference to its expression in normal 113 tissues of the same origin. In brief, for a given pathway, a multidimensional space is 114 defined where each dimension represents the expression level of a gene. All samples are 115 positioned in this space according to the expression levels of all the genes in the 116 pathway. Then, a principal curve (a smoothed curve of minimal distance to all points) 117 is calculated and all samples are projected into it. The score corresponds to the distance 118 of the sample projection measured over the principal curve respect of the projection of 119 the normal tissue samples [28]. To enable comparisons between pathways a Z -score was 120 calculated for each PDS and the median value for each pathway was taken [29]. To gain insight on how our TMRs may contribute to this phenotype, we performed an 123 ORA for each one of their regulons against Gene Ontology (GO) [30] biological 124 processes. Enrichments were calculated via WebGestalt [27]. Statistical significance 125 threshold was set to p ≤ 0.05 after FDR correction. With the exception of CLOCK, this regulators are commonly described within the context of embryonic development, and all of them have been reported in association with cancer. The total number of genes controlled by these regulons is 412, representing almost one third of the total genes in the molecular signature. In this figure, the p-value is shown on the left for each of the Master regulators whose symbols are on the right. The "Act" column indicates the activity of the master regulator on its targets, the red color represents the overexpression and the blue color represents the subexpression with respect to normal tissue. "Exp" shows the expression value for each master regulator.
With the exception of CLOCK, the activity of these transcription factors over their 137 targets is repression. However, the expression values of this regulators remain without 138 significant change in breast cancer respect to normal mamary tissue (Fig. 2). In cancer, 139 it has been described that some transcription factors can increase the transcription of Although it maintains activation interactions with some of its targets (red links), the majority of its interactions are inhibitory. CLOCK is the only TMR that maintains a greater proportion of activation interactions (image generated with cytoscape [44]).

Regulon enrichment of KEGG pathways 155
In order to know which molecular pathways are enriched in the top 10 regulons, an 156 enrichment analysis was made with Web Gestalt using KEGG as a reference database. 157 The pathway with the most statistically significant enrichment was Pathways in cancer 158 (hsa05200) with a coincidence of 121 genes, which reinforces the idea that the analysis 159 does recover information from the phenotype studied. 160 Other pathways such as Cell cycle (hsa04110) and Focal adhesion (hsa04510) follow 161 in the the top three enrichments. Also enriched are signaling pathways present within 162 our molecular signature and that are known to be important in the development of 163 cancer such as PI3K-AKT signaling pathway (hsa04151), Phospholipase D signaling 164 pathway (hsa04072) and Hedgehog signaling pathway (hsa04340) ( During embryonic developement signals such as morphogens and growth factors 173 present in cell's environment activate Signal transduction pathways that in turn induce 174 changes within the cell [45]. In the context of cancer, pathways have been reported as 175 permanently activated and to gain independence of the activating ligands [32].

176
Many of our TMRs are usually described in the context of embryonic development 177 processes [35][36][37][38][39][40][41][42][43]. It is interesting to note that our TMRs and their regulons are self-renewal of stem cells [46]. Both pathways have been previously described in 181 cancer [3,46]. Within the TMRs that have the enriched Hedgehog pathway, it has been 182 described that TSHZ2 forms a complex with GLI1 which functions in a coordinated 183 manner with GLI2 and GLI3 within the Hedgehog pathway [47]. Knockout experiments 184 of TBX18 showed a marked decrease in the Hedgehog pathway genes [48].

185
GLI2 regulon in the context of the top 10 regulon network. GLI2 is the only TMR 186 that shows multiple interactions with other TMRs (six in total Figure 3). In the regulon 187 network, genes are initially associated by means of MI but during the conversion to 188 differentiation processes during embryonic development [42,50]. Therefore, this TMR 193 may be interesting in the context of the master effector of the Hedgehog pathway which 194 is one of the most represented here.

195
Another interesting result arises from the observation that the PI3K-AKT and

196
Hedgehog signaling pathways have been reported in association with stemness and cell 197 differentiation processes. Both pathways play a role during embryonic development and 198 in the maintenance of adult tissues. Hedgehog plays a role in epithelium maintenance, 199 and is necessary to regulate the presence and number of stem cells [3], while activation 200 of the PI3K-AKT pathway promotes survival growth and proliferation [51].

202
The most significantly enriched processes of each TMR regulon are presented in Table 3. 203 It is interesting that, for enriched GO biological processes obtained from the molecular 204 signature of the signal tranduction pathways , the top places are occupied by embryonic 205 development related processes. These results are in line with the hypothesis of tumors 206 are described as aberrations of growth, differentiation, and organization of cell 207 populations. These are basic processes that are tightly coordinated and controlled 208 during embryogenesis as well as in adult tissues [6]. The oncogerminative theory of 209 cancer development (OTCD) [6] suggests that cancer arises due to aberrant expression 210 of developmental genes. According to this theory, tumor formation is a dynamic 211 self-organizing process that mimics the process of early embryo development. The 212 malignant transformation of somatic cells, which is a result of gene mutations combined 213 with epigenetic dysregulation, ultimately results in somatic cells being reprogrammed 214 into immortal cells that mimic germline cells. These mimics are termed "cancer stem 215 cells" or "oncogerminative cells" [6,52].
216 Table 3. First significant enrichments of Gene Ontology biological processes per regulon. The first ten regulons enrich more biological processes related to embryonic development (ten out of fifteen, in purple), in blue are the processes related to cell cycle and proliferation (three enriched processes) and in orange those referring to organization of the extracellular matrix (two breast cancer cells expressing wild-type p53 led to apoptosis while those lacking the p53 227 gene did not [54,55]. Furthermore, the HOXA5 promoter region was methylated in 80 % 228 of p53-negative breast cancer specimens. [54]. This aberrant regulation of HOX genes in 229 cancer indicates that HOX transcriptional mechanisms are integral to a network of 230 regulatory mechanisms involved in normal adult tissue homeostasis. [52]. Our results

231
show that HOXA members are included in all of our 10 TMR regulons Table 4.  associated regulons pointed out to the PI3K-AKT pathway, which is associated to cell 237 survival and proliferation, and to the AMPk pathway which is involved in the cellular 238 energetic balance as the most deregulated.