Abstract
Background Differential co-expression signifies change in degree of co-expression of a set of genes among different biological conditions. It has been used to identify differential co-expression networks or interactomes. Many algorithms have been developed for single-factor differential co-expression analysis and applied in a variety of studies. However, in many studies, the samples are characterized by multiple factors such as genetic markers, clinical variables and treatments. No algorithm or methodology is available for multi-factor analysis of differential co-expression.
Results We developed a novel formulation and a computationally efficient greedy search algorithm called MultiDCoX to perform multi-factor differential co-expression analysis of transcriptomic data. Simulated data analysis demonstrates that the algorithm can effectively elicit differentially co-expressed (DCX) gene sets and quantify the influence of each factor on co-expression. MultiDCoX analysis of a breast cancer dataset identified interesting biologically meaningful differentially coexpressed (DCX) gene sets along with genetic and clinical factors that influenced the respective differential co-expression.
Conclusions MultiDCoX is a space and time efficient procedure to identify differentially co-expressed gene sets and successfully identify influence of individual factors on differential co-expression.
Software R function will be available upon request.
Background
Differential co-expression of a set of genes is the change in their degree of coexpression among two or more relevant biological conditions [12]. As illustrated in Figure 1, differentially co-expressed genes demonstrate a strong co-expression pattern among normal samples and no co-expression among disease samples. These genes may not be differentially expressed. Differential co-expression signifies loss of control of factor(s) over the respective downstream genes in a set of samples compared to the samples in which the gene set is co-expressed or variable influence of a factor in one set of samples over the other. This could also be due to a latent factor which had a significant influence on gene expression in a particular condition [44]. Since the proposal by Kostka & Spang [12], many algorithms have been proposed to identify differentially co-expressed (referred as DCX throughout the paper) gene sets and quantify differential co-expression. The algorithms can be classified based on two criteria: (1) method of identification of DCX gene sets (targeted, semi-targeted and untargeted); and (2) scoring method of differential co-expression (gene set scoring and gene-pair scoring).
Based on the method of identification, similar to the one described by Tesson et al. [2], the algorithms can be classified into targeted, semi-targeted and untargeted algorithms. The Targeted algorithms [9] perform differential co-expression analysis on predefined sets of genes. The candidate gene sets may be obtained from public databases such as GO categories and KEGG pathways. They do not find novel sets of DCX gene sets. Another disadvantage of targeted methods is their reduced sensitivity if only a subset of the given gene set is differentially co-expressed as the DCX signal is diluted and the DCX geneset may not be identified. In addition, the DCX gene sets that are composed of genes of multiple biological processes or functions [44] may not be identified at all. The semi-targeted algorithms [2, 17, 24] work on the observation that the genes are co-expressed in one group of samples. Hence they perform clustering of genes in one set of samples, identify gene sets tightly co-expressed and test for their differential co-expression using the remaining group of samples. Although semi-targeted algorithms can identify novel gene sets, their applicability is limited to the co-expressed sets identified by the clustering algorithm. On the other hand, the untargeted algorithms [10, 12, 29] assume no prior candidate sets of genes and instead find the gene sets in an exploratory manner and therefore have a high potential to identify novel gene sets. The major drawback of untargeted approach is potentially high false discovery rate and large computational requirements.
The second aspect of DCX gene set identification algorithms is the methodology employed in scoring differential co-expression of a given gene set: (1) gene set scoring or set-wise method, and (2) gene pair scoring. In gene set scoring, all genes are considered in the scoring at once such as in the linear modelling used in Kostka&Spang [12] and Prieto et al. [10]. On the other hand, gene-pair scoring, as used in DiffFNs [29] and DCoX [2], computes differential correlation of each pair of genes in the gene set and summarizes them to obtain DCX score for the gene set. Gene pair scoring is intuitive and amenable to network like visualization and interpretation in single factor analysis settings. However, gene set scoring can be thought of as gene network of multiple cliques and cliques connected via common genes among all pairs of gene sets. The first few methods (e.g. Kostka&Spang [12] and Prieto et al. [10]) are untargeted set-wise methods, while DiffFNs [29] is an untargeted gene-pair scoring method. However, many later methods, including an early method (DCA [17]) are predominantly targeted or semi-targeted algorithms using gene pair scoring. Differential co-expression has been used in various disease studies and identified many interesting changed interactomes of genes among different disease conditions. DiffFNs [29], Differential co-expression [20], TSPG [39], and Topology-based cancer classification [22] were applied for the classification of tumor samples using interactome features identified using differential co-expression and shown good results over using individual gene features. The application of Ray and Zhang’s coexpression network using PCC and topological overlap on Alzheimer’s data helped identify gene sets whose co-expression changes in Alzheimer’s patients [27]. The multi-group time-course study on ageing [19] has identified gene sets whose coexpression is modulated by ageing. Application on data of Shewanella oneidens identified a network of transcriptional regulatory relationships between chemotaxis and electron transfer pathways [32]. Many other studies have also shown the significant utility of application of differential co-expression analysis [11, 14, 33, 36]. However, none of the existing algorithms allow direct multi-factor analysis of differential co-expression, i.e. deconvolving and quantifying the influence of different biological, environmental and clinical factors of relevance on the change in coexpression of gene sets. This is important as some phenotypes or biological outcomes are governed by multiple factors. In such a case, many single-factor differential coexpression analyses suffer from the same disadvantages of similar approach in differential expression analysis: it leads to multitude of tests, the interpretation of the identified gene sets may be cumbersome and misleading.
Multivariate differential co-expression analysis is important in many practical settings since each sample is characterized by many factors (a.k.a. co-factors) such as environmental variables, genetic markers, genotypes, phenotypes and treatments. For example, a lung cancer sample may be characterized by EGFR expression, smoking status of the patient, KRAS mutation and age [37]. Similarly, ageing of skin may depend on age, exposure to sun, race and sex [13]. Different environmental, genetic and clinical factors may modulate co-expression of a set of genes. Deconvolving and quantifying the effects of these factors on gene set’s co-expression and eliciting relevant regulatory pathways is an important task towards understanding the change in the cellular state and the underlying biology of interest.
Hence, we propose a very first methodology for such purpose called Multi-Factorial Analysis of Differential Co-eXpression or MultiDCoX, a gene set scoring based untargeted method. MultiDCoX performs greedy search for gene sets that maximize absolute coefficients of cofactors (as suggested in our earlier work [3]) while minimizing residuals for each geneset. Though the model used in MultiDCoX is restricted to ordinal and categorical valued factors, it is not a major limitation while dealing with real valued cofactors which can be discretized into reasonably small number of levels and be treated as ordinal variables. The analysis of several simulated datasets demonstrate that the algorithm can be used to reliably identify DCX gene sets and deconvolve and quantify the influence of multiple cofactors on the co-expression of a DCX geneset in the background of large set of non-DCX gene sets. The algorithm performs well even for genesets with weak signal-to-noise ratio. The analysis of a breast cancer gene expression dataset revealed interesting biologically meaningful DCX gene sets and their relationship with the relevant cofactors.
Furthermore, we have shown that the co-expression of CXCL13 is not only due to the Grade of the tumor as identified in [18], but also could be influenced by ER status. Similarly, MMP1 appears to play role in two different contexts defined by more than one co-factor. These together demonstrate the importance of multi-factor analysis.
Methods
MultiDCoX Formulation and Algorithm
MultiDCoXprocedure consists of two major steps: (1) identifying DCX gene sets and obtaining respective DCX profiles; and (2) identifying covariates that influence differential co-expression of each DCX gene set. The formulation essential to carry out these two steps is as follows.
Let Eim denote expression of gene gi in sample Sm. The cofactor vector characterizing Sm is denoted by Bm= (Bm1, Bm2, Bm3, …, Bmk) where Bmk is the value of kth factor for Sm which is either a binary or an ordinal variable. A categorical variable can be converted into as many binary variables as one less the number of categories of the factor.
We define a new variable Amn(I) to summarize co-expression of gene set I among sample pair Sm and Sn for which Bm = Bn, as
Amn(I) measures square of mean change of expression of all genes in I from Smto Sn. Most of Amn(I)’s are expected to be non-zero among a group of samples in which I is co-expressed. On the other hand, if genes in I are not co-expressed in a group of samples then Amn(I)’s tend to be zero. This is illustrated in Figure 2.
Once all Amn(I)s and Bmns are computed, we quantify the influence of the cofactors by fitting a linear model between Amn(I)s and Bmns. In other words, Amn(I)s are the instances of the response variable, Bmns form design matrix (B) and factors in the Bmns are explanatory variables or cofactors (F) i.e.
Where A(I) is the vector of Amn(I)s, B is matrix of Bmns and F is the vector of factors (attributes or covariates or cofactors) represented in Bmns. The coefficient vector obtained from the above modelling is called differential co-expression profile of the gene set I, denoted by F(I).
The linear modelling problem can be solved by standard functions (such as lm()) available in R-package. But, identifying DCX gene sets is major computational task. The MultiDCoX algorithm identifies DCX gene sets by iteratively optimizing coefficient of a cofactor using the following procedure, see the flowchart in Figure 3 for the algorithm: (1) identifying significance threshold for cofactor coefficients; (2) choosing seed pairs of genes that demonstrate significant coefficient for the cofactor under consideration, i.e. the gene pairs may be differentially co-expressed for the cofactor; (3) expanding each chosen seed gene pair into a conservative multi-gene set by optimizing the respective coefficient; (4) augmenting the gene set to increase sensitivity or reduce false negatives while keeping the respective factor coefficient significant; and, (5) filtering out weak contributing genes from each gene set to increase specificity or reduce false positives. Each of these steps is explained in detail below.
1. Identifying threshold of significance for cofactor coefficients
We generate the distribution of coefficients of the cofactors in F by random sampling of gene pairs: randomly sample large number of gene pairs, fit the linear model in Eq2 for each pair and obtain the coefficients in the linear models. Pool absolute values of coefficients of all factors of all gene pairs, and set half of the mth (m=10 in our experiments) highest value as absolute threshold for all cofactors. In other words, whereFk(IƖ) is coefficient on gene set (a pair of genes in this case) IƖ for kth factor.
Toi is the threshold for cofactor ‘i’, derived from CT as follows
The division by 2 is necessary to avoid damagingly strict threshold and lay wider net at the beginning of the algorithm. m> 1 is required as some of the sampled gene pairs could belong to DCX gene sets which may overestimate the threshold and reduce sensitivity of the algorithm.
2. Identifying DCX seed gene pairs
For each gene, search is performed throughout the dataset to find its partner gene whose pair can result in a linear model (Eq2) with at least one significant cofactor. A cofactor is considered to be significant if its linear model F-test p-value is < 0.01 and its coefficient is outside the range [-CT, CT]. If no partner gene could be found, then the gene will be filtered out from the dataset to improve the computational speed at later stages of the algorithm. We have implemented this step using the procedure: (a) batch application of qr.coef() in R-package which computes only linear model coefficients using one QR decomposition, (b) filter out gene pairs whose linear model coefficients are in the range [-CT, CT], (c) apply lm() on the gene pairs remaining after step ‘b’ to compute F-test p-values, and (d) further filter out gene pairs which do not meet requirements for the coefficient p-value. The batch application of qr.coef() is many folds faster than lm(). We use similar strategy in the steps 3.A-3.C below to reduce computational requirements compared to the direct application of lm().
3. Identifying DCXgene sets
We optimize coefficients of each significant cofactor for each gene pair in the direction, in positive or negative direction, depending on the sign of the coefficient i.e. if the coefficient is negative (positive) its minimized (maximized). To do so, for each factor, the steps 3.A-3.C are iterated until all seed pairs for which the factor is significant are exhausted from the seed pairs obtained in the step 2.
3. A. Expanding top gene pair to a multi-gene set
We choose the gene pair whose constituent genes are not part of any of the multi-gene sets identified and whose linear model fit resulted in the highest coefficient for the factor of interest. It will be expanded to multi-gene set by adding genes that improve the coefficient of that factor in the direction of its coefficient for the gene pair. A sequential search is performed from first gene in the data to the last gene in the data (the order of the genes will be randomized before this sequential search). A gene is added to the set if it improved the coefficient of the factor under consideration i.e. the threshold to add a gene thereby the stringency increases as the search proceeds. The final set obtained at the end of this step is denoted by J. This step results in a most conservative DCX gene set. Factor profile FP(J) of J is defined as set of (fi,hi) pairs as follows: Where fi is factor ‘i’ and hi denotes whether it is positively (hi=1) or negatively (hi=- 1) significant or insignificant (hi=0): Fi(J) is coefficient of factor fi for gene set J P-vali(J) is p-value of Fi(J)
3. B. Augmenting gene set J
As we tried to improve the coefficient of the factor for each addition of a gene in the expansion step (3.A), we may have missed many true positives which are not as strong constituents of J, but could be significant contributors. Therefore, we perform augmentation step to elicit some of the potential not-so strong constituents of J while preserving the factor profile of J. As the gene set identified in step (3.A) is most conservative, we set a new threshold Tni(J) or simply Tni for the coefficient Fi(J) of each fi as Tni(J) will be as stringent as Toi and at most equal to Fi(J) which is the coefficient obtained at the end of step (3.A). Moreover, we define centroid EC (J) ={Ecm(J)} of J as EC(J) is treated as a representative gene expression profile of J and find a gene sub set K such that each gene in K, gk, the pair Kk= (gk, EC(J)) satisfies the condition
Then the augmented set L = J&∪K as new DCX gene set
3. C. Filtering gene set L
The set L obtained after the step (3.B) may contain false positives which can be filtered out as follows: As in the augmentation step, we compute Ecm(L) and evaluate each gene pair Qk ∈ {(gk, ECmn(L)) | gk ∈ L} for F(Qk). gk is removed from the set if |Fi(Qk)|<Fi(L) for all |hi|=1. Then the final gene set R = {gk |gk∈L and F(Qk) is better than F(L)}. R is the final set output for the run.
4. Identifying cofactors significantly influencing DCX of each gene set
It is important to identify the factors influencing the DCX of a gene set to interpret the geneset and elicit underlying biology, i.e. FP(R). The F-test p-value obtained for each cofactor by the linear model fit (in Eq2) in the above procedure need to be further examined owing to the dependencies among the gene sets explored. Therefore, we mark a cofactor to be influential (|hi| =1) on co-expression of R if it satisfies the following two criteria:
Effect size criterion: We pool coefficients of all factors on all gene sets identified (denoted as CR) and examine their distribution. The valleys close to zero on either side of the central peak are chosen as the significance threshold Tf+ and Tf−., see Figure 4. Fi(R) is considered to be significant if it is >Tf+ or < Tf−. The underlying assumption is that not all factors influence all gene sets and the coefficients of the factors with no or little influence on certain gene sets will be suggestive of the distribution of the coefficients under null hypothesis.
Permutation p-value criterion: We permute the factor values of a DCX gene set, i.e. permute columns of Bmk matrix and fit the linear model in Eq2 for each gene set R. We repeat this procedure for a predefined number of iterations. A factor is said to be non-influential on the co-expression of the gene set under consideration if a minimum predefined fraction of permutations (0.01 in this paper) resulted in a fit in which the coefficient is better than Fi(R) and its F-test p-value is better than the F-test p-value of the coefficient without permutation or 0.01 whichever is lower.
Finally, the gene sets with at least one significant cofactor and of predefined size (i.e. at least 6 genes in the set) will be output as DCX gene sets along with their factor profiles.
Reducing computational and space requirements: Computational and space requirements can be further reduced using the following strategies: (1) Filter out genes with no detectable signals among almost all samples and genes that demonstrate very little variance across the samples. This can result in modest reduction in space requirement and substantial reductions in computational requirement as the search procedure is at least of quadratic complexity if 50% of the genes are filtered out from the analysis; (2) Further reduction in computational time can be achieved in the step to identify seed gene pairs. Randomly split the genes into two halves and search for possible pairs where one belongs to one half and the other belongs to the other half, instead of all possible gene pairs. As many DCX gene sets are expected to be sufficiently large, >10 genes, the sampled set is expected to contain >2 genes from that set. This reduces computational time to find seed gene pairs by 2 fold. (3) Another possibility is to consider only a subset of sample pairs by randomly sampling a small fraction of (m,n)s for the linear model, it could be as small as 10% of all (m,n)s. We demonstrate the performance of the MultiDCoX even with such a minimal sampling. These three strategies put together with the optimization described in the step 2 of MultiDCoX can massively reduce the space and computational requirement by several folds and make the algorithm practically feasible.
Results
Simulation Results
To evaluate efficacy of MultiDCoX, we analysed simulated datasets of varying degrees of signal-to-noise ratio and sample size. Each simulated dataset consists of 50,000 probes as in a typical microarray and three factors of 12 stratums, sample sizes were chosen to be either 60 or 120 or 240 i.e. 5, 10 and 20 samples per stratum respectively. Two factors B1 and B2 were binary valued taking values from {-1, 1} and the other (B3) is an ordinal variable taking values from {-1, 0, 1}. Sample labels were randomly chosen for each factor and gene expression (Eim) was simulated as described below: B1im = B1m~ N(0,1) if Sm is in co-expressed group of B1 and gi is in DCX gene set for the factor B1, 0 otherwise. Similar interpretation holds for the remaining factors, B2 and B3, too. Oim= Om~ N(0,1) indicates co-expression over all samples if gi belongs to set of genes co-expressed across all samples irrespective of the factor values. eim~ Ν(0,σ2) is noise term and σ2 is the extant of noise in the data.
We simulated 20 genes which show co-expression for B1m=1 and B2m=1, 20 genes co-expressed for B1m= −1 only, and another 20 genes with Oi= 1 only. With this we have two sets of negative controls: large number of genes with no co-expression and a set of genes co-expressed across all samples. Ideally, a DCX gene set identification algorithm should be able to discriminate the first two sets of genes from the two control (negative) sets. Furthermore, we have tested our MultiDCoX for three different values of σ ∈ {0.2, 0.5, 0.8} i.e. from low noise to the noise comparable to the signal. We carried out 10 simulations for each choice of σ.
The simulation results are summarized in the panel of plots in Figure 5: plots of average numbers of false positives (FPs) and false negatives (FNs) over 10 independent runs for each choice of σ and sample size for both DCX gene sets along with the globally co-expressed gene set. MultiDCoX performed well in terms of both false positives and false negatives for low to medium values of σ. Moreover, the algorithm exhibited reasonable performance even at the noise (σ) comparable to the signal (i.e. σ = 0.8). The simulation results also demonstrate that MultiDCoX is sensitive even at small sample size for low to medium noise level. The failure rate of identifying gene sets and their profiles are dependent not only on the sample size and noise level, but also on the type of set identified. A single factor influenced set has better chance of identifying the right profile but poor chance of being identified at low sample size and higher noise level. On the other hand, the set influenced by 2 factors has higher chance of being identified, but poorer chance of being identified with correct profile at low sample size and higher noise level. The effect of noise on FNR also depended on the number of factors influencing the DCX gene set. However, FDR is less dependent on both noise level and the number of factors influencing coexpression. The number of simulations that identified false gene sets increased with increased noise and reduced sample size. It is the lowest for 5 samples/stratum and high noise (σ = 0.8). The computational time for MultiDCoX analysis, to optimize each cofactor in both directions (maximization and minimization), was ~12-15 hours for one simulated data of 240 samples using 1 node of a typical HPC cluster.
MultiDCoX Analysis of Breast Cancer Data
We analyzed a breast cancer gene expression data published by Miller et al. [25]. It contains expression profiles of 258 breast cancer patients on U133A and U133B Affymetrix arrays i.e. ~44,000 probes. The tumors were annotated for their oestrogen receptor (ER) status (1 for recognizable level of ER or ER+, −1 otherwise or ER-), p53 mutational status (1 for mutation or p53+, and −1 for wild type or p53-) and grade of tumor (-1 for grade 1, 0 for grade 2 and 1 for grade 3). ER and p53 status are the important markers used to guide treatment and prognosis of breast cancer patients. Hence it is important to study the gene sets regulated and thereby co-expressed by these factors while accounting for the effect of the tumor status as indicated by its grade and strong association between these three factors. For example, p53-mutant tumors are typically of higher grade (grades 2 or 3) tumors with correlation ~ 63% [40] and ER-positive tumors are typically low grade (grade 1) tumors [41]. In the presence of these correlations among the covariates, it is important to identify and quantify their effects on co-expression of gene sets. We have applied MultiDCoX on this dataset using ER status, p53 mutational status and tumor grade as cofactors. We discuss a few DCX sets here and the remaining DCX gene sets are given in the Additional File 1 (UppsalaBCResults.xlsx).
Co-expression of ER pathway and the genes associated with relevant processes is modulated in p53 mutated tumors
A DCX gene set and the linear model fit is shown in Table 1A. The set is co-expressed only in p53 mutant tumors. The co-expression plot of p53 mutant tumors is shown in Figure 6.
The set includes ESR1 (which encodes ERα) and its co-factor GATA3 and pioneering factor FOXA1 [23] along with ER downstream targets CA12, SPDEF and AGR2. We retrieved a total of 1349 p53 binding sites’ associated genes data from Botcheva K et al. [1], and Wei CL et al. [5]. p53 binding sites are reported to be close to the promoters of ESR1 [30] as well as GATA3. Furthermore, GATA3 appears to bind to FOXA1 which is a pioneering factor of ER [35]. Our finding reinforces the observations made by Rastiet al. [30] that different p53 mutations may have varying effect on the expression of ESR1 gene, it’s co-factor GATA3, pioneering factor FOXA1 and SAM-dependent Mythyltransferase & p53 interacting GAMT which could have resulted in the differential co-expression of the ER pathway. In addition, comodulation of chromatin structure alternating &ER promoter stimulating TOX3 and Protein transfer associated REEP6 appears to be required to modulate ER pathway by p53.
Genes co-expressed with BRCA2 in ER-negative tumors are associated with Her2-neu status
Another gene set of interest is co-expressed in ER-negative tumors only and its details are given in Tables 2A and 2B. The co-expression plot of the gene set in ER-negative tumors is shown in Figure 7. The gene set includes tumor suppressor gene BRCA2. We have investigated ER binding sites published by Carroll et al. [4] and Lin et al. [5] for ER binding sites close (within ±35Kb from TSS) to these genes. The ~4800 binding sites mapped to ~1500 genes. Significantly, 10 of the 21 genes in this DCX gene set have ER binding sites mapped to them which is statistically significant at F-test p-value < 0.01. Interestingly, most of these genes have not been identified to be ER regulated in the earlier studies using differential expression methodologies, possibly owing to the complexity of regulatory mechanisms. However, many of these genes are down regulated in ER-negative tumors. Testing for association of expression of this set with Her2-neu status revealed that higher expression in ER-negative tumors is associated with Her2-neu positivity which must have led to co-expression in ER-tumors. Odds ratio of such an association is 18 which is much higher than that of ER positive tumors (OR = 4).
DCX of CXCL13 is modulated by Grade and ER status too
Analysis of Grade1 and Grade3 tumors using GGMs [18] helped identify CXCL13 in breast cancer as hub gene. It emerged as one of the hub genes in our analysis too, contributing to multiple DCX gene sets (see Additional File 1, sheet: maxGrade). Although they are significant for Grade, they are significant for ER status too. It shows that CXCL13’s differential co-expression appears to be influenced by ER status, in addition to Grade. This couldn’t be identified in the previous study as it was restricted to univariate (Grade) analysis.
DCX of MMP1 is modulated by factor subspace associated with poor survival
MMP1 is another gene we have examined whose family of genes are associated with poor survival [43]. MMP1 is co-expressed among tumors which are P53+ (mutant) and ER-negative or hi-grade tumors which are ER-postive (see Additional File 1, sheets:maxP53, maxGrade and minER). Both these categories are known to be associated with poor survival of patients. This couldn’t have been revealed in a single factor analyses.
Functional analysis of DCX profiles
To elucidate the biological function of different DCX profiles (ER+, ER-& p53+, etc.), we pooled all genes from gene sets of same DCX profile and used DAVID functional annotation tool [16] to identify GO terms, protein domains, tissues of expression and pathways enriched, the results are tabulated in Table 3. It shows a clear distinction of GO functional categories, pathways enriched and tissues of expression between different DCX profiles. For example there is distinctive protein domains between ER-positive and ER-negative DCX profiles, whereby ER-positive’s protein domain involves more in Immunoglobulin/major histocompatibility complex while ER-negative involves in epidermal growth factor (EGF) extracellular domain of membrane-bound proteins. Also, both ER-negative and ER-positive covariates are associated with different pathway categories: ER-positive’s pathway involves more in synaptic transmission, neuroactive ligand-receptor interaction while ER-negative is associated with hormone (steroid, androgen and estrogen) metabolism, drug, starch and sucrose metabolism. The same phenomena can be observed for p53-mutant versus p53-negative associated genes.
Discussion
MultiDCoX is a space and time efficient algorithm which successfully elicits quantitative influence of cofactors on co-expression of gene sets. It required only 12 hours of computation on a typical HPC node to identify DCX gene sets for each factor for a dataset of 240 samples and ~44000 probes. The simulation results demonstrated that MultiDCoX has tolerable false discovery rates even at 5 samples/stratum and noise (σ) of 0.8. However, false negative rate (FNR) was affected by both sample size and noise level. As expected, FNR is very low for large sample size (20 samples per stratum) and low noise level (σ = 0.2). Interestingly, both FDR and FNR did not greatly depended on the type of the gene set to be discovered, or whether it is influenced by single factor or multi-factors. The discovery of a gene set whose DCX is driven by two cofactors is less affected by noise and sample size than the gene sets influenced by a single cofactor. On the other hand, the set influenced by 2-cofactors has higher likelihood of arriving at the wrong profile compared to that of a 1-cofactor driven DCX. Occurrence of false DCX sets increased with increasing noise level, it is pronounced more for small sample size case. This is a major issue to be addressed in the future improvements over the current version of MultiDCoX. Moreover, the performance of the algorithm needs to be studied for varying parameters’ settings and further reductions in computational time. It is possible to reduce the computational time by 2 fold by filtering out 50% of probes of low variance in expression. Though we have not used this strategy as we needed to study its impact on the discovery and profiling of DCX gene sets, the current implementation could complete the analysis within half a day of computing for each factor. The massive parallel processing allows us to complete all analyses within a day.
By MultiDCoX formulation, we identify DCX gene sets exhibiting B-type coexpression only [3]. The other two types of differential co-expression may be identified using multivariate differential expression analysis followed by clustering. MultiDCoX algorithm can be applied to different clinical data to quantify the influence of multiple cofactors on the co-expression and its associated phenotypes. Multiple aspects of the formulation and the algorithm need to be studied in our future improvements: Robustness of Amn(I) to outliers is an important aspects of the performance of the algorithm and impact of the thresholds used in the algorithm also to be studied. However, without tuning, the choice of parameters appears to be effective enough for both simulated and real data sets.
The application of MultiDCoX on a breast cancer data has revealed interesting sets of DCX genes: the set of ESR1, its cofactors along with downstream genes of ESR1 and genes associated with relevant ESR1 dependent transcriptional regulation; the set of genes containing ER binding site in their cis region. Furthermore, we have shown that the co-expression of gene sets that contain CXCL13 and the gene sets that contain MMP1 is affected by ER status too in addition to tumor grade which couldn’t have been elicited in a typical univariate DCX analysis.
Declarations
Abbreviations
- DCX
- Differential Co-expression/Differentially Co-expressed
- DE
- Differential Expression
- GO
- Gene Ontology
- KEGG
- Kyoto Encyclopaedia of Genes and Genomes
- FPs
- False Positives
- FNs
- False Negatives
- FNR
- False Negative Rate
- FDR
- False Discovery Rate
- FPR
- False Positive Rate
- OR
- Odds Ratio
- HPC
- Hi Performance Computing
- ER
- Oestrogen Receptor
Ethical Approval and Consent to participate
Not applicable
Consent for publication
All authors read and approved the manuscript for submission.
Availability of supporting data
The breast cancer dataset is available from the publication Miller et al. [25].
Analysis results are available as Additional File 1:UppsalaBCResults.xlsx
Competing interests
None
Funding
This research was supported by resources and technical expertise from the Genome Institute of Singapore, Agency for Science Technology and Research (A-STAR), 60 Biopolis, Singapore; and The Jackson Laboratory, 10 Discovery Dr, Farmington, CT 06032, USA.
Authors' contributions
RKMK conceived the project. RKMK and JCR guided HL to code the algorithm and carry out analysis. HL coded the algorithm. Both RKMK and HL drafted the manuscript and carried out all analyses. JCR revised the manuscript. All authors read and approved the final manuscript.
Authors' information
HL: Genome Institute of Singapore, Singapore (lianyh@gis.a-star.edu.sg).
JCR: Nanyang Technological University, Singapore (asjagath@ntu.edu.sg).
RKMK: The Jackson Laboratory, USA (krish.karuturi@jax.org).
Additional files
Additional File 1: UppsalaBCResults.xlsx
Format: XLSX
Title of Data: Results of Analysis of Breast Cancer Data
Description: Contains all differentially co-expressed genesets with respective differential co-expression model fit (F-test p-value, coefficient value), gene counts, and permutation results over three factors (ER, p53 and Grade) in breast cancer data. Remarks: Grade+ indicates higher grade tumor i.e. 2 and 3, while Grade– indicates lower grade tumour i.e. 1.
Acknowledgements
The authors would like to thank DrJoshy George, Dr. Foo Jia Nee and Dr. Astrid Irwanto for their helpful comments on the manuscript. We thank Swarna for helping in the early discussion of the project, DrsJuntao, Sigrid and Huaien for comments and discussion.
Footnotes
Email addresses: HL: e0146315{at}u.nus.edu JCR: as.jagath{at}ntu.edu.sg RKMK: krish.karuturi{at}jax.org