Abstract
In the field of gene expression analysis, methods of integrating multiple gene expression profiles are still being developed and the existing methods have scope for improvement. The previously proposed tensor decomposition-based unsupervised feature extraction method was improved by introducing standard deviation optimization. The improved method was applied to perform an integrated analysis of three tissue-specific gene expression profiles (namely, adipose, muscle, and liver) for diabetes mellitus, and the results showed that it can detect diseases that are associated with diabetes (e.g., neurodegenerative diseases) but that cannot be predicted by individual tissue expression analyses using state-of-the-art methods. Although the selected genes differed from those identified by the individual tissue analyses, the selected genes are known to be expressed in all three tissues. Thus, compared with individual tissue analyses, an integrated analysis can provide more in-depth data and identify additional factors, namely, the association with other diseases.
1. Introduction
Gene expression analysis is an important step for investigating diseases and identifying genes that can be used as therapeutic targets or biomarkers or genes that are causes of disease. Although the development of high throughput sequencing technology (HST) has led to continuous increases in the amount of gene expression profile data, methods of integrating multiple gene expression profiles are still being developed. Tensor decomposition (TD) is a promising candidate method for integrating multiple gene expression profiles. Using this method, gene expression profiles from multiple tissues of individuals can be stored as a tensor xijk ∈ ℝN×M×K, which represents the gene expression of the ith gene in the jth individual of the kth tissue. TD provides a method of decomposing a tensor into a series expansion of the product of singular value vectors, each of which represents a gene assigned to a specific individual or tissue. For example, by applying the higher-order singular value decomposition (HOSVD) method to xijk, we can obtain the following:
where G ∈ ℝN×M×K is a core tensor,
are singular value matrices and orthogonal matrices. We previously proposed a TD-based unsupervised feature extraction (FE) method [1] and applied it to a wide range of genomic sciences. Recently, this method was improved by the introduction of standard deviation (SD) optimization and applied to gene expression [2], DNA methylation [3], and histone modification analyses [4]. Nevertheless, because the updated method was only previously applied to gene expression measured by HST, whether it is also applicable to gene expression profiles retrieved by microarray technology remains to be clarified. In this paper, an integrated analysis was performed by applying the recently proposed TD-based unsupervised FE method with SD optimization to microarray-measured gene expression data for diabetes mellitus from multiple tissues. We found that applying the TD-based unsupervised FE with SD optimization to gene expression profiles from individual tissues can identify diseases associated with diabetes that cannot be identified by the other state-of-the-art methods.
There are multiple benefits in using TD to identify DEGs. First, since it is not a supervised method, it can select DEGs that are biologically more plausible than those selected using supervised methods. This can be explained using the following example wherein the aim is to identify DEGs that are distinct between two classes, e.g., patients and healthy controls. Supervised methods attempt to identify DEGs associated with a smaller divergence within individual classes, whereas TD allows to select DEGs with within-class divergence to some extent (since TD tries to identify the representative state of distinction between two classes). If the representative state is associated with within-class divergence that has biological origins, e.g., age and sex, this divergence should not be penalized. However, supervised methods often do so whereas the unsupervised method allows biological within-class divergence. Second, TD can select more stable DEGs; i.e., those independent of specific sets of samples considered in the analysis. This is because TD attempts to identify DEGs coincident with those of the representative state, which should be robust. Since sub-sampling does not change the representative state drastically, the gene set selected by TD is not altered drastically either. Third, TD can deal with multiple conditions. For example, if gene expression is measured in various tissues of several people, it is natural to format them as gene × person × tissue, which results in a tensor form. We have listed only a few important advantages here. Readers interested in acquiring information on other advantages of TD can refer to our recent book [1].
2. Materials and Methods
2.1 Gene expression
Gene expression profiles (GSE13268, GSE13269, and GSE13270 [5]) were retrieved from the Gene Expression Omnibus (GEO), and they were obtained from a study of the progression of diabetes biomarker diseases in the rat liver, gastrocnemius muscle, and adipose tissue. Each of these profiles is composed of gene expression profiles from five individuals seen in two strains, Goto-Kakizaki and WistarKyoto, and they include data for three tissues (adipose, muscle, and liver) obtained at five time points after treatment. Three files named GSE13268_series_matrix.txt.gz, GSE13269_series_matrix.txt.gz, and GSE13270_series_matrix.txt.gz were downloaded from the Supplementary Files in GEO.
Gene expression profiles were formatted as a tensor, with xijkmst ∈ ℝ31099×5×5×2×2×3, representing the expression of the ith probe in the tth tissue (t = 1: adipose, t = 2:muscle, t = 3:liver) at the jth time point for the kth replicate and mth treatment at sth strain. These values are normalized as follows:
2.2 Methods
Figure 1 shows the analysis pipeline. Methodological details can be found in supplementary Information.
Overall flowchart of the analysis pipeline.
3. Results
To validate the selected genes, 2,281 gene symbols are uploaded to Enrichr [6] (For the full list of selected probes, genes and enrichment analyses, check the supplementary materials). Table 1 shows the results of the “KEGG 2021 Human” category in Enrichr. Since none of the terms are related to diabetes except for the top term, i.e., “diabetic cardiomyopathy”, the process initially appears to be a failure. Nevertheless, a number of the identified diseases are deeply related to diabetes mellitus. For example, many neurodegenerative diseases are listed, and diabetes mellitus is widely known to be a risk factor for neurodegenerative diseases [7–11]. Moreover, diabetes mellitus is known to be associated with thermogenesis [12], oxidative phosphorylation [13], and the PPAR signaling pathway [14]. Thus, the proposed method is successful in contrast to the first impression and can identify many diseases associated with diabetes mellitus.
Top 10 “KEGG 2021 Human” category terms in Enrichr.
Table 2 shows the top 10 terms in the category “ARCHS4 tissues” in Enrichr. Remarkably, gene expression is measured for three of the top four tissues. Similar results are found for the “Mouse Gene Atlas” category in Enrichr (Table 3). In conclusion, the proposed method is successful.
Top 10 terms in the “ARCHS4 Tissues” category in Enrichr.
Top 10 terms in the “Mouse Gene Atlas” category in Enrichr.
4. Discussion
Although the proposed method successfully integrated gene expression data measured in three tissues and identified diseases associated with diabetes mellitus, the identified genes also included genes expressed in all three tissues. If other methods that do not require an integrated analysis can perform similarly, then complicated methods, such as the proposed method, will not be required. To determine whether methods without integration can achieve similar performance, we tested three methods: t test, SAM [15] and limma [16]. Since the t test and SAM methods cannot simultaneously consider the distinction between the control and treatment as well as the dependent on time, we attempted to identify genes that presented expression differences between the control and treatment (no consideration of time dependence). For more details on how to perform these three methods, check the sample R source code in supplementary materials.
Table 4 shows the number of probes selected by the other methods. These methods select fewer probes than the proposed method (2,542 probes), and the number selected in muscle is relatively low. According to the limma method, only two probes could be selected for muscle; thus, the method was not successful. The integrated analysis likely helped identify more probes, which resulted in more significant enrichment.
Number of probes selected by other methods.
To further validate the genes selected by other methods, we converted probe IDs to gene symbols and uploaded them to Enrichr. Table 5 presents the results for the other methods on the “Mouse Gene Atlas” category in Enrichr. For muscle, neither SAM nor t test could select muscle as top ranked tissues whereas limma could identify only two probes as muscle-specific genes (see Table 4). Thus, the other methods are not better than the proposed method that could identify muscle specificity correctly (Table 3). Figure 2 shows the Venn diagrams between selected genes. Since the proposed method selects different genes from those specifically selected in individual tissues, an integrated analysis is a valuable method.
Top three terms by other methods in the “Mouse Gene Atlas” category in Enrichr
Venn diagrams between genes selected by various methods. Upper: t test, lower: SAM.
Finally, based on the genes associated with probes shown in Table 4, we found that the “KEGG 2021 Human” category in Enrichr does not include neurodegenerative diseases (see the supplementary materials). Thus, the association between neurodegenerative diseases and diabetes mellitus can be found only when an integrated analysis, such as the proposed method, is employed. In this sense, an integrated analysis is more than a simple union of individual analysis and can identify factors that cannot be identified by individual analyses, such as potentially associated diseases. Thus, an integrated analysis of gene expression profiles in individual tissues provide more in-depth information than individual analyses, at least for certain cases. Thus, integrated analyses of gene expression profiles in individual tissues should be encouraged.
It may be plausible for other integrated methods to perform similarly. If this is true, the advanced methods that we have proposed here are not required. To rule out this possibility, we apply ComBat [17] to remove the batch effect between the three tissue typed, since we selected genes whose expression is independent of tissues as can be seen in Fig. ??; Table 4 shows the results. It is seldom reported to be successful. Limma failed to select any DEGs, and the number of genes selected by the t test and SAM is markedly different from each other in contrast to the identification of tissue-specific DEGs, whose numbers are more coincident across the three methods (Table 4).
Biological validation is also worse; Table 5 shows the result of the “Mouse Gene Atlas”. None of tissues used in the experiments are listed whereas the proposed method is (Table 3). In addition to this, based on the genes associated with probes shown in Table 4, we found that the “KEGG 2021 Human” category in Enrichr does not include neurodegenerative diseases (see the supplementary materials) that were detected using the proposed method (Table 1). In conclusion, integrated analysis using ComBat is inferior to the proposed method.
One might wonder why an integrated analysis of three tissues from patients with diabetes mellitus can identify associations with neurodegenerative diseases. The PCA and TD-based unsupervised FE methods are frequently able to detect disease associations. We previously identified an association between cancer and amyotrophic lateral sclerosis [18] without investigating cancer gene expression and an association between heart diseases and posttraumatic stress disorder [19] without investigating brain gene expression. Therefore, we were not surprised that the integrated analysis using the proposed method was able to identify disease associations. To our knowledge, few studies have attempted to predict the association between diseases using gene expression, although many studies have focused on the associations between genes and disease [20–22] and between drugs and disease association [23–25]. Our proposed strategy would be useful for such studies.
5. Conclusions
In this study, we applied the proposed TD-based unsupervised FE with SD optimization method to perform an integrated analysis of gene expression measured in three distinct tissues using microarray architecture; moreover, the proposed method has not been applied to such data in previous studies. The results show that the proposed method can identify more genes than individual analyses. The selected genes are known to be expressed in all three tissues, and they are also enriched in many neurodegenerative diseases that have a known association with diabetes mellitus but cannot be identified by individual analysis. In this sense, integrated analyses might have the ability to identify additional factors relative to individual analyses.
Author Contributions
Y.-h.T. planned the research and performed the analyses. Y.-h.T. and T.T. evaluated the results, discussions, and outcomes and wrote and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by KAKENHI (Grant Numbers 20H04848, and 20K12067) to Y.-h.T.
Institutional Review Board Statement
Not applicable
Informed Consent Statement
Not applicable
Data Availability Statement
All of the data used in this study are available in GEO ID GSE13268, GSE13269, and GSE13270.
Conflicts of Interest
The authors declare no conflict of interest.
Footnotes
tturki{at}kau.edu.sa
Revision based upon reviewers' comments