XGSEA: CROSS-species Gene Set Enrichment Analysis via domain adaptation

Gene set enrichment analysis (GSEA) has been widely used to identify gene sets with statistically significant difference between cases and controls against a large gene set. GSEA needs both phenotype labels and expression of genes. However, gene expression are assessed more often for model organisms than minor species. More importantly, gene expression could not be measured under specific conditions for human, due to high healthy risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus predicting enrichment significance (on a phenotype) of a given gene set of a species (target, say human), by using gene expression measured under the same phenotype of the other species (source, say mouse) is a vital and challenging problem, which we call CROSS-species Gene Set Enrichment Problem (XGSEP). For XGSEP, we propose XGSEA (Cross-species Gene Set Enrichment Analysis), with three steps of: 1) running GSEA for a source species to obtain enrichment scores and p-values of source gene sets; 2) representing the relation between source and target gene sets by domain adaptation; and 3) using regression to predict p-values of target gene sets, based on the representation in 2). We extensively validated XGSEA by using four real data sets under various settings, proving that XGSEA significantly outperformed three baseline methods. A case study of identifying important human pathways for T cell dysfunction and reprogramming from mouse ATAC-Seq data further confirmed the reliability of XGSEA. Source code is available through https://github.com/LiminLi-xjtu/XGSEA Author summary Gene set enrichment analysis (GSEA) is a powerful tool in the gene sets differential analysis given a ranked gene list. GSEA requires complete data, gene expression with phenotype labels. However, gene expression could not be measured under specific conditions for human, due to high risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus no availability of gene expression leads to more challenging problem, CROSS-species Gene Set Enrichment Problem (XGSEP), in which enrichment significance (on a phenotype) of a given gene set of a species (target, say human) is predicted by using gene expression measured under the same phenotype of the other species (source, say mouse). In this work, we propose XGSEA (Cross-species Gene Set Enrichment Analysis) for XGSEP, with three steps of: 1) GSEA; 2) domain adaptation; and 3) regression. The results of four real data sets and a case study indicate that XGSEA significantly outperformed three baseline methods and confirmed the reliability of XGSEA.

makes bioinformatics tools more promising in retrieving biological knowledge from data. For 5 example, gene set enrichment analysis (GSEA) [1] has been well used in biology and related 6 areas, which can rank gene set(s) most relevant (precisely, statistically significant) to 7 binary-labeled gene expression measurement. However, GSEA needs gene expression data 8 labeled binary, such as control and case, and is heavily affected by missing data. 9 Indeed gene expression are now measured by more speedy and precise techniques like 10 RNA-Seq than cDNA microarray, while measuring gene expression is still costly both on money 11 and time. Existing expression data often has strong bias in measured organisms or species. 12 Model organisms, such as Mus musculus, Caenorhabditis elegans, Arabidopsis thaliana, etc., 13 are well measured, while data on minor species are relatively insufficient. Additionally, human 14 gene expression data are unable to be measured under some specific conditions, due to high risk 15 of direct experiments on human, such as non-approved treatment or gene knockout. On the other 16 hand, mouse is usually used to study human disease [2, 3] because of lower cost, lower risk and 17 relatively strong homology relationship with human [4]. However, there exists essential human, could be represented as a binary annotation vector with dimension being the number of 29 all genes in the expression data, representing whether the corresponding gene is in the gene set. 30 The enrichment significance (such as p-values) of a source gene set S with an annotation vector 31 x s can be computed by traditional GSEA. The goal of XGSEP is to predict the enrichment 32 significance for a target gene set T with an annotation vector x t , which might have a different 33 dimension from x s since the number the total genes for target (human) and source (mouse) are 34 different. Note that the sequence homology between target genes and source genes is assumed to 35 be represented by binary matrix M , which should be important information for the prediction.

36
A naive idea for XGSEP would be to first find a source gene set x s , most homologous to 37 genes in a particular target gene set x t , by using M . Then GSEA is run over source expression 38 data and x s . The resultant p-value for x s is considered as a prediction of the enrichment p-value 39 for x t . The method is simple and fast, but the homology relationship between source and target 40 is often complex, and thus homologous source gene set x s cannot be clearly defined. Also using 41 M directly would be not robust.

42
Our idea for XGSEP is, rather than focusing on only one gene set, to consider many gene . . , x n t . Note that X s (training data) and 48 X t (test data) are different in size of rows (number of genes), and thus it is difficult to compare 49 the two matrices directly, meaning that a regular machine learning model such as a classifier 50 generated by X s cannot be run directly over test data X t . Thus a further idea is to transform 51 both the target and source species into a common space so that the target and source genes can 52 be compared. However this idea cannot be realized by regular machine learning models by the 53 above problem of difference in size between training and test data. We solve this problem by 54 domain adaptation, transfer learning between two domains: target and source. In general domain 55 adaption, a machine learning model, trained by a larger amount of labeled samples from a source 56 domain, is applied to a target domain with very few or no labeled samples [11]. This is exactly 57 the same situation of XGSEP. A common way of domain adaptation methods is to train a model 58 so that the model can reduce the probability gap between two domains. A possible measure for 59 the probability gap, i.e. the difference of two data distributions, is maximum mean discrepancy 60 (MMD) [12][13][14][15]. We will borrow the idea of domain adaptation and MMD to solve XGSEP. 61 We propose a method, XGSEA, standing for Cross-species Gene Set Enrichment Analysis.

62
XGSEA solves XGSEP by three steps: 1) We run GSEA over the source gene sets to obtain gene 63 enrichment scores E s and gene enrichment significance v s . 2) We first define pairwise 64 similarities among gene sets based on M , and then propose a MMD-based domain adaptation 65 method to project X s and X t into a latent common space with affine mappings P s and P t to 66 obtain Z s and Z t , respectively, so that i) the probability gap between Z s and Z t in the latent 67 space is minimized and ii) P s and P t are smooth over the connection M between source and 68 target gene sets.  Domain adaptation model (Equation (4)) Regression model  To the best of our knowledge, there are no existing work for XGSEP. A similar problem setting 90 might be cross-species gene set analysis (XGSA) [16]. The goal of the XGSA is different with 91 our XGSEP. XGSA aims to compare a gene set from one species with a gene set from another 92 species. That is, XGSA directly examines if two gene sets (from two different species) are 93 significantly different or not, only through the homology between genes in given two gene sets. 94 On the other hand, XGSEP estimates enrichment scores through expression data sets obtained 95 under the same phenotype (see Fig 1, though the target expression is assumed to be missing).

96
Thus XGSA is totally different from XGSEP.

97
Problem definition 98 We have two species, source and target. Let A = {a 1 , · · · , a p } be a source (say mouse) gene 99 set, and B = {b 1 , · · · , b q } be a target (say human) gene set. Let M ∈ R p×q be a binary matrix 100 of sequence homology, where the (i, j)-element M (i, j) is 1 if source gene a i is homologous to 101 target gene b j ; otherwise zero. Suppose that we have gene expression matrix G s with phenotype 102 vector y s for source genes only, meaning that we can run GSEA over G s and y s to compute 103 gene set enrichment significance for an arbitrary source gene set.

104
Suppose further that we have multiple gene sets for both source and target. Let 105 S = {S 1 , · · · , S m } be m source gene sets and T = {T 1 , · · · , T n } be n target gene sets. Thus 106 we have a binary matrix (which we call annotation matrix) between A (for rows) and S (for 107 columns), where 1 means that the corresponding gene is in a gene set; otherwise zero. This can 108 be also for the target side. Let X s = [x 1 s , · · · , x m s ] ∈ R p×m be the annotation matrix for source 109 gene sets S 1 , · · · , S m , where the i-th element of x j s is 1 if gene a i is in gene set S j ; otherwise 110 zero. Similarly, let X t = [x 1 t , · · · , x n t ] ∈ R q×n be the annotation matrix for target, where the 111 i-th element of x j t is 1 if gene b i is in gene set T j ; otherwise zero. Then the problem, XGSEP 112 standing for CROSS-species Geneset Enrichment Problem, is, given G s , y s , X s , X t and M , to 113 estimate the enrichment p-value of each gene set in T with respect to the same phenotype of y s . 114 We propose our method XGSEA, standing for CROSS-species Gene Set Enrichment Analysis, 115 to solve XGSEP by using three steps. Fig 2 shows a schematic picture of the three-step 116 procedure of XGSEA. Below we will explain each of these three steps in detail.

117
Step 1: Gene set enrichment analysis for source 118 Since gene expression G s and phenotype y s are both available for the source side, we can 119 directly use regular GSEA to obtain p-values, v s,1 , · · · , v s,m for S 1 , · · · , S m , respectively. In

131
For source gene set S i , we can compute B+1 enrichment scores E 0 s,i , · · · , E B s,i in 1a and 1b 132 to compute p-value v s,i in 1c. Similarly for target gene set T j , we can first predict B+1 133 enrichment scores E 0 t,j , · · · , E B t,j for target gene set T j and then p-value v t,j in 1c.

134
Step 2: Domain adaptation for source and target gene sets 135 We project the target and source genes into a common space, to maximally use the information 136 from the source gene side for the target gene sets.

137
Formulating the objective function 138 We project X s and X t to a common subspace in R d by using affine mappings P s ∈ R p×d and 139 P t ∈ R q×d , respectively, such that Z s = [z 1 s , · · · , z m s ] = P T s X s and In this process, we can set the following two reasonable objectives:

142
(1). Probability divergence between Z s and Z t should be small.

143
(2). Pairwise distances among the gene sets in Z s and Z t should be preserved.

144
For the first objective, we use maximum mean discrepancy (MMD) [12,14]. to measure the 145 divergence. An empirical estimate of MMD can be defined as follows: where φ(·) is a mapping to reproducible kernel Hilbert space H, k(·, ·) = (φ(·), φ(·)) is the 147 kernel associated to this mapping, and (2) For the second objective, we can first define the pairwise homologous similarity between 151 source gene sets S 1 , · · · , S m and target gene sets T 1 , · · · , T n from given data directly as 152 follows, where |A| is the number of genes in set A, S i = φ M (S i ) ⊂ T is the set with the target genes 154 homologous to the source genes in S i , and T j = φ M (T i ) ⊂ S is the set with the source genes 155 homologous to the target genes in T j . The projection should be smooth over homologous Thus entirely divergence D in (1) should be minimized, being regularized by the smoothness (of the projection) over similarity matrix W . Overall the objective function can be given as follows: Optimization on Grassman manifold 158 We can use Then the first 160 term in (4) can be written as 163 Also the regularization term in (4) can be written as The constraints can be changed from P T s P s = P T t P t = I to P T s P s + P T t P t = I which 167 can avoid that all samples collapse to the origin. Finally (4) can be transformed into an easily 168 understandable form: Let f (P ) = trace(K P L) + λtrace(P T GP ). The optimization problem min P T P =I f (P ) can be solved on the Grassmann manifold, with all linear d-dimensional subspaces in R p , since optimizing f (P ) is not affected by any orthogonal transformation of P . We use the conjugate gradient (CG) algorithm on the Grassmann manifold [17] to solve the optimization problem min P T P =I f (P ). The key step is to compute partial derivative ∂ P f (P ), which is used for computing gradient ∇ P f (P ) of f on the manifold at the current estimate P by The search direction is determined at each step by combining the previous search direction with ∇ P f (P ), and in this direction, a line search along the geodesic at the current estimated P is performed. Note that partial derivative ∂ P f (P ) at the current P can be obtained as follows The computational complexity of the above algorithm for solving the optimization problem (5) 172 is O(N 2 + r 2 + N rd). The total number of either human or mouse genes is large, leading to 173 r(= p + q) N (= m + n). This (large r) problem can be a bottleneck for our algorithm, and 174 thus we need to reduce the r-related part of this complexity. For this purpose we propose an 175 approach, which uses QR decomposition, by which the computational complexity is reduced to 176 O(N 2 ). Below, we will explain more detailed manner of our approach. 177 We first use QR decomposition: X = QR, where Q ∈ R r×N is an orthonormal matrix and 178 R ∈ R N ×N is an upper diagonal matrix. By introducingP = Q T P ∈ R N ×d , the objective 179 function in (5) can be transformed as follows: whereG = RF R T , and KP can be obtained using R since Z =P T R = P T X. Thus we can 181 first solve a small-scale optimization problem ofP , i.e.
and then obtain the projections Z = P T X =P T R. Note that now solving (6) by the above 183 algorithm on the Grassmann manifold needs the computational complexity of only O(N 2 ).

184
Step 3 definition. This means that we have one more step to reach p-values from the enrichment scores 194 E t . In detail, the second idea to predict p-value for target gene set T j is that we first predict the 195 enrichment scores B + 1 enrichment scores E 0 t,j , · · · , E 0 t,j for target gene set T j and then 196 estimate p-value v t,j by step 1c in the section of Step1. Based on this idea, we propose to use 197 two regression methods (XGSEA-E and XGSEA-E±). We explain the three methods as follow. 198 • XGSEA-D: Logistic regression on p-values 199 We first train the regression parameters α by source gene sets in the following logistic for i = 1, · · · , m, and then predict p-values for the target gene sets by Finally we can obtain the p-values of the target species by the following, , for j = 1, · · · , n. for i = 1, · · · , m, and then predict the enrichment scores for target gene sets by Finally, we can compute the enrichment p-values for target gene sets by using step 1c in 210 the section of Step1.
for i = 1, · · · , m. The parameters γ + and γ − are learnt by training the source gene sets, 216 and then used to predict enrichment scores for target gene sets by where z + t and z − t are the centers for Z s with positive and negative enrichment scores 218 respectively, and j = 1, · · · , n, b = 0, · · · , B.

220
Comparison methods

221
Since there are no existing methods for XGSEP, we compared XGSEA with three simpler  To evaluate the performance of XGSEA, we need target expression data, so that we can compute 234 ground-truth enrichment p-values. We collected four gene expression data sets as below, where 235 each data set consists of human (target) and another species (source: mouse or zebrafish) which 236 share the same phenotype. Table 2 shows the statistics of the four data sets. control samples. These datasets were also used in another cross-species study [19], while 250 this study is also not on GSEA at all.

251
• Ovarian Cancer (human and mouse): The two Microarray gene expression datasets were 252 downloaded from GEO with accession number GSE6008 and GSE5987, respectively. The 253 human dataset has 21,188 genes with thirteen mucinous ovarian tumors and four control 254 samples, while the mouse dataset has 45,101 genes with seven disease and four control 255 samples. These datasets were also used in the cross-species study [19].

256
• Melanomas (human and zebrafish): The Microarray gene expression datasets of the two 257 species were downloaded from GEO with accession number GSE83343 and GSE83399, 258 respectively. The human dataset has 42, 346 genes with eight disease and four control 259 samples, while the zebrafish dataset has 13, 620 genes with five disease and three control 260 samples. These datasets were collected from two different studies [20,21]. 261 We then accessed Ensembl BioMart through http://www.ensembl.org/ [22] to retrieve homology 262 relationships between 19,404 human and 19,614 mouse genes, and also 16,070 human and In our experiments, we take human species as target species, and take mouse or zebrafish as the 272 target species. We apply our XGSEA approach to predict the enrichment p-values for the 674 273 human pathway gene sets T = {T 1 , · · · , T n }(n = 674), for embryonic development and brain, 274 ovarian and melanomas, respectively. For the target gene sets T = {T 1 , · · · , T n }, we take the 275 training source gene sets S = {S 1 , · · · , S n } in the XGSEA, where S i corresponds to T i , 276 meaning that each gene in S i is homologous to one or more genes in T i .

277
To sufficiently evaluate our XGESA method, we predict enrichment p-values for target gene 278 sets with three experimental settings. Note that the homology between two genes can be 279 classified into four types: one-to-one, many-to-one, one-to-many, and many-to-many, where 280 one-to-one means only one gene in one side is homologous to only one gene in the other side.  T (2) (medium): each set in T (2) has one-to-one or many-to-one genes. That is, target gene 295 g ∈ T (2) i has always only one homologous source gene s, which has one or more homologous 296 target genes including g.  We changed the cutoff for p-values: {5e-1, 1e-1, 5e-2, 2.5e-2, 1e-2, 5e-3, 2.5e-3, 1e-3}, showed better AUCs than the three baseline methods (green, yellow and light blue). 1 1e-1 5e-2 2.5e-2 1e-2 5e-3 2.5e-3 1e- -1 1e-1 5e-2 2.5e-2 1e-2 5e-3 2.5e-3 Table 3. AUCs of six competing methods on four data sets and three target gene sets. The best and second best in each row are in bold and underlined, respectively. The p-value by t-test between the best and each corresponding naive method is shown in brackets.  This table shows that XGSEA significantly outperformed the baseline methods. For example, 342 XGSEA-E± achieved the best in nine out of all 12 cases, followed by XGSEA-E of three cases. 343 Any naive method could neither be the best nor the second best in all cases, the difference from 344 the best being statistically significant in t-test over 20 trials. Also the AUC of T (1) was not 345 necessarily higher than T (2) (also T (3) ), since each one-to-one homologous gene pair between 346 two species is not necessarily the same gene, which would be prediction-wise harder than the 347 case that the target and source gene sets share the same gene.

348
Robustness against parameter value change 349 We examined the performance robustness of XGSEA, regarding parameter (λ) variation. Fig 6 350 shows AUCs of XGSEA-E under three gene sets (T (1) (red), T (2) (blue) and T (3) (black)) on 351 embryonic development and melanomas, when λ is one of {1e-4, 1e-3, 1e-2, 1e-1}. This figure 352 shows that AUC of XGSEA-E was rather stable within the given range, implying that the 353 advantage over the baseline methods will be kept constantly.  We applied these four variants to embryonic development data with target gene set T (3) .
365 Table 4 shows AUCs obtained with the cutoff (for p-values) of 0.01. From Table 4, MMD+WB 366 (i.e. original (4)) achieved the best result for XGSEA-E and XGSEA-E±, and MMD was worst 367 for them. This result implies that all gene set similarity contribute to the performance 368 improvement. 369 We then evaluated the effect of sequence homology on predictive performance, by removing 370 a certain amount of part in sequence homology matrix M : being motivated by that less 371 homology connectivity between two species would cause poorer performance. In more detail, we first randomly chose a certain number of genes from the source and target 373 gene sets, respectively, and kept only the part corresponding to these genes in M . Practically we 374 used 50, 500 and 5,000 for this number of selecting genes, resulting in three matrices: M 50 , 375 M 500 and M 5000 , respectively. Using each of the four sequence homology matrices (including 376 original M ), we ran XGSEA over embryonic development data under gene set T (3) to predict 377 enrichment p-values. Table 4 shows the performance results (AUC) of this experiment. The results show that the 379 AUC was reduced by decreasing the number of randomly selected genes, while if the selected 380 number is 5,000, the performance was almost consistent with that of using the original M , 381 implying that interestingly 5,000 genes might be good enough.

382
Case study: Identifying human pathways for T cell dysfunction and 383 reprogramming from mouse ATAC-Seq 384 It is important for cancer immunotherapy to study the epigenetic regulation of T cell dysfunction 385 and therapeutic reprogrammability: a plastic dysfunctional state from which T cells can be 386 rescued, and a fixed dysfunctional state in which cells are resistant to reprogramming [25].

387
Identifying two (plastic or fixed) dysfunctional chromatin states, through which T cells in 388 tumours differentiate, would be very important to predict, for example, if a patient will respond 389 to a therapy. Using GSE89308 of GEO on ATAC-Seq data of mouse, with 22 samples and the 390 two chromatin states [25], we ran XGSEA-E (B = 100,000, λ=0.01 and d=5) to identify human 391 pathways out of 1,960 Reactome pathways (downloaded from 392 https://reactome.org/download-data). 393 Table 5 shows 11 human pathways identified by XGSEA-E at the cutoff of 0.05, where the 394 top, "gene expression (transcription)", and the fourth "immune system" are large pathways with 395 1367 and 2296 genes, respectively. Obviously due to important chromatin roles in transcription, 396 "gene expression (transcription)" is tightly related to the chromatin states. Also "immune system" 397 definitely plays important roles in T cell dysfunction and reprogramming through a number of 398 membrane proteins, such as CD38, CD101, CD30L, CD5, TCF1, IRF4, BCL2, CD44, PD1, 399 LAG3 and CD62L [25].  This Notch pathway regulates CD8+ T cells by directly upregulating mRNA expression of 405 granzyme B and perforin to maintain memory T cells [27]. Furthermore, the Notch pathway 406 suppression. Ezh2, a suppressor of the Notch pathway, regulates effector T cell polyfunctionality 409 and survival by targeting the Notch signaling pathway [28]. Down regulation of Ezh2 could 410 elicit poor antitumor immunity. Besides, Delta-like 1-mediated Notch signaling enhances the 411 conversion of human memory CD4 T cells into FOXP3-expressing regulatory T cells [29].

412
These facts support the reliability of the pathways identified by XGSEA.

413
On the other hand, we ran a naive approach, HM A , over the same data, under the cutoff of 414 0.05, resulting in 20 pathways showed in Table 6. Although the number of pathways is larger 415 than Table 5, these 20 pathways were diverse and less connected to the chromatin states, such as 416 only two being related to Notch signaling pathways. This result implies that XGSEA-E would 417 be more convincing than HM A .

419
We have defined XGSEP for promoting GSEA on species with scarce expression data, and 420 proposed XGSEA with three steps, which can be simply: 1) GSEA, 2) domain adaptation, and 3) 421 regression. Our empirical supervised validation over four real data sets revealed that XGSEA 422 outperformed three naive approaches in AUC under various settings, particularly the advantage 423 being proved statistically by bootstrapping and t-test. In the case study, mouse ATAC-Seq 424 expression data is used to identify significant human pathways for T cell dysfunction and adaptation would be interesting future work for Step 2. Reasonably in Step 3, we can consider 432 more sophisticated regression models. The most key point of XGSEA is Step 2, i.e. domain 433 adaptation, which would be useful for other problems between two species, such as genome 434 wide association studies between a well-and the other less-sequenced species. This direction of 435 applying domain adaptation to various problems would be also promising future work. On the 436 statistical side, we could also further consider the problem of multiple testing and controlling the 437 false discovery rate or family-wise error rate, which have been well studied in regular GSEA.