Abstract
Following the increase in available sequenced genomes, tissue-specific transcriptomes are being determined for a rapidly growing number of highly diverse species. Traditionally, only the transcriptomes of related species with equivalent tissues have been compared. Such an analysis is much more challenging over larger evolutionary distances when complementary tissues cannot readily be defined. Here, we present a method for the cross-species mapping of tissue-specific and developmental gene expression patterns across a wide range of animals, including many non-model species. Our approach maps gene expression patterns between species without requiring the definition of homologous tissues. With the help of this mapping, gene expression patterns can be compared even across distantly related species. In our survey of 36 datasets across 27 species, we detected conserved expression programs on all taxonomic levels, both within animals and between the animals and their closest unicellular relatives, the choanoflagellates. We found that the rate of change in tissue expression patterns is a property of gene families. Our findings open new avenues of study for the comparison and transfer of knowledge between different species.
Introduction
Gene functions have traditionally been determined using molecular and cellular approaches involving forward or reverse genetics. Functional annotations that were directly determined through these approaches are, however, not available at all for most species, and incomplete even for model species [1]. For non-model species, often only data transferred from other organisms is available. In this case, the degree of conservation of functions is uncertain, especially when a gene is duplicated in a non-model species, but not in the model species where its function has originally been studied. Previously, gene coexpression data has been used to find conserved coexpressed modules [2, 3] and to uncover functional similarities between genes from different species [4]. However, the latter approach requires that the two species are well-studied in both gene expression and functional annotation, and will suffer from incomplete and biased annotations [1].
Tissue expression data is available for many species, as tissues can be gathered even from non-model species where genetic tools such as transgenesis or RNAi are not available. Developmental gene expression profiles between closely related species can be compared to find functional links between genes and to detect differences between orthologs [5, 6, 7]. For closely related species, homologous tissues can easily be identified [8], and cross-species correlations between equivalent tissues of closely related species have previously been investigated [9, 10, 11]. Existing approaches require that expression datasets have been obtained under comparable conditions for the respective species. Across larger evolutionary distances, only few clearly homologous tissues can be determined. Even between closely related species, the relative amounts of cell types within tissues may change. This reliance on homologous tissue is, therefore, a severe limitation for functional mapping between many species: it is not possible to correlate gene expression patterns across species using the traditional methods. If it was possible to compare expression patterns across large phylogenetic distances, we could substantially improve the annotation of non-model-species genomes, fill annotation gaps in model species, and in particular address the problem of functional conservation after gene duplications.
To ameliorate this situation, we have developed a method to map tissue expression patterns of genes from one species to another, without defining equivalent tissues between the two species. Our hypothesis is that groups of functionally related genes will be coexpressed in very different tissues and species due to the re-use of ancestral functional modules. For example, it is possible to identify deep homologies among tissues [12], like homologous structures in the nervous systems of vertebrates and annelids [13,14]. Other organs show functional convergence, e.g. mammalian liver and brown fat in flies, which both carry out xenobiotic clearance functions [15]. For each gene of the source species, our approach predicted a virtual tissue expression pattern in the destination species. The correlation between these virtual expression patterns and the actually observed expression could then be used to score how well a gene's expression of a gene in a target species can be predicted from the expression patterns of its orthologs. Importantly, this scheme can be used to determine the extent to which the transcriptional regulation of sets of genes is conserved across large phylogenetic distances.Subsequently we illustrate the potential of our modeling approach with two applications: determining the degree of conservation of tissue-specific gene expression patterns, and for comparing the speed of functional divergence between independently evolving members of protein families.
Results
To analyze tissue expression across the entire metazoan kingdom, we gathered genome and tissue expression data from 36 datasets covering 27 different species (Table 1, Table S1). The datasets contained both developmental time courses (e.g. embryonic stages) and static measurements of different tissues (like adult organs; see Supplementary File 1 for a complete list). For the sake of brevity, we refer to the all of these samples as “tissues.” Datasets were imported and normalized per gene (see Methods), i.e. we quantified only relative expression changes of the same gene between tissues, instead of comparing expression differences of genes within the same tissue. (Therefore, housekeeping genes and other genes which are globally expressed did not skew ouranalysis.) When we applied the concept of looking for correlations between orthologs across species to an existing dataset [10], we found that many of the reported lineage-specific expression shifts only changed the absolute expression levels, while the relative expression patterns remained conserved (Fig. S1). Normalizing each gene’s expression individually also avoided technical concerns regarding the comparability of absolute expression values between genes. However, this gene-wise normalization means that the normalized values are influenced by the complement of tissues that have been measured. For this reason, we only include datasets that survey a whole organism or a wide range of developmental time points. The datasets excluded during quality control (see next section) have between five and ten data points. Therefore, six diverse tissues seemed to be a lower limit for the number of data points.
Quality Control
The available datasets differ in their suitability for cross-species mapping. Existing measures for the quality of expression datasets rely on conserved features, e.g. conserved coexpression [42]. Because of the large biological diversity of species in our dataset, relying on conservation of features was not appropriate. We therefore devised a simple measure of dataset quality that only relied on the features of the given dataset. For each normalized dataset, we performed Principal Component Analysis and determined the proportion of variance represented by each eigenvector. We then calculated the fraction v50 of components that represent at least half of the total variance. For example, for the C. elegans dataset, the first four out of forty principal components explain just above 50% of the variance (hence, v50 = 0.1). Based on the observed correlation between v50 and median mapping quality (Fig. S2), we chose v50 > 0.25 as a filter to remove the five worst datasets from our analysis: Hydra vulgaris, Amphimedon queenslandica, Bombyx mori, Brugia malayi and Ascaris suum.
Mapping Gene Expression Between Species
In order to compare expression patterns across distant species, we first need to map the patterns. Our concept rests on the notion that the expression of a gene in a specific tissue of a target species can be predicted using the expression pattern of that gene across the tissues in the source species. For example, a gene specifically expressed in insect neurons is likely to be expressed also in the mouse brain. Here we show that this concept holds even if these “matching” tissues are not known. Consider the example of mapping gene expression patterns from fly to mouse (Fig. 1). We model mg,t, the relative expression of gene g in mouse tissue t, as a linear combination of the relative expression levels in all fly tissues (f,ĝs): where ĝ is the fly ortholog of gene g and ϵ is the residual error. The regression coefficients βs,t and the intercept β0,t are fitted using all 1:1 orthologs between mouse and fly. Subsequently, this model can be applied to all fly genes to predict the expression in mouse tissue t. We used linear models in this first description of the method as they are a simple, transparent and efficient method that is relatively robust to over-fitting. Of course, other methods may be used as well. For example, Random Forest regression [43] can deal with non-linearity, while the lasso [44] could be used to deal with redundancy between source tissues.
Expression Distances Between Genes
After mapping expression patterns between species, we quantified how well a gene can be predicted by correlating its predicted expression across all tissues with the respective measurements of the target species using Pearson’s correlation coefficient (see Methods). These pairwise correlations between genes could be calculated for different sets of genes: phylogenetically unrelated genes, orthologs and 1:1 orthologs. Of these, 1:1 orthologs had the highest correlations (Fig. 2A). However, the overall distribution of correlations differed between dataset pairs, e.g. due to the varying number of tissues or variable data quality. Therefore, we computed an expression distance based on the quantiles qx,y of the matrix of correlation coefficients R = [rx,y]. We found that lineage-specific genes (i.e. those without homologs between the two species under consideration) tended to have lower correlations than genes with homologs. Therefore, to calculate quantiles, we computed the matrix R only for genes with homologs between the two species. We first analyzed correlations between 1:1 orthologs and checked if they depended on different properties of the genes (Fig. S3). We found that when the target gene had many coexpressed genes in the same species, the cross-species expression correlation tended to be higher (Fig. S3). To correct for this effect, we considered only target genes with similar numbers of coexpressed genes when computing the expression distance (Fig. 2 and Fig. 9 in Methods section). By design, the expression distance of background gene pairs had an uniform distribution. When making inferences about 1:1 orthologs, we used linear models based on a 10-fold cross-validation.
Benchmarks
In order to establish the biological relevance of our expression distance measure, we applied benchmarks at three levels, namely sequence, structure, and function. On the sequence level, we found that expression distances could be used as a signal to decide which of the top two BLAST hits for a query protein is the true 1:1 ortholog of the query protein in the target species (Fig. 3A and Fig. S4). On the structural level [45], expression distance and the number of proteins belonging to a structural fold were correlated (Fig. 3B and Fig. S5). That is, structural folds with fewer members, and hence lower functional diversity, were more similar in their expression patterns across species.
Lastly, on the functional level, we applied the phenolog concept [46] to find equivalent phenotypic annotations across species. We found that expression distances could be used to better predict which member of a protein family has been annotated with a matching phenotype (Fig. 3C and Fig. S6).
Conservation of Gene Expression Programs
At all taxonomic levels, we determined the conservation of the expression patterns of 1:1 orthologs. This data then allowed us to estimate the degree of conservation of tissue-specific expression patterns, even between groups of species that do not have readily identifiable homologous organs. For each pair of datasets, we first computed the median expression distance of 1:1 orthologs (Fig. 4). We then tested for each pair of datasets whether the distribution of expression distances between 1:1 orthologs was shifted towards lower values, i.e. if the median is below 0.5. Using the Wilcoxon signed-rank test and controlling for multiple hypothesis testing with the Benjamini-Hochberg method [47], we found that all dataset pairs had significant shifts to lower expression distances (q < 0.05 and median < 0.5). This analysis revealed both an expected enrichment for closely related species and unexpectedly high enrichments between very distant species, such as between chordates and insects. In general, developmental datasets mapped less well to other species than datasets of adult tissues.
To summarize the data shown in Fig. 4, we computed median expression distances for 1:1 orthologs across all internal nodes of the phylogenetic tree (e.g. for vertebrates, we compared expression patterns between fish and tetrapods). As the median expression distances vary greatly between dataset pairs, we also computed the distribution of expression distances and the number of well-conserved OGs for the best dataset pair across each internal node (Fig. S8 and Fig. S9). Using a Wilcoxon signed rank test, we then tested if the distribution of median expression distances is shifted towards lower values, i.e. if the median of the distribution is lower than 0.5. This was the case for all internal nodes, with the highest p-value (5e-49, median value: 0.39) observed when mapping from the ctenophore Pleurobrachia to other animals. This confirmed that our approach could predict expression patterns over large evolutionary distances (Fig. 5 and 6). For some clades, the available data was very uneven on the two sides of the internal node. For example, at the level of eumetazoa, only one species with few tissues was available for cnidarians, whereas most bilaterian species had many tissues measured. Thus, expression distances were higher when mapping from cnidarians to bilaterians than the other way round. Interestingly, the median divergence between animals and the outgroup choanoflagellates was comparable to the median divergence between major animal clades, e.g. bilateria. Thus, mapping tissue-specific gene expression revealed expression programs conserved for 1 billion years.
Correlations Between Expression Changes of Homologs
Next, we addressed the question if conservation of expression programs depends on the functions of genes, i.e. if certain gene functions generally imply a stronger conservation of expression programs than other functions. To this end, we compared the expression distances of gene families in different clades under the assumption that functional constraints would lead to expression conservation in independent clades. If the rate of expression divergence is a property of the gene family, we expect a correlation between the expression similarities for each family in different clades. In other words, a gene family that has a conserved expression pattern in one clade should also have a conserved expression pattern in another clade. For each internal node with two or more species on either side of the split, we calculated the median expression distance per gene family within each of the two clades. Out of four internal nodes with more than one species on both sides, we found significant Spearman correlations (rs) of median expression similarities for three splits (Fig. 7A): between tetrapods and fishes (rs=0.18, #12 in Fig. 5), between protostomes and deuterostomes (rs=0.15, #4), and between nematodes and insects (rs=0.06, #7).
The previous analysis was only possible for a subset of the taxonomic splits in our body of data, due to the requirement of having more than one species on either side of the split. We therefore also analyzed the fate of duplicated genes. In this case, we tested whether duplication products are more similar if the non-duplicated members of the gene family have low expression distances across the species outside the duplication event. Indeed, we found significant negative correlations between the median expression distance among the non-duplicated genes and the intra-species correlation of the duplicated genes (Fig. 7B). For example, duplicated genes in fish were more similar (i.e. had a higher correlation) when the corresponding tetrapod genes had more similar expression patterns (i.e. had a low expression distance): rs=−0.11 for 2045 pairs of duplicated genes, corresponding to a p-value of 3e-7. Taken together, these two observations implied that for a significant fraction of genes, the rates of change in gene expression patterns were correlated between independently evolving clades.
Evolution of the Beta Catenin Protein Family
We seleteced the beta catenin protein family [48] as an example to illustrate the implications of our work. Beta catenin proteins are involved in regulating cell adhesion and gene transcription through the Wnt signaling pathway. Ancestrally, there was a single beta catenin protein, which duplicated independently in the nematode and vertebrate lineages [49]. Hence, Drosophila, Anopheles and Schistosoma only have one beta catenin, armadillo. We found this protein to be similar in its expression patterns with both the vertebrate and nematode beta catenins (Fig. 8), which is indicative of their functional similarities [50]. In vertebrates, two forms exist: beta catenin and plakoglobin. These two proteins have largely overlapping functions [51] and consequently, their observed expression distance was very low. In nematodes, the outcome of the repeated gene duplications [52, 53, 54] is very different: three of the duplication products (hmp–2, wrm–1, and sys–1) are very similar to each other in their expression patterns, which can be explained by their cooperation in in the non-canonical Wnt signaling pathway and the SYS pathway [55]. These three proteins had high expression distances to bar–1. In contrast to them, bar–1 is part of a canonical Wnt signaling pathway [55]. We also observed that bar-1 had a low expression distance to the vertebrate plakoglobin, while hmp–2, wrm–1, and sys–1 had high expression distances. Among the nematode genes, vertebrate beta catenin had the lowest expression distance with hmp–2. This example illustrates that our method is able to uncover patterns of functional similarity and divergence both between closely related species and across large evolutionary distances.
Discussion
The presented analysis established and benchmarked a new method, and provided two examples of biological conclusions that can be reached with our method: there is widespread conservation of expression regulation across very large evolutionary distances, and the expression programs of different gene families evolve at distinct rates. Presumably, the latter observation is explained by variable functional constraints between gene families.
In particular, we have shown that tissue-specific gene expression can be predicted across large evolutionary distances, even in the absence of apparent similarities between the species’ tissues. Our approach can be rationalized as follows: we assume that evolution conserves the coexpression of functionally related genes, both on the level of homologous cell types and on the level of functional modules that occur in unrelated tissues. Our analysis demonstrated that the expression patterns of such conserved gene modules can be predicted across species using 1:1 orthologs as “anchors.” This approach worked despite the fact that the tissues themselves are only conserved within smaller clades. Control of gene expression by transcription factors, miRNAs and other factors is known to turn over rather quickly [56, 57, 58]. Most probably, functional dependencies between genes lead to shared expression patterns over large evolutionary distances. Further research will be needed to reveal which expression similarities between tissues are caused by homology and which are caused by convergent evolution.
Methods
Detection of Orthologous Proteins
To determine orthology relations between genes, we assembled groups of orthologs (OGs) using the eggNOG pipeline [59] on the genomes of the choanoflagellate Salpin-goeca rosetta and 67 animals. We then computed gene trees for all OGs using GIGA [60], which we then analyzed to extract 1:1 orthologs and duplication events.
Expression Data Pre-processing
Datasets were obtained either from repositories like ArrayExpress and GEO, from supplementary materials or the respective websites of the resources. Expression profiles were then mapped to our set of genes by one of the following methods (see Table S1): If possible, genes were mapped by given identifiers, such as Affymetrix, Ensembl or WormBase identifiers. If identifiers could not be used for microarrays, we mapped probe sequences to transcripts using exonerate [61], allowing for up to three mismatches and discarding probes that mapped to multiple genes. In the case of RNA-seq data without matching identifiers, we trimmed adapters and mapped reads to annotated transcripts using tophat2 and cufflinks 2.1.1 [62, 63] and used the resulting FPKM counts.
In initial small-scale tests, we tested several normalization methods [11, 64], and settled on a z-like normalization of expression vectors x, which corresponds to the Euclidean normalization of x minus its median value .
RNA-seq data, e.g. the Drosophila modENCODE dataset, contained zeros, which were of course not suitable for logarithmic analysis. For these datasets, we determined the expression value of the 1/1000th quantile of all genes with non-zero expression. All expression values were incremented by this value.
Mapping of Tissue Expression Patterns
For each pair of datasets, individual linear models were fitted for each tissue of the target species, using the tissues of the source species as input. (Note that due to the normalization, one tissue is redundant and therefore left out. This also implies that the coefficients of the linear model are not directly interpretable.) The set of 1:1 orthologs between the two species was used as to fit the linear models. When there were multiple probes per gene, all combinations of probes were added to the tissue expression matrix. When there were many tissues in the source species, but few 1:1 orthologs, there was the danger of over-fitting. We therefore allowed only one predictor (i.e. one tissue from the source species) per 15 samples (i.e. 1:1 orthologs) [65]. For each pair of species, the safe number of predictors was calculated. If there were too many tissues, we combined tissues using k-means clustering and used the centers of the clusters as predictors. This situation only occurred for six out of 1260 dataset pairs. The fitted models were then applied to all genes of the source species, yielding corresponding predicted expression patterns in the target species. Since 1:1 orthologs are used for training, we used predictions from a 10-fold cross-validation for these genes.
Mathematical Description of Expression Mapping
To illustrate our approach, we describe it for a specific pair of datasets, namely mapping expression values from fly to mouse (dataset “MOE”). The same procedure can be applied to all pairs of species. To predict tissue expression patterns of the 51 mouse tissues based on the 26 fly tissues, we fitted 51 separate linear models for each mouse tissue based on 1:1 orthologs.
A given dataset of gene expression values across many tissues of a species can be treated as an expression matrix: Rows correspond to genes and tissues to columns. Hence, it is possible to look at gene expression vectors that correspond to a single row, and tissue expression vectors that correspond to a single column. Consider the matrices of normalized expression values for fly F0 and mouse M0. F0 contained 13,264 rows corresponding to 12,225 genes and 26 columns. M0 contained 23,624 rows for 14,307 genes and 51 columns. From the 3120 1:1 orthologs, sub-matrices F and M were constructed such that the same row in the two matrices corresponds to a given pair of 1:1 orthologs. When multiple expression measurements per gene were available, the matrices contained all possible combinations of measurements. (E.g. if there were three probes corresponding to one gene, and two for the ortholog, a total of six rows were dedicated to this pair of orthologs.) Due to these combinations, F and M each had 4447 rows. A single linear model to fit expression values in mouse tissue t for genes g was thus found by minimizing the errors ∈:
Only 25 parameters were needed in the sum, because the normalization produced a matrix with equal row sums. Therefore one variable was redundant. This approach can also be formulated as a matrix multiplication, using B = [βs,t] as parameter matrix, B0 = [β0,t] for the offsets, and E = [εg,t] as error matrix:
Once B and B0 have been determined, they can be applied to the full expression matrix F0 to create a matrix V of virtual expression values for fly genes in mouse tissues:
Cross-species Correlations Between Expression Patterns
For each fly gene x with its corresponding gene expression vector an expression vector based on mouse tissues had been predicted: X = (Vx 1,Vx, …, Vx,51).
Thus, for any mouse gene y with expression vector , the weighted sample Pearson correlation coefficient rx y could be calculated (Fig. 2A):
Weights on the tissues were calculated using the Gerstein-Sonnhammer-Chothia (GSC) weighting scheme to reduce the effect of uneven coverage of different anatomical regions [66]. For example, in the mouse tissue dataset, there were many different brain tissues with highly correlated expression patterns. Hence, a gene that was well predicted in one brain tissue was likey to be well-predicted in other brain tissues. When multiple measurements were available for the source or target gene, we reported the maximum of all pairwise correlations.
Computation of Expression Distances
For each pair of datasets, we computed a matrix of predicted expression patterns of all genes from the source species. We observed a strong correlation between the cross-species expression correlation and the number of coexpressed genes in the target species (Fig. S3). This strong correlation indicated that predictions were biased towards the average target gene (i.e. the average expression profile of all genes considered in the target species), which in turn was similar to many target genes. As a consequence, these “close-to-average” target genes had higher correlations with mapped source genes, and thus seemed more conserved. To counter this effect, target genes were split into ten bins according to the number of coexpressed genes in the target species (Fig. 9). For each bin, we separately determined the distribution of cross-species expression correlations between all genes. Given this distribution, we determined a conversion function from the cross-species expression correlation to the corresponding quantile.
Thus, there exist ten conversion functions from weighted Pearson correlation to an uncorrected expression distance. For a given pair of genes, the final expression distance is interpolated from the two adjacent bins. We determined the number of coexpressed genes for each target gene as follows: we first computed all pairwise correlations among the target genes of the training set. Then, we determined the correlation cutoff corresponding to the top 10%, and counted for each gene how many other target genes were among the global top 10% correlations. For technical reasons, we sampled one million pairs of background genes, such that the lowest possible expression distance is 1e-6.
Data Access
Protein sequences, normalized datasets, assignments and expression distances of 1:1 orthologs have been deposited at http://dx.doi.org/10.6084/m9.figshare.1362211. Separately, all pairwise mappings of expression patterns between datasets are available at http://dx.doi.org/10.6084/m9.figshare.1362240
Funding
MK is funded by the Deutsche Forschungsgemeinschaft (DFG KU 2796/2-1). AB receives funding from the Deutsche Forschungsgemeinschaft (DFG CRC 680).
Author Contributions
AB and MK conceived the study, planned the analyses and wrote the paper. MK conducted all analyses.
Competing Interests
The authors declare that there are no competing interests.
Acknowledgements
The authors thank Anthony A. Hyman and Vineeth Surendranath for helpful discussions.
Footnotes
↵** andreas.beyer{at}uni-koeln.de (Phone: +49 221-478 84429, Fax: +49 221-478 84045)
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].
- [17].
- [18].
- [19].
- [20].
- [21].
- [22].
- [23].
- [24].
- [25].
- [26].
- [27].
- [28].
- [29].
- [30].
- [31].
- [32].
- [33].
- [34].
- [35].
- [36].
- [37].
- [38].
- [39].
- [40].
- [41].
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵