Coexpression reveals conserved mechanisms of transcriptional cell identity

What makes a mouse a mouse, and not a hamster? The answer lies in the genome, and more specifically, in differences in gene regulation between the two organisms: where and when each gene is expressed. To quantify differences, a typical study will either compare functional genomics data from homologous tissues, limiting the approach to closely related species; or compare gene repertoires, limiting the resolution of the analysis to gross correlations between phenotypes and gene family size. As an alternative, gene coexpression networks provide a basis for studying the evolution of gene regulation without these constraints. By incorporating data from hundreds of independent experiments, meta-analytic coexpression networks reflect the convergent output of species-specific transcriptional regulation. In this work, we develop a measure of regulatory evolution based on gene coexpression. Comparing data from 14 species, we quantify the conservation of coexpression patterns 1) as a function of evolutionary time, 2) across orthology prediction algorithms, and 3) with reference to cell- and tissue-specificity. Strikingly, we uncover deeply conserved patterns of gradient-like expression across cell types from both the animal and plant kingdoms. These results suggest that ancient genes contribute to transcriptional cell identity through mechanisms that are independent of duplication and divergence.


INTRODUCTION
Understanding how the genome changes as species diverge is a central question in evolution. With access to sequence data, comparative genomics research initially focused on the association between gene family conservation and the phenotypes that emerge in particular lineages [1][2][3][4] . While this approach continues to shed light on genome evolution 5 , it provides at best an incomplete picture, omitting phenotypic differences that can be driven by changes in gene regulation 6,7 . There has been growing interest in using functional genomics data to find regulatory differences by comparing homologous samples, often focusing on gene expression as the output of changing regulatory architecture between species [8][9][10][11][12] . Yet because of its dependence on anatomical homology, this approach is necessarily limited to more closely related species. How then, can we compare regulatory conservation across the tree of life? The answer is coexpression.
In a coexpression network, genes are nodes and the edges represent expression similarity between genes, often a correlation coefficient. Functionally related genes are often adjacent in the network as their expression is coordinated across biological conditions, allowing for the inference of gene regulatory modules through clustering. As early as 2003, comparative coexpression approaches were used to demonstrate similarity in gene regulation between species as distant as plants and mammals [13][14][15][16][17][18] . However, these studies were limited by both the data and methods available, often relying on a small number of microarray datasets from a handful of model organisms, and ortholog predictions from BLAST 19,20 .
We breathe new life into this area by taking advantage of high-powered coexpression networks from animals, plants and yeast RNA-sequencing (RNA-seq) data 21 , as well as modern orthology prediction algorithms [22][23][24] , and measuring the conservation of coexpression relationships. As expected, we find that coexpression conservation tracks with evolutionary distances 25 , and is significantly higher for genes expressed in all cell types. However, by linking these results to single-cell expression data 26-30 , we find a fascinating possible role for "ubiquitous" genes in transcriptional cell identity. In both mouse and Arabidopsis, we find constitutively expressed genes with gradient-like expression distributions across cell types. In combination with their high degree of coexpression conservation, this suggests that these genes have tightly maintained functions that contribute to continuous aspects of cell identity. Coexpression conservation provides a data-driven estimate of gene functional divergence across species, and we have made all of our data, methods and results available to facilitate its use.

Establishing meta-analytic coexpression networks as a tool for comparative genomics
Reliable estimates of species-specific gene coexpression patterns are a necessary backbone for comparative analysis of regulatory divergence. We recently published a set of networks with our coexpression webserver, CoCoCoNet 21 , which contain RNA-seq data from 14 species across 895 datasets, and more than thirty-nine thousand individual samples ( Figure 1A). Here, we establish the power and robustness of these networks by 5 measuring the connectivity of genes with the same Gene Ontology (GO) annotations 31 , and by evaluating the stability of results after bootstrapping the network building process.
As a first validation, we find that species-specific networks built from multiple datasets ("aggregate" networks) have strong connections between genes from the same GO group, and that these connections are significantly stronger than those found in networks from individual datasets ( Figure 1B These results indicate that aggregate networks are statistically robust and biologically meaningful, with strong connections between functionally related genes.

Similarity of coexpression neighborhoods quantifies ortholog conservation
Having established the robustness of our networks, we next explore the degree to which they can be used for cross-species comparisons. Our analysis focuses on characterizing the degree to which pairs of orthologs have retained similar coexpression patterns, or "neighborhoods" in the networks.
To measure the similarity of ortholog neighborhoods between two species, we first subset networks to include only one-to-one orthologs between that species pair (e.g., pig and yeast as shown in the schematic, Figure 2A). Next, each gene's neighborhood is defined 7 by ranking all edges associated with it, and then the ranks of the gene's top coexpressed gene pairs are compared across species (see Methods for details). This is expressed as an AUROC and so it ranges from 0-1, with 1 meaning perfect coexpression conservation, 0.5 consistent with random re-ordering of neighborhoods, and 0 meaning that coexpression partners have inverted from being the top ranked to bottom ranked across species.
As a first validation of our approach, we find that coexpression conservation is strongly negatively associated with phylogenetic distances between species, demonstrated in Figure 2B with respect to distance from human (Spearman correlation coefficient= -0.95).
We also find that coexpression conservation is sensitive to the amount of underlying data as expected, with strong performance achieved with the inclusion of twenty datasets and scores plateauing beyond this point ( Figure 2C, mean individual networks=0.55+/-0.04, mean 20-dataset aggregate=0.68+/-0.08, Wilcoxon p<0.002, n=7 species).
One-to-one orthologs are frequently used for comparative genomics analyses 32 , however, as species grow more distant to one another, there are fewer one-to-one orthologs for comparison, particularly in the plant kingdom, where genome duplication events are common 33 . To explore more distant and complex relationships, we generalized our method to be able to compare groups of orthologs, i.e., all genes descended from a single gene in a common ancestor, including lineage-specific duplicates (see Supplementary   Figure 1 for a schematic). As validation, we assessed the conservation of groups of orthologs descended from the last common ancestor of all eukaryotes. We find that 8 coexpression conservation scores for many-to-many (aka N-to-M) orthologs are strongly associated with phylogenetic distances between species ( Figure 2D). We next evaluated all N-to-M orthologs defined in the last common ancestor between each pair of species, finding that one-to-one orthologs have higher conservation than N-to-M orthologs on average ( Figure 2E, mean 1-to-1 = 0.79 +/-0.14, mean N-to-M = 0.59 +/-0.14, Wilcoxon p<10 -16 ). Notably, we also find cases where N-to-M orthologs are strongly differentially conserved, with ~7% of all N-to-M groups containing orthologs with scores >0.7 and <0.5. Figure 2F. Here, we see that mouse Eif4e shares a large fraction of its coexpression neighborhood with human EIF4E, but that it is quite distinct from human EIF4E1B.

An example of this is shown in
These results indicate that coexpression neighborhoods can be usefully compared across species to provide a measure of ortholog conservation. Distinguishing orthologs by differential coexpression conservation may provide a route forward for discovering genes with "the same" function across species (a.k.a. "functional analogs").

Genes expressed in all cell types have conserved coexpression relationships and contribute to cell identity
In the previous section we established that coexpression conservation tracks with phylogeny as expected, and that one-to-one orthologs are more likely to have similar coexpression than those that have duplicated. We also find evidence of strong divergence within orthogroups, which has long been postulated to be an evolutionary mechanism for morphological innovation 34 . Earlier work to probe this hypothesis showed that older genes tend to be more ubiquitously expressed across cell types and tissues, and have greater conservation across species 9,12 , and this has typically been interpreted to mean that younger genes are required for phenotypic novelty.
By leveraging coexpression relationships we can 1) extend these observations to any species pair of interest without requiring knowledge of homologous tissues or cell types, and 2) identify conserved relationships between genes, reflecting conserved regulation and function. Importantly, we can investigate the role of conserved coexpression in cell phenotypes by taking advantage of single-cell RNA-seq data. We use comprehensive single-cell RNA-seq data from the Tabula Muris project for mouse 26 , and four single-cell RNA-seq datasets from A. thaliana root 27-30 as references for cell-type specific expression ( Figure 3A, see Methods for details), and we use data from the Genotype-Tissue Expression Project (GTEx) 35 as a reference for human tissue-specific expression.
Consistent with previous research, we find that genes expressed ubiquitously across tissues have higher coexpression conservation than those with tissue-specific expression Because variability occurs in the absence of cell type variation in yeast (it may reflect temporal or state variation), we wondered whether we could find a similar effect in multicellular organisms when holding cell type constant. Our expectation is that cells of the same type may vary by cell cycle phase, nascency, or activation state, for example. Strikingly, after calculating expression variation within each cell type in Tabula Muris, we find that the most variable genes have the most strongly conserved coexpression patterns ( Figure 3C). In other words, while their coexpression partners remain consistent, their expression levels vary dramatically within and across cells. These properties suggest that these genes could contribute to continuous aspects of cell identity that require tightly coordinated signaling, like differences in size or metabolic activity. Cells of different types would likely have broad but distinct expression distributions, forming a gradient of expression when viewed across types. Such "continuous axes" of variation would be more likely to generalize across kingdoms than poorly conserved marker genes ( Figure 3D).

One example of a gene that might contribute to continuous aspects of cell identity is
Rpl12, a component of the large ribosomal subunit 60S in mouse. This gene is among the most variably expressed genes within and across cell types (mean standardized rank within = 0.83, across = 0.97) but its expression distribution across cell types is continuous ( Figure 3E). Given Rpl12's molecular function, we speculate that protein synthesis rate could be a continuous axis of variation from which all cells sample. Notably, we find a similar pattern of continuous expression when we look at an Arabidopsis ortholog of Rpl12 ( Figure 3E). This pattern is in stark contrast to more typical marker genes, which tend to have high variance across cell types, and lower variance within cell types because they are "switch-like" in their expression patterns. Two examples are shown in Figure 3E: Cd19, which is exclusively expressed by mouse B-cells and is conserved only among mammals (mean standarized rank within = 0.55, across = 0.97); and EXPA7, which is exclusive to Arabidopsis root trichoblasts and is conserved only among eudicots (mean standardized rank within = 0.77, across = 0.99).
In summary, our combined analysis of single-cell RNA-seq and yeast expression variation shows that while cell type-or tissue-specific marker genes are generally not well conserved, genes with high expression variability are deeply conserved, and may contribute to continuous aspects of cell transcriptional identity, in both single-celled and multicellular organisms.

Due to their high coexpression conservation, high/high genes may have conserved functions vis-àvis cell identity. E -Examples of continuous (top) vs. marker-like (bottom) expression in mouse (left) and Arabidopsis (right). MSC = mesenchymal stem cell, QC = quiescent center. The conserved gene with high within-and across-cell type variance is more continuous across cell types, whereas
the markers (high/low) are either on or off.

Coexpression conservation is associated with ortholog confidence and can predict functional analogs
Our coexpression conservation analysis is made possible by the use of OrthoDB to define orthologous genes. However, orthology prediction is an active area of research in genomics 23,43 , and relying on a single algorithm has known limitations [44][45][46] . How would our results hold up if we had used a different reference? In the following, we take advantage of orthology information from the Alliance of Genome Resources 24 which has predictions from 12 independent sources 47-58 for humans and 6 common model organisms. We assess coexpression conservation across algorithms, and we explore how coexpression conservation can be applied to predict functional analogs.
The majority of algorithms within the Alliance of Genome Resources database (9/12) have high concordance across their orthology predictions (Figure 4A, mean Jaccard=0.7), with exceptions attributable to differences in species coverage. In keeping with their overall similarity, we find that average scores for these 9 algorithms are close to tied (mean = 0.74 +/-0.009, Supplementary Table 1) and although OrthoDB is fairly distinct in its predictions (mean Jaccard index=0.18), it performs very close to this average (0.73 +/-0.13). Remarkably, even though the algorithms are tied on average, we find that where they agree, coexpression conservation is preferentially high: for almost all pairs of species, coexpression conservation is correlated with the number of algorithms that predict the relationship (human-worm shown as an example in Figure 4B, all correlations in Figure 4C). The only exceptions to this rule are among the three mammals, where coexpression conservation is almost uniformly high.
A primary application of orthology prediction is to infer shared function across species, but benchmarks recurrently find a precision-recall trade-off across algorithms 46,59 , with little evidence that any one approach outperforms another 43 . By incorporating functional information directly, coexpression conservation scores may improve sequence-based inference. For example, recent work has found that human genes with similar coexpression patterns can compensate for the loss of their yeast orthologs in complementation screens 60,61 . Here we find that coexpression conservation scores from our independent data and analysis are predictive of this effect ( Figure 4D). However, we also find that certain pairs of complementing orthologs have very low coexpression conservation scores, with the two lowest scoring pairs excluded from the Alliance of Genome Resources database (Supplementary Table 2).
Altogether, these results highlight that confidence in sequence-based orthology is reflected by similarity in gene coexpression relationships, and suggest that a wisdom-ofthe-crowds approach may be beneficial for ortholog prediction. Since our earlier results rely only on OrthoDB, they likely represent a lower limit for coexpression conservation which would only improve with the use of meta-orthology methods 62 . Moreover, our approach makes it possible to compare data from any species with sufficient RNA-seq data, not just popular model systems like yeast which have been the focus of many studies due to the greater availability of experimental resources.

Figure 4 -Coexpression conservation is associated with ortholog concordance across algorithms and can predict human-yeast functional analogs
A -Heatmap of algorithm concordance. The majority of algorithms (9/12)

DISCUSSION
By combining sequence-based orthology predictions with robust estimates of gene coexpression neighborhoods, we have developed a measure of gene conservation that can be calculated between any pair of species. To our knowledge our study is the largest of its kind to date, making use of data from hundreds of individual studies, and measuring coexpression conservation across very long diverged species. We find that coexpression conservation is associated with phylogenetic distances, expectations of conservation based on gene family size, and can even predict concordance across orthology prediction algorithms. Moreover, taking advantage of the recent explosion in single-cell data, we identify commonalities between the forces that drive conservation in both single-celled and multi-cellular organisms. The genes that vary most -both within cells of the same type and across cells of different types or states -show deeply conserved patterns of coexpression, suggesting their fundamental role in eukaryotic cell function and identity.
Previous research to investigate the changing gene expression landscape across species focused on the relative lack of conservation among tissue and cell type-specific genes 9,12 , which has been interpreted as evidence for Ohno's theory that gene duplication and divergence is critical for phenotypic evolution 34 , such as the creation of novel cell types.
With our method, we confirm these previous findings, but also extend them, presenting a challenge to Ohno's theory. Genes with cell type specific expression have more poorly conserved coexpression patterns than those that are expressed in cells of all types, as previously described. Yet genes expressed in all cell types not only have strongly conserved coexpression patterns, but they also have gradient-like expression across cell types. Variation in activity across cell types is likely what drives their strong coexpression within networks.
What does this mean for cellular evolution and multicellularity? We hypothesize that these genes work in tightly coordinated modules to tune non-negotiable aspects of cellular identity, like cell size, or metabolic rate, generating diversity that allows cells to respond to varying environments. With these diverse populations established, evolution could then use novel genomic variation to mark cell types and refine their organismal roles. Further work to explore this hypothesis is necessary, including additional analyses of single-cell data from long diverged species. However, we note that this will require targeted investigation, as typical single-cell analyses are designed to identify genes that are strongly variable across cell types (i.e., markers) rather than subtler continuous signals.
Cell identity is known to be broadly distributed across the transcriptome, and this allows low-depth single-cell RNA-seq to find expected cell clusters even when individual marker genes are not sampled [63][64][65] . Determining the relative contributions of switch-like versus continuous expression to cell identity will be informative for updating empirical and mechanistic models cell type.
This work is at the intersection of transcriptomics and orthology prediction, and improvements in both will allow for increasing precision of coexpression conservation.
Based on our current estimates of available data, we can readily extend our analyses to an additional thirty-eight species. The ability to move beyond that will require efforts to profile the transcriptomes of a greater diversity of organisms, similar to the goals of the Genome10K project for genome sequences 66 . However, a necessary consequence of using bulk RNA-seq data from heterogenous samples is that networks are better powered for genes that are expressed in all cells, and our results make it clear that tissue-specific networks 67 cannot overcome this as genes expressed in all cell types will continue to dominate. Instead, future work to develop cell-type specific coexpression networks 68 , including methods to map cell identities across distantly related species 69 , and/or to selectively sample the transcriptome for genes with lower expression 70 , could be routes forward. Regarding orthology, one of our most important results is that our conservation measure is correlated with the number of algorithms that predict the orthology relationship. Orthology algorithms are notoriously difficult to benchmark given the lack of gold-standard data (i.e., we lack genomes for the last common ancestors of extant species). Our results suggest that a wisdom-of-the-crowds approach, and incorporation of functional genomics data, could improve on baseline predictions 71 .
Defining the regulatory mechanisms that allow for evolutionary divergence is a central goal in biology. Our method and data now provide a clear route forward for investigating these mechanisms with both breadth and depth.

Analysis of public gene expression data
All analyses were performed in R. Results are reported as means +/-standard deviations unless otherwise specified. calculated for each cell type in each tissue separately, then averaged for each tissue, and averaged across tissues. Across-cell type variance was calculated for each tissue separately using the mean log expression for each cell type, then values were averaged across tissues. For Figure 3D, within and across-cell type variance were ranked and standardized between 0-1 such that the highest variance genes would have a score of 1.
The limma package 76 was used to find marker genes with a log fold-change threshold >4 between each cell type and any other within its tissue.
Arabidopsis thaliana root single-cell RNA-seq expression matrices and sample metadata were downloaded from the Gene Expression Omnibus 77 or provided by the authors. Cell type specificity was calculated for each dataset separately, then averaged, excluding NAs. Within-cell type variance was calculated for each cell type in each dataset separately, then averaged across datasets. Across-cell type variance was calculated for each dataset using the mean log expression in each cell type, then averaged across datasets. To visualize cells from all studies, we first used MetaNeighbor 63 to find replicable clusters, and subset individual datasets to clusters replicating in at least one other study using an AUROC cut-off of 0.7. We used the multiBatchNorm function from the batchelor package for initial batch correction 78 . Then, we selected variable genes using the get_variable_genes function from MetaNeighbor and used fMNN in batchelor on batch corrected data, subset to variable genes. This provided principle components which were used for the UMAP projection of all cells 79 (20 components were used).
Yeast microarray data were downloaded from the Saccharomyces Genome Database 42 .
We included studies with >10 samples. After rank normalizing expression for each sample, we calculated the variance of expression for each gene that was measured in at 23 least 80% of the datasets. Variance was calculated for each dataset separately, then averaged. For the analysis in Figure 3C, we considered all gene pairs with coexpression conservation >0.9 to be true positives and used the ranked average variance to predict these for each species with the auroc_analytic function in EGAD 72 .

Gene annotations and orthology
Gene function annotations were sourced from the Gene Ontology 31  OrthoDB 22 was used for orthology mapping. For each pair of species, we search for the most recent phylogenetic split, then obtain inferred orthology groups for all genes descended from the common ancestor. These were either filtered to include only one-toone relationships, or to include all N-to-M orthologous pairs. We also downloaded orthology information from the Alliance of Genome Resources 24 for assessment with our N-to-M coexpression conservation method, detailed below. We de-duplicated the information so that each gene pair appeared only once, rather than having a direction 24 from a source-target species. Pairs from all algorithms were considered the "universe" of possible orthologs. Species divergence times were sourced from TimeTree 25 .

Coexpression conservation
For each pair of species to be compared, we filter aggregate coexpression networks to include known orthologous genes, then we compare each gene's top coexpression partners across species to quantify gene similarity. We treat this as a supervised learning task, using the ranks of the coexpression strengths from one species to predict the top coexpression partners from the second species, and then repeating this task in the opposite direction, finally averaging the scores. We refer to this as a measure of "coexpression conservation" and note that it is formally equivalent to the average area under the receiver operator characteristic curve (AUROC).
To generalize this to the case of N-to-M orthologs, we describe the analysis in greater detail: Consider a gene A1 in species 1 that has two orthologs in species 2 -genes B1 and B2. First, the top ten genes exhibiting the highest coexpression with A1 are chosen.
All possible orthologs in species 2 for the set of top 10s of A1 are shortlisted as the "translated top 10s". Note that since each gene in species 1 can now have one or more orthologs in species 2, the translated top 10s can vary in length. Additionally, some of the top ten coexpressed genes in species 1 may map to the same orthologs in species 2, but we only consider a unique list of orthologs in the translated top 10s for each gene in species 1. This task is repeated in the opposite direction, where the ranks of coexpression strengths of genes in species 2 are used to predict the top coexpression partners of genes 25 from species 1. Scores from both directions are averaged, thereby providing a measure of overall coexpression conservation.

DATA AVAILABILITY
Data and code to reproduce the coexpression conservation analysis are available through CoCoCoNet (https://milton.cshl.edu/CoCoCoNet/). contributing to algorithmic design, and J.L. contributing webserver integration. All authors read and approved the final manuscript.

COMPETING INTERESTS
There are no competing interests to declare.

MATERIALS & CORRESPONDENCE
Correspondence and material requests should be addressed to Jesse Gillis.