Gene expression phylogenies and ancestral transcriptome reconstruction resolves major transitions in the origins of pregnancy

Structural and physiological changes in the female reproductive system underlie the origins of pregnancy in multiple vertebrate lineages. In mammals, the glandular portion of the lower reproductive tract has transformed into a structure specialized for supporting fetal development. These specializations range from relatively simple maternal nutrient provisioning in egg-laying monotremes to an elaborate suite of traits that support intimate maternal-fetal interactions in Eutherians. Among these traits are the maternal decidua and fetal component of the placenta, but there is considerable uncertainty about how these structures evolved. Previously, we showed that changes in uterine gene expression contributes to several evolutionary innovations during the origins of pregnancy (Mika et al., 2021b). Here, we reconstruct the evolution of entire transcriptomes (‘ancestral transcriptome reconstruction’) and show that maternal gene expression profiles are correlated with degree of placental invasion. These results indicate that an epitheliochorial-like placenta evolved early in the mammalian stem-lineage and that the ancestor of Eutherians had a hemochorial placenta, and suggest maternal control of placental invasiveness. These data resolve major transitions in the evolution of pregnancy and indicate that ancestral transcriptome reconstruction can be used to study the function of ancestral cell, tissue, and organ systems.


Introduction
Studies of the fossil record have revealed in fine detail major stages in the origin and diversification of evolutionary novelties (structures) and innovations (functions) such as the vertebrate limb and skull (Abzhanov, 2015;Hirasawa and Kuratani, 2015), the turtle shell (Lyson and Bever, 2020), feathers (Chen et al., 2015), and flowers (Chanderbali et al., 2016). Many features of soft tissues and macromolecular structures, however, are lost during the fossilization process and thus leave little to no trace in the fossil record. To reconstruct the history of these characters, including DNA and amino acid sequences, morphology, physiology, and behavior, among others, evolutionary studies have traditionally relied on comparative (phylogenetic) methods such as parsimony or model-based maximum likelihood or Bayesian inference (Lewis and Olmstead, 2001). These methods infer ancestral characters from the distribution of character states among extant species, the latter of which also allows increased model complexity including unequal character state transitions (Lewis and Olmstead, 2001;Pagel, 1999), variable rates among sites and branches (Galtier, 2001;Yang, 1994Yang, , 1993, and character dependent diversification (Maddison et al., 2007).
Extant mammals span crucial transitions in the origins of pregnancy ( Figure 1A) and are an excellent system in which to explore the origins of evolutionary novelties. The platypus and echidna (monotremes), for example, are oviparous (egg-laying) but retain the egg in the glandular portion of uterus for about two weeks. During this period the developing embryo is nourished by uterine secretions (matrotrophy) delivered through a simple yolk sac placenta (Hughes and Hall, 1998;Renfree and Shaw, 2013). Viviparity (live-birth) evolved in the stem-lineage of therian mammals, but marsupials and eutherians have very different reproductive strategies, particularly in the ontogenetic origins of the definitive placenta and the arrangement of the maternal-fetal interface (Freyer and Renfree, 2009;Renfree and Shaw, 2013;Renfree, 2010). In most marsupials, the embryonic portion of the placenta is derived from the yolk sac, which in may come into direct contact but does not invade the maternal endometrium. While the yolk sac has essential functions during early pregnancy in eutherians, the definitive placenta is derived from chorion and allantois (chorioallantois) and varies in its degree of endometrial invasion (Swanson and Skinner, 2018) (Figure 1B). Thus, matrotrophy and a yolk-sac derived placenta were present in the mammalian stem-lineage and transitioned to a chorioallantoic placenta in eutherians.
Unfortunately, most characters related to pregnancy leave little to no trace in the fossil record. Thus, studies exploring the evolution of pregnancy have relied on comparing morphological differences between extant mammals to reconstruct the steps in the origins of pregnancy. These comparative analyses have used multiple methods to reconstruct the arrangement of the mammalian maternal-fetal interface and reached contradictory conclusions about the degree of placental invasiveness in the eutherian ancestor (Table 1). Here, we use comparative transcriptomics and maximum likelihood to infer ancestral gene expression states (ancestral transcriptome reconstruction) from 23 amniotes with different parity modes and degrees of placental invasion to identify the evolutionary history of the maternal-fetal interface.
We found strong evidence that the last common ancestor of eutherian mammals had an invasive hemochorial placenta, as well as convergence in gene expression profiles during the independent evolution of non-invasive epitheliochorial placentas in marsupials and some eutherian lineages.
These data indicate that the degree of placental invasion can be inferred from endometrial gene expression profiles, suggesting that placental invasiveness is regulated by gene expression profiles in the maternal endometrium rather than the fetal portion of the placenta.

Endometrial Gene Expression Profiling
We previously assembled a collection of transcriptomes from the pregnant or gravid uterine endometrium of therian mammals with varying degrees of placental invasiveness, as well as a monotreme (platypus), a bird, as well as viviparous, oviparous, and reproductively bi-modal lizards . The complete dataset includes expression information for 21,750 genes from 23 species (Figure 2- into discrete character states -Genes with TPM ≥ 2.0 were coded as expressed (state=1), genes with TPM < 2.0 were coded as not expressed (state=0), and genes without data in specific species coded as missing (?) Mika et al., 2021). In contrast to PCA based on TPM values, PCA of the binary encoded endometrial transcriptome dataset grouped species by phylogenetic relatedness (Figure 2B), indicating significant noise reduction in the binary encoded dataset.

Phylogenetic analyses of endometrial transcriptomes
Next, we used IQ-TREE to infer the best fitting model of character evolution (GTR2+FO+R3), the maximum-likelihood (ML) phylogeny, and branch support metrics for the binary encoded endometrial gene expression dataset. The ML phylogeny ( Figure 3A) generally followed taxonomic relationships ( Figure 3B), however, four discordant relationships within therian mammals were inferred with high support: 1) Rather than grouping marsupials into a monophyletic sister-clade to the eutherians, opossum and wallaby were placed as sister-species within the Boreoeutheria; 2) armadillo groups within the Euarchontoglires rather than as sister to all other eutherians; 3) bat groups within the Euarchontoglires rather than within Laurasiatheria; and 4) dog groups within the Euarchontoglires rather than with Laurasiatheria ( Figure 3A/B).
Remarkably while these relationships are incorrect with respect to the species phylogeny, they are correct with respect to placenta-type, i.e., species form well-supported clades based on their degree of placental invasiveness. Wallaby and opossum, for example, have epitheliochorial placentas similar to ungulates, whereas dog has an invasive endotheliochorial placenta similar to the invasive hemochorial placenta of Euarchontoglires. We also used multiple non-parametric topology tests to directly compare the inferred ML tree to alternative trees with the correct phylogenetic placement of discordant lineages, all of which rejected alternative trees in favor of the ML tree ( Figure 3C).
These data show significant phylogenetic support for species uterine transcriptomes grouping by parity mode and degree of placental invasiveness rather than phylogenetic relationships. This discordance is particularly striking for opossum and wallaby, which are deeply nested within eutherians with epitheliochorial placentas based on uterine transcriptome data, because the eutherian and marsupial placenta is derived from different extra-embryonic tissues, the chorion and allantois in the former and the yolk-sac in the latter. Thus, there must be significant convergence in endometrial gene expression profiles between eutherians with chorioallantoisderived epitheliochorial placentas and marsupials with yolk-sac epitheliochorial placentas.

Fuzzy C-Means transcriptome clustering
We also used IQ-TREE and the species phylogeny ( Figure 3B) to reconstruct ancestral gene expression states for each gene (ancestral transcriptome reconstruction). To explore the similarity of extant and reconstructed transcriptomes we used Fuzzy C-Means (FCM) clustering, a "soft" clustering method that allows each sample to have membership in multiple clusters and assigns samples to clusters based on their degree of cluster membership. FCM with two to four clusters (K=2-4) had a clear biological interpretation (Figure 4)  Chalcides maternal-fetal interface has been described as either endotheliochorial or epitheliochorial with extensive vascularization and interdigitating folds of hypertrophied uterine and chorioallantoic tissue (Blackburn, 1993;Blackburn and Callard, 1997;Corso et al., 2000).
These data suggest that species with endotheliochorial placentas have a gene expression profile that is intermediate between epitheliochorial or hemochorial placentas, yet is also distinct, and that gene expression at the Chalcides maternal-fetal interface are converging on a therian-like endotheliochorial pattern.

Maternal control of placental invasion in mammals
Our observation that endometrial gene expression patterns are correlated with degree of placental invasiveness might seem surprising, however, the endometrium controls trophoblast invasion (Cui et al., 2012;Graham and Lala, 1991). For example, the trophoblast of mammals with hemochorial placentas, such as humans and rodents, is only permissive to invasion when the "window of implantation" is opened by the endometrium. Similarly, while the trophoblast of mammals with non-invasive endotheliochorial and epitheliochorial placentas, such as cats, dogs, horses, cows, pigs, and sheep cannot invade into the endometrium, they can invade into ectopic sites (Reviewed in (Corpa, 2006)). These data indicate that trophoblast invasiveness is ancestral in eutherian mammals. While similar data on ectopic pregnancy is lacking for marsupials, some marsupial lineages, including dunnart, have evolved invasive placentation. In contrast, there is no invasion of maternal tissues during ectopic pregnancy in viviparous reptiles with epitheliochorial placentation such as Pseudemoia entrecasteauxii (Griffith et al., 2013). Thus, the invasive ability of trophoblasts most likely either evolved in the stem-lineage of therian mammals or multiple times in therians-in the stem-lineage of eutherians and some lineages of marsupials. Regardless, ancestral maternal control of placental invasion likely allows us to infer ancestral placental invasiveness from ancestral endometrial transcripomes. These data also suggest that maternal control of trophoblast invasion may be mechanistically related to resistance to metastasis in some eutherian lineages with non-invasive placentas (Boddy et al., 2020;Kshitiz et al., 2019;Wagner et al., 2020).

Caveats and limitations
A limitation of this study is that we have only sampled a small number of species. For example, we lack pregnant endometrial samples from most mammals, particularly those with endotheliochorial placentas, as well as a diversity of oviparous and viviparous squamates (there are at least 115 origins of viviparity in squamates (Blackburn, 2015;Blackburn and Brandley, 2015;Blackburn and Starck, 2015). Thus, our inferences from phylogenetic, ancestral reconstruction, and clustering analyses may be biased by small sample sizes and non-random sampling. We also assume that models of evolution designed for phylogenetic inference and ancestral reconstruction of morphological and molecular data are appropriate for gene expression data or binary encoded gene expression data, which may affect our results. Similarly, while Fuzzy C-Means clustering is conceptually similar to topic ("grade of membership") models used in population genetics, its underlying assumptions may be violated for gene expression and binary encoded gene expression data. More detailed studies will be necessary to determine if our results are robust to potential sources of error such as model mis-specification, small sample sizes, and non-random taxon sampling.

Conclusions
Previous critiques of statistical methods to infer ancestral states, particularly in the context of parity mode evolution in squamates, have suggested that ancestral state reconstructions of morphological characters must be supported by additional kinds of biological support such as anatomical, physiological, and ecological evidence, to be persuasive (Griffith et al., 2015). Here we explored the evolution of parity mode and placental invasiveness in amniotes utilizing comparative gene expression data. While our study also relies on statistical methods to infer ancestral (gene expression) states, this approach is orthogonal to traditional methods that infer ancestral states from morphological characters among extant species. Indeed, gene expression ultimately underlies the development, evolution, and function of anatomical systems. Thus, by reconstructing the evolution of entire transcriptomes we may be able to infer function of ancestral cell, tissue, and organ systems. Our results resolve several evolutionary transformations during the origins of pregnancy, including the early evolution of an epitheliochorial-like placenta in the mammalian stem-lineage, a hemochorial placenta in the ancestor of eutherians, multiple reversions to non-invasive epitheliochorial placentas within some eutherian lineages, convergent evolution of gene expression profiles among species with different ontogenetic origins of epitheliochorial placentas, and maternal control of placental invasiveness.

Endometrial Gene Expression Profiling
We previously published a dataset of uterine endometrial transcriptomes, which is also used in this study. Interested readers are referred to Mika, Marinic, and Lynch (2021) for specific details. Briefly, we searched the NCBI BioSample, Short Read Archive (SRA), and Gene Expression Omnibus (GEO) databases using the search terms "uterus", "endometrium", "decidua", "oviduct", and "shell gland". These anatomical terms refer to the glandular portion of the female reproductive tract, which is specialized for maternal-fetal interactions or shell formation. We then manually curated transcriptomes and excluded those that did not indicate whether tissue samples were from pregnant or gravid tissues and datasets composed of pooled tissues. Gene expression data were analyzed with Kallisto (Bray et al., 2016) version 0.42.4 to pseudo-align the raw RNA-Seq reads to reference transcriptomes and to generate transcript abundance estimates (see Figure 3-source data 1 for accession numbers and reference genome assemblies); Kallisto was run using default parameters, bias correction, and 100 bootstrap replicates.

Gene expression phylogeny and ancestral transcriptome reconstruction
We used the binary encoded endometrial transcriptome dataset for phylogenetic analyses and to reconstruct ancestral gene expression states. Gene expression phylogenies were inferred with IQ-TREE (Nguyen et al., 2015) using the best-fitting model of character evolution determined by ModelFinder (Kalyaanamoorthy et al., 2017). The best fitting model was inferred to be the General Time Reversible model for binary data (GTR2), with character state frequencies optimized by maximum-likelihood (FO), and a FreeRate model of among site rate heterogeneity with three categories (R3) (Soubrier et al., 2012). Ancestral gene expression states for each gene were inferred using the empirical Bayesian method implemented in IQ-TREE, the GTR2+FO+R3 model of character evolution, and the species phylogeny as a constraint tree ( Figure 3B).
The bootstrap and single branch tests assess the robustness of individual branch bipartitions and cannot directly compare complex alternate tree topologies. Therefore we used non-parametric topology tests to directly compare the inferred ML tree to alternative trees with the correct phylogenetic placement of armadillo, dog, marsupials (opossum and wallaby), and bat, as well as the correct species phylogeny ( Figure 2B); tests included the BP-RELL, KH-test (Kishino et al., 1990;Kishino and Hasegawa, 1989), SH-test (Anisimova et al., 2011;Guindon et al., 2010), c-ELW (Strimmer and Rambaut, 2002), weighted KH-and SH-tests, and the AU-test (Shimodaira, 2002). We note that the KH-test compares two a priori defined trees rather than the ML and alternative trees (Goldman et al., 2000) and does not correct for multiple hypothesis tests, it is included solely for comparison to other methods and previous studies. The SH-test can be used to compare the ML tree to multiple alternative trees selected a priori (i.e., is dataset independent) and corrects for multiple hypothesis tests, but is too conservative when many trees are tested. The AU-test, in contrast, resolves the conservative nature of the SH-test and thus is the preferred test.

Clustering methods
We evaluated multiple methods to summarize and visualize the binary encoded extant and ancestral reconstructed transcriptomes, including: 1) Logistic Principal Component Analysis conducted in R after removing columns (genes) with missing data (coded as ?/NA) or that were invariant (all 0 or all 1). LPCA was performed using the logisticPCA R package (Landgraf and Lee, 2015), which implements three methods: exponential family PCA applied to Bernoulli data, logisitic PCA, and the convex relaxation of logistic PCA. For each of the methods, we fit the parameters assuming two-dimensional representation, returning four principal components (ks=4), and selecting the best m value to approximate the saturated model for cross validation.
MDS was performed using the vegan R package (Oksanen et al., 2008) with four reduced dimensions. UMAP was performed using the umap R package. tSNE was performed using the Rtsne R package.
To explore the data in greater detail we focused on FCM clustering because the results were qualitatively similar to the other methods, and it has several desirable properties including providing a statistically sound way to identify clusters rather than an ad hoc approach that might be applied to the other methods. FCM also allows each sample to have membership in multiple clusters and is conceptually similar to topic ("grade of membership") models used in population genetics to visualize private and shared genetic structure across populations. FCM membership coefficients can thereby account for multiple sources of similarity including noise, phylogenetic signal, and convergence of gene expression. FCM was performed in R using the cluster R package, using Manhattan distances (cluster membership was not altered by using other distance metrics), and an estimated fuzzifier (m=1.034978). FCM clustering requires a priori knowledge of the number of clusters (K) to include, therefore we evaluated FCM with K=2-9 following the suggestions given in https://www.r-bloggers.com/2019/01/10-tips-for-choosing-the-optimalnumber-of-clusters/. First, we used the "elbow" method, in which the sum of squares of each cluster number is calculated and graphed and the optimal number of clusters estimated by a change of slope from steep to shallow (the elbow). We also assessed the optimal number of clusters using the clustree R package, which assess the optimal number of clusters by considering how samples change groupings as the number of clusters increases; clustree is useful for estimating which clusters are distinct and which are unstable but cannot determine the optimal number of clusters (K).    8.6% 6.9% 6.1% 5.8% 5.6% 5% 4.8% 4.3% Plus sign denotes the 95% confidence sets. Minus sign denotes significant exclusion.
All tests performed 100000 resamplings using the RELL method.   Mouse  32  36  35  34  31  39  38  41  40  37  30  AncEutheria  43  42  AncTheria  AncMammalia  47  26  48  46  45   Transparent arrows (the incoming node proportion), show how samples from move between lower and higher cluster numbers and can be used as an indicator of cluster instability. In this graph samples switch clusters between K=1-5, are stable between K=6-9, and unstable with more than 9 clusters. These data indicate that 6 clusters is optimal for resolution, i.e., more than 6 clusters no longer provides informative cluster memberships, while more than K≥9 is over-clustered.