Abstract
The impact of spliceosomal introns on genome and organismal evolution remains puzzling. Here, we investigated the correlative associations among genome-wide features of introns from protein-coding genes (e.g., size, density, genome-content, repeats), genome size and multicellular complexity on 461 eukaryotes. Thus, we formally distinguished simple from complex multicellular organisms (CMOs), and developed the program GenomeContent to systematically estimate genomic traits. We performed robust phylogenetic controlled analyses, by taking into account significant uncertainties in the tree of eukaryotes and variation in genome size estimates. We found that changes in the variation of some intron features (such as size and repeat composition) are only weakly, while other features measuring intron abundance (within and across genes) are not, scaling with changes in genome size at the broadest phylogenetic scale. Accordingly, the strength of these associations fluctuates at the lineage-specific level, and changes in the length and abundance of introns within a genome are found to be largely evolving independently throughout Eukarya. Thereby, our findings are in disagreement with previous estimations claiming a concerted evolution between genome size and introns across eukaryotes. We also observe that intron features vary homogeneously (with low repetitive composition) within fungi, plants and stramenophiles; but they vary dramatically (with higher repetitive composition) within holozoans, chlorophytes, alveolates and amoebozoans. We also found that CMOs and their closest ancestral relatives are characterized by high intron-richness, regardless their genome size. These patterns contrast the narrow distribution of exon features found across eukaryotes. Collectively, our findings unveil spliceosomal introns as a dynamically evolving non-coding DNA class and strongly argue against both, a particular intron feature as key determinant of eukaryotic gene architecture, as well as a major mechanism (adaptive or non-adaptive) behind the evolutionary dynamics of introns over a large phylogenetic scale. We hypothesize that intron-richness is a pre-condition to evolve complex multicellularity.
INTRODUCTION
Spliceosomal introns are not only germane to eukaryote origins, they also represent an evolutionary innovation on the way in which protein-coding genes have been stored, expressed and inherited throughout Life’s history on Earth. Spliceosomal introns (hereafter “introns”) form a class of non-coding DNA (ncDNA) sequences that interrupt exons within a gene. Thus, they have to be removed from the primary transcript by the splicing machinery to form a mature messenger RNA (mRNA), and hence, a functional RNA or protein molecule. Although introns are ubiquitous sequences along eukaryotes, their genome-wide features (such as length and abundance within and across genes) differ among species, and their the origins are still under intense debate. Yet, some large scale patterns of intron evolution appear to be reaching a consensus. For instance, the high conservation of intron-positions found in orthologous genes throughout eukaryotes suggests the existence of intron-rich ancestors [1, 2]. Also, intron loss has been found to be more frequent than intron gain in most lineages [1, 3–7], although episodes of rapid and extensive intron gain are also observed across eukaryotes [8–13]. Other evolutionary and functional aspects of introns remain amongst the longest-abiding puzzles, such as the phenotypic consequences of harboring intron-rich genes and their evolutionary relationship with genomic and multicellular complexity.
Because introns are less evolutionarily constrained than coding sequences, they usually evolve at high rates as do 4-fold degenerate sites and other non-coding regions [14, 15]. Nevertheless, a variable proportion of intron sites has been found to be under selective constraints in mammals [15–17], some invertebrates [18, 19], fungi[20], algae and plants [21–23]. Also, the energetic and time costs to transcribe and splice introns can be significant enough to influence the organism’s phenotype [24–27]. This is expected, in part, because eukaryotic genomes are pervasively transcribed [28, 29] and intronic RNAs constitute a major fraction of the transcribed non-coding sequences [30]. For instance, the transcription of a large gene, as the one encoding human dystrophin (2.3 Mbs), can still take up to 10 hrs. at an in vivo RNA pol-II elongation rate of 3.8 kb/min [31] because 99% of its length is intronic [32]. A substantial delay of gene expression owing to transcription and splicing of long and/or numerous introns, a phenomenon termed “intron delay” [33, 34], has turned out to be essential for cells with short mitotic cycles and for timing mechanisms during early body segmentation [35, 36]. Also, levels of gene expression (either housekeeping or tissue specific) are often associated to particular intron features [37]. For instance, some highly expressed genes are found under strong selection to remain intron-poor for transcriptional efficiency [25, 38, 39], whereas other genes are found to have longer and numerous introns to enhance expression [40–44].
Although co-transcriptional splicing depends on many parameters [45, 46], the exon-intron structure of genes is also found to have an impact on the mode of splice-site recognition and the efficiency of splicing [45, 47–51]. For instance, it was recently found that intron-containing genes and intronrich genomes are best protected against R-loop accumulation, and subsequent transcription-associated genetic instability, by favoring spliceosome recruitment [52]. Studies across phylogenetically distant eukaryotes have also found that the length of introns and exons exerts an important influence on the likelihood of an exon to be constitutively or alternatively spliced [47–51, 53–55]. For instance, canonical splicing errors produced by splice-site recognition across small introns are more likely to result in “Intron Retention” or unspliced mRNA [54]. Whereas “Exon Skipping” or inclusion of alternative exons in the mRNA is thought to comply best with splice-site recognition across small exons [54]. Remarkably, these canonical and other noncanonical splicing errors can account for the major portion of the alternative splicing (AS) events in some eukaryotes and cell types [53, 56–58]. The expansion of AS events is thought to be key in the emergence of multicellular complexity, by creating proteome diversity and by regulating gene expression post-transcriptionally through RNA surveillance pathways [2, 59]. Accordingly, species with more tissues and cell types tend to have more alternatively spliced genes [53, 60, 61].
It remains to be fully understood, however, to what extent intron-richness (and potentially increased AS events) is coupled to the evolution of complex multicellularity (as defined in Appendix 1), genome size and of other ncDNA classes across eukaryotes. In earlier studies, a number of strong positive correlations over large evolutionary scales was found among genome size and particular ncDNA classes [62–64], including the average size, total number and nucleotide content of introns in the genome [62, 64–66]. These results have fueled the suggestion that changes in the genomic features of any ncDNA class are scaling uniformly with changes in genome size, leading to the premise that larger genomes tend to harbor more and longer introns, and vice versa for smaller genomes. However, correlative associations over large evolutionary scales are prone to result in biases due to low phylogenetic diversity and the lack of both phylogeny-controlled statistics [67, 68] and systematically obtained datasets. Challenging these results are also the studies showing that the average number of introns per gene (i.e., intron density, see Appendix 2) appears to inversely correlate with generation time [26, 69] and gene expression levels (as reviewed in [34, 37]). Accordingly, genome-wide changes of intron density are found to vary widely across eukaryotic lineages [2, 7, 70], with no clear association to genome size or another intron feature [6, 7]. Furthermore, it has been largely presumed that repeats –in particular, transposable elements (TEs)– are strongly driving the evolution of some intron features [66, 71, 72]. However, the strong contribution of repeats to intron size, for instance, is supported by studies on a few model species, particular clades or repeat families [73–76].
Some of the findings and correlative associations described previously have been interpreted as evidence for either adaptive or non-adaptive forces being the major determinant of the intron-richness complexity observed across eukaryotes. And to argue, consequently, on whether the functionality of introns can be mainly explained by the effect for which it was selected for (i.e., selected-effects) or by the effect of causal-role activities [29, 71, 72, 77]. For instance, the “genomic design” model postulates that the length and number of introns is determined by selection for gene function and the necessity to preserve conserved intronic elements for complex regulation [78]. Likewise, the “selection for economy” model proposes that decreases in genic size are the results of selected mutations (mainly on the length and number of introns) to reduce the time and energetic costs of transcription [25, 38, 39]. By contrast, the “mutational bias” model states that the abundance and length of introns in certain chromosomal regions is driven by different recombination rates and/or transcription-associated mutational biases [38, 70, 79]. Alternatively, the “mutation-hazard” model suggests that variation in the hazardous accumulation of introns –along with other ncDNA sequences– is primarily the outcome of increased genetic drift when effective population sizes (Ne) remain small for an extended period of time [62, 63, 80]. So that, for instance, larger genomes would have more and larger introns owing to insufficient purifying selection to remove them in species with lower values of Ne, a condition expected to occur in multicellular organisms particularly. The strong correlations reported among genome size, Neµ (as a proxy of Ne) and some ncDNA classes –including some genome-wide features of introns– appear to support this hypothesis [62, 63, 65, 66, 68, 81]. Several controversies have emerged, however, from the contradicting evidence and arguments supporting all previous hypotheses, as discussed in [37, 67].
The discrepancy between the current observations and the evolutionary models have raised a conundrum: are the genome wide features of introns within protein-coding genes (such as their length, abundance and repetitive composition) evolving throughout Eukarya in either a concerted or an independent way among themselves, with genome size and multicellular complexity? Our study contributes to clarify this conundrum by investigating the correlative associations among these organism traits over 461 eukaryotes. To that end, we formally distinguish simple from complex multicellular organisms (CMOs) (see Appendix 1), and developed the program GenomeContent to systematically estimate genomic traits (see Appendix 2). We then estimated correlations under phylogenetic controlled analyses, taking into account significant uncertainties in the tree of eukaryotes and variation in genome size estimates. We found that intron features are weakly correlated among themselves and with genome size at the broadest phylogenetic scale, revealing different associations between those features estimating intron abundance across genes and those measuring intron length and repeat composition. We also found that CMOs and their closest ancestral relatives are characterized by high intron-richness, regardless their genome size. These patterns contrast the narrow distribution of exon features found across eukaryotes. Our findings are thus in disagreement with previous estimations claiming a concerted evolution between genome size and introns over a large phylogenetic scale. They also argue strongly against both a particular intron feature as key determinant of eukaryotic gene architecture, as well as a major mechanism (adaptive or non-adaptive) behind the evolutionary dynamics of introns over a large phylogenetic scale. Here, spliceo-somal introns are unveiled as a dynamically evolving ncDNA class, whose relationships with genome and organismal complexity are better explained by the influence of numerous life-history factors and evolutionary forces. We argue why intron-rich lineages are more likely to evolve complex multicellularity.
RESULTS
Phylogenetic signal is strong in genome-based traits and robust to uncertainties in the tree of eukaryotes
Phylogenetic signal is the “tendency of related species to resemble each other more than species drawn at random from the same tree” [82, 83]. Because the genome traits of the 461 eukaryotes analyzed here share an evolutionary history, we first evaluated the strength of their phylogenetic signal (i.e., statistical dependence) with four alternative “species trees”, each of which has been extensively used in the literature. Figure 1 shows the comparison of the tree topologies: a tree based on literature consensus from sequence-based phylogenies (Figure 1a); two NCBI taxonomy-based trees, one with no polytomies (Figure 1b), while another one with polytomies (Figure 1c); and a protein domain-based tree corrected for protein content biases derived from differences in genome size and lifestyles (Figure 1d). The literature consensus-based tree was selected as the “reference tree for eukaryotes” (Figure 1a) to present the results throughout the article. However, we do not attempt to single out any particular tree topology as the best or the correct species tree of eukaryotes. Instead, the goal is to test the robustness of the comparative analyses to alternative phylogenetic assumptions [67, 82].
(a) A consensus literature-based tree was manually created according to sequence-based phylogenies and supertrees. Two NCBI taxonomy-based trees were obtained in two versions: (b) not allowing polytomies and (c) allowing polytomies. (d) A protein domain content-based tree was created and corrected for protein content biases owing to differences in genome size and lifestyles. All tree topologies include: 131 fungal species (in dark blue), 78 species from Viridiplantae (in light green), 186 from Metazoa (in pink), 20 from Alveolata (in orange), 17 from Stramenopiles (in red), 7 from Excavata (in black), 7 from Amoebozoa (in yellow), 4 from Rhodophyta (in dark green), 4 from Choanozoa (in purple), 3 from Rhizaria, Cryptophyta and Haptophyta (in cyan), Fonticulidae and Apusozoa (in light blue). Phylogenetic uncertainties and disagreements among the trees are also summarized in the table (down). Numbers highlighted in bold correspond to the absolute symmetric differences (RF), the number of RF partitions per species is coloured in blue, and the Align mismatches scored in the best alignment of the branches are coloured in red.
Accordingly, dissimilarity metrics for the four tree topologies can be observed in Figure 1. The symmetric difference (RF) and the tree aligment (Align) metrics measure the number of clades not shared between two trees from either the total number of their partitions or the best alignment of their branches, respectively. Notably, the highest phylogenetic inconsistencies are observed in the protein domain-based tree: between 1.28 and 1.48 partitions per species when compared to the other trees (see table on Figure 1). This is because the protein domain-based phylogeny exhibits a long branch attraction (LBA) problem of several species that had undergone massive protein-domain loss, in spite of implementing a correcting factor for protein-domain content. Examples of LBA on Figure 1d include the myxosporean Thelohanellus kitauei, the trematode Schistosoma mansoni, the bdelloid rotifer Adineta vaga and the green algae Helicosporidium sp. In contrast, we observe a similar magnitude of branch dissimilarities between the literature-based tree and the NCBI-based trees, even with different polytomy resolutions: between 0.66 and 0.86 partitions per species. On closer inspection, however, the phylogenetic resolution at the level of species is not only different across the four trees, but also known conflicting hypotheses are observed for the phylogenetic positions of Rhodophyta, Rhizaria, Excavata, Stramenopiles, Alveolata, among others [84, 85]. As summarized above, the four alternative eukaryotic trees exhibit significant phylogenetic inconsistencies from one another. This allow us to incorporate adequate phylogenetic uncertainty into our comparative analyses to evaluate their sensitivity.
Table 1 presents estimates of strong phylogenetic signal for all 30 sequence-based genome traits analyzed here, as indicated by their Pagel’s λ values close to 1.0 and significantly > 0. Notably, the λ values are significantly robust to the phylogenetic disagreements shown by the alternative tree topologies. Likewise, λ values are significantly robust to different estimations of genome sizes and genome contents based on two sources: genome assemblies and experimental estimations (see Supplementary Table S4). The latter robustness is expected because, as also reported by Elliott and Gregory [64], the correlation between the assembled and estimated genome sizes is strong at the broadest phylogenetic scale: r = 0.958 (see Table 3). Consistent with previous studies [67, 76, 81, 86], these results indicate that genome traits are not statisticallly independent when compared among species. Therefore, correction for phylogenetic signal is accounted for any comparative analyses in this study.
Genome size correlates weakly with genome-wide intron features at the broadest phylogenetic scale
Table 2 shows the estimated log Bayes Factors (log BF) and coefficients of determination (r2) used here as the criterion to assess both the “strength of the evidence” and the “explanatory power” of the correlative associations between two traits (X and Y), respectively. The “explanatory power” of the r2 values should be understood as means of a statistical range to associate the variation observed between X and Y, with no implication as to the evolutionary mechanism that might cause (or not) such associations nor the primary trait (X or Y) subject to this action.
Unvariate measures of phylogenetic signal (with the λ parameter) for the genome traits analysed in this study, by using four alternate trees for the 461 eukaryotic species (see Figure 1).
Correlative associations among several features measuring genome complexity and intron-richness from 461 eukaryotes under non-phylogenetic (OLS) and phylogenetic (PIC, PGLS) models.
Summary statistics and PGLS regressions performed between genome size and intron features for different species datasets.
When the data is analyzed as phylogenetically independent with the OLS model, strong associations are observed on Table 2 among genome size and most intron features (r2 between 0.6 and 0.9). In particular, the regression between genome and intron sizes (r2 = 0.76) is consistent with previous estimations performed over a broad evolutionary range by Vinogradov (r2 = 0.792, n = 27) [65] and by Lynch and Conery (r2 = 0.641, n = 30) [62]. However, the strength of such associations substantially dropped after phylogenetic corrected regressions were performed with both PGLS and PICs models (see Table 2). For instance, the correlations among genome size and the features estimating intron length and genomic content have robust evidence for a positive association but a weak explanatory power at the broadest phylogenetic scale: r2 = 0.382 (logBF = 80.1) for intron size, r2 = 0.447 (logBF = 95.1) for intron content, and r2 = 0.485 (logBF = 119.6) for the repetitive-intronic content of the genome. On the other hand, none or no simple associations were found among genome size and those features measuring the abundance of introns within and across genes: r2 = 0.068 (logBF = 1.3) for intron density, r2 = 0.022 (logBF = 1.7) for the fraction (and total number) of intron-containing genes per genome, and r2 = 0.184 (logBF = 7.8) for the total number of introns per genome. Therefore, only reduced and differential fractions of the variation observed among intron features (from 2% to 45%) can be associated (directly or indirectly) to the ∼2,200-fold variation of the genome size observed over the 461 eukaryotes analyzed here.
The strenght of these correlations, either through log BF or r2 values, is not only similar regardless which genome trait (e.g., genome size) is associate to the X and Y variables, but it is also consistent with both PGLS and PIC models (see Supplementary Tables S5-S10). Nonetheless, the PGLS model appears to fit the data significantly better than the OLS and PIC models do, as shown by its consistent lower AIC values and higher ML values (see Table 2). We further investigated potential discrepancies in our correlations owing to the differences to estimate tree topologies, genome traits, species number and diversity (see Supplementary Tables S5-S10). As shown in Table 2, the weak associations found among genome size and intron features also proved to be robust to the phylogenetic inconsistencies of the alternate trees. A slighter strenght should be notice in the correlations performed with the protein-domain content tree, which exhibits the highest topological and branch length dissimilarities in comparison to the other trees. We also found no major discrepancies in the PGLS correlations performed with estimates of intron features based on (and against) the two sources of genome size. The PGLS correlations described in Table 3 further show that the weak associations found at the broadest phylogenetic scale, significant r2 values from 0.02 to 0.5, are robust to randomly reduced datasets of 231 and 116 species sampled from the original 461 species (see also Supplementary Table S11). These results support the robustness of the phylogenetic diversity of our species dataset. By contrast, we found overestimated correlations among intron features and genome size when smaller and less diverse datasets are used to cover large phylogenetic scales. This is observed in Table 3 for two further randomly reduced datasets of 58 and 29 species, as well as for the species datasets from Lynch and Conery [62] (n = 26) and from Wu and Hurst [68] (n = 30) (see also Supplementary Table S12). These contrasting results show that phylogenetic diversity and phylogenetic corrected correlations over large evolutionary scales are strongly affected by very small and biased datasets.
Under the “replicated co-distribution” approach [87], we also tested the decoupled association between intron features and genome size across multiple independent clades. Table 3 describes the PGLS correlations performed over 18 lineage-specific datasets compiled from the original dataset of 461 eukaryotes (see Methods and Supplementary Table S11). According to our previous results, the r2 values show that the strength of the associations among genome size and intron features is indeed different at the local phylogenetic scale. For instance, the genomeintron size relationship varies from r2 = 0.003 in stramenopiles up to r2 = 0.856 in teleosts, whereas the association between intron density and genome size varies from r2 = 0.077 in deuterostomes up to r2 = 0.807 in chlorophytes. Likewise, a differing correlative association among intron features and genome size is observed in Hymenoptera, Aves, Monocots and Ascomy-cota when compared to the correlations observed in their corresponding close relatives.
Nevertheless, some of the correlations on Table 3 also highlight the impact that differences in the estimations of genome features have at the local evolutionary scale. For instance, the correlation obtained for genome size and intron density in Ascomycota is consistent with Kelkar and Ochman [81], regardless both the tree topology or the source for the genome size estimates used to perform the regressions. This is expected because the assembled and estimated genome sizes are highly correlated in this group (r2 = 0.964, p < 0.001, Table 3). By contrast, our correlations for genome and intron size in amniotes are not consistent with those reported by Zhang and Edwards [76] when the estimated genome sizes are used to perform the regressions (r2 = 0.184, p < 0.001, Table 3), but they are when the regressions are performed with the assembled genome sizes (r2 = 0.339, p < 0.001, Supplementary Table S11). Such discrepancy might be the consequence of the low correlation found between the two sources of genome size estimates in amniotes: r2 = 0.567 (p < 0.001, see Table 3). On the other hand, some of the regressions obtained across specific lineages (such as monocots and aves) or for particular intron features (such as intron density and the fraction intron-containing genes) do not reach statistical significance. While most of these correlations reveal none or no simple associations among intron features and genome size, caution should be taken in the correlations obtained for those lineages with few sequenced genomes (n < 15). With few exceptions, the PGLS correlations reported previously, and in the following sections, do not pose significant changes when they are performed with different tree topologies and assembled genome sizes (see Supplementary Tables S5-S13).
Changes of intronic content across lineages are not strongly associated to one particular genome-wide intron feature
As observed on Figure 2 and Table 3, the net nucleotide coverage of introns in the eukaryotic genome, i.e. “intron content”, represents on average 25.8% of the genome size in Choanozoa, 21.9% in Metazoa, 11.8% in Viridiplantae, 11.1% in Alveolata, 10.4% in Amoebozoa, 7.5% in Stramenopiles, 7.0% in Fungi, and 0.9% in Excavata. Our results show that the variation of intron content observed at the broadest phylogenetic scale is positively, yet weakly correlated with genome size (r2 = 0.447, logBF = 95.1), in contrasts to the stronger associations observed with the repetitive (r2 = 0.723, logBF = 1273.3), non-repetitive ncDNA (r2 = 0.859, logBF = 449.2) and protein-coding (r2 = 0.617, logBF = 98.5) contents of the genome (see Table 2). As a consequence, low intron-contents are observed in some large and highly repetitive genomes, such as Pinus taeda (1.69% in 22.5 Gbs) and Locusta migratoria (13.61% in 6.5 Gbs). And vice versa, high non-repetitive intron contents can be observed in species with smaller genome sizes, either unicellular or multicellular, when compared to their corresponding close relatives. Some remarkable examples include Chlorophytes (29.2%, 70.7 Mbs), Aves (31.5%, 1.23 Gbs), teleosts (31.5%, 871.6 Mbs), bees (25.4%, 278.6 Mbs), buterflies (23.9%, 402.0 Mbs), Bigelowiella natans (32.05%, 94.7 Mbs), Ectocarpus siliculosus (36.04%, 214 Mbs), and Utricularia gibba (18.15%, 88 Mbs) (see Figures 2-4 and Supplementary Tables S14-S15). Consistent with this, we further found that the strength of the association between intron content and genome size is indeed different across lineages (Table 3). For instance, it is strong in Alveolata, Chlorophyta and Teleostei, while weak or absent in Pezizomycotina, Aves, Mammalia, Monocots, Hymenoptera and Lepidoptera. In agreement with a previous study [64], these findings show that intron content cannot fully account for the large variations of eukaryotic genome sizes.
The species are displayed according to the “reference tree for eukaryotes” (in Figure 1a) and coloured according to the supergroup they belong to. The fraction of genomic content (repetitive and non-repetitive) from the total assembled genome size –as calculated with the GenomeContent program–, for introns, exons and intergenic regions is coloured in red, blue and gray scales, respectively. The fraction de placeholders (sequences of Ns) is represented in yellow. Noteworthy, the fraction of non-repetitive intergenic regions might be smaller due to the presence of several repetitive pseudogenes and non-coding RNA families (e.g., ribosomal RNAs, tRNAs) that are not fully annotated in the genome projects nor in the present study. The paired symbols for each species indicate whether the nucleotide overlap of repeats within intronic (circles) and exonic sequences (squares) is less (TRUE: filled) or more (FALSE: not filled) significant than expected by chance, according to the p < 0.05 estimated over 1,000 permutation tests on the Jaccard index for each feature and genome (see Methods and Supplementary Table S16). The assembled genome size for every species is shown in vertical bars. Data calculated from estimated genome sizes are available in Supplementary Table S3.
Our findings also show that intron content in eukaryotes is not strongly associated to any other particular intron feature at the broadest phylogenetic scale. As observed in Table 2, changes of intron size (r2 = 0.336), intron density (r2 = 0.392), the fraction of genes harboring introns (r2 = 0.441) and the repetitive-genome content (r2 = 0.314) are similarly associated to the variation of intron content observed across eukaryotes. To a larger extent, intron content is associated with changes in the total number of introns per genome (r2 = 0.713) and the repetitive-intron content (r2 = 0.738). It is important to clarify, however, that the strength of the association among intron content and other genome-wide intron features is different across lineages (see Table 4 and Supplementary Table S13), as we report next.
PGLS regressions performed among intron features for different taxa datasets.
Genome-wide intron features are decoupled among themselves at the local and broadest phylogenetic scales
The repeated decoupling of intron features from genome size evolution is also supported by two additional observations. First, the evolution of intron features appears to be differentially constrained by phylogeny rather than genome size throughout Eukarya. As shown in Figures 2-4 and also Supplementary Tables S14-S15, a homogeneous variation can be observed in the phylogenetic patterns of intron features across land plants, fungi and stramenophiles. This contrast the dramatic variation –at a much faster evolutionary pace– observed in the phylogenetic patterns of intron features within holozoans, chlorophytes, alveolates and amoebozoans. Second, the PGLS correlations in Tables 2 and 4 show that intron features associate weakly among themselves at the local and broadest phylogenetic scales (see also Supplementary Table S6-10 and S13). These associative correlations further support differing patterns between those features estimating intron abundance within and across genes (i.e., intron density and the fraction of intron-containing genes, respectively) and those features measuring intron length and repeat composition. For instance, variation in intron size is weak but to a larger extent associated with the repetitive content within introns (r2 = 0.366, logBF = 588.2), rather than with changes in intron density (r2 = 0.025, logBF = 0.2), the number of introns per genome (r2 = 0.003, logBF = 1.6), or the fraction of intron-containing genes (r2 = 0.002, logBF = 1.6). Likewise, variation on intron density is weakly but to a larger extent associated with the number of introns per genome (r2 = 0.437, logBF = 161.5) and the fraction of intron-containing genes (r2 = 0.199, logBF = 152.0), rather than with changes in intron size, the repetitive content within introns (r2 = 0.210, logBF = 0.5) and the total number of CDS (r2 = 0.024, logBF = 0.074). As also observed in Table 4, the strenght of these associations fluctuates at local phylogenetic scales.
The evolutionary decoupling of intron features is also observed at the lineage and species levels (see summary statistics in Table 3 and Figures 3-4). For instance, Pezizomycotina and Eudicots show similar fractions of intron-containing genes on average (∼75%), although their corresponding avg. intron size (107.2 nts and 495.6 nts) and intron density (2.3 and 5) are different. Likewise, the small genomes of bees (288.7 Mbs) harbor shorter but more abundant introns within (6.2) and across (94.7%) genes than the larger genomes of butterflies (402.0 Mbs, 5.6 and 86.5% introns within and across genes, respectively). Strikingly, introns in Aves are slightly more abundant within (10.07) and across (92.9%) genes in comparison to mammals (9.3 and 86.8% introns within and across genes, respectively) (see Table 3 and Figure 4a). Even despite the fact that birds have undergone a reduction of their average intron (3,342 nts) and genome (1.2 Gbs) sizes, as observed here and elsewhere [76, 88, 89]. Consistent with this, birds and mammals show similar fractions of AS genes and have among the highest rates of AS events per gene [60]. Within fungi, we observe that introns within Basid-iomycota are shorter (92.2 nts) but more abundant within (4.7) and across (81.8%) genes when compared to the larger (141.4 nts) but less abundant introns within (1.9) and across (53.4%) genes in Ascomycota (see Table 3 and Figure 3a). Indeed, an increase in the average rate of alternative splicing from 6% in Ascomycota to 8.6% in Basidiomycota has been previously reported [61]. Noteworthy, the large genome of the locust L. migratoria (6.5 Gbs) exhibits the lastest intron sizes of Eukarya [90]: ∼75% of its introns are between 5,000 and 50,000 nts. Surprisingly, the abundance of its introns within (5.73) and across (81.68%) genes is not higher with respect to other protostome clades. Additional examples are described in Supplementary Tables S14-S15.
Panel A. Distribution of intron content, density and fraction of protein-coding genes containing introns across species. Intron contents with unique and repetitive intronic sequences (based on estimated genome sizes) are shown in grey and black bars, respectively. Estimations based on the assembled genome sizes are plotted in Figures S7a-S10a. The information represented by the dots is two fold: 1) the fraction (%) of genes with introns is represented by the coordinate with respect to the top scale; 2) intron density is depicted by the size and the color of the dot, so that, a bigger dot with an intensified red color implays the presence of more introns per genes. Panel B. Intron size distribution within a genome is represented by the fraction (%) of introns from the total population binned in the following ranges: 15-50 nts (brown), 51-100 nts (yellow), <250 nts: introns presumably spliced by intron definition (green), >251 nts: introns presumably dismissed by exon definition. Panel C. Grey box-plots show the descriptors (quartiles, means and outlier-thresholds) for the intron length distribution in every genome; from right to left: Q1, median, Q3, upper-fence (line), standard average size (blue dot), and weigthed-average size (red dot). Panel D. Distribution of exon features. On the left side, bars show the exon content (bottom scale) with unique and repetitive exonic sequences in grey and black, respectively. Genome size (log Mbs) is represented by the coordinate with respect to the logarithmic top scale. On the right side, the exon size distribution within the genome is shown as described for introns in Panel C. Some upper-fences were cut to avoid a higher compactation of the data. The images for model species were kindly provided by silhouettesfree.com and ClipArt.com.
For panel description see Figure 3.
Repeats differentially contribute to intron size and content across eukaryotes
We also investigated how significant is the contribution of repeats to the size and content of both introns and genomes. In partial agreement with previous studies [63, 64], we found that genome size is strongly associated at the broadest phylogenetic scale with its repetitive content (r2 = 0.723, logBF = 1273.3), but to a much lesser extent with the repetitive-intronic content (r2 = 0.485, logBF = 119.6) (see Table 2). As observed on Figure 2 and Table 3, repeats cover on average 13.6% of genome size in fungi, 41.6% in Viridiplantae, 26.5% in Metazoa, 17.7% in Choanozoa, 26.9% in Amoebozoa, 25.2% in Excavata, 26.7% in Stramenopiles, and 12.1% in Alveolata. When still observed at the supergroup level, the fraction of repeats covering the intronic genome content appears to mirror the previous trends: 8.8% in Fungi, 24.2% in Metazoa, 20.2% in Viridiplantae, 19.7% in Choanozoa, 29.6% in Amoebozoa, 13.0% in Excavata, 19.8% in Stramenopiles, and 14.6% in Alveolata. At the broadest phylogenetic scale, however, the repetitive content of the genome does not strongly correlate with intron size (r2 = 0.292, logBF = 56.4) or intron content (r2 = 0.314, logBF = 65.7), and it does not significantly associate either with intron density (r2 = 0.032, logBF = 3.1) or the fraction of intron-containing genes genes harboring introns (r2 = 0.018, logBF = 2.9) (see Table 2). These results show that the repetitive composition of introns is not strongly scaling with changes in genome size or the repetitive content of the genome at the broadest phylogenetic scale. Yet, fluctuations in the associations (from absent to strong) among intron features and the repetitive genome content are expected across lineages.
Figure 2 also shows whether or not the contribution of repeats to intron size is significant in every genome analyzed, according to the p < 0.001 obtained for the permutation tests performed on the Jaccard index (see Methods and Supplementary Table S16). In summary, we found no significant degree of nucleotide overlap between repeats and intronic sequences in most of the genomes analyzed in Fungi (71.0%), Viridiplantae (82.1%, with exceptions such as Prasinophytes), Aves, Oomycetes, Rhodophyta, and over half of the genomes within Mammalia. By contrast, the degree of overlap between repeats and intronic sequences was found statistically significant in most of the genomes analyzed within Metazoa (with notable exceptions), Amoebozoa (71.4%), and Alveolata (80%) (see Supplementary Table S16). Examples of species and clades with considerable high repetitive-intron contents (25-50%) and significant nucleotide overlaps between repeats and introns include: the haptophyta Emiliania huxleyi, the sea sponge Amphimedon queenslandica, Dictyostelia, Prasinophytes, Cnidaria, protostomates such as Lophotrochozoa, Diptera and Lepidoptera, as well as vertebrates such as Teleosts, Anura, Turtles, Crocodylia and Mammalia. With exception of Prasinophytes, high repetitivegenome contents (25-40%) are also observed in the previous lineages. Conversely, an apparent concerted reduction in the contribution of repeats to genome size (6-15%) and intron content (3-14%) –also supported by a non-significant overlap between repeats and intronic sequences– is observed in most species within Ascomycota, Nematoda, Hymenoptera, Aves, and in Trichoplax adhaerens. This particular pattern requires further research. Additional examples with statistic estimators are revisited in Supplementary Tables S14-15, S18.
Genome-wide exon features are also weakly associated with genome size, but evolving steadier than intron features
We further estimated the phylogenetic patterns of exons features across eukaryotes to contrast those observed among intron features. As shown in Figures 2-4 and Table 5, around half of the genome size is covered by exonic sequences in Fungi (44.9%), Amoebozoa (53.8%), Alveolata (51.9%), Prasinophytes (81.5%) and Excavata (53.1%). In contrast, low fractions of exon content are observed in the genomes of land plants (13.8%) and metazoans (5.2%), with pinus and mammalian genomes harboring barely 1% of protein-coding nucleotides. As with intron features, we found that the variation estimated for exon features is weakly associated to the variation observed in eukaryotic genome sizes at the broadest phylogenetic scale. As observed in Table 2 and Supplementary Tables S6-S10, genome size is positive but to a lesser extent associated with: exon content (r2 = 0.262, logBF = 6.2), the total number of both exons (r2 = 0.316, logBF = 11.3) and CDS (r2 = 0.328, logBF = 20.3), as well as the average length of CDS (r2 = 0.263, logBF = 43.8). These estimations are in agreement with previous studies [64, 66]. However, none or no simple associations were found for genome size against exon density (r2 = 0.054, logBF = 0.05), exon size (r2 = 0.096, slope = -0.114, logBF = 0.2), and the repetitiveexon content (r2 = 0.135, logBF = 0.9).
Summary statistics, mean value and (coefficient of variation), of several protein-coding features for different species datasets.
Contrasting the intron size’s patterns, the distribution of exon size is tighter across eukaryotes (see Tables 3 and 4). We found that between 50% and 75% of the exon population within a genome has a narrow and small length below 250 nts across plants, green algae, Choanozoa, basiodiomycetes, and Metazoa. Accordingly, we observed that exon length distributions are less-skewed in most eukaryotic clades (Figures 3D and 4D) when compared to intron length distributions (Figures 3C and 4C). Nevertheless, large exon lengths are observed in Rhizaria (407.2 nts), Amoebozoa (600.7 nts), Stramenopiles (687.5 nts), Alveolata (880.7), Excavata (1,357.3 nts), Prasinophytes (985.9 nts), Rhodophyta (870.9), Ustilaginomycotina (1,107.1 nts) and Ascomycota (721.6 nts), particularly in Saccharomycetes (1,167.23 nts) (see Figures 3D-4D and Supplementary Figures S7-S10). As observed in Table 2, furthermore, two different correlative associations were found for exon size at the broadest phylogenetic scale. On the one hand, exon size significantly decreases as the density and total number of exons in the genome increases (r2 = 0.731, slope = -0.880, logBF = 97.5 and r2 = 0.517, slope = -1.305, logBF = 10.5, respectively). On the other hand, none or no simple associations were found between exon size and the number (r2 = 0.097, logBF = 0.6) or length of CDS (r2 = 0.050, logBF = -0.01). These results are in overall accordance with previous estimations [66, 91, 92].
As observed in Table 5 and Figure 2, the fraction of repeats covering the exonic genome content across eukaryotes is considerably small (between 2% and 8%), in comparison to introns. Although larger fractions are observed in some species within land plants (23.05%), Lophotrochozoa (15.2%), Cnidaria (26.3%), Basiodiomycota (10.4%), Excavata (17.5%) Amoebozoa (16.2%) and Stramenophiles (13.0%). In agreement with these observations, the average length of exons is not significantly associated to its repeat content (r2 = 0.034, slope = -0.564, logBF = 2.3) or the genome repeat content (r2 = 0.094, slope = -1.269, logBF = 0.1). Based on the results from the Jaccard index’s permutation tests (see Figure 2), we found no significant degree of overlap between repeats and exonic sequences across most of the eukaryotic genomes analyzed here: Amoebozoa (100%), Choanozoa (100%), Metazoa (97.1%), Fungi (92.4%), Excavata (92.4%), Viridiplantae (86.5%), Stramenophiles (82.4%) and Alveolata (80%). The noteworthy exceptions can be observed in Figure 2 and Supplementary Table S16.
Intron-richness is robustly associated to complex multicellular organisms and their closest ancestral relatives
We further investigated the relationship between genome-wide intron features and multicellular complexity. As described in Appendix 1, we developed four criteria and three definitions to distinguish the species in our dataset as: complex multicellular (CMOs: 288), simple multicellular (SMOs: 96) or unicellular (77) organisms (see Supplementary Table S2). Accordingly, CMOs are defined here as those organisms exhibiting an irreversible transition in individuality produced by tissue-based body plans, through the developmental commitment of multiple and different cell types originated from a common cell-line ancestor. We then used Principal Component Analyses (PCAs) with direct comparative data (compPCA) and phylogenetically independent contrasts (phyloPCA) to investigate how seven intron features are covarying among themselves (Figure 5b), with other eight genome features (Figure 5e), with complex multicellularity (Figure 5a,d), and with genome size (Figure 5c,f). Noteworthy, the findings that are going to be described next, also hold for both the com-PCA analysis and the phyloPCAs performed with different tree topologies and assembled genome sizes (see Supplementary Figures S11-S15 and Tables S18-S20). As observed in Figure 5, high intron-richness does segregate CMOs (in red) from both SMOs (in blue) and unicellular organisms (in yellow), by clustering through genome-wide intron features exclusively and in conjunction with other genome features. The first two principal components (PC1 and PC2) from both phyloPCA analyses capture most of the variances in the data: 87.16% with 7 intron features and 64.6% with 15 variables. However, the association of the variables on each principal component is different in both phyloPCAs (see Table 6), as described next.
Contribution (%) of 7 intron features, 15 genomic traits and cellular complexity to the first seven principal components as estimated with the phylogenetic Principal Component Analyses (phyloPCA).
The phyloPCAs were performed with the reference tree topology (Figure 1a). The biplots depict the two first components (PC1 and PC2) inferred from the phyloPCA of 7 intron features (a-c), 14 genomic traits (d-f) and the organism complexity (a and d) estimated for 457 sequenced eukaryotes. In plots a) and d), the species are color-coded according to their organismal complexity (see Appendix 1): unicellular (yellow), simple multicellular (blue), complex multicellular (red). In plots b) and e), species are color-coded by major eukaryotic supergroups: Fungi (blue), Metazoa (pink), Viridiplantae (green), and “protists” (brown). In plots c) and f), species are color-coded by their log10-transformed genome sizes. Species are clustered in the phyloPCAs according to their dispersion along PC1 and PC2, with a confidence limit of 0.95. The dark lines radiating from (0,0) represent each variable included in the analysis; the direction of a line represents the highest correlation coefficient between the scores of the principal components and the variable, while its length is proportional to the strength of this correlation. Noteworthy, the Cronbach’s alpha coefficients obtained for the sets of 7 and 15 variables (0.88 and 0.95, respectively) indicate high internal consistency among the variables to measure the same underlying concept through phyloPCA analyses (see Methods). PhyloPCA analyses performed with alternative tree topologies and assembled genome sizes are provided in Supplementary Tables S16-S20 and Figures S11-S15.
By analyzing the phyloPCA of intron features, we first observe that high intron-richness clusters CMOs together (Figure 5a), regardless of the lineage they belong to (Figure 5b) or the wide dispersion of their genome sizes (Figure 5c). We further observe that the content, repeat composition and number of introns are mainly contributing to PC1 (71.82%), while PC2 has a major contribution (73.24%) from the fraction of introncontaining genes, intron size and density (see Table 6). In the broarder phyloPCA analysis that includes additional genome features, we also found that intron features are leading the clustering of CMOs, while other genome features are clearly setting apart the major eukaryotic supergroups: metazoans (in pink), plants (in green), fungi (in blue) and protists (in brown). Consequently, the association between high intron-richness and CMOs is found to be in part a byproduct of other genomic features constraining the development of multicellularity in every supergroup. For instance, PC1 clearly captures the ncDNA complexity of the genomes, while PC2 captures their protein-coding complexity. Accordingly, PC1 has a major contribution (51.5%) from intron content, unique intron content, unique ncDNA, genome size and genome repeat content (see Table 6). Notably, the scaling distribution of genome sizes along PC1 in Figure 5f endorses the strong correlative association found between ncDNA and genome size at the broadest phylogenetic scale (r2 = 0.859, see Table 2). Conversely, exon features (size, content, repeat composition), the number of CDS, the fraction of intron-containing genes and intron density contribute mostly to PC2 (89.8%). Hence, intron-richness in land plants is mainly associated with the high contribution of repeats to the genome, while intron-richness in metazoans is mostly related to the non-repetitive ncDNA fraction of the genome. By contrast, most unicellular and SM species (in fungi and protists) are strongly associated with several exon features, particularly with exon size and the number of CDS.
The previous patterns strongly suggest that high intron-richness is a robust genomic fingerprint of both CMOs and their closest ancestral relatives. For instance, ancestral relatives clustered among CMOs include: the choanoflagellates Salpingoeca rosetta and Sphaeroforma arctica, the sponge A. queenslandica, the slime mold Fonticula alba, the green alga Volvox carteri, and other chlorophytes (see Figure 5a,d). Nevertheless, there are some exceptions to this trend. First, there are also few exceptional intron-rich unicellular organisms clustered among CMOs, with no evidence of multicellularity (either simple or complex): the dinoflagellate Symbiodinium minutum, the chlorarachniophyte alga B. natans, the green alga Chlamydomonas reinhardtii and the apicomplexans Toxoplasma gondii, Neospora caninum and Hammondia hammondi. The high intron-richness of these unicellular species –comparable to the one observed in vertebrates– has been already acknowledged, and is suggested to have a role in the development of their complex life cycles [93–96]. Second, we observed a very few intron-poor CMOs, such as the red algae Chondrus crispus and the highly derived cnidarian parasite T. kitauei, that have undergone massive loss of introns and genome reduction due to extreme lifestyle conditions [44, 97]. Until more genomes of multicellular red algae are sequenced, however, it would be unclear to know whether the relative intron-poorness observed in C. crispus is an ancestral constraint or a derived (exceptional) trait [44]. Despite the presence of few intronpoor CMOs, the statistical robustness of the association between complex multicellularity (CM) and intron-richness is provided by most of the 288 intron-rich species classified as CMOs in this study, rather than by the five so far (out of the six) independent instances where CM has evolved. This is because intron loss overcomes intron gain across eukaryotes [1, 3, 4, 7] and intron-richness does not necessarily depend on genome size (this study). Therefore, there are no reasons to expect that high intron-richness has been maintained from intron-rich ancestors nor that it is only present in CMOs with large genome sizes. Yet, the major inconsistency found on both phyloPCA analyses (Figure 5) is the absence of a clear separation between some fungal SMOs and CMOs. This can be explained by the difficulty to determine the precise multicellular lifestyle of several fungi as simple or complex, owing to the lack of detailed life-cycle descriptions and the presence of “fuzzy” fruiting body development that challenge the evaluation of the criteria 3 and 4 in our definitions (see Appendix 1). This is an issue not only faced here, but also discussed elsewhere [98–101], and thus require further research.
DISCUSSION
Spliceosomal introns form a dynamically evolving ncDNA class, most likely under the influence of diverse life-history factors and evolutionary forces
Consistent with previous results [45, 66, 92] and with some note-worthy exceptions, our findings show that exons within protein-coding genes have remained within a narrow average size between 150 and 300 nts in Metazoa, Choanozoa, Viridiplantae and Basidiomycota. Hence, we observe that significant modifications in the structure of protein-coding genes in these lineages are basically a consequence of changes in the length and abundance of introns. Unexpectedly, these and other genome-wide features of introns are found to be repeatedly decoupled among themselves and from genome size evolution throughout Eukarya. Three findings support this observation. First, the strength of the associations among genome size and genome-wide intron features is different at the lineage-specific level, as consistent with other studies [76, 81, 102, 103]. Second, the features estimating the length and abundance of introns in a genome are weakly associated among themselves at the local and broadest phylogenetic scales. This explains the heterogeneous and contrasting patterns of intron evolution reported in the literature [2, 7, 55, 70, 76, 81]. As a consequence of the previous findings, changes of intron content cannot fully account for the large variations of eukaryotic genome sizes (in agreement with [64]), nor be strongly associated to the variation of one particular intron feature. Third, the repetitive composition of introns is not necessarily scaling with changes in genome size or the repeat content of the genome. Indeed, introns are found to be far from representing repetitive sequences in several lineages. As argued below, these results do not contradict –and even endorse in several cases– the contribution that different repeat classes have to either the origins [9, 11, 12] or length extension [73–76, 103] of introns at particular lineages. Therefore, our findings collectively unveil spliceosomal introns as a dynamically evolving ncDNA class.
Can a major mechanism (adaptive and non-adaptive) offer a unifying explanation to the highly heterogeneous patterns of intron evolution and genome complexity? Our findings suggest that this is highly unlikely. None of our correlative analyses imply causation nor offer evidence for the evolutionary mechanisms explaning the phylogenetic patterns reported here. Yet, strong (linear) correlations among several measures of genome complexity haven been often provided as evidence, by some evolutionary models [62, 65], to imply a concerted evolution among genome size, ncDNA content and intron-richness across eukaryotes (also discussed in [37, 67]). We observe here that genome size, its overall ncDNA and repetitive contents are indeed strongly associated at the large evolutionary scale (see Table 2). However, we also demonstrate that changes in the variation of some intron features (such as size and repeat composition) are only weakly, while other features measuring intron abundance are not, scaling with changes in genome size at the broadest phylogenetic scale. Our findings are thus in clear disagreement with previous estimations claiming the opposite [62, 63, 65, 66, 71]. Moreover, our results show that the genome-wide features determining length and abundance of introns across protein-coding genes are largely evolving independently throughout Eukarya. Our results are thus inconsistent with both a particular intron feature as key determinant of eukaryotic gene architecture, as well as a major mechanism (adaptive or non-adaptive) underlying a concerted effect between genome size and intron-richness over a large phylogenetic scale. Instead, the repeated decoupling among intron features themselves and with genome size strongly suggests that the major genome-wide features of introns from coding genes are evolving under the influence (direct or indirect) of either different or several life-history factors and evolutionary forces.
For instance, non-adaptive mechanisms such as the longterm evolutionary dynamics of repeats –which depend on factors like methylation propensity, RNAi-mediated interference and mating system– are found to determine concerted changes of genome and intron size in certain lineages [67, 73]. Examples observed here and elsewhere include some species within red algae and plants [44, 74, 103–105], insects [90, 106], fish [75] and birds [76, 88, 89]. Other studies suggest that intron density is determined by mechanistic factors such as nonhomologous end-joining (NHEJ) of DNA segments, reverse transcriptase and transposition activities [7, 70]. Also, introner-like elements greatly contribute to episodic intron gains in some algae [9, 12] and fungi [11]. And recently, rates of spontaneous introncreating and -deleting mutations were found to shape the intron-exon structures of several distantly related species [107]. Under adaptive forces, variations of intron size and density across several eukaryotic lineages have been associated with the action of natural selection: (a) to conserve regulatory binding sites [17, 20, 21, 23] and regulatory ncRNAs [108–110]; (b) to promote the creation of new exons in vertebrates [49, 111]; (c) to reduce splice error rates [68] and protect against transcription-associated genetic instability [52]; (d) to favor co-transcriptional splicing and nucleocytoplasmic export of highly expressed and rapidly regulated cell-cycle genes [34, 37, 112]; (e) to reduce the metabolic costs associated with either powered flight in birds [76, 88] or environmental changes of habitats in teleosts [113].
Consequently, our findings also endorse concerns [67, 114, 115] regarding how much of the content, variation and complexity of intron-richness (along with other ncDNA classes) in Eukarya can be explained by the strong action of Ne and genetic drift over a large evolutionary scale, as the “mutational-hazard” (MH) model states [62, 63, 80, 116]. In addition to the inconsistencies discussed previously, other studies [67, 68, 86] were not able to find statistically significant associations among Neµ and several intron and genome features after removing the phylogenetic signal from the dataset of Lynch and Conery [62]. Another studies have also shown that Ne cannot exclusively or even largely explain major changes of intron and genome sizes in lineages within amniotes [76], insects [117], ascomycetes [81], and plants [67, 86]. Furthermore, reliable estimations of Ne (such as Neµ and Ka/Ks) are still a matter of debate [67, 118], since they do not correlate well [68, 119] and are affected by several life-history traits [114, 120]. It is important to note that our findings do not dismiss the impact that Ne and genetic drift has on the accumulation of ncDNA in eukaryotes. Rather, they argue that the population genetic settings suggested by the MH model are most likely to be dominant at the local phylogenetic scale, or in particular intron features from coding genes, or during recent founder events, or over introns located at non-coding regions.
The robustness of systematic and phylogenetically controlled analyses
The results summarized above are based on a phylogenetic controlled framework over the largest and most diverse dataset of eukaryotic complete genomes to date. A major concern is, however, that phylogenetic uncertainty might affect considerably any phylogenetically controlled analysis [67, 82]. Here, we recapitulated no significant changes in our results after taking into account significant and numerous phylogenetic disagreements from four tree topologies estimated for the 461 species analyzed in this study. We could argue that the literature-based tree might reflect better the community consensus about the evolutionary history of these eukaryotes, since it is consistent with the Open Tree of Life [85] and with the species phylogenies reported with the complete genomes. However, the backbone of the tree of eukaryotes is still subject to deep rearrangements and competing hypothesis [84, 85]. Therefore, we can never exclude the possibility that a suggested tree is free of errors nor the existence of a better phylogenetic representation. Here, we have demonstrated that our phylogenetic controlled analyses are strongly robust to: (a) uncertainties about ancestral branches, such as Parahoxozoa in Metazoa [121, 122] or Excavata in Eukarya [84]; (b) discrepancies about the phylogenetic position of particular clades among their “peers”, for instance, Microsporia within Fungi [50] and Rhizaria within SAR [84]; (c) uncertainties in poorly resolved branches, such as Arthropoda and Rhizaria [123, 124]; as well as (d) common tree reconstruction problems, such as the presence of hard polytomies and long branch attraction of species.
In fact, the most significant discrepancies of our regressions from previous estimations over a large evolutionary scale are caused by the absence of correction for phylogenetic signal, rather than by the accuracy of the topological information used to account for it. As also shown by Whitney et al. [67] and Wu and Hurst [68], uncorrected phylogenetic dependencies among species (which assumes a star polytomy) lead to a much stronger correlation signal, as those strong correlations reported by previous studies [62, 65, 66]. We further showed that the correlation signal can also be affected considerably under controlled phylogenetic analyses, as some of those correlative associations estimated recently [64, 68]. Some of the factors analyzed here that lead to such biases include: low phylogenetic diversity, very small species datasets (<100 species) attempting to represent the current diversity of the sequenced eukaryotes, and the lack of systematic estimations of genome traits. These problems are particularly found at the clade-specific level (as also reported in [76, 81]), and in biased estimations of intron features (e.g., density and genomic content) that already incorporate changes of genome size or on the number of protein-coding genes (see Appendix 2). We also demonstrated that our results are robust, not only to a reduced number of species (as long as the phylogenetic diversity of the dataset is maintained), but also to different sources of genome size estimates.
Yet, two additional factors might challenge the findings of this study. First, a limited access to both larger genomes (>5 Gbps) and genomes from deep branches of the eukaryotic tree precludes us from evaluating eventual biases on phylogenetic and “C-value” diversity so far. Second, substantial errors in genome assembly and/or annotation of protein-coding genes might considerably affect estimations of genome-based features. For instance, under- and over-estimation of genomic features were recently found on the genome projects of Branchiostoma floridae and Hydra magnipapillata, respectively, after RNA-seq-based re-annotations were performed in seven holozoans [66]. It is unclear, however, which genome features are the most affected by these biases, because the method employed in such study cannot distinguish introns and exons of coding genes from non-coding genes, such as pseudogenes and ncRNAs. This distinction is relevant in our study, since coding genes have a different copy number and experience different selective pressure than non-coding genes [125–127]. Moreover, annotation of protein-coding genes is becoming increasingly reliable owing to the incorporation of unbiased RNA-seq data, as most of our annotations are supported by. Through the filtering approaches of GenomeContent, we further found that very small introns and exons (≤15 nts) represent < 1% from the total number located in the coding genes in all genomes. Likewise, annotations of coding genes exhibiting intron sizes with an excess or deficit modulo 3 (i.e., coding regions probably mistaken as introns and vice versa, respectively) are also unlikely to occur in our dataset (see Supplementary Figure S1 and Table S3). We acknowledge that particular phylogenetic patterns might be challenged by the completeness of genome assemblies in those lineages with very few sequenced genomes [66] or that will undergo substantial genome size corrections [128]. However, we also demonstrated to a considerable extent that possible biases in the genome assemblies and annotations analyzed here do not significantly impact our correlations and overall findings, after testing the genome completeness of 461 projects over 200 randomly reduced datasets. In disagreement with [66], thus, we show that the genomic differences obtained from most genome projects are robust enough to evaluate biological and evolutionary large-scale patterns of genome features across eukaryotes.
Intron-richness is a suitable pre-condition to evolve complex multicellularity
Despite the many origins of multicellularity on Earth, complex multicellularity (CM) evolved only a few times in Eukarya [129–131]. We developed here a conceptual framework to define CM beyond the number of unique cell types (UCTs) (see Appendix 1). Our definition follows West and colleagues [132, 133] in acknowledging that contingent irreversibility from clonal-unitary development is key to evolve obligately multicellularity, but differs in arguing that a reinforced irreversibility of developmental commitment from multiple cell types is the main determinant in clonal multicellularity for a major transition to occur in individuality. By formally differentiating simple and CM, we thus hypothesize that CM is the outcome of major evolutionary transitions [133, 134] involving the presence of innovatory changes in genome structure and expression due to the differential evolution of particular ncDNA classes along the eukaryotic lineages [131]. We found here that complex multicellular organisms (CMOs) are characterized by high intron-richness; even those CMOs that have undergone strong selection to reduce several classes of ncDNA and genome size, such as carnivorous plants [135] and birds [76, 88, 89]. We show that the association between CMOs and intron-richness does not depends on changes of genome size, which in is agreement with the study of Niklas [136] indicating that increases in the number of UCTs fail to keep pace with increases in genome size. Our findings also suggest that CM origins were most likely preceded by high intron-richness, since the latter is also found on the closest unicellular and simple multicellular relatives of CM lineages. This is consistent with episodes of rapid and extensive intron gain found on the basal lineages of opisthokonts, holozoans and plants [1, 13, 137]. Furthermore, the diversity of intron-richness observed here among CMOs is not random nor homogeneous. Instead, it appears to be constrained by different factors that demand further research, including shared phylogenetic history, widely divergent selective regimes, lifestyles and generation times.
It is becoming clearer that intron-richness has important phenotypic consequences on eukaryotes, but which of these consequences can be considered indispensable to promote CM convergently? As summarized earlier, the functions –either causal roles or selected-effects– of introns promoting the emergence and evolution of CM can be very diverse. Most of these functions are, however, neither exclusive of CMOs nor universal across eukaryotes, but rather the outcome of exaptations originated on independent occasions [138]. This is partially because, as shown in this study, introns possess different characteristics throughout the major supergroups. Also, the rates of intron conservation and the molecular mechanisms responsible for intron processing vary considerably across eukaryotes [2]. Most importantly, introns appear to affect virtually every step of mRNA maturation, as described previously and reviewed in [138]. Yet, the role of exon skipping (ES) has been highlighted as the main promoter of multicellular complexity by expansion of proteome diversity [2, 59] through selection of, for instance, new transcription factor families, cell adhesion and signal transduction proteins [98, 139, 140]. However, most ES events and isoforms (mainly in low abundance) are found to be mainly the outcome of stochastic splicing errors [141, 142]. Also, transcriptome analyses from diverse tissues and cell lines reveal that most genes express one and the same dominant transcript in multiple tissues in human [143], mouse [144] and fly [145] (but see [146, 147]). It remains thus to be fully understood to what extent ES events are actually contributing to the suggested protein diversity of CMOs [59, 131].
The phenotypic diversity of CMOs largely relies on the expression of ancestral and species-specific genes coordinated in a particular spatiotemporal manner. We argue here that intronrichness has facilitated this process to a great extent by tuning the transcriptomes of an organism through intron-mediated mechanisms (IMMs) that alter the timing or kinetics of transcript expression. For instance, intron retention coupled to components of the RNA surveillance machinery can modulate gene expression post-transcriptionally by slowing splicing kinetics of those intron-containing transcripts stored at the nucleus in response to a variety of cellular signals [138, 148, 149]. Diverse modes of transcript regulation by intron retention have been found during cell type differentiation, cellular stress, circadian rhythm and early embryogenesis in plants [147, 150–152] and mammals [58, 153–156]. Alternatively, intron delay can coordinate the expression of genes that are sensitive to changes in their transcript length during particular stages of the metazoan development [33, 34]. The transcriptional delay caused by the presence of introns in the Hes7 locus, for instance, controls the proper oscillatory expression of the genes involved in body segmentation during early vertebrate embryogenesis [36, 157–159]. Notably, the intron lengths of Hes7 and of other genes involved in developmental patterning across mammals are highly conserved and even coevolving among coexpressed genes [160, 161]. These findings are consistent with studies showing that some species with long-complex life cycles and slowly regulated cell-cycle genes appear to be enriched in intron-rich gene-structures and patterning processes [33, 34, 148]. On the other hand, short life cycles and rapidly regulated cell-cycle genes (at different stages of development) tend to constrain gene-structures toward short and intron-poor genes for efficient expression [7, 26, 34, 35, 37, 69, 162, 163].
Rather than a selective (ultimate) effect, a major influence of IMMs to differentiate functional from nonsense transcripts in CMOs is expected if we consider that: (i) intronic RNAs constitute a considerable fraction of the transcriptomes in the CMOs analyzed [28–30]; (ii) canonical and non-canonical splicing errors (from which these IMMs emerge) appear to be more frequent when intron-richness increases [49, 58, 61, 141, 142]; and (iii) the spatiotemporal patterns of transcript expression derived from IMMs do have significant ecological and evolutionary consequences for cell cycle control and body plan formation [33, 34, 148, 149]. Nevertheless, some of the phenotypic consequences of intron-richness are also expected in certain life histories that did not evolve CM due to different evolutionary conditions. For instance, IMMs appear to influence the development of complex life cycles in intron-rich unicellular and simple multicellular species (as defined in Appendix 1) such as Apicomplexan parasites, which often involve multiple hosts and/or differentiation stages [96, 164]. Similar findings are expected to be found in the upcoming complete genomes of other CMOs within the red algae and coenocytes such as C. taxi-folia [165]. Ultimately, major evolutionary transitions require extreme conditions for certain factors to become consistently important [133]. We argue that high intron-richness (through ES and IMMs processes) has laid the foundations for the emergence of novel mechanisms of transcriptome timing in Eukarya, which under exceptional conditions might have convergently co-opted for fate specification and commitment of cell types. It remains to be known whether common life-history traits [166, 167], molecular mechanisms [168, 169] and evolutionary forces have parallelly shaped the evolution of both intron-richness and tissue-based body plans in Eukarya.
METHODS AND MATERIALS
Genome-based data collections
We compiled the complete-sequenced nuclear genomes, protein-coding genes and gene annotation files for a total of 461 eukaryotic organisms from publicly available databases: 131 fungi, 78 species from Archaeplastida, 186 from Metazoa, 20 from Alveolata, 17 from Stramenopiles, 7 from Excavata, 7 from Amoebozoa, 4 from Choanozoa, 3 from Rhizaria, Fonticula alba (Fonticulidae), Guillardia theta (Cryptophyta), Emiliania huxleyi (Haptophyta), Thecamonas trahens (Apusozoa). A manual filter was applied to avoid redundant species (i.e., same genus with similar genome sizes), sequenced genomes with <70% of the estimated genome size, and gene annotations without support from transcript data. To account for significant under and overestimations of genome contents, we also corrected our calculations with the “estimated” genome sizes based on experimental approaches collected from databases and literature. References and details of these datasets are provided in Supplementary Table S1.
Estimating intron features with GenomeContent.pl
As depicted in Appendix 2-Figure 1, GenomeContent was written in Perl to calculate global statistics and sequence-based estimators of genome features through six major steps: (1) identifying coordinates from protein-coding gene (CDS); (2) checking the quality of intron annotations; (3) calculating statistic descriptors for genome-wide features from “reference gene sets” (derived from 1 and 2), such as size, density and number; (4) estimating genome-feature contents from the overlapping projection of “reference gene sets” onto the genome sequence; (5) calculating statistic descriptors for genome-feature contents; (6) plotting of figures and retrieving of sequences (fasta format) and statistics (text format). According to the definitions described in Appendix 2, we measured 30 genome features with GenomeContent across 461 eukaryotes, including 10 associated to the intron-richness of a genome: intron size (average and “weighted”, see equation 1 in Appendix 2), total number, absolute density, genomic content, number and fraction of genes containing introns. As observed in Appendix 2-Table 1, “(absolute) intron density” was measured as the average number of introns per intron-bearing gene because other estimations based on the average number of introns either per sequence region or from the total number of protein-coding genes are vulnerable to both genome size and considerable fluctuations of gene models, respectively.
Determining the repeat content of genome features and statistical tests
We focused on identifying de novo repeats along the genome, rather than on classifying them in specific families. The Repeatscout algorithm v1.0.5 [170] was employed to compare a genome sequence against itself and in the two reading directions in order to identify de novo repetitive sequences with the minimal k-mer length of 15 nts. To gain major repeat coverage, the de novo Repeatscout libraries were merged with the Repbase libraries version 20.03 (http://www.girinst.org/repbase/). Then, these merged repeat libraries were used to map the coordinates of the repeats (interspersed and short repeats, and low complexity sequences) across the complete genomes with the RepeatMasker program v4.0.5 (www.repeatmasker.org). The placeholders were also taken into account as an independent category of “potential undefined repeats” within each genome feature (see Figure 2). However, the proportion of such regions in a genome is not collected in an individual genome feature for statistical analysis, given that they usually include the highly repetitive heterochromatic sequences that could not be unambiguously sequenced. Based on the genome coordinates obtained from this process, a nucleotide was classified as repetitive if it was covered by a repetitive sequence on either strand. Accordingly, the repeat content within introns, exons and the genome was determined as the total number (or percentage) of nucleotides in the genome feature that were classified as “repeat”; while the unique content of a particular genome feature is thus estimated from the non-repetitive nucleotides.
With a custom Perl script, we then calculated the Jaccard index, as J(R, GF) = |R ∩ GF|/|R ∪ GF|, to estimate the nucleotide overlap between the repeat coordinates (R) and the genome features (GFs: exons and introns) by using the “feature content coordinates” of every genome as observed in the Appendix 2-Figure 1. We tested the significant degree of observed overlap between repeats and GFs for each genome with the GenometriCorr package [171] from the R program v3.1.2 (www.r-project.org). For each genome, 1,000 permutations were allowed to shuffle the repeat coordinates along the genome sequence. Exon and intron positions were preserved as reference, while the repeat coordinates and the random sets were provided as the query. This setting provides the correct assessment of correlations (p-value) for the relative distance under the Kolmogorov-Smirnov criteria and for the permutation test on the Jaccard index [171, 172], which indicate whether the overlap is less (TRUE) or more (FALSE) significant than expected by chance. The statistical reports are available as Supplementary Data.
Construction of phylogenetic trees
We constructed four different tree topologies, each of which has been extensively used in literature, for the 461 eukaryotes analyzed in this study, in order to: (a) evaluate phylogenetic signal, (b) perform phylogenetic controlled analyses, and (c) test the robustness of our comparative analyses against some of the phylogenetic uncertainty surrounding the tree of eukaryotes, owing to phylogeny construction errors and the absence of the true but unknown species tree.
To test for sensitivity to topological and branch length errors, we constructed a phylogeny determined by protein domain content as described in [173] (Figure 1d). Accordingly, hmmscan from HMMER v3.1b1 [174] was used to search for the protein domain models of the Pfam-A families from the PFAM database v30.0 [175] against the 461 proteomes. We used the gathering threshold (–cut-ga) for filtering out false positives. Custom Perl scripts were used to obtain a presence/absence matrix for all pfam domains detected and to calculate a pairwise distance matrix for all analyzed species. A weighting factor was included to correct the distance between two genomes (owing to the great differences in genome size, gene content, and lifestyles), and according to the following relationship: D = A′/(A′ + AB), where A′ is the number of unique pfam domains in one of two genomes compared: A and B, and AB is the number of pfam domains they share. Thus, “the two tendencies are acknowledged by setting the evolutionary distance equal to the ratio of the unique domains in the smaller genome (A′) to its total number of domains (A′ + AB)” [173]. The phylogeny construction was performed with the neighbor-joining method and bootstrapping with the program neighbor from PHYLIP v3.68 (http://evolution.genetics.washington.edu/phylip.html). We used Trypanosoma brucei as an outgroup, based on the supported basal phylogenetic position of Eozoa (Excavata and Euglenozoa) within Eukarya [176, 177].
Additionally, two NCBI taxonomy-based trees with no branch lengths were obtained with the species IDs collected from the “Taxonomy Browser” of NCBI (https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/, last accessed July 14, 2016, see Supplementary Table S1) and with a combination of the phyloT (www.phylot.biobyte.de) and iTOL (www.itol.embl.de/itol.cgi) tools to allow or not the use of polytomies on each tree. A fourth tree topology (unrooted and with no branch lengths) was obtained by manually “correcting” the NCBI taxonomy-based tree (not polytomies allowed) with the TreeGraph 2 v2.13.0-748 beta program [178] to fix the resolution at the genus and species level according to: (1) clade-specific phylogenies based on candidate orthologous sequences reported on literature, and (2) supertrees reconstruction for those few species that have not been incorporated into a sequence-based phylogeny yet. Polytomies were introduced in cases where phylogenetic uncertainty is not solved according to different studies. The 110 references employed to construct this consensus tree, some of which include the phylogenies reported along with the complete genome projects, are available on Supplementary Table S1. The literature consensus-based tree was selected as the “reference eukaryote tree” (Figure 1a) to present the results throughout the article. All tree topologies are available as Supplementary Data.
We quantitatively estimated the dissimilarity among the four trees with two measures reported to performed best among topology-only metrics [179]. The symmetric difference of Robinson and Foulds (RF) measures the number of different partitions (or clades not shared) between two trees [180], whereas the tree aligment metric (Align) of [181] scores the mismatches in the best alignment of the similar (and same) branches between two trees. Absolute RF distances were also divided by the total number of species in the tree to estimate the number of partitions per species. The Align and RF scores were calculated with the python scripts implemented in [179] and available at http://datadryad.org/resource/doi:10.5061/dryad.g9089.
Phylogenetically corrected analyses
We first evaluated the strength of phylogenetic signal (i.e., their statistical non-independence) exhibited by the 30 genome-based features analyzed in this study with the tree topologies described previously. Thus, we calculated the Pagel’s lambda (λ) transformation for all genome features analyzed with the caper R package (pgls) [182]. In comparison to other indices, Pagel’s λ is very robust to both incompletely resolved phylogenies and suboptimal branch-length information [83, 183]. We then calculated coefficients of determination (r2) to estimate the strength of the correlation to associate the variations observed between two traits (X and Y) with three linear regression models: Ordinary Least Squares (OLS) (stats R package: lm), Phylogenetically Independent Contrasts (PICs) (ape R package: pic) [184], and Phylogenetic Generalized Least Squares (PGLS) (caper R package: pgls). We also calculated log Bayes Factors (LogBF) to estimate the significance of evidence for the correlation between X and Y with the Markov chain Monte Carlo (MCMC) method and the PIC and PGLS models. log Bayes Factors were calculated as:
with 100 stones and 10,000 iterations per stone to estimate the marginal likelihood, as implemented in the BayesTraits program v3 (http://www.evolution.rdg.ac.uk/BayesTraitsV3/BayesTraitsV3.html).
We tested the robustness of the phylogenetically controlled regressions to discrepancies in tree topologies, phylogenetic diversity and estimations of genome features. First, OLS, PGLS and PIC regressions were performed with two sources for genome size: genome assemblies and experimental estimations. Also, the PGLS and PIC regressions were performed with the four tree topologies described previously. For the protein domain-based phylogeny, we also performed PGLS regressions with both equivalent branch lengths (all = 1) and lengths derived from the distance matrix. To test the influence of not fully resolved trees owing to the presence of hard polytomies, we generated three additional topologies for the NCBI-taxonomy tree (with polytomies allowed), two of them with randomly resolved polytomies using the procedure multi2di=TRUE, and one tree with a non-random procedure multi2di=FALSE, with the R package ape (library picante). Furthermore, we employed the “replicated co-distribution” approach [87] to test whether the association between X and Y is replicated across multiple independent clades. Thus, PGLS regressions were performed over 20 different sets compiled from the original dataset of 461 eukaryotes: 18 datasets correspond to divergent lineages, and two datasets (with 100 replicates each) were created through the random selection of 231 and 116 species, respectively. To further test the impact of phylogenetic diversity on the sensitivity of our phylogenetically controlled correlations, we performed PGLS regressions with four additional datasets (see Supplementary Table S12): two further randomly reduced datasets of 58 and 29 species, another dataset with 26 out of the 30 eukaryotes used in the study by Lynch and Conery [62], and the dataset of 30 eukaryotic genomes from the study of Wu and Hurst [68]. All genome-feature values were log10-transformed prior to analysis, except for the few genome-feature values == 0 that estimated the absence of repeats within introns in some extreme intron-poor genomes such as Debaryomyces hansenii, Encephalitozoon cuniculi, Giardia intestinalis, Spironucleus salmonicida. These particular values were discarded from the corresponding analyses.
We computed the Cronbach’s reliability coefficient alpha to measure the internal consistency (inter-relatedness) of the variables employed to test their relationship with multicellular complexity with the R package psych (alpha) [185]. Comparative and phylogenetic comparative Principal Component Analyses (PCA) were performed using the R packages stats (princomp) and phytools (phyl.pca), respectively. Branch transformations (all = 1) for the phylogenetically controlled PCAs and PGLS regressions were performed with ape (compute.brlen). Remaining statistical tests were calculated with custom R scripts. The plots were prepared with the R packages ggplot, ggbiplot, phenotypicForest v0.2 (http://chrisladroue.com/phorest/) and the software Inkscape (https://inkscape.org/en/). The results displayed throughout the paper are based on the estimated genome sizes and the “reference eukaryote tree” (Figure 1a). The results obtained with alternative tree topologies and assembled genome sizes are available in Supplementary Material.
AVAILABILITY
The GenomeContent program and data are available upon request during peer-review, and will be openly available after publication.
SUPPLEMENTARY DATA
Supplementary Data, Text, Figures and Tables are only available for peer-review at the moment, but will be openly available after publication.
ACKNOWLEDGMENTS
We gratefully acknowledge the work and policies of the consortia and research groups for making publicly available the current wealth of genome datasets. We thank Jens Steuck and Ronald Kriemann for their assistance with the IZBI and MPI clusters required for computational work, A. Nabor Lozada-Chávez for his assistance with the R package, and Cristina Bojórquez-Espinosa for proofreading. We deeply appreciate the feedback received from previous anonymous reviewers.
FUNDING
National Council for Science and Technology in Mexico (CONA-CyT: 185993) and Max Planck Institute for Mathematics in the Sciences to I.L-C.
CONFLICT OF INTEREST STATEMENT
No competing interests declared.
Appendix
APPENDIX 1
Defining simple versus complex multicellularity
Multicellularity refers to the phenotype characterized by the self-organization of cells that undergo a transition in individuality to perform cooperative consumption of energy, survival, and ultimately, reproduction. Multicellularity has arisen multiple times during the evolution of life on Earth [129], and it can even be induced in experimental settings [186, 187]. However, multicellularity is hypothesized to unfold into two different transitions [130, 131, 167]: simple or complex. Complex multicellularity (CM) is restricted to Eukarya and has evolved independently a few times: florideophyte red algae, laminarian brown algae, viridiplantae, eumetazoan animals, basidiomycota and ascomycota fungi. Typically, the number of unique cell types (UCTs) is used as the defining feature of multicellular complexity. However, accurate estimates of UCTs are only available for a small fraction of the species, and they also fail to appropriately capture the complexity of multicellular species [136]. Thus, we have created three “working definitions” embracing four criteria that distinguish few distinctive aspects of cellular development and life cycle to recognize species in our dataset, first as unicellular or multicellular (criterion 1), and then as simple multicellular (SMO) or complex multicellular (CMO) (criteria 2-4):
Criterion 1
Whether there are one or several differentiated state cells at once across a life cycle. Some single-celled organisms may have several differentiated state cells but at different times during the life cycle, such as the yeast Saccharomyces cerevisiae (three UCTs) [188], or may develop a pseudo-hypha with very few spores during a transient stage of a life cycle to reproduce [189]. Also some naturally living unicellular organisms, such as S. cerevisiae and C. reinhardtii, may develop filamentous growth under particular experimental settings that induce stress, and as long as the selective pressure is mantained [186, 187].
Criterion 2
Whether the organization of the cells in a multicellular organism falls into one of the following states: a) differentiated cell types; b) undifferentiated cell types; or c) “syncytium/coenocyte” sensu amplio, i.e., multiple nuclei (either genetically identical or distinct) distributed within one common cytoplasm, which might or not be partially separated by cell membranes. Examples of the latter include: coenocytic Dictyostelium discoideum [190], siphonous algae and non-septate fungi [191]. Some coenocytic organisms, such as Caulerpa taxifolia [165], have morphological structures equivalent to a multicellular organ but not comprised in tissues or cells (i.e., pseudo-organs). While SMOs undergo a transition in individuality through any of the three cellular states, CMOs only undergo a major transition in individuality through differentiated cell types.
Criterion 3
Whether the transition in individuality, as defined in criterion 2, is facultatively or obligately replicated across generations. Facultatively multicellular species are able to complete their life cycle as unicells and only become multicellular under certain environmental conditions [132, 133]. For example, formation of fruiting bodies in some organisms, such as Dictyostelium, is observed in particular generations that undergo critical conditions to increase dispersal success [192, 193]. By contrast, obligately multicellular species can only complete their life cycle as multicellular organisms, owing mainly to the high genetic relatedness of cells originated through clonal-unitary development [129, 132, 133]. For instance, the development of tissue-based fruiting bodies (“basidia” and “ascocarp”) used for sexual reproduction in fungal CMOs is replicable on every generation. While SM is either facultatively or obligately replicated across generations, CM is only replicated as a whole. This criterion takes into account the temporal “unicellular transition” that all multicellular organisms, either simple or complex, undergo by means of reproductive processes through the life cycle [194].
Criterion 4
Whether or not the transition in individuality, as defined in criterion 2, is produced by irreversible tissue-based body plans. CMOs have tissue-based body plans that are developmentally irreversible, so that the within-group conflicts produced by “mutant-selfish” cell lineages (defectors) are negligible enough to avoid reversible differentiation of the whole organism [195–197]. Such irreversibility is consequence of active developmental commitment of multiple cell types that undergo fate specification and determination at particular stages of an organism life cycle. Cell type commitment is observed during the formation of: i) germ layers in metazoans [168] and eumetazoans such as cnidarians[198–200], ii) meristems in plants [201, 202] and in the CMOs within brown and red algae [203–205], and iii) in the primordium of fungal CMOs, although some cell types are still able to revert to vegetative growth in vitro [100, 206]. By contrast, SMOs do not develop true tissue-based body plans, owing in part to the lack of cell types with fixed identities and lineage commitment, so that dedifferentiation or transdifferentiation of cell types at any stage of development is common under the influence of certain factors. For instance, an absence of true tissue-based body plans, fate determination and stability of key cell types is observed in the sea sponge A. queenslandica (∼11 UCTs) [207, 208], and the sea placozoan T. adhaerens (∼5 UTCs) [209, 210] (but see [211, 212]). Likewise, the green algae V. carteri also lacks of a tissue-based body plan, since it only forms a colony of ∼2,000 cells with two UCTs [213].
According to these four criteria, we distinguish:
Unicelullar: is an organism exhibiting a single differentiated state cell at once across its life cycle. Single-celled organisms developing a transient pseudo-hypha or experimentally-driven filamentous growth are also included in this category.
Simple multicellular: is an organism exhibiting a facultatively or obligately transition in individuality through the organization of either several cells (with none or only few differentiated cell types) or a syncytium/coenocyte sensu amplio originated from one or more cell-line ancestors. Coenocytic and siphonous organisms structurated in pseudo-organs are also included in this category. Reversion to unicellularity may occur.
Complex multicellular: is an organism exhibiting an irreversible transition in individuality produced by tissue-based body plans, through the developmental commitment of multiple and different cell types originated from a common cell-line ancestor. Reversion to unicellularity or to a simple multicellularity lifestyle does not occur.
The lifestyle and body plan development of all species in our dataset were compiled from literature to evaluate the four criteria of our definitions, such information is available in Supplementary Table S2. We classified 77 species as unicellular, 96 species as SMOs, and 288 species as CMOs. This approach was useful to define the cellular state of some controversial model organisms. However, it still represents a challenge to distinguish between SM and CM in species within Fungi and Parazoa, whose multicellular body plans are generated from a few UCTs or are not well documented yet to evaluate the criteria described previously. Since this is a pioneer attempt to formally define CM and SM, we acknowledge that our definitions and the classification for the most controversial cases in this study are not free of future improvements and corrections.
A flowchart of the program is displayed on the left and described throughout Appendix 2. On the right, examples of exploratory figures showing some statistic descriptors for the genome of V. carteri, as obtained with the program.
Appendix
APPENDIX 2
A. Estimating intron features with GenomeContent.pl
GenomeContent was written in Perl to calculate global statistics and sequence-based estimators of genome features in six major steps, as shown in Appendix 2-Figure 1. First, the processing of gene annotations focuses on identifying coordinates from protein-coding gene (CDS), while the filtering process focuses on checking the quality of intron annotations. As described in next sections, the coordinates derived from both proceses are taken as the “reference gene sets” for introns, exons and intergenic regions to directly estimate several statistic descriptors, such as size, density and number. Then, the “reference gene sets” are projected onto the genome sequence in both strands, so that the nucleotide contents of each genome-feature are calculated according to the definitions described in a section below. Finally, all statistic descriptors obtained with the program are provided as text files, fasta formats and exploratory figures (see also Supplementary Figure S1). GenomeContent runs on an entire genome in few minutes or hours, depending on genome size and the number of annotated genes. GenomeContent is available upon request during peer-review, and will be openly available after publication.
B. Filtering of gene annotations
The filtering process of GenomeContent involves: a) identification of CDS, b) removal of small sizes, c) treatment of isoforms, and d) estimation of systematic errors in CDS. First, only genome coordinates from CDS were extracted, but their corresponding untranslated regions (UTRs) are not included because these are not fully annotated in most genome projects [214, 215]. Second, we excluded introns and exons smaller than 15 nucleotides (nts), which represent < 1% from the total number of introns and exons located within the coding genes of all genomes analyzed (as observed in Supplementary Figure S2 and Table S3). Third, alternative splice variants were kept in the data. To avoid redundant/overestimated data, however, exons with partial or full matching boundaries to exons of other transcripts were overlapped; the same rule was applied to introns. In both cases, their coordinates were joined or replaced accordingly; thus, every exon and intron is only counted once. We call this filtered set of protein-coding gene coordinates for every genome as the “reference gene set”.
Comparison of intron density estimations: absolute (aID) and normalized (nID).
Most of the collected gene annotations are based on transcript evidence. Nevertheless, we implemented the approach proposed by Roy and Penny [216] in GenomeContent to further estimate systematic errors in CDS annotations by identifying the excess/deficit of the intron-length distributions modulo 3. Since introns are not expected to respect the coding frame, intron lengths 3n, 3n + 1, and 3n + 2 should appear in similar fractions p3n ≈ p3n+1 ≈ p3n+2. As stated in [216], large values of “3n excess”, E3 = p3n - (p3n+1 + p3n+2)/2, suggest that a considerable fraction of internal exons may have been incorrectly predicted as introns or that there are several “intron retention” events. On the other hand, a deficit of 3n introns, i.e., E3 ≪ 0, suggests that a considerable fraction of 3n introns –lacking of stop codons– may have been mistaken for exons. Most gene annotations included in this study shown values of the 3n excess close to 0. Very few genomes (such as parasites and endosymbiots) were initially excluded from the present study because they exhibited high 3n excess (0.4 - 0.7) (see Supplementary Figure S2 and Table S3).
C. Estimation of average sizes and density of introns
Several statistical estimators for every genome feature were obtained with GenomeContent. Genome size is defined as the net length of nucleotides and placeholders of all sequences conforming the nuclear genome. The average feature size (e.g., intron size and exon size) of CDSs per genome was calculated in two ways. The straight average size (Af eature) is calculated as the total length of all feature sequences (exons or introns) in a genome (L f eature) divided by the total number of all feature sequences (exons or introns, respectively) in a genome (Nf eature): Af eature = L f eature/Nf eature. The straight average depends on the number of data points from the whole sample (i.e., gene models with introns), which equally contribute to the final average regardless of which gene they belong to.
If we now consider a f eature to be the average length of the respective feature (exon or intron) within one single gene: a f eature = l f eature/n f eature, then the weighted average size, ā f eature, is calculated as the mean of the a f eature values of the respective feature (exons or introns) in a genome, according to:
where n represents the total number of CDSs in a genome when calculating āexon, or the total number of intron-containing CDSs in a genome when calculating āintron [217]. The weighted average depends on the gene-structure of the genome, and thus it samples more broadly the data points that contribute, in the case of introns, to the well known skewed length distribution.
GenomeContent also estimates the abundance of introns within a genome with two different estimates to detect small changes of intron-richness and to buffer dramatic changes among updated genome releases and gene annotations. On the one hand, the abundance of introns across CDSs was estimated as the fraction of intron-containing CDSs from the total number of CDSs (% CDS), which might reflect complete intron loss from CDS structures at the genome level. On the other hand, the abundance of introns and exons within CDSs (“absolute density”) was estimated as the mean number per genome of exons in CDS (exon density), or introns per intron-containing CDSs (intron density), respectively. We employ the “absolute density” because, as observed on Appendix 2-Table 1, the average number of introns either per sequence region or from the total number of genes (“normalized density”) depends on both genome size and the number of CDSs, respectively.
For instance, the four species listed in Appendix 2-Table 1 exhibit around five introns per CDS when a “normalized” density is estimated from the total number of CDSs. However, the “absolute intron density” clearly shows that, for instance, X. laevis, T. chinensis and G. theta have indeed more introns per CDS on average than the other species, despite of having a smaller fraction of intron-containing CDSs (70.8%, 72.5% and 78.3%, respectively), and lower number of introns and CDS in some cases. Clearly, the bias observed in the “normalized intron density” is produced by larger numbers of total CDS in the genome.
D. Estimation of genome contents
GenomeContent also estimates the “feature content” of a given genome, i.e., the proportion of nucleotides of the respective genome features (intron, exon, or intergenic) that contributes to genome size. Since most genome annotations only contain protein-coding regions rather than full transcript models, we count only coding exons and introns delimited by a pair of coding exons. As shown in Appendix 2-Figure 1, the program projects the sets of genomic intervals for all exons and introns from coding genes located in the plus strand (set A), the minus strand (set B), and of the isoforms (set C). Since a given nucleotide may be classified differently for different isoforms, we used the following domincance rule in order to obtain a unique classification at the genome level:
It reflects the idea that a genomic position is exonic whenever it appears in a coding exon of at least one transcript. Thus, the exon content of a genome is calculated as the total number of nucleotides in the genome sequence that are classified as coding exon with respect to at least one isoform. Analogously, a position is classified as ‘intronic’ if it appears inside the boundaries of annotated coding exons, but it does not overlap with any coding sequence. Thus, intron content was determined as the total number of nucleotides of a genome that were classified as intronic. The CDS content of a genome is calculated as the total number of nucleotides covered by the intronic and exonic positions within coding genes. Finally, the non-coding DNA content was computed analogously as the total number of nucleotides in a genome that are not covered by exonic and intronic positions from coding genes. Genome-feature contents are reported as: total size in Megabases (Mb), fraction (%) from the total genome size, and as genomic coordinates.
REFERENCES
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].
- [42].
- [43].
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].
- [95].
- [96].↵
- [97].↵
- [98].↵
- [99].
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].
- [105].↵
- [106].↵
- [107].↵
- [108].↵
- [109].
- [110].↵
- [111].↵
- [112].↵
- [113].↵
- [114].↵
- [115].↵
- [116].↵
- [117].↵
- [118].↵
- [119].↵
- [120].↵
- [121].↵
- [122].↵
- [123].↵
- [124].↵
- [125].↵
- [126].
- [127].↵
- [128].↵
- [129].↵
- [130].↵
- [131].↵
- [132].↵
- [133].↵
- [134].↵
- [135].↵
- [136].↵
- [137].↵
- [138].↵
- [139].↵
- [140].↵
- [141].↵
- [142].↵
- [143].↵
- [144].↵
- [145].↵
- [146].↵
- [147].↵
- [148].↵
- [149].↵
- [150].↵
- [151].
- [152].↵
- [153].↵
- [154].
- [155].
- [156].↵
- [157].↵
- [158].
- [159].↵
- [160].↵
- [161].↵
- [162].↵
- [163].↵
- [164].↵
- [165].↵
- [166].↵
- [167].↵
- [168].↵
- [169].↵
- [170].↵
- [171].↵
- [172].↵
- [173].↵
- [174].↵
- [175].↵
- [176].↵
- [177].↵
- [178].↵
- [179].↵
- [180].↵
- [181].↵
- [182].↵
- [183].↵
- [184].↵
- [185].↵
- [186].↵
- [187].↵
- [188].↵
- [189].↵
- [190].↵
- [191].↵
- [192].↵
- [193].↵
- [194].↵
- [195].↵
- [196].
- [197].↵
- [198].↵
- [199].
- [200].↵
- [201].↵
- [202].↵
- [203].↵
- [204].
- [205].↵
- [206].↵
- [207].↵
- [208].↵
- [209].↵
- [210].↵
- [211].↵
- [212].↵
- [213].↵
- [214].↵
- [215].↵
- [216].↵
- [217].↵