Abstract
Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of positive selection, often rely on gene families of single-copy orthologs (SC-OGs). In contrast, large gene families with multiple homologs in one or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed orthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by orthoSNAP as SNAP-OGs because they are identified using a splitting and pruning procedure. From 46,645 orthologous groups of genes inferred using graph-based clustering of sequence similarity scores across four separate eukaryotic datasets, we identified 6,634 SC-OGs; using orthoSNAP on the remaining 40,011 orthologous groups of genes, we identified an additional 6,630 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar. orthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.
Introduction
Molecular evolution studies, such as species tree inference, genome-wide surveys of positive selection, evolutionary rate estimation, measures of gene-gene coevolution, and others typically rely on single-copy orthologs (SC-OGs), a group of homologous genes that originated via speciation and are present in single-copy among species of interest (Rokas et al., 2003; Jeffares et al., 2015; Li et al., 2017; Wu et al., 2017; Dong et al., 2019; Steenwyk et al., 2021c). In contrast, paralogs, homologous genes that originated via duplication and are often members of large gene families, are typically absent from these studies (Fig 1). Gene families of orthologs and paralogs often encode functionally significant proteins such as transcription factors, transporters, and olfactory receptors (Ozcan and Johnston, 1999; Malnic et al., 2004; Wingender et al., 2013; Niimura et al., 2014). The exclusion of SC-OGs from gene families has not only hindered our understanding of their evolution and phylogenetic informativeness but is also artificially reducing the number of gene markers available for molecular evolution studies. Furthermore, as the number of species and / or their evolutionary divergence increases in a data set, the number of SC-OGs decreases (Emms and Kelly, 2018; Thomas et al., 2020); case in point, no SC-OGs were identified in a dataset of 42 plants (Emms and Kelly, 2018). As the number of available genomes across the tree of life continues to increase, our ability to identify SC-OGs present in many taxa will become more challenging.
(A) Paralogs refer to related genes that have originated via gene duplication, such as genes M, N, and O.
(B) Outparalogs and inparalogs refer to paralogs that are related to one another via a duplication event that took place prior to or after a speciation event, respectively. With respect to the speciation event that led to the split of taxa A, B, and C from D, genes M, N, and O are outparalogs because they arose prior to the speciation event; genes O1 and O2 in taxa A, B, and C are inparalogs because they arose after the speciation event. Species-specific inparalogs are paralogous genes observed only in one taxon in a dataset, such as gene N1 and N2 in taxon A. Species-specific inparalogs N1 and N2 in taxon A are also coorthologs of gene N in taxa B, C, and D; the same is true for inparalogs O1 and O2 from taxon A, which are coorthologs of gene O from taxon D.
In light of these issues, several methods have been developed to account for paralogs in specific types of molecular evolution studies—for example, in species tree reconstruction (Smith and Hahn, 2021). Methods such as SpeciesRax, STAG, ASTRAL-PRO, and DISCO can be used to infer a species tree from a set of SC-OGs and gene families composed of orthologs and paralogs (Emms and Kelly, 2018; Zhang et al., 2020; Morel et al., 2021; Willson et al., 2021). Other methods such as PHYLDOG (Boussau et al., 2013) and guenomu (de Oliveira Martins and Posada, 2017) jointly infer the species and gene trees, but require abundant computational resources, which has hindered their use for large datasets. Other software, such as PhyloTreePruner (Kocot et al., 2013), can conduct species-specific inparalog trimming, whereas Agalma (Dunn et al., 2013), as part of a larger automated phylogenomic workflow, can prune gene trees into maximally inclusive subtrees wherein each taxon is represented by one sequence. Although these methods have expanded the numbers of gene markers used in species tree reconstruction, they were not designed to facilitate the retrieval of as broad a set of SC-OGs as possible for downstream molecular evolution studies such as surveys of positive selection. Furthermore, the phylogenetic information content of these gene families remains unknown.
To address this need, we developed orthoSNAP, a novel tree traversal algorithm that conducts tree splitting and species-specific inparalog pruning to identify SC-OGs nested within larger gene families. We term SC-OGs identified by orthoSNAP as SNAP-OGs because they were retrieved using a splitting and pruning procedure. orthoSNAP takes as input a gene family phylogeny and associated FASTA file and will output individual FASTA files populated with sequences from SNAP-OGs (Fig 2). During tree traversal, tree uncertainty can be accounted for by orthoSNAP via collapsing poorly supported branches. In a set of four eukaryotic datasets that contained 6,634 SC-OGs, we used orthoSNAP to identify an additional 6,630 SNAP-OGs. Using a combination of multivariate statistics and phylogenetic measures, we demonstrate that SNAP-OGs and SC-OGs have similar phylogenetic information content in all four datasets. We also observed similar patterns of support among SNAP-OGs and SC-OGs in a contentious branch in the tree of life. Taken together, these results suggest that orthoSNAP is helpful for expanding the set of gene markers available for molecular evolutionary studies.
(A) orthoSNAP takes as input two files: a FASTA file of a gene family with multiple homologs observed in one or more species and the associated gene family tree. The outputted file(s) will be individual FASTA files of SNAP-OGs. (B) A cartoon phylogenetic tree that depicts the evolutionary history of a gene family and five SNAP-OGs therein. While identifying SNAP-OGs, orthoSNAP also identifies and prunes species-specific inparalogs (e.g., species2|gene2-copy_0 and species2|gene2-copy_1), retaining only the inparalog with the longest sequence, a practice common in transcriptomics. Note, orthoSNAP requires that sequence naming schemes must be the same in both sequences and follow the convention in which a taxon identifier and gene identifier are separated by pipe (or vertical bar; “|”) character.
Results
SC-OGs and SNAP-OGs have similar information content
To compare SC-OGs and SNAP-OGs, we first independently inferred orthologous groups of genes in three eukaryotic datasets of 24 budding yeasts, 36 filamentous fungi, and 26 mammals (S1 Fig; Table S1). There was variation in the number of SC-OGs and SNAP-OGs in each lineage (S2 Fig; Table S2). Interestingly, the ratio of SNAP-OGs : SC-OGs among budding yeasts, filamentous fungi, and mammals was 0.46, 0.83, and 5.53, respectively, indicating SNAP-OGs can substantially increase the number of gene markers in certain lineages. The number of SNAP-OGs identified in a gene family with multiple homologs in one or more species also varied (S3 Fig).
Similar taxon occupancy and best fitting models of substitutions were observed among SC-OGs and SNAP-OGs (S4 Fig; Table S3), raising the question of whether SC-OGs and SNAP-OGs have similar information content. To answer this, we calculated nine properties of phylogenetic information content from multiple sequence alignments and phylogenetic trees from SC-OGs and SNAP-OGs (S5 Fig; Table S4) and compared them using multivariate analysis and statistics as well as information theory-based phylogenetic measures. Principal component analysis enabled qualitative comparisons between SC-OGs and SNAP-OGs in reduced dimensional space and revealed a striking similarity (Fig 3, S6 Fig). Multivariate statistics, namely multi-factor analysis of variance, facilitated a quantitative comparison of SC-OGs and SNAP-OGs and revealed no difference between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1; Table S5) and no interaction between the nine properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Similarly, multi-factor analysis of variance using an additive model, which assumes each factor is independent and there are no interactions, also revealed no differences between SC-OGs and SNAP-OGs (p = 0.65, F = 0.21, df = 1). Next, we calculated tree certainty, an information theory-based measure of tree congruence from a set of gene trees, and found similar levels of congruence among phylogenetic trees inferred from SC-OGs and SNAP-OGs (Table S6). Taken together, these analyses demonstrate that SC-OGs and SNAP-OGs have similar phylogenetic information content.
To evaluate similarities and differences between SC-OGs (orange dots) and SNAP-OGs (blue dots), we examined each gene’s phylogenetic information content by measuring nine properties of multiple-sequence alignments and phylogenetic trees. We performed these analyses on 12,764 gene families from three datasets—24 budding yeasts (1,668 SC-OGs and 1,392 SNAP-OGs) (A), 36 filamentous fungi (4,393 SC-OGs and 2,035 SNAP-OGs) (B), and 26 mammals (321 SC-OGs and 1,775 SNAP-OGs) (C). Principal component analysis revealed striking similarities between SC-OGs and SNAP-OGs in all three datasets. For example, the centroid (i.e., the mean across all metrics and genes) for SC-OGs and SNAP-OGs, which is depicted as an opaque and larger dot, are very close to one another in reduced dimensional space. Supporting this observation, multi-factor analysis of variance with interaction effects of the 6,630 SNAP-OGs and 6,634 SC-OGs revealed no difference between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1) and no interaction between the nine properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Multi-factor analysis of variance using an additive model yielded similar results wherein SC-OGs and SNAP-OGs do not differ (p = 0.65, F = 0.21, df = 1). There are also very few outliers of individual SC-OGs and SNAP-OGs, which are represented as translucent dots, in all three panels. For example, SNAP-OGs outliers at the top of panel C are driven by high treeness/RCV values, which is associated with a high signal-to-noise ratio and/or low composition bias (Phillips and Penny, 2003); SNAP-OG outliers at the right of panel C are driven by high average bootstrap support values, which is associated with greater tree certainty (Salichos and Rokas, 2013); and the single SC-OG outlier observed in the bottom right of panel C is driven by a SC-OG with a high degree of violation of a molecular clock (Song et al., 2012), which is associated with lower tree certainty (Doyle et al., 2015). Multiple-sequence alignment and phylogenetic tree properties used in principal component analysis and abbreviations thereof are as follows: average bootstrap support (ABS), degree of violation of the molecular clock (DVMC), relative composition variability, Robinson-Foulds distance (RF distance), alignment length (Aln. len.), the number of parsimony informative sites (PI sites), saturation, treeness (tness), and treeness/RCV (tness/RCV).
SC-OGs and SNAP-OGs have similar patterns of support in a contentious branch in the tree of life
To further compare SC-OGs and SNAP-OGs, we investigated patterns of support in a difficult-to-resolve branch in the tree of life. More specifically, we evaluated the support between two leading hypotheses concerning deep evolutionary relationships among placental mammals: (1) Xenarthra (placental mammals from the Americas) and Afrotheria (placental mammals from Africa) are sister to all other Eutheria (Hallström et al., 2007; Wildman et al., 2007) or (2) Afrotheria are sister to all other Eutheria (Murphy, 2001; Murphy et al., 2001) (Fig 4A). Resolution of this conflict has important implications for understanding the historical biogeography of these organisms. To do so, we first obtained protein-coding gene sequences from six Afrotheria, two Xenarthra, 12 other Eutheria, and eight outgroup taxa from NCBI (S7 Fig; Table S7), which represent all annotated and publicly genome assemblies at the time of this study (Table S8). Using the protein translations of these gene sequences as input to OrthoFinder, we identified 252 SC-OGs shared across taxa; application of orthoSNAP identified an additional 1,428 SNAP-OGs, which represents a greater than five-fold increase in the number of gene markers for this dataset (Table S8). There was variation in the number of SNAP-OGs identified per orthologous group of genes (S8 Fig). The highest number of SNAP-OGs identified in an orthologous group of genes was 10, which was a gene family of olfactory receptors and are known to have expanded in the evolutionary history of placental mammals (Niimura et al., 2014). The best fitting substitution models were similar between SC-OGs and SNAP-OGs (S9 Fig).
(A) Two leading hypotheses for the evolutionary relationships among Eutheria, which have implications for the evolution and biogeography of the clade, are that Afrotheria and Xenarthra are sister to all other Eutheria (hypothesis one; blue) and that Afrotheria are sister to all other Eutheria (hypothesis two; pink). (B) Comparison of gene support frequency (GSF) values for hypotheses one, hypothesis two, as well as a third hypothesis (Xenarthra as sister to all other Eutheria represented in yellow) among 252 SC-OGs and 1,428 SNAP-OGs using an α level of 0.01 revealed no differences in support (p = 0.26, Fischer’s exact test with Benjamini-Hochberg multi-test correction). Comparison after accounting for gene tree uncertainty by collapsing bipartitions with lower than 75 ultrafast bootstrap approximation support (SC-OGs collapsed vs. SNAP-OGs collapsed) also revealed no differences (p = 0.05; Fischer’s exact test with Benjamini-Hochberg multi-test correction). (C) Examination of the distribution of gene-wise log-likelihood scores (ΔGLS) revealed no difference between SNAP-OGs and SC-OGs (p = 0.39; Wilcoxon rank sum test). ΔGLS values greater than zero are indicative of genes with greater support for hypothesis one; values less than zero are indicative of genes with greater support for hypothesis two.
Two independent tests examining support between alternative hypotheses of deep evolutionary relationships among placental mammals revealed similar patterns of support between SC-OGs and SNAP-OGs. More specifically, no differences were observed in gene support frequencies—the number of genes that support one of three possible hypotheses at a given branch in a phylogeny—without or with accounting for single-gene tree uncertainty by collapsing branches with low support values (p = 0.26 and p = 0.05, respectively; Fischer’s exact test with Benjamini-Hochberg multi-test correction; Fig 4B; Table S9). We next conducted a second test of single-gene support for hypothesis one or hypothesis two by measuring gene-wise log-likelihood scores (ΔGLS), which is the difference in the log-likelihood score of a single gene when constrained to the topologies of the two hypotheses. In this case, positive ΔGLS are reflective of greater support for hypothesis one and negative ΔGLS are reflective of greater support for hypothesis two. No difference was observed in the distribution of ΔGLS values (p = 0.39; Wilcoxon rank sum test). Examination of patterns of support in a contentious branch in the tree of life using two independent tests revealed SC-OGs and SNAP-OGs are similar and further supports the observation that they contain similar phylogenetic information.
In summary, 46,645 orthologous groups of genes across four datasets contained 6,634 SC-OGs; application of orthoSNAP identified an additional 6,630 SNAP-OGs, doubling the number of gene markers. Comprehensive comparison of the phylogenetic information content among SC-OGs and SNAP-OGs revealed no differences in phylogenetic information content. Strikingly, this observation held true when conducting hypothesis testing in a difficult-to-resolve branch in the tree of life. These findings suggest that SNAP-OGs may be useful for exploring patterns of molecular evolution ranging from genome-wide surveys of positive selection, phylogenomics, gene-gene coevolution analysis, and others.
Discussion
Molecular evolution studies typically rely on SC-OGs. Recently developed methods can integrate gene families of orthologs and paralogs into species tree inference but are not designed to broadly facilitate the retrieval of gene markers for molecular evolution analyses. Furthermore, the phylogenetic information content of gene families of orthologs and paralogs remains unknown. This observation underscores the need for algorithms that can identify SC-OGs nested within larger gene families, which can be in turn be incorporated into diverse molecular evolution analyses, and a comprehensive assessment of their phylogenetic properties.
To address this need, we developed orthoSNAP, a tree splitting and pruning algorithm that identifies SNAP-OGs, which refers to SC-OGs nested within larger gene families wherein species specific inparalogs have also been pruned. Comprehensive examination of the phylogenetic information content of SNAP-OGs and SC-OGs from four empirical datasets of diverse eukaryotic species revealed striking similarities. In certain datasets, SNAP-OGs were five times more prevalent than SC-OGs indicating SNAP-OGs can substantially increase the size of molecular evolution datasets. We note that our results are qualitatively similar to those reported recently by Smith et al. (Smith et al., 2021), which retrieved SC-OGs nested within larger families from 26 primates and examined their performance in gene tree and species tree inference. Three noteworthy differences are that we also conduct species-specific inparalog trimming, provide a user-friendly command-line software for SNAP-OG identification, and evaluated the phylogenetic information content of SNAP-OGs and SC-OGs across four diverse datasets. We also note that our algorithm can account for diverse types of paralogy—outparalogs, inparalogs, and speciesspecific inparalogs—whereas other software like PhyloTreePruner, which conducts species-specific inparalog trimming (Kocot et al., 2013), and Agalma, which identifies single-copy outparalogs and inparalogs (Dunn et al., 2013), can account for some, but not all, types of paralogs. Our results, together with other studies, demonstrate the utility of SC-OGs that are nested within larger families (van der Heijden et al., 2007; Dunn et al., 2013; Smith et al., 2021; Willson et al., 2021).
Next, we discuss some practical considerations when using orthoSNAP. In the present study, we inferred orthology information using OrthoFinder (Emms and Kelly, 2019), but several other approaches can be used upstream of orthoSNAP. For example, other graph-based algorithms such as OrthoMCL (Li et al., 2003) or sequence similarity-based algorithms such as orthofisher (Steenwyk and Rokas, 2021), can be used to infer gene families. Similarly, sequence similarity search algorithms like BLAST+ (Camacho et al., 2009), USEARCH (Edgar, 2010), and HMMER (Eddy, 2011), can be used to retrieve homologous sets of sequences that are used as input for orthoSNAP.
We suggest employing “best practices” when inferring groups of putatively orthologous genes, including SNAP-OGs. Specifically, orthology information can be further scrutinized using phylogenetic methods. Orthology inference errors may occur upstream of orthoSNAP; for example, SNAP-OGs may be susceptible to erroneous inference of orthology during upstream clustering of putatively orthologous genes. One method to identify putatively spurious orthology inference is by identifying long terminal branches (Shen et al., 2018). Terminal branches of outlier length can be identified using the
“spurious_sequence” function in PhyKIT (Steenwyk et al., 2021b). Other tools, such as PhyloFisher, UPhO, and other orthology inference pipelines employ similar strategies to refine orthology inference (Yang and Smith, 2014; Ballesteros and Hormiga, 2016; Tice et al., 2021).
Taken together, we suggest that orthoSNAP is useful for retrieving single-copy orthologous groups of genes from gene family data and that the identified SNAP-OGs have similar phylogenetic information content compared to SC-OGs. In combination with other phylogenomic toolkits, orthoSNAP may be helpful for reconstructing the tree of life and expanding our understanding of the tempo and mode of evolution therein.
Methods
orthoSNAP availability and documentation
orthoSNAP is a command-line software written in the Python programming language (https://www.python.org/) and requires Biopython (Cock et al., 2009) and NumPy (Van Der Walt et al., 2011). orthoSNAP is available under the MIT license from GitHub (https://github.com/JLSteenwyk/orthosnap), PyPi (https://pypi.org/project/orthosnap), and the Anaconda cloud (https://anaconda.org/JLSteenwyk/orthosnap). Documentation describes the orthoSNAP algorithm, parameters, and provides user tutorials (https://jlsteenwyk.com/orthosnap).
orthoSNAP algorithm description and usage
We next describe how orthoSNAP identifies SNAP-OGs. orthoSNAP requires two files as input: one is a FASTA file that contains two or more homologous sequences in one or more species and the other the corresponding gene family phylogeny in Newick format. In both the FASTA and Newick file, users must follow a naming scheme—wherein taxon identifiers and gene sequences identifiers are separated by a vertical bar (also known as a pipe character or “|”)—which allows orthoSNAP to determine which sequences were encoded in the genome of each taxon. After initiating orthoSNAP, the gene family phylogeny is first mid-point rooted and then SNAP-OGs are identified using a tree-traversal algorithm. To do so, orthoSNAP will loop through the internal branches in the gene family phylogeny and evaluate the number of distinct taxa identifiers among children terminal branches. If the number of unique taxa identifiers is greater than or equal to the taxon occupancy threshold (default: 50% of total taxa in the inputted phylogeny; users can specify an integer threshold), then all children branches and termini are examined further; otherwise, orthoSNAP will examine the next internal branch. Next, orthoSNAP will collapse branches with low support (default: 80, which is motivated by using ultrafast bootstrap approximations (Hoang et al., 2018) to evaluate bipartition support; users can specify an integer threshold) and conduct species-specific inparalog trimming wherein the longest sequence is maintained, a practice common in transcriptomics. Species-specific inparalogs are defined as sequences from the same taxon that are sister to one another or belong to the same polytomy (Kocot et al., 2013). The resulting set of taxa and sequences are examined to determine if one taxon is represented by one sequence and ensure these sequences have not yet been assigned to a SNAP-OG. If so, they are considered a SNAP-OG; if not, orthoSNAP will examine the next internal branch.
The orthoSNAP algorithm is also described using the following pseudocode:
FOR internal branch in midpoint rooted gene family phylogeny:
> IF taxon occupancy among children termini is greater than or equal to taxon occupancy threshold; >> Collapse poorly supported bipartitions and trim species-specific inparalogs; >> IF each taxon among the trimmed set of taxa is represented by only one sequence and no sequences being examined have been assigned to a SNAP-OG yet; >>> Sequences represent a SNAP-OG and are outputted to a FASTA file >> ELSE >>> examine next internal branch > ELSE >> examine next internal branch ENDFORTo enhance the user experience, arguments or default values are printed to the standard output, a progress bar informs the user of how of the analysis has been completed, and the number of SNAP-OGs identified as well as the names of the outputted FASTA files are printed to the standard output.
Development practices and design principles to ensure long-term software stability
Archival instabilities among software threatens the reproducibility of bioinformatics research (Mangul et al., 2019). To ensure long-term stability of orthoSNAP, we implemented previously established rigorous development practices and design principles (Steenwyk et al., 2020, 2021b, 2021a; Steenwyk and Rokas, 2021). For example, orthoSNAP features a refactored codebase, which facilitates debugging, testing, and future development. We also implemented a continuous integration pipeline to automatically build, package, and install orthoSNAP across Python versions 3.8, 3.8, and 3.9. The continuous integration pipeline also conducts 29 unit and integration tests, which span 95.92% of the codebase and ensure faithful function of orthoSNAP.
Dataset generation
To generate a dataset for identifying SNAP-OGs and comparing them to SC-OGs, we first identified putative groups of orthologous genes across four empirical datasets. To do so, we first downloaded proteomes for each dataset, which were obtained from publicly available repositories on NCBI (S1 and S7 Fig; Table S1 and S7) or figshare (Shen et al., 2018). Each dataset varied in its sampling of sequence diversity and in the evolutionary divergence of the sampled taxa. The dataset of 24 budding yeasts spans approximately 275 million years of evolution (Shen et al., 2018); the dataset of 36 filamentous fungi spans approximately 94 million years of evolution (Steenwyk et al., 2019); the dataset of 26 mammals spans approximately 160 million years of evolution (Tarver et al., 2016); and the dataset of 28 placental mammals—which was used to study the contentious deep evolutionary relationships among placental mammals—concerns an ancient divergence that occurred approximately 160 million years ago (Luo et al., 2011). Putatively orthologous groups of genes were identified using OrthoFinder, v2.3.8 (Emms and Kelly, 2019), with default parameters, which resulted in 46,645 orthologous groups of genes with at least 50% taxon occupancy (Table S8).
To infer the evolutionary history of each orthologous group of genes, we first individually aligned and trimmed each group of sequences using MAFFT, v7.402 (Katoh and Standley, 2013), with the “auto” parameter and ClipKIT, v1.1.3 (Steenwyk et al., 2020), with the “smart-gap” parameter, respectively. Thereafter, we inferred the best-fitting substitution model using Bayesian information criterion and evolutionary history of each orthologous group of genes using IQ-TREE2, v2.0.6 (Minh et al., 2020). Bipartition support was examined using 1,000 ultrafast bootstrap approximations (Hoang et al., 2018).
To identify SNAP-OGs, the FASTA file and associated phylogenetic tree for each gene family with multiple homologs in one or more species was used as input for orthoSNAP, v0.0.1 (this study). Across 40,011 gene families with multiple homologs in one or more species in all datasets, we identified 6,630 SNAP-OGs with at least 50% taxon occupancy (S2 Fig; Table S8). Unaligned sequences of SNAP-OGs were then individually aligned and trimmed using the same strategy as described above. To determine gene families that were SC-OGs, we identified orthologous groups of genes with at least 50% taxon occupancy and each taxon was represented by only one sequence—6,634 orthologous groups of genes were SC-OGs.
Measuring and comparing information content among SC-OGs and SNAP-OGs
To compare the information content of SC-OGs and SNAP-OGs, we calculated nine properties of multiple sequence alignments and phylogenetic trees associated with robust phylogenetic signal in the budding yeasts, filamentous fungi, and mammalian datasets (Table S4). More specifically, we calculated information content from phylogenetic trees such as measures of tree certainty (average bootstrap support), accuracy (Robinson-Foulds distance (Robinson and Foulds, 1981)), signal-to-noise ratios (treeness (Phillips and Penny, 2003)), and violation of clock-like evolution (degree of violation of a molecular clock or DVMC (Liu et al., 2017)). Information content was also measured among multiple sequence alignments by examining alignment length and the number of parsimony-informative sites, which are associated with robust and accurate inferences of evolutionary histories (Shen et al., 2016) as well as biases in sequence composition (RCV (Phillips and Penny, 2003)). Lastly, information content was also evaluated using metrics that consider characteristics of phylogenetic trees and multiple sequence alignments such as the degree of saturation, which refers to multiple substitutions in multiple sequence alignments that underestimate the distance between two taxa (Philippe et al., 2011), and treeness / RCV, a measure of signal-to-noise ratios in phylogenetic trees and sequence composition biases (Phillips and Penny, 2003). For tree accuracy, phylogenetic trees were compared to species trees reported in previous studies (Tarver et al., 2016; Shen et al., 2018; Steenwyk et al., 2019). All properties were calculated using functions in PhyKIT, v1.1.2 (Steenwyk et al., 2021b). The function used to calculate each metric and additional information are described in Table S4.
Principal component analysis across the nine properties that summarize phylogenetic information content was used to qualitatively compare SC-OGs and SNAP-OGs in reduced dimensional space. Principal component analysis, visualization, and determination of property contribution to each principal component was conducted using factoextra, v1.0.7 (Kassambara and Mundt, 2017), and FactoMineR, v2.4 (Lê et al., 2008), in the R, v4.0.2 (https://cran.r-project.org/), programming environment. Statistical analysis using a multi-factor ANOVA was used to quantitatively compare SC-OGs and SNAP-OGs using the res.aov() function in R.
Information theory-based approaches were used to evaluate incongruence among SC-OGs and SNAP-OGs phylogenetic trees. More specifically, we calculated tree certainty and tree certainty-all (Salichos and Rokas, 2013; Salichos et al., 2014; Kobert et al., 2016), which are conceptually similar to entropy values and are derived from examining support among a set of gene trees and the two most supported topologies or all topologies that occur with a frequency of ≥5%, respectively. More simply, tree certainty values range from 0 to 1 in which low values are indicative of low congruence among gene trees and high values are indicative of high congruence among gene trees. Tree certainty and tree certainty-all values were calculated using RAxML, v8.2.10 (Stamatakis, 2014).
To examine patterns of support in a contentious branch concerning deep evolutionary relationships among placental mammals, we calculated gene support frequencies and ΔGLS. Gene support frequencies were calculated using the “polytomy_test” function in PhyKIT, v1.1.2 (Steenwyk et al., 2021b). To account for uncertainty in gene tree topology, we also examined patterns of gene support frequencies after collapsing bipartitions with less than 75 ultrafast bootstrap approximation support using the “collapse” function in PhyKIT. To calculate ΔGLS, partition log-likelihoods were calculated using the “wpl” parameter in IQ-TREE2 (Minh et al., 2020), which required as input a phylogeny in Newick format that represented either hypothesis one or hypothesis two (Fig 4A) and a concatenated alignment of SC-OGs and SNAP-OGs with partition information. Thereafter, gene-wise log-likelihood scores associated with hypothesis two were subtracted from gene-wise log-likelihood scores associated with hypothesis one. The resulting score is ΔGLS wherein values greater than zero support hypothesis one and values less than zero support hypothesis two.
Data Availability
Results and data are available from figshare (doi: 10.6084/m9.figshare.16875904) – the link will become accessible upon publication.
Conflict of Interest
A.R. is a scientific consultant for LifeMine Therapeutics, Inc.
Acknowledgements
We thank the Rokas lab for helpful discussion and feedback. J.L.S. and A.R. were funded by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program.
Research in A.R.’s lab is supported by grants from the National Science Foundation (DEB-1442113 and DEB-2110404), the National Institutes of Health/National Institute of Allergy and Infectious Diseases (R56 AI146096), and the Burroughs Wellcome Fund.