Abstract
The nearly-neutral theory predicts specific relations between effective population size (Ne) and patterns of divergence and polymorphism, which depend on the shape of the distribution of fitness effects (DFE) of new mutations. However, testing these relations is not straightforward, owing to the difficulty in estimating Ne. Here, we introduce an integrative framework allowing for an explicit reconstruction of the phylogenetic history of Ne, thus leading to a quantitative test of the nearly-neutral theory and an estimation of the allometric scaling of the ratios of non-synonymous over synonymous polymorphism (πN /πS) and divergence (dN/dS) with respect to Ne. As an illustration, we applied our method to primates, for which the nearly-neutral predictions were mostly verified. Under a purely nearly-neutral model with a constant DFE across species, we find that the variation in πN /πS and dN/dS as a function of Ne is too large to be compatible with current estimates of the DFE based on site frequency spectra. The reconstructed history of Ne shows a ten-fold variation across primates. The mutation rate per generation u, also reconstructed over the tree by the method, varies over a three-fold range and is negatively correlated with Ne. As a result of these opposing trends for Ne and u, variation in πS is intermediate, primarily driven by Ne but substantially influenced by u. Altogether, our integrative framework provides a quantitative assessment of the role of Ne and u in modulating patterns of genetic variation, while giving a synthetic picture of their history over the clade.
Significance statement Natural selection tends to increase the frequency of mutants of higher fitness and to eliminate less fit genetic variants. However, chance events over the life of the individuals in the population are susceptible to introduce deviations from these trends, which are expected to have a stronger impact in smaller populations. In the long-term, these fluctuations, called random drift, can lead to the accumulation of mildly deleterious mutations in the genomes of living species, and for that reason, the effective population size (usually denoted Ne, and which captures the relative strength of drift, relative to selection) has been proposed as a major determinant of the evolution of genome architecture and content. A proper quantitative test of this hypothesis, however, is hampered by the fact that Ne is difficult to estimate in practice. Here, we propose a Bayesian integrative approach for reconstructing the broad-scale variation in Ne across an entire phylogeny, which in turns allows for quantifying how Ne correlates with life history traits and with various measures of genetic diversity and selection strength, between and within species. We apply this approach to the phylogeny of primates, and observe that selection is indeed less efficient in primates characterized by smaller effective population sizes.
Introduction
Effective population size (Ne) is a central parameter in population genetics and in molecular evolution, impacting both genetic diversity and the strength of selection (Charlesworth, 2009; Leffler et al., 2012). The influence of Ne on diversity reflects the fact that larger populations can store more genetic variation, while the second aspect, efficacy of selection, is driven by the link between Ne and genetic drift: the lower the Ne, the more genetic evolution is influenced by the random sampling of individuals over generations. As a result, long-term trends in Ne are expected to have an important impact on genome evolution (Lynch et al., 2011) and, more generally, on the relative contribution of adaptive and non-adaptive forces in shaping macro-evolutionary patterns.
The nearly-neutral theory proposes a simple conceptual framework for formalizing the role of selection and drift on genetic sequences. According to this theory, genetic sequences are mostly under purifying selection; deleterious mutation are eliminated by selection, whereas neutral and nearly-neutral mutations are subject to genetic drift and can therefore segregate and reach fixation. The inverse of Ne defines the selection threshold under which genetic drift dominates. This results in specific quantitative relations between Ne and key molecular parameters (Ohta, 1995). In particular, species with small Ne are expected to have a higher ratio of nonsynonymous (dN) to synonymous (dS) substitution rates and a higher ratio of nonsynonymous (πN) to synonymous (πS) nucleotide diversity. Under certain assumptions, these two ratios are linked to Ne through allometric functions in which the scaling coefficient is directly related to the shape of the distribution of fitness effects (DFE) (Kimura, 1979; Welch et al., 2008).
The empirical test of these predictions raises the problem that Ne is difficult to measure directly in practice. In principle, Ne could be estimated through demographic and census data. However, the relation between census and effective population size is far from straightforward. Consequently, many studies which have tried to test nearly-neutral theory have used proxies indirectly linked to Ne. In particular, life history traits (LHT, essentially body mass or maximum longevity) are expected to correlate negatively with Ne (Waples et al., 2013). As a result, dN /dS or πN /πS are predicted to correlate positively with LHT. This has been tested, leading to various outcomes, with both positive and negative results (Eyre-Walker et al., 2002; Popadin et al., 2007; Nikolaev et al., 2007; Lartillot, 2013; Nabholz et al., 2013; Romiguier et al., 2014; Figuet et al., 2016).
More direct estimations of Ne can be obtained from πS since, in accordance with coalescent theory, πS = 4Neu (with u referring to the mutation rate per site per generation). Thus, one would predict a negative correlation of dN /dS, πN /πS and LHT with πS. Such predictions have been tested, and generally verified, in several previous studies (Piganeau & Eyre-Walker, 2009; Romiguier et al., 2014; Figuet et al., 2016; Galtier, 2016; Chen et al., 2017; James et al., 2017). However, these more specific tests of the nearly-neutral theory are only qualitative, at least in their current form, in which Ne is indirectly accessed through πS without any attempt to correct for the confounding effect of the mutation rate u and its variation across species.
New Approaches
In this study, we aim to solve this problem by using a Bayesian integrative approach, in which the joint evolutionary history of a set of molecular and phenotypic traits is explicitely reconstructed along a phylogeny. This method has previously been used to test the predictions of the nearly-neutral theory via indirect proxies of Ne (Lartillot, 2013; Nabholz et al., 2013). Here, we propose an elaboration on this approach, in which the variation in the mutation rate per generation u is globally reconstructed over the phylogeny by combining the relaxed molecular clock of the model with data about generation times. This in turns allows us to tease out Ne and u from the πS estimates obtained in extant species, thus leading to a complete reconstruction of the phylogenetic history of Ne and of its scaling relations with others traits such as dN /dS or πN /πS. Using this reconstruction, we can conduct a proper quantitative test of some of the predictions of the nearly-neutral theory and then compare our findings with independent knowledge previously derived from the analysis of site frequency spectra. The approach requires a multiple sequence alignment across a group of species, together with polymorphism data, ideally averaged over many loci to stabilize the estimates, as well as data about life-history traits in extant species and fossil calibrations. Here, we apply it to previously published phylogenetic and transcriptome data in primates (Perelman et al., 2011; Perry et al., 2012).
Results
Empirical phylogenetic correlation analysis
We first conducted a general correlation analysis between dS, dN/dS, πS, πN /πS and life-history traits (LHTs). Our analysis relies on a multiple sequence alignment of 54 coding genes in 61 primates (adapted from Perelman et al., 2011), combined with data about LHTs taken from the literature (de Magalhaes & Costa, 2009; Besenbacher et al., 2019) and polymorphism data from 10 primates species (obtained from Romiguier et al., 2014; Figuet et al., 2016; Perry et al., 2012). Of note, these estimates of πS and πN /πS are based on the global transcriptome data, thus not restricted to the 54 coding genes represented in the multiple sequence alignment (see methods for details, Data subsection).
In a first step, these data were analysed using a previously introduced Bayesian integrative model (Coevol, Lartillot & Poujol, 2011). This model is an adaptation of the classical comparative method based on the principle of phylogenetically independent contrasts (Felsenstein, 1985) to the problem of estimating the correlation between quantitative traits (such as LHT, πS and πN /πS) and substitution rates (such as dS and dN/dS). The joint evolutionary process followed by all these variables is assumed to be multivariate Brownian:
where Ck(t) are the instant values of the 6 traits (body mass, age at sexual maturity, longevity, generation time πS, πN /πS). This process is parameterized by a 8 × 8 variance-covariance matrix Σ, which captures the correlation structure between rates and traits. The model is conditioned on sequence and trait data, using Markov Chain Monte-Carlo (MCMC). Then, the marginal posterior distribution on Σ is used to derive measures of the strength and the slopes of the correlations between rates and traits, using standard covariance analysis and in a way that automatically accounts for phylogenetic inertia. In the following, this Brownian model is referred to as the phenomenological model, because it does not entail any specific assumption about the population-genetic relations that might exist between dN/dS, πN /πS and Ne.
Based on this first analysis, neither πN /πS nor dN/dS appear to correlate with LHTs (Table 1), except for a positive correlation between dN /dS and longevity (correlation coefficient r = 0.49). On the other hand, the correlations among molecular quantities are globally in agreement with the nearly-neutral predictions, although with rather unequal statistical support. Most notably, πN /πS shows a clear negative correlation with πS (r = −0.73). As for dN/dS, it also shows a negative correlation with πS (r = −0.50), although with very marginal support (posterior probability of a negative correlation pp < 0.95). The two variables, dN/dS and πN /πS are also positively correlated with each other (r = 0.42), but again, with marginal support. The weaker correlations observed for dN/dS, compared to the more robust correlation between πN /πS and πS, could be due either to the presence of a minor fraction of adaptive substitutions or, alternatively, to a discrepancy between the short-term demographic effects reflected in both πS and πN /πS and long-term trends captured by dN/dS.
Correlation coefficients between dS, dN/dS, πS and πN /πS, life-history traits and Ne.
Although not conclusive, the correlation patterns between πS, πN /πS and dN/dS are compatible with the hypothesis that the nearly-neutral model is essentially valid for primates. The overall lack of correlation with LHT, on the other hand, suggest that there is no clear correlation between effective population size and body size or other related life-history traits in this group. Possibly, the phylogenetic scale might be too small to show sufficient variation in LHT that would be interpretable in terms of variation in Ne. Alternatively, Ne might be driven by other life-history characters (in particular, the mating systems), which may not directly correlate with body size. Of note, in those cases where the estimated correlation of dN/dS or πN /πS with LHTs were in agreement with the predictions of the nearly-neutral theory (Eyre-Walker et al., 2002; Popadin et al., 2007; Nikolaev et al., 2007; Lartillot, 2013; Nabholz et al., 2013; Romiguier et al., 2014; Figuet et al., 2016), the reported correlation strengths were often weak, weaker than the correlations found directly between πS and πN /πS and dN/dS (Piganeau & Eyre-Walker, 2009; Romiguier et al., 2014; Figuet et al., 2016; Galtier, 2016; Chen et al., 2017; James et al., 2017).
Teasing apart divergence times, mutation rates and effective population size
The correlation patterns shown by the three molecular quantities πS, dN /dS and πN /πS suggest that Ne plays a non-negligible role in their interspecific variation. However, in its current form, this correlation analysis does not give any quantitative insight about the scaling of dN /dS and πN /πS as a function of Ne and, more generally, about the quantitative impact of Ne on the evolution of coding sequences. In order to achieve this, an explicit estimate of the key parameter Ne, and of its variation across species, is first necessary. In this direction, a first simple but fundamental equation relates πS with Ne:
In order to estimate Ne from equation 4, an estimation of u is also required. Here, it can be obtained by noting that:
where r is the mutation rate per site and per year and τ the generation time. Assuming that synonymous mutations are neutral, we can identify the mutation rate with the synonymous substitution rate dS, thus leading to:
Finally, combining equations 4 and 6 and taking the logarithm gives:
This expression suggests to apply the linear transformation given by equation 7 to the three variables ln πS, ln dS and ln τ, all of which are jointly reconstructed across the tree by the phenomenological model introduced above, which then gives a global phylogenetic reconstruction of ln Ne. A similar argument, based on equation 6, gives:
which can be used to obtain a global reconstruction of the mutation rate per generation u. Finally, since the two transformations given by equations 7 and 8 are linear, the correlation patterns between ln Ne, ln u and the other variables included in the analysis can be recovered by applying elementary matrix algebra to the covariance matrix estimated under the initial parameterization (see Materials and Methods, Ex-post log linear transformation).
The results of this linearly-transformed correlation analysis are gathered in Table 1 (last two columns). First, in accordance with the predictions of the nearly-neutral theory, πN /πS and dN /dS show a negative correlation with Ne (r = −0.73 for πN /πS and −0.58 for dN/dS). Second, u is inferred to covary negatively with Ne (r = −0.89), suggesting that species with large Ne tend to have a lower u (Lynch et al., 2011). This correlation should be interpreted with caution, however, because these two variables are both estimated partly based on πS = 4Neu, such that estimation errors on πS will systematically affect them in opposite directions. On the other hand, and perhaps more convincingly, u is negatively correlated with πS (r = −0.67, table 1). Unlike Ne and u, πS and u are empirically independent in the present case (i.e. estimated based on different data sources). This latter correlation thus further strengthens the case that Ne and u have opposite trends. It also suggests that the variation in u is more moderate than the variation in Ne, such that πS remains in the end positively correlated with Ne.
Quantitative scaling of πN /πS and dN /dS as a function of Ne
Once an explicit estimate of Ne and of its variation is available, the scaling behavior of πN /πS and dN /dS as a function of Ne can be quantified. Mathematically, the Brownian process followed by rates and traits, such as given by equation 1, implies the following log-linear relations:
where ϵi(t), for i = 1, 2, are Brownian motions. These last two terms are mathematically equivalent to the residual errors of the linear regression between the independent contrasts of ln πN /πS and ln dN/dS against log Ne. Equivalently:
where κi(t) = eϵi(t), for i = 1, 2. In other words, the slopes of the log-linear relations, β1 and β2, are just the scaling coefficients of πN /πS and dN /dS as a function of Ne. These two scaling coefficients can be directly obtained based on the covariance matrix estimated above (see methods, Correlations and slopes). Of note, equations 9 to 12 are just a reformulation of the output of the correlation analysis conducted previously. As such, they do not entail any specific hypothesis about the population-genetic relations between πN /πS, dN/dS and Ne. A population-genetic interpretation of these equations is considered in the next subsection.
In the present case, the estimates of β1 and β2 are of similar magnitude, with point estimates of 0.17 and 0.10, respectively (Table 2). It is worth comparing these estimates with those that would be obtained if we were using πS directly as a proxy of Ne (i.e. without correcting for u). The slopes of the log-linear scaling of πN /πS and dN/dS as a function of πS are steeper than those obtained as a function of Ne, with point estimates of 0.29 and 0.13 (Table 2), suggesting that the confounding effects of u are not negligible on the evolutionary scale of a mammalian order such as primates and, as such, can substantially distort the scaling relations if not properly taken into account.
Scaling coefficient of dN/dS and πN /πS as functions of Ne and πS, compared with estimates of the shape parameter β of the distribution of fitness effects.
Relation with the shape of the distribution of fitness effects
Mechanistically, the slope of ln πN /πS and ln dN /dS as a function of ln Ne can be interpreted in the light of an explicit mathematical model of the nearly-neutral regime. Such mathematical models, which are routinely used in modern Mac-Donald Kreitman tests (Charlesworth & Eyre-Walker, 2008; Eyre-Walker & Keightley, 2009; Halligan et al., 2010; Galtier, 2016), formalize how mutation, selection and drift modulate the detailed patterns of polymorphism and divergence. In turn, these modulations depend on the shape of the distribution of fitness effects (DFE) over non-synonymous mutations (Eyre-Walker & Keightley, 2007). Mathematically, the DFE is often modelled as a gamma distribution. The shape parameter of this distribution (usually denoted as β) is classically estimated based on empirical synonymous and non-synonymous site frequency spectra. Typical estimates of the shape parameter are of the order of 0.15 to 0.20 in humans (Boyko et al., 2008; Eyre-Walker et al., 2006), thus suggesting a strongly leptokurtic distribution, with the majority of mutations having either very small or very large fitness effects.
When the shape parameter β is small, both πN /πS and dN /dS are theoretically predicted to scale as a function of Ne as a power-law, with a scaling exponent equal to β (Kimura, 1979; Welch et al., 2008), i.e.:
The relation given by equation 13 was used previously for analyzing the impact of the variation in Ne along the genome in Drosophila (Castellano et al., 2018). Assuming that (1) the sequences follow a purely nearly-neutral regime, thus without positive selection; (2) the DFE is constant across primate species; (3) variation in Ne is sufficiently slow, compared to the fixation time of non-synonymous substitutions, and (4) short-term Ne and long-term Ne are identical, equations 13 and 14 also predict the interspecific variation in πN /πS and dN/dS as a function of Ne. More specifically, by identifying these two equations with equations 11 and 12, the present argument predicts that the interspecific allometric scaling coefficients β1 and β2 estimated above should both be equal to the shape parameter β of the DFE. In the present case, the two estimates β1 and β2 are indeed congruent with each other, with overlapping credible intervals (Table 2). They are also compatible with previously reported independent estimates of the shape parameter of the DFE, such as obtained from site frequency spectra in Humans and in other great apes (also reported in Table 2).
A mechanistic nearly-neutral phylogenetic codon model
Since all of the results presented thus far are compatible with a nearly-neutral regime, we decided to construct a mechanistic version of the model directly from first principles. Thus far, a phenomenological approach was adopted, in which the whole set of variables of interest (dS, dN/dS, πS, πN /πS and LHT) were jointly reconstructed along the phylogeny, as a multivariate log-normal Brownian process with 8 degrees of freedom. Only in a second step were Ne and u extracted from this multivariate process, using log-linear relations. As a result, dN/dS, πS and πN /πS are correlated with Ne and u, but they are not deterministic functions of these fundamental variables, such as suggested by equations 6, 7, 13 and 14.
In the mechanistic model now introduced, these deterministic relations are enforced. The Brownian process now has 6 degrees of freedom (πS, πN /πS and the LHT), corresponding to the variables that are directly observed in extant species. Then, equations 6, 7, 13 and 14 are inverted, so as to express r = dS, dN/dS, Ne and u as time-dependent deterministic functions of this Brownian process (see Materials and Methods, Mechanistic model). In these equations, the shape parameter β, along with κ1 and κ2 in equations 13 and 14, are structural parameters of the model, representing the DFE, itself assumed to be constant across species. These parameters are estimated along with the covariance matrix, divergence times and the realization of the process along the phylogeny. The model is conditioned on the same data (sequence alignment, traits, polymorphism) and using the same fossil calibration scheme as for the phenomenological model.
This mechanistic model was implemented in two alternative versions. A first naive version assumes that genetic divergence takes place exactly at the splitting times defined by the underlying species phylogeny. Identifying genetic divergence with species divergence, however, amounts to ignoring the time it takes for coalescence to occur in the ancestral populations. For that reason, an alternative model was considered, based on the argument that the mean coalescence time in the ancestral population at a given ancestral node of the phylogeny is equal to δt = 2Neτ, where Ne is the effective population size and τ the generation time prevailing at or around that node of the species tree. Since Ne and τ are both reconstructed along the entire species phylogeny, it is therefore possible to use the local information about the current value of Ne and τ to account for the extra amount of divergence δt induced by coalescence in the ancestral population when calculating the sequence likelihood (see methods). In the following, these two alternative versions of the mechanistic model are called the ‘naive-phylogenetic’ and ‘mean-coalescent’ versions. Of note, the mean-coalescent version is solely invoked to correct mutation rate and time estimates, not dN/dS. In reality, ancestral coalescence is expected to lead to non-trivial patterns of non-synonymous and synonymous divergence (Mugal et al., 2020), which are ignored here.
Fitting the model on the data returns an estimate of the structural parameters of the DFE as well as a dated tree, annotated with the complete history of the variation in mutation rate and in effective population size along its branches. These different aspects of the output of the analysis are now considered in turn.
Mechanistic estimate of the shape parameter of the DFE
The mechanistic model yields an estimate of β (Table 2) that is somewhat higher than the slopes β1 and β2 estimated previously with the phenomenological approach, with a posterior median at 0.27 and a credible interval equal to (0.22, 0.33). The credible interval is smaller than those for β1 and β2, reflecting the fact that the mechanistic model is more constrained. Our estimate of β also appears to be higher than independent estimates obtained from site frequency spectra. Of note, there is some discrepancy among those DFE-based estimates, whose credible intervals do not overlap. However, our estimate is significantly higher than the more recent estimate obtained by jointly analyzing the site frequency spectra across great apes (Castellano et al., 2019), which is probably the most relevant one in the present context. This observation suggests a potential violation of the mechanistic model (see Discussion).
Estimated mutation rates
The dated phylogeny, together with the reconstructed history of the mutation rate per year r, is shown in Figure 1. Here, the mechanistic model works like a standard molecular dating method, using fossil calibration and a relaxed clock model to tease out times and rates from raw synonymous sequence divergence. Generation time, also reconstructed along the tree by the model, is then used to convert the mutation rate per year r into a mutation rate per generation u (Figure 2). Point estimates (medians) and 95% credible intervals of both r and u for some extant species of interest and a few key ancestors are shown in Table 3.
Reconstructed phylogenetic history of the mutation rate per year r (posterior median estimate) under the mechanistic model. Species for which transcriptome-wide polymorphism data were used are indicated in bold face.
Reconstructed phylogenetic history of the mutation rate per generation u (posterior median estimate) under the mechanistic model. Species for which transcriptome-wide polymorphism data were used are indicated in bold face.
Estimates of mutation rate per year r and per generation u (posterior median and 95% credible interval), for several extant and ancestral species.
The naive-phylogenetic model tends to return higher mutation rates, compared to the mean-coalescent approach, in particular for extant taxa and, more generally, for recent ancestors. For instance, in humans, the mutation rate per year is estimated at 0.80 × 10−9 when coalescence in the ancestral population is ignored, versus 0.67 × 10−9 when it is accounted for in the model. This reflects a bias induced by ignoring ancestral coalescence. For instance, the Human-Chimpanzee split is constrained at around 5 to 7 My, but coalescence in the ancestral population can easily go back to 9 My, which, if dated at 7 My, automatically induces an overestimation of the mutation rate along the branch leading to Humans (Besenbacher et al., 2019).
The estimates obtained here under the mean-coalescent model are intermediate, lower than previously reported phylogenetic estimates but still higher than pedigree-based estimates. For instance, typical phylogenetic estimates for the mutation rate in humans are typically of the order of 10−9 per year, or 3 × 10−8 per generation, whereas pedigree-based estimates are generally about half these values. Concerning other primates, our estimates are also higher than pedigree based estimates previously reported for macaques (Wang et al., 2020) and baboons (Wu et al., 2020). On the other hand, they are congruent for Gorilla, Pongo (Besenbacher et al., 2019) and Aotus (Thomas et al., 2018). In the following, only coalescent-aware estimates are further considered.
Across primates, the mutation rate per year r shows a 5-fold variation from 0.6 × 10−9 to 3.0 × 10−9 point mutations per year and per nucleotide site (Figure 1). The rate of mutation is relatively high in the ancestor of primates (~ 2 × 10−9). On the side of Strepsirrhini, it remains high in Lorisiformes (~ 2 × 10−9) but is lower in Lemuriformes (~ 10−9). Concerning Haplorrhini, the rate undergoes a net slowdown in Catarrhini (~ 10−9), further accentuated in apes, which are among the slowest evolving primates (~ 0.6 × 10−9). These observations are globally in accordance with previous observations, in particular, emphasizing that the slowdown occurring in apes (Steiper et al., 2004) is in fact in the continuity of a broader process of deceleration more generally across catarrhine primates (Perelman et al., 2011).
Compared to the mutation rate per year, the mutation rate per generation u varies over a more moderate range, showing a 3-fold variation across primates. Moderately high ancestrally (1.5 × 10−8), it shows a convergent decrease in Lorisiformes, Lemuriformes, and Cercopithecidae (reaching below 10−8 in several species in these three clades), but otherwise, remains more in the range of 1.5 to 2 × 10−8. At first sight, the mutation rates per year r and per generation u tend to show opposite patterns: slow-evolving lineages, with a low mutation rate per year, tend to have a higher mutation rate per generation (high u). This is particularly apparent for apes (low r, high u) or for Lorisiformes (high r, low u). However, there are some exceptions, of lineages that have both a high r and a high u, most notably Platyrrhini (new world monkeys). More generally, the correlation analysis (table 1) suggests that u and r are positively, not negatively, correlated on average across primates.
Phylogenetic Reconstruction of Ne
The marginal reconstruction of Ne along the phylogeny of primates returned by the mechanistic model is shown in Figure 3 (Supplementary Figure 1 for the reconstruction returned by the phenomenological model). More detailed information, with credible intervals, is given in Table 4 for several species of interest and key ancestors along the phylogeny.
Reconstructed phylogenetic history of Ne (posterior median estimate) under the mechanistic model. Species for which transcriptome-wide polymorphism data were used are indicated in bold face.
Estimates of effective population size (×10−3, posterior median and 95% credible interval) for several extant taxa and ancestors.
The Ne estimates for the four extant hominids (Homo, Pan, Gorilla and Pongo) are globally congruent with independent estimates based on other coalescent-based approaches (Prado-Martinez et al., 2013). In particular, for Humans, Ne is estimated to be between 13 000 and 24 000. For other hominids, they tend to be somewhat higher than coalescent-based estimates. Concerning the successive last common ancestors along the hominid subtree, our estimation is also consistent with the independent-locus multi-species coalescent (Rannala & Yang, 2003).
The history of Ne shows a clear large-scale structure over the primate phylogeny. Starting with a point estimate at around 100 000 in the last common ancestor of primates, Ne then goes down in Haplorrhini, stabilizing at around 65 000 in Cercopithecidae (old-world monkeys), 50 000 in Hominoidea (going further down more specifically in Humans), and 40 000 in Platyrrhini (new-world monkeys). Conversely, in Strepsirrhini, Ne tends to show higher values, staying at 100 000 in Lemuriformes and going up to 160 000 to 200 000 in Lorisiformes. Finally, rather large effective sizes are estimated for the two isolated species Tarsius and Daubentonia – although the credible intervals are very large (Table 4). These estimates may not be so reliable, owing to the very long branches leading to these two species.
The reconstruction of Ne shown in Figure 3 mirrors the patterns of dN/dS estimated over the tree (Supplementary Figure 2). This is partially expected given the fact that the model relies on the scaling relation between dN/dS and Ne given by equation 14. However, the model integrates other sources of information, in particular, from πS and πN /πS. In order to get some insight about how each of these variables informs the reconstructed history of Ne, two additional models were considered.
First, an alternative version of the phenomenological model was used (Supplementary Figure 3), in which all time-dependent variables are assumed to evolve independently. Under this uncoupled model, Ne is not informed by dN/dS or πN /πS, but only by πS = 4Neu and by the estimates of u implied by the relaxed clock and data about generation times. Compared to the mechanistic version just presented, this uncoupled model gives lower Ne estimates most notably for the last common ancestor of primates (30 000, versus 100 000 under the mechanistic model), but also for the two species Daubentonia and Tarsius (which have a low dN/dS). Conversely, it returns higher Ne estimates in Lemuriformes (which have a high dN/dS compared to Lorisidae).
Interestingly, under the uncoupled model, there is a substantial uncertainty about the estimation of Ne across the tree: the 95% credible intervals span one order of magnitude on average. This uncertainty is reduced under the reconstructions relying on the additional information contributed by dN/dS and πN /πS, quantitatively, by 30% under the phenomenological, and by 50% under the mechanistic covariant models (Table 4). In the end, there is thus on average a factor 5 between the lower and the upper bound of the 95% credible intervals on Ne estimates under the most constrained (mechanistic) model. Concerning the deep branches of the tree, most of this reduction in uncertainty is primarily contributed by dN/dS – which thus gives an idea of how much information is extracted from multiple sequence alignments by these models about ancient population genetic regimes.
Second, as mentioned above, the mechanistic model infers a value of 0.27 for the shape parameter β, which is higher than recent SFS-based estimates, of the order of 0.16. Fixing β = 0.16 in the mechanistic model returns a reconstruction of Ne over the phylogeny (Supplementary Figure 4) qualitatively similar to that returned by the unconstrained version of the model (Figure 3), although covering a broader range (about 100-fold) than under either the mechanistic or the uncoupled models (about 10- to 30-fold, Figure 3 and Supplementary Figure 3). This illustrates the key role of β in calibrating the transfer of information from dN/dS to Ne. According to the scaling relation given by equation 14, when β is small, a large variation in Ne results in a small shift in dN/dS. Thus, conversely, with a smaller β, the empirically observed variation in dN/dS over the tree implies a broader range of Ne variation across primates. In the present case, this argument also suggests an explanation of why the small value of β inferred by SFS is rejected by the mechanistic model – because the variation in Ne implied by the history of dN/dS under such a small value of β would exceed the one implied by the range of πS observed in extant species.
Of note, the Ne estimates are lower under the naive-phylogenetic (Supplementary Figure 5) than under the mean-coalescent approach (Figure 3). This difference between the two models is due to the fact that, given πS, any bias in the estimation of u has to be compensated for by an opposite bias of the same magnitude in the estimation of Ne. These discrepancies are relatively minor, however, and the global patterns of the history of Ne along the tree are very similar in both cases.
Finally, the phylogenetic history of the mutation rate per generation u mirrors that of Ne, such that species with smaller Ne tend to have a higher value for u (compare Figures 1 and 2). These opposite patterns of variation, combined with the fact that Ne shows a greater amplitude in its variation across primates compared to u, results in πS being mostly driven by Ne, although with a partial dampening of its overall variation. This joint pattern for Ne and u explains why the regression slopes of ln πN /πS and ln dN/dS against πS are steeper than those against Ne (Table 2).
Discussion
A Bayesian integrative framework for comparative population genomics
The question of the role of Ne in the evolution of coding sequences has motivated much work over the years. One main problem that has attracted particular attention is to understand to what extent Ne modulates the ratio of non-synonymous over synonymous polymorphism (πN /πS) or divergence (dN /dS). Often, πS has been used as a proxy for Ne. However, πS also depends on u, the mutation rate per generation, which differs between species.
In this context, the main contribution of the present work is to propose a Bayesian integrative phylogenetic framework for conducting such comparative analyses in a way that allows for direct quantitative estimation of Ne and of its impact on molecular evolution across a clade of interest. Relying on an integrated relaxed clock model to calibrate mutation rates, the program leverages an estimate of Ne based on πS, correcting for u. Simultaneously, it conducts a regression analysis, returning an estimate of the scaling exponent of molecular quantities such as dN/dS and πN /πS, but also potentially other variables or quantitative traits, directly as a function of Ne. As a byproduct, the approach also returns a global reconstruction of the history of effective population size and mutation rate across the phylogeny.
As can be seen from Table 2, correcting for variation in mutation rate between species (for u), as opposed to regressing directly against πS, does have an impact on the estimated scaling relations. In the present case, the slopes as a function of πS tend to be steeper than as a function of Ne, a pattern that is more generally expected if species with large Ne also tend to have lower mutation rates per generation, such as previously suggested (Lynch et al., 2011) and confirmed by our correlation analysis (Table 1). The approach introduced here should therefore represent a useful methodological contribution in the context of the current discussions on the role played by those scaling coefficients in molecular evolution (James et al., 2017; Castellano et al., 2018, 2019; Galtier & Rousselle, 2020).
Estimating mutation rates
Conceptually, our approach for extracting Ne is merely a reformulation, in a Bayesian integrative framework, of the classical idea of estimating Ne from πS by factoring out the mutation rate u, itself estimated based on a molecular clock argument. The integrative approach presents several advantages, however. First, it imposes the same assumptions about the molecular clock, relying on the same sequence data and the same global set of fossil constraints, uniformly for all species included in the analysis. Second, it automatically propagates the uncertainty about estimates of u, which themselves incorporate the uncertainty about divergence times, onto the credible intervals eventually reported for Ne or for the slopes of the regressions. Third, these slopes are also automatically corrected for phylogenetic non-independence. Finally, on purely practical grounds, the application of the method is straightforward, just requiring as its input a multiple sequence alignment for the clade of interest, estimates of πS and πN /πS for some or all of the extant species and fossil calibrations.
An alternative to the phylogenetic estimation of u is to rely on high-throughput sequencing of pedigrees. As of yet, such estimates are available only for 7 primates (Chintalapati & Moorjani, 2020), but this is likely to change in the future. For these 7 species, the phylogenetic estimates obtained here are higher than pedigree-based estimates, thus in line with previous observations. The reasons for this discrepancy are not yet well-understood (Scally & Durbin, 2012; Ségurel et al., 2014; Chintalapati & Moorjani, 2020). Interestingly, accounting for coalescence in the ancestral populations contributes a lot to making phylogenetic estimates closer to those obtained in the same species by sequencing pedigrees, although not entirely.
In principle, pedigree-based estimates of u could be included in the framework presented here, as additional constraints at the tips of the phylogeny to inform the reconstruction of Ne. However, given the still unexplained mismatch between pedigrees and phylogenies, it is perhaps more meaningful to compare them after the fact, as was done here (Table 3), and then further investigate the various entry points in the model, at the level of the relaxed clock, the prior on divergence times, the fossil constraints, that could be responsible for this discrepancy.
Mechanistic models of coding sequence evolution
Our phylogenetic approach was implemented in two alternative versions, using either a phenomenological or a mechanistic modeling strategy. The phenomenological model implements the idea of conducting comparative regression analyses directly against Ne, such as discussed above. In itself, this approach is agnostic about the underlying selective regimes over proteins and, in particular, is not inherently committed to a nearly-neutral interpretation.
The mechanistic model, on the other hand, makes more aggressive assumptions about the underlying selective regime. It is fundamentally a Bayesian phylogenetic implementation of the nearly-neutral theory. Accordingly, it assumes that protein-coding sequences are exclusively under purifying selection. Another key assumption of the model, not necessarily implied by the nearly-neutral theory, is that the DFE is constant across species. These assumptions give more constraint to the analysis and return more focussed estimates. However, the estimate of the shape parameter of the DFE obtained under this model turns out to be significantly higher than some of the estimates based on SFS obtained in Humans or in great apes (Table 2), suggesting that one of these two assumptions might not be strictly valid.
The question of whether the DFE is constant across species has recently motivated both methodological work for jointly analyzing the site frequency spectra of multiple species (Tataru & Bataillon, 2019) and empirical investigations (Castellano et al., 2019; Galtier & Rousselle, 2020). These empirical analyses suggest that the shape parameter is rather stable across great apes (Castellano et al., 2019) and more broadly across primates Galtier & Rousselle (2020). The mean of the distribution, on the other hand (usually denoted ), was found to be potentially variable across great apes (Castellano et al., 2019), such that species with larger Ne values also tend to have more strongly deleterious non-synonymous mutations. The rather steep variation of the population scaled mean
as a function of πS observed across metazoans Galtier & Rousselle (2020) might also be interpreted as reflecting an underlying positive covariation of
with Ne. Importantly, since dN/dS and πN /πS scale as
, this specific pattern of covariation between the unscaled mean of the DFE and Ne predicts that the regression slopes should be steeper than β. This could explain the high estimate of β obtained here under the mechanistic model. Of note, the scaling of πN /πS and dN/dS with respect to Ne returned by the phenomenological model are compatible with SFS-based estimates, but this might just be a consequence of the rather large credible intervals obtained in their case.
To further investigate this question, conducting a broader analysis with polymorphism data obtained for a larger number of primate species – and using the same set of coding genes for estimating πS and πN /πS, on one hand, and dS and dN/dS, on the other hand – would certainly be an important direction to pursue, as it would consolidate the results presented here, which are still preliminary in many respects. In particular, it would yield more precise estimates of the scaling coefficients under the phenomenological model. If this confirms the discrepancy between the interspecific scaling coefficients and SFS-based estimates, then, the assumption of a constant DFE under the mechanistic model could be relaxed, although this would then require incorporating information, not just about mean diversity (πS and πN /πS), but also about site frequency spectra in extant species, in order to constrain the estimation.
Concerning the other assumption of the mechanistic model, of an exclusively purifying selection regime, the dN/dS ratio may in fact contain a fraction of adaptive substitutions, susceptible to distort the relation between dN/dS and Ne. The relative importance of adaptive versus nearly-neutral substitutions in primates is still debated (Eyre-Walker & Keightley, 2009; Galtier, 2016; Zhen et al., 2021). In any case, adaptive substitutions might certainly represent an important issue when applying the method to other phylogenetic groups. Here also, the model could be further elaborated, by explicitly including an adaptive component to the total dN/dS. Quite interestingly, the resulting model could then be seen as an integrative multi-species version of the Mac-Donald Kreitman test, returning an estimate of the history of the adaptive substitution rate over the phylogeny – which could then be compared with independent estimates based on pairs of sister species (Charlesworth & Eyre-Walker, 2008; Eyre-Walker & Keightley, 2009; Halligan et al., 2010; Galtier, 2016).
Finally, another potential issue, which concerns both the mechanistic and the phenomenological model, is that short-term Ne (such as reflected by πS) may be strongly dependent on recent demographic events (Charlesworth, 2009) and may thus not be identical with long-term Ne (such as reflected by dN /dS). This might be one of the reasons why dN /dS shows a weaker correlation with πS than πN /πS. A possible improvement of our model in this direction would consist in allowing for an additional level of variability at the leaves, representing the mismatch between long- and short-term Ne. Other sources of variance in extant diversity estimates could also be modeled, in particular, the additional stochasticity contributed by the random genealogy or by the low counts of SNPs. These last two points are probably a minor issue for nuclear exome-wide polymorphism data, such as explored here. In contrast, they could be quite relevant in the case of the small and non-recombining mitochondrial genome, for which the question of the inter-specific scaling behavior of dN/dS and πN /πS as a function of Ne is also of interest (James et al., 2017).
Evolution of mutation rates, Ne and life history across primates
The global phylogenetic history of Ne (Figure 3) and mutation rates (Figures 1 and 2) obtained here offers interesting insights into the macro-evolutionary trends in primates, making connections between life-history and molecular evolution. Previous analyses have repeatedly pointed out a slowdown of the molecular clock in apes (Steiper et al., 2004), more broadly in catarrhine primates (Perelman et al., 2011), or even more globally throughout the evolutionary history of the entire order (Steiper & Seiffert, 2012), suggesting a trend towards increasing body size and longer generation times in this group. Our reconstruction confirms this global picture, adding another feature, in the form of a global decrease in effective population size, although more specifically in simians (Figure 3). A global picture only in terms of evolutionary trends along a small-versus-large body size axis, however, would be an oversimplification. In particular, dS and dN/dS appear to respond differently to LHT, dS being negatively correlated with body size (Table 1) as previously reported (Steiper & Seiffert, 2012), whereas dN/dS correlates only with longevity but not with body size, a pattern also observed across mammals (Nikolaev et al., 2007; Lartillot, 2013).
The trend in decreasing Ne observed here is primarily driven by the underlying variation in dN/dS. As such, it provides another illustration of the more general result that molecular evolutionary patterns inferred from genetic sequences using phylogenetic methods can be informative about life-history evolution (Lartillot & Delsuc, 2012; Romiguier et al., 2013; Figuet et al., 2014; Wu et al., 2017). Compared to previous work, however, an important new contribution of the present work is a quantitative reconstruction, over the phylogeny, directly in terms of the canonical parameters of population genetics, the mutation rate u and the effective population size Ne. Such broad-scale reconstructions, as opposed to focussed estimates in isolated extant species, are potentially useful in several respects. First, they provide a basis for further testing some of the key ideas about the role of mutation rate or genetic drift in genome evolution (Lynch et al., 2011; Lefébure et al., 2017). Second, the integrative framework could be augmented with trait-dependent diversification models (Fitzjohn, 2010), so as to examine the role of Ne or u in speciation and extinction patterns.
Materials & Methods
Coding sequence data, phylogenetic tree and fossil calibration
The coding sequences were taken from Perelman et al. (2011) and modified. It consists in a modified subset, codon compliant, based on 54 nuclear autosomal genes in 61 species of primates, and of a total length 15.9 kb. We used the tree topology published by Perelman et al. (itself based on a maximum likelihood analysis), as well as the eight fossil calibrations that were used in this previous study to estimate divergence times. These calibrations were encoded as hard constraints on the molecular dating analysis.
Life History Traits
We used four life history traits (LHT) in this study. Adult body mass (as a proxy for body mass, 16 missing values), maximum recorded lifespan (ML, as a proxy for longevity, 19 missing values) and female age of sexual maturity (ASM, 26 missing values) were obtained from the An Age database (de Magalhaes & Costa, 2009). Estimates about generation time were calculated from maximum longevity and age at maturity following a method detailed by UICN (Pacifici et al., 2013):
In the case of great apes, and most notably for Humans, these estimates appear to be too high (48.5 years for Homo, 24.7 for Pan, 23.8 for Gorilla and 24.1 for Pongo). For these four species, direct estimates of the generation time (29, 24, 19 and 25 years, respectively) were taken from Besenbacher et al. (2019).
Estimation of Polymorphism (πS and πN /πS)
The estimates of the synonymous nucleotide diversity πS and the ratio of non-synonymous over synonymous diversity πN /πS and πS of 10 primate species were calculated on the sequence data from Perry et al. (2012), such as re-analysed by Romiguier et al. (2014) and Figuet et al. (2016). We matched these polymorphism data for the three species Pan troglodytes, Propithecus vereauxi coquereli and Eulemur mongoz, to Pan paniscus, Propithecus verreauxi and Eulemur rufus, respectively, from the Perelman et al multiple sequence alignment.
A first series of analyses were conducted using the estimates of πN /πS and πS reported by Romiguier et al. (2014). Alternatively, and in order to avoid the artifactual correlations induced between πS and πN /πS by shared data sampling error, the method of Romiguier et al. (2014) was adapted so as to estimate πS and πN /πS on different subset of the sites, using the hypergeometric method (James et al., 2017). Specifically, starting from the original fasta file containing the 8 variants for each contig of a given species, coding sites were randomly partitioned into two subsets, with equal probability independently for each site, and the πS and πN /πS statistics were computed on each subset using the dNdSpiNpiS software program (available at https://kimura.univ-montp2.fr/calcul/softwares.html), with the same options as those used in Romiguier et al. (2014). Finally, the πS estimate from the first half was combined with the πN /πS estimate obtained on the second half. The results presented in the main document are all based on this hypergeometric method. They are essentially identical to those obtained using the original polymorphism estimates (Supplementary Table 2).
Models
The phenomenological model
The phenomenological model is essentially the one introduced in Lartillot & Poujol (2011), with some minor modifications. The exact structure of the model is given in the manual of Coevol, version 1.5b.
Correlations and slopes
Given a covariance matrix Σ describing the correlation structure of a Brownian process X, the strength of the correlation coefficient between two entries of X, k and l, is given by:
The slope of the regression of trait l agains trait k is given by:
In both cases, the equation is applied successively on each point obtained from the posterior distribution by MCMC, yielding a collection of values, for either the correlation coefficient or the slope, from which a median point estimate and a credible interval are then computed.
Ex-post log-linear transformation of the correlation analysis
In the following, the multivariate process X is specified such that the entries are in the same order as in table 1 (dS, dN/dS, maturity, mass, longevity, πS, πN /πS and generation time τ). Based on this vector X, an extended vector Y can be defined, as:
Since ln u and ln Ne are given by the following log-linear relations:
Y can be expressed as Y = AX + K. Here, K is a vector of constants (depending on the time scale), and A is an 8 × 10 matrix:
where I8 is the 8 × 8 identity matrix. Since X is Brownian, parameterized by a covariance matrix ΣX, Y is also Brownian, with (degenerate) covariance structure given by ΣY = A × ΣX × A′. In practice, we added a new method in Coevol to read the output and apply the linear transformation (from X and ΣX to Y and ΣY) on each sample from the posterior distribution. This gives a reconstruction of Ne (posterior median, credible intervals) and of the correlation matrix ΣY.
Mechanistic Nearly-Neutral Model
This alternative model uses the original Coevol framework but introduces additional constraints, such that some of the parameters are deduced through deterministic relations implying other Brownian dependent parameters. Specifically, the Brownian free variables are now:
Then, using equations 6, 7, 11 12, the other variables of interest can be expressed as deterministic functions:
This model has three structural free parameters, β, κ1 and κ2, which were each endowed with a normal prior, of mean 0 and variance 1.
The naive-phylogenetic version of the model uses the default approach used in Coevol, i.e. assumes a single dated tree T, with branch lengths measured in time. The Brownian multivariate process X runs along this tree, splitting into two independent processes whenever a speciation node is encountered along the phylogeny. Conditional on T and X, the sequences then evolve along that same tree T, using a codon-model in which dS and dN/dS are modulated across branches, such as implied by X (Lartillot & Poujol, 2011). For the mean-coalescent version of the model, on the other hand, conditional on T and X, the sequence evolutionary process is assumed to run on a dated tree T′ different from T. Compared to T, the node ages of T′ are all shifted further back into the past by an amount δt = Max(2Neτ, δt0), where Ne and τ are the instant values of effective population size and generation time implied by the process X at the corresponding node in the original tree T, and δt0 is the difference between the age of the focal node on T and the age of its oldest daughter node in T′. This additional constraint is meant to ensure that a node should always be older than its daughter nodes in T′. Owing to the variation in Ne and τ between successive nodes, this constraint may not be automatically realized, in particular in regions of the tree in which successive splitting times are within coalescence time from each other – precisely those splits that are potentially under a regime of incomplete lineage sorting. In practice, this problem is rarely encountered (less than one node of the whole tree on average under the posterior distribution).
Uncoupled Model
The uncoupled model, already implemented in Coevol, is similar to the phenomenological version of the model, except that the variables of interest (dS, dN/dS, πS, πN /πS and τ) are modelled as independent Brownian processes along the tree. Equivalently, we use a multivariate Brownian model with a diagonal covariance matrix (see Lartillot and Poujol, 2011, for details).
Markov Chain Monte-Carlo (MCMC) and post-analysis
Two independent chains were run under each model configuration. Convergence of the chains was first checked visually (a burnin of approximately 1000 points, out of 6000, was taken for the phenomenological model, and of 100 out of 3100 for the mechanistic model) and quantified using Coevol’s program Tracecomp (effective sample size greater than 500 and maximum discrepancy smaller than 0.10 across all pairs of runs and across all statistics, except for the age of the root of the tree, which typically has a lower effective sample size). We used the posterior median as the point estimate. The statistical support for correlations is assessed in terms of the posterior probability of a positive or a negative correlation. The slope is estimated for each covariance matrix sampled from the distribution, which then gives a sample from the marginal posterior distribution over the slope.
Software and data availability
Coevol (Lartillot & Poujol, 2011) is an open source program available on github: https://github.com/bayesiancook/coevol. All models and data used here, along with scripts to re-run the entire analysis, are accessible from this repo. Data and scripts for computing πS and πN/πS are available at https://github.com/bayesiancook/polyprim
Authors contribution
The modifications of Coevol and the new models presented in this study are primarily attributable to MB, who also gathered and formatted the data and conducted all analyses, in the context of an internship (master Biosciences of École Normale Supérieure de Lyon). NL contributed additional analyses and further elaborations for the revised version. MB and NL both contributed to the writing of the manuscript.
Competing interests
The Authors have no competing interests.
Funding
French National Research Agency, Grant ANR-15-CE12-0010-01 / DASIRE. Phylogenetic analyses were conducted using the computing facilities of the CC LBBE/PRABI.
Acknowledgements
We wish to thank Emeric Figuet, Jonathan Romiguier and Nicolas Galtier for sharing polymorphism data (PopPhyl project) and for their help in re-running the scripts for analysing them, Emmanuel Douzery and Frederic Delsuc for editing the multiple sequence alignment of Perelman et al to make it codon-compliant, and Nicolas Galtier, Laurent Duret and Thibault Latrille for their input on this work and their comments on the manuscript.