Evolutionary dynamics of neutral phenotypes under DNA substitution models

We study the evolution of quantitative molecular traits in the absence of selection. Using a simple theory based on Felsenstein’s 1981 DNA substitution model, we predict a linear restoring force on the mean of an additive phenotype. Remarkably, the mean dynamics are independent of the effect sizes and genotype and are similar to the widely-used OU model for stabilizing selection. We confirm the predictions empirically using additive molecular phenotypes calculated from ancestral reconstructions of putatively unconstrained DNA sequences in primate genomes. We show that the OU model is favoured by inference software even when applied to GC content of unconstrained sequences or simulations of DNA evolution. We predict and confirm empirically that the dynamics of the variance are more complicated than those predicted by the OU model, and show that our results for the restoring force of mutation hold even for non-additive phenotypes, such as number of transcription factor binding sites, longest encoded peptide and folding propensity of the encoded peptide. Our results have implications for efforts to infer selection based on quantitative phenotype dynamics as well as to understand long-term trends in evolution of quantitative molecular traits.


Introduction
With the increasing availability of high-throughput data about molecular and cellular function from many species, a key challenge is detect selection on these quantitative molecular traits [1]- [9]. The models used for selection in the comparative phylogenetic approach have rich theoretical grounding [10], [11] and software packages are available for data analysis under a variety of inference frameworks [12]- [14]. However, despite evidence that molecular phenotypes may evolve with little selective constraint [15]- [17] and classical work in the area [18], relatively little attention has been paid to improving the realism of the "neutral" or "null hypotheses" for phenotype evolution, as compared to protein coding sequences [19], [20] where the null hypothesis can explicitly formulated based on empirical estimates from neutral sites [19] or, more recently, biophysics of protein folding (e.g., [21]).
Brownian Motion (also referred to as a "random walk" [22]) is an appealing null hypothesis for phenotype evolution in comparative analysis of quantitative molecular traits [8]. It is mathematically convenient [23] and is referred to as a model of "pure drift" [1], [13] because it matches quantitative trait evolution in theory under genetic drift alone [23]. A key prediction of the Brownian Motion model is that changes from the ancestral phenotype are symmetric, and therefore, on average, phenotypes are not expected to change over time [24]. On the other hand, it has long been appreciated that in even in the absence of selection, mutation is expected to be (at least) a weak directional force on phenotype evolution [22], [25] and this has been observed empirically in at least one recent mutation accumulation study [26]. This view is captured in the so-called house-of-cards model [27], which was meant to capture the idea that the force of random mutation would be expected to break down what had been built up by selection.
With the increasing interest in studying evolution of molecular and cellular traits over long evolutionary time-scales, several theoretical studies have begun to take a closer look at the expected effects of mutation on the phenotype, finding that the equilibrium phenotype is the result of a trade-off between the directional effects of mutation and selection [28]- [30], and that biased mutation can drive the phenotype away from the optimum when selection is not strong [31]. To our knowledge, these predictions have not yet been compared directly with data, but global gene expression variation in mutation accumulation studies appears more consistent with the house-of-cards prediction [32]. Perhaps most intriguingly, a recent empirical study that estimated the effects of mutations on gene expression levels for 10 genes found asymmetric distributions (inconstant with Brownian Motion) and used simulations to predict approximately linear change over time in the absence of selection for most genes [33].
Here we aimed to develop a neutral model of phenotype evolution that could be compared directly to DNA-based molecular phenotypes. We work out the dynamics of additive phenotypes in the weak-mutation regime [34], the regime of most relevance for molecular evolutionary studies. Under Felsenstein's 1981 model of DNA evolution [35], we predict that mutation acts as a linear restoring force on the mean phenotype, similar to the OU models currently used for stabilizing selection [10] and to the house-of-cards model for phenotype evolution [22]. We find remarkable agreement with this prediction in observations of molecular phenotypes computed from reconstructed ancestors of putatively neutral sequences in primate genomes, even when the phenotypes are not strictly additive. Our results are inconsistent with the Brownian Motion null hypothesis used in phylogenetic comparative analysis, and consistent with this, an inference software package rejects this null hypothesis for our neutral phenotype data.

Neutral dynamics of the mean of an additive phenotype under a DNA substitution model
Our goal is to compare a neutral model of phenotype evolution to observations of molecular phenotypes obtained from closely related extant species. We therefore sought to derive the dynamics of the moments of the phenotype distribution under a standard model of DNA substitution. Let Z(X) be the phenotype as a function of the genotype X, and be the contribution of the m-th allele at the n-th locus. For example, for a DNA-based phenotype with n=10 contributing loci, a represents a 4 x 10 matrix. In order to compare with empirical phenotype data, we allow arbitrary values for , thus making no assumptions about the distribution of effect sizes. We assume the haploid loci are in linkage disequilibrium, and use a so-called "one-hot" encoding to represent the genotype, so that 1 if the haploid genotype is the m-th allele at the n-the locus and zero otherwise. Thus, for n=10 loci, the genotype, X, is also a 4 x 10 matrix. If the phenotype is additive, we have: is a dot product between vectors of length 4 for DNA. Given a starting phenotype , because evolution is a random process, we aim to calculate the mean phenotype over "replicate" populations as a function of time, t.
Where P(x) is the probability density of the random variable x and E[x] denotes the expectation (or mean), ≝ ∑ , | , is the probability of the initial genotype X mutating into another genotype Y after evolutionary time t, and is the probability of observing the initial genotype. In assuming that an initial genotype mutates into another genotype we are neglecting the possibility of polymorphism, competition and recombination between genotypes and other population genetic processes. This is the so-called "weak mutation" assumption commonly used to model molecular evolution at the timescale of inter species divergence [34].
To make progress, we first consider the dynamics of the mean phenotype given an initial genotype X: • Where the last equality is achieved using the linearity of the expectation and the assumption of the independence of genotype evolution at each locus. Under Felsenstein's 1981 model (F81) of DNA evolution [35] the substitution probabilities are, where ≝ ∑ is the DNA substitution rate, scaled to that evolutionary distance, ut, is measured in substitutions per site and π is the stationary distribution that parameterizes the continuous time markov process and can be interpreted as the long-term probabilities of each allele [35]. For notational convenience we define ≝ • to represent the (scalar) long-term probability of the allele that is found at the n-th locus. Note that this model assumes that the loci evolve independently and identically, so there is no dependence on the locus for . Using straightforward algebra (see Appendix), we can show that | 1 Where ≝ ∑ • and can be interpreted as the phenotype expected after a long-time of neutral evolution, or the phenotype at "mutational equilibrium." Remarkably, our formula for the dynamics of the mean of an additive phenotype is independent of the starting genotype, X, and depends only on the starting phenotype . In other words, the dynamics of the phenotype are the same for all genotypes that encode that phenotype. We conjecture that the F81 model is the most realistic DNA substitution model for which this can be true (for example, it is not true of models with transition-transversion rate bias [36], see Discussion). This removes the need to average over all ℎ ℎ , leading to our main theoretical result, which we refer to as the F81 model for mean phenotypes:

| |
Thus, under the F81 DNA substitution model, the dynamics of the mean phenotype depend only on the initial phenotype, the mutational equilibrium phenotype and the evolutionary rate. The dynamics of the mean phenotype are independent of the number of alleles, loci, and the distribution of allelic effect sizes, . The dynamics of the mean phenotype are similar to those expected under the (not DNA-based) house-of-cards mutation model [22], if the mean of the phenotype effect distribution in the house-of-cards model is chosen to be the "mutational equilibrium" phenotype, , defined above.
Since we measure evolutionary distance in units of substitutions per site, for evolutionary times of interest in the primate phylogeny considered below, ≪ 1, so 1 . Hence

≪ 1 |
This says that the mean phenotype is driven towards the equilibrium with a force simply proportional to the distance of the initial phenotype from the mean. Near the equilibrium, mutation can be neglected, but the further the phenotype is from equilibrium, the stronger the directional force of mutation becomes. The appearance of a directional force on the mean phenotype in the absence of selection is contradictory to the Brownian Motion null hypothesis for phenotype evolution [23], which argues that in the absence of selection, the mean phenotype is not expected on average to change in one direction or the other [23].
Another remarkable feature of our result is that the dynamics of the mean phenotype match exactly to the mean of the well-studied OU process, a stochastic model that is widely used to capture the restoring force of stabilizing selection in phylogenetic comparative analysis [10], [11], [13]. If our prediction of OU-like mean dynamics in the absence of selection is correct, naïve use of the OU process to model selection may be misleading (see discussion). The OU process is a Gaussian process, and therefore the dynamics are fully specified by the mean and variance. The parameters of the OU process are the restoring force (which must be based on the form of the mean dynamics we predict above) and the fluctuation size ( ), which doesn't affect the mean dynamics. If the F81 neutral phenotype dynamics matched an OU process exactly, the variance would be independent of the starting phenotype, and at short evolutionary distance would increase proportional to time This means that, if the phenotypic dynamics in the absence of selection follow an OU process, the variance of the phenotype will increase proportionally to evolutionary distance, at the rate of twice the long-time or equilibrium variance. Since this prediction is the same as the prediction of the Brownian-Motion model [37], the variance dynamics at short times cannot be used to distinguish the Brownian-Motion and OU models.

Molecular phenotypes from ERVs in primate genomes
Since our goal is to test empirically the predictions of the model, we considered phenotypes that could be computed directly from DNA sequences: these represent so-called quantitative molecular traits [5] with known genetic architecture and no measurement noise. This experimental set-up is attractive for several reasons: we can be sure that the phenotypes are truly "additive" (therefore not violating the assumptions of the model), we can rule out that stochastic effects we observe are intrinsic to the evolutionary process and not due to measurement or sampling errors [38], [39], and we can compare the observations from real DNA sequences to comparable observations from forward simulations under the DNA substitution models to determine the effects (if any) of mis-specification of the substitution model.
To obtain neutral phenotypes, we analyzed alignments of endogenous retroviruses (ERVs, see Methods), which are well-annotated ancient pseudogenized copies of retroviruses that are no longer replicating, but are easily identifiable and distributed in large number over the human genome [40]. We treat these sequences as independent "replicates" of the evolutionary process and compute the mean and variance of quantitative molecular phenotypes derived from these sequences. We use reconstructed ancestral DNA sequences (based on alignments of closely related primates) along the lineage leading to human, and infer the ancestral phenotypes from these. This allows us to directly study the forward evolution of phenotypes in an evolutionary ensemble without relying on model assumptions for parameter inference.

Dynamics of GC content and TATA box strength confirm the directional force of mutation
We first considered the "simplest" possible DNA-based phenotype: GC content, a phenotype that has been studied using comparative phylogenetic methods, e.g., [41]. We computed GC content in 100 nucleotide segments extracted from ERVs (see Methods) binned by the GC content of the inferred ancestral sequence. As expected for a simple restoring force of mutation, qualitatively, sequences with relatively high GC content (Figure 1a, triangles) show a decrease over evolutionary time, while sequences that start with a low GC content show an increase in GC content over time (figure 1b, filled squares). As predicted, the decrease at these short evolutionary distances appears linear.
To compare quantitatively, we express GC content in the notation of our theory. The GC content of the ancestral sequences in the bin represents and we used , , , 0.32,0. 18 We found good agreement between the theory based on the F81 DNA substitution model (dashed lines in Figure 1a,b) and the observed changes in GC content. We also used simple linear regression to infer the change in GC content over time as an estimate of the evolutionary force (see Methods). Once again we found good agreement between the theory and the observations, ( Figure 1c) with some slight deviations for sequences with ancestral GC content much different than equilibrium. We believe these deviations are due to the mis-specification of our DNA substitution model which does not include CpG bias in the mutation rate or transitiontransversion rate bias. Finally, we note good agreement between the variance predicted by an OU process and the observed increase in variance for GC content. To rule out possible circularity due to the use of probabilistic models of evolution in ancestral genome reconstruction [42] we repeated this analysis using ancestral sequences reconstructed using maximum-parsimony, which makes no assumptions about the relative likelihood of substitutions between different nucleotides. We found qualitatively similar results.
We next repeated these analyses for the strength of matches to the TATA box ( [43], see Methods) in the first 10 residues of each ERV segment. Because positions in transcription factor binding sites contribute approximately linearly to affinity [44], this represents a simple biochemical phenotype whose evolution is well studied [4]. For clarity we note that in our case none of these sequences are expected to have functioning TATA boxes: these are sequences with similarity to the TATA box that arise by chance [45]. The strength of matches to the TATA box is computed using a 4 x 10 weight matrix model, which corresponds to a in our model (see Methods). We bin the sequences by the average strength of TATA box in the inferred ancestor, , and again we used 0.32,0.18,0.18,0.32 which implies 1.37. The other parameters are 9.223 and ∞ 24.6, which are obtained directly from the matrix model (see Methods). We found good agreement with the predictions of the model (Figure 2a-c). Once again the linear, constant increase in variance seems to agree reasonably well with the prediction that neutral phenotypes evolve according to an OU process.

The BM null hypothesis is rejected for GC content evolution in the absence of selection
To test whether the observed deviations from the Brownian Motion null hypothesis were strong enough to be detected in a typical analysis, we applied the widely used OUCH package for comparative phylogenetic analysis [39]. We obtained six well-aligned fragments from an alignment of 39 mammals (see Methods) for an unusually ancient ERV that is thought to evolve in the absence of selection [46]. We found that for the five of six segments where the inference algorithm converged, the OU model was strongly supported over the Brownian Motion null hypothesis (all differences in corrected AIC >50, Table 1), which in the standard interpretation in comparative phylogenetic analysis would be considered strong evidence for stabilizing selection (Figure 3a). We note that these results cannot be explained by noise [38]in our phenotypic measurements, as GC content is computed exactly from sequences.
Although this ERV is thought to evolve in the absence of selection [46], we wanted to ensure that the results were not due to cryptic selection on GC content. We therefore simulated molecular evolution of these ERV sequences under a simple neutral model of DNA substitutions (see Methods) and found the same results using OUCH for inference: in the five cases where the algorithm converged, the data strongly supported the rejection of the BM model in favour of the OU model usually associated with stabilizing selection (all differences in corrected AUC >50, Figure 3b). Qualitatively, the strength of support (as measured by AUC in bootstrap resamplings) was similar to (albeit stronger than) that obtained with the real data, consistent with the idea that ERV sequences are likely evolving in the absence of selection on GC content. Taken together, these results are consistent with the idea that the force of mutation on quantitative molecular traits is sufficient to reject the Brownian Motion null hypothesis in a phylogenetic comparative analysis (see Discussion)

Dynamics of ATGs show the predicted mutational force, but no longer match an OU process
Although our predictions of the mean phenotype dynamics make no assumptions about the underlying distribution of the phenotype, this assumption is made in the Brownian Motion and OU process predictions; both GC content and strength of TATA boxes are expected to be approximately Gaussian (GC content is an estimate of a binomial distribution parameter and the strength of a TATA box is the sum of 10 different numbers). Therefore, we next considered a sequence-based trait that we expected to be far from Gaussian: the number of ATG start codons in 100 bp segments. Since the number of ATGs cannot be negative, sequences that start with 0 ATGs will certainly violate the predictions of Brownian Motion and OU process models. Once again, we emphasize that we do not expect these ERV sequences to encode functional proteins, so these are simply ATGs that occur by chance or may have functioned in the ancestral viral proteins.
As for GC content and strength of TATA boxes, we found that the number of ATGs in neutral sequences shows a clear mean-reverting force. For example, in ancestral sequences that have exactly 4 ATGs, we see the number of ATGs decreasing over time (Figure 4a, triangles). This is intuitive as we expect random mutation to destroy these "informative" signals over time. Less intuitive is the analogous result for ancestral sequences that start with no ATGs: in these sequences we see (on average) the accumulation of ATGs over evolutionary time (figure 4b filled squares). Evolution appears to be creating start codons, such that overall 142 ATGs are found in human ERV segments that had none in the primate ancestor. Again, we emphasize that the creation of ATGs is not the result of any selection or biological function, rather, in these sequences ATGs are simply being created by random mutations faster than they are being destroyed. We note that the temptation to create adaptive explanations for the appearance of ATGs in these sequences is very strong (see Discussion), but we confirmed that the dynamics of ATGs are similar in simulations where we can be certain there is no selection or function.
To predict the dynamics of the number of exact matches to a short sequence, which is not strictly an additive phenotype, we treat the DNA sequences as a series of overlapping w-mers, which we assume are independent; in the case of ATGs, w = 3. This means that we imagine 4 possible alleles at each locus, and 1 for the short sequence of interest, and 0 ortherwise. If we assume that each DNA letter still evolves independently and at the same per-site rate, the effective evolutionary rate between these alleles will be ∑ . Furthermore, with 4 ≫ 1, as long as the mutation process is not too biased, → 0, so → 1 and the mutational restoring force is simply w. As before, the number of ATGs in the ancestral sequences in the bin represents . The equilibrium phenotype in this model can be computed exactly As with GC content and strength of TATA boxes, we find very good agreement between the theory and the mean dynamics (dashed lines in figures 4a,b,c) and clear evidence for the linear restoring force with strength simply equal to 3 start codons per substitution per 100nt sites ( Figure 4c). We note that the agreement between the observations and theory must be highly approximate in this case, because the exact dynamics of the number of ATGs is strongly genotype dependent: initial sequences with many "one-off" sequences (e.g., ATC, ATT, ATA, AAG, etc.) are much more likely to increase their number of ATGs than sequences with few of these "one-off" sequences. We believe that because we are measuring the average over a large number of initial sequences, these effects are averaged out (see Discussion).
Since the number of exact matches is approximately binomial, the long-time variance will be approximately Unlike for GC content and strength of TATA boxes, we found that the variance in number of ATGs was not independent of the starting phenotype , and therefore inconsistent with the OU prediction (dashed line in figure 5a). For number of ATGs we found that sequences with more ATGs than the equilibrium (4 ancestral ATGs, triangles in figure 5a) showed a faster increase in variance than sequences with fewer (0 ancestral ATGs, filled squares in figure 5a). We therefore looked at the ancestral bin that was closest to the mutational equilibrium (2 ancestral ATGs, x's in figure 5a) and found that the variance increased in very good agreement with the OU prediction. Taken together, these results for the variance suggest that the true dynamics for molecular phenotypes do not match an OU process when the phenotype is strongly non-Gaussian and/or sufficiently far from the mutational equilibrium.
Variance dynamics of neutral phenotypes are more complicated than OU or BM model predicts We next explored the dynamics of the variance. Under the assumptions above (weak-mutation regime, full linkage disequilibrium, haploid), the dynamics of the variance of an additive phenotype given an initial genotype, X, is given by

| •
Where is the vector of effects of alleles at position n, and is the variance-covariance matrix for the current genotype at the n-th locus with entries given by , and the multiplication, indicated by • is an inner product between the two matrices. Because the genotype at each locus is a categorical random variable, we know the form of , which works out to: Where k and m index two different alleles. In general, under the F81 model, the variance appears to depend on the initial genotype, X, and is non-monotonic.
For the simplest phenotype, we can show that the F81 neutral dynamics of the variance are independent of the initial genotype, X, but they are still more complicated that the OU model predicts. Consider the additive phenotype where only one allele contributes at each locus, and the effect size is the same at each locus, say 1 unit. The phenotype is proportional to the counts of the m-th allele over all the loci, and corresponds to a phenotype considered in recent theoretical work [28]. In our notation, 1 for the m-th allele at each position, and 0 otherwise. For each of L identical loci, if the m-th allele is listed first, The variance is simply | ∑ . Since for this simple phenotype [28], substituting the formula above gives The first term is a single exponential decay to the long-term variance, 1 . The second term is proportional to the squared difference of the starting genotype from the probability of the m-th allele at equilibrium. However, it can be simplified into a genotype independent form in this case: we get exactly ∑ terms of 1 and exactly of , so Thus, the variance of the simplest additive phenotype is independent of the genotype, but shows non-monotonic, initial phenotype dependent dynamics. This also appears to be consistent with analysis of a simple phenotype under the house-of-cards mutation model, which also predicts non-monotonic dynamics that depend on the starting phenotype [22]. Only in the case where is the variance dynamics independent of the initial phenotype (as predicted by the OU dynamics), which corresponds to a case where exactly ½ of the sites have the m-th allele at equilibrium. However, even in this situation the dynamics deviate from the OU prediction.
To reconcile these theoretical results with our observations of additive phenotypes showing nearly OU variance dynamics, we consider the case where approaches . After algebra, we which is exactly the OU dynamics. Hence, at least in this case, the variance shows OU dynamics when the initial phenotype is at equilibrium. We conjecture that more generally, the variance dynamics approach OU dynamics when the phenotype approaches equilibrium. However, for only slightly more complicated phenotypes (such as GC content as defined above) the dynamics are not exactly OU, even if the phenotype is at the mutational equilibrium, owing to covariance between the alleles at each locus (see Appendix). We note that although the prediction matches the data for GC content and TATA boxes in ERVs, the case where the initial phenotype, , approaches the mutational equilibrium, , is not a realistic case for a phenotype that has been under selection, where the phenotype is likely very far from that expected under mutation alone (see Discussion). Because our approximate model for the number of ATGs described above corresponds to this simple counting phenotype, we compared these predictions to that data. At short evolutionary distance, and since for ATGs in 100 residues, ≪ , we obtain a simple approximation for the variance of number of ATGs under the F81 model: To our knowledge this simple prediction for the rate of increase of variance based on the initial phenotype has not been reported under the house-of-cards model [22]. Comparing this to our observations for the variance of ATGs (with u=3) gives remarkably good agreement, and better predicts the linear increase in variance than the OU model (Figure 5b,c).

Neutral dynamics of more complex molecular phenotypes also show a linear restoring force
Many of the molecular phenotypes of interest (e.g., gene expression level) are much more complicated than GC content or strength of binding and are hard to express as additive phenotypes. Nevertheless, a recent study of the mutational effects on gene expression used simulations parameterized by empirical measurements to show simple linear dynamics of gene expression phenotypes [47] even though the traits are unlikely to be linear. To test the generality of our finding of a mutational force on the mean phenotype, we next empirically studied the neutral dynamics of non-additive phenotypes that can be computed from sequence. We chose three phenotypes that are of more biological interest: the number of TATA boxes in a 100 bp sequence, the length of the longest encoded peptide, and the intrinsic disorder in the longest encoded peptide. The first of these relates to how non-coding sequences evolve so-called homotypic clusters of binding sites [48], [49], and the latter two of these phenotypes are related to the emergence of new protein-encoding genes from random DNA [50], [51]. To obtain these, we computed the lengths of six-frame translations from our 100-basepair ERV segments, and the propensity of the peptide to fold ( [52], see methods). We emphasize that these phenotypes are highly non-linear in the DNA sequence genotype, and therefore strongly violate the assumptions used to derive our results above.
Remarkably, we found that these phenotypes also showed simple linear mean dynamics, albeit with stronger evolutionary forces pushing the phenotypes to their mutational equilibrium ( Figure  6a) than the additive phenotypes considered above. The force appears to be different for each phenotype, but we have no theory to predict it. To a large extent, the evolutionary force appears simply proportional to the distance of the ancestral sequence from the equilibrium (Figure 6b) and the quantitative strength of the force is less than 10 units per 100 bp DNA sequence. Thus, non-additive phenotypes also appear to show a simple force of mutation proportional to the distance of the ancestral phenotype from the mutational equilibrium.
We also repeated the OUCH analysis on the longest ORF in ancient mammalian ERV segments to test if the mutational force would be strong enough to reject the BM null hypothesis. Indeed, we find significant support for the OU process model for the longest ORF (all 4 of 6 segments where the inference converged show corrected AIC differences > 50, Table 1). We note that under the current interpretation of this as evidence for stabilizing selection, this would lead to the (interesting) inference that ancient ERVs are under selection to retain coding capacity. However, we can find the similar evidence for the OU process in simulated versions of these segments (see methods) where we know there is no selection for coding capacity (all 3 of 6 segments where the inference converged show corrected AIC differences > 20, Table 1), consistent with the idea that mutation alone (or model mis-specification due to mutation, see Discussion) appears sufficient to create a detectable evolutionary restoring force on molecular phenotypes of current research interest.

Discussion
Our results show that mutation is expected to aid the creation of molecular phenotypes de novo and relate to several areas of molecular evolution. For example, in agreement with simulation results [53], we find that mutation alone is sufficient to create transcription factor binding sites in DNA sequences, and we quantify this effect: 100 nt sequences that have no strong matches to the TATA box are expected to accumulate them at a rate of approximately 0.5 TATA box matches per substitution per site. Similarly, in random 100 nt sequences with no open reading frames, mutation will tend to create open reading frames at a rate of >50 codons per substitution per site. Even if the encoded peptides start out as strongly disordered [51], the force of mutation is expected on average to increase their tendency for folding [50]. Remarkably, but consistent with a recent report [33], even though all of these molecular phenotypes are non-additive, they show simple linear change over time, following our predictions for additive phenotypes. We believe that this is because our ERV sequences are relatively near the mutational equilibrium for these phenotypes, and speculate that the evolutionary dynamics are approximately linear near the mutational equilibrium. It will be of great interest to extend the theoretical work to explain these observations.
Unlike previous theoretical work which focused on quantifying the force of mutation on phenotype evolution by including mutational bias [29], [31], here we used the F81 substitution model, which is formulated in terms of the (possibly non-uniform) equilibrium frequencies for the DNA bases (alleles). Our results suggest that the mutational equilibrium phenotype, rather than the mutation rate bias plays the key role in determining the neutral dynamics of phenotypes, consistent with recent theoretical results [28], [31]. Although the F81 model does have mutation bias (the rate bias from allele m to allele k is π m / π k ; there is no transition-transversion bias or CpG hypermutation), we also considered the Jukes-Cantor model, which is a specific instance of the F81 model where all π m =1/4 ( =4/3) for DNA, and therefore shows no mutation bias. Importantly, even under this simplification, we still predict a restoring force proportional to the distance from the mutational equilibrium, although some simplification of the dynamics is obtained (e.g., the neutral variance dynamics of GC content is no longer genotype dependent). Even simplified bi-allelic (non-DNA based) models with no mutation bias show a restoring force of mutation when far from mutational equilibrium (Appendix), strongly suggesting that the force of mutation in phenotype evolution is not due to mutation rate bias, but is a more fundamental result of the mapping between the discrete genotype space and the continuous (or ordinal) phenotype. We did not pursue these simplified models further because the predictions of the dynamics were not close to our observations from ERV sequences, presumably because the predicted mutational equilibrium is too far from the real data (e.g., equilibrium GC content of 50%, rather than 36%). This highlights the need to develop simpler (approximate) theory that can still be quantitatively compared to data.
Interestingly, for GC content we found agreement with the predictions of the variance based on the OU process, even though we can show that this is only approximate (See Appendix). We believe this is because GC content in ERVs is close enough to mutational equilibrium, where we find the dynamics to be well approximated by the OU process. However, more generally, our F81 phenotype evolution model is meant to be used as a null hypothesis for phenotypes that may be under selection (such as gene expression levels), and these phenotypes are unlikely to be close to mutational equilibrium. This is consistent with observations of directional evolution of phenotypes in mutation accumulation studies [26], where selection is removed, but phenotypes are far from their mutational equilibriums.
We note that in practice even neutral evolution of GC content may still be problematic if analyzed with the standard OU process models [41] because the rate of increase of variance depends on the number of loci (See Appendix). Thus, although for a fixed sequence size GC content appears to evolve according to an OU process, for real sequences (that change in length over the phylogeny) although the variance in GC content is expected to increase approximately linearly for reasonably short evolutionary times, it will increase at a slower rate for longer sequences. This reduction in variance could be mis-interpreted in the OU framework as increased selection intensity.
More generally, the similarity of the dynamics of neutral phenotypes to an OU process suggests that at time-scales considered here, similarity of phenotype evolution to OU process should not be used as evidence for purifying selection (as suggested in [1]). Indeed, a widely used software implementation [13], [39] infers strong evidence for selection on GC content and length of longest ORF in our ERV sequences and in DNA-based simulations with no selection. Also, our results complicate whether statistical evidence for multi-optimum OU models represents evidence for adaptive shifts, which is a widely adopted assumption in comparative phylogenetic analysis [10]. Shifts in optima could be expected even in the absence of selection, due to changes in the mutational equilibrium over phylogenies (for example, simply because of changes in GC content.) Further, since even neutral phenotypes show more complicated variance dynamics (genotype-dependent and non-monotonic) than the OU model predicts, our results add the possibility of misspecification of the underlying models [54] to the previously reported challenges with phylogenetic inference of OU models [38], [39], [55]. Developing tests based only on the mean (which does seem to match the OU process assumptions remarkably well) is an area of promise, though it remains unclear how much information about mean dynamics is contained in the observations at the tips of the tree [55].
On the positive side, our results are consistent with the idea that mutation and selection act in a simple additive way on the evolution of mean phentoypes [11], [28], both leading to OU-like dynamics. Further, our results limit the quantitative range of restoring force expected due to mutation alone: if the restoring force estimated in an OU model exceeds the predicted value, we believe this can be interpreted as stabilizing selection. Similarly, large changes in the optima in multi-optimum OU models [10] are unlikely to result from mutation alone.
Finally, our approach of using reconstructed ancestral sequences to study the forward evolution of phenotypes is applicable whenever the genetic architecture of a trait is known. With the increasing power of association and other high-throughput studies to determine the loci and their effect sizes for many phenotypes [56], [57], additive models of traits can be inferred. Furthermore, machine learning methods trained on genome-scale data and massively parallel assays may be able to predict non-additive molecular phenotypes from sequences [58], [59]. Once a genotype to phenotype model is defined, observed phenotype evolution can be compared directly to neutral phenotypes obtained from neutral sites (such as ERVs [40]) or simulations at the sequence level [60]. The simulation approach has been applied successfully to molecular traits that can be computed directly from intrinsically disordered protein sequences [61], [62]. If applied more generally to quantitative characters, this could be viewed as an extension to practice of inferring phylogenetic trees from genotypes, even when studying phenotypes [18], [24]: the null hypotheses for phenotype dynamics can also be obtained using our understanding of genotype evolution, further cementing the inevitable merger of systematics and evolutionary genetics [18].

Unconstrained sequences in primate genomes
We obtained 6362 coordinates for predicted ERVs in the human genome from the gEVE database [63], corresponding to a subset of ERV segments that span more than 100 nucleotide residues and contain no ambiguous letters. We obtained alignments of these along with their reconstructed ancestors [42] from the 100 vertebrate alignment tree from Ensembl, we extracted the species in the aligned segments using the Ensembl compara REST API [64]. As a control for possible circularity due to probabilistic models of sequence evolution used in ancestral genome reconstruction [42], we also performed analysis on ancestral sequences reconstructed maximum parsimony as implemented in the phangorn package [65].
For each "descendent" sequence, we computed the evolutionary distance in substitutions per site from the ancestor using the Juke-Cantor model and used the average of these as the evolutionary distance for that phenotype bin. The common ancestor was always assigned a distance of zero.
To estimate the mutational restoring force and rate of variance increase, we used simple linear regression to estimate the slope of the regression of the mean phenotype and variance of the phenotype on time (in substitutions per site). Because the ancestral value was determined by the binning process, and the evolutionary time for the ancestor was zero by construction, to avoid potentially biasing the estimate of the slope, we excluded the common ancestor from the regression and used the remaining 5 datapoints. R 2 for these was typically >0.9.

Strength and number of TATA boxes
We obtained the matrix model of TBP from the Jaspar database [43]. We assumed a position independent background model, as above, 0.32,0.18,0.18,0.32 . For the strength of TATA box analysis, we used the strength of the match in the first 10 bp of each ERV alignment and used the standard log-likelihood ratio score, S, as a measure of binding strength [66]. To compute the other parameters for the strength of TATA boxes, which have width w=10, we used is the probability of observing the mth allele at the nth position in the transcription factor binding probability matrix. For the expected number of TATA boxes in the entire sequence, we used the sum of the posterior probability of each position being a match to the motif [66], ∑ , where each represents the standard likelihood ratio score [66] for the subsequence of length w starting at position n, , and R is the (position independent) log odds of the prior probabilities of observing a match to the motif or not at each position.
log . We used 1/100 for this prior, so log ..

Alignment, phylogeny and simulation of ancient mammalian ERVs
we retrieved the human ERV sequence [46] and used BLAST on the Ensembl database [67] to obtain an alignment with all orthologous mammalian sequences therein. After filtering the results for quality, we were left with orthologous matches in a total of 39 other mammalian species. We manually extracted 6 segments with relatively few gaps and most of the species present (supplementary data).
For each of the 6 segments, we used PAML (v4.9) [68]to infer the branch lengths of the evolutionary tree and reconstruct the corresponding ancestral sequence of the 40 mammalian species included in the study. Then, we employed a DNA substitution simulator [60] to evolve the reconstructed sequences along the mammalian phylogeny. For each set of real or simulated sequences we computed the quantitative molecular phenotypes and analyzed them with OUCH [39]. Tree was drawn using plot function from the APE package for R [69] Alignments, scripts and evolutionary trees are available at https://github.com/shz9/neutralphenotype-model.

Foldedness of longest peptides
We considered all 6-frame translations within the 100bp sequences starting with ATG, and selected the longest. Using these peptides, we measured the propensity to fold using the method described in [52]. In this scale, positive values indicate folded, while disordered regions obtain negative values.       Each cell represents the corrected AIC estimated by OUCH [39] comparing Brownian Motion to a single optimum OU model. When the algorithm reported errors in convergence, this is indicated by "Convergence error" in the table.

Figure Legends
Supplementary figures Figure S1 -as in figure 1a-c, but with ancestral reconstruction done using maximum parsimony Figure S2 -as in figure 4 and 5, but with ancestral reconstruction done using maximum parsimony