Abstract
Understanding the consequences of mutation for molecular fitness and function is a fundamental problem in biology. Recently, generative probabilistic models have emerged as a powerful tool for estimating fitness from evolutionary sequence data, with accuracy sufficient to predict both laboratory measurements of function and disease risk in humans, and to design novel functional proteins. Existing techniques rest on an assumed relationship between density estimation and fitness estimation, a relationship that we interrogate in this article. We prove that fitness is not identifiable from observational sequence data alone, placing fundamental limits on our ability to disentangle fitness landscapes from phylogenetic history. We show on real datasets that perfect density estimation in the limit of infinite data would, with high confidence, result in poor fitness estimation; current models perform accurate fitness estimation because of, not despite, misspecification. Our results challenge the conventional wisdom that bigger models trained on bigger datasets will inevitably lead to better fitness estimation, and suggest novel estimation strategies going forward.
1 Introduction
The past decades have witnessed a tremendous increase in the scale of genome sequence data available from across life. Recently, methods for estimating molecular fitness using generative sequence models have seen widespread success at translating this evolutionary data into predictions of the functional consequences of mutation. Such models have been shown to accurately predict the outcomes of experimental assays of protein function [Hopf et al., 2017, Riesselman et al., 2018, Meier et al., 2021], and have been applied to infer 3D structures of RNA and protein [Marks et al., 2011, Weinreb et al., 2016] and to design novel proteins [Shin et al., 2021, Russ et al., 2020, Madani et al., 2020]. The models have also been used to predict whether human mutations are pathogenic, directly informing the diagnosis of genetic disease [Frazer et al., 2021]. In this paper, we investigate how and why generative sequence models fit to evolutionary sequence data are successful at estimating molecular fitness, and how they might be improved and generalized going forward.
Existing approaches to fitness estimation with generative sequence models rest on an assumed relationship between density estimation and fitness estimation. Given a dataset of sequences X1, …, XN, assumed to be drawn i.i.d. from some underlying distribution p0, fitness models proceed by (1) fitting a probabilistic model qθ to X1:N and (2) using the inferred density as an estimate of the fitness f(x) of a sequence x; this estimate in turn is used to predict other covariates such as whether the mutated sequence is pathogenic [Hopf et al., 2017, Riesselman et al., 2018, Frazer et al., 2021]. Innovation in fitness models has come out of a trend of building increasingly flexible models fit to increasing amounts of data: simple models that treat each column of a sequence alignment independently were improved by energy-based models that accounted for epistasis [Hopf et al., 2017], which in turn were improved by deep variational autoencoders [Riesselman et al., 2018], which in turn were improved by deep autoregressive alignment-free models [Shin et al., 2021, Madani et al., 2020, Meier et al., 2021]. Naively, one might assume that these improvements have come from obtaining better and better estimates of the data distribution p0, and improvements will continue with bigger models and bigger datasets. In this article, we argue that this presumption is incorrect.
Technical summary
First, we show that that the true data distribution p0 may not reflect fitness, and argue instead that we should be focused on estimating another distribution that does, p∞ (the “stationary distribution”, to be defined below). In particular, we demonstrate that phylogenetic effects – i.e. the history of how current sequences evolved over time – can “distort” the observed data, leading to a situation where p0 ≠ p∞ (Sec. 2). Second, we show in this situation that p∞ and fitness f are non-identifiable: even with infinite data, there always exists some alternative fitness function that explains the same data just as well as f. This sets fundamental limits on what we can learn about fitness from evolutionary data (Sec. 3). Third, although exact estimation of p∞ is impossible, we show that it is still possible to get closer to p∞ than p0, that is, to find a better estimator of fitness than the true data density p0. This can be done by fitting to data a parametric generative sequence model ℳ= {qθ : θ ∈ Θ} that is (approximately) well-specified with respect to p∞ (i.e. p∞ ∈ ℳ) but misspecified with respect to the data distribution p0 (i.e. p0 ∉ ℳ), thus illustrating the potential blessings of misspecification (Sec. 4). Fourth, we construct a hypothesis test to determine whether these blessings of misspecification occur on real data, with existing fitness estimation models; here, we rely on a Bayesian nonparametric sequence model to construct a credible set for p0 (Sec. 6). Fifth, we apply our test to over 100 separate sequence datasets and fitness estimation tasks, to conclude that existing fitness estimation models systematically outperform the true data distribution p0 at estimating fitness (Sec. 7). The takeaway is that better fitness estimation (i.e. better p∞ estimation) will not come from better density estimation (i.e. better p0 estimation); bigger models and bigger datasets are not enough. Instead, better fitness estimation can come from developing models that describe p∞ better but the data density p0 worse.
2 Models of Fitness and Phylogeny
In this section we show how p0 may not accurately reflect the true fitness landscape, by developing a generative model of sequence evolution that takes into account both fitness and phylogeny. The model is general: it allows for arbitrarily complex epistatic fitness landscapes, and recovers standard generative phylogenetic and fitness models as special cases. Our concerns about the effects of phylogeny on fitness estimation are motivated by the widespread use – and trust – of phylogenetic models for evolutionary sequence data (phylogenetic models are far more widely applied than fitness models) [Hadfield et al., 2018, David and Alm, 2011, Felsenstein, 1985, 2004]. Although often inferred from the very same datasets, standard fitness models and standard phylogeny models make conflicting assumptions, which our general framework makes explicit.
Joint fitness and phylogeny models
We define “joint fitness and phylogeny models (JFPMs)” using two elements: a description of how individual species (or populations or individuals) change over time, which depends on fitness f, and a description of the species’ relationship to one another, a phylogeny H. To describe the dynamics of individual species, let Pτ (x, x0) denote the probability of sequence x0 evolving into sequence x after time τ; in particular, Pτ (x, x0) is assumed to be the transition probability of an irreducible continuous-time Markov chain defined over sequence space 𝒳. For example, under neutral evolution (i.e. without selection based on fitness), Pτ (x, x0) may follow a Jukes-Cantor model [Felsenstein, 2004]. With selection, for simple population genetics models (e.g. Moran or Wright processes), Sella and Hirsh [2005] demonstrate under general conditions that for any x0, where f(x) is the log fitness of the sequence x and β > 0 is a constant (Appx. A). The implication of Eqn. 1 is that the stationary distribution of the evolutionary dynamics follows a Boltzmann distribution, with energy proportional to the log fitness of the sequence. Estimating p∞ is of interest because it provides a direct estimate of log fitness, up to a linear transform, since f(x) = β−1(log p∞(x) + log Ƶ). (N.b. in the remainder of the paper, when we say “estimate fitness” we mean, implicitly, “estimate log fitness up to a linear transform”.)
The sequences we observe, however, do not necessarily come from the stationary distribution. Instead, they are correlated with one another according to their evolutionary history. This is described by a phylogeny H = (V, E, T) consisting of a directed and rooted full binary tree with edges E and nodes V, along with time labels for the nodes, T : V → ℝ+ (Fig. 1A). Each node v is associated with a sequence Xv, drawn as , where is the sequence of the parent node, v is the child node, and Δt = T (v0) − T (v1) is the length of the edge between them (Fig. 1B). The root sequence is drawn from p∞. The observed datapoints X1, …, XN correspond to the leaf nodes. In general we will assume all leaves are observed at effectively the same time, the present day T = 0.
Simplifying assumptions
Standard probabilistic phylogenetic models ignore fitness and assume
(Pure phylogeny models (PMs)). There is no difference in fitness among sequences, i.e. f (x) = C.
Example models that fit this form include most of those used in BEAST [Drummond and Rambaut, 2007], MrBayes [Huelsenbeck and Ronquist, 2001], RaxML [Stamatakis, 2006], etc. Standard probabilistic fitness models, on the other hand, ignore phylogenetic history and assume that the stationary distribution has been reached,
(Pure fitness models (FMs)). Let τi be the distance in time between observed sequence Xi and its parent node. Take τi → ∞ for all i, which implies that
The key implication of this assumption is that density estimation and fitness estimation are linked: the data follows X1, …, XN ∼iid p0 = p∞, and so if we can estimate p0 we can estimate the fitness. Example models include EVMutation [Hopf et al., 2017], DeepSequence [Riesselman et al., 2018], EVE [Frazer et al., 2021], etc. Note although Assumptions 2.1 and 2.2 do not conflict directly, conclusions made based on them conflict in practice: PMs typically infer finite and different lengths for branches (i.e. τi < ∞), while FMs typically infer differences in fitness (i.e. f(x) ≠ C), even when applied to the same dataset.
1D Example If Asm. 2.2 does not hold, then there is no reason for the distribution of observed sequences X1, X2, … to follow p∞. We illustrate this with the most widely used example of a JFPM that does not use Assumptions 2.1 or 2.2: an Ornstein-Uhlenbeck tree (OUT) model [Felsenstein, 2004, Butler and King, 2004]. In this model, X is continuous, i.e. X ∈ ℝ, and evolves on a qu adratic fitness landscape of the form f(x) ∝ (x − μ)2 + C according to the dynamics . The stationary distribution p∞ is Normal(μ, σ2). One can show (Appx. B.1) that for any phylogeny H,
(OUT observations). The distribution of observed genotypes X1:N is drawn from a multivariate normal distribution with mean and covariance Σ where and tij(H) is the total time of the shortest path between leaves i and j along the phylogeny H.
We drew samples from the OUT with a Kingman coalescent prior on H (Bertoin [2010], Def. 2.1) and plotted their density (Fig. 2A). Even as N → ∞, the distribution of samples does not follow p∞. Moreover, rerunning the process with a new sample from the prior yields a very different distribution of samples (Fig. 2B).
3 Non-identifiability
In this section we investigate whether, given infinite sequence data, it is possible to infer fitness f without Asm. 2.2, and conversely, whether it is possible to infer phylogeny H without Asm. 2.1. That is, we are interested in whether fitness and phylogeny are identifiable in JFPMs. We conclude they are not: given infinite data generated with any f and H, there exists some alternative and , where satisfies Asm. 2.2, that explains the data equally well.
Naively, this result may be surprising: in FMs, each sequence is drawn independently, i.e. Xi ╨ Xj|H, f, while in JFPMs and PMs there is (in general) correlation between sequences, i.e. . One might then hope that examining correlations between sequences would enable us to infer whether Asm. 2.2 holds. However, we can show that these correlations are uninformative due to a symmetry in phylogenetic models, exchangeability.
(Exchangeability). Let m(X1, X2, …) denote the marginal probability of an infinite set of sequences X1, X2, … integrating over all phylogenies, i.e. m(X1, X2, …) = ∫ p(X1, X2, … |H)p(H)dH. Then, for any permutation π of the integers,
Exchangeability says that if we had observed the sequences in a different order, it would not change their probability. In general, models of sequences observed at the same time (i.e. the present day, T = 0) satisfy exchangeability; for instance, models with a Kingman coalescent prior are exchangeable [Bertoin, 2010, Drummond and Rambaut, 2007]. Exchangeability implies that fitness and phylogeny are not identifiable. In particular, even if X1, X2, … are generated from a JFPM with a finite branch length phylogeny H, we can describe the same data just as well using a model with an infinite branch length phylogeny (an FM):
(Non-identifiability). Assume X1, X2, … satisfy Assumption 3.1. Then with probability 1 there exists some function such that Proof. Applying de Finetti’s Theorem (Kallenberg [2002], Thm. 11.10), there almost surely exists a random measure G such that for . Let pG(x) be the pmf of G (we assume x is a finite discrete sequence; we can also work with continuous genotypes assuming the pdf pG(x) exists). Set .
This result says that the observed sequences from an exchangeable JFPM, X1, X2, …, are precisely i.i.d. samples from some p0. Although in the standard tree representation , there must be some alternative description of the same process where . Fitness and phylogeny are thus non-identifiable: data generated from a JFPM with fitness f and phylogeny H can be described just as well using and , and vice versa.
The biological intuition behind Thm. 3.2 is that if two sequences are similar to each other and distant from a third, they may be similar either because they are closely related (i.e. the distance τ to the most recent common ancestor is small) or because they are in a local maxima of the fitness landscape. Without further assumptions, we cannot tell the difference between these two explanations. The machine learning intuition is that evolution, as described by a JFPM, is in effect a Markov chain Monte Carlo process whose stationary distribution gives the fitness. However, the samples we observe may not be fully independent: each pair of samples was initialized from the same point (the most recent common ancestor), and the burn-in since that point may not be sufficiently long. Without independent samples, our estimate of the stationary distribution will be biased.
Fitness inference as hyperparameter inference
While general, Thm. 3.2 is not constructive, and does not tell us what the distribution p0 actually is, or how exactly it differs from p∞. Thm. 3.2 leaves unclear how much we need to know to learn the fitness landscape: could we infer fitness f if we knew the parametric form of p∞, i.e. if we had some model ℳ and knew that p∞ ℳ? What if we also knew the underlying phylogeny H? In the long branch limit (Asm. 2.2), fitness is identifiable if H is known; if ℳ is also known, learning fitness is a matter of inferring model parameters. In the limit where all the branch lengths in the phylogeny are zero, the distribution of observations from a JFPM reduces to X1 ∼ p∞ and X1 = X2 = X3 = …. Here fitness is non-identifiable even if H and ℳ are known; learning fitness is a matter of learning from a single sample. In the realistic intermediate branch length case, if H and ℳ are known, we will show that learning fitness is essentially a matter of hyperparameter rather than parameter inference.
We demonstrate this last claim by approximating OUTs as Gaussian process latent variable models (GPLVMs), finding that fitness only appears as a hyperparameter of the GP. The GPLVMs have latent variables Z1, Z2, … that lie on the hyperbolic plane ℍ, and use the Gaussian process kernel k(·,·) = exp(−d(·, ·)), where d(·,·) is a distance metric over ℍ. Let 𝒲1(·,·) be the Wasserstein metric for distributions over infinite matrices, i.e. over ℝ∞×∞, using the sup norm on matrices.
(GPLVM approximation of OUT). Assume a prior over phylogenies H that is exchangeable in its leaves and where the minimum time between any pair of nodes is greater than η > 0 with probability 1. Define the leaf distance matrix . For any ϵ > 0, there exists a.s. a GPLVM of the form, where G is a random measure over ℍ, such that , where .
If , the OUT and GPLVM produce identical distributions over X1, X2, … a.e..
The proof is in Appx. B.2, and uses the embedding of Sarkar [2012]. This result says that, by embedding phylogenies H in a metric space, we can approximate an OUT arbitrarily well with a GPLVM; as the Wasserstein bound gets smaller, the distribution of covariance matrices of the two models get closer. In the GPLVM, the observations are conditionally independent, Xi ╨ Xj|s, G, in line with Thm. 3.2. The phylogeny H enters the GPLVM only through the latent space embedding Z1, Z2,…. Learning phylogeny, given the fitness landscape, is thus essentially a matter of inferring latent variables [Riesselman et al., 2018, Ding et al., 2019]. The fitness landscape enters the GPLVM only through the prior on the Gaussian process (i.e. through μ and σ). Inferring fitness given phylogeny is thus essentially a matter of inferring hyperparameters. This is both good and bad news for fitness inference. On the one hand, hyperparameters are often learned in practice, and doing so can yield substantially better predictions, so we should be able learn something about μ and σ given data (Williams and Rasmussen [2006], Chap. 5). On the other hand, hyperparameters are in general (though not always) non-identifiable, and therefore so is fitness [Mardia and Marshall, 1984]. Ho and Ané [2013] describe non-identifiability conditions for the OUT in particular. We conclude that even when H and ℳ are known, fitness inference in JFPMs is fundamentally challenging.
4 Blessings of misspecification
We have demonstrated that phylogenetic effects can produce a data distribution p0 that is not equal to the stationary distribution p∞, and exact inference of p∞ is in general impossible even with infinite data. Nonetheless, the practical success of fitness estimation methods suggest it is possible to at least approximate p∞ from observational sequence data. Recall that existing methods proceed by fitting a probabilistic model qθ ∈ ℳ = {qθ : θ ∈ Θ} to data X1:N, typically via maximum likelihood estimation or approximate Bayesian inference, and then using the predicted log density as an estimate of the fitness of a sequence x. Why does this approach provide empirically successful estimates of p∞? In this section we consider two hypotheses, either of which may hold true in theory. In Secs. 6-7 we develop and apply tests to evaluate them on real data.
Hypothesis #1 (informal). Fitness estimation methods succeed by finding , since for all practical purposes on real data, p0 = p∞.
This hypothesis would make sense if Asm. 2.2 held, i.e. branch lengths were long enough in real datasets for to be close to its stationary distribution. Under this explanation, better density estimators have been, and will continue to be, better fitness estimators. We should focus on developing models M that are well-specified with respect to the data, i.e. p0 ∈ ℳ (Fig. 3A).
Hypothesis #2 (informal). Fitness estimation methods succeed by using models M that are misspecified with respect to p0, i.e. p0 ∉ ℳ. The inferred model is then closer to p∞ than p0 itself.
To show this hypothesis is plausible, we prove that it is guaranteed to hold under general conditions. We study the projection of p0 onto ℳ via the Kullback-Leibler (KL) divergence, . The KL projection is relevant because maximum likelihood estimation minimizes the approximate KL divergence between the data and the model, and the posterior in Bayesian inference asymptotically concentrates around the maximum likelihood estimator [Miller, 2021]. We thus expect the fit model to be close to qθ*, and get closer with N. Assume that ℳ is “log-convex”, meaning that for any θ, θ′ ∈ Θ and 0 < r < 1, there exists some θ″ such that qθ″ (x) = qθ(x)rqθ′ (x)1−r/ Σx qθ(x)rqθ′ (x)1−r; examples of log-convex models include the Potts model, as well as all other exponential family models.
(Blessings of misspecification). Assume that the model ℳ is log-convex and well-specified with respect to the stationary distribution, i.e. p∞ ∈ ℳ. Assume qθ* exists and is unique. Then, if the model is misspecified with respect to the data distribution, i.e. p0 ∉ ℳ, we have
But if the model is well-specified, i.e. p0 ∈ ℳ, we have Proof. For part 1, apply Thm. 1 from Csiszar and Matus [2003]. For part 2, note that qθ* = p0 when p0 ∈ ℳ.
In words, the model projection qθ* is closer to p∞ than p0 so long as as the model ℳ is misspecified with respect to p0 (Fig. 3B). To understand the biological intuition behind this result, consider a situation where two neutral mutations with no effect on fitness occur successively at different sites (Fig. 3C). Due to phylogenetic correlation, there is no observed sequence x* in which the second mutation is present but not the first, so an accurate density estimator will find p0(x*) ≈ 0. However, if we can guess correctly that the fitness landscape is independent across sites, then fitting a site-wise independent model ℳ will imply the mutation is allowed, qθ* (x*) > 0, correctly inferring p∞(x*) > 0.
Under Hypothesis 2, progress in the field of fitness estimation has not come from building better density estimators (Hypothesis 1), but rather from an iterative process of (1) hypothesizing, based partly on biophysical knowledge, models that are (approximately) well-specified with respect to p∞ but not too flexible, such that p0 ∉ ℳ and then (2) comparing their density estimates against experimental fitness measurements. We will show that on real data, Hypothesis 1 can often be rejected in favor of Hypothesis 2.
5 Related Work
Efforts to account for the effects of phylogeny in fitness estimation have a long history [Lapedes et al., 1999]. Practical generative sequence models that explicitly account for both epistatic fitness landscapes and phylogeny have long been sought, but stymied primarily by computational challenges [Ingraham, 2018, Rodriguez Horta et al., 2019]. In their place, a variety of non-generative (and often heuristic) methods for correcting for phylogeny have been proposed, including data reweighting schemes [Marks et al., 2011, Rodriguez Horta et al., 2019], data segmentation schemes [Colavin et al., 2022], post-inference parameter adjustments [Dunn et al., 2008], covariance matrix denoising methods [Qin and Colwell, 2018], simulation based statistical testing [Rivas et al., 2017], and more. In this article, we show that deconvolving fitness and phylogeny is not just computationally hard, but also in general statistically impossible: fitness and phylogeny are non-identifiable. We further show that use of a misspecified parametric model can on its own (without further corrections) partially adjust for phylogenetic effects.
Our results also intersect with the literature on robust statistics: we can think of the observed data distribution p0 as a “distorted” version of the true distribution of interest p∞. However, in typical robust inference frameworks (e.g. Huber’s epsilon contamination model), the observed distribution differs from the true distribution by the addition of outliers [Huber, 1992, Steinhardt, 2018]. In our setup, on the other hand, inliers are deleted, as phylogenetic correlations can mean the effective support of p0 is smaller than that of p∞ (Fig. 2).
6 Diagnostic Method
In this section, we develop diagnostic methods to discriminate between Hypothesis 1 and Hypothesis 2 (Sec. 4) based on observational sequence data and experimental fitness measurements, and validate these diagnostics in simulation. Recall that under Hypothesis 2, the estimate from a parametric fitness model is a better estimate of fitness than the true data density p0, while under Hypothesis 1, p0 is better. Discriminating these two hypotheses on real data is nontrivial because we do not have access to p0. Ideally, then, a diagnostic test would evaluate the probability that the true density p0 outperforms at predicting fitness, taking into account uncertainty in what p0 could actually be, given the data. To accomplish this, we compute a posterior over p0 using a Bayesian nonparametric sequence model. In particular, we apply the Bayesian embedded autoregressive (BEAR) model, which can be scaled to terabytes of data and satisfies posterior consistency (Amin et al. [2021], Thm. 35):
(Summary of BEAR posterior consistency). Assume p0 is subexponential, i.e. for some t > 0, , where |X| is the length of sequence X. Assume the conditions on the prior detailed in Amin et al. [2021]. If X1, X2, … ∼ p0 i.i.d, then for M > 0 sufficiently large and ϵ ∈ (0, 1/2) sufficiently small, in probability, where B(p, r) is a Hellinger ball of radius r centered at p, and ΠBEAR(·|X1:N) is the BEAR posterior.
Crucially, this result implies that the BEAR posterior will converge to effectively any value of p0, no matter what p0 is (unlike a parametric model’s posterior). Moreover, BEAR quantifies uncertainty in its estimates, giving the range of possible values of p0 that are consistent with the evidence.
We construct our diagnostic test by comparing the fitness estimation performance of to the range of possible performances of p0 estimated by BEAR. Let 𝒮f (p) be a scalar score evaluating how accurately a density p predicts fitness f. In practice, 𝒮f will be based on experimental and clinical measurements of quantities directly related to fitness.
Diagnostic test (Test Hypothesis 1 vs. Hypothesis 2.) Hypothesis 1 . Hypothesis 2 . Accept Hypothesis 2 at significance level α > 0 if Accept Hypothesis 1 at significance level α if So long as 𝒮f (p) is a well-behaved function of p (in particular, so long as 𝒮f is continuous in a neighborhood of p0 with respect to the topology of convergence in total variation), Thm. 6.1 implies that this diagnostic test will be asymptotically consistent, in the sense that it converges to the correct hypothesis in probability.
Simulations
We next evaluate the performance of our diagnostic test on simulated data. We considered two scenarios, the first in which Hypothesis 1 holds, and the second in which Hypothesis 2 holds. In both, we let ℳ be a site-wise independent (SWI) model, in which each position of the sequence is drawn independently, i.e. Xl ∼ Categorical(vl) for l ∈ {1, …, |X|}. The parameter vl is in the simplex ΔB, where B + 1 is the alphabet size. (Further details in Appx. C.) In Scenario 1, the true data are generated according to a Potts model and p0 = p∞. In this scenario, the SWI model is misspecified, and misspecification is bad : using a more flexible model will produce an asymptotically more accurate estimate of p∞. We find that our diagnostic test asymptotically correctly accepts Hypothesis 1, in line with Thm. 6.1 (Figs. 4A and S3A). In Scenario 2, the true data are generated according to a JFPM with finite branch lengths, and p∞ ∈ ℳ while p0 ∉ ℳ. The mutational dynamics Pτ follow the Sella and Hirsh [2005] process. The phylogeny ℋ is drawn from a Kingman coalescent. In this scenario, the SWI model is again misspecified, but misspecification is good: while the nonparametric BEAR model can achieve better density estimates than the SWI model (Fig. 4C), the SWI model outperforms BEAR at fitness estimation (Figs. 4D and S4). We find that our diagnostic test correctly accepts Hypothesis 2 (Figs. 4B and S3B).
A possible point of concern is that the test is poorly calibrated from a frequentist perspective, and in the low N regime accepts Hypothesis 2 in Scenario 1 more than 100α% of the time when the data is resampled from p0 (Fig. S5A). This behavior is common in nonparametric Bayesian tests, and not necessarily a problem: the test is still valid from a purely Bayesian perspective. Nevertheless, on real data we will check that we are close to the large N regime by (1) checking that the BEAR posterior predictive is at least as close to p0 as is (as measured by perplexity on held out data; Figs. 4C and S5B) and (2) examining the plot of the BEAR posterior over 𝒮f (p) as a function of N (as in Fig. 4AB), to check that it has converged.
7 Empirical Results
We now evaluate whether existing fitness estimation methods outperform the true data density p0, i.e. whether we can reject Hypothesis 1 in favor of Hypothesis 2 on real data.
Tasks
We consider two key prediction tasks where fitness models are applied in practice. The first task is to predict whether variants of a protein are functional, according to an experimental assay of protein function; the metric 𝒮f (·) is the Spearman correlation between p(x) and the assay result [Hopf et al., 2017]. There are typically ∼1000s of measurements per assay. The second task is to predict whether a variant of a protein observed in humans causes disease, according to clinical annotations; the metric 𝒮f (·) is the area under the
ROC curve when p(x) is used to predict whether or not a variant is pathogenic [Frazer et al., 2021]. There are typically only a handful of labels for each gene. For the first task, we considered 37 different assays across 32 different protein families, and for the second task, 97 genes across 87 protein families; for each protein family, we assembled datasets of evolutionarily related sequences, following previous work. Note that across the 37 assays and 97 genes, the data used for 𝒮f comes from different experiments and different clinical evidence, often collected by different laboratories or doctors. As a consequence, our overall conclusions should be robust to the choice of 𝒮f.
Models
We considered three existing fitness estimation models: a site-wise independent model (SWI), a Bayesian variational autoencoder (EVE [Frazer et al., 2021], which is similar to DeepSequence [Riesselman et al., 2018]), and a deep autoregressive model (Wavenet) [Shin et al., 2021]. Note that SWI and EVE, unlike Wavenet, require aligned sequences as training data. Details in Appx. D.
Results
Applied to the first prediction task, our diagnostic test accepts Hypothesis 2 at significance level α = 0.025 in 35/37 assays (95%) for SWI, 33/37 assays (89%) for EVE, and 36/37 assays (97%) for Wavenet (Fig. 5A). Applied to the second prediction task, our diagnostic test accepts Hypothesis 2 at significance level α = 0.025 in 31/97 genes (32%) for SWI and 46/97 genes (47%) for EVE (Fig. 5B). Thus, fitness estimation models are capable of outperforming the true data distribution p0. We found evidence for Hypothesis 1 in only a handful of examples: on the first task, Hypothesis 1 was accepted at significance level α = 0.025 in 0/37 assays for SWI, 3/37 assays (8%) for EVE, and 0/37 assays for Wavenet, while on the second task, Hypothesis 1 was accepted for 5/97 genes (5%) for SWI and 4/97 genes (4%) for EVE. We confirmed that the diagnostic test was in the large N regime: BEAR outperformed Wavenet at density estimation, providing better predictive performance on 27/37 assays (73%) and similar performance on the remaining 10 assays (Fig. S6).1 Example plots of the BEAR posterior’s convergence with N on the first prediction task showed convergence to values of 𝒮f well below that for parametric fitness estimation models (Figs. 5C and S7-S8). Overall, we conclude that there is strong evidence that existing fitness estimation methods reliably outperform the true data distribution p0 across a range of datasets and tasks.
To study the tradeoffs between density estimation and fitness estimation in more depth, we smoothly and nonparametrically relaxed a parametric autoregressive (AR) model (Appx. D.4). We embedded the AR model (a convolutional neural network) into a BEAR model, and fit the BEAR model with empirical Bayes. We found evidence that the AR model was misspecified on every dataset, following the methodology of Amin et al. [2021]: the optimal h selected by empirical Bayes was on the order of 1 − 10 in each dataset. Now, in the limit as the hyperparameter h → 0, the BEAR model collapses to its embedded AR model; so by scanning h from low to high values we can interpolate between the parametric and nonparametric regime. We find a smooth tradeoff between 𝒮f (p) and the likelihood of the data under the BEAR model, with higher h corresponding to better density estimation but worse fitness estimation (Fig. 5EF and S9). This relationship held across many datasets: the diagnostic test, evaluated against the AR model (the h → 0 limit), accepts Hypothesis 2 in 28/37 assays (76%), but Hypothesis 1 in only 6/37 (16%) (Fig. S10). These results confirm that making a model well-specified (relaxing from a parametric to a nonparametric model) can bring improved density estimation at the cost of worse fitness estimation.
8 Discussion
In this article, we have argued that better density estimation does not necessarily lead to better fitness estimation. Our results changes the outlook for the future of fitness estimation: the common narrative that progress is inevitable through ever bigger models trained on ever bigger datasets appears to be false. Instead, progress will likely demand more fundamental methodological advances.
One future direction is to improve the current strategy of fitting misspecified models. For instance, it may be worthwhile to explore models that are less flexible than existing models and worse at density estimation, since they can increase the gap between kl(qθ*‖p∞) and kl(p0‖p∞) (Thm. 4.1). Another option is to improve the geometry of the model: while exponential family models are guaranteed to be log-convex (and thus can satisfy Thm. 4.1), we have no such guarantee for variational autoencoders or other neural network methods. Finally, uncertainty quantification is crucial for applications such as those in clinical genetics, but challenging in misspecified models [Szpiro et al., 2010, Miller and Dunson, 2019, Huggins and Miller, 2020]. Another future direction is to construct scalable JFPM models and carefully handle non-identifiability. Recent progress on amortized variational inference for phylogenetic models is promising [Vikram et al., 2019]. Non-identifiability is more challenging, and may require new assumptions and/or new methods of sensitivity analysis to infer the full set of fitness landscapes consistent with the data.
Finally, although this article has focused on technological applications of fitness models in solving prediction problems, fitness models also have implications for our fundamental understanding of evolution. Pure phylogeny models and pure fitness models present very different pictures of the past history of life: in PMs, similarities and differences among genetic sequences are determined primarily by history and ancestry (Asm. 2.1), while in FMs they are primarily determined by functional constraints (Asm. 2.2). PMs and FMs also present very different implications for the future of life: in PMs, the diversity of sequences seen in nature will likely expand dramatically going forward, while in FMs, the landscape of functional sequences has already been well-explored. Our results emphasize that where and to what extent each model offers an accurate picture of reality remains an open question.
A Evolutionary dynamics models
Application of the Sella and Hirsh [2005] model (Eqn. 1) in JFPMs rests on a number of assumptions; we briefly the most relevant here.
When applying Eqn. 1 to amino acid sequences, as is typical for fitness estimation models, we ignore biases that come from the genetic code, which can modify the steady state probability of amino acids (in the absence of fitness effects) away from a uniform distribution. This is justified practically by the small effect sizes: if at steady state an amino acid has probability 1/64 instead of 1/20, the total difference in log probability is log(1/20) − log(1/64) ≈ 1, which is small compared to (for instance) the log probability differences relevant for disease risk prediction with fitness models, which are ≈ 10 (Frazer et al. [2021], Extended data Fig. 3). Moreover, this bias only contributes an overall shift in amino acid probabilities, independent of position, and so does not change our main theoretical results. We ignore biases caused by asymmetric mutation rates for analogous reasons (though note they are often included in PMs in practice) [Sella and Hirsh, 2005].
The constant β depends on the effective population size, as well as the underlying population genetics model (Moran or Wright) and organismal ploidy (Sella and Hirsh [2005], Table 1). Following standard practice, we treat β as fixed for simplicity, though in reality it may vary over time and across lineages. Taking into account these possible changes clearly would not contradict our main theoretical result, that fitness and phylogeny are non-identifiable.
B Proofs
B.1 Proof of Proposition 2.3
N.b. this result is known in the literature (Ho and Ané [2013], Eqn. 1) but we are unaware of a proof, so we provide one here for completeness.
Proof. For notational convenience, we will work with a standardized OUT, with μ = 0 and σ = 1. The final result can be obtained by translating and scaling the distribution of leaves. The transition distribution from point x′ at time t′ to point X at time t under the Ornstein-Uhlenbeck (OU) process is This distribution can be reparameterized in location-scale form as As t → ∞ we reach the stationary distribution Normal(0, 1). Let b ∈ {1, …, B} index the branches of the tree, let λb be the length of branch b, and let j ∈ {1, …, N} index the leaves (observed species or sequences); see Fig. S1. We have assumed that the most recent common ancestor of the observed sequences was sampled from p∞; this can be represented by adding a single branch length (indexed b = 1) to the root with length λ1 = ∞. Let ϵb be the noise describing the OU diffusion over each branch. Let ξj,b be the total time from leaf j to the nearest vertex on branch b, so long as branch b is on the path from leaf j to the root; otherwise, set ξj,b = ∞. For instance, in the diagram in Figure S1, we have ξ1,4 = 0, ξ1,2 = λ4, ξ1,1 = λ4 + λ2, and ξ1,5 = ξ1,6 = ξ1,7 = ξ1,3 = ∞. We can now write the leaf position as
Define the matrix such that Xj = Σb Mj,b ϵb. We can now describe the complete leaf distribution as where IB is the B-dimensional identity matrix. Thus, according to the location-scale representation of the multivariate normal, We can simplify the covariance matrix Σ := MM⊤. First Before introducing the notation required to derive the general result, it’s helpful to get a sense of how the derivation works; in the example tree (Figure S1), The sum over b telescopes, leaving only the initial term, which corresponds to the total time between leaf node 1 and leaf node 2. To construct the general result, define as the branch whose later node is the most recent common ancestor of leaves j and j′. In the example in Figure S2, . Let R be an ordered list of branches from to b = 1, the earliest branch. In the example in Figure S2, R = [4, 2, 1]. Finally, let tjj′ be the length of the shortest path from leaf j to leaf j′1, the time from the most recent common ancestor to j plus the time to j′. In the example in Figure S2, t2,4 = λ5 + λ6 + λ8. We now have Breaking down the telescoping sum, and using the fact that the final element of R is t1 = ∞, So we have the simple result that the covariance matrix depends just on the divergence times between leaves, Translating the distribution Eqn. 13 by μ and scaling by σ yields the result.
B.2 Proof of Theorem 3.3
Before proving the result, we briefly clarify a definition in the statement of the theorem:
Definition B.1 (Exchangeable in leaves). Let H be a tree with countably infinite leaves and let Hπ be a permutation of a phylogeny in its leaves, i.e. the same tree H with the leaves observed in a different order, according to a permutation π. A distribution over phylogenies is exchangeable in its leaves if p(H) = p(Hπ) for any permutation π.
Proof. Outline: First, using the results from Sarkar [2012], we construct an embedding for each tree into the hyperbolic plane, being careful that the embedding preserves exchangeability. Second, we apply de Finetti’s Theorem to obtain the conditionally independent representation of the joint distribution of Z1, Z2,…. Third, we use the distortion bound from Sarkar [2012] to bound the Wasserstein distance between p(ν) and .
First we describe the Sarkar [2012] (1 + ϵ) distortion embedding algorithm setup. Vertices in phylogenetic trees have maximum degree three, and, by assumption, the minimum edge length in a tree H is greater than η > 0 with probability one. For any ϵ′ > 0, choose a ρ < π/3 and a scale factor where k is the Gaussian curvature of the hyperbolic plane ℍ (for most hyperbolic geometry models, and in particular the Lorentz manifold, k = −1). Then, let h1(H), h2(H), be the position of the leaves in the embedding of H produced by the (1 + ϵ) distortion embedding algorithm in Sarkar [2012], using edge scale factor λ, and ρ separated cones with cone angle 2π/3 − 2ρ. Taking the last line of the proof of Theorem 6 in Sarkar [2012], we are guaranteed that even for a countably infinite number of leaves, where i, i′ ∈ ℕ := {1, 2, …}, and is the hyperbolic distance function.
Next we will modify the embedding function h to ensure that the distribution of embedded leaves is exchangeable. Let [H] be the set of phylogenetic trees that are equivalent to H up to reordering of the vertices. For each equivalence class [H] we choose one ordering of the vertices to be the canonical tree Ĥ ([H]), and for any tree H let πc(H) be the leaf permutation such that the reordered tree . Now define the modified leaf embedding function where π(H) is the inverse permutation of πc(H). Since by assumption the prior p(H) on the phylogenetic tree is exchangeable, we can rewrite p(H) using the induced distribution over equivalence classes p([H]) as where Permutation is the uniform distribution over all permutations of ℕ := {1, 2…}. We now define the distribution over leaf embeddings as which we can rewrite as The distribution p(Z1, Z2, …) is therefore exchangeable. Applying de Finetti’s Theorem [Kallenberg, 2002] we have a.s. where G is a random measure distributed according to a prior 𝒢. Moreover, the embedding distortion bounds (Eqn. 16) are preserved for each H, since and by the same logic We will now construct the Wasserstein bound. Define the joint distribution over ν and , where we have chosen . Note that the marginal distribution of ν matches its definition in the statement of the theorem, and that, applying Eqn. 17 and Eqn. 18, the marginal distribution of also matches its definition. Using the fact that log is a monotonically increasing function, Eqn. 19 gives and similarly using the bound from Eqn. 20, . Thus, with probability 1 under p(H), Recall that the Wasserstein distance between the distribution of two random variables ν and can be written as where 𝒥 is the set of joint distributions with marginals corresponding to the distributions of ν and (Dudley [2002], Chap. 11.8). Using the joint distribution in Eqn. 21, the Wasserstein distance is bounded by Now consider the case where . (N.b. in this case, we do not need to assume that the minimum time between nodes in H is greater than η > 0.) Since the Wasserstein metric is a metric on the space of probability distributions (Dudley [2002] Lemma 11.8.3), a.e.. Using the standard properties of Gaussian processes (Williams and Rasmussen [2006], Chap. 2), the GPLVM model (Eqn. 5) can be written as which is equivalent to the OUT distribution, So the distribution p(X1:∞) produced by the GPLVM is equivalent to the distribution produced by the OUT model a.e..
C Simulation Details
In both scenarios, we generated sequences of fixed length |X| = 30, with an alphabet size of B + 1 = 4 (corresponding to nucleotides).
Scenario 1 We simulated from a Potts model where h is the sitewise energies, e is the pairwise energies, x is a one-hot sequence encoding, l indexes sequence positions and b indexes letters. Following the simulations in Ingraham and Marks [2017], which were intended to roughly match the statistics of typical real protein Potts models, we drew hlb ∼ InvGamma(2, 0.8) and The energies h and e were drawn once, and the same values used across independent simulations. We sampled from the model using a Gibbs sampler with 100 steps of burn-in and 10 parallel chains using the code from Ingraham and Marks [2017] (https://github.com/debbiemarkslab/persistent-vi). We shuffled the resulting samples to remove autocorrelation.
Scenario 2 We used a site-wise independent fitness function: with site-wise residue biases hl, where xl is a one-hot encoding of the letter at the l-th position of x. To generate phylogenetically correlated sequences, we sampled phylogenetic trees from a Kingman Coalescent (Bertoin [2010], Def. 2.1) with rate 1. Starting from a random sequence drawn from the steady state distribution at the root, we evolved the sequence simulating a Wright process in a haploid population (Sella and Hirsh [2005], Eqn. 3) according to the tree and fitness function. In particular, for sequences x0, x that are one mutation away, the mutation rate is where we set the effective population size to Neff = 10000. This stochastic process has steady state (Sella and Hirsh [2005], Eqn. 7).
SWI model We fit the SWI model with maximum likelihood estimation.
BEAR model In these simulations, we used a vanilla BEAR model with a uniform embedded AR model (i.e. a Bayesian Markov model) for simplicity. We set the Dirichlet prior concentration to the constant α = 0.5. Based on the theoretical analysis in Amin et al. [2021] (Thm. 35), we used a prior on lags of the form where B is the alphabet size (4 for nucleotides). We inferred the prior via empirical Bayes, marginalizing over the transition probabilities following the protocol in Amin et al. [2021]. Conditional on lag L, sampling from the posterior over the BEAR model is straightforward thanks to Dirichlet-Categorical conjugancy.
Evaluation We defined 𝒮f following standard protocols for fitness estimation models. In particular, we let 𝒮f (p) be the Spearman correlation between p(x) and f(x) for x ∈ Λ where Λ consists of all possible single point mutations (i.e. single letter changes) of an initial (“wild-type”) sequence. The wild-type sequence was chosen as the most likely sequence under p∞, computed exactly for Scenario 2 and estimated based on the 106 samples for Scenario 1.
To estimate model perplexity (Fig. 4C and S5B), we used N = 10, 000 independent sequences from p0 and computed the per-residue perplexity where |Xn| is the sequence length and p(Xn) is the probability of the sequence under the model.
To estimate the KL to the fitness distribution in Scenario 2 (Fig. 4D), we sampled N = 10, 000 independent sequences from p∞, {X1, …, XN} and estimated where H(p∞) is the entropy of p∞, which can be computed analytically. For BEAR, we plotted the KL to the posterior predictive, which, using Jensen’s inequality can also be seen to lower bound where ΠBEAR(p|Xtrain) is the BEAR posterior learned from the training dataset.
D Empirical Results Details
D.1 Data
Prediction task #1 (functional effect)
Following standard practice, we report the absolute value of the Spearman correlation as f (p), since in some assays a negative change in the measured quantity corresponds to larger fitness (note that in all cases the predicted directionality of the effect under each model was correct). We focused on single amino acid substitutions, taking only those for which EVE was able to make a prediction (EVE is limited by its reliance on a multiple sequence alignment). We used the same data as in Shin et al. [2021], Table 1, taking the 37 experiments performed on the following 32 proteins: UBC9_HUMAN, UBE4B_MOUSE, P84126 _HETH, HIS7_YEAST, BLAT_ECOLX, IF1_ECOLI, PTEN_HUMAN, B3VI55_LIPST, GAL4_YEAST, POLG_HCVJF, PABP_YEAST, CALM1_HUMAN, AMIE_PSEAE, TRPC_THEMA, RASH_HUMAN, YAP1_HUMAN, TRPC_SULSO, DLG4_RAT, BG_STRSQ, KKA2_KLEPN, HSP82_YEAST, B3VI55_LIPST (stabilized), MK01_HUMAN, HIV_BF520 env, SUMO1_HUMAN, RL401_YEAST, PA_FLU, HG_FLU, TPMT_HUMAN, HIV_BG505 env, TPK1_HUMAN, and MTH3_HAEAE (stabilized).
Prediction task #2 (pathogenicity)
We used the pathogenicity labels of single amino acid substitutions curated from ClinVar [Landrum et al., 2018] in Frazer et al. [2021]. We considered labels for 87 human proteins less than 250 amino acids in length: AICDA, AQP2, ATPF2, B9D2, CAH5A, CAV3, CD40L, CF410, CHC10, CIA30, CLD16, CLN8, COQ4, CRBB2, CRGD, CTRC, CXB1, CXB2, CXB3, CXB4, CXB6, CY24A, DERM, DGUOK, DHDDS, EDAD, EFTS, ELNE, ETFB, ETHE1, EXOS3, FGF10, FGF23, FOXE3, FRDA, GP1BB, HBB, HEM4, HSPB1, HSPB8, IFM5, IFT27, JAGN1, KAD2, KCNE1, KCNE2, KITM, LITAF, MMAB, MMAC, MPU1, MYPR, NDP, NDUS8, NFU1, NKX25, NMNA1, OPA3, PAHX, PDYN, PMM2, PMP22, PNPH, PNPO, PROP1, PSPC, PTPS, RASH, RNH2A, S5A2, SAP3, SBDS, SCO1, SDHB, SDHF2, SIX1, SIX3, SOMA, TMM70, TNNT2, TPK1, TPM2, TR13B, TWST1, VHL, XLRS1, ZC4H2.
Training data
All models were trained on datasets of protein sequences gathered as described in Shin et al. [2021] for pathogenicity effect prediction tasks and as described in Frazer et al. [2021] for functional effect prediction tasks. SWI and EVE were trained on the multiple sequence alignment, while Wavenet and BEAR were trained on the raw sequences as described in Shin et al. [2021]. All datasets were uniformly subsampled to produce a 75%/25% train/test split.
D.2 Models and code
The SWI model was trained via maximum likelihood.
The Wavenet model was trained via maximum likelihood with the default architecture, hyperparameters and training protocol described in Shin et al. [2021], for 100,000 steps. Code is from https://github.com/debbiemarkslab/SeqDesign. We did not apply the Wavenet model to the second prediction task, as it has only previously been developed for the first task.
The EVE model was trained via variational inference, using the same architecture, hyperparameters, and training protocol described in Frazer et al. [2021]. Code is from https://github.com/debbiemarkslab/EVE. To match the protocol of the original paper, EVE was – unlike SWI, Wavenet and BEAR – (a) trained on the full dataset rather than the training set alone, and (b) used a sequence reweighting heuristic.
The BEAR model used an embedded convolutional neural network (the same architecture as used in Amin et al. [2021], with layer 1 width of 16, filter width of 5 and 30 filters total) and a uniform prior over lags 2, 3, 5, 7, and 9. Code is from https://github.com/debbiemarkslab/BEAR. The model was trained using empirical Bayes, as described in Amin et al. [2021], for 500 steps with a batch size of 500000 kmers. To construct posterior credible intervals, we used 41 samples from the posterior for prediction task #1, and 1000 samples for prediction task #2.
We computed the heldout perplexity (Eqn. 26) for the BEAR posterior predictive and for Wavenet to produce Fig. S6.
D.3 Convergence experiments
To plot the convergence of the posterior over p0 as a function of N (Fig. 5CD, S7 and S8), we used a vanilla BEAR model, a nonparametric Bayesian Markov model. Note that here we fixed the embedded AR model, rather than refitting with larger N, so that we could analyze the the convergence behavior with reference to the asymptotic results of Thm. 35 in Amin et al. [2021], which does not take into account empirical Bayes. We set the Dirichlet concentration to 10 and used a prior over lags as in Eqn. 25.
D.4 Interpolation experiments
We fit a BEAR model using the architecture and training protocol described in Sec. D.2, optimizing both the parameters of the AR model and h via empirical Bayes. We then varied h from its optimized value, and recalculated the total marginal likelihood and the posterior distribution over 𝒮f (p) (Fig. 5EF and S9). We also computed the value of for the fit BEAR model in the h → 0 limit (Fig. S10).
E Supplementary code
The supplementary code provides a Jupyter notebook (example.ipynb) illustrating the application of our BEAR diagnostic test on simulated data.
Footnotes
↵5 Contact: eweinstein{at}g.harvard.edu, alanamin{at}g.harvard.edu, debbie{at}hms.harvard.edu
↵1 Note that we cannot do this comparison for SWI or EVE since they are alignment-based [Weinstein and Marks, 2021].