Bayesian Inference of Dependent Population Dynamics in Coalescent Models

The coalescent is a powerful statistical framework that allows us to infer past population dynamics leveraging the ancestral relationships reconstructed from sampled molecular sequence data. In many biomedical applications, such as in the study of infectious diseases, cell development, and tumorgenesis, several distinct populations share evolutionary history and therefore become dependent. The inference of such dependence is a highly important, yet a challenging problem. With advances in sequencing technologies, we are well positioned to exploit the wealth of high-resolution biological data for tackling this problem. Here, we present a novel probabilistic model that relies on jointly distributed Markov random fields. We use this model to estimate past population dynamics of dependent populations and to quantify their degree of dependence. An essential feature of our approach is the ability to track the time-varying association between the populations while making minimal assumptions on their functional shapes via Markov random field priors. We provide nonparametric estimators, extensions of our base model that integrate multiple data sources, and fast scalable inference algorithms. We test our method using simulated data under various dependent population histories and demonstrate the utility of our model in shedding light on evolutionary histories of different variants of SARS-CoV-2.


Introduction
Bayesian inference of past population sizes from genetic data is an important task in molecular epidemiology of infectious diseases and other biomedical disciplines (Volz et al., 2009;Bouckaert et al., 2019;Stadler et al., 2021). While many computational and methodological advances have been developed in the last 20 years, there are still many challenges in using these models on real data applications (See Cappello et al. (2021) for a recent review).
An important limitation of current methods is their lack of flexibility in modeling dependent populations. Current models typically assume either single population size dynamics or a structured population undergoing migration. In particular, when modeling structured populations, these models resort to simplistic assumptions on the population size dynamics in order to gain computational tractability and parameter identifiability (Kühnert et al., 2016;Müller et al., 2017). However, often there are situations where the population is neither one large unit nor completely divided into isolated subpopulations. Different subpopulations may share the same environment and ancestry, and therefore their population dynamics are dependent. For example, in the study of infectious diseases, a new variant can emerge and become dominant over the preexisting variants in the population. In this case, all variants share the same environment and local non-pharmaceutical interventions. In tumorgenesis, the cancer cell population within an individual undergoes clonal expansions in the tumor microenvironment, often resulting in a mixture of genotypically and phenotypically heterogenous cell subpopulations. Identifying and quantifying such expansions is crucial for timely detection and personalized oncology for cancer (Caswell-Jin et al., 2021). Despite its importance, to the best of our knowledge, no realistic methods exist for jointly modeling and studying the evolutionary trajectories of dependent populations.
In this work, we propose a nonparametric method for inferring past population dynamics of dependent populations and for estimating the relative difference in their evolutionary trajectories over time. The proposed method bypasses the problems inherent in modeling complex dependent population dynamics by a priori modeling the dependency of population sizes via Gaussian Markov random fields. Our approach incorporates other types of data informative of the parameter of interest such as temporal sampling information of molecular sequences.
Our contributions can be summarized as follows.
• We propose a nonparametric Bayesian framework for inferring dependent population dynamics from genetic data in the coalescent framework. The model makes minimal assumptions on the functional form of the population trajectories and their dependency. Despite its flexibility, we prove that model parameters are identifiable.
• We extend our framework to jointly model the ancestral and sampling processes incorporating sampling times as an additional source of information. We empirically validate the performance of our methods and show the ability of the sampling-aware methods in reducing bias and improving estimation.
• We demonstrate the utility of our methods in providing new insights into the evolutionary dynamics of SARS-CoV-2 novel variants.

Coalescent Model
Example of a Genealogy with Sequential Sampling. s 1 , . . . , s 4 and t 8 , . . . ,t 2 indicate sampling times (red dotted lines) and coalescent times (blue dotted lines), respectively. The time increases backwards in time starting with s 1 = t 9 = 0 as the present time. T MRCA indicates the time to the most recent common ancestor at the root.
The coalescent (Kingman, 1982) is a popular prior model on genealogies. The genealogies are timed and rooted binary trees that represent the ancestry of a sample of n individuals from a population. We assume that (n ℓ ) ℓ=1:m samples are sequentially collected at m sampling times denoted by s = (s ℓ ) ℓ=1:m , with s 1 = 0, s j−1 < s j for j = 2, . . . , m, and n = ∑ m j=1 n j is the total number of samples. In the genealogy, pairs of lineages merge backwards in time into a common ancestor at coalescent times denoted by t = (t n , ...,t 2 ) ( Figure 1). The rate at which pairs of lineages coalesce depends on the number of lineages and the effective population size (EPS) denoted by (N e (t)) t≥0 := N e . Under this model, the density of a genealogy g is: where is the coalescent factor-a combinatorial factor of the number of extant lineages The EPS is generally interpreted as a relative measure of genetic diversity (Wakeley & Sargsyan, 2009).

Bayesian Inference of EPS
We start by assuming that a given genealogy is available to us. Bayesian inference of the EPS then targets P(N e , τ τ τ | g) ∝ P(g | N e )P(N e | τ τ τ)P(τ τ τ), where P(N e | τ τ τ) is a prior distribution on N e that depends on a vector of hyperparameters τ τ τ. A common choice for a prior on N e is to assume an underlying finite-dimensional parametric structure. In particular, this choice makes the calculation of the integral in Eq. (1) computationally tractable. A popular strategy is to use a regular grid of M + 1 points (k i ) 1:M+1 and assume that N e is well approximated by a piece-wise constant function with M change points. In Eq.
(3), we model N e in log-scale, however this is not strictly necessary. Many approaches place a Gaussian Markov random field (GMRF) prior on θ θ θ , for example Gill et al. (2013); Volz & Didelot 2/11 (2018); Faulkner et al. (2020). An alternative to the piece-wise constant assumption is to use Gaussian process priors. However, the posterior distribution becomes doubly intractable because the likelihood function depends on an infinite-dimensional integral over Gaussian processes. In Palacios & Minin (2013), the authors augmented the posterior distribution with auxiliary variables via thinning of Poisson processes (Adams et al., 2009) in order to gain tractability. In applications, the use of GMRFs has shown a numerical performance comparable to GPs. We conjecture that the reason of this comparable performance is the flexibility in the choice of the number of parameters M. This is the case because neither the grid cell boundaries (k i ) 1:M+1 nor M depend on the data, g = (g, t, s, n) or the model parameters N e (t). We refer to Faulkner et al. (2020) for guidance on the choice of M.
There are exact and approximate algorithms to sample from Eq. (2). Standard software packages (Suchard et al., 2018;Bouckaert et al., 2019) employ Markov chain Monte Carlo (MCMC) methods with carefully designed transition kernels. Recent algorithmic advances employ Hamiltonian Monte Carlo (HMC). Faulkner et al. (2020) implemented an algorithm to sample from Eq. (2) under several Markov random fields priors on STAN (Carpenter et al., 2017), and Lan et al. (2015) employed split HMC (Shahbaba et al., 2014). Among the approximate methods, inference can be efficiently done with Integrated Nested Laplace Approximation (INLA) (Rue et al., 2009). Palacios & Minin (2012) use INLA under a GMRF prior on N e , showing that the approximate posterior is remarkably similar to that obtained by MCMC-based algorithms.

Preferential Sampling
The standard coalescent model implicitly assumes the sampling times are either fixed or functionally independent of the underlying population dynamics. However, in many applications, such as in infectious diseases, the sampling frequency is often highly correlated with EPS: more samples are sequenced when the EPS is larger. This situation, known as preferential sampling in spatial statistics (Diggle et al., 2010), allows us to model sampling frequency information in order to improve inference about EPS, reducing estimation bias and improving the accuracy of model parameter inference (Volz & Frost, 2014;Karcher et al., 2016Karcher et al., , 2020Parag et al., 2020;Cappello & Palacios, 2021).
The probabilistic dependency of sampling time distribution on population dynamics can be modeled as an inhomogeneous Poisson point process (iPPP) with a rate λ (t) that depends on EPS. We adopt the adaptive preferential sampling framework (Cappello & Palacios, 2021) that employs a flexible approach for modeling the time-varying dependency between N e (t) and and N e (t) are unknown continuous functions with GMRF priors.

Methods
Here, we propose a flexible and scalable framework for modeling the genealogies of two samples (not necessarily of the same size) from two populations with dependent population dynamics. The goals are (i) to estimate the EPSs of the two populations, (ii) to account for the dependency between them, (iii) and to quantify estimation uncertainty.

Coalescent for Dependent Evolutionary Histories
Let g A = (g A , t A , s A , n A ) and g B = (g B , t B , s B , n B ) be the genealogies of samples collected from populations A and B respectively, and let N A e and N B e denote their corresponding EPSs. Here, (s A , s B ) are vectors of sampling times, and (n A , n B ) are number of samples collected. Standard coalescent-based inference methodologies approximate posterior distributions P(N A e |g A ) and P(N B e |g B ) ignoring any association between the underlying population processes of the two populations. We model the association between the two population size trajectories, which can change over time, with a time-varying parameter linking N A e and N B e : Here, γ := (γ(t)) t≥0 is the time-varying coefficient that describes how the association between the two population processes changes over time, leading to N A e = N e and N B e = γN e . The interpretation of γ provides information on the association between two populations. For example, a growing trend in γ signals the existence of an association between two EPSs: N B e is growing faster than N A e . It, however, does not necessarily imply a positive association because, for example, if N B e were growing, N A e could be either growing at a slower rate than N B e or be decreasing and still have an increasing γ. The number of parameters of the two GMRFs tunes how free the dependence is allowed to vary between the two populations. Throughout the section, we employ GMRF priors on N e and γ; however, the framework is flexible to any prior distribution. It is possible to go fully nonparametric employing a GP (Palacios & Minin, 2013), or a different kind of MRFs, for example, the Horseshoe MRFs (Faulkner & Minin, 2018).

3/11
Our model in Eq. (4) is "asymmetric", in the sense that the baseline population EPS is tilted by the time-varying coefficient γ to define the EPS of a new population. The choice is motivated by the actual scientific question we are examining, where a new population develops from an existing one. An alternative model is to allow every EPS to be a function of a shared base population size trajectory. This second formulation is more symmetrical, although the parameters lose interpretability and are not identifiable.
We also considered a simpler parametric model suggested in Cappello et al. (2021). Here, the population dependence is modeled through two time-independent parameters α and β : However, the strict parametric dependence enforced in the model increases the risk of model misspecification by not allowing changes in the association of the two population processes. For example, the model support excludes a time shift when N B e (t) = N A e (t + s) for s > 0. A consequence of this potential model misspecification is biased estimation of N B e , α, and β ; we will provide numerical proofs of this claim in Section 4.

Preferential Sampling and Dependent Evolutionary Histories
A common approach for studying the relative growth between two population dynamics is to model how the sampling frequency of molecular sequences changes over time in the two populations. This is frequently done by fitting logistic growth models to the sampling dates only (Volz et al., 2021;Davies et al., 2021). The probabilistic models discussed in Section 3.1 take a different stance and employ molecular data to reconstruct the genealogies, which in turn are used to infer the population processes jointly. Here, we extend the probabilistic models of Section 3.1, either model Eq. (4) or (5), and model the genealogies jointly with the observed sampling frequencies from both populations as follows: Eq. (6) builds on the preferential sampling framework described in Section 2.3: the sampling process is an iPPP whose rate is a function of both, the EPS and a time-varying parameter ζ . Here, the sampling process of population A and B will have distinct rates, λ A = ζ N A e and λ B = ζ N B e . Although we assumed a shared ζ function, our implementation considers the possibility of one ζ function per population.

Inference
We start describing the inference procedure for the parameters of the model in Eq. (4). We employ the same discretization described in Section 2.2: given a regular grid (k i ) 1:M+1 , we assume that N e is governed by parameters θ θ θ through the map given in Eq. (3); γ is governed by parameters Let τ τ τ be the vector of precision hyperparameters of the GMRFs. Based on Palacios & Minin (2012), we use INLA for obtaining marginal posterior medians and marginal 95% Bayesian credible intervals (BCI).

Identifiability
Parameter identifiability is an essential property of models used in statistical learning. Roughly speaking, it refers to the theoretical possibility of uniquely estimating a parameter vector if an infinite amount of data is available (Rothenberg, 1971;Bishop, 2006;Watanabe, 2009). Note that this is a property of the generative model, not of the estimator used.
Definition 1 can be rewritten stating that a parameter is identifiable if the map θ → P θ is one-to-one. A model is nonidentifiable if two or more parameter vectors are observationally equivalent. An example of nonidentifiable parameters arises if Y ∼ Poisson(λ 1 λ 2 ): for a pair (λ ′ 1 , λ ′ 2 ), any combination ( 1 c λ ′ 1 , cλ ′ 2 ) with c > 0 will be observationally equivalent. Parameter redundancy is another common violation (Catchpole & Morgan, 1997).
The parameters of models in Eqs. (4) and (5) have a scientific interpretation. The lack of identifiability could hinder the validity of the scientific insights gained from using our methodology. There is a large literature showing that many models in evolutionary biology and ecology are not identifiable (Little et al., 2010;Cappello et al., 2021). Under the assumption that N e and γ are piecewise-constant, i.e. N e = (N e,i ) 1:M and γ = (γ i ) 1:M , we prove that the models introduced in this section possess this important property.   (2017) for a review. We employ results in Rothenberg (1971), which consists in showing that the expected Fisher information (EFI) is non-singular. The trees represent large genealogies of many sequences from three different populations labeled A, B and C at the tips of the trees. Each lineage represents subtrees of individuals whose rate of coalescence is dictated by the color of the branch. (A) Nested. The blue branch indicates coalescent events happen at rate that depends on N A e = N e , green branch indicates a coalescent rate that depends on N B e = γ 1 N e and pink branch indicates a coalescent rate that depends on N C e = γ 1 γ 2 N e . (B) Radial. The blue branch indicates a coalescent rate that depends on N A e = N e , the green branch indicates a coalescent rate that depends on N B e = γ 1 N e and pink branch indicates a colescent rate that depends on N C e = γ 2 N e .

Extension to multiple populations
Our work is centered on a two-population model because our main application target is the estimation of a relative selective advantage of a newly emerging viral variant over an existing variant. We further assume that the new variant originated from the standing variant. However, the framework can easily be extended to include multiple populations viewing the two-population model as a building block for a general genealogical model with multiple populations. In this generalization, the EPS of a child population is a function of the EPS of its parental population. This gives the two types of hierarchical structures displayed in Figure 2: • Nested populations. Each effective population is a function of its immediate preceding one (Figure 2(A)). The baseline population "A" with EPS N A e = N e evolves into a second population "B" with EPS N B e = γ 1 N e , which then in turn evolves into a population "C" with EPS N C e = γ 1 γ 2 N e . • "Radial" populations. Multiple populations evolve from the baseline population "A" with EPS N A e = N e (Figure 2(B)). Then, population "B" will have EPS N B e = γ 1 N e , and population "C" N C e = γ 2 N e .

5/11
The hierarchical structure for EPSs of multiple populations can be constructed iteratively as combinations of the two base structures of Figure 2. This construction maintains parameter identifiability. The inference framework for the multiplepopulations extension still follows Section 3.3: GMRF priors on the base EPS N e and the γ i parameters and the parameter inference with INLA approximation.
We note that, while the above extension can model more general evolutionary scenarios, it still cannot accommodate all possible evolutionary histories. For example, it cannot model ancestral populations that are not observed in the samples.

Experiments
We show the effectiveness of our methodology by applying it to synthetic and real-world data. In the synthetic data section, we offer evidence of its numerical accuracy. The real-data section illustrates the scientific insights that can be obtained by applying our methodology to SARS-CoV-2 data. The code to reproduce the following studies is available at https://github.com/lorenzocapp/adasel.   Figure 3 depicts six pairs (N A e , N B e ) used to simulate data. The trajectories mimic challenging realistic scenarios typically encountered in applications. The scenarios include several types of dependence between N A e and N B e , ranging from perfect association (Scenario 6) to no association (Scenario 2; here, our model is misspecified). We simulated 100 datasets per scenario.

6/11
For a fixed pair of EPSs, we sampled (s A , s B ), (n A , n B ), and (t A , t B ) with n = 200. Specifics of (N A e , N B e ) and the data-generating mechanism are in the Supplementary Material.
We compare our hierarchical approach with and without preferential sampling-referred as "adaPop" and "adaPop+Pref", respectively-to the parametric method, referred here as "parPop". We also include a neutral estimator "noPop", which ignores the association between populations and estimates N A e and N B e . We compare how accurately the four methodologies estimate γ, N A e and N B e . Note that, while parPop and noPop do not approximate γ, the posterior P(γ|s A s B , t A , t B ) can be empirically approximated by taking samples from P(N A e , N B e |s A s B , t A , t B ) and summarizing γ = N B e /N A e . Let f be either γ, N A e or N B e . We evaluate the performance of the methods using three metrics (listed below) computed on a regular grid of time points (v i ) 1:K . Here, the grid is defined with K = 100 on the interval [0, 0.6 T MRCA ], where T MRCA denotes the time to the most recent common ancestor at the root.
It is a measure of bias.
are respectively the 97.5% and 2.5% quantiles of the posterior distribution of f (v i ). It describes average width of the credible region.
It is a measure of 95% credible intervals coverage. Table 1 reports the average value of each statistic pooling together all datasets, all scenarios, and all grid points. Hence, each entry should represent the average performance of a method across the variety of challenging scenarios considered. We also average the performance metrics of N A e and N B e . A more granular view of the performance of each method by scenario is given in the Supplementary Material. adaPop+Pref stands out as the best performing method, exhibiting the lowest bias (DEV) and the narrowest credible regions (RWD). The coverage (ENV) is slightly worse than noPop; however, noPop achieves the slightly higher coverage with much wider credible regions. adaPop has a very similar performance to adaPop+Pref.
Sampling times were sampled at uniform and not proportionally to N A e and N B e . This is the case where preferential sampling is less informative. Remarkably, adaPop+Pref still has narrower credible regions than adaPop. The reason is that the time-varying coefficient ζ is able too capture a variety of sampling protocols, including uniform sampling. We interpret this as empirical evidence of the adaptivity of the model.
adaPop's higher DEVN e is mostly attributed to poorer performance in estimating N B e in Scenario 2 (see Supplementary Material). If we exclude Scenario 2, adaPop is superior to noPop and parPop across all metrics. Similarly, parPop is very competitive in the scenarios where the model is correctly specified (Scenarios 1, 2, and 6). The average performance deteriorates due to poorer performance in the remaining scenarios.
Lastly, an essential feature of adaPop and adaPop+Pref is that they are the most stable methodologies. This can be seen from the standard deviations of the statistics in the bracket. The Supplementary Material includes further analyses where the robustness and performance of the models are evaluated.

Real Data
Since its introduction, SARS-CoV-2 has undergone rapid evolution resulting in novel variants, some of which possess transmissibility, pathogenicity, or antigenicity advantages over the preexisting resident variants (Harvey et al., 2021;Cevik et al., 2021). A variant of recent interest is the delta variant (Pango lineage B.1.617.2 and AY lineages (Rambaut et al., 2020)). We compare this variant to other SARS-CoV-2 variants in two countries, South Korea and Italy.
We analyzed high-coverage complete sequences publicly available in GISAID (Shu & McCauley, 2017) collected from South Korea and Italy during 2021-03-01 to 2021-09-30. For each country, we subsampled two sets of 150 sequences: one with the delta variant and the other without the delta variant. We then estimated the maximum credibility clade (MCC) trees-the tree in the posterior sample with the maximum sum of the posterior clade probabilities-of samples from each variant group of each country independently with BEAST2 (Bouckaert et al., 2019). In our study, we set the population of the delta variant sequences as population A, and of the non-delta variant as population B. Here, we discuss the results using the parPop and adaPop+Pref models. The additional results with other methods and further details of the sequence analysis pipeline, together with the inferred MCC trees, appear in the Supplementary Materials.
The orange-shaded heat-maps in Figure 4 depict the number of samples collected over time of the two populations in South Korea and Italy (as represented in our sub-sampled data sets). The observed pattern is consistent in the two countries: delta viral samples predominantly collected in the summer of 2021, while non-delta samples collected in the spring and early summer of 2021. This is consistent with the general observation of rapid spread of the delta variant that has progressively replaced the preexisting non-delta variants (such as alpha) since its introduction (Mlcochova et al., 2021;del Rio et al., 2021). Figure 4(B), (C), (F) and (G) depict the estimated posterior distribution of EPS of the two viral populations in the two countries obtained with adaPref+Pop (noPop estimates are qualitatively similar, see Supplementary Material). In both countries, the delta population has experienced prolonged growth since its inception, halting its growth in the last month. On the other hand, the non-delta population EPS grew until approximately the end of May (the growth looks more pronounced in Italy), then it roughly plateaued.
We next examine the quantification of the dependence between the two populations: β for parPop and γ for adaPop+Pref. In Figure 4(A), the posterior density of β estimated under the parPop model has mean 0.36 with 95% credible interval (0.19, 0.51), well below 1 (red line), suggesting EPS growth is more pronounced among sequences having the delta variant in South Korea. On the other hand, the posterior density of β has mean 1.62 with credible interval (1.25, 1.95) (Figure 4(E)) indicating the EPS of the non-delta variant grows faster than the delta variant in Italy under the parPop model. The result suggests that the growth of the non-delta population dominated that of the delta population which contradicts the general consensus (Mlcochova et al., 2021;del Rio et al., 2021). Under the adaPop+Pref model, the monotonic decrease in the growth of the non-delta variant EPS compared to that of the delta variant EPS in South Korea is apparent in γ-trajectory (Figure 4(D)); this is consistent with parPop β . However, the γ-trajectory of Italy (Figure 4(H)) shows that, compared to the growth of the delta variant EPS, the non-delta variant underwent an initial phase of faster growth and then transitioned to slower growth around mid-March of 2021. The inability of the parPop model (Figure 4(E)) to capture the two-phase population dynamics in Italy, that are evident in the adaPop+Pref model (Figure 4(H)), suggests that more flexible approaches proposed by our work are needed for accommodating the broad range of population dynamics scenarios encountered in real applications.

Discussion
We have developed a coalescent-based Bayesian methodology for inferring dependent population size trajectories and quantifying such dependency. We make minimal assumptions on the functional form of population size trajectories and allow the 8/11 dependence between the two populations to vary over time. We also present a sampling-aware model for leveraging additional information contained in sampling times for reduced bias and improved inference accuracy. Although the proposed models have an increased number of parameter, we prove that the models are identifiable. We have shown that our adaPop+Pref outperforms other methods in synthetic data with known ground truth and that our adaptive method can detect changes in population size dynamics that are otherwise undetected with other models. We make this point more precise in our SARS-CoV-2 analyses. We have decided to infer parameters via a numerically approximated method that relies on Laplace approximations (INLA) for computational speed. However, our proposed models can, in principle, be implemented in any of the MCMC standard approaches. Implementing our models in an MCMC approach would allow to infer population size trajectories from molecular sequence and sequencing time information directly and it is a subject of future development. This extension would account for uncertainty in the genealogy, an important component that is missing in the analyses presented.
We see a growing number of applications of the coalescent requiring the modeling of complex demographic histories. In the introduction, we mentioned a few, such as viral epidemiology and cancer evolution. There is extensive literature on models incorporating more and more realistic features, for example, a detailed description of migration histories. The quest for realism and scientifically meaningful parameters comes at the cost of computational tractability and leads, sometimes, to issues of model identifiability. Our work is somewhat motivated by these problems. Our method is equipped with high performance and accuracy, due to its scalability, interpretability, and parameter identifiability; such properties are lacking in many complex models in biology and epidemiology. We see our proposal as a "hybrid approach" that allows scientists to quantify the relative advantage of one population over another while still retaining a fairly parsimonious model. Such approach will be invaluable across many biomedical disciplines for studying complex time-varying dependent evolutionary dynamics of populations.