The infinite alleles model revisited: a Gibbs sampling approach

The SARS-CoV-2 outbreak started in late 2019 in the Hubei province in China and the first viral sequence was made available to the scientific community on early January 2020. From there, viral genomes from all over the world have followed at an outstanding rate, reaching already more than 105 on early May 2020, and more than 106 by early March 2021. Phylodynamics methods have been designed in recent years to process such datasets and infer population dynamics and sampling intensities in the past. However, the unprecedented scale of the SARS-CoV-2 dataset now calls for new methodological developments, relying e.g. on simplifying assumptions of the mutation process. In this article, I build on the infinite alleles model stemming from the field of population genetics to develop a new Bayesian statistical method allowing the joint reconstruction of the outbreak’s effective population sizes and sampling intensities through time. This relies on prior conjugacy properties that prove useful both to develop a Gibbs sampler and to gain intuition on the way different parameters of the model are linked and inferred. I finally illustrate the use of this method on SARS-CoV-2 genomes sequenced during the first wave of the outbreak in four distinct European countries, thus offering a new perspective on the evolution of the sampling intensity through time in these countries from genetic data only.


Introduction
The concept of descent with modication is central in modern biology, where biological entities evolving across various spatial and temporal scales (e.g. cells, individuals, species) can be seen as atomic particles carrying molecular sequences, that are passed on to their descent, while accumulating small gradual changes. As a result, the patterns of genetic dierentiation obtained in a sample of particles depend on the underlying population dynamics, and can be analysed to retrieve information on this unobserved population dynamics. This is the aim shared by two related elds called population genetics and phylodynamics.
Population genetics and phylodynamics Molecular sequences are nowadays routinely being collected and analyzed throughout the tree of life, to address a wealth of biological questions, across elds such as ecology, anthropology, macroevolution, developmental biology, or epidemiology. In this manuscript, I focus on methods designed to investigate the population dynamics of a system through the analysis of genetic polymorphism. These methods have been applied across plenty of temporal and geographical scales, e.g. in ecology to study the population size trajectory of species (Parag et al. 2021), in epidemiology to estimate the prevalence of an infectious disease from sequences of pathogens sampled during an outbreak , or in paleontology to study the species diversity trajectory of a clade over macroevolutionary time-scales (Morlon et al. 2011). While both elds address similar questions and may seem intertwined, their methodologies remained quite distinct, giving rise to two branches in the literature.
Population genetics approaches primarily aim at studying genetic variation within populations through time, based on genetic data. The recognition of the central inuence of demography on genetic variation fostered the development of statistical methods aiming at inferring past demography from observed genetic polymorphism. The eld has been very active since the beginning of the 70s, and most of the early theory is now digested in textbooks presenting the coalescent, with or without demography complications (Tavaré 2004;Hein et al. 2004;Durrett 2008).
This early work relied on simplifying assumptions such as the innite alleles or innite sites models, when genetic data has been sampled at a single point in time. Elegant analytical developments of the probability distribution of summary statistics were derived, allowing one to investigate, e.g., population growth (Kuhner et al. 1998), population structure (Beerli and Felsenstein 1999), selection (McDonald and Kreitman 1991), or the presence of recombination (Hudson 1983). Contemporary empirical applications usually deal with more complicated demography scenarios and samples taken at multiple points in time, and have thus adopted two dierent strategies. First, some studies rely on Principal Component Analysis or summary statistics that have been previously derived in very simple settings (Novembre et al. 2008). This approach is extremely fast and appropriate for an initial exploration of the dataset, still it lacks a quantitative aspect. Second, the rise of computational power fostered the development of Approximate Bayesian Computation to t parameter-rich models using computationally intensive procedures (Skoglund et al. 2014; Kim et al. 2017).
Phylodynamics approaches stem from the eld of phylogenetics, which aims at reconstructing the ancestral relationships between individuals, together with their evolutionary parameters, based on genetic data. In this eld as well, researchers have acknowledged the key role of the demography in shaping the phylogenetic tree and hence the observed molecular patterns. This in turn promoted the rise of a subeld called phylodynamics, aiming at inferring the demography using molecular sequences, by integrating over precise phylogenetic relationships. The two main demography frameworks used in the eld are (i) the coalescent, borrowed from Kingman (1982)'s work in population genetics and (ii) birth-death processes, relying on seminal results by Kendall (1948) and Nee et al. (1994). Compared to population genetics methods, there has been a cultural change towards more precise estimation relying on computationally intensive Bayesian inference methods. These methods rely on many superimposed model layers, among which e.g. models of clock evolution (Lepage et al. 2007), models of across-locus variation (Lartillot and Philippe 2004), and models of molecular substitution (Lanave et al. 1984). Moreover, phylodynamic methods have been developed to take into account serially sampled molecular data (Stadler 2010). Population dynamics has been modeled either in a coalescent framework using e.g. time-varying population size (Pybus and Harvey 2000;Drummond et al. 2002;Pybus et al. 2003), or in a birth-death framework, where the population is already free to uctuate with constant birth and death rates, but larger variations can be allowed using time-varying parameters (Morlon et al. 2011;Stadler et al. 2013). As an alternative to time-dependent processes, some studies have attempted to introduce diversity-dependence processes in either a coalescent framework (Volz et al. 2009), or a birth-death framework (Etienne et al. 2012;Leventhal et al. 2013). Population structure can be modeled in a coalescent framework with discrete demes exchanging genes through migration (Ewing et al. 2004;Vaughan et al. 2014;Müller et al. 2017). In a birth-death process, structure is modeled using so-called multi-type birth-death processes, where dierent types are associated with dierent birth and death parameters, and individuals from a given type can either give birth to other types or directly change type (Maddison et al. 2007;Beaulieu and O'Meara 2016;Maliet et al. 2019;Barido-Sottani et al. 2020). Finally, methods have been developed to jointly consider occurrence and molecular data. In a coalescent framework, occurrences are assumed to be the result of a Poisson sampling process among the total population (Rasmussen et al. 2011;Parag et al. 2020). In a birth-death process, an individual can be sampled and sequenced at a given rate in which case it appears in the tree or sampled without being sequenced at another rate in which case it is a simple occurrence (Vaughan et al. 2019;Gupta et al. 2020;Manceau et al. 2021).
Motivating example In this paper, I focus on inferring population dynamics for biological systems where (i) genetic polymorphism is sampled through time, and (ii) an ever increasing amount of sequences are being collected, challenging state-of-the-art methods for phylodynamic analysis.
The current SARS-CoV-2 pandemic provides the archetypal dataset that I propose to model. The outbreak survey started in late 2019, and the rst viral genome was already published and made available for research on the 10th of January 2020. New sequences followed at an outstanding rate. By early May 2020, already more than 10 5 viral sequences were available from across the world. A bit less than a year after, 10 6 sequences have been reached before early March 2021. Developing statistical tools capable of keeping up with the pace of data acquisition thus represents a methodological challenge. SARS-CoV-2 genomes have already been used to address a number of epidemiology-related questions, among which assessing the number and origins of introductions in a given locality (Gonzalez-Reiche et al. 2020;Lemey et al. 2020), the magnitude of super-spreading events (Li et al. 2020), or estimating the reproductive numbers of local outbreaks ). Yet, phylogenetics/phylodynamics approaches do not scale well to large numbers of sequences and empirical applications typically require subsampling the original datasets.
The virus genome is approximately 3 × 10 4 nucleotides long, and its mutation rate, quite heterogeneous across the genome, has been estimated around 22 mutations per year per genome (Hadeld et al. 2018). As a result, new alleles and polymorphic sites of the genome have accumulated in the data at a slow pace. Together with the outstanding number of sequences, this rather slow mutation rate advocates for the use of simplifying assumptions of the mutation process. The innite alleles model Phylodynamic analyses generally assume a very realistic mutation process. Sequences have a nite number of sites, and each mutation hits a randomly chosen nucleotide, with a realistic substitution process ranging from the Jukes-Cantor to the Generalized Time Reversible model. Selection might even be modeled and nucleotides might have dierent mutation rates along the sequence. While these realistic models are very well designed to study ne-grain processes or processes happening over long timescales, they do not appear to be the best option to process large numbers of similar sequences. In this manuscript, I take a step back and aim at bringing back into fashion a simplifying assumption that has been traditionally considered in the early days of the neighbouring eld of population genetics, namely the innite alleles model. Under this model, each mutation hitting a sequence always creates a new allele never observed before. If we imagine that each sequence is a ball and an allele is a colour, genetic data thus simplies as a sampling record of coloured balls through time, as illustrated in Figure 1 (Durrett 2008).
Analytical tractability is the main reason why the innite alleles model is used nowadays. Following the past history of one genetic sequence backward in time, it can either (i) coalesce with another lineage that belongs to the same allele; or (ii) if it is the only representative of its allele, it can nd the mutation that gave rise to it. Once this original mutation is found, everything else in the past is forgotten. The innite alleles model was studied extensively during the golden age of population genetics, in combination with the coalescent model and for sequences sampled at a unique point in time. A closed-form analytical characterization of the probability distribution of the allele frequency spectrum in this setting exists, called Ewen's sampling formula (Ewens 1972).
The dynamics of the colour assemblage through time is informative of the underlying population dynamics that we are interested in inferring. I propose to work under a Bayesian framework, and to rely on population dynamics and sampling process assumptions similar to what has been recently used in phylodynamics. To ensure fast convergence of the Markov Chain Monte Carlo (MCMC) method that is used for inference, the model is (i) carefully built such that data augmentation can be performed eciently, and (ii) relies on prior conjugacy properties and Gibbs sampling moves. This approach has been successfully applied to other phylogenetics methods before, and has shown a much faster convergence of MCMC methods relying on Gibbs sampling moves as compared to Metropolis-Hastings moves (Lartillot 2006). Further, prior conjugacy properties allow one to build a better intuition on the interactions between dierent parameters, which proves particularly convenient for the choice of prior distributions. As compared to state of the art phylodynamics methods, I aim at integrating over the unknown ancestral relationships more eciently, with the hope to warrant dataset analysis on a larger scale.
Manuscript outline In Section 2, I introduce in more details the model assumptions, before turning to the inference strategy in Section 3. I then present in Section 4 some sanity checks and validate the inference method on simulated data. An empirical application in Section 5 illustrates the use of the method on the SARS-CoV-2 sequences sampled during the rst wave hitting Europe in 2020. Finally, I discuss in Section 6 the results of this paper as well as the future research challenges it opens. This manuscript is released along with the code implementing the method, and details on the implementation and use of the code are provided in Supp. Mat. C.
2 Model and notation I build here on work by Parag et al. (2021) and Karcher et al. (2020), who both consider a sampling process on top of a coalescent model with piecewise-constant eective population sizes. The coalescent process is very conveniently described backward in time, and time will thus be, throughout the manuscript, the calendar time before present, in units of days for empirical applications, with t = 0 at present and t → ∞ in the past.

Model parameters
The model is built around the following four key parameters.
First, the past eective population size is piecewise-constant on a partition of (0, +∞) into p successive disjoint intervals (∆ (N ) j ) p−1 j=0 delimited by 0, ∞, and the p − 1 times (t (N ) j ) p−1 j=1 , i.e. Second, following Parag et al. (2021), the past sampling intensity is also a piecewise-constant function, on a possibly dierent partition (∆ (S) j ) p −1 j=0 of (0, ∞), delimited by 0, ∞, and the p − 1 times (t where ∆ Last, the mutation rate µ and generation time g are constant through time. In a Bayesian framework, these parameters are random variables which are assigned prior distributions. The eective population sizes (N j ) p−1 j=0 are assumed to be a priori distributed according to a Generalized Inverse Gaussian distribution, while the sampling intensities (S j ) p −1 j=0 are assumed to be a priori distributed according to a Gamma distribution. Finally, the mutation rate µ and generation time g are respectively a priori distributed according to a Gamma and Inverse-Gamma distribution. The choice of these prior distributions will be explained in Section 3, when discussing the posterior inference of these variables.

Past sampling and coalescent history
We assume that the sampling history is given by a Poisson Point Process (PPP) with rate generating the set of ordered sampling times B = (b i ) B−1 i=0 of all individuals. The total number of sampling events is denoted B, and individuals are numbered from 0 to B − 1 in reverse birth time order. We will also call these sampling times the birth times of lineages when considering the history backward in time (hence the name B).
Lineages begin their backward-in-time journey as singletons {i} where i corresponds to the individual's number.
The past history of these lineages is further assumed to follow a standard coalescent with eective population size N t , generation time g and dierentiation under an innite alleles model with mutation rate µ. That is, while there are k t lineages alive in the process, the next coalescent (resp. dierentiation) event happens with rate, When there is a coalescent event, two lineages L i and L j , uniformly sampled among the k t living lineages at that time, are merged together in a unique lineage L i ∪ L j . When there is a dierentiation event, one of the k t living lineages is uniformly chosen to be killed. Forward in time, a coalescence corresponds to an individual giving birth to another individual, whereas a dierentiation event corresponds to the acquisition of an original mutation responsible for the creation of an allele.
The past coalescent history thus generates in particular a partition of individuals into an allele partition A corresponding to the collection of all lineages killed by a mutation. It also generates the times at which dierentiation and coalescent events, jointly referred to as death events happened in history. In order to record these, we take the following approach. Lineages are initially numbered with the same number as the individual's number it carries.
Each coalescence involving two lineages numbered j < i at time t is considered to kill lineage i, and to keep living in lineage j (see arrows on Fig. 1). By a slight abuse of language, we call such an event the death of individual i, and the time at which a mutation is found is the death time of the very rst individual of the allele (see crosses on (2.4) The above density belongs to the exponential family, and thus lends itself well to inference via a Gibbs sampling strategy, with priors that belong to other exponential family distributions. In the next Section, I turn to the description of this inference method.  3.2 Prior conjugacy properties for parameters

Eective population size
Recall that N j is a priori independent on N −j , S, µ, g and is distributed according to a Generalized Inverse Gaussian distribution denoted GIG(λ, χ, ψ). The following will justify this choice of prior. The GIG distribution belongs to the exponential family and is characterized by its density, usually parameterized as, Its posterior is thus given by, where the last line is obtained by substituting the density of B, H using Equation (2.4), before dropping out all terms which do not depend on N j .
This proves that the prior and posterior are conjugate distributions, with ) are respectively the number of coalescent and sampling events happening over interval The choice of a conjugate prior will help (i) simplify the Gibbs sampling process, and (ii) provide a better intuitive understanding of the factors that inuence the distribution of N .

Sampling intensity
Recall that S j is a priori independent on N, S −j , µ, g and that S j ∼ Γ(α, β). Its posterior is thus given by, where the last line is obtained again by substituting the density of B, H using Equation (2.4), before dropping out all terms which do not depend on S j . This shows that the prior and posterior are conjugate distributions, with (3.6)

Mutation rate
Recall that, a priori, µ ∼ Γ(α, β). Following the same strategy as above, its posterior is given by, We conclude that the prior and posterior of µ are conjugate distributions, with, (3.7)

Generation time
Last, recall that g is a priori distributed according to an Inverse-Gamma distribution denoted Γ −1 (α, β). Its posterior is given by, which means that the posterior is This ends the description of the four conjugate priors used for N, S, µ, g. They will be used in the nal Gibbs sampler to quickly update the posterior of these variables of interest, provided B, H are observed.

Data augmentation with the past coalescent history
Under the assumption of an innite alleles model with constant eective population size N and constant mutation rate µ, the distribution of the past history of a sample of n genes taken at a single point in time can be eciently sampled using the equivalence with the Hoppe's urn process (Durrett 2008). Yet, when sequences are heterochronous i.e. have been sampled through time instead of at a single point in time and when the population size is not constant anymore, an alternative strategy is needed.
Recall that H i ∈ R + is the death time of a focal individual i, and that O i ∈ {0, 1, . . . , i} is the death output.  D. E. F. Figure 2: Gibbs sampling approach for simulating the past history of a sample given the allele partition. A-F, Each leaf is considered in turn, and its time of extinction is sampled given the rest of the coalescent history is xed. the following alternative approach to avoid rejecting too many simulations. On all intervals ∆ l = (t l , t l+1 ) where the total number of lineages k l remains constant and where t l ≥ m i , the probability that H i falls within the interval conditioned on everything else is computed, where k <i a denotes the number of living lineages to the left of i belonging to the same allele a as i.
The above formula is used to recursively compute from bottom to top the weights w l associated to all intervals ∆ l above m i . Once these have been computed, we have access to, and can sample from, Once the interval on which the death event happens has been drawn, it remains to draw from P(H i ∈ dt | H i ∈ ∆ l ), which corresponds to an exponential distribution with rate (µ + k l (N g) −1 ) conditioned on happening on ∆ l . 1. Compute 2. Explore intervals, from time m i up until M i and for each interval ∆ l : (a) record the count of k l , the total number of lineages alive on the interval, (b) record the list L <i a of all lineages j < i from the same allele a as lineage i alive in the interval, (c) record the weight w l using Equation (3.9).
3. Draw the interval in which the event occurs, and then the specics of the event: (a) Draw the interval ∆ l according to the weights computed at the previous step, (b) Draw H i , from an exponential distribution with rate µ + k l (N g) −1 conditioned on happening on ∆ l , (c) Draw O i , a uniformly chosen lineage in the list L <i a .

Summary of the Gibbs sampler
The nal Gibbs sampler is a simple MCMC that is initialized using the priors of N, S, µ, g, before relying on repeated updates of the four variables N, S, µ, g, H, using their conditional probabilities. Algorithm 2 summarizes these steps.
In the next Section, I validate the behaviour of this MCMC on simulated datasets before applying it to empirical datasets in Section 5. 4 Numerical validation of the method 4.1 Past coalescent history I rst aim to validate the procedure described in Section 3.3 and the implementation of Algorithm 1. To do so, I wrapped Algorithm 1 in a minimalist Gibbs sampler aiming at sampling from P(H | A, B, N, S, µ, g). I x B, A as well as all parameters of the model N, S, µ, g. I then use a simplied version of the MCMC described as Algorithm 2, where all updates concerning N, S, µ, g are skipped. To draw one observation of H, I performed n = 50 complete cycles of leaf updates, and kept only the last state reached by the MCMC.
I compare the distribution obtained using this procedure with the distribution of H obtained by a naive rejection algorithm, consisting in simulating the coalescent process backwards in time, while rejecting outcomes that do not satisfy A. The distributions are compared based on summary statistics computed on 10 4 samples drawn in both distributions: (i) the proportion of samples having a mutation in successive death events, as well as (ii) the distribution of death times. Note that the dataset must be small, for the rejection algorithm very quickly becomes computationally too intensive to be used. Figure 3 illustrates the perfect agreement between both distributions, on a toy dataset with N = 10, g = 0.1, µ = 1.5, and two alleles respectively joining individuals sampled at times

Simulation-based calibration
The MCMC implementation of Algorithm 2 is further validated against simulated data using the Simulation-Based Calibration (SBC) method described by Talts et al. (2018) and summarized hereafter.
Second, 10 4 parameter sets are sampled from these distributions and for each parameter set, the sampling history B as well as the allele partition A are sampled according to the model.
Third, for each simulated dataset, the posterior distribution of N, S, µ, g conditioned on A, B is sampled using the Gibbs sampler described as Algorithm 2, while using the same priors that were used for the simulation. I ran the MCMC for a total of 10 4 steps, discarded a burn-in of 10 3 steps at the beginning and recorded one state every 100 steps over the remaining steps.
Finally, Figure 4A shows the proportion of datasets p α for which a credible interval with level α of the posterior distribution contains the true simulated data, for 9 values of α evenly spaced on (0, 1). The good match between p α and α indicates that the MCMC correctly samples the posterior distribution. This is further conrmed in Figure 4B, showing the histogram of the rank statistic associated with the 10 4 experiments. Here, the rank statistic associated to one experiment refers to the number of samples from the posterior being less than the true value. Under the null hypothesis that we are sampling the true posterior distribution, the histogram of rank statistics should be uniform, as illustrated in Figure 4B

Running time assessment
I now turn to an assessment of the running time of the method on realistic datasets. I simulated datasets using a timeline cut into p = p = 5 intervals with varying N and S values, while xing the same hyperparameters on all intervals, so as to get datasets with total sample size B regularly spaced on a log scale between 10 and 2500. On each of these, I ran 10 4 steps of the Gibbs sampler, discarded the rst 10 3 steps, and recorded the running time and posterior samples.
Let us focus rst on the running time of each of the dierent updates. Since for each of B sequences, the update of the death event requires to compute the weights associated to a number of intervals of the order of B, the update of the coalescent history is expected to scale in O(B 2 ) and to be the bottleneck of the MCMC. Figure 5A presents estimates of the running time depending on B, using the code released along this article on a laptop. It conrms that the update of the coalescent history and so, a step in the MCMC scales in O(B 2 ).
Running a MCMC moreover requires to perform these updates repeatedly during a certain number of steps, in order to (i) escape the burn-in phase of the MCMC; and (ii) collect enough samples from the posterior to characterize it. When the samples collected through time are highly correlated, it is said that the chain is mixing poorly, and more steps are typically required. In order to provide a rough idea of the expected mixing behaviour on simulated datasets, I computed estimates of the eective sample size of (N j ) p−1 j=0 , (S j ) p −1 j=0 , µ, g. The results are shown in Figure 5B. Combined with the running time assessment, it conveys the idea that the current implementation, without further approximation or numerical optimization will likely not be useful to process more than ∼ 10 4 sequences.
Having in mind the rough behaviour of the method, I illustrate its use in the next Section on empirical datasets encompassing ∼ 10 3 sequences.

MCMC specications
The timeline of the analysis is xed for N and S. It extends from the 1st of June 2020 to the 13th of January 2020, cut into successive intervals of 4 weeks each. I assume that µ and g are known from other studies and are not the focus of the inference. The mutation rate µ is xed to 0.065 mutations per genome per day, corresponding to 8 × 10 −4 mutations per nucleotide per year, and the generation time g is xed to 5 days.
Finally, hyperparameters values for N and S are chosen so as to not be too informative, using a quick backof-the-envelope reasoning around Equations (3.5) and (3.6). We imagine what could happen over a time period with few data, as happens in the beginning of the dataset. If the order of magnitude of N is approximately 10 4 , and we roughly believe that one out of 5 × 10 3 individuals is sequenced in reality, because each individual lives for 5 days in the model, it corresponds to sequencing one out of 2.5 × 10 4 individuals, i.e. S ∼ 4 × 10 −5 .
I ran the Gibbs sampler for 10 4 steps, discarded the rst 10 3 steps as burn-in and used the remaining 9 × 10 3 steps for posterior inference. The code provided along with this article generates the traces and auto-correlation functions of all parameters. These are inspected visually and the ESS values of all parameters are higher than 100.

Results
The output of the Gibbs sampler on each country is a posterior sample of (N j ) p−1 j=0 and (S j ) p −1 j=0 values through time over the xed timeline. Figure 6 illustrates the input data (to the left) together with the output posteriors of N and S (to the right). In particular, the trajectories of N and S tell a similar story in the four countries. The outbreak slowly started in early 2020 and reached its peak around March-April 2020, before quickly decreasing in May 2020. Opening future modeling opportunities with allele data Overall, the results on empirical data illustrates the use and applicability of the method on large real-world datasets. Yet, the current model formulation still lacks a few realistic ingredients before it can be used to learn new aspects of an epidemic.
First and foremost, it does not take into account population structure, a feature that is likely to be present in the empirical dataset and violates the model assumptions. Further work is needed to properly incorporate dierent demes characterized by dierent population sizes and sampling intensities, exchanging sequences through migrations among demes. This will likely be the subject of future work aiming to infer population structure and population dynamics using the allele partition. Integrating over the hidden coalescent history will be complicated a bit by migrations between demes, but could be envisioned within a closely related inference framework.
Second, it could be interesting to use smoothing priors for N and S to ensure that these two functions of time do not show huge steps from a time period to the next. This will as well be the focus of future developments of the method, that could build on related work in phylodynamics (Karcher et al. 2020;Parag et al. 2021). It would seem especially interesting for N to root smoothing priors on mechanistic assumptions of population dynamics, such as considered e.g. in other epidemiological models (Cori et al. 2013).
Last, the boundaries of the intervals on which N and S remain constant could be unknown, with or without prior distribution. This would allow the timeline to be informed by the data, and could lead as well, in case it is stochastic and averaged over the posterior, to smoother N and S trajectories. This could build on related work in a phylodynamic context under a birth-death model (Stadler 2011).
Further theoretical work is also required before this kind of model can be applied to larger already available datasets. Indeed, while the initial hope of bringing back into fashion the innite allele model was to be able to process very large genetic datasets, it is in the end not realistic to use the current implementation of the method with more than ∼ 10 4 sequences. At least four directions can be envisioned to improve on this objective in the future: (i) trying to optimize the sampling step of the coalescent history, which is the current bottleneck of the computation, When is the use of a simplied mutation process pertinent ? A natural question arises when thinking about the dierence between this method and currently used coalescent-based method in phylodynamics, namely How do these compare in terms of statistical power ? Or to rephrase it in more technical terms: what signal do we loose by forgetting about the coalescent history above the rst mutation ?
The answer will likely depend on the mutation rate and on the temporal scale that one is interested in studying.
In the limit of very high mutation rate compared to the temporal scale under study, only singletons are observed, bearing no useful signal, while in the limit of very low mutation rate, only one allele is observed, again bearing no useful signal to infer N and S.
In between, there is a setting with an intermediate mutation rate, such that alleles extend for some time across the focal time-frame, providing signal to reconstruct N and S. However, even in this optimal setting for the method, data is being discarded so that all signal on the internal branches linking dierent alleles is lost as compared to coalescent-based methods integrating over the full unknown tree. When does the trade-o between computation time and precision turn in favor of using a simpler innite alleles model ? Quantifying this more precisely on simulations would be a valuable contribution.
Moreover, when one is not interested in estimating µ, the allele partition could be chosen so as to tend to an optimal setting as described above. In principle, the allele partition of the set of sequences could be obtained by applying any another equivalence relation, e.g. being similar only on a given subsequence, or having a similar amino-acid sequence. These could be used to decrease the number of alleles in the dataset.
In between the two above-mentioned extremes of (i) using an innite alleles model or (ii) using a nite site model with a substitution model, lies also the opportunity to revive another assumption from population genetics, namely the innite sites model. Under this model, each mutation hits a new site along the sequence, and thus a more precise phylogenetic history between sequences can be reconstructed. An Importance Sampler algorithm has been proposed by Stephens and Donnelly (2000) for simulating the past history under a coalescent with an innite sites model. A more thorough comparison of this inference framework against another inference method relying on the innite sites model could as well be a relevant contribution to the eld.
Finally, this manuscript also opens the way to develop better approximations aiming at taking into accounts more sequences in scenarios with high numbers of duplicates (Boskova and Stadler 2020). This line of research could benet from the joint use of dierent mutation models clearly distinguished in distinct parts of the evolutionary tree of sequences, while still relying on a unique underlying population dynamics model, such as e.g. a coalescent model with discrete population size shifts.
Conclusion Bringing back into fashion old population genetics simplications of the mutation process and incorporating them into modern statistical frameworks could play a key role in better surveying and understanding population demographics and structure from molecular data. I hope that this work will participate in a current trend towards adapting computer-intensive phylodynamics methods for use with datasets characterized by low genetic diversity such as the current SARS-CoV-2 outbreak. A Details on the numerical method Pipeline for creating the allele partition Creating the allele partition is a central step in the method, for this will be the raw data taken as input by the Gibbs sampler. To build it, I relied on the following pipeline: 1. cut the master reference sequence from position 250 to 29700, and put this one as the rst element of a list of reference sequences.
2. initialize an empty list of alleles.
3. Then iterate, for each other sequence, the following steps, (a) look for a pattern close to the 30 rst nucleotides of the master reference into the rst 500 nucleotides of each sequence, to get the beginning of the window, and discard the very few sequences where the beginning could not be found.
(b) keep as the focal sequence the 29450 nucleotides following the beginning of the window.
(c) record the collection of snps diering in the focal sequence as compared to the reference sequence.
• if it is smaller than 100 snps long, we consider it to be the compressed representation of the sequence against this reference.
• it not, compare to the next reference sequence in the list.
(d) if no reference sequence is less than 100 snps dierent than the sequence, it is likely to have a feature of its own (typically, a gap). Add this sequence to the list of reference sequences.
(e) Look for the same compressed representation of the sequence in the list of alleles and add the sampling date of this sequence to the list of sampling dates of the allele.
This pipeline is implemented in the script raw_to_datasets.ml, available as part of the code associated to this article.
MCMC output analysis MCMC samples output by the Ocaml code are then analyzed using scripts written in the R programming language.
Post-processing steps rely in particular on the following packages to work, • the very versatile ggplot2 and cowplot to produce gures, • LaplacesDemon to compute ESS values using traces of scalar values, • forecast to compute ACF or PACF using traces of scalar values, The R post-processing scripts are also available as part of the code released along this article.

Generation of random variables
Random variables distributed according to a GIG distribution are sampled using a personal implementation of Hörmann and Leydold (2014)'s algorithm in the programming language Ocaml. It is naive translation of the very handy R package GIGrvg by the same authors. Note here that the sequencing fraction plotted below corresponds to the number of sequences divided by the sum of the number of sequences and number of samples. Indeed, the number of sequences in the early time period is sometimes higher than the number of samples, and I thus considered that the sequences were not included in the case count.

C Gibbs sampler code documentation
This Section aims at making the code released along this article more easily comprehensible, by shortly presenting in plain text the strategy to store dierent data, together with key functions to manipulate the data.

Format of key quantities
By default, all lists are ordered in time ascending order (that is, rst element on top of the list at present time 0).
. an allele or a lineage an ordered list of either (b i ) belonging to the same allele/lineage, if the individuals are not numbered, or an ordered list of (n i ) if the individuals have been numbered. alleles or lineages a list of (ordered lists of (b i )) or a list of (ordered lists of (n i )) depending again on whether individuals are numbered or not. all_events a list of all events ordered in time ascending order, with lots of other interesting quantities attached to the interval between this event and the following, such as : the number of lineages alive, the current values of N , the current value of S, the total rate at which an event happens on this interval for a given individual, etc... intervals the ordered in DESCENDING order list of (t s , t e , k, k <i a , τ, w) where t s is the time of start, t e is the end time, k is the total number of lineages alive on the interval, k <i a is the number of individuals from allele a, to the left of individual i, being alive on the interval, τ is the total rate at which death happens and w is the weight of the interval, i.e. the probability that the death event of the focal individual falls in this interval, conditioned on A. Note that the reason it is in descending order is that it is built by reading the list of all_events in ascending order, and the next steps consisting in drawing the interval does not require any specic order.

Roles of some key functions
Here is now an overview of some of the key operations we need to perform for Gibbs sampling steps. simulation if needed to get samp_history rst using sim_sampling_events.
And then to get alleles using sim_coal.
pre-processing we build the array_individuals and array_alleles from the alleles and samp_history.
Gibbs sampling We need to consider in turn the following operations, • Updating the death time of individual number i using update_past_coal: requires to rst build the weighted intervals with get_intervals, before drawing an interval with find_back_interval, simulating the death time within the interval and nally erasing the old time and inserting the new at a correct place in all_events using replace_coal.
• Updating the (N j ) values using update_listN, which explores the list of all_events, and modies in place the N j values.
• Updating the (S j ) values using update_listS, working similarly as above.
D Acknowledgements for data collection I am grateful to the following list of authors, who have contributed to the collection of SARS-CoV-2 sequences that I downloaded on GISAID. In the four tables below, you will nd the authors that took part in the collection eort in, respectively, Switzerland, Germany, France and Italy over the time period I have been interested in.
We gratefully acknowledge the following Authors from the Originating laboratories responsible for obtaining the specimens, as well as the Submitting laboratories where the genome data were generated and shared via GISAID, on which this research is based.
All Submitters of data may be contacted directly via www.gisaid.org Authors are sorted alphabetically. We gratefully acknowledge the following Authors from the Originating laboratories responsible for obtaining the specimens, as well as the Submitting laboratories where the genome data were generated and shared via GISAID, on which this research is based.
All Submitters of data may be contacted directly via www.gisaid.org Authors are sorted alphabetically.