Bayesian inference of relative fitness on high-throughput pooled competition assays

The tracking of lineage frequencies via DNA barcode sequencing enables the quantification of microbial fitness. However, experimental noise coming from biotic and abiotic sources complicates the computation of a reliable inference. We present a Bayesian pipeline to infer relative microbial fitness from high-throughput lineage tracking assays. Our model accounts for multiple sources of noise and propagates uncertainties throughout all parameters in a systematic way. Furthermore, using modern variational inference methods based on automatic differentiation, we are able to scale the inference to a large number of unique barcodes. We extend this core model to analyze multi-environment assays, replicate experiments, and barcodes linked to genotypes. On simulations, our method recovers known parameters within posterior credible intervals. This work provides a generalizable Bayesian framework to analyze lineage tracking experiments. The accompanying open-source software library enables the adoption of principled statistical methods in experimental evolution.


Primer on Variational Inference
In this section, we will briefly introduce the idea behind variational inference.Recall that any Bayesian inference problem deals with the joint distribution between observations x and unobserved latent variables ◊.This joint distribution can be written as the product of a distribution of the observations x conditioned on the ◊ and the marginal distribution of these latent variables, i.e.,

fi(x, ◊) = fi(x | ◊)fi(◊).
(S1) A Bayesian inference pipeline's objective is to compute the latent variables' posterior probability given a set of observations.This computation is equivalent to updating our prior beliefs about the set of values that the latent variables take after taking in new data.We write this as Bayes theorem The main technical challenge for working with Equation S2 comes from the computation of the denominator, also known as the evidence or the marginalized likelihood.The reason computing this term is challenging is because it involves a (potentially) high-dimensional integral of the form where K is the dimesionality of the ◊ vector.Here, the integrals are taken over the supportthe set of values valid for the distribution-of fi(◊).However, only a few selected distributions have a closed analytical form; thus, in most cases Equation S3 must be solved numerically.
Integration in high-dimensional spaces can be computationally extremely challenging.For a naive numerical quadrature procedure, integrating over a grid of values for each dimension of ◊ comes with an exponential explosion of the number of required grid point evaluations, most of which do not contribute significantly to the integration.To gain visual intuition about this challenge, imagine integrating the function depicted in Figure S1.If the location of the high-density region (dark peak) is unknown, numerical quadrature requires many grid points to ensure we capture this peak.However, most of the numerical evaluations of the function on the grid points do not contribute significantly to the integral.Therefore, our computational resources are wasted on insignificant evaluations.This only gets worse as the number of dimensions increases since the number of grid point evaluation scales exponentially.prohibitively slow for the number of dimensions our present inference problem presents.Thus, there is a need to find scalable methods for the inference problem in Equation S2.
Variational inference circumvents these technical challenges by proposing an approximate solution to the problem.Instead of working with the posterior distribution in its full glory fi(◊ | x), let us propose an approximate posterior distribution q " that belongs to a distribution family fully parametrized by ".For example, let us say that the distribution q " belongs to the family of multivariate Normal distributions such that " = (µ, ), where µ is the vector of means and is the covariance matrix.If we replace fi by q " , we want q " to resemble the original posterior as much as possible.Mathematically, this can be expressed as minimizing a "distance metric"-the Kullback-Leibler (KL) divergence, for example-between the distributions.Note that we use quotation marks because, formally, the KL divergence is not a distance metric since it is not symmetric.Nevertheless, the variational objective is set to find a distribution q ú " such that where D KL is the KL divergence.Furthermore, we highlight that the KL divergence is a strictly positive number, i.e., as this property will become important later on.
At first sight, Equation S4 does not improve the situation but only introduces further technical complications.After all, the definition of the KL divergence includes the posterior distribution fi(◊ | x) we are trying to get around.However, let us manipulate Equation S6 to beat it to a more reasonable form.First, we can use the properties of the logarithms to write where, for convenience, we write a single integration sign (d K ◊ still represents a multidimensional di erential).For the second term in Equation S7, we can substitute the term inside the logarithm using Equation S2.This results in Again, using the properties of logarithms, we can split Equation S8, obtaining It is convenient to write Equation S9 as where for the last term, we can take ln fi(x) out of the integral since it does not depend on ◊.Lastly, we utilize two properties: 1.The proposed approximate distribution must be normalized, i.e., 2. The law of the unconscious statistician (LOTUS) establishes that for any probability density function, it must be true that where È•Í q " is the expected value over the q " distribution.
Using these two properties, the positivity constraint on the KL divergence in Equation S5, and the definition of the KL divergence in Equation S6 we can rewrite Equation S10 as Multiplying by a minus one, we have the functional form of the so-called evidence lower bound (ELBO) Kingma and Welling [9], ln fi(x) Let us recapitulate where we are.We started by presenting the challenge of working with Bayes' theorem, as it requires a high-dimensional integral of the form in Equation S3.As an alternative, variational inference posits to approximate the posterior distribution fi(◊ | x) with a parametric distribution q " (◊).By minimizing the KL divergence between these distributions, we arrive at the result in Equation S14, where the left-hand side-the log marginalized likelihood or log evidence-we cannot compute for technical/computational reasons.However, the right-hand side is composed of things we can easily evaluate.We can easily evaluate the log-likelihood ln fi(x | ◊) and the KL divergence between our proposed approximate distribution q " (◊) and the prior distribution fi(◊).Moreover, we can compute the gradients of these functions with respect to the parameters of our proposed distribution.This last point implies that we can change the parameters of the proposed distribution to maximize the ELBO.And, although we cannot compute the left-hand side of Equation S14, we know that however large we make the ELBO, it will always be smaller than (or equal) the log-marginal likelihood.Therefore, the larger we can make the ELBO by modifying the parameters ", the closer it gets to the log-marginal likelihood, and, as a consequence, the better our proposed distribution q " (◊) gets to the true posterior distribution fi(◊ | x).
In this sense, variational inference turns the intractable numerical integration problem to an optimization routine, for which there are several algorithms available.

ADVI algorithm
To maximize the right-hand side of Equation S14, the Automatic Di erentiation Variational Inference (ADVI) algorithm developed in 7 takes advantage of advances in probabilistic programming languages to generate a robust method to perform this optimization.Without going into the details of the algorithm implementation, for our purposes, it su ces to say that we define our joint distribution fi(◊, x) as the product defined in Equation S1.ADVI then proposes an approximate variational distribution q " that can either be a multivariate Normal distribution with a diagonal covariance matrix, i.e., where D is the identity matrix, with the diagonal elements given by the vector of variances ‡ 2 for each variable or a full-rank multivariate Normal distribution Then, the parameters are initialized in some value " o .These parameters are iteratively updated by computing the gradient of the ELBO (right-hand side of Equation S14), hereafter defined as L, with respect to the parameters, and then computing where ÷ defines the step size.This short explanation behind the ADVI algorithm is intended only to gain intuition on how the optimal variational distribution q " be computed.There are many nuances in the implementation of the ADVI algorithm.We invite the reader to look at the original reference for further details.

Defining the Bayesian model
In the main text, we specify the inference problem we must solve as being of the form Here, we briefly define the missing nuisance parameters.Let sT = (s 1 , s2 , . . ., sT ≠1 ) † , (S19) be the vector containing the T ≠ 1 population mean fitness we compute from the T time points where measurements were taken.We have T ≠ 1 since the value of any st requires cycle numbers t and t + 1.Furthermore, let the matrix F be a T ◊ B matrix containing all frequency values.As with Equation 12in the main text, we can split F into two matrices of the form to separate the corresponding neutral and non-neutral barcode frequencies.
Let us now define each of the terms in Equation 18described in Section of the main text.
The following sections will specify the functional form each of these terms takes.

Frequency uncertainty fi(F | R)
We begin with the probability of the frequency values given the raw barcode reads.The first assumption is that the inference of the frequency values for time t is independent of any other time.Therefore, we can write the joint probability distribution as a product of independent distributions of the form where f t and r t are the t-th row of the matrix containing all of the measurements for time t.
We imagine that when the barcode reads are obtained via sequencing, the quantified number of reads is a Poisson sample from the "true" underlying number of barcodes within the pool.
This translates to assuming that the number of reads for each barcode at any time point r where the symbol "≥" is read "distributed as."Furthermore, for a Poisson distribution, we have that where È•Í is the expected value.In other words the Poisson parameter is equal to the mean and variance of the distribution.The Poisson distribution has the convenient property that for two Poisson distributed random variables X ≥ Poiss(⁄ x ) and Y ≥ Poiss(⁄ y ), we have that This additivity allows us to write the total number of reads at time t n t also as a Poissondistributed random variable of the form where the sum is taken over all B barcodes.
If the total number of reads is given by Equation S25, the array with the number of reads for each barcode at time t, r t is then distributed as where each of the B entries of the frequency vector f t is a function of the ⁄ t vector, given by In other words, we can think of the B barcode counts as independent Poisson samples or as a single multinomial draw with a random number of total draws, n t , and the frequency vector f t we are interested in.Notice that Equation S27 is a deterministic function that connects the Poisson parameters to the frequencies.Therefore, we have the equivalence that meaning that the uncertainty comes from the ⁄ t vector.By Bayes theorem, we therefore write where we explicitly include the dependence on n t .This does not a ect the distribution or brings more uncertainty because r t already contains all the information to compute n t since But adding the variable allows us to factorize Equation S29 as We then have Furthermore, we have {#eq=freq_n_bayes} Finally, for our prior fi(⁄ t ), we first assume each parameter is indepen- A reasonable prior for each ⁄ (b) t representing the expected number of reads for barcode b should span several orders of magnitude.Furthermore, we assume that no barcode in the dataset ever goes extinct.Thus, no frequency can equal zero, facilitating the computation of the log frequency ratios needed to infer the relative fitness.The log-normal distribution satisfies these constraints; therefore, for the prior, we assume as the user-defined parameters that characterize the prior distribution.

Summary
Putting all the pieces developed in this section together gives a term for our inference of the form where and

Population mean fitness uncertainty fi(s T | F , R)
Next, we turn our attention to the problem of determining the population mean fitnesses sT .
First, we notice that our fitness model in Equation 3 does not include the value of the raw reads.They enter the calculation indirectly through the inference of the frequency values we developed in Section .This means that we can remove the conditioning of the value of sT on the number of reads, obtaining a simpler probability function Moreover, our fitness model does not directly explain how the population mean fitness evolves over time.In other words, our model cannot explicitly compute the population mean fitness at time t + 1 from the information we have about time t.Given this model limitation, we are led to assume that we must infer each st independently.Expressing this for our inference results in where we split our matrix F for each time point and only kept the conditioning on the relevant frequencies needed to compute the mean fitness at time t.
Although our fitness model in Equation 3 also includes the relative fitness s (m) , to infer the population mean fitness we only utilize data from the neutral lineages that, by definition, have a relative fitness s (n) = 0. Therefore, the conditioning on Equation S39 can be further simplified by only keeping the frequencies of the neutral lineages, i.e., Recall that in Section we emphasized that the frequencies f do not represent the true frequency of a particular lineage in the population but rather a "normalized number of cells."Therefore, it is safe to assume each of the N neutral lineages' frequencies is changing independently.The correlation of how increasing the frequency of one lineage will decrease the frequency of others is already captured in the model presented in Section .Thus, we write Now, we can focus on one of the terms on the right-hand side of Equation S41.Writing Bayes theorem results in Notice the likelihood defines the joint distribution of neutral barcode frequencies conditioned on the population mean fitness.However, rewriting our fitness model in Equation 3 for a neutral lineage to leave frequencies on one side and fitness on the other results in Equation S43 implies that our fitness model only relates the ratio of frequencies and not the individual values.To get around this complication, we define as the ratio of frequencies between two adjacent time points for any barcode b.This allows us to rewrite the joint distribution fi(f Let us rephrase this subtle but necessary change of variables since it is a key part of the inference problem: our series of independence assumptions lead us to Equation S42 that relates the value of the population mean fitness st to the frequency of a neutral barcode at times t and t + 1.However, as shown in Equation S43, our model functionally relates the ratio of frequencies-that we defined as " (n) t -and not the independent frequencies to the mean fitness.Therefore, instead of writing for the likelihood the joint distribution of the frequency values at times t and t + 1 conditioned on the mean fitness, we write the joint distribution of the barcode frequency at time t and the ratio of the frequencies.These must be equivalent joint distributions since there is a one-to-one mapping between " (n) t and f for a given value of f (n) t .Another way to phrase this is to say that knowing the frequency at time t and at time t + 1 provides the same amount of information as knowing the frequency at time t and the ratio of the frequencies.This is because if we want to obtain f this information, we simply compute The real advantage of rewriting the joint distribution as in Equation S45 comes from splitting this joint distribution as a product of conditional distributions of the form fi(f Written in this form, we can finally propose a probabilistic model for how the mean fitness relates to the frequency ratios we determine in our experiments.The second term on the right-hand side of Equation S47 relates how the determined frequency ratio " t relates to the mean fitness st .From Equation S43 and Equation S44, we can write ln " where, for simplicity, we set • = 1.Note that we added an extra term, Á (n) t , characterizing the deviations of the measurements from the theoretical model.We assume these errors are normally distributed with mean zero and some standard deviation ‡ t , implying that ln " where we include the nuisance parameter ‡ t to be determined.If we assume the log frequency ratio is normally distributed, this implies the frequency ratio itself is distributed log-normal.
This means that Having added the nuisance parameter ‡ t implies that we must update Equation S42 to where we assume the prior for each parameter is independent, i.e., and so on.We can then write the joint distribution on the right-hand side of Equation S66 as a product of conditional distributions of the form fi(f Writing the fitness model in Equation 3as reveals that the value of each of the ratios " (m) t only depends on the corresponding fitness value st and the relative fitness s (m) .Therefore, we can remove most of the conditioning on the right-hand side of Equation S69, resulting in a much simpler joint distribution of the form fi(f where for the first term on the right-hand side of Equation S70 we apply the same logic as in Equation S53 to remove all other dependencies.We emphasize that although Equation S70 looks like a series of independent inferences, the value of the relative fitness s (m) is shared among all of them.This means that the parameter is not inferred individually for each time point, resulting in di erent estimates of the parameter, but each time point contributes independently to the inference of a single estimate of s (m) .
Using equivalent arguments to those in Section , we assume and that allows deviations from the hyper-fitness value to be either positive or negative, and is a strictly positive random variable that characterizes the deviation of the local fitness value from the global hyper-fitness.We assume where µ • (m) [j] and ‡ • (m) [j] are user-defined parameters capturing the expected magnitude of the batch e ects.

Defining prior probabilities
One aspect commonly associated-in both positive and negative ways-to Bayesian analysis is the definition of prior probabilities.On the one hand, the naive textbook version of Bayesian analysis defines the prior as encoding the information we have about the inference in question before acquiring any data.This is the "ideal" use of priors that, whenever possible, should be implemented.On the other hand, for most practitioners of Bayesian statistics in the age of big data, the definition of prior becomes a tool to ensure the convergence of sampling algorithms such as MCMC 25 .However, for our particular problem, although we deal with large amounts of data (inferences can be made for > 10K barcodes over multiple time points, resulting in > 100K parameters), each barcode has very little data, as they are measured only once per time point over < 10 growth-dilution cycles.Furthermore, it is incredibly challenging to understand the noise sources related to culturing conditions, DNA extraction, library preparation, etc., and encode them into reasonable prior distributions.
Empirically, our approach for this work defined the priors based solely on the neutral lineage data, as they represent the only repeated measurements of a single genotype in our experimental design.We acknowledge that defining the priors after observing the data might be considered an incoherent inference.However, as expressed by Gelman et al. [25] Incoherence is an unavoidable aspect of much real-world data analysis; and, indeed, one might argue that as scientists we learn the most from the anomalies and reassessments associated with episodes of incoherence.
With this in mind, we leave it to the reader to judge the selection of priors.Furthermore, the software package associated with this work, BarBay.jl, is written so that users can experiment with di erent prior selection criteria that fit their needs.We strongly advocate that statistics should not be done in a black-box fit-all tool mindset but rather as a formal way to encode the assumptions behind the analysis, subject to constructive criticism.With this philosophical baggage behind us, let us now focus on how the priors used for this work were selected.
where n i is the number of cells of strain i, and ⁄ i is the corresponding growth rate is not enough.Instead, we will assume that the cells follow the logistic growth equation of the form where Ÿ is the carrying capacity, and N is the total number of strains in the culture.
The inference method is based on the model that assumes that the time passed between dilutions • ¥ 8 generations, the change in frequency for a mutant barcode can be approximated from cycle t to the next cycle t + 1 as where s (m) is the relative fitness for strain i compared to the ancestral strain and st is the mean fitness of the population at cycle t.To test this assumption, we implemented a numerical experiment following the logistic growth model described in Equation S105.To simulate multiple growth-dilution cycles, we take the population composition at the final time point and use it to initialize a new logistic growth simulation.Figure S3 shows the resulting number of cells at the last time point of a cycle over multiple growth-dilution cycles for the genotypes in @Figure S2.We can see that the adaptive lineages (blue curves) increase in abundance, while detrimental lineages (red curves) decrease.
cycle number In Section , we derive the functional form to infer the relative fitness of each lineage as Figure S4 shows the corresponding log frequency ratio curves for the logistic growth simulation.
The displacement of these curves with respect to the neutral lineages determines the ground truth relative fitness value for these simulations.
To simulate the experimental noise, we add two types of noise: 1. Poisson noise between dilutions.For this, we take the final point of the logistic growth simulation and sample a random Poisson number based on this last point to set the initial condition for the next cycle.
2. Gaussian noise when performing the measurements.When translating the underlying population composition to the number of reads, we can add a custom amount of Gaussian noise.
Figure S5 shows the frequency trajectories (left panels) and log frequency ratios (right panels) for a noiseless simulation (upper panels) and a simulation with added noise (lower panels).
The noiseless simulation is used to determine the relative fitness for each of the lineages, which serves as the ground truth to be compared with the resulting inference.

Figure S1 .
Figure S1.High-dimensional numerical quadrature does not scale with dimensionality.Schematic depiction of the problem with naive numerical quadrature to integrate over an unknown density.While the density is concentrated on the dark peak, most of the evaluations over the x1 ≠ x2 grid do not contribute to the value of the integral Modern Markov Chain Monte Carlo algorithms, such as Hamiltonian Monte Carlo, can e ciently perform this high-dimensional integration by utilizing gradient information from the target density Betancourt [6].Nevertheless, these sampling-based methods become

FigureFigure S2 .
Figure S2shows an example of the deterministic trajectories for 50 labeled neutral lineages and 1000 lineages of interest.The upper red curve that dominates the culture represents the unlabeled ancestral strain included in the experimental design described in Section .

Figure S3 .
Figure S3.Growth-dilution cycles for logistic growth simulation.Each point represents the final number of cells after a growth cycle for each lineage.Colors are the same as in Figure S2.

Figure S5 .
Figure S5.Logistic growth-dilution simulations with and without noise