Estimating the timing of HIV infection from unmutated sequences

For HIV, the time since infection can be estimated from sequence data for acutely infected samples. One popular approach relies on the star-like nature of phylogenies generated under exponential population growth, and the resulting Poisson distribution of mutations away from the founding variant. However, real-world complications, such as APOBEC hypermutation and multiple-founder transmission, present a challenge to this approach, requiring data curation to remove these signals before reasonable timing estimates may be obtained. Here we suggest a simple alternative approach that derives the timing estimate not from the entire mutational spectrum but from the proportion of sequences that have no mutations. This can be approximated quickly and is robust to phenomena such as multiple founder transmission and APOBEC hypermutation. Our approach is Bayesian, and we adopt a conjugate prior to obtain closed form posterior distributions at negligible computational expense. Using real data and simulations, we show that this approach provides accurate timing estimates and credible intervals without the inconvenience of data curation and is robust to complicating phenomena that can mislead existing approaches or cause them to fail entirely. For immediate use we provide an implementation via Google Sheets, which offers bulk analysis of multiple datasets, as well as more detailed individual-donor analyses. For inclusion in data processing pipelines we provide implementations in three languages: Julia, R, and Python.


Introduction
HIV prevention efficacy trials can leverage accurate inference of time since infection (hereafter referred to as "infection time") in order to identify correlates of protection. 1 While clinical diagnostic staging can inform timing estimates, there is particular interest in using viral sequence data from early in infection for this purpose. For HIV, roughly 80% of infections are homogeneous, initiated by one distinct founding strain [2][3][4] , suggesting a strong transmission bottleneck. While the exact mechanism of the transmission bottleneck is unknown, there is evidence that route of transmission is associated with increased odds of observing multiple founder infections. 3,5,6 These estimates were based on relatively shallow sequencing, and it is possible that the deployment of higher throughput sequencing will reveal that the transmission of multiple founder variants occurs even more frequently. Following the establishment of the initiating founder strain(s), HIV typically grows rapidly and exponentially 7 . This leaves an imprint on the resulting phylogeny: the tree is largely "star-like" with all lineages coalescing near the most recent common ancestor (MRCA). 8 It has been observed that acute HIV infections with a homogeneous strain typically follow this pattern. 2 The diversity of this acute-infection virus population has been found to increase roughly linearly with time, motivating the development of several methods for diversity-based infection time estimation. [9][10][11][12][13][14] One popular method is Poisson-Fitter, 15 a maximum likelihood approach which estimates infection time from the distribution of pairwise Hamming distances between aligned sequences, under the assumption that the number of mutations from the founder to each observed sequence follows a Poisson distribution. Violations of this assumption have been attributed to APOBEC mediated hypermutation, the transmission of multiple variants, the onset of immune selection, or stochastic early mutations. 15,16 Hypermuation via host APOBECs (apolipoprotein B mRNAediting catalytic polypeptides) introduces G to A mutations into the HIV genome by cytidine deamination of the negative strand cDNA. [17][18][19] While typically understood as an "all-or-nothing" phenomenon associated with defective viral Vif, 20,21 several studies have demonstrated that more subtle, sub-lethal levels of can occur in in vitro experiments. 22,23 Additionally, studies using Poisson-Fitter timing estimates have noted that removal of APOBEC-targeted sites restores a Poisson distributed mutational spectrum. 2,24 For multiple founder infections, prior to the onset of immune selection (Fiebig stage I and II) and recombination, a theoretical population resembles a collection of star-like phylogenies -one for each founder -each with the same root-to-tip distances. If the founder strains can be discriminated, and all sequences assigned to the correct founder, then each founder can be modelled separately to obtain independent estimates of λ. In practice, the difficulty of such splitting can range from trivial, to difficult, to nearly impossible. Multiple variant transmission from acute donors is known to occur, and may be more prevalent than previously appreciated. 25 It is particularly unclear in this scenario how to distinguish closely-related founder variants that diverged in the donor population from stochastic early mutations separating variants that diverged in the recipient population. A variety of founder identification methods of varying complexity exist, 26 tween sequences and cause violations of the Poisson assumptions that Poisson-Fitter relies upon. These can be remedied by curation, removing APOBEC hypermutated sequences or sites and grouping sequences by founder before attempting to estimate the time since infection. These curation steps are non-trivial however, and both introduce a substantial effort burden on the user and potentially multiple user-level decisions for each infection time estimate, which may be statistically problematic, depending on how these estimates are used. 29 Here we describe ZFitter (for "Zero-Fitter") which, like Poisson-Fitter, aims to estimate infection time from HIV sequence datasets, but is designed to be more robust to the Poisson violations that are routinely observed in HIV sequence datasets.

Methods
Inference. ZFitter begins with the observation that, for most datasets sampled from acute infection, there are multiple sequences that are identical to each other, typically representing variants that have not mutated away from the founder. Like Poisson-Fitter, ZFitter assumes that mutations are Poisson distributed, with parameter λ, which we can subsequently relate to "time" through a known mutation rate. Under this Poisson assumption, the number of sequences with no mutations, s, among a fixed total of N sampled sequences, is Binomial distributed with probability mass function ZFitter is Bayesian in nature, and we introduce a twoparameter family of conjugate prior distributions for the above likelihood, denoted Z(α, β). The density functions are defined for all α, β > 0, and all λ > 0, as Assuming a Z(α, β) prior over λ, the posterior distribution is Z(α + s, β + N − s). Figure 1A shows Poisson distributions (and the expected proportion of sequences with no mutations) for varying λ. And using our standard prior of α = 0.3, β = 1 (which we use throughout), figure 1B displays example posterior distributions corresponding to a range of N and s counts. We consider cases where s = 0 to be "inestimable" by ZFitter, where all we can do is provide a lowerbound on the infection time.
Our parameterisation of the family is motivated by the observation that λ ∼ Z(α, β) if and only if e −λ ∼ Beta(α, β). Using this, the quantile function for a Z(α, β)-distributed random variable, F −1 (q), can be expressed in terms of the standard quantile function for a Beta(α, β)-distributed variable, All 95% Bayesian "Credible Intervals" (CIs) presented here are F −1 (0.025) to F −1 (0.975). This approach has a number of computational and statistical consequences. Computationally, we can approximate s relatively well with the number of identical sequences, which is trivial to compute from sequence datasets. A multiple sequence alignment is not even required for this. Secondly, given chosen prior parameters and s and N counts, the conjugacy above provides closed-form expressions for posterior medians, and any required posterior credible intervals. Statistically, ZFitter involves a trade-off. On the one hand, not all the information in the sequences is exploited, as the full mutational spectrum is ignored, attending to only the zero-valued mass. Under perfect Poisson assumptions, ZFitter should thus provide less precise and less confident estimates of time since infection. On the other hand, this renders ZFitter far more robust to particular assumption violations that frequently occur in real-world data. Key among these are APOBEC-mediated hypermutation, which dramatically inflates the mutation rate (real or apparent) of a small number of sequences, and multiple founder infections. Both can cause a dramatic shift in the distribution of pairwise distances, but will only minimally affect the proportion of completely unmutated sequences. We demonstrate this robustness through simulation.
Implementation. The approach described here is so computationally trivial that it can be implemented in a simple spreadsheet. Indeed, we offer a public "Google Sheets" implementation: https://bit.ly/3pOsa2a. This offers two kinds of functionality: i) a bulk processing option, where counts of sequences (total, and non-singletons) are input for a large number of datasets, and infection estimates and CIs are provided for each, and ii) a single-dataset processing option, where either counts are entered, or sequences are pasted in directly, and both the estimates, and a plot of the prior and posterior distributions are displayed. For convenient incorporation into computational pipelines, simple implementations in Julia, Python and R are included in Supplemental S1-3.

Simulations.
To generate simulated sequences from a founder env sequence, an HKY85 30 substitution rate matrix, Q, was used with a transition to transversion ratio of 4.5. Equilibrium frequencies were calculated from an HIV dataset derived from a single donor 31 : π A = 0.34, π C = 0.17, π G = 0.23, π T = 0.24. The Q matrix was scaled to yield an average rate of 1.19e-5 substitutions per site per day; the daily internal rate used by Poisson-Fitter. 28 All individual simulations, unless otherwise noted, consisted of 100 sequence observations and time, t, was varied from 5 to 80 days. Mutations are sampled from P = e Qt .
Star-like, single founder. Infections were modelled as star trees with all observed sequences equidistant from the root. Branch lengths were set equal to the time (in days) since infection.
Star-like, dual founder. Two separate founder sequences were sampled from the "source" population, derived from one timepoint of a longitudinally-sampled donor from a primary infection cohort. 31 Founder frequencies were drawn from a Multinomial distribution, which itself was drawn from a Dirichlet distribution with a concentration parameter of 1. Sequences from each founder were then mutated as in the single-founder case.
Context-dependent APOBEC effects were introduced proportional to their empirical occurrence across trinucleotide contexts estimated from a control dataset. Fig ED1 shows the frequencies of APOBEC mutations per sequence introduced by this scheme.

Results
Simulated. Figure 2 summarises the performance of ZFitter and Poisson-Fitter on simulated sequence datasets. For single founder infections, where Poisson assumptions are clearly satisfied, estimates from both methods closely tracked the simulated infection time (Figure 2A). For single founder infections with APOBEC mediated hypermutation, ZFitter performed similarly to post-curation Poisson-Fitter with all APOBEC positions scrubbed from the alignment, and was significantly more accurate than Poisson-Fitter without any curation ( Figure 2B). In comparison to true infection time, both curated Poisson-Fitter and Zfitter estimates for the lowest range of infection times were biased slightly upward. For ZFitter, this is because a high percentage of APOBECmutated sequences presents a non-negligible influence on the number of completely unmutated sequences for smaller t, but matters relatively less when t is large. It is not yet clear why post-curation Poisson-Fitter exhibits this behavior as well.
For both dual founder simulations, ZFitter was able to infer reasonable estimates of infection time from all combined sequences, closely matching curated Poisson-Fitter, which, here, relied on perfect knowledge of how to group the sequences by founder, and was estimated for the largest lineage only ( Figure 2C,D).
Empirical. ZFitter estimates were obtained from 130 Sanger SGA datasets from two published studies on acute HIV infection. 2,6 For an in-depth description of the results of each dataset and associated metadata, see Table S1. Figure 3 displays the correlation between ZFitter and Poisson-Fitter for  Figure 4 displays ZFitter estimates for all 130 datasets grouped by Fiebig stages. 32 Timing estimates from infections with Fiebig stages II and III differed significantly from infections in Fiebig stage V, but all other comparisons were not significant (Mann-Whitney U Test). The relatively fast clinical progression through stages II, III, and IV could explain part of why their median ZFitter estimates are so similar. Stages II, III and IV have typical duration times of five, three, and six days, respectively. 32 Stage V is longer, with a median duration of 70 days.

Discussion
Accurately estimating the date of HIV infection is a critical parameter for on-going clinical trials aimed at HIV prevention, where the knowledge of the titer or concentration of inhibitors at the time of infection is required to determine the correlates of protection. 1 In a recent study, Poisson-Fitter, a maximum likelihood approach which models the distribution of nucleotide mutations in a acute infection, yielded more accurate, more precise and unbiased estimates for the time of infection than did coalescent phylogenetic models implemented in BEAST. 24,33 However, violations of the Poisson-Fitter model assumptions of star-like phylogeny and a Poisson-distributed accumulation of mutations by var-ious phenomena, including APOBEC-mediated hypermutation, immune mediated selection, recombination, or multiplefounder transmission can mislead Poisson-Fitter's estimates, requiring manual data curation and iterative Poisson fitting.
Here, we developed a method which is more robust to these complications by only considering if a sequence has mutated away from the founding strain. This approach does not leverage all the information in the sequences and will therefore be less precise when mutations are truly Poisson distributed. However, this is a potentially beneficial trade-off, as many assumption violations will dramatically skew the distribution of pairwise distances but not substantially alter the proportion of unmutated sequences. The performance of Zfitter on simulated datasets (Figure 2) demonstrates its comparable performance to Poisson-Fitter when infections are homogeneous, as well as ZFitter's robustness to both APOBEC mediated hypermutation and multiple founder infection, which require additional curation for reasonable Poisson-Fitter estimates. The simulated level of APOBEC only induces small bias in estimated infection time for low t. This is in contrast to Poisson-Fitter, which is relatively sensitive to these sequences as they induce strong perturbations to the distribution of pairwise Hamming distances and skew λ upwards. And even when APOBEC sites are removed, Poisson-Fitter appears to have the same bias at low t as ZFitter.
To investigate ZFitter's behavior on real data from acute HIV infection, we processed 130 Sanger SGA datasets from published studies with Poisson-Fitter estimates of infection time. 2,6 ZFitter and Poisson-Fitter estimates were largely consistent when there was a good Poisson model fit, albeit with larger uncertainty than observed in our simulations (Figure 3). This was at least partly due to lower sampling depth in available datasets (median=28). In a similar manner to the dual-founder simulations, in instances of multiple founder infection ZFitter produced timing estimates more consistent with acute infection when run on all available sequences. There are several caveats to our approach which are important to discuss. As all of our signal comes from the number of sequences which cluster together, estimates made on any dataset where all sequences are unique or the ratio of s to N  Figure 3.
is very small should be treated with caution. One biological process that can fragment the founding virus lineage is early T cell mediated selection. This is predominantly positive/directional selection by CD8+ T cells narrowly focused on a small number of epitopes and is known to occur relatively early in infection associated with initial control of plasma viral load. [34][35][36][37] We have not investigated the effect of such a process on our inference since we do not know how to appropriately simulate it, but we note that for real, single-founder datasets, our estimates did not appear to be biased either way compared to Poisson-Fitter. Therefore, at the very least we are not especially affected by this process.
Another potential source of error is recombination between distinct founder strains. Recombinant sequences are frequently detected in acute infection when initiated by two or more founding strains. 2,5,6 When they contain unique breakpoints, these sequences inflate the number of unique sequences in the data. We do not explicitly filter for recombinant sequences currently in ZFitter, as most standard methods for recombination detection either require specification of founding lineages in advance (RAPR) 38 or require an amount of phylogenetic signal which may be lacking in instances of multiple founder infection from a low diversity source (RDP4). 39 However, we note that the effect of recombination on s is similar to that of APOBEC: even extreme APOBEC hypermutation of a sequence will reduce s by one, just as would that sequence harbouring a single mutation, or a sequence being the recombined offspring of two other sequences. This is to say that, as long as the per-sequence probability of at least one mutation is substantially higher than the probability of recombination or hypermutation, ZFitter's estimate should remain relatively robust to such processes.
ZFitter may open the door for new strategies for identifying founder variants and grouping sequences into founder clades. For low-diversity multivariant founder infections, the founder identification problem suffers from a chicken/egg issue. It can be difficult to split a dataset into founders without knowing the amount of post-infection divergence. But estimating the post-infection divergence, especially with methods such as Poisson-Fitter, require that datasets are already split into founding clades. By providing robust estimates of postinfection divergence without any curation, simply relating the number of unmutated sequences to the expected divergence, ZFitter's λ estimates may be useful as inputs to founder clustering algorithms, which we will explore in future work.
Here we have shown that the performance of ZFitter on real and simulated data supports its further investigation as a timing estimator for sequences from acute HIV infection. The method is trivial to implement and is designed to require no sequence curation to obtain timing estimates. Where extensive sequence curation is feasible, ZFitter should provide a valuable supplementary method to existing approaches such as Poisson-Fitter, allowing a consistency check by comparing the uncurated ZFitter estimate to the curated Poisson-Fitter estimate. Where curation is infeasible, ZFitter provides a useful standalone approach to estimating acute viral infection times.