Abstract
An early analysis of SARS-CoV-2 deep-sequencing data that combined epidemiological and genetic data to characterize the transmission dynamics of the virus in and beyond Austria concluded that the size of the virus’s transmission bottleneck was large – on the order of 1000 virions. We performed new computational analyses using these deep-sequenced samples from Austria. Our analyses included characterization of transmission bottleneck sizes across a range of variant calling thresholds and examination of patterns of shared low-frequency variants between transmission pairs in cases where de novo genetic variation was present in the recipient. From these analyses, among others, we found that SARS-CoV-2 transmission bottlenecks are instead likely to be very tight, on the order of 1-3 virions. These findings have important consequences for understanding how SARS-CoV-2 evolves between hosts and the processes shaping genetic variation observed at the population level.
In their recent research article (1), Popa, Genger et al. combined epidemiological and viral genetic data to characterize the transmission dynamics of SARS-CoV-2 in Austria between February and April 2020. The genetic data they analyzed comprised >500 deep-sequenced virus samples. Beyond using consensus-level SARS-CoV-2 sequences to infer transmission clusters within Austria and to examine the role that Austria played in seeding regional epidemics elsewhere in Europe, the authors used their sequenced samples to characterize mutational dynamics within hosts and along short transmission chains. While we believe that the findings from their consensus-level genetic analysis are robust, we here revisit their analyses of mutational dynamics at the below-the-consensus level. From our reanalysis, we conclude that transmission bottleneck sizes are not on the order of 1000 virions as concluded by the authors, but instead much smaller.
Our decision to revisit Popa, Genger et al.’s conclusions on transmission bottleneck sizes stems from curious patterns present in some of their figures. First, inferred bottleneck size estimates using a 3% variant calling threshold were bimodal, with 14 of the 39 transmission pairs having an inferred bottleneck size (Nb) of <10 and the remaining 25 pairs having Nb estimates of 115-5000 (their Figure S4G). Further, when a 1% variant calling threshold was used, only a single transmission pair retained an Nb estimate of <10 (their Figure 5B). In an attempt to understand these patterns, we first reanalyzed their deep sequencing data and recalled variants using their pipeline (Supplementary Methods). In the analyses presented below, we use these recalled variant frequencies, which appear to be highly similar to those presented in Popa, Genger et al. based on the “tv plots” published as part of their article (10.5281/zenodo.4247401).
As expected, re-estimation of transmission bottleneck sizes at variant calling thresholds of 1% and 3% yielded similar results to those shown in (1) (Figure S1A,B). During this analysis, we noticed that bottleneck size estimates dropped, sometimes precipitously, when going from a 1% cutoff to a 3% cutoff for every one of the 13 transmission pairs that had donors with a maximum iSNV frequency of >6% (Figure 1A; p = 0.004 using a paired t-test). Since increasing the variant calling threshold would remove low-frequency iSNVs from analysis, these consistent decreases in Nb estimates could come about if low-frequency donor iSNVs indicated that bottleneck sizes were large while high-frequency donor iSNVs instead indicated that bottleneck sizes were small. Examination of low-frequency iSNVs across donor-recipient pairs indeed indicate high levels of congruence between their frequencies (Figure 1B inset; Figures S2), which would suggest wide transmission bottlenecks. In contrast, high-frequency donor iSNVs rarely appeared to be transmitted to their corresponding recipient (Figures S2), suggesting narrow transmission bottlenecks.
To come to terms with these conflicting patterns, we considered genetic variation that appeared de novo in recipient hosts. This genetic variation appears in the “tv plots” as iSNVs absent from a donor but present in a corresponding recipient. When a de novo variant is observed as fixed in a recipient sample, we should not observe any shared iSNVs between a donor and a recipient that are present in the recipient at subclonal (i.e., not fixed) frequencies unless within-host recombination occurred extremely rapidly or the fixed de novo variant arose multiple times in different genetic backgrounds. However, in the transmission pairs analyzed in Popa, Genger et al., shared subclonal iSNVs – at extremely similar frequencies - are observed in several transmission pairs where there is also a fixed de novo variant present in the recipient. The transmission pair CoV_162 → CoV_161 provides an example (Figure 1B). This means that the low-frequency iSNVs shared between CoV_162 and CoV_161 are either spurious or that they arose independently in the recipient (that is, they are homoplasies). In either case, these shared low-frequency iSNVs are highly unlikely to constitute transmitted genetic variation, and as such would need to be excluded from a transmission bottleneck analysis involving this transmission pair.
While we can only conclude that the low-frequency shared iSNVs in transmission pair CoV_162 → CoV_161 are almost certainly not shared between donor and recipient as a result of transmission, transmission pairs with de novo fixed variants in the recipient (here, defined as >94% in frequency), transmission pairs with de novo high-frequency (6-94%) variants in the recipient, and transmission pairs with only low-frequency variants (<6%) in the recipient exhibit highly similar distributions of low-frequency (1-6%) shared iSNVs (Figure 1C). The similarity between these distributions indicates that these iSNVs may be subject to the same interpretation as for CoV_162 → CoV_161. Indeed, when we calculate the probability that a low-frequency donor iSNV is observed in a corresponding recipient (at ≥1%) versus observed in an epidemiologically unlinked recipient, we find that the distribution of these probabilities are highly similar (Figure 1D). It is thus highly unlikely that these shared low-frequency iSNVs are transmitted to their corresponding recipient; if this were the case, we would expect the probability of shared variants to be higher for the corresponding recipient compared to an epidemiologically unlinked one.
Given these findings that shed doubt on low-frequency iSNVs constituting transmitted genetic variation, we decided to quantify the extent to which particular iSNVs were present across the samples used in the transmission pair analyses. We found that 5 iSNVs were present in 40 or more of the 43 samples analyzed, at frequencies that fell into a very narrow range (1%-2.2%) (Figure 1E). Many other iSNVs were also present across numerous samples (Figure 1E; Figure S3, Figure S4), with the frequencies of any particular iSNV being highly similar across the samples that it appears in. This similarity in iSNV frequencies argues against these low-frequency iSNVs being homoplasies.
Finally, a comparison between observed patterns of iSNV frequencies between donors and recipients versus those expected under large transmission bottleneck sizes as inferred in Popa, Genger et al. further argues against the transmission of the low-frequency shared iSNVs. Specifically, observed iSNV frequencies from transmission pairs with inferred bottleneck sizes of Nb ≥ 1000 show that iSNVs are present in both donor and recipient at highly similar frequencies or are observed exclusively in the donor or recipient (Figure 1F). On this figure, we overlay simulated iSNV frequencies under the assumption of a bottleneck size of Nb = 1000 (Supplemental Methods). Juxtaposition of the observed versus theoretically-predicted iSNV frequencies highlights an inconsistency: at Nb values of ~1000, we should expect almost all (at least 96.1%) of the iSNVs present in the donor at ≥2% to be transmitted and also observed above the variant calling threshold of 1% in the recipient. However, only 77.5% of donor iSNVs within the 2-6% frequency range are observed in the corresponding recipients at ≥1% frequency. This inconsistency indicates that the low-frequency iSNVs themselves show patterns that cannot be parsimoniously explained by large transmission bottleneck sizes.
Given these findings, we re-estimated transmission bottleneck sizes using the beta-binomial method (2) at a conservative variant calling threshold of 6% (Figure 1A; Figure S1C). Increasing the variant calling threshold does not bias bottleneck size estimates, but it is does increase statistical uncertainty in the estimated values. At this 6% cutoff, only 13 transmission pairs had one or more donor iSNVs remaining, such that bottleneck sizes could only be estimated for these pairs. The maximum likelihood estimate for Nb was 1 for 12 out of these 13 transmission pairs; for the remaining transmission pair (CoV_198 → CoV_230), the maximum likelihood estimate was Nb = 143 virions. This transmission pair was the only one where a donor iSNV (at a frequency of ~22%) was transmitted to a recipient but remained subclonal (at a frequency of ~17%). Since the confidence intervals around these maximum likelihood estimates were large, we also estimated an overall transmission bottleneck size using the data from these 13 transmission pairs (Supplemental Methods). We arrived at an estimate of a mean bottleneck size of 1.21, such that 99% of successful transmissions are expected to result from 3 or fewer virions (Figure 2).
Our finding of a very tight transmission bottleneck from a reanalysis of the viral deep-sequencing data from Popa, Genger et al. is consistent with conclusions from other (as yet not peer-reviewed) studies that have quantified SARS-CoV-2 transmission bottleneck sizes in humans (3) and other mammals (4). These results indicate that SARS-CoV-2 has a narrow transmission bottleneck, similar in size to that of influenza A viruses (5). Small bottleneck sizes also mean that infections generally start off with very little – if any – viral genetic diversity, such that acute infections will likely be characterized by low levels of viral diversity except in instances of superinfection, consistent with other recent (as yet not peer-reviewed) studies (6, 7). Our reanalysis thus parsimoniously adds to a growing understanding of SARS-CoV-2 evolution between and within infected individuals.
Acknowledgments
We thank Andreas Bergthaler and his group for providing clarifications on the SARS-CoV-2 deep-sequencing data submitted as part of their research article. The research reported in this technical comment was supported by National Institute of Allergy and Infectious Diseases Centers of Excellence for Influenza Research and Surveillance (CEIRS) grant HHSN272201400004C.