Differing impacts of global and regional responses on SARS-CoV-2 transmission cluster dynamics

Although the global response to COVID-19 has not been entirely unified, the opportunity arises to assess the impact of regional public health interventions and to classify strategies according to their outcome. Analysis of genetic sequence data gathered over the course of the pandemic allows us to link the dynamics associated with networks of connected individuals with specific interventions. In this study, clusters of transmission were inferred from a phylogenetic tree representing the relationships of patient sequences sampled from December 30, 2019 to April 17, 2020. Metadata comprising sampling time and location were used to define the global behavior of transmission over this earlier sampling period, but also the involvement of individual regions in transmission cluster dynamics. Results demonstrate a positive impact of international travel restrictions and nationwide lockdowns on global cluster dynamics. However, residual, localized clusters displayed a wide range of estimated initial secondary infection rates, for which uniform public health interventions are unlikely to have sustainable effects. Our findings highlight the presence of so-called “super-spreaders”, with the propensity to infect a larger-than-average number of people, in countries, such as the USA, for which additional mitigation efforts targeting events surrounding this type of spread are urgently needed to curb further dissemination of SARS-CoV-2.

throughout the pandemic, understanding the impact of these efforts on the global and regional transmission dynamics of the virus are imperative in defining a strategy to assess the current situation and prevent similar scenarios in the future.
Genomic data sampled from viral epidemics offer a unique opportunity to evaluate not only the evolution of the virus over time but also the changing viral population dynamics imprinted in the evolutionary history. These population dynamics, though often reflective of the neutral processes of evolution (5), can at times be traced to significant ecological and epidemiological events (6). Sample collection dates are critical in these connections, as when combined with the assumption of relatively stable mutation rates over time, allow for genetic differences to be rescaled as differences in time. The timing of population changes, inferred from genetic changes, can then be compared with external, contextual data (e.g., (7)) to investigate potential links between public health interventions (or other relevant events) and viral population growth and spread. Global analyses of viral spread using genomic data can reveal important information regarding the emergence of a novel virus, such as the phylogenetic analysis of H1N1 (8), which informed the community of the adaptive process of H1N1 from its original host (swine) to humans and its subsequent challenges in escape from the human immune system. Since then, large projects aimed at facilitating and optimizing the sharing of data and results for real-time projections of viral spread have helped in tracking SARS-CoV-2 global dissemination (e.g., (9, 10)). Yet, given that viruses evolve at a relatively rapid rate, considering separate isolated geographic areas as separate epidemics may also be warranted when attempting to understand how regional efforts drive viral evolutionary and population dynamic patterns. A virus from patristic distance (branch length separating sequences within the tree) of <0.009% (see Sup-plementary Materials) in a maximum likelihood (ML) phylogenetic tree inferred from 11,069 SARS-CoV-2 genomic sequences. This distance, representing the median genetic difference between two sampled individuals is less than the median difference observed within a single individual (0.014%) in the study by Shen et al. (2020) (15) on patients admitted with COVID-19 pneumonia. By only considering individuals that share a genetic distance less than would be expected during the evolution of the virus within a single host, we have greater confidence in infections separated by a short amount of time and thus an epidemiological linkage or connection; increasing genetic distance leads to increased uncertainty as to the relationship of individuals within a cluster and a greater likelihood of the inclusion of multiple risk factors. The majority of clusters identified using this method included < 25 individuals, with one large outlier cluster of 185 individuals (184 from the USA and 1 from Denmark) ( Figure 1A). In terms of country representation within transmission clusters, the majority of the identified clusters included viral sequences isolated from patients in 1 − 5 countries, with the majority country in clusters comprised of only two countries representing 50-99% of the sequences ( Figure 1B). Clusters including 6-10 countries were more evenly distributed, while the majority country represented 70% of the sequences in the only cluster with strains from 11 countries. The results collectively indicate a significant role for travel in connecting several countries through putative direct transmission events, rather than isolated epidemics seeded by single introductions, consistent with previous studies (16, 17).
As with conventional epidemiological analyses, sampling bias can impact results and interpretation and has been inherent to SARS-CoV-2 sample collection throughout the pandemic (18). In our analysis, the number of clusters involving each of the countries with available sequence data as of April 24, 2020, were distributed similarly to the number of available genomes indicating, unsurprisingly, that the number of clusters detected for a particular country is limited 5 by the number of available samples from that locale (Figure 2A). Hence, no direct comparison could be made regarding the number of clusters involving different countries. This was not the case, however, for the average size (number of persons) of clusters associated with each country, since large clusters were not always detected in countries with more samples available, such as the United Kingdom (UK) and United States of America (USA) ( Figure 2B). In other words, despite potential lack of information from countries with reduced datasets, a global perspective on transmission cluster patterns related to cluster size might nevertheless be investigated. It is important to keep in mind that low sequence presence is not necessarily an indicator of low infection rates, as the fraction of total infected individuals that were sampled may, in fact, be high for these countries, resulting in a different type of regional sampling bias. For example, although as of April 24, over 3,000 sequences were submitted for the UK, this comprised < 25% of total infections, whereas Hong Kong reportedly submitted at least one genome for every confirmed case ( Figure 2C). Given that individuals linked through transmission in a small amount of time share a small genetic difference, on which phylogenetic cluster inference relies, missed sampling can prevent the inclusion of individuals within a cluster. Hence, we anticipated that sampled individuals from countries with sequencing more representative of the infected population (i.e., genome per confirmed case value closer to 1) would be more likely to cluster, resulting in a larger fraction of clustered individuals. However, there was no clear relationship (linear regression R 2 < 0.0084) between the percentage of individuals within a country that are included in clusters and the number of retrieved genomes per confirmed cases ( Figure 2C). This finding indicates that country-specific contributions to clustering were not biased toward countries with larger fractions of sampled individuals from the infected population. Therefore, sub-sampling as an effort to mitigate the effects of sampling bias at the country level, though often performed for phylogenetic analyses of viral geographical spread (19), was not deemed necessary for our study. On the other hand, it is important to notice that due to the lack of information on sam-6 pling strategies used to gather available sequence data, we could not rule out possible effects of selection bias (i.e., preferential sampling). For example, in certain countries for which representative sequencing was low but clustering rate was high, we cannot exclude that sampling efforts were focused on presumed contact networks in order to locate and quarantine infections.
Cluster origin and epidemiology In order to derive epidemiological information for each detected cluster, branch lengths within the tree were scaled in time by enforcing a molecular clock, which assumes the accumulation of substitutions has occurred at a constant rate over time. The time to the most recent common ancestor (TMRCA) of all sequence data was estimated to be For the majority of the transmission clusters detected in the tree, TMRCAs (i.e., a cluster's temporal origin) dated back prior to Feb 20, 2020, with a peak observed between the first week of February until the first week of March ( Figure 3A). These results were robust to genetic distance thresholds of 0.006% and 0.013% ( Figure S2), the latter value representing the 2.5th percentile of tree-wide distances and maximum diversity observed in Shen et al. (2020) (15), as described above. Following this time, a sharp decline in number of newly formed clusters was evident. Furthermore, despite differing patterns in the number and size of clusters across individual countries, the global peak in clusters size coincided with the peak in number of new clusters, around the end of February, as did the number of countries represented in each cluster ( Figure 3A). The overlapping peaks in cluster number and size, as well as number of countries/cluster in time, suggest a common underlying factor responsible for the decrease in the rate of cluster formation, growth, and connectivity at the global level. Thus, it appeared reasonable to hypothesize that efforts to reduce international travel at the beginning of the epidemic 7 played a role.
Data on reported travel restrictions (Tables S1 and S2) were obtained from various sources (see Supplementary Materials), and the cumulative number of reports involving all international travel, as well as those specifically involving China, were plotted over time to provide context into travel-related events that would potentially result in the global cluster behavior described above. The early, elevated rate of accumulation of travel bans, largely comprised of restriction on immigration from China began to slow on February 15 (slowed accumulation for Chinaspecific bans on the 6th) ( Figure 3B). Accumulation in the number of restrictions remained low until the onset of nationwide lockdowns (first reported March 08), at which time the steep decline in overall cluster TMRCA, size, and geographical range began. These results strongly suggest that reduced international travel slowed the formation, growth, and connectivity of transmission clusters, whereas more local interventions were necessary, and more effective, in preventing local virus spread.
The median lifespan, or duration, of a transmission cluster was estimated to be approximately 2 weeks, with the largest cluster extending for over six weeks ( Figure 4A). Such a short duration time (on the order of the longer end of the incubation period (21)), combined with average cluster sizes of up to 25 sampled individuals, is indicative of SARS-CoV-2 rapid transmission (22). The expected number of secondary cases directly generated by a primary case in the population, otherwise known as the basic reproductive number (R 0 ), was calculated as a function of early changes in the estimated viral effective population size (23) and a normally distributed infectious period of approximately 2−8 days). R 0 for the entire pandemic was es- (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; estimate the transmission potential for individual clusters, which we have already shown can vary in size and composition over time. In the majority of clusters, R 0 ranged from < 1to2.32, though R 0 values of up to 12 were reported ( Figure 4B), indicative of clusters formed through "super-spreading" events (SSEs), or cases of larger-than-average transmissibility (25). While it is important to note that accuracy of R 0 estimation is reduced for outbreaks characterized by true R 0 ≥ 5 (23), clusters representing increased secondary infection rates as compared to the majority population are of importance in the control of infection. As R 0 > 1 is indicative of sustainable transmission and vice versa, we investigated the timing and duration of both low(< 1)and high(> 1)-R 0 clusters. High-R 0 clusters were more frequently observed between February and early March ( Figure 4C), consistent with the temporal peak in global cluster size. There was no relationship between R 0 and duration (R 2 < 3.69E − 06), suggesting clusters with R 0 < 1 were still sustained for various lengths of time. Although seemingly counter-intuitive, this finding can be explained by dynamic transmission patterns, such as a change in the contact network, or even a later introduction of unsampled super-spreaders. As the R 0 calculation is derived from early estimates of the viral effective population size, this value does not depict the full transmission potential of the group of individuals within the cluster.
Our results point to a drastic reduction in cluster formation, growth, and connectivity following the first week of March. Yet, clusters with R 0 > 2 (earlier estimates of SARS-CoV-2 respectively) ( Figure 5A). In line with global reduction in connectivity, clusters formed after March 08 consisted of 2 countries or less Figure 5B. The USA formed two clusters comprised of only US individuals, both with R 0 > 2, one of which was estimated as the highest R 0 value (> 10, CI: 7.57-11.94) of all clusters at this time. The three clusters following in ranking comprised either Belgium alone (2) or Belgium and neighboring Luxembourg (2), with R0 > 4. All of these clusters exhibited evidence of sustained transmission, with a duration of greater than 2 weeks, though it is important to note that the addition of more up-to-date sequences may extend duration times for clusters initiated during this time period. The UK formed three separate clusters -one isolated, one with primarily UK individuals, and one with primarily Australian individuals. The two clusters comprised of majority UK sequences were both estimated to have R0 > 2, whereas the cluster with primarily Australian sequences was characterized as R0 < 1.
Similarly, the second Australian cluster (Australian sequences only) was also characterized as having a relatively low R 0 of < 1.3 (CI:1.20-1.32). Results suggest that the virus was already spreading rapidly in the USA, Belgium and Luxembourg, and the UK at the time of implementation of regional mobility restriction efforts and that, despite efforts to restrict international travel, the UK and Australia maintained travel sufficient to sustain inter-regional transmission.
Recent evidence implicating a mutation in residue 614 of the spike protein of SARS-CoV-2 in increased infectivity (26) and higher mortality (27) offered a possible explanation for increased transmission potential of certain clusters identified in this study. However, despite the clear advantages at the cellular level of glycine in place of aspartic acid at this site, there was no relationship between prevalence of glycine (% of individual sequences) within individual transmission clusters and R 0 (linear regression R 2 < 0.00011) or cluster size (R 2 < 0.0083) ( Figure S3), though generation time was not explored.

Discussion
Unlike conventional epidemiological surveillance data, viral genetic data can be used to link infected individuals involved in direct transmission events even when the contact structure is unknown. Identification of these putative transmission clusters, coupled with phylogenetic inference of cluster dynamics and relevant epidemiological parameters, has provided evidence of a global pattern in cluster formation, growth, connectivity, as well as transmission potential over time. The slow rise and subsequent rapid fall in the number and size of forming clusters over the period of December 30 to April 20, 2020, may have been the results of a change in global transmission patterns and/or a rate of sampling that did not adequately capture, as recently reported (28), the increased rate in the number of infected individuals. According to the latter scenario, missed sampling of individuals involved in a transmission cluster can prevent the detection of links via genetic data, or even conventional surveillance data. However, the similar temporal pattern observed for the number of countries connected by individual clusters also suggests a relationship between cluster formation, growth, and connectivity and that an outside force was responsible for changing global transmission patterns, rather than problematic sampling over time.
When placed in the context of the timing and accumulation of public health interventions, the data provide evidence supporting the benefits of both global and regional response efforts.
Given an incubation period extending to up to 18 days within an individual (29), it is plausible that specific restriction on international travel involving China, peaking on February 06, could have resulted in the slowed rate of cluster activity beginning the first week of February. It is also possible that the rapid increase in overall international travel restrictions, peaking around February 15, was partially, if not equally, responsible. Whereas additional modeling using recorded travel data would be highly beneficial in teasing apart the effects of international and 11 China-specific travel restrictions on transmission characteristics, the drastic halt in cluster activity beginning the first week of March directly coincides with the date of onset of nationwide lockdowns. This particular finding speaks to the effectiveness of additional non-pharmaceutical measures taken at the regional level to curb the spread of infection, consistent with previous reports (30, 31).
Whereas a global reduction in cluster activity in response to efforts to reduce mobility might be expected given the known role of travel in pathogen spread (32), we anticipated cluster-and region-specific variation in transmission characteristics, consistent with known difficulties in controlling regional epidemics. Using viral sequence data and the phylogenetic relationships among sampled individuals, particularly for an exponentially growing epidemic of limited data availability (23), we can model relevant epidemiological parameters of interest used in determining the transmission potential for clusters involving individual countries. It is important to keep in mind that the results of any phylogenetic study that are dependent on a single tree assume that that a tree best describes the underlying phylogenetic relationships. While it is often best to summarize results across a sample of similarly plausible trees using Bayesian methods (33, 34), Bayesian tree reconstruction methods are highly parametric and have difficulty converging on a reliable distribution of trees and related evolutionary parameters for datasets as large as that of SARS-CoV-2. For this reason, we only focused this study on portions of the maximum likelihood tree that were considered to be well supported. Similarly, invaluable methods exist for the detection and epidemiological characterization of transmission clusters within the Bayesian framework, such as the multi-state birth death model (bdmm) (35); however, even terventions. As our study is based on hypotheses regarding not only cluster characteristics but also cluster size distribution, a skew towards detection of larger clusters as a result of this loss of information was undesirable. Post hoc transmission characterization of individual clusters using a Bayesian framework for population dynamics estimates that has demonstrated accuracy for low-signal sequence data (i.e., small clusters) (23) was thus ideal for an in-depth investigation of transmission cluster dynamics at the global and regional scales. Using this method, four countries (the USA, Belgium and neighboring Luxembourg, and the UK) were identified as harboring isolated clusters with elevated transmission potential following the global initiation of nationwide lockdowns, potentially fostered by super-spreaders. Federal policies regarding lockdowns were put into place for Belgium, Luxembourg, and the UK beginning mid-March, which were likely necessary in absolving highly active clusters such as those observed in this study; though federal guidelines were issued at a similar time in the USA (March 16th), mandatory US policy regarding lockdowns was not put into motion, allowing individual states to adopt their own policies at different times. Depending on the location of transmission clusters within the USA, delayed lockdowns could have resulted in continued rapid transmission of the virus.
It is also important to keep in mind that at the time of this study, a dramatic rebound was being observed in the number of detected daily cases in the USA, whereas the UK, Belgium and Luxembourg were demonstrating a consistent decline with occasional peaks (9). Moreover, the epidemic at the time was exponentially spreading in Brazil and India, as well as steadily in Russia -all countries that were not captured by our analysis based on sequence data up to the last week in April, 2020. In this context, our results emphasize the importance of additional country-specific transmission cluster analysis for data collected more recently than April 20.
It would be reasonable to suspect that the identification of the USA, Belgium, Luxembourg, and the UK as problematic countries was aided by the availability of sequence data from these locations (i.e., attributed to a sufficient number of individuals and genetic diversity to classify 13 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. and characterize corresponding clusters), which at first points to a potential problem with sampling bias. Although these countries were indeed at the high end of the spectrum in terms of available sequence data, our analyses relied on averaged values across clusters, which we show is unrelated to genome availability, unlike number of clusters per country. These countries exhibited a wide range of clustering frequency (20-60% of individuals) with no relation to sampling representation (fraction of the infected population sampled), which varied from 1-30%, indicating with a high degree of confidence, that they were not identified as a result of sampling bias. Moreover, countries (Iceland, Australia, and the Netherlands) identified as having low transmission potential after March 08 (R 0 less than or approximately 1) had a comparable number of available genomes within the distribution, lending support to the conclusion. This is not to say that selection bias, a form of sampling bias, may not be associated with the results.
For example, Iceland's low R 0 during this time may not be surprising, given that the country's genetic powerhouse, deCODE, began screening high-risk (symptomatic) individuals and those returning, or in contact with an individual, from high-risk countries as early as January 31 (37). Therefore, while in our study, concerns for the sampling representation of the overall infected population is negligible, more sophisticated quantitative measures to assess the impact of sampling biases, specifically selection bias, will be necessary in future investigations.
In summary, we propose that phylogenetic identification and characterization of transmission clusters using the vast resources of viral genomic data currently available can provide both global and regional perspectives on viral spread. When collected early in the course of an epidemic, as was the the case for the SARS-CoV-2, this approach may help to pinpoint locations for which increased efforts at the level of local government might be necessary to mitigate growth on a pandemic scale. At the time of the submission of these results, relaxation of these efforts was on the rise, particularly in the USA. The detection of isolated transmission clusters with elevated transmission potential in March, despite the rapid decline in global patterns of clus-14 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; ter activity, points to an important role played by super-spreaders in the current pandemic that likely pose a threat to relaxation (38) unless measures are taken to quickly recognize and predict these events. A better understanding of the underlying risk factors associated with related super-spreading events, including host, environmental, and behavioral factors (39), is necessary for targeting efforts aimed in avoiding recurrent rebounds in epidemic waves. Given the increased efforts in testing and sampling in the USA, as well as other countries, transmission cluster dynamic inference can help to identify these events and underlying risk networks for more precise intervention strategies.
Author contributions statement M.S., M.P., and B.R.M. conceived of the analyses, S.M. and C.M. retrieved the data, A.Z. and F.L. used their expertise (with guidance from O.G.) in sequence data quality control and tree reconstruction to produce the trees, M.P. used his expertise in Phylopart to identify transmission clusters, B.R.M. analyzed the clusters and prepared the manuscript, A.R. monitored and gathered data regarding region-specific public health interventions, and all authors helped to craft the discussion and review the the manuscript. The authors report no competing interests at the time of manuscript preparation and submission. All data is available in the manuscript or the supplementary materials.

15
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020.

19
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

20
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://www.biorxiv.org/content/early/2020/01/06/2020.01.06. 895532.full.pdf. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

22
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ;

23
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint A B Figure

24
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ;

25
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint A B Figure 5: Summarized transmission patterns for countries involved in clusters formed after March 08, 2020. (A) Mean R 0 for each country was calculated by weighting each cluster-specific R 0 according to the fraction of sequences belonging to the country.(B) Estimated mean R 0 and 95% confidence intervals for each cluster (dot). For clusters with more than one country, the fraction of sequences belonging to each country is depicted in the corresponding pie chart. Clusters are also colored according to duration. Dashed line depicts R 0 = 2 26 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint

Materials and Methods
Sequence data, metadata, and phylogenetic reconstruction A total of 11,262 sequences and associated metadata (time and country of sampling) were downloaded from GISAID (GI-SAID) on April 25, 2020. Sequence IDs can be found in Table S3 The number of confirmed cases for each corresponding country was retrieved from European Centre for Disease Pre- Sequences were quality filtered and aligned, as described in (40), keeping one representative per identical sequence group and excluding sequences without precise dates (month and day), resulting in 11, 316 sequences (29, 726 nucleotides in length). We reconstructed a maximum likelihood tree with RAxML-NG (41) (GTR+I+G6 model, with all parameters optimized) starting from a distance tree (reconstructed using mimimum evolution in FASTME (42)), rooted it at the most recent common ancestor of the likely first-generation strains from (1), and put back the identical sequences (as zero-branch polytomies with the corresponding representative sequence tip). We then collapsed the branches without phylogenetic signal (i.e., of length ≤ 1/2 mutation per genome). Support for branching events within the tree were calculated using the the 27 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; Shimodaira-Hasegawa approximate likelihood ratio test (43), performed in IQ-TREE v2 (44) on the final, fixed RAxML-NG tree topology.
Least squares dating in LSD2 (45) was used to date internal nodes of the tree (given sampling dates of taxa) and to identify, and remove, outlier sequences. Outlier sequences were defined as taxa whose mutation rate (estimated as the distance to the root divided by time since the first sequence) was larger than 3 standard deviations from the median. A strict molecular clock was assumed, resulting in an estimated mutation rate of 2. Transmission cluster identification Clades comprised of ≥ 5 distinct sequences and a reliability of ≥ 90% were recognized as potential transmission clusters when the median pairwise patristic distance (i.e., branch length separating sequences) within the clade was below a pre-specified percentile threshold of the whole-tree patristic distance distribution. A range of percentile thresholds spanning 0.0005% − 25% of the whole-tree distance distribution was used to choose an optimal threshold point and to verify robustness of cluster composition. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46). Clusters and corresponding information (including sequence IDs) can be found in Table S4.
Potential clusters with more than two time points were used for the calculation of the basic reproductive number (R 0 ) from the viral effective population size (N e ) estimated using the skygrowth (23) package in R (47). Viral effective population size (N e ) estimates were allowed to vary weekly, and the default smoothing parameter (tau) of 0.1 was used. As described in Volz Didelot (2018) (23), the mean effective reproductive number (R e ) was calculated as a function of the change in N e and 2-8 day (mean=5.475, SE=1.825) infectious period (psi). R 0 28 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint was defined as the first R e value in time. Clusters with outlying R 0 values (> 3.4 standard deviations above or below the mean over all clusters) were discarded as unreliable. Figure S1: Maximum likelihood tree for 11,069 SARS-CoV-2 sequences scaled in time using least squares dating (45). Purple bar represents the 95% credible interval for estimate of the time to the most recent common ancestor for all sequences. Outlier sequences have been pruned and were excluded from downstream analyses (see Materials and Methods).

29
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint A B Figure

30
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; 45 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; https://doi.org/10.1101/2020.11.06.370999 doi: bioRxiv preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 6, 2020. ; Table S4: Clusters identified using a genetic distance threshold within 1% of the distribution of patristic distances within the entire tree. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46) Table S4: Clusters identified using a genetic distance threshold within 1% of the distribution of patristic distances within the entire tree. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46).  Table S4: Clusters identified using a genetic distance threshold within 1% of the distribution of patristic distances within the entire tree. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46).  Table S4: Clusters identified using a genetic distance threshold within 1% of the distribution of patristic distances within the entire tree. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46).  Table S4: Clusters identified using a genetic distance threshold within 1% of the distribution of patristic distances within the entire tree. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46).  Table S4: Clusters identified using a genetic distance threshold within 1% of the distribution of patristic distances within the entire tree. The minimum percentile threshold that maximized the number of clusters was chosen as the optimal threshold by performing multiple clustering runs on randomly sampled patristic distance distributions (1 million for each run) in Phylopart v2 (46).