## Abstract

The 2019 novel coronavirus (2019-nCoV) have emerged from Wuhan, China. Studying the epidemic dynamics is crucial for further surveillance and control of the outbreak. We employed a Bayesian framework to infer the time-calibrated phylogeny and the epidemic dynamics represented by the effective reproductive number (*R _{e}*) changing over time from the genomic sequences available from GISAID. The origin time is estimated to be December 17, 2019 (95% CI: December 5, 2019 – December 23, 2019). The median estimate of

*R*ranges from 0.2 to 2.2 and changes drastically over time. This study provides an early insight of the 2019-nCoV epidemic.

_{e}## Introduction

An outbreak of a novel coronavirus (2019-nCoV) was reported in Wuhan, a city in central China (WHO). Coronaviruses cause diseases range from common cold to severe pneumonia. Two fatal coronavirus epidemics over the last two decades were severe acute respiratory syndrome (SARS) in 2003 and Middle East respiratory syndrome (MERS) in 2012 (WHO). Human to human transmission has been confirmed for this new type of coronavirus (Wang et al. 2020) and more than 1,000 cases have been reported as of January 25, 2020 (https://bnonews.com/index.php/2020/01/the-latest-coronavirus-cases).

Studying the virus epidemic dynamics is crucial for further surveillance and control of the outbreak. Phylogeny of the viruses is a proxy of the transmission chain. Early studies of 2019-nCoV have focused on their molecular features and phylogenetic relationship with the close relatives (Zhou et al. 2020; Chan et al. 2020). The phylogenetic analyses typically ignored the sampling times and measured branch lengths by expected number of substitutions per site. They are also lacking a stochastic process to model the epidemic dynamics over time. However, both the timing and dynamics are critical to understand the early outbreak of 2019-nCoV. Furthermore, various sources of information and uncertainties are hard to be integrated in the analyses without employing a Bayesian approach.

In this study, we used the birth-death skyline serial (BDSS) model (Stadler et al. 2013) to infer the phylogeny, divergence times and epidemic dynamics of 2019-nCoV. This approach takes the genomic sequences and sampling times of the viruses as input, and co-estimates the phylogeny and key epidemic parameters in a Bayesian framework. To our knowledge, this is the first study to perform such estimation on 2019-nCoV.

## Results and Discussion

The phylogeny in Figure 1 shows the divergence times and relationships of the 24 BetaCoV viruses. The samples from Wuhan form a paraphyletic group, while the rest of the samples form a monophyletic clade. These two clades are not divergent from each other as their sequences are quite similar, which indicates that the outbreak is still in an early stage. The patients from Guangdong and Zhejiang had traveled from Wuhan and they had been infected before travelling (Chan et al. 2020). Note that this phylogeny is a maximum clade credibility (MCC) tree summarized from the posterior samples, which represents a best estimate of the topology. The probabilities in most clades are lower than 0.5 and would form polytomies if summarized as a 50% majority-rule consensus tree (GISAID).

The time of origin is estimated with median 31.9 days and 95% highest posterior density (HPD) interval from 26.3 to 43.8 days before the date of the latest sample (January 18, 2020), that is, December 17, 2019 (December 5, 2019 – December 23, 2019) (Table 1, ten intervals). This is in agreement with the symptom onset reported by WHO, showing again an early phase of the epidemic.

We investigate the epidemic dynamics of 2019-nCoV by estimating *R _{e}* changing over time in the BDSS model.

*R*> 1.0 means that the number of cases are increasing and the epidemic is growing, whereas

_{e}*R*< 1.0 means that the epidemic is declining and will die out. Interestingly, the epidemic has an early boost with

_{e}*R*at around 2.2, then decreases dramatically to 0.2, and increases again to about 1.7 during the last phase (Table 1). In general, this is in agreement with some other studies reporting

_{e}*R*ranging from 1.4 to 5.5 (Read et al. 2020; Zhao et al. 2020; Riou and Althaus 2020). In comparison, the estimated

_{e}*R*was 2.7 to 3.6 for SARS during the precontrol phase in Hong Kong (Riley et al. 2003; Wallinga and Teunis 2004). Dividing

_{e}*R*into ten intervals rather than three give us a better resolution (Figure 2). This drastic change could reflect the epidemic to some extent but is likely sensitive to the virus sampling. Keep in mind that we used only 24 samples in our analysis, which is less than 2% of the reported number of infected patients, thus one needs to be cautious when interpreting this result. With more viruses sequenced, we would expect more reliable estimates which would provide better insights into the epidemic of 2019-nCoV.

_{e}Overall, this study provides an early insight of the 2019-nCoV epidemic by inferring key epidemiological parameters from the virus sequences. Such estimates would help public health officials to coordinate effectively to control the outbreak.

## Methods

The original data downloaded from GISAID (https://www.gisaid.org) consists of 26 sequences of 2019-nCoV (as of January 24, 2020). We excluded two outliers, one is the virus from Kanagawa, Japan (EPI_ISL_402126) which contained only a small segment of 369bp, another is from Wuhan, China (EPI_ISL_403928) which contains suspiciously many mutations and has genetic distance about ten times longer than the rest of the sequences (GISAID). Sequence alignment was done using MUSCLE (Edgar 2004), resulting in a total length of 29904bp for the whole genome. The sampling times of the viruses ranged from December 24, 2019 (EPI_ISL_402123) to January 18, 2020 (EPI_ISL_403937) and they were used as fixed ages (in unit of years) in subsequent analysis.

We used the BDSS model (Stadler et al. 2013) implemented in the BDSKY package for BEAST 2 (Bouckaert et al. 2019) to infer the phylogeny, divergence times and epidemic dynamics of 2019-nCoV. The model has an important epidemiological parameter, the effective reproductive number *R _{e}*, defined as the number of expected secondary infections caused by an infected individual during the epidemic. The model allows

*R*to change over time, making it feasible to estimate its dynamics (Stadler et al. 2013). The BDSS process starts from the origin time

_{e}*t*0, which was assigned a lognormal(–2, 1.5) prior with median 0.135 (years before the latest sampling time). Time from the origin to the latest sample was divided into either three or ten equally spaced intervals in which

*R*was varied and estimated individually in each interval. The prior for

_{e}*R*was a lognormal(0, 1.25) distribution with median 1.0 and 95% quantiles between 0.13 and 7.82. The other two parameters are the becoming noninfectious rate

_{e}*δ*and sampling proportion

*p*, which were assumed constant over time.

*δ*was given a lognormal (2, 1.25) prior with median 7.39 and mean 16.1, expecting the infectious period of an individual (1/

*δ*) to be about a month. The sampling proportion of infected individuals

*p*was a beta(1, 9) distribution with mean 0.1.

We used the lognormal independent relaxed clock (Drummond et al. 2006; Rannala and Yang 2007) to model evolutionary rate variation along the branches. The mean clock rate *r* was assigned a gamma(2, 0.0005) prior with mean of 0.001 substitutions per site per year and the standard deviation *σ* was a gamma (0.54, 0.38) prior with mean 0.2 and median 0.1 by default. The substitution model used was HKY+Γ_{4} (Hasegawa et al. 1985; Yang 1994) in which the transition-transversion rate ratio *κ* was set a lognormal(1, 1.25) prior and the gamma shape parameter *α* was an exponential(1) prior.

The analysis was performed in the BEAST 2 platform (Bouckaert et al. 2019). We ran 150 million MCMC iterations and sampled every 5000 iterations. The first 30% samples were discarded as burn-in. Convergence was diagnosed in Tracer (Rambaut et al. 2018) to confirm that independent runs gave consensus results and all parameters had effective sample size (ESS) larger than 100. The remaining 70% samples were used to build the maximum clade credibility (MCC) tree and to summarize the parameter estimates. The common ancestor heights were used to annotate the clade ages in the MCC tree.

## Acknowledgments

CZ is supported by the 100 Young Talents Program of Chinese Academy of Sciences.