Single source of pangolin CoVs with a near identical Spike RBD to SARS-CoV-2

Multiple publications have independently described pangolin CoV genomes from the same batch of smuggled pangolins confiscated in Guangdong province in March, 2019. We analyzed the three metagenomic datasets that sampled this batch of pangolins and found that the two complete pangolin CoV genomes, GD_1 by Xiao et al. Nature and MP789 by Liu et al. PLoS Pathogens, were both built primarily using the 2019 dataset first described by Liu et al. Viruses. Other publications, such as Zhang et al. Current Biology and Lam et al. Nature, have also relied on this same dataset by Liu et al. Viruses for their assembly of the Guangdong pangolin CoV sequences and comparisons to SARS-CoV-2. To our knowledge, all of the published pangolin CoV genome sequences that share a highly similar Spike receptor binding domain with SARS-CoV-2 originate from this singular batch of smuggled pangolins. This raises the question of whether pangolins are truly reservoirs or hosts of SARS-CoV-2-related coronaviruses in the wild, or whether the pangolins may have contracted the CoV from another host species during trafficking. Our observations highlight the importance of requiring authors to publish their complete genome assembly pipeline and all contributing raw sequence data, particularly those supporting epidemiological investigations, in order to empower peer review and independent analysis of the sequence data. This is necessary to ensure both the accuracy of the data and the conclusions presented by each publication.

Several studies describing pangolin CoV genomes have posited that pangolins are potentially important hosts in the emergence of novel coronaviruses. There has been particular emphasis on the discovery of a Guangdong pangolin CoV that shares all five critical residues and ~97% amino acid identity in the Spike receptor binding domain (RBD) with SARS-CoV-2. [1][2][3] This is due to the fact that the Spike RBD plays a major role in determining the host specificity of each coronavirus.
The next closest match to SARS-CoV-2 in terms of the Spike RBD is the bat CoV RaTG13, sharing 90.13% Spike RBD amino acid identity. RaTG13 shares the highest genome identity to SARS-CoV-2 (96%), whereas the Guangdong pangolin CoV shares ~90% genome identity with SARS-CoV-2. Nonetheless, based on Spike RBD similarity, many have hypothesized that  (Fig 1, Supplementary Fig 1). In their paper, Xiao  The sample lung08, first analyzed by Liu et al. Viruses, was a major source of metagenomic reads spanning the entire pangolin CoV genome (Fig 1, Supplementary Fig 1). The only sample from Xiao et al. that we could not match to those from the Liu et al. dataset, which contributed a considerable number of reads was pangolin_9 (sample M1). However, the majority of the pangolin_9/M1 reads map onto a single region in Orf1a (Fig 1A, Supplementary Fig 1). Indeed, when the reads from samples lung08 and pangolin_9 are combined, the resulting read coverage and alignment profile looks highly similar to the sample read profile in Xiao et al.'s Extended Data   Fig 4 (Supplementary Fig 1). Xiao et al. did not specify which of their seven samples with CoV sequence reads was described by this figure. 2 In addition, for the Spike RBD, the majority of the metagenomic sequence data are derived from lung08 (even so, the read depth was low; Fig 1B). Based on our analysis, we emphasize that there has only been one confirmed source of pangolin CoVs with a Spike RBD nearly identical to that of SARS-CoV-2: pangolins confiscated from smugglers in Guangdong province in March, 2019 (Fig 2). The  (Fig 2). Of the five critical residues for binding between the SARS-CoV Spike RBD and human ACE2 protein, the bat CoV RaTG13 and the Guangxi pangolin CoVs each share only one residue with SARS-CoV-2, whereas the Guangdong pangolin CoV shares five residues with SARS-CoV-2. 5 We would like to reiterate the observations from the literature that it remains to be determined whether the Guangdong and Guangxi pangolin CoVs are found in pangolins in the wild. [1][2][3]5 All of the CoV-positive pangolins were confiscated from smugglers, and the Guangdong pangolins were One scientist, who read our preprint, inquired whether the above suggestion means that SARSlike CoVs, most closely related to the pangolin CoVs, were circulating among humans before the outbreak of SARS-CoV-2. This is an important point that needs to be elaborated on, particularly for readers who are not specialists in mechanisms of wildlife pathogen spillover into humans.
Following the original SARS epidemic in 2002-2004, 13-40% of the tested asymptomatic animal handlers were positive for SARS-reactive antibodies; in comparison, the portion of the general population carrying SARS-reactive antibodies was significantly lower (only up to 0.2% in asymptomatic household contacts). 15,16 This has been widely interpreted to mean that animal handlers/traders (we include wildlife traffickers in this group) are frequently exposed to wild animals and their pathogens on a regular or prolonged basis. Therefore, it is not surprising that wildlife smugglers would be exposed to novel pathogens and serve as hosts of these pathogens from time to time, without necessarily requiring or resulting in the widespread transmission of these pathogens in the general human population.
In conclusion, our analysis points out that the two complete pangolin CoV genomes, GD_1 by

Read processing and alignment
Raw reads were downloaded from the NCBI SRA (see Data Availability). We analyzed 21 and removing poor-quality reads, etc.). 17 The clean reads were mapped to the genome sequence of GD_1 (GISAID: EPI_ISL_410721) using minimap2 version 2.17 on default settings. 18 Duplicate reads were marked and clean reads were coordinate-sorted using samtools version 1.10 (subcommands markdup and sort, respectively). 19

Read coverage statistics
We computed the following read coverage statistics using bedtools version 2.29.2 20 : (1) percentage breadth of coverage with respect to GD_1; (2) mean depth of coverage with all mapped reads; and (3) mean depth of coverage without duplicate reads. Also, the per-position read depth data across the GD_1 genome were exported for plotting in R (R Core Team 2020). We extracted the RBD sequences from GD_1 according to the genome coordinates 22,483-23,151; the receptor binding motif (RBM) sequence embedded within the RBD is located at the genome coordinates 22,837-23,052. The percentage sequence identities were calculated based on pairwise alignments while excluding gapped sites and sites with ambiguous characters. The pairwise sequence similarity matrix was plotted as a heatmap in R 21 .

Read alignment visualization
The libraries lung08 (from Liu et al.