The selection of reference genome and the search for the origin of SARS-CoV-2

The pandemic caused by SARS-CoV-2 has a great impact on the whole world. In a theory of the origin of SARS-CoV-2, pangolins were considered a potential intermediate host. To assemble the coronavirus found in pangolins, SARS-CoV-2 were used a reference genome in most of studies, assuming that pangolins CoV and SARS-CoV-2 are the closest neighbors in the evolution. However, this assumption may not be true. We investigated how the selection of reference genome affect the resulting CoV genome assembly. We explored various representative CoV as reference genome, and found significant differences in the resulting assemblies. The assembly obtained using RaTG13 as reference showed better statistics in total length and N50 than the assembly guided by SARS-CoV-2, indicating that RaTG13 maybe a better reference for assembling CoV in pangolin or other potential intermediate hosts.


Introduction
Recently, the outbreak of SARS-CoV-2 (COVID-19) has caused an ongoing global pandemic. As of July 21, 2020, the pandemic resulting in a total of 14,562,550 clinical cases and 607,781 deaths all over the world (www.who.int). With an effort of metagenomic RNA deep sequencing, the genome of a SARS-CoV-2 isolate, Wuhan-Hu-1, was published in [11]. The genome showed high nucleotide similarity (89.1%) to a group of SARS-like coronavirus that were identified in bats in China, which indicated the possibility of animal origin.
To identify potential direct or intermediate host of SARS-CoV-2, coronavirus in several animals were studied and compared with SARS-CoV-2. Bat and pangolin were the two mostly investigated species. A CoV in bat, RaTG13, showed 96% full-genome similarity with SARS-CoV-2 [14]. Other virus isolates from bat, ZXC21 and ZC45 also shared 85% of similarity to SARS-CoV-2. These results led to the hypothesis that the progenitor of SARS-CoV-2 originated in Bats and it spilled over to humans using another animal as intermediate host. Many researchers believed that pangolins are a potential intermediate host and they attempted to characterize coronavirus in pangolin. Liu, Chen, and Chen [8] constructed coronavirus contigs using de novo assembly method from organ samples of dead Malayan pangolins rescued at the Guangdong Wildlife Rescue Center. Of the 11 collected pangolins, coronavirus was detected in two. Zhang, Wu, and Zhang [13] re-analyzed the RNA-Seq reads from two pangolins carrying coronavirus using reference-guided de novo assembly method, with Wuhan-Hu-1 as the reference genome. The resulting draft genome shared 91.02 and 90.55% whole genome similarity with Wuhan-Hu-1 and RaTG13, respectively. Xiao et al. [12] obtained 21 samples from pangolins rescued by Guangdong Customs and conducted reference-guided genome assembly, with Wuhan-Hu-1 as the reference genome. The derived viral genome showed 80 and 98% whole genome sequence identity to SARS-CoV-2 and RaTG13, respectively. Lam et al. [3] collected 43 samples from 18 pangolins from Guangxi Medical University, China, and six samples contained coronavirus sequences. The viral genomes of six samples were de novo assembled. They also performed RNA sequencing in five archived pangolins samples from Guangdong, and assembled the genomes using WIV04, another SARS-CoV-2 genome from human, as reference genome. The resultant draft genomes have 85.5% to 92.4% identity to the SARS-CoV-2 genome. They suggested two sub-lineages of coronavirus existed in pangolins. In terms of all coding sites, coronavirus identified in pangolins from Guangdong is more closely related to WIV04 than Bat-CoV RaTG13, whereas coronavirus genome of pangolins in Guangxi showed lower genome similarity to WIV04 than Bat-CoV RaTG13. Liu et al. [8] took samples from three coronavirus positive pangolins rescued in Guangdong and performed deep sequencing.
Using de novo assembly method, they obtained viral genome that showed 90.32% and 90.24% of whole genome identify to Wuhan-Hu-1and Bat-CoV RaTG13, respectively.
All of the pangolins involved in published coronavirus analysis were from either the Guangdong collection or the Guangxi collection. Pangolins from the Guangdong collection were investigated in most studies [3,8,9,12]. The resulting viral genomes were derived from de novo assembly with or without a guided reference and further curated using blast annotation or PCR amplicon sequencing. All studies have showed that CoV in pangolin was highly related to Bat-CoV and SARS-CoV-2. In all of the studies that used reference-guided de novo assemblies, a SARS-CoV-2 genome (Wuhan-Hu-1 or WIV04) was chosen as reference [3,12,13]. This choice was based on the assumption that SARS-CoV-2 was the closest neighbor of the Pangolin CoV in the phylogeny tree. However, this assumption may not necessary be true. Therefore, choosing the SARS-CoV-2 genome as reference could inadvertently introduce bias in the genome assembly, leading to inaccurate or incomplete results.
In this study, we assembled the Pangolin CoV genome using several different genomes as reference. We investigated how the reference genome impact the resulting genome and its phylogenetic relationship with others. The results from this study will provide guidance for future studies on how to accurately construct CoV genomes from pangolin or other potential intermediate hosts.

Data selection
Two RNA-seq samples, lung07 and lung08, were downloaded from NCBI SRA under BioProject SRA: PRJNA573298. The two samples were originally published in [8] for viral metagenomics analysis.

Sequence preparation
Adaptor trimming and quality control were performed on the raw sequence reads using with the Trimmonmatic program (verson 0.39) [1]. To eliminate host contamination, the remaining reads were aligned to the Manis javanica genome (SRA: PRJNA256023) using BWAaln (version 0.7.17) [6] and reads mapped to the host genome were discarded. Reads unmapped to the host reference genome were used to construct genome in the subsequent de novo assembly.

Genome assembly and sequence analysis
Cleaned reads were used to assemble genome using reference-guided de novo assembly.
To investigate how the results were influenced by the choice of reference genome, we explored a few representative virus genomes on the phylogeny tree (Table 1) as the reference genome.
These genomes were selected based on previous studies [3,9,11,13]. Once a reference genome was picked, the cleaned reads were aligned to the reference genome using BWA-MEM [5], and the mapped reads were assembled de novo using MEGHIT (version 1.1.3) with meta-sensitive mode [4]. The resulting contigs were concatenated into an assembly by aligning them to the reference genome.

Phylogenetic and similarity analyses
Phylogenetic distance analysis was performed using MEGA X (version 10.1.8) [2].The whole genome was used in phylogenetic and distance analysis, and phylogenetic trees were constructed in the best-fit DNA/amino acid substitution mode with 1000 bootstrap replications.
The whole genome nucleotide identity analysis was performed in SimPlot 3.5.1 [10].

Result
A total of eight viral genomes were tested as the reference genome in reference-guided de novo assembling. Numbers of mapped reads ranged from 2 to 3,060, and length of the resulting draft assemblies varied from 5,969 to 22,419 bp (Table 2). Two reference genomes, MersCoV and Bat Hp-BetaCoV Zhejiang2013, failed in reads assembling due to the limited number of remaining reads. Less than 1,000 reads were mapped to three reference genomes, BJ01, BM48_31, and Longquan140, resulting in shorter assemblies. A total of 1,061 reads were mapped to ZC45 and subsequently assembled into a 21,819-bp assembly with 67.8% of coverage.
RaTG13, which is a bat CoV, had 1,287 reads mapped to it, and the resulting assembly has total length of 21,925 and N50 of 1,428. A total of 3,060 reads were mapped to Wuhan-Hu-1, a human SARS-CoV-2 strain, which was the highest number among all genomes we surveyed.
The assembly guided by RaTG13 and Wuhan-Hu-1 showed similar coverage at about eighty percent. However, the resulting assembly had shorter total length (21,819) and N50 (1,195) than those of the assembly that used RaTG13 as reference.
To understand this seemly contradiction, we investigated the reads coverage and depth on RaTG13 and Wuhan-Hu-1. The results (Figure 1) show that there were an excessive number of reads mapped to distal regions of Wuhan-Hu-1, which could indicate artifacts or contaminations during the sequencing. After removing these tail regions (with > 200X depth), 62 more unique reads were mapped to Wuhan-Hu-1. Figure 2 shows the overlap of between the unique reads mapped to RaTG13 and Wuhan-Hu-1. unique reads were used to construct Venn diagrams. All reads mapped to Italy strain were also mapped to Wuahn-Hu-1 genome (Figure 2). The additional 1516 unique reads were mapped to Wuhan-Hu-1. Most reads were mapped to both RaTG13 and Italy strain, but a total of 45 reads were aligned to either one. Bat-CoV, ZC45 and RaTG13, shared 999 reads in alignment, and RaTG13 genome guided over two hundred more reads into assembly.
The two assemblies that used RaTG13 and Wuhan-Hu-1 as reference genomes respectively, were aligned with the eight reference viral genomes in Table 1 in a multiple alignment. The multiple alignment result was used for similarity analysis and phylogenetic analysis. In terms of the whole genome nucleotide, the RaTG13-guided assembly showed 85.8% and 85.2% identity to RaTG13 and Wuhan-Hu-1, respectively, and the Wuhan-Hu-1-guided assembly showed 88.6 and 88.8% identity to RaTG13 and Wuhan-Hu-1 respectively. The difference between Wuhan-Hu-1-guided assembly and RaTG13-guided assembly was not always consistent with difference between RaTG13 and Wuhan-Hu-1 (Table S1). The differences between RaTG13-guided and Wuhan-Hu-1-guided assemblies in the regions of 18,431-18,601 bp were probably due to references between the reference genomes (Table S1). However, the differences in some regions including 4,761-5,021 bp and 10,121-11,321 bp, were not due to differences in the reference genomes.
We further investigated phylogenetic relationship between the two assemblies and eight reference viral genomes using MEGA X with 1000 Bootstrap tests. The two assemblies positioned between RaTG13 and ZC45 with strong statistical evidence (Figure 3). The SARS-CoV-2 genome, Wuhan-Hu-1, and Bat-CoV, RaTG13, clustered closely together. The two assemblies from this study consistently positioned between this cluster and other CoVs ( Figure   4). Among other CoVs, ZC45 is the closest neighbor to the assemblies.

Discussion
Currently, very few samples of coronavirus-positive pangolins have been sequenced.
Pangolins-CoV obtained from the Guangdong collection and the Guangxi collection represent two lineages of coronavirus [3]. After outbreak of SARS-CoV, thousands of bat samples were collected and sequenced to identify coronaviruses that bat may carry. We can expect that in near future more pangolin samples will be collected and studied for better understanding of coronavirus in pangolin.
In a reference-guided assembling, the resulting assembly may show bias towards the reference genome [7]. Successful decoding a complete viral genome usually require deep sequencing and further manual curations to fix the gaps. Inaccuracies in assembling the sequencing reads could mislead the subsequent curation step. The whole genome identity between Bat-CoV RaTG13 and SARS-CoV2 Wuhan-Hu-1 is about 96%, which corresponds to a total difference of about 1,500 nucleotides. Our results have shown that observable difference could be found in the resulting assemblies when these genomes were used as reference separately.
Particular, when RaTG13 was used as reference, the resulting assembly had a longer total length and higher N50 value than when Wuhan-Hu-1 was used. This points to the possibility that Pangolin-CoV is more closely related to RaTG13 than to Wuhan-Hu-1. Therefore, in order to decode the coronavirus sequence accurately, RaTG13, and possibly other SARS-CoV-2 isolates, should also be considered as reference in future studies of coronavirus in Pangolin or other potential intermediate hosts.
In addition to using one reference genome to guide the assembling, we also attempted to assemble all reads that mapped to either RaTG13 or Wuhan-Hu-1 genomes. The resulting twogenomes-guided assembly has a total length of 22,707 bp, which is slightly longer than that of the RaTG13-guided assembly. However, the N50, 1,388bp, is slightly shorter.

Conflicts
The authors have no conflict of interest.    Table S1. The whole genome nucleotide similarity from RaTG13, Wuhan-Hu-1, and resulting assemblies.