CovProfile: profiling the viral genome and gene expressions of SARS-COV-2

The SARS-CoV-2 virus has infected more than one million people worldwide to date. Knowing its genome and gene expressions is essential to understand the virus’ mechanism. Here, we propose a computational tool CovProfile to detect the viral genomic variations as well as viral gene expressions from the sequences obtained from Nanopore devices. We applied CovProfile to 11 samples, each from a terminally ill patient, and discovered that all the patients are infected by multiple viral strains, which might affect the reliability of phylogenetic analysis. Moreover, the expression of viral genes ORF1ab gene, S gene, M gene, and N gene are high among most of the samples. While performing the tests, we noticed a consistent abundance of transcript segments of MUC5B, presumably from the host, across all the samples.

This approach, however, has several problems. First, the long reads from Nanopore technology suffers 26 from high error rates. Second, the sub-sequences can be amplified unevenly. Third, reads in the gene 27 region amplified from the viral genome are indistinguishable from the gene transcripts. Finally, chimeric 28 reads are frequent. These inaccuracies severely impact the study of viral expressions. Furthermore, while 29 not inherent in the technology, we noticed that the problem could be further complicated by the 30 existence of multiple viral strains. 31 Here, we developed a computational tool, CovProfile, to detect viral genomic variations, and profile 32 viral gene expressions through data from multiplex RT-PCR amplicon sequencing on Nanopore MinION. 33 To detect the genomic variations, we mapped the reads to SARS-CoV-2 reference genome (MN908947.3). 34 To profile the viral gene expressions, we adopt a regularization method to find out the primer efficiencies. 35 The tool was tested on the data obtained from the sputum samples in eleven severe COVID-19 patients. 36 No structural variation was found, but all eleven patients appear to be infected by two strains. We found 37 that the expression of viral genes ORF1ab gene, S gene, M gene, and N gene are high among most of the 38 samples. While performing the tests, we noticed that MUC5B transcript segments to be consistently 39 more abundant than segments of other genes in the host. The function of this gene may be associated 40 with the severe symptom of abundant mucus in COVID-19 patients. 41 The discovery of multiple strains in all our samples has severe implications on the study of phylogeny, 42 which rely heavily on SNPs (Single Nucleotide Polymorphisms). We identify at least two problems. First, 43 coexistence of multiple strains might lead to inference of the wrong allele combinations, resulting in 44 incorrect phylogeny and interpretation. Second, SNV calling might, in fact, be incorrect when mutations 45 are located near the primer region. This problems are severe and requires more careful analysis, since 46 phylogeny is essential in studying the spreading of the pandemic and evolution of the virus [20][21][22][23]. 47

48
RNA reverse transcription and Nanopore sequencing 49 Sputum samples were collected from the 11 patients and inactivated under 56 • C with 30 minutes in 50 accordance with WHO and Chinese guidelines [24,25]. Then, the total RNA was extracted from the 51 samples according to the protocol of RNA isolation kit (RNAqueous R Total RNA isolation Kit, 52 Invitrogen, China), and RNA concentration was determined by Qubit (ThermoFisher Scientific, China). 53 Based on 2 pools of primers (98 pairs of primers in total) (Supplementary table1), the entire genomic 54 sequence of SARS-CoV-2 was amplified segmentally by reverse transcription. Then, the library was built 55 by Nanopore library construction kit (EXP-FLP002-XL, Flow Cell Priming Kit XL, YILIMART, China), 56 while the adapter and barcode sequences were also added to the samples. The samples were sequenced 57 on the Oxford Nanopore MinIon.

58
Data conversion and filtration 59 Fast5 format data was generated by the sequencer, and converted into fastq format with guppy 60 basecaller (version 3.0.3). By applying NanoFilt (version 1.7.0) [26], we performed data filtration on the 61 raw fastq data with the following criteria: read lengths should be longer than 100 bp after the removal of 62 the adapter sequences, with an overall quality higher than 10. The clean data was kept for subsequent 63 analysis. 64

2/17
Data composition 65 A total of 1,804,043 clean reads were generated, with an average of 164,004.00±90,848.28 (Mean±SD) 66 reads per sample (Fig 1). Aligning the reads to SARS-CoV-2 and human genomes respectively, the 67 alignment ratio ranged from 20.14% to 99.74% on SARS-CoV-2 genome, and ranged from 0.11% to 68 72.12% on the human genome. Furthermore, each sample contained 71.24±35.36M data on average, 69 while 61.29±42.13M and 8.87±9.34M data were aligned to SARS-CoV-2 and host genomes, respectively. 70  Remove false-positive chimeric reads 71 Chimeric Nanopore reads formed by random connection of multiplex RT-PCR amplicons were split into 72 multiple segments if the locations of their bilateral ends match corresponding locations of PCR primers. 73 These chimeric reads were able to be split into several segments with their bilateral ends exactly 74 matching the sites where the multiplex PCR primers are located, suggesting that they should be 75 processed accurately to avoid false positive identification of virus recombinations or host integrations. 76 We position the primers on both viral and host genome, and split the identified chimeric reads into 77 segments corresponding to PCR amplicons. This method allowed us to salvage large amount of 78 sequencing data, leading to more accurate alignment and higher coverage. Realignment were performed 79 on the separated segments before analysis.

80
Mutation detection 81 We aligned the clean data to SARS-CoV-2 genome (NCBI MN908947.3) with minimap2 [26] with default 82 parameters for Oxford Nanopore reads. The aligned PCR amplicons were separated according to 83 3/17 corresponding primer pool. With the separated alignment results, the genomic variations with average 84 quality larger than 10 were called with bcftools (version 1.8). Mutations with less than 10 supported 85 reads were filtered. To reduce the effects brought about by the PCR amplification, a variation is filtered 86 if it was located within 10 bp upstream or downstream of the primer region within corresponding primer 87 pool. The filtered mutations for different primer pools were then merged as the final mutations. Based 88 on the gene information in SARS-CoV-2 reference genome, the final mutations were annotated by 89 self-programming software.

90
Estimating viral load and primer efficiency 91 Assume that there are n pairs of primer (s i , t i ), with primer efficiency e i , 1 ≤ i ≤ n. Let the viral gene 92 boundaries be denoted (l g , r g ), where g indexes the genes. Denote the observed sequence depth captured 93 by the primer pair (s i , t i ) as d i . 94 We want to estimate the viral gene expression. Let the number of viral copies be denoted v, the 95 number of transcripts of gene g be denoted t g , and the sequencing rate of amplified segment be denoted 96 α. Then, we can have the following two approximations for d i .

97
If the primer i is fully covered by gene g (Fig1. A), then as the number of amplicons increases 98 exponentially with the number of replication cycles we have where k is the number of replication cycles.
If some reads share two ends, where one end is a primer and the other end is at some gene boundary 103 (Fig 1D, Fig 1E), it may indicate that these amplicons are the result of linear amplification. Denote the 104 number of reads r i,g with two endpoints as (t i , e g ), and the number of reads r g,i with two endpoints as 105 (l i , s g ), then If a gene g is within some primer target region (Fig2. G), then the sequenced fragment can be 107 expected to exactly cover the gene. Let the number of observed reads span exactly the whole gene g as 108 r g , then To find the expressions of the viral genes, we want to minimize the following objective function based 110 on the entities defined above where λ 1 and λ 2 are two regularization parameters, and k is a weight factor (we set k to 21 according to 112 our experiments). The viral load v, primer efficiencies e i , and expressions t g which minimize the 113 objective equation are our estimation for these parameters. To capture the gene products of the host, reads and primers were aligned to the hg38 transcript database 120 derived from UCSC refGene annotation. Primer alignment were performed with BLAT [27]. Regions sample, we deduced that the functions of the discovered SNPs need to be further verified according to 136 lab experiments. Surprisingly, we also found two SNPs (loci 8,782 and 28,144), which were important 137 SNP loci identified in recent phlogenetic analysis [20], that frequently demonstrates heterogeneous 138 signals, suggesting the existence of at least two different virus strains in each host. However, we did not 139 find creditable InDels (Insertions and Deletions), or structural variations, viral-host integration during 140 the mutation analysis.

141
Evaluations of the amplification efficiencies for the primers 142 Since 98 pairs of primers from two pools were designed for the enrichment of SARS-CoV-2 genome 143 before sequencing, we discovered that different parts of SARS-CoV-2 genome exhibited different coverage 144 depths. We then calculated the amplification efficiencies for the primers with our proposed method. We 145 discovered that the primers located at the 5' and 3' ends of the SARS-CoV-2 genome exhibited higher Due to the amplification discrepancy among the primers, the sequencing data was unevenly distributed 156 along the SARS-CoV-2 genome. After aligning the sequencing data to the SARS-CoV-2 genome, we 157 found that the coverage depths of some genomic regions reached over 8000x in some samples, but some 158 regions have coverage depths less than 10x ( Figure 5). In order to minimize the data imbalance induced 159 by PCR amplification, we performed data normalization based on the primer efficiency and their 160 locations on SARS-CoV-2 genome, so that the normalized data could reflect the actual virus 161 reproduction in the host. Based on the algorithm illustrated in the method, we obtained the normalized 162 sequencing depth of 11 sample data.

163
After data normalization, the overall coverage depth along the SARS-CoV-2 genome was relatively 164 even, and less than 300x. In addition, we discovered that the ORF1ab gene, S gene, M gene, and N gene 165 exhibited higher coverage depth in the samples, which might indicate higher expression ( Figure 5).

166
These genes are related to host cell recognition, amplification and virus particle assembly, so we 167 speculate that the results might explain the severe SARS-CoV-2 virus infection in the patients. Since 168 gene E, ORF6, ORF7a, ORF7b, ORF8 and ORF10 are less than 400bp in length, their transcripts might 169 not have been captured by the primers, and hence their expressions cannot be estimated.

170
Highly expressed genes from the host 171 We noticed the existence of human genome in the samples from an earlier study. We similarly analyze     the samples. More than 50% of the MUC5B gene region is covered ( Figure 5). However, the region with 178 high coverage in MUC5B gene is inconsistent among the samples, perhaps due to data scarcity.

180
In this study, we implemented a software tool, CovProfile, to analyze multiplex RT-PCR on Nanopore 181 sequencing data, and studied the sputum samples of COVID-19 patients. Our results show enrichment of 182 PCR amplicons which in total cover 99.7% of SARS-CoV-2 virus genome, confirming the reliability of 183 the multiplex RT-PCR method in identifying COVID-19 infection.

184
The data contained large amount of chimeric reads formed by unintended random connection of 185 multiplex PCR amplicons, making up 1.69% of total sequencing reads. Similar observation has been 186 reported that these chimeric reads can be formed during library preparation and sequencing, and lead to 187 the cross barcode assignment errors in multiplex samples [28,29]. Our analysis of the host genome the genotypes of the pair is detected in at least one strain within the dataset. In this case, the existence 216 of a large clique will be intractable to explain using phylogeny, as the chances become diminishingly 217 small with the size of the clique.

218
From the dataset, we were able to uncover a partial clique of 24 vertices and 138 edges. Two maximal 219 8-cliques and 14 maximal 7-cliques from the partial cliques are shown in Fig 7. These cliques suggest fast 220 mutations on their loci. On the other hand, genomic recombination between the strains is not detected 221 with the software package RDP4 [36]. Hence, we believe the cliques would be best explained by the  Figure 6. Normalized coverage depth on MUC5B gene in the 11 samples. In each plot, the X and Y coordinates respectively represent the length of MUC5B and the normalized coverage depth along the MUC5B gene.

231
In closing, we expect CovProfile to continue to help in detecting genomic variation on transcriptional 232 profile of SARS-CoV-2 infection from multiplex Nanopore sequencing data. The tool decomposes and 233 realign the chimeric reads that commonly exist in the multiplex Nanopore sequencing, and this will 234 greatly promote data usage. The method also provides robust estimation of primer efficiency through a 235 multi-parameter model. We hope the viral and host gene expressions we reported could help in the 236 development of diagnosis and therapy strategies of SARS-CoV-2 infection candidates.

238
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive [37] 239 in BIG Data Center [38], Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under 240 BioProject PRJCA002503 with accession ID CRA002522 that are publicly accessible at 241 https://bigd.big.ac.cn/gsa. All other data are available from the authors upon reasonable request.