Abstract
Many detection methods have been used or reported for the diagnosis and/or surveillance of COVID-19. Among them, reverse transcription polymerase chain reaction (RT-PCR) is the most commonly used because of its high sensitivity, typically claiming detection of about 5 copies of viruses. However, it has been reported that only 47-59% of the positive cases were identified by some RT-PCR methods, probably due to low viral load, timing of sampling, degradation of virus RNA in the sampling process, or possible mutations spanning the primer binding sites. Therefore, alternative and highly sensitive methods are imperative. With the goal of improving sensitivity and accommodating various application settings, we developed a multiplex-PCR-based method comprised of 343 pairs of specific primers, and demonstrated its efficiency to detect SARS-CoV-2 at low copy numbers. The assay produced clean characteristic target peaks of defined sizes, which allowed for direct identification of positives by electrophoresis. We further amplified the entire SARS-CoV-2 genome from 8 to half a million viral copies purified from 13 COVID-19 positive specimens, and detected mutations through next generation sequencing. Finally, we developed a multiplex-PCR-based metagenomic method in parallel, that required modest sequencing depth for uncovering SARS-CoV-2 mutational diversity and potentially novel or emerging isolates.
Introduction
A variety of methods for detecting SARS-CoV-2 have been reported and discussed1,2, including RT-PCR, serological testing3 and reverse transcription-loop-mediated isothermal amplification4,5. Currently, RT-PCR is considered the gold standard for diagnosing SARS-CoV-2 infections because of its ease of use and high sensitivity. RT-PCR has been reported to detect SARS-CoV-2 in saliva6, pharyngeal swab, blood, rectal swab7, urine, stool8, and sputum9. In laboratory conditions, the RT-PCR methodology has been shown to be capable of detecting 4-8 copies of virus or lower, through amplification of targets in the Orf1ab, E and N viral genes, at 95% confidence intervals10-12. However, only about 47-59% of the true positive cases were identified by RT-PCR, and 75% of RT-PCR negative results were actually later found to be positive with other assays, hence mandating repeated testing 8,13-15. In addition, there is evidence suggesting that heat inactivation of clinical samples causes loss of viral particles, thereby hindering the efficiency of downstream diagnosis16.
Therefore, it is necessary to develop robust, sensitive, specific and highly quantitative methods for reliable diagnostics17,18. The urgency to develop an effective surveillance method that can be easily used in a variety of laboratory settings is underlined by the wide and rapid spreading of SARS-CoV-219-21. In addition, such method should also distinguish SARS-CoV-2 from other respiratory pathogens such as influenza virus, parainfluenza virus, adenovirus, respiratory syncytial virus, rhinovirus, human metapneumovirus, SARS-CoV, etc., as well as Mycoplasma pneumoniae, Chlamydia pneumoniae and other causes of bacterial pneumonia22-25. Furthermore, obtaining full-length viral genome sequence through next generation sequencing (NGS) prove to be essential for the surveillance of SARS-CoV-2’s evolution and for the containment of community spread26-29. Indeed, SARS-CoV-2 phylogenetic studies through genome sequence analysis have provided better understanding of the transmission origin, time and routes, which has guided policy-making and management procedures27,28,30-33.
Here, we describe the development of a highly sensitive and robust detection assay incorporating the use of multiplex PCR technology to identify SARS-CoV-2 infections. Theoretically, the multiplex PCR strategy, by simultaneously targeting and amplifying hundreds of targets, has significantly higher sensitivity than RT-PCR and may even detect nucleotide fragments resulting from degraded viral genomes. Multiplex PCR has been shown to be an efficient and low-cost method to detect Hantaan orthohantavirus and Plasmodium falciparum infections34,35, with high coverage (median 99%), specificity (99.8%) and sensitivity. Moreover, this solution can be tailored to simultaneously address multiple questions of interest within various epidemiological settings35. Similar to a recently described metagenomic approach for SARS-CoV-2 identification36, we also establish a user-friendly multiplex-PCR-based metagenomic method that is capable of detecting SARS-CoV-2, and could also be applied for the identification of novel pathogens with a moderate sequencing depth of approximately 1 million reads.
Results
Mathematical model of RT-PCR
Several RT-PCR methods for detecting SARS-CoV-2 have been reported to date6,10-12. Among them, two groups reported the detection of 4-5 copies of the virus10,12. To investigate the opportunity for further improvement upon the sensitivity of RT-PCR, we built a mathematical model to estimate the limit of detection (LOD) for SARS-CoV-2. The reported RT-PCR amplicon lengths are around 78-158bp, and the SARS-CoV-2 genome is 29,903bp (NC_045512.2). Thus, we chose 100bp amplicon length and 30kb SARS-CoV-2 genome size for mathematical modeling. With the assumption of 99% RT-PCR efficiency11, we found that RT-PCR assays could only detect 4.8 copies of SARS-CoV-2 at 95% probability (Fig. 1A), which is consistent with the experimental results previously reported10. In this model, the predicted probability of RT-PCR assays to detect one copy of SARS-CoV-2 is only 26% (Supplemental Fig. 1). This finding may explain, at least in part, the reported 47-56% detection rates of SARS-CoV-2 with known positive samples by RT-PCR8,13. We further discovered that the LOD appears to be independent of the viral genome size. For genomes of 4 to 100kb, the detection limit remains 4.8 copies at 95% probability.
One way to elevate the sensitivity is to simultaneously target and amplify multiple regions of the viral genome in a multiplex PCR reaction, thereby increasing the frequency of occurrence in the mathematical model. Amplifying multiple targets has the advantage of potentially detecting fragments of degraded viral nucleotide fragments while tolerating genomic variations, thus allowing for the detection of new and ever-evolving viral strains. The amplification efficiency of multiplex PCR is critical for LOD. We estimated that the efficiency of our multiplex PCR technology is about 26% if using Unique Molecular Identifier (UMI)-labeled primers to count the amplified products after NGS sequencing (Supplemental Fig. 2 and Supplemental Table 1). However, the amplification efficiency could be lower, and amplicons would not be equally amplified if the template used is one single strand of cDNA. Thus, more amplicons are potentially required for multiplex PCR to detect limited copies of the virus.
Mathematical model of multiplex-PCR-based detection method
We designed a panel of 172 pairs of multiplex PCR primers in order to increase the sensitivity of detecting SARS-CoV-2 (Fig. 1B). The average amplicon length is 99bp. The amplicons span across the entire SARS-CoV-2 genome with an average 76bp gap (76±10bp) between adjacent amplicons. Since the observed efficiency of multiplex PCR is about 26% in amplifying the four DNA strands of a pair of human chromosomes, we assumed an efficiency of 6% in amplifying the single-strand of a cDNA molecule. In addition, it has already been reported that 79% of variants are recovered when directly amplifying 600 amplicons from a single cell using our technology37. Therefore, we assumed that 80% of targeted regions would be amplified successfully. Using the same mathematical model described above, we estimated that our SARS-CoV-2 panel can detect 1.15 copies of the virus at 95% probability (Fig. 1C). Again, the LOD is independent of virus genome size.
We also designed a second pool of 171 multiplex PCR primer pairs. The target regions of these primer pairs overlap with the gaps between target regions of the previous pool of 172 primer pairs (Fig. 1B). Together, these two overlapping pools of primers provide full coverage of the entire viral genome. Most importantly, using both pools in detection would lead to a calculated detection limit of 0.29 copies at 95% probability.
Detecting limited copies of SARS-CoV-2
The workflow was designed so that the multiplex PCR products are further amplified in a secondary PCR, during which sample indexes and NGS sequencing primers are added (Fig. 1D). The PCR products were first analyzed by electrophoresis to visualize potential positives. Since dozens of target regions could be amplified from a single copy of SARS-CoV-2, electrophoresis peaks with a defined distribution of peak sizes were expected. Multiplex PCR could potentially amplify not only SARS-CoV-2, but also other coronaviruses, due to shared sequence similarities despite the fact that we designed primers specific to the SARS-CoV-2 genome to avoid cross amplification. In that context, electrophoresis analysis provides a fast and sensitive indication of infection from at least that family of viruses. For specificity, the generated NGS sequencing library can be interrogated for definitive identification of the specific virus.
Two plasmids, containing the full sequence of the S and N genes of SARS-CoV-2, respectively, were used to validate our multiplex PCR method. A total of 28 targets are expected to be amplified within our 172-amplicon panel. To simulate the use of real clinical samples, these two plasmids were spiked into cDNA generated from human total RNA. The copy number of each plasmid was precisely determined by droplet-based digital PCR (ddPCR) with a QX200 system from Bio-Rad38. The two plasmids were diluted from approximately 9,000 copies to below one copy, and were amplified in multiplex PCR reactions. The library peaks of expected sizes were obtained from 8,900 to 2.8 copies of plasmids (Fig. 2A). Quantification of peaks demonstrated a wide dynamic range from 1 to about 1,000 copies of plasmids (Fig. 2B). The yield of the libraries started to saturate when the copy number reached 1,700. It is possible that the saturation point could be even lower when all of the 172 amplicons are amplified from positive clinical COVID-19 samples, and the library peaks could be observed with even fewer viral copies. In contrast, the detected quantities of a single target on S gene by RT-PCR rapidly dropped when using 2.85 copies (Fig. 2B and Supplemental Fig. 3).
Estimated from the mathematical model described above, employing 28 amplicons provides a 16% chance to detect one single copy of the virus. We tested this predicted probability using one copy of plasmid in multiplex PCR reactions. The theoretical calculation gives a 66% probability to sample 1.1 copies, and a 12% chance to detect them based on a multiplex PCR efficiency of 6%. In practice, we experimentally observed a significantly higher 56% probability to detect 1.1 copies (Fig. 2C). These results suggest that the efficiency of multiplex PCR is actually higher than the previously estimated 6% when a single-stranded cDNA molecule is amplified.
When the amplified products were sequenced, we found that the recovered reads were within a range of about 20-fold relative depth with about 1.4 to 2.8 plasmids, and were uniformly distributed across the GC range (Fig. 2D). When detecting down to 1.4 copies of plasmids, only the reads from one amplicon were about 100-fold lower than the average. Approximately 96% of the amplicons were recovered with 14 copies of plasmids, 77% with 2.8 copies, and 37% with 0.6 copies (Fig. 2E).
Detection of SARS-CoV-2 in COVID-19 specimens
The above two pools of primers were used to amplify the full SARS-CoV-2 genomes from a total of 13 nasopharyngeal swab specimens. These specimens were previously diagnosed to be SARS-CoV-2 positive by using RT-PCR. The viral load was found to be from 8 to 675,885 RNA copies/uL (Supplemental Table 2). These 13 viral genomes were successfully amplified and subsequently sequenced on the Illumina MiSeq. We found a correlation between genome coverage and viral copy number, as expected. While about 95% of the genome was covered at 100X for 8 copies of virus, 98-99% of the genomes were covered at 100X for 22 to 675,885 copies of virus at an average sequencing depth of 5,000 reads per amplicon (Fig. 3A). One genome, from the sample with 5,000 estimated viral copies, was covered at 96% for 100X. This coverage was lower than expected, and might have been caused by the poor sample quality resulting from processing or handling the viral RNA or library.
CleanPlex libraries are usually sequenced at about 1,000 paired-end reads while still generating sufficient data for detecting mutations. To confirm SARS-CoV-2 libraries could be sequenced at about 1,000 reads per amplicon, we sub-sampled the data of 5 genomes to 150,000 and 100,000 total reads per library, which were equivalent to 437 and 291 reads per amplicon, respectively. At least 93% of the genome was covered at 100X with 150,000 total reads, and 86% of the genome was covered at 100X with 100,000 total reads. Even at 50,000 total reads per library, at least 58% of the genome was covered at 100X (Fig. 3B). The high coverage was also manifested by the superior uniformity of the number of amplicons amplified in the multiplex PCR reaction and recovered in the sequencing (Fig. 3C), and by the log10 distance of the number of each amplicon to the mean number of all amplicons in the library (Fig. 3D).
The mutations in the SARS-CoV-2 genome from the 13 specimens were detected by two independently developed algorithms. Only those mutations that were detected by both methods were reported. Assuming all viral particles from a single patient contained identical mutations, mutations with frequencies (%AF) > 60% were considered to be empirically true. The majority of the mutations identified in these 13 SARS-CoV-2 genomes clustered around 7 loci, probably reflecting the collection of these specimens in close communities or the transmission of the virus (Fig. 3E). According to the similarity of these mutations, samples were categorized into three groups. The majority of the mutations in the first group showed >98% mutation frequency. Groups 1 and 2 shared a considerable number of identical mutations, while Group 3 was significantly different, suggesting that the origin of this isolate might be traced back to a distinct lineage (Supplemental Table 3). Of note, all 13 strains contained at least one mutation that has been reported to be associated with SARS-CoV-2 virulence39. The D614G mutation (a G-to-A base change at position 23,403 in the reference strain NC_045512.2), which began spreading in Europe in early February, was then transmitted to new geographic regions and became a dominant form40, was found in 11 of these 13 specimens.
We considered mutations with %AF ≤ 60% as a likely reflection of intra-host heterogeneity41. To eliminate noise from PCR amplification and sequencing, only mutations with %AF ≥ 20% were considered. The 20% cutoff was selected based on the mutation profile from sequencing the synthetic SARS-CoV-2 RNA controls from Twist Bioscience. Some true intra-host mutations with %AF < 20% might be missed. Since no UMI was used in the multiplex PCR amplification, intra-host mutations might still be contaminated with noise originating from PCR amplification, even though a 20% cutoff was applied. The occurrence of such noise may be exacerbated by low viral copy inputs in the multiplex PCR, or low read depth per amplicon in sequencing. In groups 1 and 3, where both viral copy numbers and read depth per amplicon were high, only one intra-host mutation per genome was found in some of the specimens. In contrast, group 2 had low copy numbers, and low read depth per amplicon, and the numbers of apparent intra-host mutations were considerably higher (Fig. 3E and Supplemental Table 3). Some of these intra-host mutations occurred at the aforementioned 7 loci, or were found at other loci and were identical among the 4 specimens in group 2, while the remaining ones appeared random. Our findings suggested that these recurring mutations might not be true intra-host mutations. Indeed, it is possible that the low copy number inputs, as well as sequencing depth, caused a reduction in the %AF, and additionally introduced false intra-host mutations. Therefore, copy number and sequencing depth should be cautiously considered when a mutation is found to have low %AF.
Metagenomic method design for novel pathogens
In order to characterize highly mutated viruses that would otherwise not be amplified by the pre-designed primer pairs, and to discover unknown pathogens, we subsequently developed a user-friendly multiplex-PCR-based metagenomic method. In this method, random hexamer-adapters were used to amplify DNA or cDNA targets in a multiplex PCR reaction. The large amounts of non-specific amplification products were removed by using Paragon Genomics’ background removal reagent, thus resolving a library suitable for sequencing. For RNA samples, Paragon Genomics’ reverse transcription reagents were used to convert RNA into cDNA, resulting in significantly reduced amount of human ribosomal RNA species.
We sequenced a library made with 4,500 copies of N and S gene-containing plasmids spiked into 10 ng of human gDNA, which roughly represents 3,300 haploid genomes. Even though the molar ratios of viral targets and human haploid genomes were comparable, the N and S genes, which encompass about 4kb of targets, were a negligible fraction of the 3 billion base pairs of a human genome. If every region of the human genome were amplified and sequenced at 0.6 million reads per sample, only one read of viral target would be recovered. In fact, our results showed that 16% of the recovered bases, or 13% of the recovered reads, were within the viral N and S genes (Fig. 4A and Supplemental Table 4). 80% of SARS-CoV-2 and 78% of mitochondrial targets were covered (Fig. 4B), and their base coverage was significantly higher than for human targets (Fig. 4C). In contrast, only 0.08% of human chromosomal regions were amplified. Furthermore, the human exonic regions were preferentially amplified (Fig. 4D). This suggested that the random hexamers deselected a large portion of the human genome, while favorably amplifying regions that were more “random” in base composition. Indeed, long gaps and lack of coverage in very large repetitive regions were observed in human chromosomes (Fig. 4E). On the contrary, the gaps in SARS-CoV-2 and mitochondrial regions were significantly shorter (Fig. 4F). We further optimized this method so that 96% of the SARS-CoV-2 genome was recovered (Fig. 4G). The depth of the recovered bases was within a 10-fold range on average. This 10-fold difference in coverage has been routinely observed with our multiplex PCR technology (Supplemental Fig. 4 and Supplemental Table 5). Therefore, increasing sequencing depth alone might not improve the coverage of the targeted regions further.
Discussion
This study provides a highly sensitive and robust multiplex PCR method for the detection of SARS-CoV-2. By amplifying hundreds of targets simultaneously, our multiplex PCR method is more sensitive than RT-PCR, and tolerates the presence of mutations in SARS-CoV-2. For the purpose of diagnosis, only one of two primer pools could be used, or the two pools could be alternatively used in adjacent samples to prevent cross contamination. While the amplification products from positive samples are mainly viral amplicons, low quantities of primer dimers are produced in the negative samples. Therefore, a simple measurement of the dsDNA concentration by fluorometry or spectrophotometry would not be sufficient. High-resolution electrophoresis is required to resolve the length of the amplification products in order to differentiate the target amplicons from the primer-dimers. Alternatively, a low-depth sequencing in the range of 50K reads per sample would provide definitive diagnostic results.
When both primer pools are used, the entire genome of SARS-CoV-2 can be enriched, sequenced and interrogated for the presence of any mutations. We demonstrated that mutations were detected from samples with viral loads ranging from 8 to half a million copies. For accurate sequencing and phylogenetic studies, a high-depth sequencing in the range of 300K reads per sample, along with an input of high viral load (>100 copies), are deemed necessary. We caution the interpretation of intra-host mutations obtained with a low input number of viral particles and in low sequencing depth data.
The current SARS-CoV-2 panel was demonstrated to specifically amplify the entire SARS-CoV-2 genome, and sequencing data obtained from the 13 COVID-19 RT-PCR positive samples clearly differentiate SARS-CoV-2 from other human coronaviruses, such as MERS-CoV, CoV 229E, CoV OC43, CoV NL63, CoV HKU1. 0% of the obtained sequencing reads from each of the 13 samples were aligned to the genome sequence of any of the above viral species. This is in contrast to what was reported for a similar SARS-CoV-2 panel42. Such high specificity argues strongly that this panel could be further expanded to include simultaneous detection of other respiratory viruses including influenza virus.
Metagenomic method is a powerful technology that can theoretically detect any sequences in the specimens. However, metagenomic methods usually require very high sequencing depth in order to find the target sequences, and hence are economically prohibitive as a diagnostic assay. To overcome this constraint, we developed a multiplex-PCR-based metagenomic method that achieved >96% coverage of the S and N genes of SARS-CoV-2 in the contest of human gDNA, while only required ∼0.6M of total reads per library. This coverage was superior given the recommended 50% threshold of coverage for drafting a genome43. The results were obtained with no additional means of host depletion to remove human gDNA and rRNA. The viral bases were 16% of the total recovered bases in the sequencing. Yet it still necessary to verify and validate the detection of SARS-CoV-2 and the other coronaviruses and respiratory viruses by this metagenomic method.
Materials and Methods
Ethics statement
Clinical SARS-CoV-2 samples were collected at Children’s Hospital of Los Angeles. Samples and ancillary clinical and epidemiological data were de-identified before analysis, and are thus considered exempt from human subject regulations, with a waiver of informed consent according to 45 CFR 46.101(b) of the US Department of Health and Human Services. Analysis of the nasopharyngeal swab samples from patients with COVID-19 disease was approved by the Ministry of Health in the US. Patients in the 2020 COVID-19 outbreak from 1 January 2020 to 30 August 2020 provided oral consent for study enrolment and the collection and analysis of their nasopharyngeal swab. Consent was obtained at the homes of patients or in hospital isolation wards by a team that included staff members of Children’s Hospital of Los Angeles. SARS-CoV-2 viruses were purified from the clinical samples by using QIAamp Viral RNA Mini Kit (Qiagen, Cat. No. 52906). The preparations were analyzed by real-time RT–PCR testing for the determination of viral titers of SARS-CoV-2 by standard curve analysis. The full genomes of SARS-CoV-2 viruses were amplified by using the CleanPlex SARS-CoV-2 panel (Paragon Genomics, SKU 918011) and sequenced on an Illumina MiSeq at Children’s Hospital of Los Angeles.
Materials
The Universal Human Reference RNA was from Agilent Technologies, Inc. (Cat#74000). The plasmids containing either S or N gene of SARS-CoV-2 (pUC-S and pUC-N, respectively) were purchased from Sangon Biotech, Shanghai, China. The PCR primers used in ddPCR and RT-PCR reactions for S gene are 5’-TGTACTTGGACAATCAAAAAGAGTTGAT and 5’-AGGAGCAGTTGTGAAGTTCTTTTC; for N gene are 5’-GGGGAACTTCTCCTGCTAGAAT and 5’-CAGACATTTTGCTCTCAAGCTG, respectively.
Multiplex PCR panel design
Panel design is based on the SARS-CoV-2 sequence NC_045512.2 (https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/). In total, 343 primer pairs, distributed into two separate pools, were selected by a proprietary panel design pipeline to cover the whole viral genome except for 92 bases at its ends. Primers were optimized to preferentially amplify the SARS-CoV-2 cDNA versus background human cDNA or genomic DNA. They were also optimized to amplify the covered genome uniformly.
Reverse transcription
50ng of Universal Human Reference RNA was converted into cDNA using random primers and SuperScript IV Reverse Transcriptase, following the supplier recommended method (Thermo Fisher Scientific, Cat# 18090050). After reverse transcription, cDNA was purified with 2.4X volume of magnetic beads, and washed twice with 70% ethanol. Finally, the purified cDNA was dissolved in 1X TE buffer and used per multiplex PCR reaction.
RT-PCR
Plasmids pUC-S and pUC-N, in combination with human cDNA, were used in each reaction. Paragon Genomics’ CleanPlex secondary PCR mix was used with 100nM of each PCR primers in 10ul reactions. The PCR thermal cycling protocol used was 95°C for 10min, then 98°C for 15sec, 60°C for 30sec for 45 cycles.
ddPCR
ddPCR was performed on QX200 from Bio-Rad. Plasmids pUC-S and pUC-N at the estimated copy numbers 1 (6 repeats), 2 (3 repeats), and 100 (3 repeats) were tested. In each reaction, the ddPCR thermal cycling protocol used was 95°C for 5min, then 95°C for 30sec, 60°C for 1min with 60 cycles, 4°C for 5min and 90°C for 5min, 4°C hold. The resulting data were analyzed by following the supplier recommended method.
Multiplex PCR
Paragon Genomics’ CleanPlex multiplex PCR reagents and protocol were used. Briefly, a 10µl multiplex PCR reaction was made by combining 5X mPCR mix, 10X Pool 1 of the panel, water and viral template cDNA. The reaction was run in a thermal cycler (95°C for 10min, then 98°C for 15sec, 60°C for 5min for 10 cycles), then terminated by the addition of 2µl of stop buffer. The reaction was then purified by 29µl of magnetic beads, followed by a secondary PCR with a pair of primers for 25 cycles. The secondary PCR added sample indexes and sequencing adapters, allowing for sequencing of the resulting products by high throughput sequencing. A final bead purification was performed after the secondary PCR, followed by library interrogation using a Bioanalyzer 2100 instrument with Agilent High Sensitivity DNA Kit (Agilent Technologies, Inc. Part# 5067-4626).
Mathematical Modeling
A cumulative Poisson probability was used to build the mathematical model. In Microsoft Excel, the following function was used:
P = 1 - POISSON.DIST(1, λ, TRUE)
For multiplex PCR, λ = f × 80% × n × m, where f = frequency of target(s) per genome. f = l/L, where l = cumulative length of amplicon, L = length of genome. For a panel of 172 amplicons, l = 172 × average length of amplicon. n = number of virus genomes in the sample (n = 1,2,3…); m = amplification efficiency. 80% of targets were assumed to be successfully amplified in multiplex PCR. For RT-PCR, λ = f × n × m. q = number of detected virus genomes. q is used to plot the graph reported in the paper, e.g., the copies of virus against the matching probabilities. For multiplex PCR, q = n/a, a = the number of amplicons used in the multiplex PCR. For RT-PCR, q = n × f.
Multiplex-PCR-based metagenomic method
Paragon Genomics’ CleanPlex metagenomic reagents and protocol were used. Briefly, a 10µl multiplex PCR reaction was made by combining 5X mPCR mix, 10X random hexamer-adapters, water and the viral template cDNA. The PCR thermal cycling protocol used was 95°C for 10min, then 98°C for 15sec, 25°C for 2min, 60°C for 5min for 10 cycles. The reaction was then terminated by the addition of 2µl of stop buffer, and purified by 29µl of magnetic beads. The resulting solution was treated with 2µl of CleanPlex reagent at 37°C for 10min to remove non-specific amplification products. After a magnetic bead purification, the product was further amplified in a secondary PCR with a pair of primers for 25 cycles to produce the metagenomic library. This metagenomic library was further purified by magnetic beads before sequencing.
High throughput NGS sequencing and data analysis
High throughput NGS sequencing was performed using Illumina iSeq 100, MiSeq and MGI sequencers (DNBSEQ-G400 and its research-grade CoolMPS sequencing kits). Detailed information for the samples sequenced and used in this manuscript can be found in Supplemental Table 4. Raw sequencing data were trimmed for adaptors using cutadapt version 1.14. The sequences obtained were mapped to the SARS-CoV-2 genome (NC_045512.2) with bwa-mem using Sentieon version 201808.01. Duplicate read marking was skipped. Base-quality recalibration, re-alignment of indels and quality metrics was accomplished with Sentieon. The resulting BAM files were then used to calculate depth and coverage metrics using Samtools version 1.3.1. Algorithms developed independently at Children’s Hospital of Los Angeles and paragon Genomics were sued to detect the mutations in the genome of SARS-CoV-2.
Data availability
All sequencing data used in this publication are available for downloading at NCBI’s Sequence Read Archive (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA614546).
Author contributions
C.L., D.N.D., Z.L. conceived the study and drafted the manuscript. C.L., D.N.D., J.S., V.K., L.Y.L., L.L., R.F., G.B., J.L., A.O., G.L., Z.L. performed experiments, analysis and revised the manuscript. U. P. performed the initial diagnosis and quantification of clinical COVID-19 samples. D. O and D. R. performed CleanPlex library preparation and NGS sequencing of the positive COVID-19 clinical samples with MiSeq. M. B., D. M., A. R., and L. S. analyzed the data from the clinical samples. X. G. and J. D. B. supervised the study of clinical samples and revised the manuscript. B.Z., A.E.U. performed ddPCR experiments and analysis. A.B., H.T. performed MGI sequencing and analysis.
Conflict of Interest
The authors declare no competing interests.
Supplemental Information
Acknowledgements
The authors would like to thank Jian Xu of MGI, BGI-Shenzhen for technical support in MGI sequencing, Dr. Jin Billy Li of Stanford University for the coordination of academic support, Dr. Alexander E. Urban of Stanford University for suggestions in preparing the manuscript, and Dr. Zihuai He of Stanford University for discussions on a potential mathematical model of the metagenomic method.