Abstract
Background MGISEQ-2000 developed by MGI Tech Co. Ltd. (a subsidiary of the BGI Group) is a new competitor of such next-generation sequencing platforms as NovaSeq and HiSeq (Illumina). Its sequencing principle relies on the DNB and cPAS technologies also used in the previous version of the BGISEQ-500 device, but the reagents for MGISEQ-2000 are refined and the platform utilizes updated software. The cPAS technology has evolved from cPAL previously created by Complete Genomics.
Findings This article compares the results of the whole-genome sequencing of a DNA sample from a Russian female donor performed on MGISEQ-2000 and Illumina HiSeq 2500 (both PE150). Two platforms were compared in terms of sequencing quality, number of errors and performance. Additionally, we did variant calling using four different software packages: Samtools mpileaup, Strelka2, Sentieon, and GATK.
Conclusions The accuracy of single nucleotide polymorphism (SNP) detection was similar between the data generated by MGISEQ-2000 and HiSeq 2500, which was used as a reference:
For Samtools mpileaup software package: TPR (Sensitivity) 99.30%, FPR = 0,000498%;
For Strelka2 software package: TPR (Sensitivity) 99,51%, FPR = 0,000254%;
For Sentieon software package: TPR (Sensitivity) 99,57%, FPR = 0,000285%.
For GATK software package: TPR (Sensitivity) 98,70%, FPR = 0,000240%.
At the same time, a separate indel analysis of the overall error rate revealed similar FPR values and lower sensitivity:
For Samtools mpileup: TPR (Sensitivity) 93,62%, FPR = 0,000698%;
For Strelka2: TPR (Sensitivity) 98,84%, FPR = 0,000127%;
For Sentieon: TPR (Sensitivity) 98,68%, FPR = 0,000285%.
For GATK: TPR (Sensitivity) - 98,70%, FPR = 0,000240%.
The method of statistical analysis we use does not allow us to conclusively establish which of the two instruments is the most accurate. However, it can be said with confidence that the data generated by the analyzed sequencing systems are characterized by the comparable magnitude of error and that MGISEQ-2000 can be used for a wide range of research tasks on a par with HiSeq 2500.
Background
The cPAL sequencing technology developed by Complete Genomics first came to light in 2009 [1]. In 2013, Complete Genomics was acquired by BGI (the Beijing Genomic Institute), and the technology was subsequently refined [2]. In 2015, a new commercially available second-generation genome analyzer BGISEQ-500 was first announced [3]. Since then, the cPAL technology has undergone serious modifications.
The cPAS method was an important milestone in the evolution of this technology. The method exploits fluorescently labeled terminated substrates. In cPAS, sequencing occurs as DNA polymerase starts its work using a primer (anchor) complementary to single DNA strand [4]. DNA nanoballs (DNB) are 160,000 to 200,000-bp-long single-stranded DNA fragments used for signal amplification, the replicated butt-joined copies of one of the original DNA library molecules. The copies are created in the process of rolling circle amplification of DNA circles constituting the library. Each DNB rests in a separate section of patterned flow cell, which is ensured by its non-covalent binding to a charged substrate. The flow cell is a silicon wafer coated with silicon dioxide, titanium, hexamethyldisilazane and a photoresist material. DNBs are added to the flow cell and selectively bind to positively-charged aminosilanes in a highly ordered pattern, allowing a very high density of DNA nanoballs to be sequenced [1], [5].
The sequencing process itself consists of a few steps, including the addition of a fluorescently labeled terminated nucleotide (sequencing by synthesis), the cleavage of a terminator during the synthesis process and the detection of the produced fluorescent signal [6], [7], [8]. We would like to emphasize that we were unable to find a detailed description of cPAS-based sequencing in the literature, nor figured out how it is implemented in MGISEQ-2000. However, there is a patent in the public domain that describes the application of the cPAS approach, in which the sequencing process is carried out using fluorescently labeled monoclonal antibodies that recognize a unique chemical modification of one of four terminated dNTPs [9]. Anyway, it is not currently possible to obtain full information about sequencing by MGISEQ-2000.
A couple of years ago, a paper was published demonstrating a similar accuracy of SNP detection and slightly lower accuracy of indel detection for the BGISEQ-500 platform, as compared to HiSeq 2500, using a reference genomic dataset from GIAB [3]. A few recent studies have compared the performance of these two platforms in sequencing ancient DNA [10], metagenome [11] and microRNA [4]. In general, the quality of data generated by BGISEQ-500 has proved to be good, although some of its characteristics are somewhat worse than those of Illumina HiSeq 2500.
The Genome in a Bottle Consortium provides reference genomes for benchmarking [12]. By comparing the obtained genomic variants to a reference sequence, one can assess the accuracy/sensitivity of a tested instrument and the corresponding bioinformatics pipeline for data analysis. In our study, we somewhat stepped back from the conventional methods of analysis as we were pressed for time and material resources. Our intention was to test how suitable is the MGISEQ-2000 platform for assessing the mutational variability of embryonic cells. So, we took the genome of a Russian female egg donor and conducted a genome-wide analysis using two platforms: Illumina HiSeq 2500 and MGISEQ-2000. Since HiSeq 2500 is a well-characterized and popular platform for genomic research, we decided to evaluate the overall error rate in order to understand whether we can use MGISEQ-2000 for our utilitarian tasks.
CRISPR-CAS9-based genome editing technologies are an effective tool for altering the nucleotide sequence of target regions. The application of genome-editing technologies to in vitro fertilization (IVF) at the zygote stage holds clinical promise and allows almost complete elimination of the original DNA sequence in embryonic cells [13]. However, PCR used to assess the efficacy of targeted genome editing provides no information about the nonspecific activity of CRISPR-CAS9 systems, which can potentially affect any part of the genome. In this case, WGS (whole-genome sequencing) of the embryonic cell is needed. We decided to compare the performance quality of two massively parallel sequencing (MPS) platforms by Illumina (HiSeq 2500) and MGI (MGISEQ-2000) using the biological samples provided by one of the egg donors for embryonic genome editing.
Materials and Methods
Ethics
The research was carried out according to The Code of Ethics of the World Medical Association (Declaration of Helsinki). Written informed consent was obtained from the patient, and the study was approved by the Ethical Committee from National Medical Research Center for Obstetrics, Gynecology and Perinatology Named After Academician V.I. Kulakov, Moscow, Russia.
DNA preparation
A sample of genomic DNA was isolated from WBC (white-body cells) by phenol-chloroform extraction. Quality control was done with agarose gel electrophoresis (degradation level) and the Qubit dsDNA BR Assay Kit (concentration measurement). The donor was a female resident of the Russian Federation.
Library preparation for sequencing
MGISEQ-2000
The circularization procedure is essentially the denaturation and renaturation of a DNA library in the presence of excess amounts of a splint oligo (dephosphorylated at the 5’-end) and consisting of inverted complementary sequences of adapters ligated to the library. In the process of renaturation with the splint oligo, an annular molecule is formed with double-stranded structure in the adapter region containing a nick. The nick is sealed by DNA ligase. Linear DNA library molecules are disposed of at the digestion stage using a mixture of nucleases that cleave linear molecules. Good scheme is prepared by MGI’s team [28].
The isothermal synthesis of nanoballs is carried out using the rolling circle amplification (RCA) mechanism and is initiated by the splint oligo. As a result, RCA forms a linear single-stranded DNA consisting of 300-500 repeats. A nanoball is a molecule compactly packed into a coil-like form 200-220 nm in diameter.
The procedure of nanoball loading on the flow cell is simplified and automated: the flow cell has a patterned array structure promoting efficient loading (85.5% in our case) which does not depend on the accuracy of library dilution in the case of unordered cells, for example for Illumina MiSeq or HiSeq 2500. The nanoballs are loaded using a DNB Loader, a device similar to cBot (Illumina); alternatively, the loading procedure can be carried out manually using a plastic DNB manual adapter loader. The instrument and the reagents are prepared for sequencing in the way similar to that offered by Illumina. With MGISEQ-2000, water and maintenance washes must be performed. The ready-to-use reagents are delivered in a cartridge that needs to be pre-thawed. A flow cell for MGISEQ-2000 has four separate lanes and one surface on which DNBs are immobilized.
1000 ng of genomic DNA was fragmented using a Covaris ultrasonicator to achieve a length distribution of 100-700 bp with a peak of 350 bp. Size selection was performed with Ampure XP (Beckman). Library concentrations were measured using Qubit; the amount of DNA used was 289 ng (procedure efficiency 29%). Then, an aliquot of 50 ng of the fragmentation product was transferred to a separate tube for end-repair and A-tailing. For ligation, the equimolarly mixed set of Barcode Adapters 501-508 was used. The ligation product was washed with Ampure XP, then 7 PCR cycles were performed using primers complementary to the ligated adapters After washing the library with Ampure XP, its concentration was measured by Qubit. Before the annealing and circularization with splint oligo, the library was normalized to the amount of 330 ng in a volume of 60 μl. After linear DNA was digested, the concentration of circulated DNA (0.997 ng / uL) was measured by Qubit using the ssDNA kit.
After RCA and formation of DNBs, the end product was measured by Qubit using the ssDNA kit. The typical range of nanoball concentrations suitable for loading is 8-40 ng / uL. In our case, the concentration was 20 ng / uL. Nanoball loading was assisted by a DNB manual loader.
Illumina 2500
500 ng of genomic DNA was enzymatically fragmented by dsDNA Fragmentase (NEB). The library was prepared using the NEBNext Ultra II kit and indexes from the Dual Index Primers Set 2 (all New England Biolabs) according to the manufacturer’s instructions; amplification at the last sample preparation stage was done in 3 PCR cycles.
MPS was carried out on the Illumina HiSeq 2500 in the Rapid Run mode (paired-end 150 bp dual indexing) using the 500-cycle v2 reagent kit according to the manufacturer’s instructions.
Sequencing
Preparation of genomic libraries and sequencing on MGISEQ-2000 were carried out by our research group at the facilities of MGI Tech. in Shenzhen. Fastq files were generated as described previously using the zebracallV2 software provided by the manufacturer [3].
Library preparation and sequencing on HiSeq 2500 were carried out at the Center for Genome Technologies of Russian National Research Medical University. Fastq files were generated using the Basespace cloud software offered by the manufacturer (https://basespace.illumina.com/analyses/140691740/files/logs).
Raw Data
Fastq files with WGS of E704 sample obtained from HiSeq 2500 and MGISEQ-2000 are avaliable in SRA database (BioProject: PRJNA530191, direct link https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA530191
Data analysis
The detailed description of the sequencing process and the scripts are provided in a Supplementary file 1
Results
Sequencing data summary
In this research, we analyzed two whole-genome datasets yielded by the sequencing of a Russian female donor’s gDNA (hereinafter, we will call the sample E704). Her genome was sequenced using two platforms: HiSeq 2500 by Illumina and new MGISEQ-2000 by BGI Complete Genomics that have similar performance characteristics. In the case of MGISEQ-2000, DNA was applied onto a separate lane of the flow cell. Sequencing was performed in a paired-end 150 bp mode. We noted the amount of data generated by MGISEQ-2000 and calculated the average coverage. After that, we sequenced the donor’s genome using Illumina HiSeq 2500 in order to obtain a similar amount of data. General sequencing characteristics are presented in Table 1. The detailed description of library preparation is provided in Materials and Methods. We would like to note that we used different methods of DNA fragmentation for library preparation: fragmentation by ultrasound (E704-M) and enzymatic fragmentation (dsDNA fragmentase; E704-I). This fact is important for the discussion of our results.
As shown in Table 1, the size of the obtained data, as well as the characteristics of sequencing quality, indicate that the datasets can be analyzed and compared. The use of different fragmentation methods is unlikely to skew the comparison of the two datasets [14].
The average coverage is an important characteristic of whole-genome sequencing, just like its distribution and variability. Figure 1 compares the average coverage distribution for MGISEQ-2000 and HiSeq 2500. The figure shows a slightly higher average coverage for MGISEQ-2000 (32.75X for MGISEQ-2000 vs 30.48X for HiSeq 2500). At the same time, the overall coverage distribution is highly uniform for both datasets (Inter-Quartile Range (IQR = 6)), suggesting good sequencing quality [15].
The data presented in Figure 1 were obtained after the FastQC had been carried out during the reads alignment. We specifically mention it at the beginning of the results section so that it is clear that the input data in terms of the coverage distribution and the total reads number were similar.
FastQC analysis
The next step in the comparison of both datasets was to assess the quality of FastQ files by FastQC [18]. We also analyzed every individual FastQ file generated by paired-end sequencing with different barcodes (see Materials and Methods).
Data quality exposed by FastQC source file analysis was acceptable and comparable for both platforms. K-mers were found at the start of the reads in the fastq files generated by MGISEQ-2000-based sequencing and at the end of the reads in the files yielded by HiSeq 2500-based sequencing. In HiSeq 2500 fastq files a deviation from the normal GC-content was observed at the start of the reads. K-mers might be explained by the presence of unremoved adapter sequences in both cases. The abnormal GC content could be a result of enzymatic fragmentation, which apparently causes a deviation from the random distribution pattern. Bearing that in mind, we decided to remove 10 nucleotides from both ends of each read in both MGISEQ-2000 and HiSeq 2500 fastq files. Further manipulations were carried out on 130-nucleotide-long fragmented reads. We also trimmed adapter and other technical sequences (data not provided in this article), which allowed us to save more data and work with a higher average read length. This, however, was not crucial for our purposes, so we proceeded to the next steps of the comparative analysis. We merged all obtained fastq library files containing different barcodes so that each platform was represented by only a couple of fastq files - with forward (R1) and reverse (R2) reads, respectively. After merging the fastq files, we repeated the quality assessment procedure using the FastQC service only to find out that the total data generated by both platforms were of acceptable quality and could be safely compared.
Figure 2 shows quality of sequencing data assessed by the FastQC service [18]. Data quality was acceptable for each of the nucleotide positions within a read for both MGISEQ-2000 and HiSeq 2500. However, the quality of data representing each position in the MGISEQ-2000 fastq file was somewhat lower than in the HiSeq 2500 file and tended to gradually deteriorate towards the end of a read (though it was not lower than Q20). For HiSeq 2500-generated data, a decline in quality below Q20 was observed only towards the very end of a read. For each nucleotide, the quality of MGISEQ-2000-based sequencing gradually decreased after 50-60 cycles. In contrast, the total number of high-quality nucleotides was higher for HiSeq 2500 and maintained through the last cycle. A similar picture is demonstrated by the graphs representing the distribution of reads quality (Fig. 2c). With Illumina, the distribution is more uniform, meaning that the average quality is higher. The quality of reads yielded by MGISEQ-2000-based sequencing is acceptable since 95% of all reads were above Q30. The GC-content is similar for both platforms (Fig. 2d); the distribution graphs completely repeat each other.
Reads mapping/alignment and QC
The filtered and trimmed reads were aligned to the reference genome, which was necessary to convert fastq files to BAM files. This was done using Burrows-Wheeler Aligner (BWA-MEM) with default settings recommended for the analysis of genomes sequenced on Illumina systems [19]. The quality of read alignment was assessed using the SAMtools software package and the bamstats software module [20,21].
The quality of read alignment was acceptable for both platforms. The insert size for paired-end libraries corresponds to the theoretical size specified in the manufacturer’s protocol: 250 bp for Illumina HiSeq 2500 and 400 bp for MGISEQ-2000. The proportion of aligned reads was 99.9% for both BAM files.
Figure 3 presents the results of the analysis of read alignment to the reference genome. Importantly, the frequency of random sequencing errors is much higher for MGISEQ-2000 and increases with the number of sequencing cycles. Another distinctive feature of MGISEQ-2000 is a shift in the distribution of fragment lengths towards the dominance of shorter fragments, suggesting excessive genome fragmentation. But the distribution of insert lengths in the library is much closer to normal. However, it may be more affected by the process of sample preparation than by the selected sequencing technology (data is not shown).
Variation calling and false positive/negative ratio estimation
In order to further assess the quality of sequencing by MGISEQ-2000, as well as to understand the aspects of its potential use, the generated data were subjected to variant calling. After the data were aligned to the reference genome using BWA-MEM [19], the BAM file was modified using four different pipelines: Samtools [20.21], Strelka2 [22], Sentieon [23], and GATK [24].
The mapping speed, coverage, and sequencing homogeneity were similar for both datasets (Figure 1). All software packages used to process the datasets generated by Illumina and MGI demonstrated similar performance in terms of computation speed, which is consistent with the results obtained for BGISEQ [25].
Alignment results are provided in Table 2; the table shows that both sequencing platforms performed similarly well. The duplication rate for E704-I was higher than for E704-M, amounting to 12.26%. This value, however, was calculated after the fastq files with different barcodes and from different lanes were merged. In each individual fastq file, the duplication rate did not exceed 5-6% for both devices (see Supplementary information). With Illumina HiSeq 2500, 16 separate fastq files (8 for + 8 rev) were generated. The number of fastq files for MGISEQ-2000 was also 16, but they represented a single flow cell, whereas Illumina’s files came from two different flow cells. Thus, a higher duplication rate for Illumina results from the use of two cells. Most likely, the probability of getting repeated reads from two independent flow cells is higher than from one cell. As the information contained in fastq files is summed up, it results in an additional 3-4% of duplicates for Illumina-generated data relative to MGISEQ-2000.
Since it was not possible to conduct standard benchmarking procedures and determine error values in the reference genomic dataset under this study, we calculated error rates (False Positive, False Negative, etc.) in the E704-M dataset using E704-I as a reference. This approach cannot be used to assess the accuracy of the MGISEQ technology, but it does allow us to conclude that the two compared technologies can be used interchangeably for similar tasks without significant loss of accuracy.
Figure 4 shows error rates determined by different software packages. The best result was demonstrated by Strelka2 [22]; below we will use the figures yielded by this pipeline. Variant calling results are presented in the Additional file 2. The magnitude of the total error (False Negative + False Positive) between E704-M and E704-I corresponded to the previously obtained results for BGISEQ500 and Illumina [https://blog.dnanexus.com/2018-07-02-comparison-of-bgiseq-500-to-illumina-novaseq-data/].
In total, over 3.7 million SNPs were detected in the datasets by each of the tested platforms. The E704-M sample contained 3,730,684 SNPs; the number of detected SNPs in E704-I was comparable (3,719,768 SNPs). These data are shown in Table 3. In addition, was detected a similar Ti / Tv ratio, which may indirectly indicate the sequencing accuracy.
MGISEQ-2000 was able to detect a little bit more indels (803,736) than HiSeq 2500 (770,193; see table 3). Generally, HiSeq 2500 performance was characterized by a slightly lower average coverage, which partly explains its indel detection rate. However, given that the dbSNP indel rate for HiSeq 2500 was slightly higher (92.1%; E704-I, versus 90.86%; E704-M), this may indicate a lower accuracy of indel detection by the MGISEQ-2000 platform. These observations are consistent with the previous findings for BGISEQ-500 [3].
To assess the accuracy of detection of certain genomic variants, we chose the E704-I dataset as a reference for E704-M. We would like to emphasize that we do realize that our approach is not accurate enough to be used for benchmarking. But since such studies had been carried out many times for HiSeq 2500, we decided to determine the level of differences for a single genome. Sequencing by two different tools allowed us to estimate their interchangeability/similarity. We understand that our approach is less accurate and cannot be used to directly measure error rates in detecting various mutations, as proposed by the Genome in a Bottle Consortium [12]. However, we believe that it allows us to compare the tested platforms, using the HiSeq 2500 data as a reference, given that the permissible rate of errors for the latter technology has already been established by the Consortium.
For all the SNPs detected, we estimated the magnitude of various errors and calculated the F1-metric using vcf-compare (vcftools [26]) and snpeff [27]).
Table 4 compares the variants obtained through variant calling by Strelka2; the data generated by other software packages are listed in the Additional file 2.
As a result, using the “accessible genome” matrix, the sensitivity of determining SNPs (Sensitivity) in E704-M was 99.51% compared to E704-I, with an FPR (false positive rate) value — 0.000254% (F1 metrics = 99.65%). For InDels, the sensitivity was 98.84% (F1 metrics = 98.81%). It is worth noting that although we didn’t compare with the reference sequence, the level of convergence of genotypes for the two platforms MGISEQ-2000 and Illumina Hiseq2500 is high enough for both the accessible genome and the complete sequence of the read genome and shows a higher accuracy of the MGISEQ-2000 sequencing relative to Previously obtained data for BGISEQ-500 [3]. This data are shown in Table 4.
Discussion
We have compared two genomic datasets generated by Illumina HiSeq 2500 and MGISEQ-2000-based sequencing. As part of our study, we aimed to understand whether MGISEQ-2000 could be used for the whole-genome sequencing of embryos, SNP detection and other tasks faced by our laboratory.
Our study has revealed that MGISEQ-2000 generates datasets possessing similar characteristics, as compared to the data yielded by the “gold standard” of the NGS analysis — the Illumina platform. Given a comparable amount of output data (101.37Gb for MGISEQ and 94.37Gb for Illumina), the average coverage was comparable between the two sets: 32.75X for MGISEQ-2000 vs 30.48X for HiSeq250; the coverage distribution patterns were almost identical (Figure 1).
The analysis demonstrates that sequencing quality is similar for both instruments. The existing differences can be explained by the specifics of the preliminary steps of library preparation and not by the sequencing technique itself.
Four different pipelines were used to perform variant calling. The detection rate of genomic variants was similar between the datasets. The computational time required to process the obtained data was comparable between all software packages and all datasets used. The performance of Strelka2 was characterized by the lowest number of errors (Figure 4).
The quality of data obtained with MGISEQ-2000 is inferior in some respects to that generated by Illumina HiSeq 2500. Specifically, the frequency of random sequencing errors, the percentage of quality reads, and the accuracy of indel detection are higher for HiSeq 2500. However, the magnitude of those differences is small and insignificant for most research tasks. Last but not least, sequencing costs are an important factor for the laboratories in developing countries, including the Russian Federation. To our knowledge, the MGISEQ-2000 platform is comparable to NovaSeq in terms of its costs, but advantageously requires a smaller number of samples per run.
List of abbreviations
- bp
- base-pair
- cPAS
- combinatorial Probe-Anchor Synthesis
- dATP
- deoxyadenosine triphosphate
- dTTP
- deoxythymidine triphosphate
- DNBs
- DNA nanoballs
- FNR
- false negative rate
- FPR
- false positive rate
- FN
- false negative
- FP
- false positive
- GIAB
- Genome in A Bottle
- MPS
- Massive Parallel Sequencing
- PCR
- polymerase chain reaction
- PE150
- pair-end 150 bp
- SNPs
- Single Nucleotide Polymorphisms
- indels
- insertions and deletions
- WGS
- Whole Genome Sequencing
- WBC
- White Blood Cell
Conflicts of interest
DKw - is an employee of OOO “Helicon company”, distributor of MGI Tech LLC on Russian Federation
Authors’ contributions
DR and DK had designed the project. DKw and DK conducted sample preparation and sequencing library construction. VB, DKw and DK conducted sequencing. NK, VN and AG conducted data analysis. DK and AG wrote the manuscript.
DR - Denis Rebrikov
DK - Dmitriy Korostin
VB - Vera Belova
DKw - Dmitry Kwon
NK - Nikolay Kulemin
AG - Alexey Gorbachev
VN - Vladimir Naumov