Single Molecule Sequencing of Cell-free DNA from Maternal Plasma for Noninvasive Trisomy Detection

The demand of non-invasive prenatal testing for autosomal aneuploidy using cell-free fetal DNA (cffDNA) in maternal plasma is a highly sought-after diagnostic, with a rapidly growing market. Current approaches developed by next generation sequencing (NGS) need PCR amplifcation during sample preparation, which results in amplification bias in GC-rich areas of the human genome. With these approaches, the minimum fetal fraction in maternal plasma is 4% for the small differences in circulating cfDNA between trisomic and disomic pregnancies to be detectable. In this paper, we performed single molecule sequencing of cell-free DNA from maternal plasma for noninvasive trisomy 13, 18 and 21 detections using the GenoCare platform. We found that single molecule sequencing is sensitive enough to detect these chromosome abnormalities when the fetal DNA fraction is as low as 2%. Compared to the Hiseq2500 platform, no significant GC bias was observed. The improved sensitivity and unbiased GC readout make GenoCare a promising platform for autosomal aneuploidy detections, even in the very early stage of pregnancy.


Introduction
Trisomy 21 (Down syndrome), trisomy 18 (Edwards syndrome) and trisomy 13 (Patau syndrome) are the most common live-birth autosomal aneuploidies. Their incidence of live-birth is as high as 1/700, 1/6,000 and 1/10,000 respectively [1,2]. These diseases can induce serious morbidity and mortality [2]. Amniocentesis and chorionic villus sampling are currently the standard procedures for the diagnosis of fetal aneuploidy [3]. However, these invasive prenatal tests can lead to miscarriage, with rates of total pregnancy loss ranging from 0.6% to 1.9% [4]. To avoid this risk, there is an increasing demand for non-invasive prenatal testing (NIPT) for accurate and safe diagnosis in the early stages of pregnancy.
The discovery of cell-free fetal DNA (cffDNA) in maternal plasma was first reported in 1997 [5]. The cffDNA represents a constant relative proportion in maternal plasma [6], thus, sequencing cfDNA opens up a new window for testing fetal genetic conditions. In the first and second trimesters of pregnancy, the faction of cffDNA is 4-10% in total cell-free DNA (cfDNA); and this ratio rises to 10-20% in the third trimester pregnancies [7]. In 2008, the first studies were published showing NIPT of fetal chromosomal aneuploidy by next generation sequencing (NGS) in maternal plasma with Solexa/Illumina platform [9,10]. Cell-free DNA was fragmented in nature, which can be measured by NGS without additional shearing. NGS has been developed as highly sensitive and specific non-invasive screening tools for common fetal chromosome aneuploidies. Detection of fetal trisomy 21 has been reported, but detection of trisomy 13 and 18 seems to be less accurate with this approach [11,12]. The reduced accuracy was due to GC bias influencing sequencing data. A further drawback of NGS is the need for costly equipment and reagents. For example, a trisomy 21 detection test costs roughly $2000 [13].
Amplification-free single molecule sequencing technology can avoid the GC bias of genome. Quake et al. first reported a single-molecule-sequencing technology by a sequencing-by-synthesis method in 2003 [14]. Helicos technology, a PCR-free singlemolecule-sequencing platform, has been applied for high sensitivity and specificity noninvasive fetal aneuploidy detection without obvious GC bias [15,16]. In our previous study, we demonstrated a direct targeted sequencing of cancer related gene mutations by single molecule sequencing method [17]. Recently, we reported single molecule sequencing of the M13 virus genome and Escherichia coli genome by GenoCare platform developed by Direct Genomics. Here, we presented our study of accurately detecting of trisomy 21, 18 and 13 by single molecular DNA sequencing platform GenoCare.

Results and discussions 1. Sequencing results of trisomy chorinic villus samples
We first demonstrated detecting trisomy positive DNA samples respectively from chorionic villus samples of fetus with trisomy 13, trisomy 18 and trisomy 21 on GenoCare platform in Figure 1. The human genome chromosomes were uniformly segmented into bins containing 150k bases, and then the number of reads mapped back to each bin were counted.
The normalized bin depth was defined as the number of reads mapped to a bin divided by the average bin reads of the autosomes. The data is fitted by LOWESS regression as shown in red dots in Figure 1. From Figure 1, we observed that the normalized bin depths of the trisomy chromosomes are significantly higher than those of other normal chromosomes. The Z score is also known as standard score, which is the number of standard deviations between a certain observed value and the mean value of the measured data. Z score is widely used in NIPT. The Z score calculated for chromosome 13 corresponding to Figure 1a

Determining the sensitivity by mixing trisomy 21 samples with normal samples
In order to test the sensitivity of the GenoCare platform for NIPT detection, we sequenced samples which were mixtures of genomic DNA extracted from normal women peripheral blood and from chorionic villus samples with trisomy 21. Such mixtures were used to mimic the DNA samples from healthy mother with trisomy 21 fetus. Here we prepared samples containing 2%, 3% and 5% trisomy 21 positive DNA, and the rest part of the samples was normal DNA from healthy people.

GC bias
A direct comparison of GC bias between the single-molecule-sequencer GenoCare and the NGS platform Illumina Hiseq 2500 was made by sequencing the same DNA samples collected from healthy people on both platforms (Figure 5). The GC bias was calculated as ΔRGC 2 , which was defined as the effect of GC bias on the relative reads number [18]. The larger value of ΔRGC 2 represented stronger GC bias. The data recorded by GenoCare showed much less GC bias than Hiseq2500, which can significantly improve the sensitivity for chromosome aneuploidies detection.

Sequencing statistics
The raw reads number, mapped reads number and read length for each positive sample and a representative negative sample were listed in Table 1. The average mapped reads is 5.82 million and the average read length is 34.8bp; The length distribution of the mapped reads for the negative sample presented in Table 1 is in Figure 6. Unlike Illumina or Ion sequencer, GenoCare single molecule sequencing does not suffer from the dephasing issue. The read length is distributed from 28 to 50 bp when the sequencing chemistry was performed in 120 cycles. Longer reads length can be achieved by increasing the sequencing cycles.    Base calling: The base calling method applied in this study is exactly the same as that we reported in previous work [19].

Mapping method
For the mapping of single molecule sequencing reads, a new software named DirectAlignment has been developed in-house. DirectAlignment performs the mapping using a 3-step strategy.
First, global alignment is carried out to locate the candidate mapping sites on a reference genome for each reads. We adopted both the k-mers hashtable and Burrows-Wheeler-Transform (BWT) algorithms to build a reference in Pre-Alignment. To accelerate the mapping process, all reads were transformed into a binary sequence. With all candidate sites in hand, a fast bit-vector algorithm for approximate string matching based on dynamic programming [19] were applied for filtration. The candidate sites which fail to fulfill the maximum mapping error tolerance were discarded at this stage. After filtration, only a limited number of candidate sites are left. For further precise analysis of substitution, deletion and insertion errors, the Smith-Waterman local alignment algorithm or Needleman alignment algorithm were applied on reads and reference segment. Our aligner used SSE2 to accelerate the edit distance computation. In addition, in order to guarantee a good accuracy of mapping sites, a maximum error tolerance filter and minimum length filter were applied for final mapping results. A read that failed to meet the maximum error tolerance was cut from both two ends base by base until it meet the error tolerance and was accepted as a mapped read or until its length fell below the minimum length requirement and was discarded.

Calculation of normalized bin depth
The human genome chromosomes were uniformly segmented into bins containing 150k bases, and then the number of reads mapped back to each bin was counted. The bins containing fewer than 10k unknown reference sequences (marked as N) and with greater than 10 bin reads were kept for subsequent processing. The selected bins were corrected for their GC-content using LOWESS fitting method. The number of reads in each bin was divided by the average number of bin reads of the autosomes, and such result was defined as the normalized bin depth.

Z score
Z score is also known as standard score, which is widely used in NIPT. It is the number of standard deviations between a certain observed value and the mean value of the measured data. Z score is calculated as z = − , in which x is the observed value, µ is the mean value of the observation, and σ is the standard deviation of the observation.

GC bias
Here the GC bias is evaluated using ΔRGC 2 , which is defined as the effect of GC bias on the relative reads number. ΔRGC 2 was calculated according to the equation and L R is the optimal prediction of the normalized bin depth obtained via a LOWESS regression fitting of the normalized bin depth against the GC content. Note that here the segmentation of bins on genome is exactly the same as described in previous sections, and each bin contains 150k bases.