Single tube bead-based DNA co-barcoding for cost effective and accurate sequencing, haplotyping, and assembly

Ou Wang; Robert Chin; Xiaofang Cheng; Michelle Ka Wu; Qing Mao; Jingbo Tang; Yuhui Sun; Han K. Lam; Dan Chen; Yujun Zhou; Linying Wang; Fei Fan; Yan Zou; Ellis Anderson; Yinlong Xie; Rebecca Yu Zhang; Snezana Drmanac; Darlene Nguyen; Chongjun Xu; Christian Villarosa; Scott Gablenz; Nina Barua; Staci Nguyen; Wenlan Tian; Jia Sophie Liu; Jingwan Wang; Xiao Liu; Xiaojuan Qi; Ao Chen; He Wang; Yuliang Dong; Wenwei Zhang; Andrei Alexeev; Huanming Yang; Jian Wang; Karsten Kristiansen; Xun Xu; Radoje Drmanac; Brock A. Peters

doi:10.1101/324392

Abstract

Single tube long fragment read (stLFR) technology enables efficient WGS, haplotyping, and contig scaffolding. It is based on adding the same barcode sequence to sub-fragments of the original DNA molecule (DNA co-barcoding). To achieve this, stLFR uses the surface of microbeads to create millions of miniaturized compartments in a single tube. Using a combinatorial process over 1.8 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding in reactions with 50 million barcodes. Using stLFR we demonstrate efficient unique co-barcoding of over 8 million 20300 kb genomic DNA fragments with near perfect variant calling and phasing of the genome of NA12878 into contigs up to N50 23.4 Mb. stLFR represents a low-cost single library solution that can enable long sequence data.

To date the vast majority of individual whole genome sequences lack information regarding the order of single to multi-base variants transmitted as contiguous blocks on homologous chromosomes. Numerous technologies^1–11 have recently been developed to enable this. Most are based on the process of co-barcoding¹², that is, the addition of the same barcode to the sub-fragments of single long genomic DNA molecules. After sequencing the barcode information can be used to determine which reads are derived from the original long DNA molecule. This process was first described by Drmanac¹³ and implemented as a 384-well plate assay by Peters et al.⁶. However, these approaches are technically challenging to implement, expensive, have lower data quality, do not provide unique co-barcoding, or some combination of all four. In practice, most of these approaches require a separate whole genome sequence to be generated by standard methods to improve variant calling. This has resulted in the limited use of these methods as cost and ease of use are dominant factors in what technologies are used for WGS.

Results

stLFR library process

Here we describe implementation of stLFR technology¹⁴, an efficient approach for DNA co-barcoding with millions of barcodes enabled in a single tube. This is achieved by using the surface of a microbead as a replacement for a compartment (e.g., the well of a 384-well plate). Each bead carries many copies of a unique barcode sequence which is transferred to the sub-fragments of each long DNA molecule. These co-barcoded sub-fragments are then analyzed on common short read sequencing devices such as the BGISEQ-500 or equivalent. In our implementation of this approach we use a ligation-based combinatorial barcode generation strategy to create over 1.8 billion different barcodes in three ligation steps. For a single sample we use ~10-50 million of these barcoded beads to capture ~10-100 million long DNA molecules in a single tube. It is infrequent that two beads will share the same barcode because we sample 10-50 million beads from such a large library of total barcodes. Furthermore, in the case of using 50 million beads and 10 million long genomic DNA fragments, the vast majority of sub-fragments from each long DNA fragment are co-barcoded by a unique barcode. This is analogous to long-read single molecule sequencing and potentially enables powerful informatics approaches for de novo assembly. A similar but informatically limited and less efficient approach using only ~150,000 barcodes was recently described by Zhang et al.¹⁵. Importantly, stLFR is simple to perform and can be implemented with a relatively small investment in oligonucleotides to generate barcoded beads. Further, stLFR uses standard equipment found in almost all molecular biology laboratories and can be analyzed by almost any sequencing strategy. Finally, stLFR replaces standard NGS library preparation methods, requires only 1 ng of DNA, and does not add significantly to the cost of whole genome or whole exome analyses with a total cost per sample of less than 30 dollars (Table 1).

View this table:

Table 1.

Stlfr equipment and reagent cost (USD)

The first step in stLFR is the insertion of a hybridization sequence at regular intervals along genomic DNA fragments. This is achieved through the incorporation of DNA sequences, by the Tn5 transposase, containing a single stranded region for hybridization and a double stranded sequence that is recognized by the enzyme and enables the transposition reaction (Figure 1a). Importantly, this step is done in solution, as opposed to having the insertion sequence linked directly to the bead¹⁵. This enables a very efficient incorporation of the hybridization sequence along the genomic DNA molecules. As previously observed¹⁰, the transposase enzyme has the property of remaining bound to genomic DNA after the transposition event, effectively leaving the transposon-integrated long genomic DNA molecule intact. After the DNA has been treated with Tn5 it is diluted in hybridization buffer and added to 50 million ~2.8 um clonally barcoded beads in hybridization buffer. Each bead contains approximately 400,000 capture adapters, each containing the same barcode sequence. A portion of the capture adapter contains uracil nucleotides to enable destruction of unused adaptors in a later step. The mix is incubated under optimized temperature and buffer conditions during which time the transposon inserted DNA is captured to beads via the hybridization sequence. It has been suggested that genomic DNA in solution forms balls with both tails sticking out¹⁶. This may enable the capture of long DNA fragments towards one end of the molecule followed by a rolling motion that wraps the genomic DNA molecule around the bead. Approximately every 7.8 nm on the surface of each bead there is a capture oligo. This enables a very uniform and high rate of subfragment capture. A 100 kb genomic fragment would wrap around a 2.8 um bead approximately 3 times. In our data, 300 kb is the longest fragment size captured suggesting larger beads may be necessary to capture longer DNA molecules. Beads are next collected and individual barcode sequences are transferred to each subfragment through ligation of the nick between the hybridization sequence and the capture adapter (Figure 1a). At this point the DNA/transposase complexes are disrupted producing sub-fragments less than 1 kb in size. Due to the large number of beads and high density of capture oligos per bead, the amount of excess adapter is four orders of magnitude greater than the amount of product. This huge unused adapter can overwhelm the following steps. In order to avoid this, we designed beads with capture oligos connected by the 5’ terminus. This enabled an exonuclease strategy to be developed that specifically degraded excess unused capture adapter.

Figure 1. Overview of stLFR.

(A) The first step of stLFR involves inserting a hybridization sequence approximately every 200-1000 base pairs on long genomic DNA molecules. This is achieved using transposons. The transposon integrated DNA is then mixed with beads that each contain ~400,000 copies of an adapter sequence that contains a unique barcode shared by all adapters on the bead, a common PCR primer site, and a common capture sequence that is complementary to the sequence on the integrated transposons. After the genomic DNA is captured to the beads, the transposons are ligated to the barcode adapters. There are a few additional library processing steps and then the co-barcoded sub-fragments are sequenced on a BGISEQ-500 or equivalent sequencer. (B) Mapping read data by barcode results in clustering of reads within 10 to 350 kb regions of the genome. Total coverage and barcode coverage from 4 barcodes are shown for the 1 ng stLFR-1 library across a small region on Chr11. Most barcodes are associated with only one read cluster in the genome. (C) The number of original long DNA fragments per barcode are plotted for the 1 ng libraries stLFR-1 (blue) and stLFR-2 (orange) and the 10 ng stLFR libraries stLFR-3 (yellow) and stLFR-4 (grey). Over 80% of the fragments from the 1 ng stLFR libraries are co-barcoded by a single unique barcode. (D) The fraction of nonoverlapping sequence reads (blue) and captured sub-fragments (orange) covering each original long DNA fragment are plotted for the 1 ng stLFR-1 library.

In one approach to stLFR, two different transposons are used in the initial insertion step, allowing PCR to be performed after exonuclease treatment. However, this approach results in approximately 50% less coverage per long DNA molecule as it requires that two different transposons were inserted next to each other to generate a proper PCR product. To achieve the highest coverage per genomic DNA fragment we use a single transposon in the initial insertion step and add an additional adapter through ligation. This noncanonical ligation, termed 3’ branch ligation, involves the covalent joining of the 5’ phosphate from the blunt-end adapter to the recessed 3’ hydroxyl of the genomic DNA (Figure 1a). A detailed explanation of this process has previously been described by some of us (Wang et al., under review). Using this method, it is theoretically possible to amplify and sequence all sub-fragments of a captured genomic molecule. In addition, this ligation step enables a sample barcode to be placed adjacent to the genomic sequence for sampling multiplexing. This is useful as it does not require an additional sequencing primer to read this barcode. After this ligation step, PCR is performed and the library is ready to enter any standard next generation sequencing (NGS) workflow. In the case of BGISEQ-500, the library is circularized as previously described¹⁷. From single stranded circles DNA nanoballs are made and loaded onto patterned nanoarrays¹⁷. These nanoarrays are then subjected to combinatorial probe-anchor synthesis (cPAS) based sequencing on the BGISEQ-500^18–20. After sequencing, barcode sequences are extracted using a custom program (Supplementary Materials). Mapping the read data by unique barcode shows that most reads with the same barcode are clustered in a region of the genome corresponding to the length of DNA used during library preparation (Figure 1b). A detailed description of this method, as well as a protocol for making the beads can be found in the supplementary materials.

stLFR read coverage and variant calling

To demonstrate stLFR phasing and variant calling we generated four libraries using 1 ng (stLFR-1 and stLFR-2) and 10 ngs (stLFR-3 and stLFR-4) of DNA from NA12878. The number of beads were varied with 10 million (stLFR-3), 30 million (stLFR-4), and 50 million (stLFR-1 and stLFR-2) used. Finally, both the 3’ branch ligation (stLFR-1, stLFR-2, and stLFR-3) and two transposon (stLFR-4) methods were tested. Both stLFR-1 and stLFR-2 were sequenced deeply to 336 Gb and 660 Gb of total base coverage, respectively. We also analyzed these at downsampled coverages. stLFR-3 and stLFR-4 were sequenced to more modest levels of 117 Gb and 126 Gb, respectively. Co-barcoded reads were mapped to build 37 of the human reference genome using BWA-MEM²¹. Because stLFR does not require any preamplification steps, read coverage distribution across the genome was close to Poisson (Figure S1). The non-duplicate coverage ranged from 34-58X and the number of long DNA molecules per barcode ranged from 1.2-6.8 (Table 2 and Figure 1c). As expected, the stLFR libraries made from 50 million beads and 1 ng of genomic DNA had the highest single unique barcode co-barcoding rates of over 80% (Figure 1c). These libraries also observed the highest average non-overlapping read coverage per long DNA molecule of 10.7-12.1% and the highest average non-overlapping base coverage of captured subfragments per long DNA molecule of 17.9-18.4% (Figure 1d). This coverage is ~10 X higher than previously demonstrated using 3 ng of DNA and transposons attached to beads¹⁵. This suggests our solution-based transposition process is 3-fold more efficient at sub-fragment capture (40.7-47.4 sub-fragments per genomic fragment in 1 ng of genomic DNA versus 5 sub-fragments captured in 3 ng at similar read coverage as reported by Zhang et al.¹⁵, Table 2).

View this table:

Table 2. Phasing and variant calling statistics

For each library variants were called using GATK²² using default settings. Comparing SNP and indel calls to Genome in a Bottle (GIAB)²³ allowed for the determination of false positive (FP) and false negative (FN) rates (Table 2). In addition, we performed variant calling using the same settings in GATK on a standard non-stLFR library made from ~1000 times more genomic DNA and also sequenced on a BGISEQ-500 (STD), and a Chromium library from 10X Genomics¹¹. We also compared precision and sensitivity rates against those reported in the bead haplotyping library study by Zhang et al.¹⁵. Our stLFR approach and that practiced by Zhang et al. demonstrated lower SNP and Indel FP rates than the Chromium library. stLFR had 2-fold higher FP and FN rates than the STD library and depending on the particular stLFR library and filtering criteria the FN rate was either higher or lower than the Chromium library. The higher FN rate in stLFR libraries compared to standard libraries is primarily due to the shorter average insert size (~200 bp versus 300 bp in a standard library). That said, stLFR had a much lower FN rate than Zhang et al. for SNPs and Indels and a much lower FN rate than the Chromium library for Indels (Table 2). Overall, most metrics for variant calling were better for our stLFR libraries than the published results from Zhang et al. or Chromium libraries, especially when nonoptimized mapping and variant calling processes were used (Table 2, “No Filter”).

One potential issue with using GIAB data to measure the FP rate is that we were unable to use the GIAB reference material (NIST RM 8398) due to the rather small fragment size of the isolated DNA. For this reason, we used the GM12878 cell line and isolated DNA using a dialysis-based method capable of yielding very high molecular weight DNA (see methods). However, it is possible that our isolate of the GM12878 cell line could have a number of unique somatic mutations compared to the GIAB reference material and thus cause the number of FPs to be inflated in our stLFR libraries. To examine this further we compared the overlap of single nucleotide FP variants between the 4 stLFR libraries and the two non-LFR libraries (Figure S2a). Overall, 544 FP variants were shared between the six libraries and 2,078 FPs were unique to the four stLFR libraries. We also compared stLFR FPs with the Chromium library and found that over half (1,194) of these shared FPs were also present in the Chromium library (Figure S2b). An examination of the read and barcode coverage of these shared variants showed they were more similar to that of TP variants (Figure S3-4). We also examined the distribution across the genome of these shared FP variants versus 2,078 randomly selected variants (Figure S5a). This analysis showed 219 variants that are found in clusters where two or more of these FPs are within 100 bp of each other. However, the majority (90%) of variants have distributions that appear indistinguishable from randomly selected variants. In addition, of those FPs shared between stLFR and Chromium libraries only 41 were found to be clustered (Figure S5a). Finally, 96 of these variants are called by GIAB but with a different zygosity than called in the stLFR libraries.

If we accept the evidence that these shared FP variants are largely real and not present in the GIAB reference material, the FP rate for stLFR could be up to 1,859 variants less than what is reported in Table 2 for SNP detection. This is still several thousand single nucleotide variants more than the standard BGISEQ-500 library. To further improve the FP rate in stLFR libraries we tested a number of different filtering strategies for removing errors. Ultimately, by applying a few filtering criteria based on reference and variant allele ratios and barcode counts (see Methods) we were able to remove 3,64713,840 FP variants depending on the library and amount of coverage. Importantly, this was achieved while only increasing the FN rate by 0.10-0.29% in the stLFR libraries. After this filtering step we examined the shared FPs between the four stLFR libraries. Filtering removed only 340 shared FP variants, of which 147 were cluster within 100 base pairs of each other and likely not real (Figure S5b). This further suggests most of these shared FPs are real variants. Taking into account these variants and the reduced number of FP variants after filtering results in a similar FP rate and a 2-3 fold higher FN rate than the filtered STD library for SNP calling (Table S1). This increased FN rate is primarily due to increased non-unique mapping of mate-pairs with short insert sizes in stLFR libraries.

stLFR phasing performance

To evaluate variant phasing performance high confidence variants from GIAB were phased using the publicly available software package HapCut2²⁴. Over 99% of all heterozygous SNPs were placed into contigs with N50s ranging from 0.6-15.1 Mb depending on the library type and the amount of sequence data (Table 2). The stLFR-1 library with 336 Gb of total read coverage (44X unique genome coverage) achieved the highest phasing performance with an N50 of 15.1 Mb. N50 length appeared to be mostly affected by length and coverage of long genomic fragments. This can be seen in the decreased N50 of stLFR-2 as the DNA used for this sample was slightly older and more fragmented than the material used for stLFR-1 (Table 2, average fragment length of 52.5 kb versus 62.2 kb) and the ~10-fold shorter N50 of the 10 ng libraries (stLFR-3 and 4). Comparison to GIAB data showed that short and long switch error rates were low and comparable to previous studies^11,15,25. stLFR performance was very similar to the Chromium library. As the Zhang et al. bead haplotyping method did not have read data available we could only compare our results to the results from their phasing algorithm written and optimized specifically for their data. This demonstrated that stLFR-1 and stLFR-2 libraries had a longer N50, a similar short switch error rate, but a higher long switch error rate. stLFR-3 and stLFR-4, which used more DNA, had an N50 similar to the Zhang et al. However, direct comparison is difficult due to differences in DNA input and coverage.

It should be noted that this phasing result was achieved using a program that was not written for stLFR data. In order to see if this result could be improved we developed a phasing program, LongHap, and optimized it specifically for stLFR data. Using GIAB variants LongHap was able to phase over 99% of SNPs into contigs with an N50 of 18.1 Mb (Table 2). Importantly, these increased contigs lengths were achieved while decreasing the short and long switch errors (Table 2). LongHap is also able to phase indels. Applying LongHap to stLFR-1 using GIAB SNPs and indels results in a 23.4 Mb N50, but also results in increased switch error rates (Table S2).

Structural variation detection

Previous studies have shown that long fragment information can improve the detection of structural variations (SVs) and described large deletions (4-155 kb) in NA12878^11,15. To demonstrate the power of stLFR to detect SVs we examined barcode overlap data, as previously described¹⁵, for stLFR-1 and stLFR-4 libraries in these regions. In every case the deletion was observed in the stLFR-1 data, even at lower coverage (Figure 2a and Figure S6). Closer examination of the co-barcoded sequence reads covering a ~150 kb deletion in chromosome 8 demonstrated that the deletion was heterozygous and found in a single haplotype (Figure 2b-c). The 10 ng stLFR-4 library also detected most of the deletions, but the three smallest were difficult to identify due to the lower coverage per fragment (and thus less barcode overlap) of this library.

Figure 2. SV detection.

(A) Previously reported deletions in NA12878 were also found using stLFR data. Heat maps of barcode sharing for each deletion can be found in Figure S8. (B) A heat map of barcode sharing within windows of 2kb for a region with a ~150 kb heterozygous deletion on chromosome 8 was plotted using a Jaccard Index as previously described¹⁵. Regions of high overlap are depicted in dark red. Those with no overlap in beige. Arrows demonstrate how regions that are spatially distant from each other on chromosome 8 have increased overlap marking the locations of the deletion. (C) Co-barcoded reads are separated by haplotype and plotted by unique barcode on the y axis and chromosome 8 position on the x axis. The heterozygous deletion is found in a single haplotype. (D) Heat maps were also plotted for overlapping barcodes between chromosomes 5 and 12 for a patient cell line with a known translocation²⁶ and (E) GM20759, a cell line with a known transversion in chromosome 2²⁷.

To evaluate stLFR performance for detecting other types of SVs we made libraries from a cell line from a patient with a known translocation between chromosomes 5 and 12²⁶ and GM20759, a cell line with a known inversion on chromosome 2²⁷. stLFR libraries were able to identify the inversion and the translocation in the respective cell lines (Figure 2d-e). Downsampling the amount of reads per library showed that a strong signal of the translocations was detected even with as little as 5 Gb of read data (~1.7X total coverage, Figure S7a-h). Finally, examination of both SVs in the stLFR-1 library resulted in no obvious pattern (Figure S7i-l), suggesting the false positive rate for detection of these types of SVs is low.

Scaffolding contigs with stLFR

StLFR is a powerful method because it uses ~1.8 billion unique barcodes and enables co-barcoding that is specific to each individual long genomic DNA molecule. This type of data should be beneficial for de novo genome assembly and improved scaffolding. To demonstrate how stLFR can be used to improve genome assemblies we used reads from stLFR-1 and stLFR-4 libraries and SALSA²⁸, a program designed for chromatin conformation capture (Hi-C) data, to scaffold Single Molecule Real-Time (SMRT) read assemblies of NA12878²⁹. SALSA was not designed for stLFR data, making it necessary to alter the stLFR data to a structure similar to Hi-C. This was achieved by selecting pairs of reads sharing the same barcode and located towards the ends of the captured long DNA molecule. These were then labeled as read pairs for the SALSA program. Substituting stLFR data for Hi-C data resulted in excellent scaffolding. Using only 60 million stLFR reads enabled the linkage of 1,411 contigs into 597 scaffolds with an N50 of 44.7 Mb. These scaffolds covered 2.84 Gb of the genome. These metrics compared very favorably to those generated in the SALSA manuscript using the same contigs and 10-fold more (734 million) Hi-C read pairs generated from human embryonic stem cells³⁰ (Table 3). The quality of stLFR scaffolds was further analyzed by aligning them to build 37 of the human reference genome and comparing them with the program dnadiff³¹. In general, stLFR scaffolds agreed closely with the reference genome and the number of breakpoints, translocations, relocations, and inversions was similar to those of the scaffolds generated with Hi-C reads (Table 3). Alignment dot plots further demonstrate the high degree of continuity between stLFR scaffolds and the reference genome (Figure S8).

View this table:

Table 3.

Scaffolding statistics

Discussion

Here we describe an efficient whole genome sequencing library preparation technology, stLFR, that enables the co-barcoding of sub-fragments of long genomic DNA molecules with a single unique clonal barcode in a single tube process. Using microbeads as miniaturized virtual compartments allows a practically unlimited number of clonal barcodes to be used per sample at a negligible cost. Our optimized hybridization-based capture of transposon inserted DNA on beads, combined with 3’-branch ligation and exonuclease degradation of the extreme excess of capture adapters, successfully barcodes up to ~20% of sub-fragments in DNA molecules as long as 300 kb in length. Importantly, this is achieved without DNA amplification of initial long DNA fragments and the representation bias that comes with it. In this way, stLFR solves the cost and limited co-barcoding capacity of emulsion-based methods.

The quality of variant calls using stLFR is very high and possibly, with further optimization, will approach that of standard WGS methods, but with the added benefit that co-barcoding enables advanced informatics applications. We demonstrate high quality, near complete phasing of the genome into long contigs with extremely low error rates, detection of SVs, and scaffolding of contigs to enable de novo assembly applications. All of this is achieved from a single library that does not require special equipment nor add significantly to the cost of library preparation.

As a result of efficient barcoding, we successfully used as little as 1 ng of human DNA (600 X genome coverage) to make stLFR libraries and achieved high quality WGS with most sub-fragments uniquely co-barcoded. Less DNA can be used, but stLFR does not use DNA amplification during co-barcoding and thus does not create overlapping subfragments from each individual long DNA molecule. For this reason overall genomic coverage suffers as the amount of DNA is lowered. In addition, a sampling problem is created as stLFR currently retains 10-20% of each original long DNA molecule followed by PCR amplification. This results in a relatively high duplication rate of reads and results in added sequencing cost, but improvements are possible. One potential solution is to remove the PCR step. This would eliminate sampling, but also it could substantially reduce the false positive and false negative error rates. In addition, improvements such as optimizing the distance of insertion between transposons and increasing the length of sequencing reads to paired-end 200 bases should be easy to enable and will increase the coverage and overall quality. For some applications, such as structural variation detection, using less DNA and less coverage may be desirable. As we demonstrate in this paper, as little as 5 Gb of sequence coverage can faithfully detect inter and intrachromosomal translocations and in these cases the duplication rate is negligible. Indeed, stLFR may represent a simple and cost-effective replacement for long mate pair libraries in a clinical setting.

In addition, we believe this type of data can enable full diploid phased de novo assembly from a single stLFR library without the need for long physical reads such as those generated by SMRT or nanopore technologies. One interesting feature of transposon insertion is that it creates a 9 base sequence overlap between adjacent subfragments. Frequently, these neighboring sub-fragments are captured and sequenced enabling reads to be synthetically doubled in length (e.g., for 200 base reads, two neighboring captured sub-fragments would create two 200 base reads with a 9 base overlap, or 391 bases). stLFR does not require special equipment like droplet based microfluidic methods and the cost per sample is minimal. In this paper we demonstrated using 50 million beads but using more is possible. This will enable many types of cost-effective analyses where 100s of millions of barcodes would be useful. We envision this type of cheap massive barcoding can be useful for RNA analyses such as full-length mRNA sequencing from 1,000s of cells by combination with single cell technologies or deep population sequencing of 16S RNA in microbial samples. Phased chromatin mapping by the Assay for Transposase-Accessible Chromatin (ATAC-seq)³² or methylation studies are all also possible with stLFR. Finally, in an effort to share what we believe to be a very important technology, we have made a detailed protocol freely available for academic use (see Supplementary Materials).

Authors contributions

R.D. and B.A.P. conceived the study. O.W., R.C., X.C., M.K.W., H.K.L., D.C., L.W., F.F., Y.Z., S.D., D.N., A.A., X.X., R.D., and B.A.P. developed the molecular biology process of stLFR. R.Y.Z., S.D., S.G., N.B., and A.C. performed the sequencing. Q.M., J.T., Y.S., Y.Z., E.A., Y.X., C.V., S.N., W.T., J.W., X.L., X.Q., H.W., Y.D., and Z.L. developed algorithms for and performed analyses on stLFR data. O.W., C.X., J.S.L., W.Z., H.Y., J.W., K.K., X.X., R.D., and B.A.P. coordinated the study. O.W., R.D., and B.A.P. wrote the manuscript. All authors reviewed and edited the manuscript.

Completing interests

Employees of BGI and Complete Genomics have stock holdings in BGI.

Data and materials availability

All sequencing data reported in this paper have been deposited in the database of the European Nucleotide Archive under accession number@@@.

Acknowledgments

We would like to acknowledge the ongoing contributions and support of all Complete Genomics and BGI-Shenzhen employees, in particular the many highly skilled individuals that work in the libraries, reagents, and sequencing groups that make it possible to generate high quality whole genome data. We would also like to thank Z. Dong, Z. Yang, and W. Xie for providing cell lines for the translocation analysis. This work was supported in part by the Shenzhen Peacock Plan (NO.KQTD20150330171505310) and the National Key Research and Development Program of China (NO.2017YFC0906501). B.A.P. is a recipient of and this work was partially supported by the Research Fund for International Young Scientists, National Natural Science Foundation of China (31550110216).

References

↵
Zhang, K. et al. Long-range polony haplotyping of individual human chromosome molecules. Nat Genet 38, 382–387 (2006).
OpenUrl CrossRef PubMed Web of Science
Ma, L. et al. Direct determination of molecular haplotypes by chromosome microdissection. Nat Methods 7, 299–301 (2010).
OpenUrl CrossRef PubMed Web of Science
Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat Biotechnol 29, 59–63 (2011).
OpenUrl CrossRef PubMed Web of Science
Suk, E. K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res 21, 1672–1685 (2011).
OpenUrl Abstract/FREE Full Text
Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nat Biotechnol 29, 51–57 (2011).
OpenUrl CrossRef PubMed Web of Science
↵
Peters, B. A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).
OpenUrl CrossRef PubMed Web of Science
Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res 40, 2041–2053 (2012).
OpenUrl CrossRef PubMed Web of Science
Selvaraj, S., J, R. D., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat Biotechnol 31, 1111–1118 (2013).
OpenUrl CrossRef PubMed
Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat Biotechnol 32, 261–266 (2014).
OpenUrl CrossRef PubMed
↵
Amini, S. et al. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 46, 1343–1349 (2014).
OpenUrl CrossRef PubMed
↵
Zheng, G. X. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol (2016).
↵
Peters, B. A., Liu, J. & Drmanac, R. Co-barcoded sequence reads from long DNA fragments: a cost-effective solution for “perfect genome” sequencing. Frontiers in genetics 5, 466 (2014).
OpenUrl
↵
Drmanac, R. Nucleic Acid Analysis by Random Mixtures of Non-Overlapping Fragments. WO 2006/138284 A2 (2006).
↵
Drmanac, R., Peters, B.A., Alexeev, A. Multiple tagging of individual DNA fragments. WO 2014/145820 A2 (2013).
↵
Zhang, F. et al. Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube. Nat Biotechnol 35, 852–857 (2017).
OpenUrl CrossRef
↵
Jo, K., Chen, Y. L., de Pablo, J. J. & Schwartz, D. C. Elongation and migration of single DNA molecules in microchannels using oscillatory shear flows. Lab Chip 9, 2348–2355 (2009).
OpenUrl CrossRef PubMed
↵
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
OpenUrl Abstract/FREE Full Text
↵
Fehlmann, T. et al. cPAS-based sequencing on the BGISEQ-500 to explore small non-coding RNAs. Clin Epigenetics 8, 123 (2016).
OpenUrl CrossRef
Huang, J. et al. A reference human genome dataset of the BGISEQ-500 sequencer. Gigascience 6, 1–9 (2017).
OpenUrl CrossRef PubMed
↵
Mak, S. S. T. et al. Comparative performance of the BGISEQ-500 vs Illumina HiSeq2500 sequencing platforms for palaeogenomic sequencing. Gigascience 6, 1–13 (2017).
OpenUrl CrossRef PubMed
↵
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
OpenUrl CrossRef PubMed Web of Science
↵
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–1303 (2010).
OpenUrl Abstract/FREE Full Text
↵
Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32, 246–251 (2014).
OpenUrl CrossRef PubMed
↵
Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res 27, 801–812 (2017).
OpenUrl Abstract/FREE Full Text
↵
Mao, Q. et al. The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes. Gigascience 5, 1–9 (2016).
OpenUrl CrossRef
↵
Dong, Z. et al. Low-pass whole-genome sequencing in clinical cytogenetics: a validated approach. Genet Med 18, 940–948 (2016).
OpenUrl CrossRef
↵
Dong, Z. et al. Identification of balanced chromosomal rearrangements previously unknown among participants in the 1000 Genomes Project: implications for interpretation of structural variation in genomes and the future of clinical cytogenetics. Genet Med (2017).
↵
Ghurye, J., Pop, M., Koren, S., Bickhart, D. & Chin, C. S. Scaffolding of long read assemblies using long range contact information. BMC Genomics 18, 527 (2017).
OpenUrl CrossRef
↵
Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat Methods 12, 780–786 (2015).
OpenUrl CrossRef PubMed
↵
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
OpenUrl CrossRef PubMed Web of Science
↵
Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome biology 9, R55 (2008).
OpenUrl CrossRef PubMed
↵
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10, 1213–1218 (2013).
OpenUrl CrossRef PubMed Web of Science