Linear Assembly of a Human Y Centromere using Nanopore Long Reads

Miten Jain; Hugh E. Olsen; Daniel J. Turner; David Stoddart; Kira V. Bulazel; Benedict Paten; David Haussler; Huntington F. Willard; Mark Akeson; Karen H. Miga

doi:10.1101/170373

Abstract

The human genome reference sequence remains incomplete due to the challenge of assembling long tracts of near-identical tandem repeats, or satellite DNAs, that are highly enriched in centromeric regions ¹. Efforts to resolve these regions capitalize on a small number of sparsely arranged sequence variants that offer unique markers to break the repeat monotony and ensure proper overlap-layout-consensus assembly DNAs ^2–4. Identifying and spanning sequence variants that may be spaced hundreds of kilobases away within a given array requires long and highly accurate sequence reads. Achieving this requires an advancement in standard single-molecule sequencing, which to date has been error-prone and offers a low throughput of sufficiently long-reads (100 kb+)^5,6. Here we present a strategy that generates long-reads capable of spanning the complete sequence insert of bacterial artificial chromosomes (BACs) that are hundreds of kilobases in length (∼100-300kb). We demonstrate that these reads are sufficient to resolve the linear ordering of repeats within a single satellite array on the Y chromosome, allowing the first complete sequence characterization of a human centromere.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.