Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes

  1. Fereydoun Hormozdiari1,4,
  2. Can Alkan2,3,4,
  3. Evan E. Eichler2,3,5 and
  4. S. Cenk Sahinalp1,5
  1. 1 School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6;
  2. 2 Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA;
  3. 3 Howard Hughes Medical Institute, Seattle, Washington 98195, USA
    1. 4 These authors contributed equally.

    Abstract

    Recent studies show that along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. The Human Genome Structural Variation Project aims to identify and classify deletions, insertions, and inversions (>5 Kbp) in a small number of normal individuals with a fosmid-based paired-end sequencing approach using traditional sequencing technologies. The realization of new ultra-high-throughput sequencing platforms now makes it feasible to detect the full spectrum of genomic variation among many individual genomes, including cancer patients and others suffering from diseases of genomic origin. Unfortunately, existing algorithms for identifying structural variation (SV) among individuals have not been designed to handle the short read lengths and the errors implied by the “next-gen” sequencing (NGS) technologies. In this paper, we give combinatorial formulations for the SV detection between a reference genome sequence and a next-gen-based, paired-end, whole genome shotgun-sequenced individual. We describe efficient algorithms for each of the formulations we give, which all turn out to be fast and quite reliable; they are also applicable to all next-gen sequencing methods (Illumina, 454 Life Sciences [Roche], ABI SOLiD, etc.) and traditional capillary sequencing technology. We apply our algorithms to identify SV among individual genomes very recently sequenced by Illumina technology.

    Footnotes

    • 5 Corresponding authors.

      E-mail eee{at}gs.washington.edu; fax (206) 221-5795.

      E-mail cenk{at}cs.sfu.ca; fax (604) 291-4277.

    • 6 Matepairs refer to the two ends of a paired-end read.

    • 7 This is an arbitrary cutoff. Using a higher cutoff value makes the problem easier; however, we might miss some real structural variants.

    • 8 Referring to an insertion, deletion, and inversion, respectively. Note that our methods can be generalized to detect everted pairs and translocation events without much difficulty.

    • 9 Note that there are no range rules for transchromosomal events. Furthermore, although we do not focus on everted paired-end reads or the transchromosomal mappings in this study, our algorithms can be generalized to capture both tandem repeat events and transchromosomal events.

    • 10 Note that minimizing the SVs also will imply that the average number of paired-end reads supporting an SV is maximized—the two goals are equivalent.

    • 11 The reader can easily verify this in the case that the probability of an SV is a linear function of the number of mappings supporting it.

    • 12 By insert size error we mean errors in the distance between the two ends of a read pair by the sequencing platform. For certain platforms, such as Illumina, the probability of such an error is almost nil.

    • 13 Perhaps 1 − f, the probability of a potential read length error, should decrease exponentially with k and linearly with Len.

    • 14 In general, there exists dependency between mapping of different paired-end reads; however, to be able to approximate the values Graphic we assume independence between mappings of different paired ends.

    • 15 ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271/.

    • 16 The reader should also note that our algorithm is compatible with any sequence mapping tool that can return multiple map locations, such as Mosaik (Hillier et al. 2008) or SHRiMP (Yanovsky et al. 2008).

    • 17 For example, in the Illumina platform (short insert library), the upstream end sequence is expected to map to the + strand, where the downstream end sequence is expected to map to the − strand; see the Supplemental material.

    • 18 Note that it is possible to use concordant clones for heterozygosity studies, etc.

    • [Supplemental material is available online at www.genome.org. The source code of the algorithm implementations and predicted structural variants are available at http://compbio.cs.sfu.ca/strvar.htm.]

    • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.088633.108.

      • Received October 26, 2008.
      • Accepted April 10, 2009.

    Related Article

    | Table of Contents

    Preprint Server