De Novo Repeat Classification and Fragment Assembly

  1. Paul A. Pevzner1,
  2. Haixu Tang1, and
  3. Glenn Tesler2,3
  1. 1 Department of Computer Science and Engineering, University of California, San Diego,La Jolla, California 92093, USA
  2. 2 Department of Mathematics, University of California, San Diego,La Jolla, California 92093, USA

Abstract

Repetitive sequences make up a significant fraction of almost any genome, and an important and still open question in bioinformatics is how to represent all repeats in DNA sequences. We propose a new approach to repeat classification that represents all repeats in a genome as a mosaic of sub-repeats. Our key algorithmic idea also leads to new approaches to multiple alignment and fragment assembly. In particular, we show that our FragmentGluer assembler improves on Phrap and ARACHNE in assembly of BACs and bacterial genomes.

Footnotes

  • [The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: D. Kisman, M. Li, B. Ma, and to S.-P. Yang.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2395204.

  • 4 The concept of an “elementary repeat,” although important, was never rigorously defined in the recent papers on repeat analysis (Volfovsky et al. 2001; Bailey et al. 2002; Bao and Eddy 2002). Although it is well defined for perfectly conserved repeats (e.g., as maximal simple paths of multiplicity >1 in the de Bruijn graph), the imperfectly conserved repeats defy simple definitions. “Elementary repeats” are defined in this paper as maximal simple paths of multiplicity >1 in the A-Bruijn graph. This definition is not perfect either, but we believe that it fits the spirit of Bailey et al. (2002). However, in many biological applications, this concept needs to be adjusted (e.g., by introducing appropriate thresholds that remove low-multiplicity edges from A-Bruijn graphs) to account for high-multiplicity mobile elements, fractured repeats, and other biological complications.

  • 5 We are not criticizing other repeat classification programs here: They were designed with the primary goal of characterization of high-multiplicity mobile elements rather than the explicit representation of mosaic structure of sub-repeats. The goal of this example is to illustrate that the problem of mosaic repeat representation raised by Bailey et al. (2002) is not adequately addressed by the existing software tools. We also remark that our approach reveals mobile elements as well (as high-multiplicity edges/paths in A-Bruijn graphs).

  • 6 Pevzner et al. (2001) introduced an error-correction procedure that mimics de Bruijn graphs for nearly identical repeats (repeats that are 98%-99% similar), a small step toward a generalization of the original de Bruijn approach. However, extending this approach beyond nearly identical repeats remains an open problem. Our construction of repeat graphs (below) is very different and more general.

  • 7 Although the A-Bruijn graphs may be very complicated for highly repetitive genomes, they are often broken into simple paths if one removes all low-multiplicity edges. These paths (formed by high-multiplicity edges) typically correspond to mobile elements studied in Bao and Eddy (2002).

  • 8 For example, inconsistent pairwise alignments in Figure 6B can be made consistent by moving the gap in the alignment of at and acat from positions 1-2 to positions 2-3.

  • 9 Short tandem repeats deleted during whirl elimination can be easily added to the repeat graphs.

  • 10 RepeatGluer classifies repeat families in a single DNA strand. To account for both strands and for inverted repeats, one has to concatenate both DNA strands into a single sequence S.

  • 11 We emphasize that highly repetitive sequences correspond to a similarity matrix with a large number of 1s, which may be quadratic in the length of the sequence. Although it does not present a problem for genomes under study, it may present a problem for longer genomes. A possible approach to reducing the running time in this case is to limit the number of 1s for each column/row. Such filtering of 1s in the similarity matrix has to be done with caution to ensure that all copies of a mobile element are “glued together” with high probability (e.g., all copies of a perfect m-copy repeat can be glued with m-1 1s instead of m2 1s). Also, blocks of adjacent 1s are represented as position and length. Another approach (that is equivalent to the “repeat masking” procedure used in many WGS assemblers) is to simply glue together all predefined high-multiplicity repeats at the preprocessing stage.

  • 12 If one removes low-multiplicity edges from the graph in Figure 4C, the resulting path on four bold edges will correspond to the consensus Alu repeat. Other transposable elements can be extracted from the A-Bruijn graphs by applying different thresholds depending on the multiplicity of the transposable element (applications of RepeatGluer for finding transposable elements will be described elsewhere).

  • 13 Multiple alignment is not a focus of this paper, and the goal of this section is simply to introduce the construction that is used in the follow-up fragment assembly section (we view fragment assembly as multiple alignment of reads). The applications of RepeatGluer for multiple alignment of proteins with shuffled domains will be described elsewhere.

  • 14 The informed reader may notice parallels between our A-Bruijn graph approach to fragment assembly and two earlier approaches pioneered by Idury and Waterman (1995) and Myers (1995). Although the earlier algorithms look very different, both implicitly tried to develop the idea of the repeat graph. Myers tried doing this by collapsing the overlap graph at the level of read (∼500 bp) resolution, whereas Idury and Waterman tried simplifying the de Bruijn graph at the level of l-mer resolution. Our key contribution is the A-Bruijn graph construction that deals with fragment assembly at the level of single-nucleotide resolution. This increased granularity alleviates challenging algorithmic problems that we faced trying to design an efficient assembler for low-coverage regions and low-quality reads/read ends.

  • 15 Low-quality reads can be defined as reads with fewer than 100 positions with Phred quality values above 15.

  • 16 It may sound like a simple integration problem, but it turned out to be a very difficult task that required development of a new idea for fragment assembly (our FragmentGluer algorithm, implemented in EULER+). The difficulty is that it is unclear how the algorithmic ideas used in the recent WGS assemblers can be adjusted for working with low-coverage and low-quality sequencing data (like read ends).

  • 3 Corresponding author. E-MAIL gptesler{at}math.ucsd.edu; FAX (858) 534-7029.

    • Accepted June 29, 2004.
    • Received January 27, 2004.
| Table of Contents

Preprint Server