Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples

  1. Xiaoman Li1,3,4 and
  2. Michael S. Waterman1,2
  1. 1 Department of Mathematics, University of Southern California, Los Angeles, California 90089, USA
  2. 2 Celera Genomics, Rockville, Maryland 20850, USA

Abstract

In shotgun sequencing projects, the genome or BAC length is not always known. We approach estimating genome length by first estimating the repeat structure of the genome or BAC, sometimes of interest in its own right, on the basis of a set of random reads from a genome project. Moreover, we can find the consensus for repeat families before assembly. Our methods are based on the ℓ-tuple content of the reads.

Footnotes

  • [Supplemental material available online at www.genome.org.]

  • 5 The left end of all reads consist of a homogeneous Poisson process with parameter c/L (Lander et al. 1988).

  • 6 GroupNum is the maximal number of groups we used.

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1251803.

  • 3 Present address: Department of Statistics, Harvard University, Cambridge, MA 02138, USA.

  • 4 Corresponding author. E-MAIL xiaomanl{at}yahoo.com; FAX (617) 496-8057.

    • Accepted June 4, 2003.
    • Received February 7, 2003.
| Table of Contents

Preprint Server