Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples
Abstract
In shotgun sequencing projects, the genome or BAC length is not always known. We approach estimating genome length by first estimating the repeat structure of the genome or BAC, sometimes of interest in its own right, on the basis of a set of random reads from a genome project. Moreover, we can find the consensus for repeat families before assembly. Our methods are based on the ℓ-tuple content of the reads.
Footnotes
-
[Supplemental material available online at www.genome.org.]
-
↵5 The left end of all reads consist of a homogeneous Poisson process with parameter c/L (Lander et al. 1988).
-
↵6 GroupNum is the maximal number of groups we used.
-
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1251803.
-
↵3 Present address: Department of Statistics, Harvard University, Cambridge, MA 02138, USA.
-
↵4 Corresponding author. E-MAIL xiaomanl{at}yahoo.com; FAX (617) 496-8057.
-
- Accepted June 4, 2003.
- Received February 7, 2003.
- Cold Spring Harbor Laboratory Press