Abundance and length of simple repeats in vertebrate genomes are determined by their structural properties

  1. Albino Bacolla1,5,
  2. Jacquelynn E. Larson1,
  3. Jack R. Collins2,
  4. Jian Li3,
  5. Aleksandar Milosavljevic3,
  6. Peter D. Stenson4,
  7. David N. Cooper4, and
  8. Robert D. Wells1
  1. 1 Institute of Biosciences and Technology, Center for Genome Research, Texas A&M University Health Science Center, Houston, Texas 77030, USA;
  2. 2 Advanced Biomedical Computing Center, Advanced Technology Program, SAIC-Frederick, Inc., NCI-Frederick, Frederick, Maryland 21702, USA;
  3. 3 Department of Molecular and Human Genetics and Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA;
  4. 4 Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff CF14 4XN, United Kingdom

Abstract

Microsatellites are abundant in vertebrate genomes, but their sequence representation and length distributions vary greatly within each family of repeats (e.g., tetranucleotides). Biophysical studies of 82 synthetic single-stranded oligonucleotides comprising all tetra- and trinucleotide repeats revealed an inverse correlation between the stability of folded-back hairpin and quadruplex structures and the sequence representation for repeats ≥30 bp in length in nine vertebrate genomes. Alternatively, the predicted energies of base-stacking interactions correlated directly with the longest length distributions in vertebrate genomes. Genome-wide analyses indicated that unstable sequences, such as CAG:CTG and CCG:CGG, were over-represented in coding regions and that micro/minisatellites were recruited in genes involved in transcription and signaling pathways, particularly in the nervous system. Microsatellite instability (MSI) is a hallmark of cancer, and length polymorphism within genes can confer susceptibility to inherited disease. Sequences that manifest the highest MSI values also displayed the strongest base-stacking interactions; analyses of 62 tri- and tetranucleotide repeat-containing genes associated with human genetic disease revealed enrichments similar to those noted for micro/minisatellite-containing genes. We conclude that DNA structure and base-stacking determined the number and length distributions of microsatellite repeats in vertebrate genomes over evolutionary time and that micro/minisatellites have been recruited to participate in both gene and protein function.

Footnotes

  • 5 Corresponding author.

    5 E-mail abacolla{at}ibt.tamhsc.edu; fax (713) 677-7689.

  • [Supplemental material is available online at www.genome.org.]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.078303.108.

  • 6 Abbreviations: Concerning nucleic acid nomenclature, we designated a double-stranded genomic triNR or tetraNR by its unique sequence (Tables 1, 2) with no specification as to the reading frame or strand composition. Accordingly, AGC includes all genomic tracts composed of AGC:GCT, GCA:TGC, CAG:CTG, GCT:AGC, TGC:GCA, and CTG:CAG duplex DNA, where the colon separates the complementary strands. In contrast, we specify single-stranded DNA oligonucleotides and their reading frame by d(AGC)n, for example. A subscript (n) indicates the number of repeating units. Hydrogen-bonded nucleotides are also indicated by a colon, that is, A:T.

    • Received March 10, 2008.
    • Accepted July 31, 2008.
  • Freely available online through the Genome Research Open Access option.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server