Dissecting a hidden gene duplication: the Arabidopsis thaliana SEC10 locus

PLoS One. 2014 Apr 11;9(4):e94077. doi: 10.1371/journal.pone.0094077. eCollection 2014.

Abstract

Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be explained by a hidden gene duplication. Re-sequencing and manual assembly of the Arabidopsis thaliana SEC10 (At5g12370) locus revealed that this locus, comprising a single gene in the reference genome assembly, indeed contains two paralogous genes in tandem, SEC10a and SEC10b, and that a sequence segment of 7 kb in length is missing from the reference genome sequence. Differences between the two paralogs are concentrated in non-coding regions, while the predicted protein sequences exhibit 99% identity, differing only by substitution of five amino acid residues and an indel of four residues. Both SEC10 genes are expressed, although varying transcript levels suggest differential regulation. Homozygous T-DNA insertion mutants in either paralog exhibit a wild-type phenotype, consistent with proposed extensive functional redundancy of the two genes. By these observations we demonstrate that recently duplicated genes may remain hidden even in well-characterized genomes, such as that of A. thaliana. Moreover, we show that the use of the existing A. thaliana reference genome sequence as a guide for sequence assembly of new Arabidopsis accessions or related species has at least in some cases led to error propagation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Arabidopsis / genetics*
  • Arabidopsis Proteins / genetics*
  • DNA, Bacterial / genetics
  • Gene Duplication / genetics*
  • Mutagenesis, Insertional / genetics

Substances

  • Arabidopsis Proteins
  • DNA, Bacterial
  • T-DNA

Grants and funding

This work was supported by the Czech Science Foundation (projects GPP501/11/P853 and P305/11/1629), the Structural Funds of the Europe Union (project CZ.1.05/2.1.00/03.0M.100 IET), Ministry of Education, Youth and Sports of the Czech Republic grant (KONTAKT ME10033, MSM0021620858) and by the project EU-ITN-PLANTORIGINS 238 640. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.