ABSTRACT
Long terminal-repeat retrotransposons (LTR-RTs) are prevalent in plant genomes. Identification of LTR-RTs is critical for achieving high-quality gene annotation. Based on the well-conserved structure, multiple programs were developed for de novo identification of LTR-RTs; however, these programs are associated with low specificity and high false discovery rate (FDR). Here we report LTR_retriever, a multithreading empowered Perl program that identifies LTR-RTs and generates high-quality LTR libraries from genomic sequences. LTR_retriever demonstrated significant improvements by achieving high levels of sensitivity (91.8%), specificity (94.7%), accuracy (94.3%), and precision (90.6%) in model plants. LTR_retriever is also compatible with long sequencing reads. With 40k self-corrected PacBio reads equivalent to 4.5X genome coverage in Arabidopsis, the constructed LTR library showed excellent sensitivity and specificity. In addition to canonical LTR-RTs with 5'-TG..CA-3' termini, LTR_retriever also identifies non-canonical LTR-RTs (non-TGCA), which have been largely ignored in genome-wide studies. We identified seven types of non-canonical LTRs from 42 out of 50 plant genomes. The majority of non-canonical LTRs are Copia elements, with which the LTR is four times shorter than that of other Copia elements, which may be a result of their target specificity. Strikingly, non-TGCA Copia elements are often located in genic regions and preferentially insert nearby or within genes, indicating their impact on the evolution of genes and potential as mutagenesis tools.