ABSTRACT
Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. We present StORF-Reporter, a tool that takes as input an annotated genome and returns missed CDS genes from the unannotated regions. StORF-Reporter consists of two parts. The first part begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are Open Reading Frames that are delimited by stop codons.
We show that this methodology recovers complete coding sequences (with/without similarity to known genes) which were missing from both canonical and novel genome annotations. StORF-Reporter recovered sequences that exhibited high levels of sequence identity to proteins in the SwissProt database and the proteomes of the genome they were identified from (gene-duplicates). We inspected in detail the results from the genomes of six model organisms, the pangenome of Escherichia coli, and a further 6,223 annotated prokaryotic genomes of 179 genera from the Ensembl Bacteria database. StORF-Reporter was able to extend the core, soft-core and accessory gene-collections, and identify novel gene families and families which were extended into additional genera, not previously identified in the canonical annotations. Many of the gene families these sequences form are routinely misreported or completely omitted by state-of-the-art annotation methods.
Competing Interest Statement
The authors have declared no competing interest.