ABSTRACT
Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. We present StORF-Reporter, a tool that takes as input an annotated genome and returns missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first part begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are Open Reading Frames that are delimited by stop codons.
We show that this methodology recovers complete coding sequences (with/without similarity to known genes) which were missing from both canonical and novel genome annotations. StORF-Reporter recovered sequences that exhibited high levels of sequence identity to proteins in the Swiss-Prot database and the proteomes of the genome they were identified from (gene-duplicates). We inspected in detail the results from the genomes of six model organisms, the pangenome of Escherichia coli, and a further 6,223 annotated prokaryotic genomes of 179 genera from the Ensembl Bacteria database. StORF-Reporter was able to extend the core, soft-core and accessory gene-collections, and identify novel gene families and families which were extended into additional genera, not previously identified in the canonical annotations. Many of the gene families these sequences form are routinely misreported or completely omitted by state-of-the-art annotation methods.
StORF-Reporter has been specifically developed to be interoperable with the PROKKA genome annotation pipeline due to the systematic use of its output format in downstream studies.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
This new version has been somewhat rewritten to get to the main points faster and clearer. The functional analysis has been removed as it detracted from the main story and there was not enough data to gain useful insight from it.