Bakta: Rapid & standardized annotation of bacterial genomes via alignment-free sequence identification

Command line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command line software pipelines heavily depend on taxon specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command line software tool for the robust, taxon-independent, thorough and nonetheless fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross references. Annotation results are exported in GFF3 and INSDC-compliant flat files as well as comprehensive JSON files facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references whilst providing comparable wall clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.


Introduction
Regional and functional annotations have become a routine task in the analysis of bacterial whole-genome sequencing data. A thorough genome annotation is crucial to form a stable basis for many downstream analyses as both accuracy and comprehensiveness of the annotation have strong impacts on the outcome of related studies. Hence, various online services evolved to streamline the different steps that are involved in this task [1][2][3][4].
However, these services have become unsuitable for the timely annotation of high-throughput data which is needed to keep pace with the ever increasing speed at which bacterial genomes are sequenced today [5]. To meet these growing demands, annotations are required to be conducted either locally on standard consumer hardware or within high-performance or cloud computing infrastructures. Therefore, several command line software tools for the rapid annotation of bacterial genomes have recently been developed, e.g. Prokka [6] and DFAST [7].
These tools, however, trade annotation database sizes and workflow standardizations for runtime performance and flexibility regarding user-provided annotation data, respectively. In particular, requirements for taxon-specific databases are drawbacks for automated highthroughput annotations in situations where no or only limited taxonomic knowledge is available a priori, for instance as part of larger analysis pipelines [8][9][10][11]. Likewise, requirements for annotated reference genomes present an obstacle for the annotation of species that are underrepresented in public databases or for which no annotated reference genomes are available, e.g. metagenome-assembled genomes (MAGs). Depending on taxonomic groups [12], these are important issues often involved in low rates of functionally described and annotated genes. Furthermore, existing rapid offline annotation software tools leave room for improvements regarding the following issues: (i) despite the discovery of previously overlooked conserved short open reading frames (sORFs) two decades ago [13], they neither predict nor detect coding sequences (CDSs) of nowadays well-known small proteins shorter than 29 amino acids, due to technical gene length cutoffs implemented within underlying gene prediction tools to reduce the number of false de novo predictions [14,15]; (ii) they do not identify known protein sequences stored in public databases like RefSeq [16] and UniRef100 [17] and thus cannot assign database cross references (dbxrefs), i.e. stable public database identifiers facilitating the interconnection with further and more detailed databases; (iii) they do not take into account additional sequence information, i.e. completeness and topology, for the structural annotation of CDSs spanning artificial sequence edges.
Addressing these issues, here we introduce Bakta, a new command line tool for the automated and standardized annotation of bacterial genomes aiming at a well-balanced tradeoff between runtime performance and comprehensive annotations. It implements a comprehensive annotation workflow for coding and non-coding genes complemented by the prediction of CRISPR arrays, gaps, oriC and oriT features. In contrast to other lightweight annotation pipelines, Bakta is able to detect and annotate small proteins by a custom extraction and filter workflow for sORFs. The CDSs annotation workflow is accelerated by a hash-based alignment-free protein sequence identification approach considerably reducing the number of required computationally expensive sequence alignments. This new approach furthermore facilitates the annotation of CDSs with cross references to public databases via stable identifiers. We envision Bakta also as a suitable software tool for integration into larger pipelines. To  Transfer-RNA and transfer-messenger-RNA genes are predicted and annotated by tRNAscan-SE [18] and Aragorn [19], respectively. Ribosomal genes and non-coding RNAs are predicted and annotated by Infernal [20] using Rfam [21] covariance models. It is worth noting that non-coding RNA genes and non-coding RNA cis-regulatory elements are predicted and annotated as distinct feature types, allowing for distinct annotations of regulatory region subtypes and adjusted feature overlap filters. CRISPR arrays are predicted by Piler-CR [22]. Origins of replication and origins of transfer are detected by BLAST+ [23] against sequences from DoriC [24] and MOB-suite [25], respectively.
Coding sequences are predicted by Prodigal taking into account optionally provided metadata on sequence completeness and topology, enabling the prediction of CDSs spanning artificial replicon edges. Therefore, predicted pairs of partial CDSs on complete replicons that run off the 5' and 3' edges on the same strand are merged by Bakta. sORFs of small proteins shorter than 30 amino acids are extracted with BioPython [26]. Publicly known spurious CDSs and sORFs are filtered out using HMMER [27] and AntiFam [28]  To further improve the annotation of special interest genes, additional expert annotation tools are incorporated into the workflow allowing for fine grained annotation of closely related protein sequences that are indistinguishable by UniRef90 clusters alone. For instance, different alleles of antimicrobial resistance genes are annotated by AMRFinderPlus [33].
Furthermore, an integrated set of reference protein sequences with curated coverage and identity thresholds is used to refine annotations, thus allowing the standardized incorporation of external high-quality annotation resources, e.g. NCBI BlastRules and VFDB [16,34].
Finally, all gathered information is assessed to assign concluding annotations. CDS product names are amended and refined to follow protein nomenclature guidelines. CDSs without annotations are then (i) marked as hypothetical proteins; (ii) described by sequence-based characterizations, i.e. molecular weight and isoelectric point; (iii) screened for protein domains by HMMER using Pfam HMM profiles [27,35].   For the integration of high-quality annotation sources from external databases that are available at runtime, a general protein sequence-based expert annotation system is compiled.

Results
Therefore, protein sequences, gene symbols, protein products, query and subject coverage thresholds, sequence identity thresholds and priority ranks are stored for protein sequences from VFDB [34] and NCBI BlastRules [16]. More information is provided in Supplemental Notes S2.
The deeper analysis of hypothetical proteins is a distinct task in Bakta's annotation workflow.
Therefore, Pfam [35] HMMs of types different from family are downloaded and included in the database for the detection of conserved sequence domains within these proteins of unknown functions at runtime.

Comparison of annotated features
To illustrate and compare all aspects of Bakta's functionality we evaluated its performance and benchmarked it against other software tools. For these comparisons, we focused on state- as well as numbers of predicted and annotated further feature types is summarized in Table 1.
First, we compared the regional prediction of various features including coding, non-coding and further genomic features. Regarding tRNAs, tmRNAs, rRNAs and CRISPR arrays all tools predicted equal or comparable numbers of features. Prokka annotated the highest total number of ncRNAs whereas only PGAP and Bakta were able to distinguish between ncRNA genes and ncRNA regulatory regions. Taking this into account, Bakta predicted the highest number of ncRNA genes (n=223) and regulatory regions (n=66). Moreover, Bakta was the only tool predicting origins of replication (n=4). Regarding CDSs, Bakta (n=5,841) and PGAP (n=5,794) predicted more genes than Prokka (n=5,754) and DFAST (n=5,740) which we attribute largely to the detection of small proteins by Bakta (n=82) and PGAP (n=44) that are not predicted de novo by Prodigal [15] and MetaGeneAnnotator [14] used by Prokka and DFAST, respectively.
Second, we compared the identification and functional annotation of predicted and detected CDSs. In contrast to Prokka and DFAST, Bakta (n=5,738) and PGAP (n=5,550) were able to precisely identify publicly-known protein sequences and to assign stable database identifiers referring to RefSeq [4] and UniRef100 [17]. In terms of functional CDSs annotation, Bakta   acceleration was achieved via the AFSI approach that drastically reduced the number of required CDS alignments to 110 in this benchmark. Wall clock runtimes required to conduct homology searches for these remaining protein sequences are further reduced by using Diamond [29] using its new fast mode. Hence, even though Bakta provides a much larger and more comprehensive annotation database, it is able to annotate bacterial genomes within wall clock runtimes roughly comparable to Prokka and DFAST even on standard consumer hardware.
To assess both the vertical scalability of each tool and the effects of AFSIs on overall runtime performances, we conducted a second benchmark measuring wall clock runtimes using varying numbers of CPU cores. Therefore, we created a Bakta version with deactivated AFSI logic which is subsequently referred to as Bakta w/o AFSI. In this experiment, DFAST consistently provided the shortest runtimes within each bin of available CPU cores followed

Functional annotation performance benchmark
We envision Bakta as a suitable alternative to existing command line annotation software tools, e.g. Prokka and DFAST. Furthermore, we see great potential for integration into larger high-throughput analysis pipelines, e.g. Tormes [8], ASA³P [9], Bactopia [10] and Nullarbor [11], enabling taxonomically untargeted workflows. Hence, we compared the functional annotation performance of Bakta against aforementioned tools over a broad taxonomic range of species. Therefore, we counted numbers of predicted CDSs and those annotated as hypothetical protein in total and genome-wise manner. Moreover, we counted the numbers of identified protein sequences and detected small proteins by Bakta. In a first experiment we annotated 35 taxonomically diverse bacterial genomes from RefSeq [4]. This benchmark dataset comprises many bacterial pathogens, e. proteins by Bakta as well as to differences in the internal feature overlap filters of both tools.
Within the set of benchmarked tools, only Bakta was able to identify publicly known unique  To address the discussed limitations of the RefSeq benchmark dataset we ran a second experiment to assess the functional annotation performance on a large set of genomes that are not covered by those public databases that are used within the database build procedure.
Therefore, we screened the GenBank database for genomes meeting the following criteria: (i) they have a strain designation to exclude metagenome-derived genomes; (ii) they have explicitly been excluded from RefSeq due to an undefined genus; (iii) they do not miss  Table S2) were annotated with Prokka, DFAST and Bakta without providing any taxonomic    As small proteins are known to play important roles in many processes, e.g. regulation [41], virulence [42,43] and sporulation [44], we investigated the functional descriptions of all detected small proteins from the RefSeq benchmark experiment in order to assess the relevance and impact of their annotation. Table 2 summarises the numbers of detected small proteins aggregated by key words contained in the proteins' product descriptions. These results indicate that the small proteins detected by Bakta in this benchmark are involved in a broad range of important processes of high relevance to pathogenicity as well as more general cellular house-keeping processes.  complemented the annotation workflow of Bakta with a fallback stage to further expand the recognizable sequence space. Protein sequences that cannot be identified neither by IPSs nor PSCs are annotated by PSCCs, i.e. UniRef50 clusters. To assess the annotation performance of Bakta and to compare it against Prokka and DFAST, we compiled a benchmark set of high-quality MAGs. Therefore, we screened 7,903 published MAGs [46] that have been assembled from more than 1,500 public metagenomes meeting the following criteria: (i) a CheckM [50] complete score larger than or equal to 95.0; (ii) a CheckM contamination score smaller than or equal to 1.0; (iii) a taxonomical assignment within the bacterial GTDB lineage. Using this benchmark dataset comprising 198 MAGs (Supplementary Table S3) covering a diverse taxonomic range (Supplemental Fig. S2), Bakta achieved on average a total ratio of CDSs annotated as hypothetical protein as low as 24.2% (n=138,282) outperforming DFAST (n=232,516) and Prokka (n=279,352) which achieved total ratios of 41.3% and 49.0%, respectively. Figure 5 shows the distribution of genome-wise hypothetical protein ratios. For 46.5% (n=92) of all MAGs Baka achieved the lowest genome-wise hypothetical protein ratio. Interestingly, even in this metagenomic setup, Bakta was able to precisely identify 38.6% (n=220,753) of all predicted CDSs (n=572,213) via AFSI.

INSDC-compliant annotation results
The INSDC is a long-standing initiative synchronizing the major public DNA sequence databases DDBJ, ENA and GenBank. The submission of annotated genomes to these databases is a prerequisite for the publication of genomic data in most scientific journals.  Table S1) using the Webin-CLI submission tool version (4.0.0) provided by the ENA [51]. All tested files were successfully validated without errors or warnings. In addition, annotated genomes can be submitted to GenBank via NCBI's table2asn_GFF tool using Bakta's GFF3 and Fasta files.

Convenient and scalable web-based annotations
Command line software tools are essential for the timely analysis of large bacterial cohorts. This web application provides an interactive GUI wizard that supports the user in the upload of input data, the specification of related metadata as well as the configuration and submission of annotation jobs (Fig. 6). For instance, it automatically parses the uploaded genome in Fasta file format [52] and provides a replicon table widget that aids the user with the provision of precise metadata for each replicon sequence within the genome.
Furthermore, the configuration of annotation parameters is supported via a taxon autocompletion mechanism for genus and species information that takes advantage of the ENA Taxonomy REST API [51]. Finally, annotation results are provided in various manners.
Firstly, a set of aggregated feature counts provides a broad picture of the genome. Secondly, a searchable data We would like to emphasize that this web application can also be used to visualize offline annotation results conducted by using the command line version. Therefore, the web application provides an offline viewer accepting Bakta's JSON result files which are parsed and visualized locally within the browser without sending any data to the server. In contrast to existing light-weight annotation software tools, Bakta also detects and annotates sORFs of small proteins. Two decades ago, the existence of many of these small proteins was experimentally verified expanding the prokaryotic genomic repertoire. Existing lightweight command line annotation tools fail to detect these small proteins through using contemporary de novo gene prediction tools [14,15] alone. To the best of our knowledge, Bakta is currently the only lightweight annotation software tool that is able to detect and annotate these small proteins. However, it must be stated that Bakta is not able to predict these small protein coding genes de novo either. Instead, it identifies known sORF protein sequences via AFSI and additionally conducts very strict homology searches to find and annotate these sequences. Thus, Bakta helps to shed light on these otherwise genomic blind spots. This approach however has an obvious drawback as it is not able to predict hitherto unknown sORFs. Hence, the integration of dedicated sORF prediction tools [54,55] into this workflow might help to improve on this issue. Existing lightweight annotation software tools accelerate the execution of their workflow by using hierarchical or taxonomically targeted annotation databases. In contrast, Bakta provides a single taxonomically untargeted database. By doing so, it facilitates the integration into larger high-throughput analysis pipelines that might be executed in a taxon-independent manner. Also, it allows the annotation of rare bacterial species for which no or only few high- and UniRef [17]. Often, these databases are in turn linked to other databases that additionally contribute to a more comprehensive and sophisticated picture of these genomic sequences.
Especially for protein sequences of unknown functions, i.e. proteins annotated as hypothetical protein, the interconnection of database records provides a helpful tool for further investigations.
An important aspect that must not be overlooked are potential hash collisions which might lead to false identifications and hence wrong annotations. In its current version 1. high-performance compute clusters and cloud computing infrastructures. For these setups, we highly recommend using a local copy of the database.
The precise annotation of CDSs conducted by Bakta is based on alignment-free detections of IPSs complemented by alignment-based homology searches for PSC homologues. However, depending on taxonomic distributions and evolutionary selection pressures, sequence conservation of protein family members may vary significantly. Hence, the AFSI of certain protein sequences belonging to more heterogeneous protein families might not always be possible. Likewise, appropriately precise annotations of CDSs belonging to closely related but nevertheless distinct protein families might not be achievable via PSCs. To facilitate more precise annotations of these CDSs, Bakta complements its annotation workflow by taking advantage of so-called expert annotation systems. At the time of writing two expert annotation systems are implemented: one to specifically target antimicrobial resistance genes and a general protein sequence-based system integrating multiple external high-quality annotation sources. The expansion of these expert systems are subject for further improvements.
The recent progress in metagenomics nowadays allows the sequencing of entire microbial communities and to reconstruct MAGs in silico thus providing access to hitherto unknown genomes of unculturable organisms. The annotation of these genomes is key to many downstream analyses, such as metabolic pathway predictions. However, the annotation of these genomes via reference genomes or taxonomically targeted databases becomes difficult or even impossible for rare or unknown species that are covered poorly or not at all by public databases. To improve the annotation of these genomes we implemented an additional annotation step. We demonstrated that Bakta is able to annotate large proportions of many MAGs' protein sequences and outperforms other annotation software tools.
In conclusion, we have developed the new command line software tool Bakta, and we demonstrated that it improves on existing rapid annotation tools for bacterial genomes in various ways: (i) Bakta outperforms existing tools in terms of functional annotation of CDSs over a broad taxonomic range of both known and unknown species; (ii) Bakta is able to detect and annotate small proteins which are not predicted by contemporary de novo gene prediction tools, as for instance Prodigal [15] and MetaGeneAnnotator [14]; (iii) Bakta precisely identifies publicly known protein sequences and assigns stable database identifiers from RefSeq [16] and UniProt [17]; (iv) Bakta's functional annotation workflow is accelerated by a new AFSI approach; (v) Bakta takes advantage of sequence metadata to improve the structural prediction of CDSs; (vi) Bakta provides equivalent or more comprehensive annotations of CDSs with functional categories, i.e. COG, EC numbers and GO terms. Therefore, we consider Bakta as a useful and valuable novel tool for the comprehensive and timely annotation of bacterial genomes, even on standard consumer hardware. In addition, we have developed a user-friendly web version providing interactive visualizations taking advantage of a highly-scalable cloud based backend.

Author statements
Conflict of interest: none declared.