Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Balrog: A universal protein model for prokaryotic gene prediction

View ORCID ProfileMarkus J. Sommer, Steven L. Salzberg
doi: https://doi.org/10.1101/2020.09.06.285304
Markus J. Sommer
1Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America
2Center for Computational Biology, Johns Hopkins University, Baltimore, MD, United States of America
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Markus J. Sommer
  • For correspondence: markusjsommer@gmail.com
Steven L. Salzberg
1Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America
2Center for Computational Biology, Johns Hopkins University, Baltimore, MD, United States of America
3Departments of Computer Science and Biostatistics, Johns Hopkins University, Baltimore, MD, United States of America
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.

Author summary Annotating the protein-coding genes in a newly sequenced prokaryotic genome is a critical part of describing their biological function. Relative to eukaryotic genomes, prokaryotic genomes are small and structurally simple, with 90% of their DNA typically devoted to protein-coding genes. Current computational gene finding tools are therefore able to achieve close to 99% sensitivity to known genes using species-specific gene models.

Though highly sensitive at finding known genes, all current prokaryotic gene finders also predict large numbers of additional genes, which are labelled as “hypothetical protein” in GenBank and other annotation databases. Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives. Additionally, all current gene finding tools must be trained specifically for each genome as a preliminary step in order to achieve high sensitivity. This requirement limits their ability to detect genes in fragmented sequences commonly seen in metagenomic samples.

We took a data-driven approach to prokaryotic gene finding, relying on the large and diverse collection of already-sequenced genomes. By training a single, universal model of bacterial genes on protein sequences from many different species, we were able to match the sensitivity of current gene finders while reducing the overall number of gene predictions. Our model does not need to be refit on any new genome. Balrog (Bacterial Annotation by Learned Representation of Genes) represents a fundamentally different yet effective method for prokaryotic gene finding.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted September 08, 2020.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Balrog: A universal protein model for prokaryotic gene prediction
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Balrog: A universal protein model for prokaryotic gene prediction
Markus J. Sommer, Steven L. Salzberg
bioRxiv 2020.09.06.285304; doi: https://doi.org/10.1101/2020.09.06.285304
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Balrog: A universal protein model for prokaryotic gene prediction
Markus J. Sommer, Steven L. Salzberg
bioRxiv 2020.09.06.285304; doi: https://doi.org/10.1101/2020.09.06.285304

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4122)
  • Biochemistry (8831)
  • Bioengineering (6536)
  • Bioinformatics (23493)
  • Biophysics (11818)
  • Cancer Biology (9235)
  • Cell Biology (13350)
  • Clinical Trials (138)
  • Developmental Biology (7453)
  • Ecology (11431)
  • Epidemiology (2066)
  • Evolutionary Biology (15183)
  • Genetics (10458)
  • Genomics (14057)
  • Immunology (9193)
  • Microbiology (22221)
  • Molecular Biology (8833)
  • Neuroscience (47670)
  • Paleontology (352)
  • Pathology (1432)
  • Pharmacology and Toxicology (2493)
  • Physiology (3741)
  • Plant Biology (8097)
  • Scientific Communication and Education (1438)
  • Synthetic Biology (2226)
  • Systems Biology (6046)
  • Zoology (1258)