Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Protocol
  • Published:

Systematic prediction of functionally linked genes in bacterial and archaeal genomes

Abstract

Functionally linked genes in bacterial and archaeal genomes are often organized into operons. However, the composition and architecture of operons are highly variable and frequently differ even among closely related genomes. Therefore, to efficiently extract reliable functional predictions for uncharacterized genes from comparative analyses of the rapidly growing genomic databases, dedicated computational approaches are required. We developed a protocol to systematically and automatically identify genes that are likely to be functionally associated with a ‘bait’ gene or locus by using relevance metrics. Given a set of bait loci and a genomic database defined by the user, this protocol compares the genomic neighborhoods of the baits to identify genes that are likely to be functionally linked to the baits by calculating the abundance of a given gene within and outside the bait neighborhoods and the distance to the bait. We exemplify the performance of the protocol with three test cases, namely, genes linked to CRISPR–Cas systems using the ‘CRISPRicity’ metric, genes associated with archaeal proviruses and genes linked to Argonaute genes in halobacteria. The protocol can be run by users with basic computational skills. The computational cost depends on the sizes of the genomic dataset and the list of reference loci and can vary from one CPU-hour to hundreds of hours on a supercomputer.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The pipeline for the identification of gene families associated with a set of baits.
Fig. 2: A detailed, step by step schematic of the protocol.
Fig. 3: The space of relevance metrics.
Fig. 4: Dissection of the space of relevance metrics.

Similar content being viewed by others

Data and code availability

The source code of the Icity pipeline is freely available under open-source NCBI license (https://github.com/ncbi/ICITY/blob/master/LICENSE.txt) at the NCBI GitHub page (https://github.com/ncbi/ICITY). Questions and comments can be addressed to authors through the GitHub portal or by email. All example datasets and the results of their analysis presented in the paper are available at the NCBI FTP site (ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/icityNatProt/).

References

  1. Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S. & Koonin, E. V. Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001).

    Article  CAS  Google Scholar 

  2. Rogozin, I. B., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief Bioinform. 5, 131–149 (2004).

    Article  CAS  Google Scholar 

  3. Aravind, L. Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000).

    Article  CAS  Google Scholar 

  4. Galperin, M. Y. & Koonin, E. V. Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol. 18, 609–613 (2000).

    Article  CAS  Google Scholar 

  5. Janga, S. C., Collado-Vides, J. & Moreno-Hagelsieb, G. Nebulon: a system for the inference of functional relationships of gene products from the rearrangement of predicted operons. Nucleic Acids Res. 33, 2521–2530 (2005).

    Article  CAS  Google Scholar 

  6. Moreno-Hagelsieb, G. The power of operon rearrangements for predicting functional associations. Comput. Struct. Biotechnol. J. 13, 402–406 (2015).

    Article  CAS  Google Scholar 

  7. Moreno-Hagelsieb, G. & Santoyo, G. Predicting functional interactions among genes in prokaryotes by genomic context. Adv. Exp. Med. Biol. 883, 97–106 (2015).

    Article  CAS  Google Scholar 

  8. Price, M. N., Huang, K. H., Alm, E. J. & Arkin, A. P. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 33, 880–892 (2005).

    Article  CAS  Google Scholar 

  9. de Crecy-Lagard, V. & Hanson, A. D. Finding novel metabolic genes through plant-prokaryote phylogenomics. Trends Microbiol. 15, 563–570 (2007).

    Article  Google Scholar 

  10. Zhao, S. et al. Discovery of new enzymes and metabolic pathways by using structure and genome context. Nature 502, 698–702 (2013).

    Article  CAS  Google Scholar 

  11. Calhoun, S. et al. Prediction of enzymatic pathways by integrative pathway mapping. Elife 7, e31097 (2018).

    Article  Google Scholar 

  12. Koonin, E. V., Wolf, Y. I. & Aravind, L. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Res. 11, 240–252 (2001).

    Article  CAS  Google Scholar 

  13. Evguenieva-Hackenberg, E., Hou, L., Glaeser, S. & Klug, G. Structure and function of the archaeal exosome. Wiley Interdiscip. Rev. RNA 5, 623–635 (2014).

    Article  CAS  Google Scholar 

  14. Shmakov, S. et al. Discovery and functional characterization of diverse class 2 CRISPR–Cas systems. Mol. Cell 60, 385–397 (2015).

    Article  CAS  Google Scholar 

  15. Shmakov, S. et al. Diversity and evolution of class 2 CRISPR–Cas systems. Nat. Rev. Microbiol. 15, 169–182 (2017).

    Article  CAS  Google Scholar 

  16. Burstein, D. et al. Major bacterial lineages are essentially devoid of CRISPR–Cas viral defence systems. Nat. Commun. 7, 10613 (2016).

    Article  CAS  Google Scholar 

  17. Yan, W. X. et al. Cas13d is a compact RNA-targeting type VI CRISPR effector positively modulated by a WYL-domain-containing accessory protein. Mol. Cell 70, 327–339.e5 (2018).

    Article  CAS  Google Scholar 

  18. Makarova, K. S., Aravind, L., Grishin, N. V., Rogozin, I. B. & Koonin, E. V. A DNA repair system specific for thermophilic archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res. 30, 482–496 (2002).

    Article  CAS  Google Scholar 

  19. Shmakov, S. A., Makarova, K. S., Wolf, Y. I., Severinov, K. V. & Koonin, E. V. Systematic prediction of genes functionally linked to CRISPR–Cas systems by gene neighborhood analysis. Proc. Natl Acad. Sci. USA 115, E5307–E5316 (2018).

    Article  CAS  Google Scholar 

  20. Pawluk, A. et al. Naturally occurring off-switches for CRISPR–Cas9. Cell 167, 1829–1838e1829 (2016).

    Article  CAS  Google Scholar 

  21. Pawluk, A., Davidson, A. R. & Maxwell, K. L. Anti-CRISPR: discovery, mechanism and function. Nat. Rev. Microbiol. 16, 12–17 (2018).

    Article  CAS  Google Scholar 

  22. Lasken, R. S. & McLean, J. S. Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet. 15, 577–584 (2014).

    Article  CAS  Google Scholar 

  23. Stern, A. & Sorek, R. The phage-host arms race: shaping the evolution of microbes. Bioessays 33, 43–51 (2011).

    Article  CAS  Google Scholar 

  24. Koonin, E. V., Makarova, K. S. & Wolf, Y. I. Evolutionary genomics of defense systems in archaea and bacteria. Annu. Rev. Microbiol. 71, 233–261 (2017).

    Article  CAS  Google Scholar 

  25. Makarova, K. S., Wolf, Y. I., Snir, S. & Koonin, E. V. Defense islands in bacterial and archaeal genomes and prediction of novel defense systems. J. Bacteriol 193, 6039–6056 (2011).

    Article  CAS  Google Scholar 

  26. Doron, S. et al. Systematic discovery of antiphage defense systems in the microbial pangenome. Science 359, eaar4120 (2018).

    Article  Google Scholar 

  27. Rogozin, I. B. et al. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 30, 2212–2223 (2002).

    Article  CAS  Google Scholar 

  28. Zheng, Y., Szustakowski, J. D., Fortnow, L., Roberts, R. J. & Kasif, S. Computational identification of operons in microbial genomes. Genome Res. 12, 1221–1230 (2002).

    Article  CAS  Google Scholar 

  29. Yan, Y. & Moult, J. Detection of operons. Proteins 64, 615–628 (2006).

    Article  CAS  Google Scholar 

  30. Mitra, K., Carvunis, A. R., Ramesh, S. K. & Ideker, T. Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet. 14, 719–732 (2013).

    Article  CAS  Google Scholar 

  31. Burroughs, A. M., Zhang, D., Schaffer, D. E., Iyer, L. M. & Aravind, L. Comparative genomic analyses reveal a vast, novel network of nucleotide-centric systems in biological conflicts, immunity and signaling. Nucleic Acids Res. 43, 10633–10654 (2015).

    Article  CAS  Google Scholar 

  32. Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 41, 4360–4377 (2013).

    Article  CAS  Google Scholar 

  33. Galperin, M. Y. Bacterial signal transduction network in a genomic perspective. Environ. Microbiol. 6, 552–567 (2004).

    Article  CAS  Google Scholar 

  34. Mishra, V., Lal, R. & Srinivasan Enzymes and operons mediating xenobiotic degradation in bacteria. Crit. Rev. Microbiol. 27, 133–166 (2001).

    Article  CAS  Google Scholar 

  35. Besemer, J., Lomsadze, A. & Borodovsky, M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001).

    Article  CAS  Google Scholar 

  36. Marchler-Bauer, A. et al. Troubleshooting advice can be: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–226 (2015).

    Article  CAS  Google Scholar 

  37. Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–285 (2016).

    Article  CAS  Google Scholar 

  38. Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  Google Scholar 

  39. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  40. Soding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).

    Article  Google Scholar 

  41. Makarova, K. S. et al. An updated evolutionary classification of CRISPR–Cas systems. Nat. Rev. Microbiol. 13, 722–736 (2015).

    Article  CAS  Google Scholar 

  42. Bath, C., Cukalac, T., Porter, K. & Dyall-Smith, M. L. His1 and His2 are distantly related, spindle-shaped haloviruses belonging to the novel virus group, Salterprovirus. Virology 350, 228–239 (2006).

    Article  CAS  Google Scholar 

  43. Swarts, D. C. et al. The evolutionary journey of argonaute proteins. Nat. Struct. Mol. Biol. 21, 743–753 (2014).

    Article  CAS  Google Scholar 

  44. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    Article  CAS  Google Scholar 

  45. Sasaki, Y. The truth of the F-measure. Teach Tutor Mater. 1, 1–5 (2007).

    Google Scholar 

Download references

Acknowledgements

This research was funded through the Intramural Research Program of the National Institutes of Health of the USA, the RFBR (for research project 18-34-00012, S.A.S.), a systems biology fellowship funded by Philip Morris Sales and Marketing (to S.A.S.), the Ministry of Education and Science of the Russian Federation (subsidy agreement 14.606.21.0006; project identifier RFMEFI60617X0006; to S.A.S. and K.V.S.) and an NIH grant (R01 GM10407 to K.V.S.).

Author information

Authors and Affiliations

Authors

Contributions

S.A.S., Y.I.W. and E.V.K. designed the protocol; S.A.S. implemented the protocol with assistance from G.F.; S.A.S., G.F., K.S.M., Y.I.W. and K.V.S. analyzed the results; S.A.S., Y.I.W. and E.V.K. wrote the manuscript, which was read, edited and approved by all authors.

Corresponding author

Correspondence to Eugene V. Koonin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Protocols thanks Christine Pourcel and other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Shmakov, S. A., Makarova, K. S., Wolf, Y. I., Severinov, K. V. & Koonin, E. V. Proc. Natl Acad. Sci. USA 115, E5307–E5316 (2018): https://doi.org/10.1073/pnas.1803440115

Shmakov, S. et al. Nat. Rev. Microbiol. 15, 169–182 (2017): https://doi.org/10.1038/nrmicro.2016.184

Shmakov, S. et al. Mol. Cell 60, 385–397 (2015): https://doi.org/10.1016/j.molcel.2015.10.008

Supplementary information

Supplementary Data

Step-by-step explanation of the RunClust.sh script used for protein clustering, and an alternative iterative clustering procedure.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shmakov, S.A., Faure, G., Makarova, K.S. et al. Systematic prediction of functionally linked genes in bacterial and archaeal genomes. Nat Protoc 14, 3013–3031 (2019). https://doi.org/10.1038/s41596-019-0211-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41596-019-0211-1

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing Microbiology

Sign up for the Nature Briefing: Microbiology newsletter — what matters in microbiology research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Microbiology