RT Journal Article SR Electronic T1 Semantic search using protein large language models detects class II microcins in bacterial genomes JF bioRxiv FD Cold Spring Harbor Laboratory SP 2023.11.15.567263 DO 10.1101/2023.11.15.567263 A1 Kulikova, Anastasiya V. A1 Parker, Jennifer K. A1 Davies, Bryan W. A1 Wilke, Claus O. YR 2023 UL http://biorxiv.org/content/early/2023/11/15/2023.11.15.567263.abstract AB Class II microcins are antimicrobial peptides that have shown some potential as novel antibiotics. However, to date only ten class II microcins have been described, and discovery of novel microcins has been hampered by their short length and high sequence divergence. Here, we ask if we can use numerical embeddings generated by protein large language models to detect microcins in bacterial genome assemblies and whether this method can outperform sequence-based methods such as BLAST. We find that embeddings detect known class II microcins much more reliably than does BLAST and that any two microcins tend to have a small distance in embedding space even though they typically are highly diverged at the sequence level. In datasets of Escherichia coli, Klebsiella spp., and Enterobacter spp. genomes, we further find novel putative microcins that were previously missed by sequence-based search methods.Competing Interest StatementThe authors have declared no competing interest.