Computational exploration of the global microbiome for antibiotic discovery

Summary Novel antibiotics are urgently needed to combat the antibiotic-resistance crisis. We present a machine learning-based approach to predict prokaryotic antimicrobial peptides (AMPs) by leveraging a vast dataset of 63,410 metagenomes and 87,920 microbial genomes. This led to the creation of AMPSphere, a comprehensive catalog comprising 863,498 non-redundant peptides, the majority of which were previously unknown. We observed that AMP production varies by habitat, with animal-associated samples displaying the highest proportion of AMPs compared to other habitats. Furthermore, within different human-associated microbiota, strain-level differences were evident. To validate our predictions, we synthesized and experimentally tested 50 AMPs, demonstrating their efficacy against clinically relevant drug-resistant pathogens both in vitro and in vivo. These AMPs exhibited antibacterial activity by targeting the bacterial membrane. Additionally, AMPSphere provides valuable insights into the evolutionary origins of peptides. In conclusion, our approach identified AMP sequences within prokaryotic microbiomes, opening up new avenues for the discovery of antibiotics.


Introduction
Antibiotic-resistant infections are becoming increasingly difficult to treat with conventional therapies. Indeed, such infections currently kill 1.27 million people per year 1 . Therefore, there is an urgent need to develop novel methods for antibiotic discovery. Computational approaches have been recently developed to accelerate our ability to identify novel antibiotics, including antimicrobial peptides (AMPs) 2-5 .
AMPs, found in all domains of life [6][7][8][9] , are short sequences (here operationally defined as 10-100 amino acid residues 10 ) capable of disturbing microbial growth 7,11 . AMPs can target proteins, RNA, DNA, and other intracellular molecules; they most commonly interfere with cell wall integrity and cause cell lysis 7 . Natural AMPs can originate by proteolysis 3,12 or by non-ribosomal synthesis 13 , or they can be encoded within the genome 14 . Recently, proteome mining approaches have been developed to identify antimicrobials in extinct organisms 15 .
Bacteria live in an intricate balance of antagonism and mutualism in natural habitats. AMPs play an important role in modulating such microbial interactions and can displace competitor strains, facilitating cooperation 16 . For instance, pathogens such as Shigella spp. 17 , Staphylococcus spp. 18 , Vibrio cholerae 19 ,and Listeria spp. 20,21 produce AMPs that eliminate competitors (sometimes from the same species) and occupy their niche.
AMPs hold promise as potential therapeutics and have already been used clinically as antiviral drugs (e.g., enfuvirtide and telaprevir 22 ). AMPs that exhibit immunomodulatory properties are currently undergoing clinical trials 23 , as are AMPs that may be used to address yeast and bacterial infections 24 (e.g., pexiganan, LL-37, . Although most AMPs display broad-spectrum activity, some can present narrow activity, having activity only against closely related members of the same species or genus 25 . Such AMPs are more targeted agents than conventional broad-spectrum antibiotics 26,27 . For example, the FDA-approved small peptide nisin has been shown to restore microbiome homeostasis in Clostridioides difficile infections in mice 28 and in ex vivo models of the human gut 29 . Furthermore, contrary to conventional antibiotics, the evolution of resistance to many AMPs occurs at low rates and is not related to cross-resistance to other classes of widely used antibiotics 3,30,31 . The application of metagenomic analyses to the study of AMPs has been limited due to technical constraints, primarily stemming from the challenge of distinguishing genuine protein-coding sequences from false positives 32 . As a consequence, the significance of small open reading frames (smORFs) has been historically overlooked in (meta)genomic analyses [33][34][35] . In recent years, significant progress has been made in metagenomic analyses of human-associated small open reading frames (smORFs) 5,36 . These advancements have incorporated machine learning (ML) techniques to identify smORFs encoding proteins belonging to specific functional categories [37][38][39][40] . Notably, a recent study uncovered approximately 2,000 AMPs from metagenomic samples of human gut microbiomes 5 .
Nevertheless, it is important to note that the human gut represents only a fraction of the overall microbial diversity, suggesting that there remains an immense potential for the discovery of AMPs from prokaryotes in the diverse range of habitats across the globe.
In this study, we employed ML to predict and catalog the entire global microbiome. By computationally exploring 63,410 publicly available metagenomes and 87,920 high-quality microbial genomes 41 , we uncovered a vast array of AMP diversity. This resulted in the creation of AMPSphere, a collection of 863,498 non-redundant peptide sequences, encompassing candidate AMPs (c_AMPs) derived from (meta)genomic data. Remarkably, the majority of these identified c_AMP sequences had not been previously described. Our analysis revealed that these AMPs were specific to particular habitats and were predominantly not core genes in the pangenome.
Moreover, we synthesized 50 c_AMPs from AMPSphere and found that 54% of them exhibited antimicrobial activity in vitro against clinically significant ESKAPEE pathogens, which are recognized as public health concerns 42,43 . These peptides demonstrated their ability to target bacterial membranes and were prone to adopting α-helical and β-structures. Notably, the leading candidates displayed promising anti-infective activity in a preclinical animal model.

c_AMPs are rare and habitat specific
The AMPSphere spans 72 different habitats, which were classified into 8 high-level habitat groups, e.g., soil/plant (36.6% of c_AMPs in AMPSphere), aquatic (24.8%), human gut (13%) -( Fig.   1A and Table SI2). Most of the habitats, except for the human gut, appear to be far from saturation in terms of newly discovered c_AMPs (Fig. 1C). In fact, most AMPs are rare (median number of detections is 99, or 0.17% of the dataset), with 83.97% being observed in <1% of samples -see Fig.   SI2. Only 10.8% (93,280) c_AMPs were detected in more than one high-level habitat (henceforth, "multi-habitat c_AMPs"); this level is 7.25-fold less frequent than would be expected by a random assignment of habitats to samples (PPermutation < 10 -300 , see Methods -Multi-habitat and rare c_AMPs). Even within high-level habitat groups, c_AMP contents overlap between habitats much less than expected by chance (2.4 to 192-fold less, PPermutation⩽ 5.4·10 -50 , see Methods -Significance of the overlap of c_AMP contents; Fig. 1D), indicating the existence of cross-habitat boundaries.

Mutations in larger genes generate c_AMPs as independent genomic entities
Many AMPs are generated post-translationally by the fragmentation of larger proteins 12 . In parallel, encrypted peptides from protein sequences within the human proteome were also shown to be highly active. Those peptides have different physicochemical features compared to known AMPs 3,31 . However, AMPSphere only considered peptides encoded by dedicated genes. Nonetheless, we hypothesized that some of these have originated from larger proteins by fragmentation at the genomic level: about 7% (61,020) of AMPSphere c_AMPs are homologous to full-length proteins in GMGCv1 46 (Fig. 1B), with 27% of hits sharing the start codon with the full-length protein, which suggests early termination of full-length proteins as one mechanism for generating novel c_AMPs ( Fig. 2A and 2B).
To investigate the function of the full-length proteins homologous to AMPs, we mapped the matching proteins from GMGCv1 46 to their orthologous groups (OGs) from eggNOG 5.0 48 . We identified 3,792 (out of 43,789) OGs significantly enriched (PHypergeom. < 0.05, after multiple hypothesis corrections with the Holm-Sidak method) among the hits from AMPSphere. Although OGs of unknown function comprise 53.8% of all identified OGs, when considered individually, these OGs are, on average, smaller than OGs in other categories. Thus, despite each OG having a relatively small number of c_AMP hits, when compared to the background distribution of the OGs in GMGCv1, OGs of unknown function were the most enriched among the c_AMP hits, with an average enrichment of 10,857 fold (PMann ≤ 3.9·10 -4 ; Fig. 2C; Table SI3).

c_AMP genes may arise after gene duplication events
We next raised the question of whether c_AMPs would be predominantly present in specific genomic contexts. To investigate the functions of the neighboring genes of the c_AMPs, we mapped them against 169,484 high-quality genomes included in a study by del Río et al. 49 . 38.9% (21,465 out of 55,191) c_AMPs with more than two homologs in different genomes in the database show phylogenetically conserved genomic context with genes of known function (see Methods -Genomic context conservation analysis). This proportion of c_AMPs in conserved genomic context is slightly higher than for other gene family clusters calculated on the genomes' gene sets . This difference becomes more pronounced when comparing the genomic context of c_AMPs and protein families composed of short (< 50 amino acids) peptides  ).
Despite being involved in similar processes, AMPs were significantly depleted from conserved genomic contexts involving known systems of antibiotic synthesis and resistance, even when they were compared to small protein families (0.6-fold, PPermutation = 1.7·10 -8 , Fig. 3). Instead, we found that c_AMPs are encoded in conserved genomic contexts with ribosomal genes (24.1%) and ABC transporters (18%) at a higher frequency than other gene families ( Fig. 3A and Tables SI4 and SI5).
Most of the c_AMPs (2,201 out of the 2,642) in conserved ribosomal genomic contexts are homologous to ribosomal proteins (Fig. 3D), congruent with the observation that, in some species, ribosomal proteins have antimicrobial properties 50 . Seventy-seven of the c_AMPs that were homologous to ribosomal proteins were homologous to a gene in their immediate vicinity (up to 1 gene up/downstream). This phenomenon is not exclusive to ribosomal proteins: 2,309 c_AMPs can be annotated to the same KEGG Orthologous Group (KO) as some of their conserved neighbors and may have originated from gene duplication events, the common annotation being interpreted, in this context, as evidence for a common evolutionary origin and not as a functional prediction for the c_AMPs. Interestingly, 1,707 (73.9%) of these c_AMPs are located downstream of the conserved neighbor with the same KO annotation. The livM family (branched-chain amino acid transport system permease protein, K01998), a transposase family (K07486), and a class of permeases (K03106) are the most common KOs assigned to c_AMPs and their neighbors (185, 128, and 67 c_AMPs, respectively) -see Table SI6.

Most c_AMPs are members of the accessory pangenome
We observed that only a small portion (5.9%, PPermutation = 4.8·10 -3 , NSpecies = 416) of c_AMP families present in ProGenomes2 41 show prevalence ≥95% in genomes from the same species (Fig.   4), here referred to as "core" 51 . This is consistent with previous work, in which AMP production was observed to be strain-specific 52 . In contrast, a high proportion (circa 68.8%) of full-length protein families are core in ProGenomes2 41 species. There is a 1.9-fold greater chance (PFisher = 2.2·10 -92 ) of finding a pair of genomes from the same species sharing at least one c_AMP when they belong to the same strain (99.5% ≤ ANI < 99.99%).
One example of this strain-specific behavior is AMP10.018_194, the only c_AMP found in Mycoplasma pneumoniae genomes. M. pneumoniae strains are traditionally classified into two groups based on their P1 adhesin gene 53 . Of the 76 M. pneumoniae genomes present in our study, 29 were of type-1, 29 were of type-2, and the remaining 18 were of the undetermined type in this classification system 54 (Methods -Determination of accessory AMPs). Twenty-six of the 29 type-2 genomes contained AMP10.018_194, as did 2 undetermined type genomes, but none of the type-1 genomes contained this AMP.

Bacterial strains from the human gut have more c_AMP genes than conspecific strains from other human body sites
We investigated the taxonomic composition of the AMPSphere by annotating contigs with the GTDB taxonomy 55,56 (see Methods -Differences in the c_AMP density in microbial species from different habitats), which resulted in 570,187 c_AMPs being annotated to a genus or species. The genera contributing the most c_AMPs to AMPSphere were Prevotella (18,593 c_AMPs), Bradyrhizobium (11,846 c_AMPs), Pelagibacter (6,675 c_AMPs), Faecalibacterium (5,917 c_AMPs), and CAG-110 (5,254 c_AMPs) (see Fig. 4). This distribution reflects the fact that these genera are among those that contribute the most assembled sequences in our dataset (all occupying percentiles above 99.75% among the assembled genera; see Table SI7). Therefore, we computed the c_AMP density (⍴ AMP ) as the number of c_AMP genes found per megabase pair of assembled sequence. The densest genera were environmental microorganisms, such as Algorimicrobium (2.1 c AMP genes Mbp ), as well as non-cultured taxa, e.g., TMED78 (1.6), SFJ001 (1.5), STGJ01 (1.4), and CAG-462 (1.4). However, when we considered phylum-level annotations, we observed none of the above-mentioned genera belonging to the Methylomirabilota phylum, the group with the highest ⍴ AMP , followed by Fusobacteriota. The absence of these genera is due to the presence of contigs assigned to these phyla, but not to a specific genus, likely because of a lack of representation in the database.
These environmental high c_AMP density genera are, however, low abundance taxa (see

Methods -Differences in the c_AMP density in microbial species from different habitats) and,
the average ρ AMP of animal-host-associated samples is 1.6-fold higher than in samples from non-hostassociated habitats (PMann < 10 -330 ; Fig. 5A and Fig. SI4). Differences between habitats reflect both changes in high-level taxonomic composition and subspecies variation. To disentangle these two effects, we analyzed the 3,930 microbial species that were present in at least 10 samples from two different habitats, comparing the AMP density in sequences from the same species in different habitats. This resulted in 1,531 species showing significant differences in at least one pair of habitats (FDR < 0.05, Holm-Sidak; see Methods -Differences in the c_AMP density in microbial species from different habitats). For example, Prevotella copri has a higher density when found in the cat and human gut compared to other mammalian hosts or in wastewater (PKruskal = 4.9·10 -276 , Fig. 5B and Table SI8). When comparing the human gut and oral cavity (the two body sites with the greatest overlap in species), in general, microbial strains have higher AMP density in the human gut (PMann = 1.2·10 -4 , NSpecies = 37, Fig. 5C and Table SI8). We also checked non-animal hosts, such as plants, and observed that microbial species simultaneously observed in soil and plant-associated microbiome usually have higher AMP density in soil (PMann = 2.8·10 -4 , NSpecies = 130, Fig. 5D and Table SI8).

Physicochemical features and secondary structure of AMPs
To investigate the properties and structure of the synthesized peptides, we first compared their amino acid composition to AMPs from available databases (DRAMP 3.0 44 , DBAASP 58 , and APD3 59 ).
Overall, the composition was similar, as was expected, given that Macrel's ML model was trained using known AMPs 40 . Notably, the AMPSphere sequences displayed a slightly higher abundance of aliphatic amino acid residues, specifically alanine and valine. However, these AMPSphere sequences consistently differed (Fig. 6A) from encrypted peptides (EPs), which are peptides previously identified within proteins, including those found in the human proteome 3 . The resemblances in amino acid composition between the identified c_AMPs and known AMPs suggested similar physicochemical characteristics and secondary structures, both of which are recognized for their influence on antimicrobial activity 60 . The c_AMPs exhibited comparable hydrophobicity, net charge, and amphiphilicity to AMPs sourced from databases (Fig. SI1). Furthermore, they displayed a slight propensity for disordered conformations (Fig. 6B) and had a lower positive charge compared to EPs ( Fig. 6A).
Subsequently, we conducted experimental assessments of the secondary structure of the active c_AMPs using circular dichroism ( Fig. 6B and SI6). Similar to AMPs documented in databases, peptides derived from AMPSphere exhibited a pronounced propensity for adopting α-helical structures. Notably, they also displayed an unusually high content of β-antiparallel structure in both water and methanol/water mixtures (Fig. 6B), despite their amino acid composition similarities to AMPs and EPs. We attribute these findings to the slightly elevated occurrence of alanine and valine residues, which are known to favor β-like structures with a preference for -antiparallel conformation 61 .

Validation of c_AMPs as potent AMPs through in vitro assays
To evaluate the potential antimicrobial properties of c_AMPs, we selected and chemically synthesized 50 peptide sequences based on their abundance, predicted solubility, and taxonomic diversity (Methods -Selection of peptides to synthesis and activity testing). Next, we subjected these peptides to testing against 11 clinically relevant pathogenic strains, encompassing Our initial screening revealed that 27 AMPs (54% of the total synthesized) completely eradicated the growth of at least one of the pathogens tested (Fig. 6C). Remarkably, in some cases, the AMPs were active at concentrations as low as 1 μmol·L -1 . Several of the Gram-negative bacteria, i.e., A.
baumannii, E. coli, and P. aeruginosa, as well as the Gram-positive strain of vancomycin-resistant E.

The growth of human gut commensals is impaired by c_AMPs
We screened the AMPs against eight of the most prevalent members of the human gut microbiota. We tested commensal bacteria belonging to four phyla (Verrucomicrobiota, Bacteroidota, While it is commonly observed that known natural AMPs do not target microbiome strains 62 , our study found that 30 of the synthesized AMPs (60%) demonstrated inhibitory effects on at least one commensal strain at low concentrations levels (8-16 μmol·L -1 ). Although this concentration range was higher than that required to inhibit pathogens (1-4 μmol·L -1 ), it still falls within the highly active range of AMPs based on previous studies [63][64][65] (Fig. 6C). Interestingly, all the analyzed gut microbiome strains were susceptible to at least two c_AMPs, with strains of A. muciniphila, B. uniformis, P. vulgatus, C.
aerofaciens, C. scindens, and P. distasonis exhibiting the highest susceptibility. In total, 36 AMPs (72% of the total synthesized peptides) demonstrated antimicrobial activity against pathogens and/or commensals. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023

Depolarization and permeabilization of the bacterial membrane by AMPs from AMPSphere
To gain insights into the mechanism of action responsible for the antimicrobial activity observed in the peptides derived from AMPSphere (Fig. 6C), we conducted experiments to assess their ability to permeabilize and depolarize the cytoplasmic and outer membranes of bacteria at their Minimum Inhibitory Concentrations -MICs. Specifically, we investigated the effects of 22 peptides on A. baumannii (Fig. 6D-E and Fig. SI7A,D) and 4 peptides on P. aeruginosa ( Fig. SI7B-C,E). For comparison, we used polymyxin B, a peptide antibiotic known for its membrane permeabilization and depolarization properties, as a control in these experiments 3 .
To investigate the potential permeabilization of the outer membranes of Gram-negative bacteria by active AMPs derived from AMPSphere, we conducted 1-(N-phenylamino)naphthalene (NPN) uptake assays ( Fig. SI7A-B). NPN is a lipophilic fluorophore that exhibits increased fluorescence in the presence of lipids found in bacterial outer membranes. The uptake of NPN indicates membrane permeabilization and damage. Among the 22 peptides evaluated for activity against A. baumannii, 10 peptides caused significant permeabilization of the outer membrane, resulting in fluorescence levels at least 50% higher than that of polymyxin B (Fig. 6D). Only three peptides exhibited lower permeabilization than polymyxin B (Fig. 6D). In the case of P. aeruginosa cells, two out of the four tested peptides showed higher permeabilization than polymyxin B (Fig.   SI7C).
To evaluate the potential membrane depolarization effect of the AMPs from AMPSphere, we utilized the fluorescent dye 3,3′-dipropylthiadicarbocyanine iodide [DiSC3-(5)] ( Fig. SI7D-E). Among the peptides tested against A. baumannii and P. aeruginosa, marinobacticin-1 (AMP10.321_460) and cagicin-2 (AMP10.014_861) exhibited greater cytoplasmic membrane depolarization than polymyxin B ( Fig. 6E;Fig. SI7E-F), respectively. Interestingly, all the tested AMPSphere peptides displayed a characteristic crescent-shaped depolarization pattern compared to polymyxin B, with lower levels of depolarization observed during the first 20 minutes of exposure, followed by an increase in depolarization over time ( Fig. 6E and Fig. 7D-F). Taken together, these results indicate that the kinetics of cytoplasmic membrane depolarization are slower compared to the kinetics of outer membrane permeabilization, which occurs rapidly upon interaction with the bacterial cells.
Our findings indicate that the AMPs from AMPSphere primarily exert their effects by permeabilizing the outer membrane rather than depolarizing the cytoplasmic membrane, revealing a similar mechanism of action to that observed for classical AMPs and EPs.
The skin abscess infection was established with a bacterial load of 20 μL of A. baumannii cells at 110 6 CFU·mL -1 onto the wounded area of the dorsal epidermis (Fig. 7A). A single dose of each peptide, at their respective MIC (Fig. 6A), was administered to the infected area. Two days postinfection, synechocin-1 and actynomycin-1 peptides presented bacteriostatic activity, inhibiting the proliferation of A. baumannii cells, whereas lachnospirin-1 and enterococcin-1 presented bactericidal activity similar to that of the antibiotic polymyxin B (MIC = 0.25 μmol·L -1 ), reducing the colony-forming units (CFU) counts up to 3-4 orders of magnitude (Fig. 7B). Four days post-infection, none of the AMPs nor polymyxin B achieved a statistically significant reduction of A. baumannii growth at the infection site, although treatment with proteobactin-1, lachnospiracin-1, enterococcin-1, and polymyxin B reduced the CFU counts by 1-2 orders of magnitude compared to the untreated control. These results are promising since the AMPSphere peptides were administered only once immediately after the establishment of the abscess, highlighting their anti-infective potential.
Mouse weight was monitored as a proxy for toxicity and no significant changes were observed ( Fig. 7C), suggesting that the peptides tested are not toxic.

Discussion
Here, we use ML to identify thousands of novel candidate AMPs in the global microbiome.
Building on previous studies that focused specifically on the human gut microbiome 5,36,67 , we cataloged AMPs from the global microbiome across 63,410 publicly available metagenomes, as well as 87,920 high-quality microbial genomes from the ProGenomes v2 database 41 , leading to the creation of AMPSphere (https://ampsphere.big-data-biology.org/), a publicly available resource encompassing 863,498 non-redundant peptides and 6,499 high-quality AMP families from 72 different habitats, including marine and soil environments and the human gut. We show that most of the c_AMPs (91.5%) were previously unknown, lacking detectable homologs in other databases, and about one in five could be detected in independent sets of meta-transcriptomes or metaproteomes.
Two evolutionary mechanisms by which AMPs may be generated were explored. First, mutations in genes encoding longer proteins could generate gene fragments. Among the enriched ortholog groups of proteins from GMGCv1 46 homologous to c_AMPs, we observed that a majority of groups had unknown function (53.8%), similar to what was reported by Sberro et al. 36 for small proteins from the human gut microbiome. The second mechanism is that a gene duplication could be followed by mutation, which we observed in the case of ribosomal proteins. Ribosomal proteins can harbor antimicrobial activity 50 , possibly due to their amyloidogenic properties 68 . Nonetheless, the majority of identified AMPs do not have detectable homology to other sequences, highlighting their novelty. The lack of observed homology, however, may be due to limitations in our ability to robustly detect these homology relationships in small sequences, but there is also the possibility that small proteins, such as AMPs, may be more likely to be generated de novo and may have repeatedly evolved in various taxa 69 .
Four out of the five genera with the most c_AMPs present in AMPSphere share a hostassociated lifestyle. Three of these (Prevotella,Faecalibacterium, are common in animal hosts (Fig. 4). The greater density of c_AMP genes in genera from the Bacillota and Bacillota A phyla is consistent with the well-known diversity of ABC transporters dedicated to the translocation of AMPs found in that group, resulting in improved resistance to compounds that bind extracellular targets; for a review, see Gebhard 70 .
We observed that c_AMPs from AMPSphere are habitat-specific and mostly accessory members of microbial pangenomes. Moreover, species-specific density (⍴ AMP ) shows that the habitat plays an important role in shaping c_AMP content. The ⍴ AMP of strains from the same species can differ even across body sites. In particular, we observed higher ⍴ AMP in the human gut compared to the human oral cavity, in agreement with a recent report of a strain-specific AMP (cutymicin), which is present in only some of the hair follicles in the same human host 71 .
Valles-Colomer et al. 57 , who recently analyzed a large collection of human-associated metagenomes, provide a species-specific index of transmissibility for the several transmission scenarios they study (e.g., mother to infant). Hypothesizing that AMP production may be related to transmission, we correlated the species-specific ⍴ AMP calculated in AMPSphere with transmission scores. In both the human gut and oral microbiomes, species with higher ⍴ AMP are less transmissible, possibly because AMPs confer protection against strain replacement. Taken together, our results and those of Valles-Colomer et al. 57 validate the AMP density concept and the applicability of AMPSphere resources to study mechanisms of microbial establishment and competition.
Finally, we experimentally validated predictions made by our ML model and found that 72% (32 out of the 50) synthesized AMPs displayed antimicrobial activity against either pathogens or commensals. Notably, four peptides (cagicin-1, cagicin-4, and enterococcin-1 against A. baumannii; and cagicin-1 and lachnospirin-1 against vancomycin-resistant E. faecium) presented MIC values as low as 1 μmol·L -1 , comparable to the MICs of some of the most potent peptides previously described in the literature 64,65 .
We show that AMPs from the AMPSphere tended to target clinically relevant Gram-negative pathogens and also showed activity against vancomycin-resistant E. faecium. Although conventional AMPs do not target microbiome bacteria 62 , AMPs from AMPSphere showed efficacy against these bacteria, suggesting potential ecological implications of peptides as protective agents for their producing organisms. Notably, the amino acid composition and physicochemical characteristics of the c_AMPs in AMPSphere differed from those recently identified in EPs 3 . (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ; https://doi.org/10.1101/2023.08.31.555663 doi: bioRxiv preprint microbiome. Overall, our findings unveil a wide array of novel AMP sequences, highlighting the potential of machine learning in the identification of much-needed antimicrobials.

Acknowledgments
We thank Marija Dmitrijeva (University of Zurich) for her helpful comments on a previous version of the manuscript. We thank members of the Coelho group and de la Fuente Lab for insightful discussions. Cesar de la Fuente-Nunez holds a Presidential Professorship at the University of

Declaration of interests
Cesar de la Fuente-Nunez provides consulting services to Invaio Sciences and is a member of the Scientific Advisory Boards of Nowture S.L. and Phare Bio. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023 (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023 (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023 19 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023 (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023 (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023 (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 Eggnog v.5.0 (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023

Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contacts: Cesar de la Fuente-Nunez (cfuente@upenn.edu) and Luis Pedro Coelho (luispedro@big-data-biology.org).

Materials availability
Peptides were obtained from AAPPTec and synthesized using solid-phase peptide synthesis  AMPSphere is also available as a public online resource: https://ampsphere.big-data-biology.org/  Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Bacterial strains and growth conditions
The pathogenic strains Acinetobacter baumannii ATCC 19606, Escherichia coli ATCC 11775,

Skin abscess infection mouse model
To assess the anti-infective efficacy of the peptides against A. baumannii ATCC 19606 in a skin abscess infection mouse model, the bacteria were cultured in tryptic soy broth (TSB) medium until an OD600 of 0.5 was reached. Next, the cells were washed twice with sterile PBS (pH 7.4) and suspended to a final concentration of 5·10 6 colony-forming units (CFU) per mL -1 . Six-week-old female CD-1 mice, after being anesthetized with isoflurane, were subjected to a superficial linear skin abrasion on their backs in an area that they could not touch with their mouth or limbs. An aliquot of 20 μL containing the bacterial load was then administered over the abraded area. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 experimental groups consisted of 3 mice per group (n = 3), and each mouse was infected with an inoculum from a different colony to ensure variability.

Selection of metagenomes and high-quality microbial genomes
Selection of metagenomes and genomes to compose the AMPSphere was similar to that adopted by Coelho et al. 46 . Only public metagenomes on 1 January 2020 produced with Illumina instruments (except for MiSeq), with at least 2 million reads and, on average, 75 bp long, were downloaded from the European Nucleotide Archive (ENA). These samples met two criteria: (1) tagged with taxonomy ID 408169 (for metagenome) or is a descendent of it in the taxonomic tree; and/or (2) experiments with the library source listed as "METAGENOMIC". Samples were grouped by project and all projects with at least 20 samples were considered. Additionally, studies based on a whitelist, including metagenomes deposited by the Integrated Microbial Genomes System (IMG) missing from ENA, also were included. Metadata was manually curated from their describing literature and Biosamples database 95 . The habitat classification took into account the metadata, creating groups based on the similarity of habitat conditions, such as air, anthropogenic, aquatic, host-associated, ph:alkaline, sediment, terrestrial, and others. The sample origins and other relevant information related to host species were assessed using the NCBI taxonomic identification number. High-quality microbial genomes were selected from ProGenomes2 database 41 . The resulting 63,410 publicly available metagenomes and 87,920 high-quality microbial genomes are listed in Table SI1.

Reads trimming and assembly
Reads were processed using NGLess 72 , trimming positions with quality lower than 25, and discarding reads shorter than 60 bp post-trimming. Metagenomes obtained from a host-associated microbiome passed through a filtering of reads mapping to the host genome when available. Reads were assembled with MEGAHIT 1.2.9 89 and the taxonomy of 16,969,685,977 contigs assembled from more than 14.7 trillion base pairs of sequenced DNA was inferred as previously described 96 , using MMSeqs2 75 to map the sequences against the GTDB release 95 55,56  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023

smORF and AMP prediction
Analogously to Sberro et al. 36 , we used a modified version of Prodigal 32 to predict smORFs (33 to 303 bp) from contigs. The 4,599,187,424 redundant smORFs, most of which (99.25%) originated in metagenomes were then de-duplicated, yielding 2,724,621,233 non-redundant smORFs. Macrel 40 was run on the de-duplicated smORFs to predict c_AMPs. Singleton sequences (those appearing in a single sample or genome) were eliminated, except when they had a significant match (Amino acids identity ≥ 75% and E-value ≤ 10 -5 ) to a sequence from the Data Repository of Antimicrobial Peptides - (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 respectively. The alignment score was then converted to an E-value, according to the model by Karlin and Altschul 101 , using the values of κ (0.132539) and λ (0.313667) constants adjusted to search for a short input sequence as implemented in the BLAST algorithm 102,103 . Alignments were considered significant when their E-value was less than 10 -5 . We found that more than 95.3% of alignments produced in the first two levels (100% and 100-85% of identity) were significant along with 77.1% of those from the third level (100-85-75% of identity) -see Fig. SI3.

Quality control of c_AMPs
The genes of c_AMPs were subjected to 5 different quality tests to reduce the likelihood that the observed peptides were artifacts or fragments of larger proteins. Initially, the peptides were searched against AntiFam v.7.0 93 using HMMSearch 85 with the option "--cut_ga", and significant hits were classified as spurious. We observed that 99.9% of the c_AMPs in AMPSphere do not belong to AntiFAMs.
A test for the terminal positioning in contigs checked if there were in-frame stop codons upstream to the smORF coding for a given c_AMP. When no stop codon is found, we cannot rule out the possibility that the smORF is part of a larger gene due to a fragmentary assembly. Most (68.4%) of the c_AMPs are encoded by at least a gene that is not terminally placed.
The RNAcode program 104 predicts protein-coding regions based on evolutionary signatures typical for protein genes. This analysis depends on a set of homologous and non-identical genes.
Therefore, AMP clusters containing at least 3 gene variants were aligned. Given that an extensive portion of the AMPSphere candidates -53% (459,910 out of 863,498) is not part of such a cluster, they could not be tested. Of the tested c_AMPs, 53% (215,421 out of 403,588) were considered genes with evolutionary traits of protein-coding sequences.
To further verify the experimental evidence of c_AMPs, we checked for evidence of transcription and/or translation using a set of 221 publicly available metatranscriptome sets, comprising human gut (142), peat (48), plant (13), and symbionts (17); and 109 publicly available metaproteomes from 37 habitats - Table SI9. Using bwa v.0.7.17 105 , AMP genes were mapped against the reads from the metatranscriptomes, and with NGLess 72 , we selected those genes with at least 1 read mapped across a minimum of two samples. Using Regex methods implemented in Python 3.8 76 , k-mers of all AMPSphere peptides (with length equal to at least half the length of the sequence) were checked for peptide sequences in metaproteomics data. In the case of perfect matches between a k-28 588 590 592 594 mer and a metaproteomic peptide for more than half the length of the sequence, it was considered that there is additional evidence that this c_AMP is likely to be expressed, as described by Ma et al. 5 .
Briefly, the number of mapped peptides against the set of samples was counted and those peptides with at least 1 match covering more than 50% of the peptide were marked as detected. c_ AMPs with experimental evidence in metatranscriptomes and/or metaproteomes accounted for 1/5 of the AMPSphere.
We separated AMPs passing all quality-control tests into those with experimental evidence of translation/transcription (17,115 c_AMPs, ~2% of AMPSphere) and those without it (63,098 c_AMPs, 7%). Quality filters for families consisted of keeping only those with ≥ 75% of its c_AMPs passing all quality control tests or having at least 1 c_AMP with experimental evidence of translation/transcription.

Sample-based c_AMPs accumulation curves
For each habitat and group of habitats, we computed the sample-based accumulation curves by randomly sampling metagenomes 32 times in steps of 10 metagenomes. At each step, the number of unique c_AMPs found was computed, and the curves were drawn with the average obtained across the permutations.

Multi-habitat and rare c_AMPs
The c_AMPs present simultaneously in ≥2 habitat groups were computed. To test the significance of this number, we opted for a similar approach to that described in Coelho et al. 46 . The number of c_AMPs present in more than 1 habitat (high-level or general), here considered "multihabitat", was determined by counting the hits for each c_AMP in each sample. After that, the habitat labels for each sample were shuffled 100 times and the number of randomly obtained multi-habitat c_AMPs was counted. Shapiro-Wilks test was used to check the data distribution as normal (for general habitats -p = 0.49; and high-level habitats -p = 0.1) and this resulted in 676,489.7 ± 4281.8 multi-habitat c_AMPs by chance for high-level habitat groups, and in 685,477.17 ± 4,369.6 multihabitat c_AMPs by chance for general habitats. High-level habitat groups presented 93,280 multihabitat c_AMPs, while general habitats presented 173,955 multi-habitat c_AMPs. Both cases were 136.21 and 117.1 standard deviations below the value expected by chance, respectively. This was significant for the two cases with low estimated p-values (p < 10 -300 ). (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ; https://doi.org /10.1101/10. /2023 To determine the rarity of c_AMPs, the non-redundant genes in AMPSphere were mapped against the reads of metagenome samples using NGLess 72 . We considered only uniquely mapped reads. From the mapping, we computed the c_AMPs detected per sample and the number of detections per c_AMP, considering "rare", the c_AMPs detected less than the average of the entire AMPSphere (682 detections or 1% of all samples). This approach was adopted to overcome the high computational costs involving a competitive mapping procedure. We expect our approach overestimates how prevalent the c_AMPs are, and because of that, it is a robust way to estimate the rarity observed in c_AMPs.

Significance of the overlap of c_AMP contents
Similar to the significance testing of multi-habitat c_AMPs, the number of overlapping c_AMPs was computed for each pair of habitats (general and high-level). We shuffled the sample labels 1,000 times, counting the number of randomly overlapping c_AMPs for each pair of habitats, and used the Shapiro-Wilk test to verify normal distributions. Then, we estimated the probability of observing the overlap by Chebyshev's inequality, which does not rely on normal distributions: p ≤ 1 Z 2 , where Z stands for the Z-score computed from the average and standard deviations estimated by the shuffling procedure. The probabilities were adjusted using Holm-Sidak implemented in multipletests from the statsmodels package 106 , and those below 5% were considered significant.

Differences in the c_AMP density in microbial species from different habitats
The c_AMP density was defined as ρ AMP =n c AMPs / L, where n c AMPs is the number of c_AMP redundant genes and L is the assembled base pairs. It was computed per sample to verify the differences between animal host-and non-host-associated samples, including only habitats with ≥100 samples. The calculated densities were filtered by using Tukey's fences calculated with k =1.5 to eliminate outliers.
This density was also calculated at phylum, species, and genus levels, summing all assembled base pairs for contigs assigned to each one of those taxonomy levels in the samples used in AMPSphere. In that case, we assume, as an approximation, that in a large assembled segment, the start positions of AMP genes are independent and uniformly random. Thus, we calculated the (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ; https://doi.org/10.1101/2023.08.31.555663 doi: bioRxiv preprint standard sample proportion error, with the formula: STDerr= √ ρ * (1− ρ) L . The error was used to calculate the margin of error at a 95% confidence interval ( Z=1.96 , ). Genera, phyla and species ρ AMP within a margin of error superior to 10% of the calculated value were eliminated along with outliers according to Tukey's fences ( k =1.5 ). To verify the effects of different habitats in the ρ AMP of species, we took the density calculated per species per sample from species present in ≥2 habitats in ≥10 samples per habitat and tested their medians using the Mann-Whitney U test implemented in the scipy package 80 .
We verified the species abundances in each sample using mOTUs 2 107 . None of the genera found as those with the highest ρ AMP (Algorimicrobium, TMED78, SFJ001, STGJ01, and CAG-462) were also verified throughout the mOTUs, which show them as not abundant microbes.
To evaluate the effects of potential fragments in our density analysis, all tests and comparisons were also performed restricting the set of c_AMPs used to only those with at least one stop codon upstream the gene. The results showed that results related to the c_AMP density observed for all AMPSphere did not modify when controlling for quality (Fig. SI5).
Differences between habitats for each species, genus, or sample groups were tested using Mann-Whitney U and Kruskal-Wallis tests implemented in the scipy package 80 . P-values were adjusted using the Holm-Sidak method implemented in multipletests from the statsmodels package 106 .

Determination of accessory AMPs
Core, shell, and accessory c__AMP clusters were determined using the subset of c_AMPs obtained from ProGenomes v2 41 because of their high-confidence assigned taxonomies and the genomically-defined species (specI). To increase confidence in our measures, only species containing ≥10 genomes were considered. AMPs and families ( 8 c_AMPs) present in fewer than 50% of the genomes from a microbial species were classified as accessory. Those c_AMPs and families present in 50% -95% of the genomes in the cluster were classified as shell 108 , and those present in >95% of the genomes were classified as core genes 51 .
We used FastANI v.1.33 88 to cluster genomes in species in the ProGenomes v2 41 , keeping one randomly selected representative for each clonal complex (ANI ≥ 99.99%) and inferring strains (99.5% ≤ ANI < 99.99%) as in Rodriguez et al. 109  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 ? redundancy were kept. To verify the propensity of AMPs being shared between genomes belonging to the same strains, we computed all possible pairs of genomes per species and computed the pairs sharing AMPs, testing the results with the Fisher's Exact test implemented in the scipy package 80 . We also extracted the predicted full-length proteins from the ENA database for each genome and hierarchically clustered them after alphabet reduction in a similar fashion to that described in the topic "AMP families", keeping those clusters with ≥8 sequences for each species. The prevalence of fulllength protein families within a species was computed as above mentioned and the number of core families was compared to the number of c_AMP core families using the probability calculated as number of species with proportion of core full-length protein families less or equal to that observed for c_AMPs divided by the total of assessed species.
To determine the genotype of Mycoplasma pneumoniae samples in ProGenomes2 41 , the gene coding for P1 adhesin 53 was mapped against the genomes using the reference gene NZ_LR214945.1:c568695-567307 with bwa 105 , and later extracted with SAMtools 110 and BEDtools 111 .
The extracted genes were aligned using Clustal Omega 112 , and a phylogenetic tree was built using nucleotide sequences and FastTree 2 87 with the restricted time-reversible substitution model and a bootstrapping procedure with 1,000 pseudo-replicates to determine node support. The tree was used to segregate and classify genomes taking the strain type of reference genomes from Diaz et al. 54 and was consistent with the previously established groups.

Annotation of AMPs using different datasets
Databases used in the annotation were the small protein sets in SmProt 2 45 , the bioactive peptides database starPepDB 45k 91 , the small proteins from the global data-driven census of Salmonella 113 , the global microbial gene catalog GMGCv1 46 , and a specific AMP database -DRAMP 3.0 44 . To only have sequences that were unlikely to be artifacts of assembly for the analysis, only c_AMPs passing the terminal placement test were searched against the GMGCv1 46 . The AMPs were annotated using MMseqs2 75 with the 'easy-search' method, retaining hits with an E-value maximum of 10 -5 . To normalize coordinates of hits to the full-length protein, we corrected for the elimination of the initial methionine performed by Macrel 40 , so that hits starting at the second amino acid were considered as if they matched the first one (as the peptide has had its initial methionine removed).
We used the hypergeometric test implemented in the scipy package 80 to model the association between c_AMPs and the background distribution of ortholog groups from GMGCv1 46 . To that, the (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 number of genes in the redundant GMGCv1 46 for each ortholog group was computed along with the counts for ortholog groups in the top hits to AMPSphere. The enrichment was given as the proportion of hits presenting a given ortholog group divided by the proportion of that ortholog group among the redundant sequences in GMGCv1 46 and was considered significant when p < 0.05 after a correction with Holm-Sidak method implemented in multipletests from the statsmodels package 106 . With a robust approach, filtering the OGs by the number of c_AMP hits and GMGCv1 46 hits associated with them, using a minimum of 10, 20, or even 100 proteins, the results were kept similar to those obtained with all data.
To check for genomic entities generated after gene truncation, we screened for c_AMP homologs using the default settings for Blastn 103 against the NCBI database 94 , keeping only significant hits with a maximum E-value of 10 -5 . We selected the AMP10.271_016, predicted to be produced by Prevotella jejuni, which shares the start codon with the gene coding for a NAD(P)-dependent dehydrogenase (WP_089365220.1). Using Biopython 83 , we codon-aligned the fragments from metagenomic contigs assembled from samples SAMN09837386, SAMN09837387, and

Positive selection tests
The genes of c_AMPs belonging to 100 high-quality families randomly sampled were codonaligned, excluding identical sequences and the stop codons. Selection tests were run using HyPhy (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023

Genomic context conservation analysis
We mapped the 863,498 AMP sequences against a collection of 169,632 reference genomes, MAGs and SAGs curated elsewhere 49 with DIAMOND 116 in blastp mode. Hits with identity > 50% (amino acid) and query and target coverage > 90% were considered significant. A total of 107,308 AMPs have homologs in at least one genome. We built gene families from the hits of each AMP detected in the prokaryotic genomes, and calculated a conservation score of the functional annotation of the neighboring genes in a window of 3 genes up and downstream. The vertical conservation score of each OG, KEGG pathway, KEGG orthology, KEGG module -from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 117 ,PFAM 33.1 92,97 ,and CARD 118 at each position was calculated as the number of genes with a given functional annotation divided by the number of genes in the family.
AMPs with more than 2 hits and a vertical conservation score > 0.9 with any functional term were considered to have conserved genomic contexts.
For testing whether the number of AMPs with conserved genomic neighbors is higher than in other gene families within the 169,632 genomes curated by del Río et al. 49 , we used MMSeqs2 75 for building de novo gene families (establishing the minimum amino acid identity of 30%, coverage of the shorter sequence of at least 50%, and maximum E-value of 10 -3 ). We then took 10,000 random sets of 55,191 gene families with more than 2 members (number of AMPs with more than two homologs in the genomes) composed of: (i) small (< 50 residues) proteins, and (ii) no length-restricted proteins.
Then, we computed the number of gene families showing conserved genomic contexts with known functions within each set and confirmed their normal distribution using the Shapiro-Wilks test 80 . Later, the conservation values obtained for AMPSphere were compared to those sets using P-values computed from their respective Z-scores.
To verify the ortholog groups from c_AMPs and the gene neighborhood, the peptides were also annotated using EggNOG-mapper v2 84 using default settings and selecting the best hit for each c_AMP, by filtering the lowest E-value and highest best score. Their KEGG ortholog groups (KOs) were used as functional labels to cluster and verify the gene neighborhood in terms of functional similarity. It was possible to annotate 56.1% (60,173 out of 107,308) of c_AMPs with hits to the genome set tested using the EggNOG5 database 48 , with 9.1% of them missing COG categories, and about 18.1% of them belonging to translation-related functions (J), 14.4% belonging to unknown function proteins (S), and 9% of them belonging to replication, recombination, and repair (L).

c_AMPs and bacterial species transmissibility
We used the species taxonomy and transmissibility indexes calculated by Valles-Colomer et al. 57 to demonstrate the effect of AMPs on the transmission of bacterial species from mother to children. Only those species overlapping AMPSphere and the datasets from Valles-Colomer et al. 57 were kept for this analysis, and their AMP densities were calculated separately for samples from the human gut and human oral cavity as mentioned in the section Differences in the c_AMP density in microbial species from different habitats. The AMP density and the coefficient of transmissibility were correlated using Spearman's method implemented in the scipy package 80 to keep robust comparisons and were tested for different situations, e.g. following children's microbiome after 1, 3, and up to 18 years, as well as, cohabitation and intra-datasets. Missing data was completely omitted for calculus purposes.

AMPSphere web resource
AMPSphere is found at the address https://ampsphere.big-data-biology.org/. The implementation is based on Python 76 and Vue Javascript. The database was built with sqlite, and SQLalchemy was used to map the database to Python objects. Internal and external APIs were built using FastAPI and Gunicorn to serve them. In the front end, Vue 3 was used as the backbone and Quasar built the layout. Plotly was used to generate interactive visualization plots, and Axios to render content seamlessly. LogoJS (https://logojs.wenglab.org/app/) is used to generate sequence logos for AMP families; while the helical wheel app (https://github.com/clemlab/helicalwheel) generates AMP helical wheels.

Selection of peptides to synthesis and activity testing
Only high-quality (see the topic "Quality control of c_AMPs") c_AMPs were considered for synthesis. They were filtered according to 6 criteria for solubility and 3 criteria for synthesis, as used before in PepFun 119 . A peptide approved for at least 6 of these criteria was then filtered by predicting AMP activity with 6 methods in addition to Macrel 40 : AMPScanner v2 120 , ampir 38 , amPEPpy 121 , APIN 122 , and AMPLify 123 . Peptides predicted to be AMPs by all methods were filtered by length, discarding sequences longer than 40 amino acid residues, for which conventional solid-phase peptide synthesis using Fmoc strategy has lower yields and many recoupling reactions [124][125][126][127] . AMPs were sorted by their abundance (the number of redundant genes), keeping the most abundant peptide per AMP family.

Outer membrane permeabilization assays
Membrane permeability was analyzed using the 1-(N-phenylamino)naphthalene (NPN) uptake assay. NPN demonstrates weak fluorescence in an extracellular environment but displays strong fluorescence when in contact with lipids from the bacterial outer membrane. Thus, NPN will show increased fluorescence when the integrity of the outer membrane is compromised. A. baumannii ATCC 19606 and P. aeruginosa PA01 were cultured until cell numbers reached an OD600 of 0.4, followed by centrifugation (10,000 rpm at 4ºC for 3 min), washing, and resuspension in buffer (

Cytoplasmic membrane depolarization assays
The ability of the peptides to depolarize the cytoplasmic membrane was assessed by measuring the fluorescence of the membrane potential-sensitive dye 3,3'-dipropylthiadicarbocyanine iodide [DiSC3- (5)]. This potentiometric fluorophore fluoresces upon release from the interior of the cytoplasmic membrane in response to an imbalance of its transmembrane potential. A. baumannii ATCC 19606 and P. aeruginosa PA01 cells were grown with agitation at 37ºC until they reached midlog phase (OD600 = 0.5). The cells were then centrifuged and washed twice with washing buffer (20 mmol·L -1 glucose, 5 mmol·L -1 HEPES, pH 7.2) and re-suspended to an OD600 of 0.05 in 20 mmol·L -1 glucose, 5 mmol·L -1 HEPES, 0.   (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  c_AMPs were then hierarchically clustered in a reduced amino acids alphabet using 100%, 85%, and 75% identity cutoffs. We observed at 75% of identity 118,051 non-singleton clusters, and 8,788 of (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023   trifluoroethanol in water, and 50% methanol in water. Secondary structure was calculated using BeStSel server 130 . (AC) Activity of c_AMPs assessed against ESKAPEE pathogens and human gut commensal strains. Briefly, 10 6 CFU·mL -1 was exposed to c_AMPs two-fold serially diluted ranging from 64 to 1 μmol·L -1 in 96-wells plates and incubated at 37 °C for one day. After the exposure period, the absorbance of each well was measured at 600 nm. Untreated solutions were used as controls and minimal concentration values for complete inhibition were presented as a heat map of antimicrobial activities (μmol·L -1 ) against 11 pathogenic and eight human gut commensal bacterial strains. All the assays were performed in three independent replicates and the heatmap shows the mode obtained within the two-fold dilutions concentration range studied. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101/2023 doi: bioRxiv preprint Figure SI2. c_AMPs quality and habitat distribution. (A) Quality assessment of AMPSphere reveals most of the peptides passing at least 1 of the tests. The RNAcode test depends on gene diversity, which is very low for AMPSphere, and therefore, determines a low rate of positives among our candidates. (B) c_AMPs homologous to databases of validated bioactive peptides also showed a higher average quality of these datasets. (C) The limited overlap of c_AMPs among habitats argues in favor of using habitat groups to gain resolution. Note that the group of habitats with the highest paired overlaps belong to human body sites and samples from human guts and non-human mammalian guts.
Only habitats with at least 100 samples were shown. (D) It is also possible to observe the great proportion of rare genes in AMPSphere from different habitat groups, in which few genes are largely (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101/2023 doi: bioRxiv preprint Figure SI4. Host-associated habitats are denser in c_AMPs. The c_AMP density (given as c_AMP genes per assembled Mbp) was computed per sample and plotted per habitat. Only habitats with at least 100 samples were used. To favor a good visualization, we sampled 2,000 dots to plot their distribution trends. habitats were colored by their ontology, orange for the animal host-associated samples and purple for the non-animal-host-associated samples. Host-associated habitats cluster at the right portion of the graph with some anthropogenic habitats, which are closer to animal hostassociated samples than habitat samples. Few exceptions, such as coral-associated and bird gut habitats cluster with non-animal-host associated habitats. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 to Fig. 4C. (B) Animal host-associated habitats are denser in c_AMPs -correspondent to Fig. SI4. (C) Animal host-associated habitats have a higher sample c_AMP density when compared to those nonhost-associated -correspondent to Fig. 5A. (D) Hosts are a factor for variation of c_AMP density in Prevotella copri, presenting a higher ⍴AMP in cat and human guts compared to the same species in guts of pigs and dogs -correspondent to Fig. 5B. 106 randomly selected points are shown for each host. (E) Species-specific ⍴ AMP of microbes from the human gut are higher, when compared against the same species found in the human oral cavity -correspondent to Fig. 5C. (F) For non-animal hosts, the species-specific ⍴ AMP of microbes from the soil is higher when compared against the same species found in plant-associated samples -correspondent to Fig. 5D. For panels E and F the significance was color-encoded using a Log10(PMann) scale. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 °C, and the circular dichroism spectra shown are an average of three accumulations obtained using a quartz cuvette with an optical path length of 1.0 mm, ranging from 260 to 190 nm at a rate of 50 nm·min -1 and a bandwidth of 0.5 nm. All peptides were tested at a concentration of 50 μmol·L -1 , with respective baselines recorded prior to measurement. A Fourier transform filter was applied to minimize background effects. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 Fluorescence values relative to PMB (positive control) of 3,3′-dipropylthiadicarbocyanine iodide [DiSC3- (5)], a hydrophobic fluorescent probe, used to indicate cytoplasmic membrane depolarization of P. aeruginosa PA01 cells. Data in (A), (B), (D), and (E) are the mean plus and minus the standard deviation.
Excel Tables   Table SI1. Metadata and description of (meta)genomes used in AMPSphere. The sample is identified by its access code in ENA, the habitat shows the type of habitat this sample was retrieved from. Other data about the sequencing, such as the number of raw inserts and the number of assembled base pairs (bp) are also available along with the information on N50. The number of predicted complete large ORFs (>100 amino acids) and smORFs (10-100 amino acids) is shown (ORFs+smORFs) along with the number of smORFs alone and the predicted non-redundant c_AMPs. shown along with the number of genes encoding the non-redundant c_AMPs, the number of c_AMP clusters in total, and the number of clusters containing ≥8 c_AMPs (c_AMP families). [S3]. Top hits were assessed and the proportion of OGs from eggNOG 5 [S4] was compared using the number of c_AMPs affiliating to homologs of a given OG and the total number of OGs found in the homologs of c_AMPs (156,711) in the comparison to the GMGCv1 [S3]. As a background measure, we used the counts of a given OG in the redundant set of genes belonging to GMGCv1 [S3] and the total number of OGs found in the redundant GMGC catalog [S3] (9,180,087,363). Enrichment in the c_AMPs set was given as the fold-change calculated for each given OG in relation to that expected in the GMGCv1 [S3]. P-values were adjusted using Holm-Sidak and only significant hits (P < 0.05) were shown.

Table SI4. c_AMP genome context in comparison to families with proteins of different sizes.
The proportion of families of proteins of different sizes (all lengths and only ≤ 50 amino acids) is presented in comparison to the proportion of mapped AMPs (55,191) with genome contexts involving a given Kyoto Encyclopedia of Genes and Genomes -KEGG ortholog pathway [S5] shown with their accession code and description.  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 11, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 [S7] and those harboring KOs were included in this analysis. It is shown for each KO a brief description of its activity, the total number of AMPs annotated, the number of those with conserved neighbors (conservation score > 0.9), and the number of AMPs assigned to a given KO inserted in a conserved neighborhood containing other genes annotated to that KO (the intersection before the 2 previous columns). Table SI7. c_AMP density across different genera. The number of redundant c_AMP genes as well as their respective assembled base pairs is presented with the calculated AMP density. The error on the AMP density measure is also shown evidence that most of the genera present an error above 10% (our cutoff). Table SI8. Differential species-specific c_AMP density across habitats. The tested species had their densities (as c_AMP genes per assembled gigabase pairs) compared across samples from different habitats (Habitats A and B), and the number of samples for each habitat was also registered after eliminating the outliers using Tukey's fences with k = 1.5 (# Samples A and B). The average c_AMP density for each species in each habitat was registered along with its standard deviation (Avg. c_AMP density and Std. c_AMP density). The Mann-Whitney U test was applied to each pair of habitats and later corrected using the Holm-Sidak method, shown as the 'Adjusted P-value'.
Comparisons considered species present in at least 10 samples for each habitat with ≥ 100 samples presenting c_AMPs. Only significant comparisons were shown.

Table SI9. Metatranscriptomes and metaproteomes used in the verification for experimental signals of transcription and/or translation of c_AMP genes from AMPSphere.
Metatranscriptomes from EMBL-ENA were used in the comparisons with genes in AMPSphere to verify signals of transcription in datasets ad hoc. The datasets from the Proteomics Identification Database (PRIDE) -EMBL-EBI were also used in the comparison to c_AMPs and verified the same peptides in datasets ad hoc.