DeepGOMeta: Predicting functions for microbes

Analyzing microbial samples remains computationally challenging due to their diversity and complexity. The lack of robust de novo protein function prediction methods exacerbates the difficulty in deriving functional insights from these samples. Traditional prediction methods, dependent on homology and sequence similarity, often fail to predict functions for novel proteins and proteins without known homologs. Moreover, most of these methods have been trained on largely eukaryotic data, and have not been evaluated or applied to microbial datasets. This research introduces DeepGOMeta, a deep learning model designed for protein function prediction, as Gene Ontology (GO) terms, trained on a dataset relevant to microbes. The model is validated using novel evaluation strategies and applied to diverse microbial datasets. Data and code are available at https://github.com/bio-ontology-research-group/deepgometa


Introduction
Protein function prediction has evolved significantly over the past few years, transitioning from reliance on basic sequence alignment to approaches based on machine learning, natural language processing, or analysis of biological networks [1].Despite these advances, few methods have been developed for and evaluated on metagenome or amplicon sequencing data mainly because there is no "ground truth" unless the methods are applied to "mock communities" which are highly simplified versions of actual microbial communities and not representative of the complexities encountered in real-world cases.
Microbial communities are especially complex, mainly due to the diversity of organisms they contain, including many that have yet to be cultured.This gives rise to a phenomenon often termed metagenomic 'dark matter' where 50% to 80% of metagenomic proteins remain unannotated using current methods [2].This complexity and diversity often renders traditional annotation methods inadequate, particularly when presented with novel proteins.
Microbial genomic data primarily come in two forms: amplicon sequences and whole genome sequencing (WGS) reads.Amplicon sequences, like 16S rRNA, are key for bacterial taxonomic classification, but their utility in function prediction is limited.Tools like PICRUSt2 [3] and Tax4Fun2 [4] infer microbial community functions using homology-based algorithms by aligning to reference databases.However, the accuracy of these predictions is constrained by algorithm limitations and database completeness.WGS enables the reconstruction of complete microbial genomes, allowing for a more direct assessment of a microbial community's functional potential, traditionally done by aligning protein-coding sequences to known proteins using algorithms like BLAST [5].
Existing protein function prediction methods face significant limitations in microbial contexts.Even when enhanced with machine learning, these methods are limited by their training datasets.For example, the Critical Assessment of Function Annotation (CAFA) challenge [6] utilizes the SwissProt database, rich in eukaryotic proteins, overlooking the predominantly prokaryotic nature of metagenomes [7].Moreover, most of these methods have not been validated on or applied to microbial data, largely due to the lack of robust evaluation strategies.These limitations highlight the need for models trained on relevant data and innovative evaluation strategies.
Deep learning has shown remarkable potential in analyzing biological data through its ability to detect intricate patterns in vast datasets [8].DeepGOMeta incorporates ESM2 (Evolutionary Scale Modeling 2) [9], a deep learning framework that extracts meaningful features from protein sequences by learning from evolutionary data.By utilizing these learned features through ESM2, and training on a more representative dataset, DeepGOMeta can predict protein functions even in the absence of explicit sequence similarity or homology to known proteins.Moreover, we introduce novel evaluation strategies to assess the method's performance when applied to microbial data.Taken together, DeepGOMeta addresses the multifaceted challenges associated with protein function prediction for microbial data.

Materials and Data UniProtKB/Swiss-Prot Dataset and Gene Ontology
We obtained all proteins that were manually curated and reviewed from the UniProtKB/Swiss-Prot Knowledgebase (v2023 03, r28-June-2023) [10].We further filtered to select for proteins that belong to prokaryotes, archaea and phages, and only kept proteins with experimental functional annotations using evidence codes EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC, HTP, HDA, HMP, HGI, HEP.The dataset contains 10, 107 reviewed and manually annotated proteins.
Metagenomes contain many uncharacterized, novel proteins and in order to evaluate our models on novel proteins, we generated training, validation and testing splits based on sequence similarity.First, we grouped the proteins by their similarity using Diamond (v2.0.9) [11] (e-value 0.001) and split them into training, validation and testing sets, 81/9/10 %, respectively.This is to ensure that the training and validation set proteins do not have any similar sequences in the training set.
We trained and evaluated a model for each of the GO subontologies separately (r2023-01-01) [12].Table 1 summarizes the datasets for each sub-ontology.
To compare our model against other methods, we generated a test set by following the CAFA [6] challenge time-based approach.We downloaded UniProtKB/Swiss-Prot (v2023 05 r2023-11-08) and extracted newly annotated proteins in this version.The table shows the number of GO terms, number of proteins in training, validation, and testing sets for the UniProtKB/Swiss-Prot dataset.

Protein-Protein Interactions data
For the 10,107 proteins in our dataset, we obtained proteinprotein interaction (PPI) data from the STRING (v11.0)[13] database, which yielded 14,524 interactions.There were 7 different modes of interactions: binding, activation, reaction, catalysis, expression, inhibition, and ptmod.

MGnify dataset
To evaluate microbial protein annotations, we downloaded the MGnify protein database (r2023 02) [14] and its associated metadata.This database includes protein sequences from publicly available metagenomic assemblies within MGnify.We extracted 2,000 random proteins from this database, where half were 'aquatic' and half were 'terrestrial' (lineage:root:Environmental:Terrestrial/Aquatic).

Baseline and Comparison methods
For our evaluations, we used baseline methods that do not rely on predictions based on sequence similarity, as our aim is to test the predictors on challenging sequences.Therefore, we do not include methods that are primarily based on sequence similarity, such as BLAST, Diamond, or their combinations, as baselines.For the time-based dataset evaluation, we selected three state-of-the-art methods developed by other groups: [17], SPROF [18] and NetGO3 [19].

Naive approach
Due to the imbalance in GO class annotations and propagation based on the true-path-rule, some classes have more annotations than others.Therefore, it is possible to obtain prediction results just by assigning the same GO classes to all proteins based on annotation frequencies.In order to test the performance obtained based on annotation frequencies, CAFA introduced a baseline approach called "naive" classifier [6].
Here, each query protein p is annotated with GO classes with a prediction scores computed as: where f is a GO class, N f is a number of training proteins annotated by GO class f , and N total is a total number of training proteins.We implemented the same method.

MLP (ESM2)
The MLP baseline method predicts protein functions using a multi-layer perception (MLP) from a protein's ESM2 embedding [9].We generated an embedding vector of size 5,192 using the ESM2 15B model and passed it to two layers of MLP blocks where the output of the second MLP block had residual connection to the first block.This representation is passed to the final classification layer with sigmoid activation function.
One MLP block performs the following operations: The input vector x of length 5, 192 represents the ESM2 embedding and is reduced to 1, 024 by the first MLPBlock: This representation is passed to the second MLPBlock with the input and output size of 1, 024 and added to itself using a residual connection: Finally, we passed this vector to a classification layer with a sigmoid activation function.The output size of this layer is equal to the number of classes in each sub-ontology: We trained a different model for each sub-ontology in GO.

DeepGO-PLUS and DeepGOCNN
DeepGO-PLUS [20] predicts protein functions by combining DeepGOCNN, which predicts functions from the amino acid sequence of a protein using a 1-dimensional convolutional neural network (CNN), and the DiamondScore method.DeepGOCNN captures sequence motifs that are related to GO functions.Here, we only used CNN based predictions.

DeepGOZero
DeepGOZero [21] combines protein function prediction with a model-theoretic approach for embedding ontologies into a distributed space, ELEmbeddings [22].ELEmbeddings represent classes as n-balls and relations as vectors to embed ontology semantics into a geometric model.It uses InterPro domain annotations represented as binary vector as input and applies two layers of MLPBlock as in our MLP baseline method to generate an embedding of size 1024 for a protein.It learns the embedding space for GO classes using ELEmbeddings loss functions and optimizes together with protein function prediction loss.For a given protein p DeepGOZero predicts annotations for a class c using the following formula: where f η is an embedding function, hF is the hasFunction relation, r η (c) is the radius of an n-ball for a class c and σ is a sigmoid activation function.It optimizes binary crossentropy loss between predictions and the labels together with ontology axioms losses from ELEmbeddings.
TALE TALE [17] predicts functions using a transformer-based deep neural network model which incorporates hierarchical relations from the GO into the model's loss function.The deep neural network predictions are combined with predictions based on sequence similarity.We used the trained models provided by the authors to evaluate them in the time-based dataset.

SPROF-GO
SPROF-GO [18] method uses the ProtT5-XL-U50 [23] protein language model to extract proteins sequence embeddings and learns an attention-based neural network model.The model incorporates the hierarchical structure of GO into the neural network and predicts functions that are consistent with hierarchical relations of GO classes.Furthermore, SPROF-GO combines sequence similarity-based predictions using a homology-based label diffusion algorithm.We used the trained models provided by the authors to evaluate them on the time-based dataset.

Pathway prediction
PICRUSt2 [3] provides the potential functions of microbial communities using 16s rRNA data and a reference genome databases.We used operational taxonomic unit (OTU) tables as the input for PICRUSt2 and focused on MetaCyc [24] pathways and their abundance scores.We performed Principal Component Analysis (PCA) and k-means clustering to discern patterns within the dataset based on these MetaCyc pathway features.The value of k was determined based on the number of categories within each phenotype.We measured clustering purity based on the true phenotype labels in the datasets (eq.16).

Evaluation metrics
We used four different measures to evaluate the performance of our models.Three protein-centric measures F max , S min and AUPR and one class-centric AUC.F max is a maximum protein-centric F-measure computed over all prediction thresholds.First, we computed average precision and recall using the following formulas: where f is a GO class, T i is a set of true annotations, P i (t) is a set of predicted annotations for a protein i and threshold t, m(t) is a number of proteins for which we predict at least one class, n is a total number of proteins and I is an indicator function which returns 1 if the condition is true and 0 otherwise.Then, we compute the F max for prediction thresholds t ∈ [0, 1] with a step size of 0.01.We count a class as a prediction if its prediction score is greater than or equal to t: S min computes the semantic distance between real and predicted annotations based on information content of the classes.The information content IC(c) is computed based on the annotation probability of the class c: where P (c) is a set of parent classes of the class c.The S min is computed using the following formulas: where ru(t) is the average remaining uncertainty and mi(t) is average misinformation: AUPR is the area under the average precision (AvgP r) and recall (AvgRc) curve.
AUC is a class-centric measure where we computed AUC ROC per class and calculated the average.
Purity assesses the homogeneity of clusters formed by a kmeans clustering algorithm.We clustered samples based on their predicted functions and used purity to evaluate whether the samples with same phenotype are in the same cluster.The Weighted Average Clustering Purity (WACP) formula is given by: where N is the total number of data points, k is the number of clusters, n ij is the number of data points from cluster j that are assigned to cluster i, n i is the total number of data points assigned to cluster i, and w j is the weight associated with cluster j.
We calculated function abundance to provide a quantitative assessment of the functional potential within a microbial sample.The abundance of a function (A(f ) is the sum of the relative abundance of all taxa present in a sample that contain a certain function, given by: where i is an index representing each taxon, n is the total number of taxa in the sample, R(t i ) is the relative abundance of the i th taxon, and I(f, t i ) is an indicator function that equals 1 if the i th taxon contains function f , and 0 otherwise.

MGnify dataset
We annotated the 2,000 randomly selected proteins with DeepGOMeta, which annotates each protein with GO terms.We used two clustering approaches for the first evaluation.The first approach, sequence similarity clustering, involved calculating pairwise sequence similarities between the proteins using DIAMOND BLASTp (v2.1.8)[11], followed by dimensional reduction using t-Distributed Stochastic Neighbor Embedding (t-SNE), and k-means clustering with k=2 based on the binary nature of the dataset's phenotypes.We calculated clustering purity using the known environment labels of the proteins.
For the second approach, semantic similarity clustering, we filtered the GO annotations resulting from DeepGOMeta to retain the most specific terms for each protein.For measuring the semantic similarity between protein pairs, we utilized Resnik's similarity method [25], combined with Best Match Average (BMA) strategy.Resnik's similarity measure is defined as the most informative common ancestor (MICA) of the compared classes in the ontology.First, we computed information content (IC) for every class with following formula: Then, we found Resnik's similarity by: We computed all possible pairwise similarities of two annotation sets and combined them with: We then performed a similar dimensionality reduction and clustering using t-SNE, k-means and a purity calculation (eq.16).
We further subsetted the 2,000-protein Mgnify dataset to only keep proteins with existing Pfam annotations (n = 567) [26].For these proteins, we used Pfam2GO to map Pfam and GO annotations.We calculated purity using the same semantic similarity clustering approach described earlier.

Paired dataset
We analyzed four diverse microbiome datasets, each containing paired 16S rRNA amplicon and WGS data.For the 16S data, we used a Nextflow pipeline employing the RDP classifier (v18) for processing and taxonomic classification available on our GitHub repository1 .We sourced protein sequences corresponding to the identified bacteria in the RDP database from NCBI and annotated with DeepGOMeta [27,28].We then constructed functional profiles for each sample by aggregating the DeepGOMeta-derived functions of all bacteria present, weighing each function by the relative abundance of the genera in which it was present (eq.17).We also constructed a a binary matrix of all the samples and functions in the dataset, where the presence of a function in a sample is represented by 1 and the absence by 0.
For WGS data, we used fastp (v0.23.2) [29] for trimming [-q 30].For host-associated microbiome samples, we used Bowtie2 (v2.5.1) [30] to filter out reads mapping to the host's reference genome.We then assembled the reads with MEGAHIT (v1.2.9) [31], predicted protein sequences with prodigal (v2.6.3)[32], and annotated the predicted proteins using DeepGOMeta.For each sample, we constructed a functional profile by aggregating the functions derived from DeepGOMeta annotations of all proteins present in the sample.We constructed a binary matrix for these results as described above.
For each dataset, we applied PCA and k-means clustering to the OTU table containing the relative abundance of bacterial genera.The choice of k in k-means clustering was determined by the number of phenotype categories present for each phenotype under investigation.We calculated clustering purity based on the known phenotype category labels provided in the metadata (eq.16).We conducted this analysis for all categorical phenotypes across each dataset.

DeepGOMeta
Microbial samples are complex and contain many uncharacterized proteins.Previously, we developed DeepGO-SE [33], a method for protein function prediction using protein sequence embeddings generated by ESM2 [9] and approximate semantic entailment.We showed that DeepGO-SE can be applied to uncharacterized proteins; however, since it is trained on all experimentally annotated proteins form UniProt-KB/Swissprot database, many of the functions it predicts are not relevant to microbiomes and exist only in eukaryotic genomes.Therefore, we trained DeepGOMeta, a specific version of DeepGO-SE, optimized to predict functions of organisms found in microbiomes.We created a dataset of prokaryotic, archaeal and viral proteins with experimental annotations from UniProt-KB/SwissProt and trained and evaluated three models for the three sub-ontologies of GO.In addition, we created a timebased benchmark dataset in order to compare with DeepGO-SE and other state-of-the-art function prediction methods.
Proteins do not function in isolation and PPIs play significant role in biological processes that take place in the environment.PPI networks also offer a means to reveal functional information for unknown proteins within microbial datasets.In order to test if PPIs help to improve protein function predictions, we trained a model which combines PPIs from STRING Database [13] using Graph Attention Networks.We refer to this model as DeepGOMeta-PPI.
We developed novel evaluation strategies to test the performance of DeepGOMeta in annotating proteins derived from microbial data, and we used these strategies to test the method against sequence-similarity clustering and Pfam database annotations.We also developed two different workflows for functional characterization of microbial samples consisting of 16S amplicon and WGS reads.In the case of 16S amplicon reads, we use OTUs to predict functions by utilizing the reference genomes of the genera in the samples.We then aggregate all the functions that were annotated into a functional profile for that genus.In the case of WGS reads, we performed de novo metagenome assembly and predicted functions from metagenome assemblies.Figure 1 depicts these workflows.We applied DeepGOMeta to diverse microbial datasets, and compared functional profiles, pathways, and taxonomy-based methods to gain biological insights.

Evaluation on the similarity-based benchmark
We trained, validated and tested our models for the three sub-ontologies of GO using the UniProtKB/Swiss-Prot dataset splitted based on sequence similarity (See Methods section).We compared with four baseline methods such as MLP(ESM2), DeepGraphGO, InterPro and Naive.We selected these methods because they do not rely on sequence similarity to predict functions.
In the MFO evaluation, DeepGOMeta performed best in all evaluation metrics.It performed slightly better than MLP(ESM2) in terms of F max and S min ,; however, the AUPR and term-centric AUC were significantly better.Combining PPI network features into the model reduced its performance, but was still better than the DeepGraphGO method, which is also based on PPIs.Table 3 provides the evaluation results for MFO classes.In the BPO evaluation, our model resulted in best F max of 0.476 which was significantly better (Wilcoxon signed-rank test p-value is 8 • 10 −37 ) than the second best MLP(ESM2) baseline.Combining PPI networks in DeepGOMeta did not improve the model and lead to slightly lower F max of 0.469.Interestingly, InterPro baseline performance was close to F max of 0. We believe that this might be due to the fact that not many of InterPro annotations are linked to BPO classes.It also explains the low performance of the DeepGraphGO method which uses InterPro annotations.Table 4 provides detailed evaluation results.In the CCO evaluation, DeepGOMeta achieved the best F max of 0.739 followed by almost the same performance by MLP(ESM2) baseline.Noticably, MLP(ESM2) method resulted in the best S min .Similarly to MFO and BPO evaluations, combining PPIs did not improve the predictions.DeepGraphGO method resulted in F max of 0.501 which is slightly better than Naive classifier, and InterPro annotationbased prediction performance was close to zero.Table 5 provides the evaluation results.By embedding proteins with ESM2 [9] and employing graph attention mechanisms, our model further enriched the protein feature with contextual information present in the PPI network.However, the results indicated that incorporating PPIs as background information did not improve function prediction in our case.Upon scrutinizing the interaction data, we noticed that the interaction information was excessively sparse, failing to provide substantial support for function prediction and, instead, introducing additional noise.Our datasets included 10,107 proteins and 14,524 interactions, but only 1,935 proteins had interactions.Given the sub-optimal performance and sparse nature of PPI data, we excluded the DeepGOMeta-PPI model from further evaluation.

Evaluation and comparison on the time-based split
We used a time-based split to evaluate DeepGOMeta as microbial data often contains an abundance of novel proteins.This is to ensure that our model is robust and effective in predicting the functions of these newly discovered proteins.We did this by comparing DeepGOMeta predictions on the newly annotated proteins with other state-of-the-art methods that predict functions based on protein language model embeddings and transformer-based deep learning models, including TALE [17], SPROF-GO [18] and DeepGO-SE [33].We found that DeepGOMeta outperforms the DeepGO-SE method in all three sub-ontology evaluations and performs better than all the compared methods in the BPO and CCO evaluations in terms of F max and S min .However, it resulted in lower performance than SPROF-GO method in the MFO evaluation and in terms of AUC in BPO evaluation.Table 6 shows the results of this evaluation.

Evaluation strategies on microbial proteins
Given the unique challenges presented by microbial data and the lack of robust evaluation strategies, it was necessary to develop new strategies to assess the performance of DeepGOMeta in annotating microbial proteins in comparison with current annotation methods.Our evaluation employs kmeans clustering and clustering purity based on true phenotype labels as a key metric (eq.16) using a two-fold strategy.First, we compared our method against traditional sequence similarity-based methods by clustering based on sequence similarity.Second, we compared our method against database annotations by clustering based on semantic similarity.
Sequence similarity is a well-established method often employed in homology-based function prediction, and we aim to illustrate how DeepGOMeta performs in comparison to this approach.We used DeepGOMeta to annotate 2,000 proteins  derived from microbial data in the MGnify database, and we calculated pairwise sequence similarity for these proteins.We clustered the proteins based on their sequence similarity scores and calculated purity, and in order to allow for an evaluation of our predicted functions against this, we clustered the proteins based on their predicted functions using semantic similarity.Both methods yielded a clustering purity of 0.55 (Figure 2).This implies that DeepGOMeta is at least as effective as traditional sequence similarity-based approaches, based on the assumption that a a similar degree of clustering purity based on the true phenotype labels indicates similar performance.
As most current function annotation methods rely on annotations in existing databases, we subsetted this dataset to only keep proteins with Pfam annotations to compare against DeepGOMeta annotations.We observed that only 567 proteins have existing annotations, highlighting the annotation limitation in these traditional databases.DeepGOMeta was capable of annotating all 2,000 proteins, demonstrating its comprehensive annotation coverage.When focusing on the subset of 567 proteins with Pfam annotations, sequence similarity clustering yields a clustering purity of 0.6.After mapping the Pfam annotations to GO terms using Pfam2G), we found that the clustering purity using semantic similarity was also 0.6 for both Pfam and DeepGOMeta annotations (Figure 3).This parity in clustering purity might suggest that DeepGOMeta does not surpass sequence similarity methods in terms of predictive accuracy.However, it has the advantage that it can annotate all the proteins in the dataset.

Applications on amplicon and metagenome data
To demonstrate the utility of our method in function prediction for different types of microbial data, we used paired datasets of 16S amplicon reads and WGS reads of the same samples.Here, we employed our evaluation strategy where we used clustering to assess our method's efficacy in capturing functionally relevant information from microbial communities.Due to the absence of ground-truth data for microbial functions, we assume that protein functions found in microbial communities are more similar when the microbial communities are from the same environment or share identical phenotypes.Consequently, we used functional similarity, based on the functions predicted by DeepGOMeta, to cluster microbial samples.This clustering, based on functional similarity, serves as an unsupervised and ostensibly unbiased method to group microbial communities by their functions.This approach allowed us to explore the primary drivers of community composition, focusing on the application of DeepGOMeta for gaining biological insights.We used DeepGOMeta to construct functional profiles for each sample using reads from both sequencing strategies and compared against taxonomy-based clustering (Table 7).For each dataset, based on DeepGOMeta results, we constructed a binary representation of functions which indicates presence or absence of a function.For 16S data, we also constructed an abundance-weighted matrix, in which each function is assigned a weight (eq.17).In certain contexts, DeepGOMeta demonstrated superior performance over OTU-based clustering.Specifically, in 5 out of the 9 phenotypes we analyzed, employing 16S functions (abundance-weighted) proved to be either on par with or more effective than clustering by OTUs.This suggests that DeepGOMeta's functional profiles can be effective in capturing specific functional attributes that are unique to each phenotype.In some datasets, such as Mammalian Stool and Cameroon (Region, Ethnicity), the functional attributes were more defining than taxonomic composition, suggesting that these community compositions are driven by functions (in contrast to taxa).
Conversely, in 3 out of 9 phenotypes studies, OTU-based clustering proved more effective.Specifically, in two datasets (Blueberry, India), the location phenotype was better explained by OTU composition than by functions.Interestingly, we found that using 16S functions in a binary format never outperformed the abundance-weighted approach, suggesting its limited efficacy.In the case of WGS functions, this method only took the lead in 1 out of 9 phenotypes, possibly indicating the necessity of weighing functions.
We also compared OTU-based clustering and DeepGOMetaderived functional profiles with pathways generated by PICRUSt2 (detailed in the methods section).PICRUSt2 provides functional insights into microbial communities through KEGG/MetaCyc pathways.In only one case, the Cameroon dataset (Diet), we find that the functional insights provided by PICRUSt2 exhibit a better capacity to separate the phenotype than taxonomy-based clustering.We also find that pathway predictions and taxonomic composition separate samples by location equally well in the India dataset.However, for other datasets, there is no clear distinction between pathway-based and taxonomy-based clustering purity; none of which show a clear superiority in separating the samples between the phenotypes.
Compared to DeepGOMeta, PICRUSt2's pathway information would be considered limited, as it constitutes only a subset of the predictable functions by DeepGOMeta in the form of BPO predictions.The results also indicate that PICRUSt2's functional information overall does not separate samples better, based on phenotype, in comparison to DeepGOMeta.However, the experiment falls short of comparing the performance of the two function prediction methods.This indicates either a lack of strong associations between pathways and phenotypes or limitations of the algorithm/database used by PICRUSt2.

Discussion
In this study we introduced DeepGOMeta, which aims to overcome the limitations of current methods in their lack of representative training sets and the lack of validation and applications on microbial data.Current function prediction methods are predominantly trained on eukaryotic data.We trained, tested, and evaluated three different models on UniProtKB/Swiss-Prot Knowledgebase proteins that belong to microbial species (prokaryotes, archaea, viral), a set more representative of species prevalent in microbial datasets.DeepGOMeta provides function predictions in the form of GO terms, as each of the three models was trained on a distinct GO sub-ontology.DeepGOMeta demonstrates an improvement over similarity-based benchmark methods in most evaluation metrics across the three sub-ontologies.In the comparison using a time-based split, DeepGOMeta outperformed DeepGO-SE, TALE and SPROF-GO in all three sub-ontology evaluations in BPO and CCO assessments in F max and Smin metrics.However, in the MFO evaluation, the model was outperformed by SPROF-GO.
To evaluate the method's predictions on microbial proteins, we designed a novel evaluation and benchmark strategy in which we use k-means clustering and clustering purity based on true phenotype labels in order to evaluate our method against sequence similarity-based methods and annotations in existing databases.For this, we use both sequence similaritybased clustering and semantic similarity-based clustering.We demonstrated that DeepGOMeta performs as well as traditional sequence similarity approaches in annotating 2,000 proteins from the MGnify protein database.This indicates the method's ability to group proteins based on the environment in which they were found based on their predicted function.Notably, while only 567 proteins had existing Pfam annotations, DeepGOMeta successfully annotated all 2,000 proteins, showcasing its comprehensive annotation capabilities.While DeepGOMeta successfully annotated these proteins, extending the functional knowledge base, we recognize that further validation is necessary to ensure the specificity

Fig. 1 .
Fig. 1.The figure provides an overview of the workflows used to generate functional profiles using DeepGOMeta for amplicon samples and WGS samples.

Fig. 2 .
Fig. 2. Clustering of microbial proteins (n = 2000) from MGnify.(a) Clustering based on sequence similarity between all proteins.(b) Clustering based on semantic similarity between all proteins.

Fig. 3 .
Fig. 3. Clustering of microbial proteins (n = 567) from MGnify that possess Pfam annotations.(a) Clustering based on sequence similarity between all proteins.(b) Clustering based on semantic similarity between all proteins.

Table 1 .
Summary of the UniProtKB/Swiss-Prot dataset

Table 2 .
Descriptions of the paired datasets used for evaluation

Table 3 .
Evaluation results for Molecular Function Ontology classes This table shows protein-centric Fmax, Smin, and AUPR, and the classcentric average AUC.

Table 4 .
Evaluation results for Biological Process Ontology classes

Table 5 .
Evaluation results for Cellular Component Ontology classes This table shows protein-centric Fmax, Smin, and AUPR, and the classcentric average AUC.

Table 6 .
Evaluation of DeepGOMeta on time-based split