Fast, lightweight, and accurate metagenomic functional profiling using FracMinHash sketches

Motivation Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k-mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k-mer-sketching algorithm, to obtain functional profiles of metagenome samples. Results We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99x faster in wall-clock time, and consumes up to 40-55x less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets. Reproducibility This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: https://github.com/KoslickiLab/fmh-funprofiler. All scripts of the analyses we present in this manuscript can be found on GitHub.


Introduction
Metagenomic profiling, the process of evaluating and identifying the genomic information present within a given environmental sample, offers valuable insights into the genetic diversity, functional capacity, and ecological contributions of microorganisms.Specifically, metagenomic functional profiling (a.k.a.functional annotation), the computational process that involves identifying and quantifying functional components within metagenomic data, is essential to understanding microbial functionality, phenotype-genotype association, host interaction, disease progress, etc. [13,14,41,56,44].
A related and very important concept often explored in biology, and in gene studies in particular, is orthologous genes.If two genes in two different species are evolutionary descendants of the same gene in the least common ancestor of these species, then these two genes are called orthologs [60].The study of orthologous genes contributes crucially to comparative genomics and evolutionary studies, and often these studies make use of the well-known fact that orthologous genes often carry out equivalent or identical biological functions across species boundaries [18,36,61].For example, researchers have successfully identified functions of genes in newly sequenced genomes using orthology in model organisms [60,36,10].Previously, it was common practice to annotate the function of human genes using the orthologs in mice, as it is easier to experiment on the latter [23].Largely speaking, orthology traces genes to the same ancestral gene, and therefore, orthologous relationships are considered to be the most accurate way to capture differences and similarities in the composition of genomes from different species [16].Naturally, computational efforts have explored the area, and constructed databases of orthologous genes, such as AspGD [4], MBGD [67], OrthoDB [39], and KEGG [31,30].
Coming back to the study of genes in metagenomes: it is important to note that numerous computational techniques have been developed to identify genes that are present in an environmental sample.In a broader sense, these techniques fall into two categories: (a) gene prediction tools -which attempt to predict genes ab initio using only the sequences [48,59,77], and (b) gene classification/annotation tools -which aim to annotate a metagenome sample against a database of known genes.The latter category is more relevant for functional profiling since we need to have prior knowledge of the functions of known genes.The majority of the tools in this category rely on alignment-based approaches.For example, eggNOG-Mapper performs alignments with orthologs utilizing profile HMM models for query assignment [25].MG-RAST [35] uses BLAST [74] to search against the M5nr database [70].BlastKOALA [32], GhostKOALA [32], and KofamSCAN [3], which utilize the BLASTP [74], GHOSTX [63], and hmmsearch [29] algorithms respectively, are developed to search against Kyoto Encyclopedia of Genes and Genomes (KEGG) database [31].Perhaps the fastest (and most widely used) tool in the context is DIAMOND [8], which utilizes double indexing on both query and reference to enable fast alignment.DIAMOND's novel algorithm can match the accuracy of BLAST while having much better efficiency [8].
Despite the popular use of these tools, the primary use of alignment-based algorithms makes these a poor practical choice in terms of scalability.With the ever-growing volume of sequencing data, alignmentbased methods, despite their historical success, will eventually be overshadowed by more scalable techniques.Computational biologists, therefore, continue to turn to sketching-based methods, which are often faster and more lightweight; and theoretical guarantees of the sketching algorithms ensure their high accuracy.These more popular alignment-based tools also lack the use of orthology relationships of the genes.Using wellstudied orthologous groups has the potential to make functional profiling pipelines more efficient, both in time and memory.
In this paper, we present a sketching-based pipeline to functionally profile a metagenome in terms of orthologous gene groups.We used FracMinHash as our sketching technique to develop the pipeline -recent investigations have shown the successful use of FracMinHash (used in the software package sourmash [7]) in metagenomics analysis [47].We used KEGG as the database of orthologous genes.The orthologous groups in KEGG are called KOs.Our pipeline can take FracMinHash sketches of a given metagenome and the KOs, and progressively discovers what KOs are present in the metagenome using the algorithm sourmash gather [52].The pipeline can also annotate the relative abundances of the KOs.It is fast and lightweight because of using FracMinHash sketches, and is accurate when the sequencing depth is moderately high.We investigated the performance and resource usage of the pipeline against popular gene annotation tools DIAMOND and KofamScan using simulations.We also used the pipeline to functionally profile Human Gut Microbiota.Furthermore, the KO groups are very well studied in the KEGG database in the form of pathways -which allowed us to make insightful deductions about these microbiota.Finally, to demonstrate the versatile usefulness of the results, we showed how two KO profiles can be consolidated into a single metric using the pathways' hierarchy.The pipeline is freely available and can be accessed here: https://github.com/KoslickiLab/funprofiler.

KEGG and gene orthology
Selecting an appropriate reference database is crucial in metagenomic functional profiling.Orthologous genes, which share a common evolutionary ancestry, are commonly employed for inferring functions [12].Numerous efforts have been undertaken to construct databases for gene orthology, such as NCBI COG [17], AGNOSTOS-DB [69], OrthoDB [39], and Ensembl Compara [21].The KEGG database is a comprehensive and integrated resource for biological interpretation by providing gene annotations and mapping genes to manually created pathway maps [31,30].These KO groups serve as a foundation for functional annotation in metagenomics and can be further connected into pathway maps to illustrate the hierarchical interconnections of biological functions within living organisms.

FracMinHash and sketching-based method
With the large number of reference genomes and exponential surge in metagenomic data production, there is an imperative need for the creation of computational models that are both scalable and robust, ensuring precision in analysis.K-mer-based algorithms, particularly sketching-based methods, are gaining increasing popularity for metagenomic profiling implementations in this regard.A k-mer is a sequence of k consecutive nucleotides, extracted from a longer sequence.Algorithms designed to work with k-mers split the entire sample into k-mers, and analyze the number of shared/dissimilar k-mers among multiple samples.The number of all distinct k-mers in a sequencing sample can often be huge, so sketching-based methods take a fingerprint of the k-mers (called a sketch) and work with these much smaller sets, ensuring less consumption of computational resources.The most popular sketching method for many years has been MinHash, introduced in the context of document comparisons [6].Mash [50] was developed to apply MinHash to genomic data and has been very widely used.However, recent studies have shown that the error in comparing two MinHash sketches depends on the sketch size [53], and the size needs to grow quadratically to compensate for the error [50,Fig S1].More recent works have shown that when sets of very dissimilar sizes are compared, MinHash sketches perform relatively poorly [42,37].Researchers have proposed many adjustments to MinHash to tackle this issue; using a variable sketch size (instead of MinHash's fixed-size scheme), the recently introduced FracMinHash sketch [27,20] is one such example.Simply speaking, a FracMinHash sketch retains a fraction of the k-mers in the original sample.Formally, given a perfect hash function h : Ω → [0, H] for some H ∈ R and a scale factor s where 0 ≤ s ≤ 1, a FracMinHash sketch of a set A is defined as follows: The scale factor s is a tunable parameter that can modify the size of the sketch.For a fixed s, if the set A grows larger, the sketch FRAC s (A) grows proportionally.This sketching technique was first introduced in the software package sourmash [7,52], and has recently been theoretically analyzed to justify its use [20].sourmash has successfully used FracMinHash to obtain genome-wide comparisons of biological sequences.In the second round of Critical Assessment of Metagenomic Interpretation (CAMI) challenges [57,47], the tool sourmash gather [26], exhibited the highest completeness and purity in taxonomic profiling from metagenomic data for multiple datasets [47] across the genus and the species levels.
Figure 1: An overview of the pipeline used to identify genes/KOs present in a metagenome.The pipeline requires a list of KOs to calculate the relative abundance of these KOs in the sample.FracMinHash sketches of the KOs and the metagenome sample are computed using the command sourmash sketch, and then sourmash gather is invoked to identify the KOs using these sketches.

sourmash accurately identifies KOs using less computational resources
Our pipeline (shown in Figure 1) to identify KOs in a metagenome uses the KEGG database coupled with the sourmash gather command in the sourmash software package.The pipeline is detailed in the Methods section.We start presenting the results by showing the performance of this pipeline in identifying KOs in simulated metagenomes.Details of these simulations are elaborated in the Methods section.
We made comprehensive comparisons with the tools KofamScan and DIAMOND.We compared with DIAMOND because this is the fastest and most widely used protein alignment tool, being more than 20,000 times faster than BLASTX [8].We also compared our pipeline with KofamScan, primarily because this is a KO-identifying tool developed specifically for the KEGG database, and our use of the KEGG database naturally warrants this comparison.KofamScan uses hmmsearch to directly assign KO IDs to reads in a metagenome, whereas DIAMOND assigns gene id to a sequencing read.We summarized the outputs generated by DIAMOND using gene-to-KO mapping available in the KEGG database.
From our simulation experiments, we found that KofamScan fails to scale to metagenomes with millions of reads (taking more than seven days to complete on a simulated metagenome with 1M reads) -making it an impractical choice for this task.Nevertheless, because KofamScan was developed so closely with the KEGG database, we present the comparison in this manuscript.
The results for metagenomes with millions of reads are shown in Figure 2 (a and b).We see that sourmash gather slightly outperforms DIAMOND for these settings in terms of both completeness and purity as the sample is sequenced in more depth.We used two different k-mer sizes when running sourmash.In these experiments, we used a single active thread to run the sourmash gather program, and 64 threads to run DIAMOND to generate these results.The computational resources (total CPU time and memory) to generate these results are shown in Figure 2 (c and d).Clearly, sourmash gather outputs the list of KOs faster than DIAMOND.This is expected, given the computation-intensive alignments performed by DIAMOND  The simulated metagenomes were annotated to the truly present genes and KOs using brute force.Then, the outputs of the tools were matched against this ground truth.For a particular setting, ten different seed values were used to simulate a metagenome -the error bars indicate one standard deviation over these seeds.We found that sourmash shows high purity and completeness, just as DIAMOND -but sourmash consumes less computational resources(13.7-20.7%less memory, and 42-51x less CPU time).
to identify KOs.On the other hand, the use of lightweight sketches allows sourmash to avoid alignment altogether, and identify the list of all present KOs more accurately, using fewer computational resources.We also investigated the performance of DIAMOND, KofamScan, and our pipeline for much smaller-sized metagenomes (10-50K reads).These metagenomes were simulated from the same 1000 randomly selected bacterial genomes, and hence barely cover the diversified genes present in these genomes (with a low coverage).Therefore, these low-coverage samples would hardly be representative of real-world data.Nevertheless, we include these results to highlight the computationally intensive performance of KofamScan.
Our experiments revealed that when applied to low-coverage samples, sourmash has relatively high purity compared to the other two tools (shown in Figure 3b), i.e. almost all KOs that it identifies are truly in the sample.The completeness (shown in Figure 3b) is not so convincing compared to DIAMOND, i.e. sourmash cannot recover all KOs that are present when the sequencing is too shallow.Nonetheless, it does recover the most highly abundant KOs and is much faster and more lightweight compared to the other two tools (see running time and memory consumption in Figures 3c and 3d).We also found that KofamScan has exceptionally high resource requirements, and yet did not show promising performance.
We conclude this section with the following remarks: sourmash clearly is the better choice when highcoverage samples are available.If the sequencing data does not have high coverage, then sourmash can recover the most abundant orthologous genes with high precision and may miss some gene groups whose presence is not so dominant.In all cases, sourmash runs faster, and consumes less memory.

sourmash reveals functional insights of human metagenomic data
After demonstrating the accuracy and improved efficiency of functional profile via FracMinHash sketches in simulated data, we subsequently employ this approach on actual human metagenomic data from the Human Microbiome Project (HMP) [66].We use the KEGG (Kyoto Encyclopedia of Genes and Genomes) database [31], with KEGG Orthology (KOs) specifically, to profile metagenomic functions.The analysis encompasses a total of 1747 filtered samples, spanning 6 ethnic groups, 2 genders, and 3 study groups.
Functional profiles reveal the presence of housekeeping KOs and pathways consistently observed across the majority of samples, irrespective of conditions (Fig. 4).In addition to the core housekeeping KOs present in most samples, numerous other KOs (functional units) appear in just one or two conditions.The top hits of most observed KOs include K01872 for Aminoacyl-tRNA biosynthesis, K03701 for excinuclease ABC subunit A, and K02004 for (putative) ABC transport system permease protein.Most of the top hits are recognized as vital cellular functions, thus constituting a compilation of housekeeping functions.More universal functional insights of human microbiota can be obtained from the right tail of Figure 4b.We further checked the functional pathways at KEGG level 2 and level 3 and found that genes in Biosynthesis of amino acids (Ko01230), Carbon metabolism (Ko01200), and Amino sugar and nucleotide sugar metabolism (Ko00520) are abundant in most samples.These observations are consistent with the previous functional analysis in human gut microbiomes [79,40,65,51], and suggest that these, along with Purine metabolism (Ko00230), Amino sugar and nucleotide sugar metabolism (Ko00520), and Glycolysis / Gluconeogenesis (Ko00010) etc., are the core functions of the gut microbiota.
Next, we analyze the distinct functions among different conditions (Type 2 Diabetes, T2D; Healthy, HHS; and Inflammatory Bowel Disease, IBD).We conducted a LEfSe analyses [58] to unveil the key functional units/pathways that underlie the distinctions between the condition T2D vs. HHS and IBD vs. HHS.
Overall, our analysis identified a total of 17,225 KOs and 476 pathways across all samples.When applying the default parameters in LEfSe (log10 LDA > 2 and adjusted p-value < 0.05) to detect differentially abundant functional units, we identified 206 KOs and 71 pathways in the IBD group; 316 KOs and 83 pathways in the T2D group compared to the healthy control (HHS group).Figure 5 displays the top 10 results from the comparison between T2D and HSS.The most notable finding, map01100, indicates a strong link to Carbohydrate metabolism, a well-established association in the context of type 2 diabetes [64].Likewise, we've detected functions related to glycan degradation (map00511) and sphingolipids (map00600), both of which have substantial supporting evidence for their involvement in disease status or development [11,68,76,28,73,34,1,72,24].These observations substantiate the reliability of these functional profiles, thus signifying that the utilization of FracMinHash in conjunction with the KEGG database may serve as a fast and accurate hypothesis-generating methodology within the realm of functional analysis.
Digging into the KO-level analysis, the most significantly differentially abundant hit is NF-kappa-Bactivating protein (K25931), which may play a key role in the pathogenesis of vascular complications of diabetes [62].We also found starch-binding outer membrane protein (K21572), hydrophobic/amphiphilic exporter-1 (K03296), and carboxyl-terminal processing protease (K03797) etc.All of these discoveries may provide valuable insights for further investigating the functional roles of the gut microbiome.

sourmash-generated functional profiles can be used to compute Functional UniFrac distance
These FracMinHash-generated profiles can easily be coupled with other tools in further downstream functional analysis.In this section, we demonstrate this using two dimension reduction techniques: Principle Coordinate Analysis (PCoA) on pairwise Euclidean distances between profiles, and multidimensional scaling (MDS) using a recently developed method FunUniFrac [38].FunUniFrac stands for functional UniFrac.It is a dissimilarity metric for metagenomic functional profiles derived from the traditional UniFrac metric [45].The computation of such UniFrac-like metrics requires an underlying tree structure that reflects some sort of intrinsic hierarchy or distance between the components (in our case, the KOs) in the samples.FunUniFrac uses the "KEGG KO tree" (see Subsection 4.2.2) from the KEGG database with branch lengths assigned based on pairwise distances between KOs obtained using sourmash sketch.This underlying KEGG tree allows the dissimilarity between two functional profiles to be quantified and potentially the differentially abundant protein functions to be identified.
For this demonstration, we used the same HMP dataset as used in Section 3.2 which consists of samples from healthy individuals (HHS) as well as individuals with type 2 diabetes (T2D) and inflammatory bowel disease (IBD).Using the functional profiles as input, we computed the pairwise FunUniFrac distances for T2D vs. HHS and performed MDS on the resulting pairwise distance matrices for visualization.We also clustered the functional profiles based on their pairwise Euclidean distances and performed a PCoA analysis.The comparisons of the results of these two clustering methods are presented in Figure 6.From this figure, we can observe their similarity in the segregation of the samples from these two environments, which agrees with prior discoveries of a connection between gut microbiome and T2D [2,33,54,19,5].This demonstrates the versatility of these FracMinHash-generated functional profiles as inputs for downstream analyses in producing biologically meaningful results.

Overview of the pipeline
We begin with an overview of our KO identification pipeline, which differs from traditional gene identification tools (which mostly use alignment) by primarily utilizing sourmash gather, which breaks genes and reads into k-mers and approximately solves a set cover problem [26], producing a list of genes 'covering' observed k-mers in the metagenome.The problem with using sourmash gather with a list of genes is that genes are     tinier compared to the sequences sourmash gather is designed to work on and that the number of genes in a known database can be very large.In the KEGG database, we found about 8M genes in the bacterial genomes.This large number can make the approximation algorithm in sourmash gather very slow and impractical to use.In initial investigations, we found that for even very small metagenomes (about 30 genomes, 150K reads in the metagenomes, and approximately 100K genes), it was computationally infeasible (15+ hours), and therefore, we abandoned this direction.Indeed, sourmash is intended to sketch many large sequence sets, not massive amounts of small sequences.Therefore, instead of finding genes, we focus on finding the orthologous gene groups in a metagenome.The KEGG database groups genes into orthologous groups called 'KEGG Orthologs' (referred to as KOs henceforth).The number of KOs (a total of only 25K) is much smaller than the number of genes, and the number of k-mers in a KO is much larger than that of a single gene.Considering these factors, we designed our pipeline to invoke sourmash gather with a list of all KOs in the KEGG database, and then to output a list of KOs that 'cover' all observed k-mers in a given metagenome.

Simulated data
We used BBMap [9] to simulate a metagenome from 1000 randomly selected genomes from all 4498 bacterial genomes present in the KEGG database.The genomes were obtained from the KEGG database, as of 1 June, 2023.Genomes used in the simulations were selected uniformly, with each genome being equally likely to be included in the simulation.Tiny sequencing error was introduced in all simulated metagenomes, with insertion, deletion, and substitution rates of 0.001 (0.1%) each.BBMap was configured to store the chromosome coordinates of the sequencing reads when generating the metagenome file.We also collected the start and end positions of all 7956912 genes in these bacterial genomes.These start and end coordinates of the genes allowed us to construct a ground truth, i.e. which genes are actually present in the simulated metagenome.Using the gene-to-KO mapping available in the KEGG database, we then computed the ground truth at the KO level: which KOs are present in the simulated metagenome.This ground truth allowed us to measure the performance (completeness and purity) of the tools, presented in Figures 2 and 3.
We generated simulated metagenomes with the number of reads varying from 10K to 50K, and 2M to 10M.For each of these settings, we simulated 20 metagenomes with different seeds.

Reference data
The KEGG data, including KEGG Orthology (KO), gene sequences, and protein sequences), were downloaded from the KEGG FTP server, as of 1 June 2023.We used the KO identifiers that belong to the KEGG brite hierarchy Ko00001 which contains 25,413 KOs.After filtering out the KOs without protein sequences, we were left with a total of 25,347 KOs used for the downstream analysis.The hierarchical structure of Ko00001 was used as "KEGG KO tree".
We employed sourmash to establish FracMinHash sketches from amino acid sequences from those 25,347 KOs.These FracMinHash sketches were formulated using the command sourmash sketch with parameters -p protein, k=7, k=11, k=15, abund, scaled=1000 (that is to say, we use k-mers for amino acid sequences and generate FracMinHash sketches using a scaling factor of 1000 for three different k values: 7, 11, and 15.Besides, k-mer abundances are tracked for quantification purposes).Subsequently, a sequence bloom tree (SBT) was constructed from the sketched KO database to accelerate the profiling process, though this is not required.These reference data are freely available on Zenodo.

Real metagenomes
All human metagenomic samples were sourced from the HMP data portal [66], using specific search terms: "feces" for body sites, "fastq" for format, and "WGS raw seq set" for type.On January 2023, a total of 3550 files were retrieved, tagged under the manifest id "160bdc491e".However, for quality assurance, additional filters were applied: (1) 185 files from the "MOMS-PI" project remained private; (2) 6 files were eliminated due to broken links; (3) 691 files exhibited unmatched MD5 values compared to metadata; (4) 4 files were corrupt; and (5) 148 files were identified as assembled scaffolds.Consequently, a final selection of 2516 files (a total of 4.6TB) was used.In subsequent analysis, we found numerous samples with very low sequencing depth.Therefore, we excluded files with sizes less than 200MB and files with less than 1000 KOs identified.After all these filtered steps, we have 1747 high-quality data remaining for the downstream analysis, including 547 healthy samples, 274 type 2 diabetes samples, and 926 samples related to inflammatory bowel disease.

Functional profiles of HMP data
We used command sourmash sketch with parameters -p protein, k=11, abund, scaled=1000 for FracMin-Hash sketches of the HMP data (though it's suggested for the reference data to have multiple k values to fit various purposes, one proper k size is sufficient for functional profiles).Next, functional profiles of them were generated using sourmash gather with parameters -k 11 --protein --threshold-bp 500 (k = 11, which corresponds to 33 in DNA sequences, is a reasonable value for protein sequence comparison).It's worth mentioning that sourmash gather provides comprehensive information, among which, we utilized "f unique weighted" since this represents the relative abundance of each KO.

Differential analysis
LEfSe (Linear discriminant analysis Effect Size) was performed to determine enrichment in functional profiles of pairs of conditions.We adhered to the default settings for this analysis.Features (KO or pathway) with LDA score (log10) higher than 2 and adjusted p-value < 0.05 were recognized as significant.We used the Seaborn package in Python for plotting.For simplicity, we only showed partial results in the manuscript.

Conclusions
In this manuscript, we combined the FracMinHash sketching technique along with the KEGG orthology to construct a robust pipeline for generating functional profiles from metagenomic data.Our findings demonstrate that the functional profiles derived through this pipeline exhibit advantages in terms of both completeness and purity when compared to DIAMOND, a widely used alignment-based tool for functional profiling.Moreover, our findings highlight the versatility of k-mer sketches, not just in taxonomic profiles but also in large-scale functional profiling, delivering enhanced scalability and effectiveness.It is worth noting that the FracMinHash may not be the ideal choice for the analysis of many small reference sequences in practice (e.g.millions of genes in the KEGG database) due to its top-down approximation algorithm.Opting for KEGG orthology, a family of genes sharing similar functions, is more appropriate in this situation.
By using this pipeline, we conducted a comprehensive functional annotation analysis on all available human gut metagenomic data from the Human Microbiome Project [66], comprising a total of 2.5K samples.We successfully generated sample-specific functional signatures and performed differential analyses across different conditions.The top signatures we identified align with existing literature.Moreover, the KO-based functional profiles expanded our insights by introducing additional disease-related functional units.To further illustrate the benefits of using the KEGG hierarchical structure on these KO profiles, we showcased their integration with the recently developed FunUniFrac method, highlighting how they can facilitate hierarchybased sample comparisons.We believe this pipeline holds significant value in the realm of metagenomic functional profiles, especially when scalability and hypothesis-generating analyses are paramount.

Going beyond functional profiles
Gene Ontology (GO) enrichment analysis is widely used for understanding biological processes in experimental datasets.KEGG is such a comprehensive database that offers structured modules and pathways that delineate functional elements, associations with diseases, and interactions with drugs.It serves as an invaluable resource for exploration and hypothesis generation in diverse research endeavors.The potential of this direction can be fully exploited through the integration of functional profiles with meticulously curated biomedical knowledge graphs (e.g.Hetionet [22], RTX-KG2 [71]).This integration represents an exciting avenue for advancing our comprehension of intricate biological systems: metagenomic functional profiles provide insights into the functional potential of microbial communities; KEGG's rich resource of hierarchical modules and pathways offers a structured framework to interpret and contextualize these functions; and knowledge graphs, which link genes, proteins, diseases, drugs, and other biological entities, can facilitate the discovery of underlying relationships between different entities.By connecting functional profiles with biomedical knowledge graphs, researchers can explore the functional intersections of diseases [55,49], identify potential therapeutic targets [75,78,46], and predict novel associations between microbial functions and human health [15,43].Pursuing this integrated approach can not only aid in hypothesis generation but also pave the way for innovative research.
Total CPU time required to identify KOs.Memory usage to identify KOs.

Figure 2 :
Figure2: Average performances and computational resources of various tools in identifying KOs in simulated metagenomes.The simulated metagenomes were annotated to the truly present genes and KOs using brute force.Then, the outputs of the tools were matched against this ground truth.For a particular setting, ten different seed values were used to simulate a metagenome -the error bars indicate one standard deviation over these seeds.We found that sourmash shows high purity and completeness, just as DIAMOND -but sourmash consumes less computational resources(13.7-20.7%less memory, and 42-51x less CPU time).
Memory usage to identify KOs.

Figure 3 :
Figure3: Average performances and computational resources of various tools in identifying KOs in simulated metagenomes.The simulated metagenomes were annotated to the truly present genes and KOs using brute force.Then, the outputs of the tools were matched against this ground truth.For a particular setting, ten different seed values were used to simulate a metagenome -the error bars indicate one standard deviation over these seeds.
(a) Functional landscape of the gut microbiota.(b) Frequency of KOs detected in all samples (c) Frequency of KEGG pathways detected in all samples

Figure 4 :
Figure 4: Functional landscape of the gut microbiota.(a) Heatmap of KO occurrence in all gut metagenomic samples.Samples are clustered by KO profile and the left color bar indicates the metadata of the sample: subject, gender, race, and study group.The green color indicates the presence of a given KO and a vertical green "line" suggests that a KO shows up in most of the samples.(b) Frequency distribution of KO occurrence: the right side contains KOs identified in most of the samples, i.e. "housekeeping".(c) Frequency distribution of KEGG pathway occurrence.

Figure 5 :
Figure 5: Differential analysis for T2D vs HSS.(a) Top 10 differential KOs in T2D samples compared to healthy samples.(b) Top 10 differential KEGG pathways in T2D samples compared to healthy samples.

Figure 6 :
Figure 6: Functional profiles recapitualate previously ovserved patterns: a) PCoA plot of functional profiles of gut microbiomes based on Euclidean distances for healthy and type 2 diabetes patients.b) MDS plot of samples clustered based on FunUniFrac distances computed using the same functional profiles; showing correspondence in observable patterns.