Open Biomedical Network Benchmark A Python Toolkit for Benchmarking Datasets with Biomedical Networks

Over the past decades, network biology has been a major driver of computational methods developed to better understand the functional roles of each gene in the human genome in their cellular context. Following the application of traditional semi-supervised and supervised machine learning (ML) techniques, the next wave of advances in network biology will come from leveraging graph neural networks (GNN). However, to test new GNN-based approaches, a systematic and comprehensive benchmarking resource that spans a diverse selection of biomedical networks and gene classification tasks is lacking. Here, we present the Open Biomedical Network Benchmark (OBNB), a collection of benchmarking datasets derived using networks from 15 sources and tasks that include predicting genes associated with a wide range of functions, traits, and diseases. The accompanying Python package, obnb, contains reusable modules that enable researchers to download source data from public databases or archived versions and set up ML-ready datasets that are compatible with popular GNN frameworks such as PyG and DGL. Our work lays the foundation for novel GNN applications in network biology. obob will also help network biologists easily set-up custom benchmarking datasets for answering new questions of interest and collaboratively engage with graph ML practitioners to enhance our understanding of the human genome. OBNB is released under the MIT license and is freely available on GitHub: https://github.com/krishnanlab/obnb


Introduction
Life is orchestrated by remarkably complex interactions between biomolecules, such as genes and their products.Network biology [6,7,43,93] has demonstrated remarkable success over the past two decades in systematically uncovering genes' functions and their relations to human traits and diseases.Accurately identifying genes associated with a particular disease, for example, is a vital step towards understanding the biological mechanisms underlying the condition, which in turn could lead to novel and effective diagnostic and treatment strategies [96,51].Early work in network biology focused on network diffusion-type methods based on the guilt-by-association principle [69], which states that genes interacting with each other likely participate in the same biological functions or pathways.The performance of these methods has subsequently been improved by the application of supervised learning [67].In this trajectory, the next wave of improvements in network biology is likely to emanate from the surge of powerful graph machine learning (ML) techniques such as graph embeddings [27,14] and graph neural networks [100,106].These methods have shown promising results in many graph-structured tasks, such as social networks [34], and have started to attract researchers to apply them to biological tasks [5,104,68].To this end, accelerating the development and application of graph ML methods in network biology is of great importance.
However, there is a critical need for standardized benchmarks that allow reliable and reproducible assessment of the novel graph ML methods [81,34].Recent efforts such as MoleculeNet [99] and Therapeutics Data Commons [37] for molecular and therapeutics property predictions, and Benchmarking GNN [23] and Open Graph Benchmark [34,35] for more general graph benchmarks, have proven valuable in advancing the field of graph ML by providing carefully-constructed benchmarking datasets for applying specialized methods.Meanwhile, such comprehensive benchmarking datasets and systems are currently lacking for network biology.Furthermore, setting up ML-ready datasets for network biology is incredibly tedious.Some necessary steps include converting gene identifiers (IDs), setting up labeled data from annotated biomedical ontologies, filtering labeled data based on network gene coverage, and constructing realistic data splits mimicking real-world biologically meaningful scenarios.As a result, despite the remarkable amount of publicly available data for biomedical networks [36,44] and annotations [82], the only widely available ML-ready datasets for network biology are PPI dataset from OhmNet [108] and PPA from the Open Graph Benchmark [34].
Contributions In this work, we address this critical need.Our main contributions are as follows: 1. We present a Python package obnb that provides reusable modules for data downloading, processing, and split generation to set up node classification benchmarking datasets using publicly available biomedical networks and gene annotation data.The first release version contains interfaces to networks from 15 sources and annotations from three sources.2. We present a comprehensive benchmarking study on the OBNB node classification datasets with a wide range of graph ML approaches and tricks to set up the baseline for future comparisons.3. We analyze the benchmarking results and point out several exciting directions and the potential need for a special class of graph neural network to tackle the OBNB tasks.

Related work
Several existing Python packages share similar goals with obnb, primarily focusing on establishing biomedical network datasets and facilitating their analyses.In these networks, nodes typically represent genes or their products, such as proteins, while edges represent the functional relationships between them [36], such as physical interactions.PyGNA [24] offers a suite of tools for analyzing and visualizing single or multiple gene sets using biological networks.PyGenePlexus [63] and drug2ways [79] specialize in network-based predictions of genes and drugs.The OGB [34] platform houses a variety of graph benchmarking datasets, which includes a PPI dataset akin to the STRING-GOBP dataset in OBNB (Table S11).Nonetheless, all the aforementioned packages have a limited number of, if any, biomedical network and label data.obnb, on the other hand, provides an extensive number of biomedical networks and diverse gene set collections to facilitate the systematic evaluation of graph machine learning methods on diverse datasets.In a related domain, PyKEEN [3] provides a vast array of biomedical knowledge graph (KG) datasets and KG embedding methods.There, the main task of interest is link prediction, through which the missing knowledge link can be completed.Other notable works for constructing large-scale biomedical KG and setting up link-prediction benchmarks from them include BioCypher [57] and OpenBioLink [11].While the tasks associated with KG [95] are orthogonal to the node classification settings, it is possible to reformulate gene classification problems as KG completion and vice versa [5,104].Nevertheless, the advantages and drawbacks of these two approaches are yet to be comprehensively evaluated.

Systems description
Making the process of setting up ML-ready network biology benchmarking datasets from publicly available data as effortlessly as possible is the core mission of obnb.We implement and package a suite of graph (obnb.graph)and label (obnb.label)processing functionalities and couple them with the high-level data object obnb.data to provide a simple interface for users to download and process biological networks and label information.For example, calling obnb.data.BioGRID("datasets") and obnb.data.DisGeNET("datasets") will download, process, and save the BioGRID network and DisGeNET label data under the datasets/ directory, which can then be loaded directly next time the functions are called.Users can compose a dataset object using the network and the label objects, along with the split (Section 2.3), which can then be used by a model trainer to train and evaluate a particular graph learning method.Alternatively, the composed dataset object can be transformed into data objects for standard GNN framework, including PyTorch Geometric (PyG) [25] and Deep Graph Library (DGL) [94].Figure 1: Overview of obnb data processing and benchmarking dataset generation In the following subsections, we go into details about the processing steps for the biomedical networks (Section 2.1), creation of node labels from annotated ontologies (Section 2.2), and preparation of data splits (Section 2.3).Finally, as we are dedicated to creating a valuable resource for the community, we closely follow several open-source community standards, as elaborated in Appendix A.7.

Network
Downloading Currently, tens of genome-scale human gene interaction network databases are publicly available, each constructed and calibrated with different strategies and sources of interaction evidence [36].Unlike in many other domains, such as chemoinformatics, where there are only a few ways to construct the graph (e.g., molecules), gene interaction networks can be defined and created in a wide range of manners, all of which capture different aspects of the functional relationships between genes.Some broad gene interaction mining strategies include experimentally captured physical protein interactions [85], gene co-expression [45], genetic interactions [19], and text-mined gene interactions [86].We leverage the Network Data Exchange (NDEx) [78] to download the biological network data when possible (Figure 1a).The obtained CX stream format is then converted into a obnb.graphobject for further processing.

Gene ID conversion
In gene interaction networks, each node represents a gene or its gene product, such as a protein.There have been several standards in mapping gene identifiers, and different gene interaction networks might not use the same gene ID.For example, the STRING database [86] uses Ensembl protein ID [40], PCNet [36] uses HGNC [28], and BioGRID [85] uses Entrez gene ID [40].
Here, we use the NCBI Entrez gene ID for its advantages, such as supporting more species other than Human and being less ambiguous [33].To convert other types of gene IDs into the Entrez gene, we use the MyGene query service [98], which provides the most up-to-date gene ID mapping across tens of gene ID types.Following, we remove any gene where more than one gene ID is mapped to the same Entrez gene, which indicates ambiguity in annotating the gene identifier currently.The gene interaction network after gene ID conversion will contain equal or less genes, all of which are one-to-one mapped from the original gene IDs to the corresponding Entrez genes.

Connected component
In practice, a small fraction (typically within 1%) of genes may be disconnected from the largest connected components.The disconnectedness of the network is typically due to missing information about gene interactions from the measurement; the more information a network uses to define the interactions, the denser and less likely there will be disconnected genes.Thus, we extract the largest connected component of the gene interaction network by default as it is more natural to have a single component for the transductive node classification tasks we consider.However, a user can also choose to take the full network without extracting the largest connected component by specifying the largest _ comp option to False when initializing the data object.

Label
Many biological annotations are organized into hierarchies of abstract concepts, known as ontologies [8].Each ontology term (a node in the ontology graph) is associated with a set of genes provided by current curated knowledge.By the hierarchical nature of the ontologies, if a gene is associated with an ontology term, then it must also be associated with any more general ontology terms.Thus, we first propagate gene annotations over the ontology, resulting in highly redundant gene set collection.Then, we pick a non-redundant subset of the gene sets with other filtering criteria to set up the final labeled data in the form of Y ∈ {0, 1} N ×p with N number of genes and p number of labels (Table 1b).

Annotated ontology
An ontology is a directed acyclic graph H = (V H , E H ), where v ∈ V H is an ontology term.The directed edge (u, v) ∈ E H indicates v is a parent of u, that is, v is more abstract or general concept containing u.Consider B as the set of all genes we are interested in, and P(B) be the power set of B. Given a gene annotation data, represented as a set of pairs A = {(b, v) i }, where b ∈ B and v ∈ V H are gene and ontology term, we represent the raw annotation J 0 : V H → P(B) as a map from an ontology term to a set of genes, so that J 0 (v) = {b : (b, v) ∈ A}.We then propagate the raw annotation J 0 (•) over the ontology H into a propagated annotation J(•), where ), as depicted in Figure 1b.
Downloading All the ontology data is downloaded from the OBO Foundry [82] (Figure 1a), which actively maintains tens of large-scale biological ontologies.In the first release, we focus on three different annotations, Gene Ontology (GO) [4], DisGeNET [77], and DISEASES [30].We naturally split GO into three different sub-collections, namely biological process (GOBP), cellular component (GOCC), and molecular function (GOMF), resulting in a total number of five label collections.

Filtering
The propagated annotations contain highly redundant gene sets, which may bias the benchmarking evaluations toward genes sets that commonly appear.To address this, we adapt the non-redundant gene set selection scheme used in Liu et al. [54].In brief, we construct a graph of gene sets based on the redundancy between two gene sets, and recursively extract the most representative gene set in each connected component (Figure 1b).In addition, we provide several other filtering methods, such as filtering gene sets based on their sizes, number of occurrences, and their existence in the gene interaction network, to help further clean up the gene set collection to be used for final evaluation.Advanced users can alter these filtering functionalities to flexibly set up a custom gene set collection to better suit their biological interests besides using the default filtering steps provided by obnb.

Data splitting
A rigorous data splitting should closely mimic real-world scenarios to provide accurate and unbiased estimations of the models' performance.One stringent solution is temporal holdout, where the training input and label data are obtained before a specific time point, and the testing data is only available afterwards [18,54].In practice, this temporal setting is often too restrictive and leads to insufficient tasks for evaluation [54].Thus, by default, we consider a closely related but less strict strategy called study-bias holdout.
6/2/2 study-bias holdout We use top 60% of the most studied genes with at least one associated label for training.The extend to which a gene is studied is based on its number of associated PubMed publications retrieved from NCBI.The 20% least studied genes are used for testing, and the rest of the 20% genes are used for validation.Notice that by splitting up genes this way, some of the tasks may get very few positive examples in one of the splits.Hence, we remove any task whose minimum number of positive examples across splits falls below a threshold value (set to 10 by default).For completeness, we also provide functionalities in obnb to generate random splits.

Dataset construction
The network (Section 2.1) and label (Section 2.2) modules provide flexible solutions for processing and setting up datasets.Meanwhile, we provide a default dataset constructor that utilizes the above modules with reasonable settings to set up a dataset given particular selections of network and label.The default dataset construction workflow is as follows.

Experimental settings
To provide solid baselines for future references, we benchmark a wide range of graph ML methods on our OBNB datasets.Due to space limits, we only present a subset of benchmarking results in the main paper and refer readers to Appendix C for all other results.Specifically, we only include datasets generated using the BioGRID or HumanNet network, combined with the GOBP or DisGeNET labels, resulting in four primary datasets.The two networks were chosen for their distinct characteristics: BioGRID is an unweighted graph whose edges indicate protein interaction evidence from high throughput experiments [85], whereas HumanNet is a weighted graph whose edges indicate much more diverse types of interactions such as gene coexpression and associations derived via literature text-mining [42].Meanwhile, the two selected label collections cover two broad classes of gene classification problems, namely, gene function prediction (GOBP) and disease gene association prediction (DisGeNET).We report the performance as the average test precision over prior (APOP) score across five seeds (see Appendix A.5 for the mathematical definition and the motivation of choosing APOP as our main metric).The statistics about networks and task collections for the primary datasets are shown in Table 1 (see Table S10 and Table S11 for all dataset statistics).

Baselines
We consider three general types of methods: (1) label propagation [105] that directly propagates label information over the graph, (2) graph embedding that first extracts the vectorial representations of each node in the graph, which are then used to fit a simple classifier, such as logistic regression, and (3) GNN that learns the mapping from each node to its label space end-to-end.All implementation details, including technical descriptions of the featurization, the backbone graph neural network architecture, and hyperparameter settings, can be found in the Appendix A.

Graph embeddings
We include three distinct featurizations in the main result: using the rows of the adjacency matrix as the node features (Adj) [54], using the node2vec embedding (N2V) [31], or using the Laplacian EigenMap embedding (LapEigMap) [9].Extended results using SVD, LINE [87], and Walklets [75] are available in the Appendix (Table S6).Each of these features is coupled with an ℓ 2 regularized logistic regression model (LogReg) for downstream prediction.
GNN We include multiple variations of two GNN, GCN [48] and GAT [90], in the main results.Extended results for SAGE [32], GIN [102], and GatedGCN [12] can be found in the Appendix (Table S6).One major challenge in applying GNN to OBNB datasets is the lack of canonical node features.This is unlike networks in other domains, such as citation networks, where node features are naturally defined using the paper content, such as bag of words [34].To tackle this problem, we experiment with a diverse selection of node featurization strategies, including using the one hot encoded log degree of the node, using node2vec, and many others.We provide detailed descriptions for the 15 different feature construction strategies in Appendix A.3.1, with extended results for GCN and GAT in Table S5.
Bag of Tricks (BoT) In addition to optimizing the model architectures, many tricks in model training and feature augmentations have also shown to be key factors for practically good performance in existing benchmarks [34].Here, we further investigate the usefulness of several popular tricks, including reusing training labels as node features (LabelReuse) [97] and performing post-correction to the predictions at test time via correct and smooth (C&S) [39].
4 Results and discussions

Traditional ML methods perform comparably to GNNs overall
Table 2 shows the overall performance for the selected model on the four primary benchmarking datasets (full baseline results in Table S6).Surprisingly, even after a rather extensive hyperparameter search and GNN architecture tuning (Appendix A.4), logistic regression using graph-derived features still performs comparably to GNNs.For example, node2vec achieves the best performance for GOBP prediction when using both BioGRID and HumanNet.Similar results are obtained when using the other networks (Table S6) and using the more standard random splitting strategy (Table S8).The superior performance of traditional ML methods over GNNs is in stark contrast with results reported in recent benchmarking studies [81,23,34], highlighting the unique characteristics of biological networks and the challenges in modeling them.In addition, we demonstrate that the proposed study-bias holdout splitting provides more stringent evaluations than random splitting (Table S8).

Different models have their own strength
Despite the overall rankings of different methods indicating the superiority of their gene classification capabilities, no single method can achieve the best results across all tasks.We demonstrate this in Figure 2, which shows that the performance of most methods does not correlate with each other across different tasks on the BioGRID-DisGeNET dataset.For example, while LogReg+Adj performs better than GAT+LogDeg overall on the BioGRID-DisGeNET dataset, GAT+LogDeg achieves significantly better predictions (t-test p-value < 0.01) for a handful of tasks (Appendix S9), such as iris disorder and ocular vascular disorder.
Focusing on the optimal model for one or a few tasks of interest is utterly essential.This is because, in practice, experimental biologists often come with only one or a few gene sets and want to either obtain new related genes or reprioritize them based on their relevance to the whole gene set [62].Therefore, understanding the characteristics of the task that for which a particular model works well over others is the key to making better architectural decisions for a new task of interest and, ultimately, designing new specialized GNN for network biology.

Homophily is a driving factor for good predictions
Homophily describes the tendency of a node's neighborhood to have similar labels to itself.Such effects are prevalent in many real-world graphs, such as citation networks, and have been studied extensively recently in the GNN communities as a way to understand what information can or cannot be captured by GNN effectively [59,61].Intuitively, homophily aligns well with the guilt-byassociation principle in network biology, which similarly states that genes that interact with each other are likely functionally related.Despite this clear connection, existing homophily measures, such as the homophily ratio [61], do not straightforwardly apply to the OBNB datasets for two reasons.
1. Most established work on homophily consider multiclass classification task, where each node has exactly one class label, contrasting with the multilabel classification tasks in OBNB. 2. Label information in OBNB datasets is incredibly sparse.On average, there are only hundreds of positive genes per class out of the total number of 20K genes.
One attempt to resolve the first issue is to compute the average homophily ratio for each class independently.However, due to the label scarcity, the resulting metric can not be readily interpreted.
In particular, the highest homophily ratio observed in our main benchmarking study is about 0.2 (Figure 3).Datasets with this value are typically categorized into heterophily (not homophily), contradicting the guilt-by-association principle.To address this inconsistency, we propose a corrected version of the homophily ratio that takes label scarcity into account (see Appendix A.6).The main idea is to measure homophily as a relative measurement: how much more likely nodes labeled in class i are connected with nodes also labeled in class i than those not in class i?This measure is directly interpretable as it is the log fold change of the homophily between positive and negative examples.
Corrected homophily ratio characterizes prediction performance of graph ML methods As shown in the right two panels of Figure 3, the corrected homophily ratio is well correlated with the prediction performance across different networks and tasks.This positive correlation unveils the fundamental principle that graph ML methods rely on: the underlying graph must provide sufficient contrast between the neighborhoods of the positive and negative examples.Conversely, the left two panels of Figure 3 indicate that the standard homophily ratio fails to adequately account for the performance nuances in low homophily tasks As a result, OBNB opens up new research opportunities in understanding the effects of homophily in GNN by introducing a new type of homophily that differs significantly from those traditionally studied [61].

Graph based augmentation tricks offer most advantages on intermediate homophilic tasks
We next investigate the relationship between the corrected homophily ratio and the performance difference between a pair of methods by asking: is one method systematically better than another method on a particular range of homophily?The first and second rows of Figure S5 indicate that GNN augmented with different grpah-based tricks, such as C&S and node2vec features, provide most apparent advantage at intermediate homophily (corrected homophily ratio of ∼ 2.5).However, as the homophily increases further, the performance differences diminish, suggesting that all methods would perform equally perfectly as the graph provide more clear clues about the labels.Similar trends

Challenges for the OBNB benchmarks and potential future directions
Our systematic benchmarking study paves the way for numerous subsequent explorations.Below, we outline the primary challenges associated with OBNB and suggest potential areas for future research.
Challenge 1. Lack of canonical node features Traditional network biology studies solely rely on biological network structures to gain insights [7,6,49].Meanwhile, the presence of rich node features in many existing graph benchmarks is crucial for the success of GNNs [55], posing a challenge for GNNs in learning without meaningful node features.An exciting and promising future direction for obtaining meaningful node features is by leveraging the sequential or structural information of the gene product (e.g., protein) using large-scale biological pre-trained language models like ESM-2 [53].
Challenge 2. Dense and weighted graphs Many gene interaction networks are dense and weighted by construction, such as coexpression [45] or integrated functional interactions [86].In more extreme cases, the weighted networks can be fully connected [29].The networks used in the OBNB benchmark are orders of magnitude denser than citation networks [81] (Table 1a).Thus, scalably and effectively learning from densely weighted gene interaction networks is an area of research to be further explored in the future [56].One potential solution would be first sparsifying the original dense network before proceeding to train GNN models on them, either by straightforwardly applying an edge cutoff threshold or by using more intricated methods such as spectral sparsification [84].
Challenge 3. Scarce labeled data Curated biological annotations are scarce, posing the challenge of effectively training powerful and expressive models with limited labeled examples.Furthermore, a particular gene can be labeled with more than one biological annotation (multi-label setting), which is much more complex than the multi-class settings in popular benchmarking graph datasets such as Cora and CiteSeer [81].The data scarcity issue naturally invites the usage of self-supervised [101] graph learning techniques, such as DGI [91], and knowledge transfer via pre-training, such as TxGNN [38].However, these methods still face the challenge of missing node features (challenge 1).
Challenge 4. Low corrected homophily ratio tasks Our results in Figure 3 show that tasks with low corrected homophily ratios tend to be more challenging to predict, indicating that current tested methods are limited to local information of the underlying graph.This naturally opens up opportunities for designing models that (1) better capture long-range dependencies [22] and (2) exploit higher-order structural information [65].

Conclusion
We have developed obnb, an open-source Python package for rapidly setting up ML-ready benchmarking graph datasets using diverse biomedical networks and gene annotations from publicly available sources.obnb takes care of tedious data (pre-)processing steps so that network biologists can easily set up particular datasets with their desired settings, and graph ML researchers can directly use the ML-ready datasets for model development.We have established a comprehensive set of baseline performances on OBNB using a wide range of graph ML methods for future reference and pointed out potential improvements that could further enhance performances.Together, OBNB will help accelerate the development of advanced graph ML methods in network biology toward furthering our understanding of the complex genetic basis of human traits and diseases.

A Additional information A.1 Example code for obnb
The obnb library provides high-level APIs for users to set up benchmarking datasets from diverse selections of gene interaction networks and gene annotations to study the performance of various graph learning methods.Below, we provide a simple code snippet demonstrating how to set up a full dataset in a single function call.Further, we demonstrate how the constructed dataset can be directly used to evaluate the performance of a simple GNN model.We also invite readers to check out the README file of the obnb library for more comprehensive example usages: https: //github.com/krishnanlab/obnb/blob/main/README.md.

A.2 Benchmarking experiments implementation information
The obnbench benchmarking codebase utilizes PyTorch [70] and PyTorch Geometric (PyG) [25] for building deep (graph) neural network models.In addition, we leverage Weights & Bias [10] for tracking training results and Hydra [103] for managing experiment configurations.All benchmarking experiments are run using computational nodes with five CPUs, 45GB memory, and a Tesla V100 GPU (32GB).The basic GNN architecture we used contains the following components:

Raw
Feature encoder Since the genes do not come with canonical features, we need to derive initial node features for the GNN model.The default option we used is OneHotLogDeg with 32 bins (A.3.1).The raw features are first processed by a custom batch norm layer that is activated during testing in addition to training.Then, the processed features are projected into the hidden dimension (d = 128 by default) of the model via a linear layer, followed by batch norm and ReLU activation.

Message passing layers
We use five message passing layers in the GNN model by default.Each message passing layer contains a convolution block (A.3.2),followed by normalization, non-linear activation, and dropouts.We apply residual connection by summing up the input and the output of the graph convolution block.
Prediction head and post-processing Finally, we apply a linear layer to project the hidden dimension to the dimension matching the number of tasks and apply a sigmoid activation to turn the prediction into binary prediction probabilities.Optionally, we apply correct and smooth (C&S) [39] at test time to correct the predicted probabilities via a two-step label propagation (A.3.3).

A.3.1 Node feature design
In this section, we provide an overview of a diverse selection of node feature constructions we used in our benchmark.All node features are constructed using the whole network as input in a transductive setting, except for a few cases where the node features do not depend on the network structure, such as constant features.By default, any initial node feature will be 128-dimensional unless otherwise specified.We start by designing a collection of simple statistics that can be easily derived.
Constant uses a one-dimensional trivial feature for every node in the graph.
RandomNormal samples 32-dimensional features for each node independently via a standard normal distribution.
OneHotLogDeg (short for LogDeg) first computes the log degree of each node in the graph and then uniformly bins the nodes into one of 32 bins based on their log degree.The one-hot encoded node degrees approach has recently been shown to be a great structure encoder, whose utilization can sometimes result in performance superior to using the original node features associated with the graph [17,55].Meanwhile, the design choice of using log-uniform grids stems from the scale-free nature of biological networks [2].
RandomWalkDiag is the landing probability of a node back to itself after k hops, which is also commonly referred to as the random walk structural encodings (RWSE) [21].It has been used widely in many graph transformer models due to its expressiveness in capturing graph structure [52].Next, we consider several node feature options derived directly from the adjacency matrix.
Adj uses the rows of the adjacency matrix as the node feature.It has been shown previously [54] that logistic regression using the adjacency matrix produced better prediction performance than the commonly used label propagation algorithm for diverse gene classification problems.
RandProjGaussian and RandProjSparse are random projections of the adjacency matrix using two different but related approaches.We use the scikit-learn [71] implementations (GaussianRandomProjection and SparseRandomProjection) to compute these features.
SVD uses the left singular vectors of the adjacency matrix as the node features.
LapEigMap [9] uses the (ℓ 2 normalized) eigenvectors of the symmetric normalized graph Laplacian as the node features.
Node embeddings [68] are powerful approaches to extracting vectorial representations of each node in a graph and have shown promising results in many biomedical application [54,104,5].Thus, we include a few popular and good-performing node embedding options in our benchmark.
LINE1 and LINE2 [87] extracts first-and second-order proximity information from the graph to train the underlying embeddings.We use the GraPE [15] implementation to compute the LINE embeddings.
Node2vec [31] extracts node representation using word2vec [64] on node sequences sampled from the graph via biased random walks.The biased random walks, in contrast to an earlier work, DeepWalk [74], which uses an unbiased search, allow the searching strategy to be more flexible, mimicking either breadth-first search or depth-first search in the random walk phase.
Walklets [75] is similar to botch Node2vec and DeepWalk in that it runs word2vec using random walks sampled from the graph.However, wallets sample node pairs from the random walk with a specific number of hops, allowing a more explicit control of the multiscale relationships between nodes.
Finally, we experiment with options that let the model learn the node features freely.
Embedding lets the model learn the node features freely.
AdjEmbBag is similar to Embedding but with an additional aggregation step that sums up the raw embedding of each node from the central node's neighborhood.

A.3.2 Graph neural network summary
In this section, summarize the five tested GNN models under the message-passing framework [26].
Let h l i be the raw representation of vertex i at layer j, and hl i be the corresponding processed representation, e.g., after non-linear activation and normalization.The message-passing framework is written as where is the aggregation operator, f is the update function, ϕ n and ϕ s are message functions for neighbors and self.Different GNNs differ in the choices of , f , ϕ n and ϕ s they use.
GCN [48] uses a scaled linear transformation as the message function.The scaling factor is computed as the normalized edge weights with added self-loop.The aggregation is done by summing up the transformed message functions.
where di = d i + 1 = j∈Ni e i,j is the degree of vertex i with self-loop added.
SAGE [32] uses two separate linear transformations as the message functions for neighbors and self.The aggregation is done by taking the sum1 of the neighbors' messages and then adding it to the self-message.We additionally use an affine update function to transform the aggregated messages.
GIN [102] uses a multi-layer perceptron (MLP) to update the aggregated messages, which is done by summing up neighbors' representations and self-representation from the last layer.
GAT [90,13] uses an attention mechanism to distribute the weights from which the linear transformed messages are aggregated.
In particular, the original GAT paper [90] formulated the attention scores as follows However, it was pointed out in a more recent work, GATv2 [13], that the above formulation was fundamentally limited in the types of attention the model can learn and proposed a correction that showed stronger performances follow.
We also observe a slight but significant improvement when using GATv2 attention as opposed to the original GAT.Thus, all reported GAT results are based on v2 attention.
GatedGCN [12] uses gating mechanisms to process neighbors' messages, with linear transformations as the message functions.The aggregation is done by summing up the gated messages from neighbors and adding them to the self-message.
where ⊙ is the elementwise multiplication operator, and the gating coefficients η i,j are computed as

A.3.3 Label propagation and C&S
Label propagation iteratively propagates the label information from the source nodes outwards through the network neighborhoods [49,16,66].Let Y ∈ {0, 1} n×d be the (training) label matrix, and M ∈ R n×n be the diffusion operator.The propagated information at step t, denoted F t ∈ R n×d , is defined as where F 0 = F is the features to be propagated, which is the label matrix Y in the case of label propagation.The propagated feature is computed by repeating the propagation above until convergence: In practice, equation 11 is approximated by applying the propagation until the changes are small enough.
The original label propagation paper [105] uses the symmetric normalized adjacency matrix D − 1 /2 AD − 1 /2 as the diffusion operator, where A and D are the adjacency matrix and the corresponding diagonal degree matrix.Here, we use the column stochastic matrix D −1 A as the diffusion operator to resemble the random walk with restart (RWR) implementation that is more commonly used in the network biology literature [16].We use a propagation parameter α of 0.1 (equivalent to a restart parameter of 0.9) for label propagation.
C&S implements the idea of using label propagation 2 to (1) correct for the error made by the model and (2) smooth out the corrected predictions.Specifically, let E = Y train − Z be the prediction error matrix, where Z is the predicted probabilities resulting from the model.C&S can be summarized as follows.
2. Correct the original prediction with a fixed scale s: Z = Z + s Ẽ.

Smooth out the corrected predictions: Z C&S = PROPAGATE( Z)
We use α = 1.0 for the correction step, α = 0.8 for the smoothing step, and a scaling factor of s = 1.5.

A.4 Training and hyperparameter details
All hyperparameters are listed in the configuration files for each model in the benchmarking repository 3 .We summarize the primary default hyperparameters for GNN in Table S1.
For logistic regression baselines, we use the SGD optimizer with a constant learning rate of 200, weight decay of 1e-5, and momentum of 0.9.

A.4.1 Fully tuned GNN models
In addition to the baseline GNN experiments where we used the default settings elaborated above, we also tuned GCN and GAT specifically for the four primary benchmarks presented in the main paper with a bag of tricks (BoT).Specifically, we construct node features by concatenating Node2vec embeddings (d = 128), OneHotLogDeg encodings (d = 32), and LabelReuse.Finally, we apply correct and smooth (C&S) to the GNN predictions as a post-processing step.
We use the Bayesian search hyperparameter optimization strategy provided by Weights&Biases [10] and optimize for the validation APOP scores to tune the dataset-specific hyperparameters for GCN and GAT on primary results presented in Table 2.The search space is listed in Table S2 and the final hyperparameter settings are summarized in Table S3,S4.

A.5 The log2 fold change of average precision over the prior metric (APOP)
APOP is computed by taking log 2 of the ratio between the average precision and the prior.The prior is computed as the ratio between the number of positives and the total number of samples, which where R n and P n are the recall and precision and precision at the n th prediction score threshold.
The average precision is also related to the AUPRC 4 , which has been shown to be more suitable than the AUROC5 in the case of class imbalance [80,83].In our dataset, class imbalance is prevalent, where each class has only one or two hundred of positive examples but thousands of negative examples (Table 1b,S11).Moreover, AUROC is more tolerant of errors made on top predictions and generally only requires that the global distribution of the predictions made is consistent with the true labels.AUPRC and AP, on the other hand, penalize the top predictions' errors more.

Practical relevance
The OBNB benchmarks cover diverse gene prioritization tasks, such as pinpointing relevant genes for a particular disease.This formulation can also be straightforwardly extend to drug recommendation or drug repurposing tasks, by considering known drug targets (genes) as positive examples.For such biomedicine applications, in practice, only a few top predictions made by a predictive method can be experimentally verified by experimentalists due to the high experimental costs.Thus, APOP's emphasis on top predictions is well-aligned with this practical constraint and encourages methods to make highly accurate top predictions.

A.6 Corrected Homophily Ratio
Let G = (V, E, w) be a weighted graph (unweighted graphs can be treated as weighted graphs with identical edge weights), with node set V , edge set E, and the edge weight function w : V × V → R. Let y i : V → {0, 1} p be the label function for the i th class (or label), for i ∈ {1, . . ., p}.
We define the set of labeled nodes V labeled as the ones that are part of at least one label, i.e., V labeled = {v ∈ V | max i∈1,...,p y i (v) = 1}.Furthermore, we define the positively and negatively labeled nodes for the i th class as Definition 1 (Node homophily ratio).The node homophily ratio of the i th class for node v is defined as the ratio of its neighborhood that is positively labeled in the i th class.
and negative homophily ratio).The positive and negative homophily ratios of the i th class (or label) are defined as the ratio of a node's neighborhood that are positively labeled for the i th class averaged over the positively and negatively labeled nodes, respectively We refer to the positive homophily ratio as the homophily ratio for short.Note that definition 1 is slightly different from the typical definition of node homophily that is used in previous works targeting multiclass classification settings [72].Specifically, (1) we only count positive nodes from the neighborhood instead of nodes having the same class as the central node since we are dealing with (multilabel) binary classification, which will also come in handy later for defining the corrected homophily ratio; (2) we only average the node homophily ratios over the positive node sets since our datasets have a notable class imbalance, where most nodes are negatively labeled.This way, the homophily ratio is less skewed towards the majority of negatively labeled nodes.Definition 3 (Corrected homophily ratio).The corrected homophily ratio of the i th class is defined as the expected fold change of node homophily between positively and negatively labeled nodes for class i This corrected homophily ratio provides an answer to the following question: How much more likely are nodes labeled in class i to be connected with other nodes labeled in class i compared to nodes not labeled in class i?A positive value of the corrected homophily ratio indicates that positive nodes in class i have a higher likelihood of interacting with other positive nodes of class i than with negative nodes.

A.7 Community standards and maintenance plans
We follow several community standards to ensure sustainable and maintainable community-wide contributions, which is the key to the continuing improvement of a code base.First, we release the code on GitHub https://github.com/krishnanlab/obnbunder the permissive MIT license.Second, we use Sphinx to build the documentation of the code base and host it on ReadTheDocs.org.Several quick-start examples using the package are also provided on the GitHub README page.Third, code quality is ensured via various testing and code-linting automation using tox, pytest, pre-commit hooks, and GitHub actions.Fourth, we provide contribution guidelines and a code of conduct on the GitHub page.Finally, as a part of our commitment to the community, we also put in place a maintenance plan to address GitHub issues, merge pull requests, and release updates periodically to ensure the benchmarks remain adaptive to the evolving needs of the community.

B Data descriptions B.1 Networks
BioGRID [85] (MIT License) is a protein interaction network curated from primary experimental evidence from the biomedical literature, as well as evidence inferred from low-and high-throughput experiments.
BioPlex [41] (Creative Commons Attribution-ShareAlike 4.0 International License) is a protein interaction network whose interactions are measured by affinity purification-mass spectrometry (AP-MS) analysis shared across two cell lines (HEK293T and HCT116).This shared interaction network encodes core complexes involving many essential proteins that are vital for the cell's survival.
ComPPIHumanInt [92] (CC BY-NC 4.0 License) is a context-naive version of the cellular compartment-specific ComPPI [92] networks for humans, which are constructed by combining protein interactions from nine different PPI databases (including BioGRID).
ConsensusPathDB [46] (Free for academic use, see http://cpdb.molgen.mpg.de/ for more Licensing info) integrates protein interaction evidence (binary protein interaction, protein complex interaction, genetic interaction, metabolic, signaling, etc.) from 31 public databases, in addition to interactions curated from the literature.
FunCoup [76] (CC BY-SA 4.0 License) version 5 is a functional gene interaction network constructed by integrating a wide range of interaction evidence using a redundancy-weighted naive Bayes approach.
HIPPIE [1] (CC BY-NC 4.0 License) integrates experimentally detected protein interactions from several public databases such as BioGRID.
HumanBaseTopGlobal [29] (CC BY 4.0 License) is the tissue-naive version of the HumanBase tissue-specific gene interaction network collections, which are constructed by integrating hundreds of thousands of publicly available gene expression studies, protein-protein interactions and protein-DNA interactions via a Bayesian approach, calibrated against high-quality manually curated functional gene interactions.
HuMAP [20] (CC0 1.0 License) is a protein interaction network derived from over seven thousand protein complexes by integrating experimental evidence from public resources including AP-MS, large-scale biochemical fraction data, proximity labeling, and RNA hairpin pulldown data.
HumanNet [42] (CC BY-SA 4.0 License) is a functional gene interaction network originally designed for disease studies.It contains interaction evidence from gene co-citation from the literature, gene co-expression, pathway, domain profile, genetic interaction, gene neighborhood, phylogenic profile, and other protein interaction data.All these interaction evidence are integrated using a Bayesian statistical framework, resulting in a single value for each pair of genes that indicate the odds ratio for their functional interaction.
HuRI [60] (CC BY-4.0 License) is a binary protein interaction network constructed via the yeast two-hybrid (Y2H), covering about 90% of the protein-coding genes in the human genome.
PCNet [36] (CC BY-NC 4.0 License) is a protein interaction network constructed by requiring that an edge be supported by at least two out of the 21 selected protein interaction networks, such as BioGRID and ConsensusPathDB.
ProteomeHD [50] (CC BY-4.0 License) is the subnetwork of [50] containing top 0.5% strongest co-regulation signal between pairs of proteins.The co-regulation is measured by proteins' response to a total of 294 biological perturbations via isotope-labeling mass spectrometry.
SIGNOR [73] (CC BY-4.0 License) contains manually curated causal signaling relationships between proteins and other biochemical molecules, such as transcriptional activations and phosphorylation.
STRING [86] (CC BY-4.0 License) is a functional interaction network constructed by integrating seven types of gene interaction evidence via a probabilistic approach that calibrates against the KEGG [47] pathway database.The seven evidence types include conserved neighborhood, gene co-occurrence across species, gene fusion events, gene co-expression, other databases, and text-mined interactions.

B.2 Annotations and Ontologies
Gene Ontology [4] (CC BY-4.0 License) is a structured and standardized system that provides a comprehensive vocabulary to describe the molecular functions (GOMF), biological processes (GOBP), and cellular components (GOCC) associated with genes across different organisms.It aims to unify the representation of biological knowledge and enable effective analysis and interpretation of genomic data.
Mondo Disease Ontology [89] (CC BY-4.0 License) is a unifying resource that integrates disease, genotype, and phenotype knowledge across diverse resources, providing a standardized knowledge graph with controlled vocabularies for diseases and phenotypes.
DisGeNET disease gene annotations [77] (CC BY-NC 4.0 License) is a disease gene annotation database that contains a wide range of disease-gene association evidence, including curated annotations, high-throughput experiments and other inferred annotations, animal models, and literature text-mined annotation mostly from BEFREE.By default, we only use the curated and inferred annotations.
DISEASES disease gene annotations [30](CC BY License) is another disease gene annotation database that has a weekly update schedule for extracting disease gene annotations via text-mining from the fast-growing literature.The text-mined approach uses full text instead of only using the titles and abstracts.Other disease gene annotations from experimental data and other databases are also available.

B.3 Archived data
In addition to downloading and processing the data directly from the original sources, we also provide archived versions of the data preprocessed by use by running the default OBNB processing pipelines.The archived data is versioned with DOI's and can be found on Zenodo under the record  Each point in a box is a ranking of a particular node feature construction on a specific dataset.A lower rank indicates that the particular node feature achieved higher test performance than others.For both GCN and GAT, node2vec appears to be the top-ranked node feature overall.As shown in Table S8, random-split performance is higher than study-biased holdout in all cases.This indicates that, besides being more realistic, the study-biased holdout split is a much harder evaluation scheme and thus provides a more stringent evaluation of the tested gene classification method.

C Additional results
Table S9: Task-specific model prediction performance difference on the BioGRID-DisGeNET between GAT and LogReg+Adj.
gene sets (tasks) (a) obnb downloads network and label data from public data sources, then process and combine them into ML-ready benchmarking graph datasets.gene sets (tasks) (b) Generating label data from annotated ontology through annotation propagation and non-redundant gene set extraction.
L a b e lP r o p L o g R e g + A d j L o g R e g + L a p E ig M a p L o g R e g + N o d e 2 v e c

Figure 3 :
Figure 3: Relationship between the (corrected) homophily ratios of tasks and their performances.are observed for the comparison between GAT+BoT and baseline label propagation and logistic regression methods (Figure S5 third row), where GAT+BoT most significantly outperforms the baselines at intermediate homophily.

Figure S2 :
FigureS2: Boxplots representing the rankings of different feature construction when used by GNNs (lower the better).Each point in a box is a ranking of a particular node feature construction on a specific dataset.A lower rank indicates that the particular node feature achieved higher test performance than others.For both GCN and GAT, node2vec appears to be the top-ranked node feature overall.

Figure S3 :Figure S4 :
FigureS3: Boxplots representing the performance improvement of using different tricks with GNNs across datasets.Each point in a box is the test APOP difference of GNN with added tricks vs. plain GNN using OneHotLogDeg feature on a specific dataset.A positive value implies the added trick improves GNN performance.

Figure S5 :
FigureS5: Relationships between the corrected homophily ratio and performance difference between methods.Each panel summarizes tasks over three networks (BioGRID, HumanNet, ComPPIHumanInt) and three gene set collections (DISEASES, DisGeNET, GOBP), with a total of 1,672 tasks.In each panel, the x-axis represents the corrected homophily ratio and the y-axis represents the test APOP performance difference between the two methods.The first (A, B, C) and second rows (D, E, F) show the performance improvement to GCN and GAT when augmented with individual tricks or combined BoT.The third row (G, H, I) shows the performance difference between the best GNN method and the baseline methods, including label propagation and logistic regression.

Table 2 :
Performance of different methods on the four primary OBNB datasets evaluated in APOP ↑ aggregated over five seeds.Bold indicates the best-performing method within the method class (traditional ML or GNN).Green indicates best performing-method across both classes.See TableS7for extended baseline performance references.

Table S2 :
Hyperparameter search space for GCN and GAT on the primary datasets

Table S3 :
Tuned GCN model on the four primary benchmarks.

Table S4 :
Tuned GAT (v2) model on the four primary benchmarks.
corresponds to the probability that a randomly drawn sample is positive.Thus, APOP indicates how much better (> 0), or worse (< 0), a prediction is than a random guess, in two-fold change.

Table S5 :
Ablation study on different initial node feature constructions for GCN and GAT.Reported values are average test APOP scores aggregated over five random seeds.The best score within a group (network x label x model) is bolded.

Table S6 :
Baseline performance reference.Reported values are average test APOP scores aggregated over five random seeds.The best performance achieved by (1) GNNs or (2) logistic regression and label propagation are bolded for each dataset (network × label).The best performance across the two methods groups is additionally colored by green.

Table S8 :
Random splitting evaluation performance of different methods on the four primary OBNB datasets evaluated in APOP ↑ aggregated over five seeds.Bold indicates the best-performing method within the method class (traditional ML or GNN).Blue indicates the evaluated performance is higher than the study-biased holdout evaluation counterpart..020† recommended BoT, ‡ recommended BoT with fully tuned GNNs (see Appendix A.4.1 for tuning details)