Abstract
As an alternative to target-driven drug discovery, phenotype-driven approaches identify compounds that counteract the overall disease effects by analyzing phenotypic signatures. Our study introduces a novel approach to this field, aiming to expand the search space for new therapeutic agents. We introduce PDGrapher, a causally-inspired graph neural network model designed to predict arbitrary perturbagens – sets of therapeutic targets – capable of reversing disease effects. Unlike existing methods that learn responses to perturbations, PDGrapher solves the inverse problem, which is to infer the perturbagens necessary to achieve a specific response – i.e., directly predicting perturbagens by learning which perturbations elicit a desired response. Experiments across eight datasets of genetic and chemical perturbations show that PDGrapher successfully predicted effective perturbagens in up to 9% additional test samples and ranked therapeutic targets up to 35% higher than competing methods. A key innovation of PDGrapher is its direct prediction capability, which contrasts with the indirect, computationally intensive models traditionally used in phenotype-driven drug discovery that only predict changes in phenotypes due to perturbations. The direct approach enables PDGrapher to train up to 30 times faster, representing a significant leap in efficiency. Our results suggest that PDGrapher can advance phenotype-driven drug discovery, offering a fast and comprehensive approach to identifying therapeutically useful perturbations.
Main
Target-driven drug discovery, predominant since the 1990s, focuses on the design of highly specific compounds against disease-associated targets such as proteins or enzymes [1, 2]. A prime example of target-driven drug discovery is the development of small molecule kinase inhibitors like Imatinib. Imatinib halts the progression of chronic myeloid leukemia (CML) by inhibiting the BCR-ABL tyrosine kinase, a mutated protein that drives the uncontrolled proliferation of leukemic cells in CML patients [3]. Another notable example is monoclonal antibodies like Trastuzumab, which specifically targets the HER2 receptor, a protein overexpressed in certain types of breast cancer. Trastuzumab inhibits cell proliferation while engaging the body’s immune system to initiate an attack against the cancer [4]. These examples illustrate the success of target-driven drug discovery, yet the past decade has seen a revival of phenotype-driven approaches. This shift has been fueled by the observation that most of the first-in-class drugs approved by the US Food and Drug Administration (FDA) between 1999 and 2008 were discovered empirically without a drug target hypothesis [5]. Instead of the “one drug, one gene, one disease” model of target-driven approaches, phenotype-driven drug discovery aims to identify compounds or, more broadly, perturbagens – combinations of therapeutic targets – that reverse phenotypic disease effects as measured by high-throughput phenotypic assays [1, 6]. Recent successes include Ivacaftor, the first available therapy treating the underlying cause of cystic fibrosis [7–9], and Risdiplam, the first oral medicine approved to treat spinal muscular atrophy [10].
The resurgence of phenotype-driven drug discovery has been further bolstered by the advent of chemical and genetic libraries like the Connectivity Map (CMap [11]) and the Library of Integrated Network-based Cellular Signatures (LINCS [12]). CMap and LINCS contain gene expression profiles of dozens of cell lines treated with thousands of genetic and chemical perturbagens. CMap introduced the concept of “connectivity score” to quantify similarities between compound response and disease gene expression signatures. The hypothesis is that identifying compounds with gene expression signatures akin to known disease-treating drugs or counteracting disease signatures could unveil new therapeutic leads [13–16]. These strategies have been successful in identifying drugs with high efficacy in vitro [16] against new or under-studied diseases [17–19].
Recent advances in deep learning produced methods that predict gene expression responses to perturbagens or combinations thereof [20–23]. Such methods have advanced lead discovery by enabling predictions of responses to perturbagens that were not yet experimentally tested. 1) However, current approaches rely on predefined chemical and genetic libraries, meaning that they can select perturbagens only from predefined libraries instead of flexibly identifying perturbagens as combinations of therapeutic targets. 2) Existing approaches are predominantly perturbation response methods that predict changes in phenotypes upon perturbations. Thus, they only indirectly identify perturbagens by exhaustively predicting responses to all perturbations in the library and then searching for perturbagens with the desired response. 3) Unlike existing methods that learn responses to perturbations, phenotype-driven drug discovery needs to solve the inverse problem, which is to infer perturbagens necessary to achieve a specific response – i.e., directly predicting perturbagens by learning which perturbations elicit a desired response.
In causal discovery, the problem of identifying which elements of a system should be perturbed to achieve a desired state is referred to as optimal intervention design [24–26]. Leveraging insights from causal discovery and geometric deep learning, here we introduce PDGrapher, a novel approach for combinatorial prediction of therapeutic targets that can shift gene expression from an initial, diseased state to a desired treated state. PDGrapher is formulated using a causal model, where genes represent the nodes in a causal graph, and structural causal equations define their causal relationships. Given a genetic or chemical intervention dataset, PDGrapher pin-points a set of genes that a perturbagen should target to facilitate the transition of node states from diseased to treated. PDGrapher utilizes protein-protein interaction networks (PPI) and gene regulatory networks (GRN) as approximations of the causal graph, operating under the assumption of no unobserved confounders. PDGrapher tackles the optimal intervention design objective using representation learning, utilizing a graph neural network (GNN) to represent the structural equations.
Given pairs of diseased and treated samples, PDGrapher is trained to output a ranking of genes, with the top-predicted genes identified as candidate combinatorial therapeutic targets to shift gene expression phenotype from a diseased to a normal state in each sample. Once trained, PDGrapher takes a new diseased sample and returns a perturbagen – a combination or set of therapeutic targets – that are likely to counter disease effects in the given sample. We evaluate PDGrapher across eight datasets, comprising genetic and chemical interventions across two cancer types and proxy causal graphs, and consider diverse evaluation setups, including settings where held out folds contain novel samples and challenging settings where held out folds contain novel samples from a cancer type that PDGrapher has never encountered before. In held out folds that contain novel samples, PDGrapher ranks ground-truth therapeutic targets up to 34% higher in chemical intervention datasets and 16% higher in genetic intervention datasets than existing methods. Even in held-out folds containing novel samples from a previously unseen disease, PDGrapher maintains robust performance. A key innovative feature of PDGrapher is its direct prediction of perturbagens that can shift gene expression from diseased to treated states in contrast with existing methods that indirectly predict perturbagens through extensive computational modeling of cell responses. This PDGrapher’s feature enables model training up to 30 times faster than indirect prediction methods like scGen [22]. We find that in chemical intervention datasets, candidate therapeutic targets predicted by PDGrapher are on average up to 12% closer to ground-truth therapeutic targets in the gene-gene interaction network than what would be expected by chance. Additionally, PDGrapher can aid in elucidating the mechanism of action of chemical perturbagens (Figure 5), which we show in the case of Raloxifene, a selective estrogen receptor modulator used to treat postmenopausal osteoporosis and the risk reduction of invasive breast cancer, and Sertindole, an antipsychotic to treat schizophrenia. Our results underscore PDGrapher’s effectiveness in identifying perturbagens as combinatorial therapeutic targets that transition cells from a diseased to a treated state, highlighting PDGrapher’s potential to enhance phenotype-driven drug discovery.
Results
Overview of datasets and PDGrapher
We consider a total of eight datasets across two treatment types (genetic and chemical interventions), two cancer types (lung cancer cell line A549 and breast cancer cell line MCF7), and two proxy causal graphs (protein-protein interaction network (PPI), and gene regulatory networks (GRN)), which we denote as follows: Genetic-PPI-Lung, Genetic-PPI-Breast, Chemical-PPI-Lung, Chemical-PPI-Breast, Genetic-GRN-Lung, Genetic-GRN-Breast, Chemical-GRN-Lung, and Chemical-GRN-Breast. Genetic interventions are single-gene knockout experiments by CRISPR/Cas9-mediated gene knockouts, while chemical interventions are multiple-gene treatments induced using chemical compounds. We utilize a PPI network that has 10,716 nodes and 151,839 undirected edges. We additionally construct gene regulatory networks for each disease-treatment type pair using a gene regulatory network inference method (Online Methods). All GRNs have 10,716 nodes and 303,678 directed edges. Each dataset is made up of disease and treatment intervention data. Disease intervention data contains paired healthy and diseased gene expression samples and disease-associated genes. Treatment intervention data contains paired diseased and treated gene expression samples and genetic or chemical perturbagens. Supplementary Figure S1 summarizes the number of samples for each cell line and intervention dataset type.
Given a diseased cell state (gene expression), the goal of PDGrapher is to predict the genes that, if targeted by a perturbagen on the diseased cell, would shift the cell to a treated state (Figure 1A). Unlike methods for learning the response of cells to a given perturbation [22, 27– 29], PDGrapher focuses on the inverse problem by learning which perturbation elicits a desired response. PDGrapher predicts perturbagens to shift cell states under the assumption that an optimal perturbagen will be that which shifts the cell gene expression as close as possible to the desired gene expression. Our approach comprises two modules (Figure 1B). First, a perturbagen discovery module fp takes the initial and desired cell states and outputs a candidate perturbagen as a set of therapeutic targets 𝒰′. Then, a response prediction module fr takes the initial state and the predicted perturbagen 𝒰′ and predicts the cell response upon perturbing genes in 𝒰′. Our response prediction and perturbagen discovery modules are graph neural network (GNN) models that operate on a proxy causal graph, where edge mutilations represent the effects of interventions on the graph (Figure 1C).
PDGrapher is trained using an objective function with two components, one for each module, fr and fp. The response prediction module fr is trained using all available data on cell state transitions (that is, disease and treatment intervention data). Response prediction module fr is trained so cell states are close to the known perturbed cell states upon interventions. The perturbagen discovery module fp is trained using treatment intervention data; given a diseased cell state, fp predicts the set of therapeutic targets 𝒰′ that caused the corresponding treated cell state. The objective function for the perturbagen discovery module consists of two elements: (1) a cycle loss that optimizes the parameters of fp such that the response upon intervening on the predicted genes in 𝒰′, as measured by fr, is as close as possible to the real treated state and (2) a supervision loss on the therapeutic targets set 𝒰′ that directly pushes PDGrapher to predict the correct perturbagen. Both models are trained simultaneously using early stopping independently so that each model finishes training upon convergence.
When trained, PDGrapher predicts perturbagens (as a set of candidate target genes) to shift cells from diseased to treated. An example of PDGrapher’s predictions is depicted in Figure 1D-F, where we used SAFE [30] to visualize the spatial enrichment of ground-truth and predicted perturbagen targets for Raloxifene and Sertindole using the proxy causal graph in the context of oncology (Chemical-PPI-Lung). We observe consistent patterns between ground truth and predicted gene targets and their spatial enrichment distributions.
Given a pair of diseased and treated samples, PDGrapher directly predicts perturbagens by learning which perturbations elicit target responses. In contrast, existing approaches are perturbation response methods that predict changes in phenotype that occur upon perturbation, thus they can only indirectly predict perturbagens (Figure 2A). Given a disease-treated sample pair, a response prediction module (such as scGen [22]) is used to predict the response of the diseased sample to a library of perturbagens. The predicted perturbagen is that which produces a response as similar as possible to the treated sample. We evaluate PDGrapher’s performance in two separate settings (Figure 2BC): (1) a random splitting setting, where the samples are split randomly between training and test sets, and (2) a leave-cell-out setting, where PDGrapher is trained in one cell line, and its performance is evaluated in a cell line the model never encountered during training to test how well the model generalizes to a new disease.
PDGrapher efficiently predicts genetic and chemical perturbagens to shift cells from diseased to treated states
In the random splitting setting, we assess the ability of PDGrapher for combinatorial prediction of therapeutic targets across chemical PPI datasets (Chemical-PPI-Lung and Chemical-PPI-Breast). Specifically, we measure whether, given paired diseased-treated gene expression samples, PDGrapher can predict the set of therapeutic genes targeted by the chemical compound in the diseased sample to generate the treated sample. Given paired diseased-treated gene expression samples, PDGrapher ranks genes in the dataset according to their likelihood of being in the set of therapeutic targets. We quantify the ranking of ground-truth gene targets in the predicted ranked list such that a ranking equal to one signifies perfectly accurate predictions by the model (the top K-ranked gene targets correspond to the K ground-truth gene targets), and zero signifies the opposite (the bottom K ranked gene targets correspond to the K ground-truth gene targets). PDGrapher ranks ground-truth gene targets 27% and 34% higher in the list compared to the second best-performing baseline in Chemical-PPI-Lung and Chemical-PPI-Breast, respectively (Figure 3A). Because perturbagens target multiple genes, we measure the fraction of samples in the test set for which we obtain a partially accurate prediction, where at least one of the predicted gene targets corresponds to an actual gene target. PDGrapher consistently provides accurate predictions for more samples in the test set than baselines. Specifically, it accurately predicts up to 8% and 9% more samples than the second best-performing baseline (Figure 3C).
PDGrapher not only provides accurate predictions for a larger proportion of samples and consistently predicts ground-truth therapeutic targets close to the top of the ranked list, but it also predicts gene targets that are closer in the network to ground-truth targets compared to what would be expected by chance (Figure 3DE). Specifically, for Chemical-PPI-Lung, the median distance between predicted and ground-truth therapeutic targets is 3.0 for both PDGrapher and the Random baseline. However, the distributions exhibit a statistically significant difference, with a one-sided Mann-Whitney U test yielding a p-value < 0.001, an effect size (rank-biserial correlation) of 0.2432 (95% CI [0.2416, 0.2449]), and a U statistic of 1.50 x 1011. Similarly, for Chemical-PPI-Breast, the median distance is 3.0 for both groups, yet the distributions are significantly different (p-value < 0.001, effect size = 0.2515 (95% CI [0.2501, 0.2529]), U statistic = 3.74 x 1011). This finding is significant as it is supported by the local network hypothesis, which posits that genes in closer network proximity tend to exhibit greater similarity compared to those that are more distantly connected [31–33]. This result implies that PDGrapher not only identifies relevant gene targets but does so in a manner that reflects the underlying biological and network-based relationships [29], suggesting that its predictions are not random but are rooted in the inherent structure of the gene interaction network.
PDGrapher’s training is up to 25 times faster than baselines used for indirect prediction of perturbagens (Figure 3B), significantly reducing the computational cost of predicting perturbagens and highlighting another advantage of PDGrapher over existing methods. The enhanced efficiency of PDGrapher is attributed to its unique approach. Existing approaches are predominantly perturbation response methods that predict changes in the phenotype that occur upon perturbation. Thus, they can identify perturbagens only indirectly by exhaustively predicting responses to all perturbations in the library and then searching for the perturbagen with the desired response that reverses a disease state. Unlike existing methods that learn responses to perturbations, PDG-rapher addresses the inverse problem, which is to infer the perturbagen necessary to achieve a specific response – i.e., directly predicting perturbagens by learning which perturbations elicit a desired response.
PDGrapher exhibits comparatively modest performance across genetic datasets, specifically Genetic-PPI-Lung and Genetic-PPI-Breast (Supplementary Figure S2). This relative modesty could stem from the inherent characteristics of knockout interventions, which often produce a reduced phenotypic signal compared to small molecule interventions. Studies have shown that while gene knockouts are essential for understanding gene function, single-gene knockout studies sometimes offer limited insights into complex cellular processes due to compensatory mechanisms [34–36]. Despite the modest performance in genetic intervention datasets, PDGrapher outperforms competing methods in the combinatorial prediction of therapeutic targets. When using GRNs as proxy causal graphs, we see that PDGrapher is consistently the top-performing method across both chemical and genetic intervention datasets (Supplementary Figure S4).
PDGrapher generalizes to new (previously unseen) cell lines and learns optimal genetic and chemical perturbagens
We observe consistently strong performance of PDGrapher across chemical and genetic intervention datasets in the leave-cell-out setting (Figure 4, Supplementary Figure S3). PDGrapher successfully predicts perturbagens that describe the cellular dynamics and shift gene expression phenotypes from a diseased to a treated state in up to 9% and 1% additional test samples compared to the second-best baseline in chemical and genetic intervention datasets, respectively (Figure 4DE, Supplementary Figure S3AB). PDGrapher also ranks ground-truth gene targets up to 35% and 15% higher in the predicted ranked gene list, compared to the second-best baseline, in chemical and genetic intervention datasets, respectively (Figure 4A, Supplementary Figure S3AB). Additionally, combinations of therapeutic targets predicted by PDGrapher in chemical datasets are closer to ground-truth targets than expected by chance (Figure 4BC). Specifically, for Chemical-PPI-Lung, the median distance between predicted and ground-truth therapeutic targets is 3.0 for both PDGrapher and the Random baseline. However, the distributions exhibit a statistically significant difference, with a one-sided Mann-Whitney U test yielding a p-value < 0.001, an effect size (rank-biserial correlation) of 0.2154 (95% CI [0.2131, 0.2178]), and a U statistic of 5.05 x 1010. Similarly, for Chemical-PPI-Breast, the median distance is 3.0 for both groups, yet the distributions are significantly different (p-value < 0.001, effect size = 0.2402 (95% CI [0.2383, 0.2421]), U statistic = 1.35 x 1011).
PDGrapher illuminates mode of action of chemical perturbagens
To demonstrate PDG-rapher’s ability to illuminate the mechanism of action of therapeutic perturbagens, we analyze PDGrapher’s predictions for Raloxifene (Figure 5A) and Sertindole (Figure 5B) in Chemical-PPI-Lung in the random splitting setting. We visualize ground truth and predicted combinations of therapeutic targets together with their one-hop neighbors in the protein interaction network using Gephi [37] and the ForceAtlas graphical layout. We utilized the modularity algorithm in Gephi to identify distinct communities within the network. The nodes were subsequently colored based on their modularity class to represent these communities visually.
Raloxifene is a second-generation selective estrogen receptor modulator (SERM) with anti-estrogenic impacts on breast and uterine tissues and estrogenic effects on bone, lipid metabolism, and blood clotting [38]. It targets a combination of Estrogen Receptor 1 (ESR1), Estrogen Receptor 2 (ESR2), Trefoil Factor 1 (TFF1), and Serpin Family B Member 9 (SERPINB9). PDGrapher correctly predicted ESR1, and ESR2. Additionally, PDGrapher predicted Sex Hormone Binding Globulin (SHBG) and Phosphodiesterase 5A (PDE5A) genes as combinatorial therapeutic targets. Notably, Raloxifene treatment has been documented to raise SHBG levels in healthy middle-aged and older men [39, 40], and post-menopausal women [41]. Therefore, PDGrapher’s prediction of SHBG can be explained due to the strong connection between Raloxifene and downstream effects on SHBG. The prediction of PDE5A by PDGrapher is through its functional relationship with estrogen receptors. Estrogen facilitates vasodilation by engaging its receptors, increasing nitric oxide (NO) production. This NO production is pivotal as it stimulates the synthesis of cyclic guanosine monophosphate (cGMP), resulting in the relaxation of smooth muscle cells and the sub-sequent dilation of blood vessels [42]. PDE5A plays a crucial role in this mechanism by hydrolyzing cGMP, thereby modulating the vasodilation process to be both controlled and reversible [43]. A disruption in this intricate pathway might lead to changes in the expression or functionality of PDE5A. Such alterations potentially explain the observed link between Raloxifene, a selective estrogen receptor modulator, and the modulation of PDE5A activity.
Sertindole is a second-generation antipsychotic to treat schizophrenia. It acts through antagonistic mechanisms against Dopamine D2 Receptor (DRD2), Serotonin receptors HTR2A, HTR2C, and HTR6, and Alpha 1 Adrenergic receptors ADRA1A and ADRA1B [44]. PDGrapher accurately predicted DRD2, HTR2A, and HTR2C. It additionally predicted serotonin receptor HTR1A, B-Raf Proto-Oncogene (BRAF), and Homeobox C6 (HOXC6) genes. All Sertindole’s gene targets are G-protein coupled receptors (GPCRs), as is the predicted target HTR1A. The predictive involvement of BRAF in response to Sertindole’s targeting of GPCRs can be explained by its position in the downstream cascade of GPCR signaling pathways. GPCRs influence various intracellular signaling cascades, including the MAPK/Erk signaling pathway of which BRAF is a critical component [45]. Additionally, HOXC6 has been shown to promote cell proliferation and migration through the activation of the MAPK pathway [46]. This implies that changes in BRAF activity, potentially induced by altered GPCR signaling due to Sertindole, may modify the behavior of PDGrapher-predicted HOXC6 in the downstream pathway.
Training PDGrapher’s casually-inspired neural network models
We perform an ablation study to analyze components in PDGrapher’s objective function across the chemical datasets. We train PDGrapher using only the cycle loss (PDGrapher-Cycle), using only the supervision loss (PDGrapher-Super), and using both (PDGrapher-SuperCycle) in the random splitting setting on Chemical-PPI-Lung and Chemical-PPI-Breast datasets. We evaluate the relative position of ground-truth therapeutic targets in the predicted gene ranking (Figure 5C), the proportion of samples with a partially accurate prediction (Figure 5D), and the sample-wise R2 (Figure 5E) and perturbagen-wise R2 (Figure 5F) of the reconstruction of treated samples given diseased samples and predicted perturbagens.
We find that PDGrapher-Super has the highest performance when predicting correct perturbagens; however, it has the lowest performance in the reconstruction of treated samples. PDG-rapher-SuperCycle (our choice throughout this study) has comparative performance when predicting correct perturbagens and the highest performance in the reconstruction of treated samples. Conversely, PDGrapher-Cycle performs poorly in identifying correct perturbagens and performs better when asked to predict (reconstruct) held-out treated samples. PDGrapher-SuperCycle appears as the best compromise between accuracy in predicting therapeutic genes and reconstruction of treated samples from diseased samples upon intervening on the predicted genes. We chose PDGrapher-SuperCycle throughout this work because we seek a model that provides accurate gene target predictions and phenotypic-driven gene target predictions such that even incorrect predictions are guided by the changes they would generate in diseased samples.
Acknowledging the critical role of biological pathways in cellular functions, it becomes feasible for PDGrapher-SuperCycle to identify alternative gene targets within these pathways that may elicit similar phenotypic responses. The organization of genes in biological pathways, where each gene contributes to a specific biochemical process or signaling cascade, allows for the possibility that perturbations in different genes could lead to similar outcomes [47]. This pathway-based interconnectivity suggests that targeting different genes within a pathway may still achieve the desired therapeutic state, as these genes collectively influence certain cellular functions or pheno-types [48]. Consequently, while PDGrapher-SuperCycle may exhibit lower accuracy compared to PDGrapher-Super in pinpointing the precise original targets due to the interconnected and complex nature of biological pathways [49], it demonstrates enhanced efficacy in identifying a broader range of gene targets that can effectively transition cell states from diseased to treated conditions. This inductive bias aligns with the intricate organization of genes within biological pathways. It offers valuable insights into the diverse mechanisms of action of perturbagens, as substantiated by the results shown in the previous subsection. Such an approach underscores the potential of leveraging the organization of biological pathways to develop more comprehensive therapeutic interventions.
Discussion
We introduce a novel problem formulation for phenotype-driven lead discovery. Given a diseased sample, the goal is to find genes that a genetic or chemical perturbagen should target to reverse disease effects and shift the sample to a treated state distributionally equivalent to a healthy state. In practice, this problem translates to predicting a combination of gene targets; therefore, we refer to this formulation of phenotype-driven lead discovery as a combinatorial prediction of therapeutic targets. To address this problem, we introduce PDGrapher. Given a diseased cell state represented as a gene expression signature and a proxy causal graph of gene-gene interactions, PDGrapher predicts candidate target genes to shift the cells to a desired treated state.
PDGrapher efficiently predicts perturbagens to shift cell line gene expression from a diseased to a treated state across two evaluation settings and eight datasets of genetic and chemical interventions. Training PDGrapher models is up to 30 times faster than response prediction methods that use indirect prediction to nominate candidate perturbagens. PDGrapher can illuminate the mode of action of predicted perturbagens given that it predicts gene targets based on network proximity which governs similarity between genes. Though conceptually similar to drug target prediction, PDGrapher is uniquely different in that gene targets are optimized to shift cells from diseased to treated states; therefore, the goal of PDGrapher is the discovery of new therapeutic leads rather than the characterization of existing compounds, as is the case in drug target prediction [32, 33, 50–52].
The potential of PDGrapher is in enhancing the design of therapeutic leads and broadening the search space of perturbagens to include those above and beyond existing chemical and genetic libraries. PDGrapher effectively leverages large datasets of genetic and chemical interventions to find perturbagens as sets of candidate targets to shift cell line gene expressions from diseased to treated states. By flexibly selecting sets of therapeutic targets for intervention, rather than on a specific perturbagen, PDGrapher enhance the versatility of phenotype-driven lead discovery. PDGrapher’s approach in identifying therapeutic targets holds promise for personalized therapies, as it can enable tailoring treatments based on individual gene expression profiles, thereby addressing disease representations in each patient. PDGrapher’s capacity to simultaneously out-put multiple genes is highly relevant for diseases where dependencies involving multiple genes influence treatment efficacy and safety.
However, it is crucial to acknowledge the limitations of phenotype-driven drug discovery that lie ahead of PDGrapher. One limitation of our current approach is its reliance on transcriptomic data. While this offers broad applicability, there are other data modalities and experimental platforms, such as cell painting, that can be valuable for phenotype-driven drug discovery. Cell painting can enhance phenotype-driven drug discovery by visually capturing cellular responses through staining organelles and cytoskeletal components [53, 54]. This platform provides high-content image profiles, yielding signatures of cell changes that reveal nuanced effects of compounds. With the recent release of the JUMP Cell Painting dataset [55], this approach has the potential to synergize with databases like CMap and LINCS, providing a rich, image-based layer of phenotypic data that complements transcriptomic profiles. Integrating diverse data sources could offer a more comprehensive view of compound effects [56] and improve the predictive power and applicability of PDGrapher. Additionally, the journey from theoretical models to practical applications in drug discovery is complex and requires careful consideration of various biological, technological, and regulatory factors [57, 58].
Future developments of PDGrapher will address the assumptions of our current approach. Our work was developed under the assumption of no unobserved confounders. This assumption is stringent and difficult to empirically verify. Fruitful future directions include reevaluating and relaxing this assumption in the problem formulation, thereby refining the problem definition. Another limitation is the reliance on PPIs or GRNs as proxies for causal gene networks, as these networks are noisy and incomplete [59–61]. PDGrapher posits that leveraging representation learning can overcome incomplete causal graph approximations. A valuable research direction is to theoretically examine the impact of using such approximations, focusing on how they influence the accuracy and reliability of predicted likelihoods. Such an analysis would help reconcile structural causality and modern machine learning research towards causal representation learning [62] and discover high-level causal variables with therapeutic effects from low-level observations.
Data availability
Processed data used in this paper, including the cell line gene expression dataset, protein-protein interaction network, drug targets, and disease-associated genes, are available via the project website at https://zitniklab.hms.harvard.edu/projects/PDGrapher or directly at https://figshare.com/articles/dataset/Combinatorial prediction of therapeutic t argets using a causally-inspired neural network/24798855. The raw protein-protein interaction network data was obtained from https://downloads.thebiogrid.org/File/BioGRID/Release-Archive/BIOGRID-3.5.186/BIOGRID-MV-Physical-3.5.186.tab3.zip, https://www.science.org/doi/suppl/10.1126/science.1257601/suppl_file/datasets_s1-s4.zip, and http://www.interactome-atlas.org/data/HuRI.tsv. Raw gene expression datasets were obtained from https://clue.io/releases/data-dashboard. Disease-associated genes were obtained from COSMIC at https://cancer.sanger.ac.uk/cell_lines/archive-download#:~:text=Complete%20mutation%20data and https://cancer.sanger.ac.uk/cosmic/curation. Drug targets were extracted from DrugBank at https://go.drugbank.com/releases/5-1-9, and a list of cancer drugs was obtained from NCI at https://www.cancer.gov/about-cancer/treatment/types/targeted-therapies/approved-drug-list.
Code availability
Python implementation of PDGrapher is available at https://github.com/mims-harvard/PDGrapher.
Authors contribution
G.G. retrieved, processed, and analyzed gene expression data, protein-protein interaction network data, and disease gene datasets. I.H. retrieved, processed, and analyzed drug target data and processed gene regulatory networks. G.G. developed, implemented, and benchmarked PDGrapher and performed detailed analyses of PDGrapher ‘s algorithm. G.G., K.V., M.B., and M.Z. designed the study. G.G., I.H., and M.Z. wrote the manuscript.
Competing interests
G.G. is currently employed by Genentech, Inc. and F. Hoffmann-La Roche Ltd. I.H. is currently employed by Merck & Co., Inc.
Online Methods
Datasets
We compiled and processed six primary data sources and two additional repositories of biological information. Data sources include protein-protein interactions (PPI), healthy and diseased cell line gene expression, diseased cell line gene expression upon chemical or genetic interventions, phenotype-associated genes, drug targets, and drug indications. The following is a description of the data sources and preprocessing steps.
Human protein-protein interaction network
We built a PPI network by aggregating proteins and connections from BIOGRID [63] (accessed in March 2022), HuRI [64], and Menche et al. [48] In this graph, nodes represent human proteins, and edges exist between nodes if there is physical interaction between the proteins. We downloaded a gene ID mapping file from the HUGO Gene Nomenclature Committee. Using this file, we mapped proteins in BIOGRID and Menche et al. [48] from Entrez Gene ID [65] to HUGO Gene Nomenclature Committee ID [66], and proteins in HuRI from Ensembl Gene ID [67] to HUGO Gene Nomenclature Committee ID [66]. Our final PPI comprises the union of nodes and edges, resulting in a graph with 15,742 nodes and 222,498 undirected edges.
Gene expression data
We downloaded Library of Integrated Network-Based Cellular Signatures (LINCS [68]) level 3 gene expression data from https://clue.io/releases/data-dashboard (accessed in February 2022). Level 3 data consists of quantile-normalized samples across each plate and is the lowest level in the LINCS library that can be compared across plates. LINCS contains gene expression measurements for 12,327 genes upon genetic and chemical interventions. There are 387,317 samples upon CRISPR genetic interventions (treated samples), with 5,156 unique knocked-out genes across 27 unique cell lines. There is an average of 17.18 replicates per cell line-knocked-out gene pair. The number of unique genes knocked out in each cell line varies from 1 to 5,114, with an average of 2,042.14 unique genes knocked out per cell line.
Control data for CRISPR interventions, that is, diseased samples, are genetic interventions that either do not contain a gene-specific sequence or whose gene-specific sequence targets a gene not expressed in the human genome. There is a total of 47,781 diseased samples across 50 cell lines. The number of diseased samples for each cell line varies from 1 to 6,890, with an average of 955.62 diseased samples per cell line.
There are 1,313,292 samples upon chemical interventions (treated samples), with 31,234 unique compounds across 229 unique cell lines. There is an average of 7.96 replicates per cell line-compound pair. The number of compounds tested in each cell line varies from 1 to 19,509, with an average of 719.69 unique compounds tested per cell line. Drugs are administered at different doses and measured at varying time points after treatment. On average, there are 2.73 different doses per compound-cell line pair, with a minimum of 1 and a maximum of 26 different doses. On average, gene expression is measured at 1.25 time points per compound-cell line pair, with a minimum of 1 and a maximum of 13 different time points.
Control data for chemical interventions, that is, diseased samples, is treatment with vehicle (dimethyl sulfoxide). There is a total of 76,795 diseased samples across 226 cell lines. The number of diseased samples for each cell line varies from 1 to 7,336, with an average of 339.80 diseased samples per cell line. On average, gene expression of diseased samples is measured at 1.4 time points, with a minimum of 1 and a maximum of 5 different time points.
We filter cell lines to keep those treated with at least 4,000 unique genetic or chemical perturbagens, resulting in 10 selected cell lines for each genetic and chemical dataset. To find healthy cell line counterparts, we extracted all cell lines with the “Unknown” tumor phase in the downloaded LINCS dataset (N=145). Then, we filtered the cell lines by tissue type. To find the exact match to diseased cell lines, we performed a manual literature search to confirm their experimental use as healthy counterparts. We extracted healthy counterparts for three of the ten diseased cell lines: cell line NL20 as the healthy counterpart for A549, cell line MCF10A as the healthy counterpart for MCF7, and cell line RWPE1 as the healthy counterpart for PC3.
Genetic interventions correspond to gene experiment knockouts in which the gene expression of the knocked-out gene after the intervention is zero. Chemical interventions correspond to small molecule treatments, where each molecule targets one or more proteins. Chemical interventions were performed at different dose levels and measured at different time points. We included replicates measured at all time points and doses. For each cell line and condition (healthy, diseased, and treated), we log-normalized level 3 gene expression data. We applied a min-max normalization to transform gene expression values into the range [0, 1] following established practices in the field.
We match genes in LINCS to proteins in our PPI using the HUGO Gene Nomenclature Committee ID [66], resulting in 10,716 overlapping genes and 151,839 undirected edges. Furthermore, we excluded treated samples from our datasets whose targeted genes were not included in the PPI. We have healthy, diseased, and treated gene expression samples for each cell line treated with several genetic or chemical perturbagens (Table 1). For healthy counterparts, samples with the corresponding treatment (“vector” for genetic perturbagens, and “vehicle” for chemical perturbagens) are not available, therefore, we use the closest possible one (see “Sample category” in Table 1.
Gene regulatory networks
We computed one gene regulatory network (GRN) for each diseased cell line in each condition (genetic and chemical datasets), using the GENIE3 [69] algorithm on gene expression values of each diseased cell line. We filtered genes in our gene expression dataset (LINCS) to contain only those in the PPI before running the GRN algorithm for consistency between the PPI and GRNs. GENIE3, introduced in 2010, won the Dialogue for Reverse Engineering Assessments and Methods 4 (DREAM4) challenge [70], which evaluates the success of GRN inference algorithms on benchmarks of simulated data. GENIE3 was introduced in the open source software for bioinformatics Bioconductor [71] in 2017 and is still used as a gold-standard for GRN generation [72–75]. It is a model based on an ensemble of regression trees and requires as input a matrix of gene expression levels under various conditions. Notably, this expression data is multifactorial. This means that they represent expression levels resulting from a perturbation over a set of genes rather than from a targeted experiment. Multifactorial expression can be obtained as samples from different patients or other biological systems. Therefore, cell line diseased samples are the closest to the ideal input data for GENIE3. GENIE3 produces a directed graph representing gene-gene regulatory interactions. This is achieved by assigning weights to regulatory links and maximizing weights for more significant links. Then, a significance threshold is used to determine which links are substantial enough to be predicted as a regulatory link. We adapted the threshold to generate GRNs with the same network density as our PPI, which was achieved by keeping 303,678 directed edges.
Disease-associated genes
We extracted disease-associated genes from COSMIC [76] (Accessed in October 2022) in addition to expert-curated genes available at https://cancer.sanger.ac.uk/ cosmic/curation. Genes were represented using the HUGO Gene Nomenclature Committee ID. For each cell line in our dataset, we extracted cancer-causing mutations as the list of genes with “Verified” Mutation verification status in COSMIC and present in the list of genes curated by experts. Mapping the resulting genes to our list of genes in the PPI resulted in eight disease-associated genes for lung cancer cell line A549, eight disease-associated genes for breast cancer cell line MCF7, and one disease-associated gene for prostate cancer cell line PC3. Therefore, we filtered out the cell line PC3 and proceeded with only MCF7 and A549.
Drug targets
We downloaded drug-related data from DrugBank [38] (accessed in November 2022). We extracted drug names and synonyms, chemical identifiers, drug-gene targets, and all available synonyms for each gene target. We mapped drugs in DrugBank with chemical perturbagens in LINCS by InChI Key [77], resulting in 1,522 out of 31,234 unique LINCS compounds mapped to DrugBank with information of at least one target. We mapped drug targets to our PPI network using the HUGO Gene Nomenclature Committee ID, excluding any drug target that was not mapped. Chemical interventions target multiple genes, with a minimum of 1, a maximum of 300, and an average of 2.44 targets per compound.
List of cancer drugs for cancer targets baseline
We extracted the list of cancer drugs by cancer type from NCI (https://www.cancer.gov/about-cancer/treatment/types/targeted-therapies/approved-drug-list#targeted-therapy-approved-for-breast-cancer; Accessed in November 2022). We mapped drug names to DrugBank to obtain cancer drug-gene targets. In total, there are 24 drugs associated with breast cancer (cell line MCF7) and 30 drugs associated with lung cancer (cell line A549).
Disease intervention data
Disease intervention datasets consist of gene expression measurements of healthy cell lines, disease-associated genes, and gene expression measurements of diseased cell lines. Gene expression samples of healthy and diseased cell lines were retrieved from LINCS [68], and disease-associated genes were retrieved from COSMIC [76], as detailed previously. Each dataset 𝒯T = {T1, …, TM } is a collection of paired healthy-diseased cell lines where in each sample T =< xh, 𝒰, xd >, xh corresponds to gene expression values of the healthy cell line, set 𝒰U is comprised by a randomized subset of disease-associated genes, and xd corresponds to gene expression values of diseased cell lines (that is, upon mutations on genes in U). To select the randomized set of disease-associated genes, we first choose at random a proportion p ∈ {0.25, 0.50, 0.75, 1}, and then select N disease-associated genes at random where N is the proportion multiplied by the total number of disease-associated genes. Given that more diseased samples are available than healthy samples (see Table 1) when building the triplets, we select a random sample from the set of healthy samples and, therefore, have non-unique healthy samples during training. In total, we built two datasets of disease interventions: one comprised of gene expression of healthy cell line MCF10A, breast cancer mutations, and gene expression of breast cancer cell line MCF7; the second comprised of gene expression of healthy cell line NL20, lung cancer mutations, and gene expression of lung cancer cell line A549. Find more details on data compilation and processing in previous subsections.
Treatment intervention data - genetic
Genetic treatment intervention datasets consist of single-gene knockout experiments using CRISPR / Cas9-mediated gene knockout. Genetic treatment intervention data comprises gene expression measurements of diseased cell lines, single knocked-out genes, and gene expression measurements of treated cell lines. Gene expression samples of diseased and treated cell lines and knocked-out genes were retrieved from LINCS [68]. Each dataset T = {T1, …, TM } is a collection of paired diseased-treated cell lines where in each sample T =< xd, 𝒰′, xt >, xd corresponds to gene expression values of the diseased cell line, set U′ is comprised by the knocked-out gene, and xt corresponds to gene expression values of treated cell lines (that is, upon knocking-out the gene in 𝒰′). Given that more treated samples are available than diseased samples (see Table 1) when building the triplets, we select a random sample from the set of diseased samples and, therefore, have non-unique diseased samples during training. In total, we built two datasets of treatment interventions: one comprised of gene expression of diseased cell line MCF7, knocked-out genes, and gene expression of treated cell line MCF7; the second comprised of gene expression of diseased cell line A549, knocked-out genes, and gene expression of treated cell line A549. Find more details on data compilation and processing in previous subsections.
Treatment intervention data - chemical
Chemical treatment intervention datasets consist of chemical compound treatment experiments. Chemical treatment intervention data comprises gene expression measurements of diseased cell lines, chemical compound therapeutic targets, and gene expression measurements of treated cell lines. Gene expression samples of diseased and treated cell lines were retrieved from LINCS, and chemical compound targets were retrieved from DrugBank, as detailed previously. Each dataset 𝒯T = {T1, …, TM } is a collection of paired diseased-treated cell lines where in each sample T =< xd, 𝒰′, xt >, xd corresponds to gene expression values of the diseased cell line, set 𝒰′ is comprised by the chemical compound targets, and xt correspond to gene expression values of treated cell lines (that is, upon treated with the chemical perturbagen targeting genes in 𝒰′). Given that more treated samples are available than diseased samples (see Table 1) when building the triplets, we select a random sample from the set of diseased samples and, therefore, have non-unique diseased samples during training. In total, we built two datasets of treatment interventions: one comprised of gene expression of diseased cell line MCF7, chemical compound target genes, and gene expression of treated cell line MCF7; the second comprised of gene expression of diseased cell line A549, chemical compound target genes, and gene expression of treated cell line A549. Find more details on data compilation and processing in previous subsections.
Related work
Learning optimal interventions
The problem of learning interventions to achieve a desired state has gained interest in recent years. A few recent works formulate this problem as finding optimal interventions to optimize an associated outcome [78–81]. These works offer varied approaches. For example, Mueller et al. [78] aim to learn an intervention policy defined by a covariate transformation that produces the largest post-intervention improvement with high uncertainty. Pacchiano et al. [79] formalize the task as a bandit optimization problem in which each bandit’s arm corresponds to a covariate to intervene, and the goal is to recover an almost optimal arm in the least number of arm pulls possible. Mueller et al. [80] and Hie et al. [81] approach the problem of sequence-based data where each sequence is associated with an outcome, and the goal is to find mutations in the input sequence that increase a desired outcome. Other recent works formulate this problem as finding optimal interventions to shift the system to a desired state. Zhang et al. [82, 83] aimed to find an intervention that applied to a distribution helps match a desired distribution. Specifically, given a distribution P over X and a desired distribution Q over X, the goal is to find an optimal matching intervention I such that P I best matches Q under some metric. They address the special case of soft interventions (shift interventions) and use the expectation of distributions as the distance metric.
Neural networks and Structural Causal Models (SCMs)
Causal representation learning has been a growing trend in recent years [84]. It aims to combine the strength of traditional causal learning methods with the robust capabilities of deep learning in the face of large and noisy data. Bottlenecks of traditional causal learning methods include unstructured high-dimensional variables, combinatorial optimization problems, unknown intervention, unobserved confounders, selection bias, and estimation bias [84]. There are three areas in which deep learning helps to over-come these bottlenecks [84]. First, in learning causal variables from high-dimensional unstructured data. Second, in learning the causal structure between causal variables, called causal discovery within the causal inference literature. And third, in facilitating inference of interventional and counterfactual queries. Within the last branch, a promising approach aims to join SCMs and neural models to facilitate interventional and counterfactual querying. Parafita et al. put forward the requirements that any DL model should fulfill to approximate causal queries and introduced normalizing causal flows as a specific instantiation [85]. Pawlowski et al. followed a similar approach to introduce a model capable of computing counterfactual queries [86]. Xia et al. approached the problem differently, introducing a Neural Causal Model (NCM), a type of SCM with neural networks as structural equations [87]. Together with the NCM, they introduced an algorithm that provably performs identification and inference of interventional queries [87]. A follow-up work extended the NCM framework for identification and inference of counterfactual queries [88]. The concept of NCMs inspires our work by considering the graph in which we operate as a noisy version of a causal graph and our model operating on the graph as a proxy for the structural equations.
Interventions in Graph Neural Networks (GNNs)
GNNs are a type of neural model that falls under the umbrella term of geometric deep learning [89–91]. These models use graph-structured data to compute transformed representations useful for downstream predictive tasks. Their ability to operate over graphs makes them especially relevant to NCMs. A recent work by Zecevic et al. [92] explored this connection. It introduced interventional GNNs, a GNN in which interventions are represented through mutilations in the input graph, and interventional inference as GNN computations on the mutilated graph [93]. We borrow this concept in our work and extend the representational capabilities of GNNs by assigning learnable embeddings to input nodes.
Methods
Preliminaries
A calligraphic letter χX indicates a set, an italic uppercase letter X denotes a graph, uppercase X denotes a matrix, lowercase x denotes a vector, and a monospaced letter X indicates a tuple. Uppercase letter X indicates a random variable and lowercase letter x indicates its corresponding value; bold uppercase X denotes a set of random variables, and lowercase letter x indicates its corresponding values. We denote P (X) as a probability distribution over a set of random variables X and P (X = x) as the probability of X is equal to the value of x under the distribution P (X). For simplicity, P (X = x) is abbreviated as P (x).
Problem formulation - combinatorial prediction of therapeutic targets
Intuitively, given a diseased cell line sample, we would like to predict the set of therapeutic genes that need to be targeted to reverse the effects of disease, that is, the genes that need to be perturbed to shift the cell gene expression state as close as possible to the healthy state. Next, we formalize our problem formulation. Let M =< E, V, ℱ, P (E) > be an SCM associated with causal graph G, where E is a set of exogenous variables affecting the system, V are the system variables, ℱ are structural equations encoding causal relations between variables and P (E) is a probability distribution over exogenous variables. Let 𝒯T = {T1, …, Tm} be a dataset of paired healthy and diseased samples, where each element is a 3-tuple T =< vh, U, vd > with vh ∈ [0, 1]N being gene expression values of healthy cell line (variable states before perturbation), VU being the disease-causing perturbed variable (gene) set in V, and vd ∈ [0, 1]N being gene expression values of diseased cell line (variable states after perturbation). Our goal is to find, for each sample T =< vh, U, vd >, the variable set U′ with the highest likelihood of shifting variable states from diseased vd to healthy vh state. To increase generality, we refer to the desired variable states as treated (vt). Our goal can then be expressed as where represents the probability over graph G mutilated upon perturbations in variables in U. Under the assumption of no unobserved confounders, the above interventional probability can be expressed as a conditional probability on the mutilated graph : which under the causal Markov condition is: where represents parents of variable Vi according to graph (that is, the mutilated graph upon intervening on variables in U′). Here, state of a variable will be equal to an arbitrary value if Vj ∈ U′. Therefore, intervening on the variable set U′ modifies the graph used to obtain conditional probabilities and determine the state of variables in U′.
Problem formulation - representation-learning-based combinatorial prediction of therapeutic targets
In the previous section, we drew on the SCM framework to introduce a generic formulation for the task of combinatorial prediction of therapeutic targets. Instead of approaching the problem from a purely causal inference perspective, we draw upon representation learning to approximate the queries of interest to address the limiting real-world setting of a noisy and incomplete causal graph. Formulating our problem using the SCM framework allows for explicit modeling of interventions and formulation of interventional queries. Inspired by this principled problem formulation, we next introduce the problem formulation using a representation learning paradigm.
We let G = (𝒱, ε) denote a graph with |𝒱| = n nodes and |E| edges, which contains partial information on causal relationships between nodes in 𝒱and some noisy relationships. We refer to this graph as proxy causal graph. Let 𝒯T = {T1, …, TM} be a dataset with an individual sample being a 3-tuple T =< xh, 𝒰, xd > with xh ∈ [0, 1]n being the node states (attributes) of healthy cell sample (before perturbation), 𝒰U being the set of disease-causing perturbed nodes in 𝒱, and xd ∈ [0, 1]n being the node states (attributes) of diseased cell sample (after perturbation). We denote by G𝒰 = (𝒱, ε𝒰) the graph resulting from the mutilation of edges in G as a result of perturbing nodes in 𝒰 (one graph per perturbagen; we avoid using superindices for simplicity). Here again, we refer to the desired variable states as treated (xt). Our goal is then to learn a function: That, given the graph , the diseased xd and treated xt node states, predicts the combinatorial set of nodes 𝒰′ that if perturbed have the highest chance of shifting the node states to the treated state xt. We note here that represents probabilities over graph GU mutilated upon perturbations in nodes in 𝒰. Under Causal Markov Condition, we can factorize over graph : that is, the probability of each node depending only on its parents 𝒫𝒜i in graph .
We assume (i) real-valued node states, (ii) G is fixed and given, and (iii) atomic and nonatomic perturbagens (intervening on individual nodes or sets of nodes). Given that the value of each node should depend only on its parents on the graph , a message-passing framework appears especially suited to compute the factorized probabilities P.
Comment on problem formulation
In the SCM framework, the conditional probabilities in equation 3 are computed recursively on the graph, each being an expectation over exogenous variables E. Therefore, node states of the previous time point are not necessary. To translate this query into the representation learning realm, we discard the existence of noise variables and directly try to learn a function encoding the transition from an initial state to a desired state.
Intuitively, an exhaustive approach to solving Equation 5 would be to search the space of all potential sets of therapeutic targets 𝒰′ and score how effective each is in achieving the desired treated state. This is, indeed, how many cell response prediction approaches can be used for perturbagen discovery [22, 23, 94]. However, with moderately sized graphs, this is highly computationally expensive, if not intractable. Instead, we propose to search for potential perturbagens efficiently with a perturbagen discovery module (fp) and a way to score each potential perturbagen with a response prediction module (fr).
Relationship to conventional graph prediction tasks
Given that the prediction for each variable is dependent only on its parents in a graph, GNNs appear especially suited for this problem. We can formulate the query of interest under a graph representation learning paradigm as: Given a graph G = (𝒱, ε), and paired sets of node attributes χ = {X1, X2, …, Xm} and node labels 𝒴 = {Y1, Y2, …, Ym} where each Y = {y1, …, yn}, with yi ∈ [0, 1], we aim at training a neural message passing architecture that given node attributes Xi predicts the corresponding node labels Yi. There are, however, some major differences between our problem formulation and the conventional graph prediction tasks, namely, graph and node classification (summarized in Table 2).
In node classification, a single graph G is paired with node attributes X, and the task is to predict the node labels Y. Our formulation differs in that we have m paired sets of node attributes χ and labels 𝒴 instead of a single set, yet they are similar in that there is a single graph in which GNNs operate. In graph classification, a set of graphs 𝒢 = {G1, …, Gm} is paired with a set of node attributes χX = {X1, X2, …, Xm} and the task is to predict a label for each graph Y = {y1, …, ym}. Here, graphs have a varying structure, and both the topological information and node attributes predict graph labels. In our formulation, a single graph is combined with each node attribute Xi, and the goal is to predict a label for each node, not for the whole graph.
Proposed approach
We propose an approach for the combinatorial prediction of therapeutic targets composed of 2 modules. First, a perturbagen discovery module fp searches the space of potential gene sets to predict a suitable candidate U′. Next, a response prediction module fr checks the goodness of the predicted set 𝒰′, that is, how effective intervening on variables in 𝒰′ is to shift node states to the desired treated state xt.
Model optimization
We optimize our response prediction module fr using cross-entropy loss on known triplets < xh, 𝒰, xd > and < xd, 𝒰′, xt >: We optimize our intervention discovery module fp using a cycle loss such that the responseupon a predicted 𝒰′ is as close to the desired treated state as possible. In addition, we provide a supervisory signal for predicting U′ in the form of cross-entropy loss. We train fp and fr in parallel and implement early stopping separately (see Experimental setup for more details). Trained modules fp and fr are then used to predict, for each diseased cell sample, which nodes should be perturbed (𝒰′) to achieve a desired treated state (Figure 1A).
Response prediction module
Our response prediction module fr should learn to map preperturbagen node values to post-perturbagen node values through learning relationships between connected nodes (equivalent to learning structural equations in SCMs) and propagating the effects of perturbations downstream in the graph (analogous to the recursive nature of query computations in SCMs).
Given a triplet < xh, 𝒰, xd >, we propose a neural model operating on a mutilated graph, G𝒰 where the node attributes are the concatenation of xh and , predicting diseased node values xd. Each node i has a two-dimensional attribute vector , where the first element is its gene expression value , and the second is a perturbation flag: a binary label indicating whether a perturbation occurs at node i. In practice, we embed each node feature into a high-dimensional continuous space by assigning learnable embeddings to each node based on the value of each input feature dimension. Specifically, for each node, we use the binary perturbation flag to assign a d-dimensional learnable embedding, which is different between nodes but shared across samples for each node. To embed the gene expression value , we first calculate thresholds using quantiles to assign the gene expression value into one of the B bins. We use the bin index to assign a d-dimensional learnable embedding, which is different between nodes but shared across samples for each node. To increase our model’s representation power, we concatenate a d-dimensional positional embedding (d-dimensional vector initialized randomly following a normal distribution). Concatenating these three embeddings results in an input node representation of dimensionality 3.d
For each node i ∈ 𝒱, an embedding zi is computed using a graph neural network operating on the node’s neighbors’ attributes. The most general formulation of a GNN layer is: where represents the updated information of node i, and hi represents the information of node i in the previous layer, with embedded di being the input to the first layer. ψ is a MESSAGE function, ⊕ an AGGREGATE function (permutation-invariant), and ϕ is an UPDATE function. We obtain an embedding zi for node i by stacking K GNN layers. Node embedding zi ∈ ℝ is then passed to a multilayer feed-forward neural network to obtain an estimate of the post-perturbation node values xd.
Perturbation discovery module
Our perturbagen prediction module fp should learn the nodes in the graph that should be perturbed to shift node states (attributes) from diseased xd to a desired treated state xt.
Given a triplet < xd, 𝒰 ′, xt >, we propose a neural model operating on graph G𝒰′ with node features xd and xt that predicts a ranking for each node where the top P ranked nodes should be predicted as the nodes in U′. Each node i has a two-dimensional attribute vector: . In practice, we represent these binary features in a continuous space using the same approach as described for our response prediction module fr.
For each node i ∈ 𝒱, an embedding zi is computed using a graph neural network operating on the node’s neighbors’ attributes. We obtain an embedding zi for node i by stacking K GNN layers. Node embedding zi ∈ ℝ is then passed to a multilayer feed-forward neural network to predict a real-valued number for node i.
Network proximity between predicted and ground truth perturbagens
Let P be the set of predicted therapeutic targets, ℛ be the set of ground truth therapeutic targets, and spd(p, r) be the shortest-path distance between nodes in P and R. We measure the closest distance between P and R as:
Model implementation and training
We implement PDGrapher using PyTorch 1.10.1 [95] and the Torch Geometric 2.0.4 Library [96]. The implemented architecture yields a neural network with the following hyperparameters: number of GNN layers and number of prediction layers.
We set the number of prediction layers to two and performed a grid search over the number of GNN layers (1-3 layers). We train our model using a 5-fold cross-validation strategy and report PDGrapher’s performance resulting from the best-performing hyperparameter setting.
Further details on statistical analysis
We next outline the evaluation setup, baseline models, and statistical tests used to evaluate PDG-rapher.
Baselines
We evaluate the performance of PDGRAPHER against a set of baselines:
Random baseline: Given a sample T =< xd, 𝒰′, xt >, the random baseline returns N random genes as the prediction of genes in 𝒰′, where N is the number of genes in 𝒰 ′.
Cancer genes: Given a sample T =< xd, 𝒰 ′, xt >, the cancer genes baseline returns the top N genes from an ordered list where the first M genes are disease-associated genes (cancer-driver genes) and the remaining genes are ranked randomly, and where N is the number of genes in 𝒰 ′.
Cancer drug targets: Given a sample T =< xd, 𝒰′, xt >, the cancer genes baseline returns the top N genes from an ordered list where the first M genes are cancer drug targets and the remaining genes are ranked randomly, and where N is the number of genes in 𝒰 ′.
scGen [22]: scGen is a widely-used gold-standard latent variable model for response prediction [97–100]. Given a set of observed cell type in control and perturbed state, scGen predicts the response of a new cell type to the perturbagen seen in training. To utilize scGen as a baseline, we first fit it to our LINCS gene expression data for each dataset type to predict response to perturbagens, training one model per perturbagen (chemical or genetic). Then, given a sample of paired diseased-treated cell line states, T =< xd, 𝒰 ′, xt >, we compute the response of cell line with gene expression xd to all perturbagens. The predicted perturbagen is that whose predicted response is closest to xt in R2 score.
Dataset splits and evaluation settings
We evaluate PDGRAPHER and baseline methods on two different settings:
Random splits: Our dataset is split randomly into train and test sets to measure our model performance in an IID setting.
Leave-cell-line-out splits: To measure model performance on unseen cell lines, we train our model with random splits on one cell line and test on a new cell line.
Evaluation setup
For all dataset split settings, our model is trained using 5-fold cross-validation, and metrics are reported as the average on the test set. Within each fold, we further split the training set into training and validation sets (8:2) to perform early stopping: we train the model on the training set until the validation loss has not decreased at least 10−5 for 15 continuous epochs.
Evaluation metrics
We report average sample-wise R2 score, and average perturbagen-wise R2 score to measure performance in the prediction of xt. The sample-wise R2 score is computed as the square of Pearson correlation between the predicted sample and real sample xt ∈ ℝ N. The perturbagen-wise R2 score is adopted from scGen. It is computed as the square of Pearson correlation of a linear least-squares regression between a set of predicted treated samples and a set of real treated samples Xt ∈ ℝ N ×S for the same perturbagen. Higher values indicate better performance in predicting the treated sample xt given the diseased sample xd and predicted perturbagen.
We also report the average ranking of real therapeutic gene targets in the predicted ordered list of therapeutic targets to measure the ability of our model to rank targets correctly. We normalize the ranking to the range [0, 1] as 1−ranking/N where N is the total number of genes in our dataset. Higher values indicate better performance; that is, the model ranks ground truth therapeutic targets closer to the top of the predicted list. In addition, we report the proportion of test samples for which the predicted therapeutic targets set has at least one overlapping gene with the ground-truth therapeutic targets set.
Spatial enrichment analysis of PDGrapher’s predicted genes
We quantify the spatial enrichment for PDGrapher ‘s predicted therapeutic targets using SAFE [30], a systematic approach that identifies regions that are over-represented for a feature of interest (Figure 1 D-F). SAFE requires networks and annotations for each node as an input. We use the PPI network as input and label gene nodes based on PDGrapher’s predictions: nodes are labeled as 1 if they are predicted to belong to the therapeutic targets set, and 0 otherwise. We compute enrichment analyses for two chemical compounds in the lung cancer cell line A540 test set: Raloxifene and Sertindole. We apply SAFE with the recommended settings: neighborhoods are defined using the short-path weighted layout metric for node distance and neighborhood radius of 0.15, and p-values are computed using the hypergeometric test with multiple testing correction (1,000 iterations). We use the Python implementation of SAFE: https://github.com/baryshnikova-lab/safepy.
Acknowledgements
We would like to thank Domen Mohorcic for his help with the PDG-rapher codebase. We gratefully acknowledge the support of NIH R01-HD108794, US DoD FA8702-15-D-0001, awards from Harvard Data Science Initiative, Amazon Faculty Research, Google Research Scholar Program, AstraZeneca Research, Roche Alliance with Distinguished Scientists, Sanofi iDEA-iTECH Award, Pfizer Research, Chan Zuckerberg Initiative, and Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. I.H. was supported, in part, by the Summer Institute in Biomedical Informatics at Harvard Medical School. M.B. and G.G. were supported by the ERC-Consolidator Grant No. 724228 (LEMAN). Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.
Footnotes
We have updated the declaration of interests and figures 3-5 to show individual datapoints
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.
- 74.
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.
- 99.
- 100.↵