Graph Structured Neural Networks for Perturbation Biology

Computational modeling of perturbation biology identifies relationships between molecular elements and cellular response, and an accurate understanding of these systems will support the full realization of precision medicine. Traditional deep learning, while often accurate in predicting response, is unlikely to capture the true sequence of involved molecular interactions. Our work is motivated by two assumptions: 1) Methods that encourage mechanistic prediction logic are likely to be more trustworthy, and 2) problem-specific algorithms are likely to outperform generic algorithms. We present an alternative to Graph Neural Networks (GNNs) termed Graph Structured Neural Networks (GSNN), which uses cell signaling knowledge, encoded as a graph data structure, to add inductive biases to deep learning. We apply our method to perturbation biology using the LINCS L1000 dataset and literature-curated molecular interactions. We demonstrate that GSNNs outperform baseline algorithms in several prediction tasks, including 1) perturbed expression, 2) cell viability of drug combinations, and 3) disease-specific drug prioritization. We also present a method called GSNNExplainer to explain GSNN predictions in a biologically interpretable form. This work has broad application in basic biological research and pre-clincal drug repurposing. Further refinement of these methods may produce trustworthy models of drug response suitable for use as clinical decision aids. Availability and implementation: Our implementation of the GSNN method is available at https://github.com/nathanieljevans/GSNN. All data used in this work is publicly available.

In table 9 we report the number of GSNN and NN parameters in the best-performing models of each experiment.Across all three experiments, the best GSNN models from each fold had more trainable parameters than the best NN model from that fold.This may be indicative of the GSNN model being less prone to over-fitting.Another explanation is that, due to the GSNN biological graph structure, there are likely to be many function nodes that are rarely involved in prediction logic or impact only a few targets (i.e., only a few LINCS nodes are descendants) and therefore the functional set of parameters may not be well represented by the total number of trainable parameters.In other words, prior knowledge may lead to some function nodes being effectively spurious or underutilized, and therefore the direct parameter comparison should be interpreted with caution.

Computational Complexity of the GSNN method
The GSNN algorithm takes significantly longer to train due to being a particularly deep architecture and due to it's use of sparse matrix operations.133.9 35.9 12.9 3.4

Effect of Layer Depth on GSNN performance
The GSNN algorithm passes information during sequential layers allowing information to diffuse through the network up to the number of layers L in the model.Cell signaling often involves many entities in many sequential interactions as well as feedback loops that may alter behavior.Due to this trait, deeper networks may be more representative of the underlying biology and therefore more accurate.To test this, we compare the performance of GSNN algorithms with different number of layers (L=10,20).Figure 12 shows the results and suggests that 20-layer GSNNs have a small improvement in performance compared to 10-layer GSNNs.Notably, training deeper neural networks also introduces more parameters, greater memory complexity, and longer training times.It is critical, therefore, that the choice of (a-c)  The pathways that were used in each experiment to specify the proteins included in the GSNN input graph.Bold text indicates the initial pathway choice from which all other pathways were "linked."Pathway size refers to the number of proteins in each reactome pathway and may not reflect the exact number of proteins included in the resulting biological network.

Figure 10 :
Figure 10: The overlapping elements of each experiment.Most of the targets, nodes and drugs are shared across all three experiments; however, there are distinct protein subsets for each of the three experiments.

Figure 11 :
Figure 11: Representative training curves from experiment 1 (EGFR + ERBB2 signaling).Dark Gray/Blue indicates the GSNN training curves, light blue are NN training curves.

Table 9 :
Number of trainable parameters of the GSNN and NN algorithms used in experiments 1-3 (median of best models from each MCCV fold).Percent change is calculated as

Table 10 :
Table 10 reports the average training times for each algorithm.Specifically, the GSNN algorithm requires between 3-15 times as much training time as the alternative algorithms tested (NN, GNN).Of note, however, are the training curves shown in Figure11that compare the validation performance by epoch for representative GSNN and NN models; The GSNN validation performance increases markedly faster, achieving approximately the maximum NN performance in the first 20 epochs.This aspect of the training dynamics may suggest that the GSNN algorithm can be trained with fewer epochs, which would markedly reduce the compute requirements.Average training time of each algorithm (reported in minutes).Note: GSNN and GNN were trained on GPUs whereas the NNs were trained on CPU only.