Comparative analysis of molecular representations in prediction of drug combination effects

Application of machine and deep learning methods in drug discovery and cancer research has gained a considerable amount of attention in the past years. As the field grows, it becomes crucial to systematically evaluate the performance of novel computational solutions in relation to established techniques. To this end we compare rule-based and data-driven molecular representations in prediction of drug combination sensitivity and drug synergy scores using standardized results of 14 throughput screening studies, comprising 64 200 unique combinations of 4 153 molecules tested in 112 cancer cell lines. We evaluate the clustering performance of molecular representations and quantify their similarity by adapting the Centered Kernel Alignment metric. Our work demonstrates that to identify an optimal molecular representation type it is necessary to supplement quantitative benchmark results with qualitative considerations, such as model interpretability and robustness, which may vary between and throughout preclinical drug development projects.


Introduction
In the past years deep learning (DL) methods have been successfully applied to a variety of research topics in biomedicine and drug discovery [1][2][3]. Deep neural networks achieve state-of-the-art performance in medical computer vision tasks and protein structural modelling, enabling de novo generation of drug candidates and development of prognostic clinical models [4][5][6][7][8]. However, such performance of DL models is context-dependent [9][10][11][12]. While quantitative metrics are routinely and effectively used to compare various computational methods, overreliance on them is a well-known issue [13][14][15][16][17][18]. It is beneficial to supplement performance results on benchmark datasets with estimates of model uncertainty and robustness, as well as ability to generalize on unseen data [19][20][21]. These aspects are particularly important in the biomedical research, where in silico model predictions direct experimental design choices, as exhaustively testing all combinations of relevant factors is usually unfeasible due to the combinatorial explosion [22,23].
Advances in high throughput screening of bioactive compounds in cancer cell lines promote the development of personalized cancer treatments [24]. A major goal in such drug sensitivity and resistance testing studies is to prioritize promising combinatorial therapies that involve coadministration of multiple drugs [25]. By combining synergistic compounds, often with distinct mechanisms of action, it is possible to overcome single-drug resistance, produce sustained clinical remissions and diminish adverse reactions [26,27]. Drug synergy refers to a degree of drug-drug interaction quantified as the difference between expected and observed dose-response profiles measured by a biological endpoint, such as cell viability or cell toxicity [28]. While synergy characterizes how compounds modulate each other's biological activity, combination sensitivity score (CSS) quantifies drug combination efficacy [29]. In addition to the CSS, we use four synergy scores based on distinct null models, namely Bliss independence, Highest single agent, Loewe additivity, and Zero Interaction potency in the regression analysis of molecular fingerprints [30][31][32][33][34]. Predicting drug combination synergy and sensitivity is related to Quantitative Structure Activity Relationship (QSAR) modelling and Virtual Screening [35,36]. The QSAR captures mathematical associations between drug descriptors and assay endpoints based on the assumption that structurally similar compounds have similar bioactivity properties, while in the Virtual Screening studies candidate molecules are prioritized for subsequent experimental validation according to in silico prediction results [37]. Rule-based molecular fingerprints are commonly used as drug descriptors in QSAR/Virtual Screening, and MACCS structural keys based on molecular topology are arguably the most popular type of rule-based fingerprints [38][39][40][41]. Other types include circular topological fingerprints that describe combinations of non-hydrogen atom types and paths between them within a predefined atom neighborhood, and pharmacophore fingerprints that incorporate local features related to molecular recognition [42][43][44].
More recently, data-driven fingerprints generated by DL models have been shown to perform well in various research projects [45]. Majority of such DL fingerprints are based on the encoder-decoder architecture, whereby an approximate identity function is learned to translate high-dimensional input into a low-dimensional, fixed-size latent manifold, which is then used to reconstruct the original input [46]. When an encoder-decoder DL model is trained on chemical structures, its latent manifold is interpreted as a data-driven fingerprint. Examples of early DL fingerprinting models include a Convolutional Neural Network (CNN), Chemception, and a Recurrent Neural Network, SMILES2Vec, as well as a Variational Autoencoder model with a CNN encoder and a Gated Recurrent Unit-based decoder [47][48][49][50][51]. Development of attention methods for sequence modelling further contributed to the popularity of data-driven DL fingerprints, whereas evolution of generative models enabled de novo molecular design through latent space sampling [52][53][54][55][56]. These DL solutions operate on images of molecules or SMARTS/SMILES sequences to create drug structural representations [57][58][59]. Further, DL fingerprints may be enriched with numerical drug descriptors through multitask DL learning methods or simply by concatenating to latent space [60]. Unlike sequence-based versions, Geometric Deep Learning fingerprints are derived from molecular graphs, and in addition to global molecular descriptors enable position-aware encoding of individual atom and bond features [61][62][63][64][65][66][67].
There exist several extensive benchmark datasets for ranking DL models in chemoinformatics tasks, such as MoleculeNet, Open Graph Benchmark and Benchmarking GNNs [68][69][70]. Despite the widespread use of molecular fingerprints, there is a lack of systematic evaluation of data-driven DL and rule-based versions. To address the gap, we study 11 types of molecular representations, comprising seven DL and four rule-based variants, in prediction of cancer drug combination synergy and sensitivity, based on 17 271 848 data points from 14 cancer drug screening studies (Fig.1, experiment VS I). By comparing four synergy scores based on distinct null models we identify a preferred synergy formulation for use in cancer drug combinations research [71,72]. We measure the fingerprint similarity by adapting Centered Kernel Alignment as a distance metric (Fig.1, experiment VS II). Lastly, we explore the downstream performance of molecular representations by clustering compounds assigned to 10 Anatomical Therapeutic Chemical (ATC) classes in one-vs-all mode (Fig.1, experiment VS III). We believe that our work will contribute to the rational design of drug combinations, enable easier selection of molecular representations for in silico modelling, and promote further use of Deep Learning methods in biomedicine. Figure 1: Study workflow. Compounds found in combinations in the DrugComb database are represented using four rule-based (blue) and seven data-driven (yellow) fingerprint types. Rule-based fingerprints include topological, and 2D/3D extended connectivity variants. Data-driven fingerprints are generated using two Variational Autoencoders and two Transformer models trained on ChEMBL 26, Graph Autoencoder trained on DrugComb compounds, and a pre-trained Deep Graph Infomax model (Infomax). The fingerprints are compared in three tasks: predictions of drug combination sensitivity and four synergy scores (VS I); representation similarity based on Centered Kernel Alignment (VS II); one-vs-all fingerprint clustering based on ATC drug classes (VS III). VS I results are also used to identify the most predictive synergy model.

Data provenance
The DrugComb data portal, one of the largest public drug combination databases, is used to access combination sensitivity and synergy data [73]. Its October 2 019 release contains standardized and harmonized results of 14 drug sensitivity and resistance studies on 4 153 drug-like compounds screened in 112 cell lines for a total of 447 993 drug-drug-cell line tuples. Each pairwise drug combination is characterized by the combination sensitivity score (CSS) and four synergy scores, namely Bliss independence (Bliss), Highest single agent (HSA), Loewe additivity (Loewe), and Zero interaction potency (ZIP), further details are in the Supplementary Information. ChEMBL (release 26) is used to obtain SMILES strings, which are subsequently standardized by stripping salt residues and solvent molecules [74,75]. SMILES shorter than 8 and longer than 140 characters are filtered out. PubChem identifiers (CID) are used to cross-reference compounds between the databases when necessary. The final DL training dataset consists of 1 795 483 unique SMILES with a median length of 48 and a median absolute deviation of 10.

Molecular representations
Fingerprints are numeric arrays of n elements (bits) long, where n ranges between 16 and 1 024 depending on fingerprint type. Even though n values up to 16,384 have been tested in literature demonstrating a positive correlation between fingerprint size and downstream prediction performance, not all the studies support these findings [38,76]. Fingerprints used in the current work are classified into rule-based with binary values and Deep learning-based with continuous values. Rule-based models are further split into topological, 2D and 3D circular subtypes. Deep Learning fingerprints are split into sequence and graph subtypes. More detailed classification is found in Table 1.  [38,43]. E3FP is a 3D extension of 2D extended-connectivity models, it is generated following the no_Stereo variant [77].

Deep learning-based fingerprints
Seven data-driven molecular fingerprints of different lengths are generated using four types of unsupervised encoderdecoder DL models, namely a Graph Autoencoder (GAE), a Variational Autoencoder (VAE), a Transformer, and a pre-trained Deep Graph Infomax (Infomax). GAE fingerprints. 16 bits long GAE fingerprints are defined via a diagonal semidefinite matrix of singular values Σ, obtained through the Singular Value Decomposition of GAE embedding matrix [61,62]. Inspired by the Ky Fan matrix k-norm, equal to the sum of k largest singular values of the matrix, the main diagonal of Σ is used as a 16 bits long fingerprint [78]. If small molecules result in diagonal shorter than 16 bits, then zero-padding is applied. 64 bits long GAE fingerprints are generated by concatenating average, min-and max-pooled representations of the embedding matrix to 16 bits long GAE fingerprints. VAE fingerprints. VAE fingerprints are 16 and 256 bits long latent spaces of two independently trained VAE models [49]. Transformer fingerprints. 64 bits long Transformer fingerprints are constructed by concatenating average-and maxpooled latent embeddings of the 16 bits model with the first output of its last and second last recurrent layers. Similarly, the 1 024 bits variant is generated from the embedding space of the Transformer 256 bits model [52]. Infomax fingerprints. Infomax fingerprints are 300 bits long, generated using a pre-trained Deep Graph Infomax model that by design maximizes mutual information between local and global molecular graph features [79,80].

Deep Learning models used for fingerprint generation
Graph Autoencoder model. Graph Autoencoder model uses a graph G, with V nodes and E edges as input, where Vcorrespond to atoms and E to atomic bonds. Additional numeric features may be incorporated via node or bond feature matrices [81]. A graph G is represented with an adjacency matrix, A ℝ |V|×|V| , where |V| are node indices, such that non-zero A elements correspond to existing molecular bonds [82]. A is normalized to be symmetric and contain self-loops following [83]: where is an identity matrix equal in size to , is a diagonal node degree matrix such that its main diagonal represents bond counts of − . GAE model is initialized with a node matrix of 54 atom features, where each atom is represented by an array of one-hot encoded values denoting one of the 37 atoms types, six possible atom degrees, five atom charges, four variants of chiral configuration and an aromaticity indicator, all generated using RDKit. One-hot refers to encoding categorical variables as binary arrays. To make the GAE model compatible with previously unseen atoms a placeholder for an unknown atom type is added. GAE encoder consists of seven convolutional layers with sum pooling followed by ReLu activation [84]. The decoder part is a dot product of the embedding matrix with itself, followed by 0.1 dropout and sigmoid activation. Cross-entropy over is used as a loss function. Empty nodes in are initialized with zeros. Variational Autoencoder model. Two VAE models are trained with the embedding sizes of 16 or 256 bits. Both models have a 54 characters in vocabulary, consisting of 53 unique alphanumeric characters found in SMILES and an additional empty character for zero-padding. Input length is 140 characters, zero-padded if necessary. VAE encoder consists of three 1D convolutional layers of 9, 9 and 10 neurons, each followed by SELU activation [85]. The decoder consists of three GRU layers with a hidden dimension of 501, followed by softmax activation. Loss function is an equally weighted combination of binary cross-entropy and Kullback-Leibler divergence. Xavier uniform initialization is used to assign the starting weights of two VAE models [86]. Transformer model. Two Transformers are trained with the embedding sizes of 16 or 256 bits. The vocabulary size for both models is 58 characters including 53 unique SMILES characters and five tokens for end-of-string, mask, zero-pad, unknown-character and initialize-decoding. Maximum input length is 141 characters, zero-padded if necessary. Both models have four-headed attention and six transformer layers, with a dropout of 0.3 applied to the positional encoding output [54]. Loss function is negative log likelihood. Network weights are initialized with Xavier uniform. Deep Graph Infomax (Infomax) model. Infomax is pre-trained on 465 000 molecules from ChEMBL 20 and on 2 million molecules from ZINC 15 by Hu et.al. [80,87].
DL Model training. The GAE model is trained on 4 153 DrugComb drug-like compounds, while VAE and Transformer models are trained on 1 795 483 molecules from DrugComb and ChEMBL 26 databases. Five-fold cross-validation is used for training all the DL models. Transformer and VAE models are trained for 10 epochs on each fold, GAE is trained for 40 epochs on each fold. All models use Adam optimizer with a learning rate decay and an initial learning rate of 1 · 10 −3 , the training is halted once the learning rate reaches 1 · 10 −6 or loss reaches zero [70,[88][89][90]. GAE hyperparameters are optimized using tree-structured parzen estimators with a budget of 1 000 iterations, other DL models employ random search [91]. Further training details can be found in Supplementary Information.

Regression analysis of molecular fingerprints -VS I
Data input. One-hot encoded cell line labels and each of 11 drug fingerprints are used as inputs to regression models to predict drug combinations sensitivity and synergy. Combination fingerprints are generated by concatenating single molecular representations, topological fingerprints are bit-averaged [92]. Full dataset contains 362 635 cell line-drug combination tuples of 3 421 compounds, when filtered by the SMILES strings (SMILES-filtered), and 447 993 combination tuples of 4 153 molecules, when filtered by the CID (CID-filtered). For each cell line-drug combination tuple, four synergy scores and CSS sensitivity scores are obtained from DrugComb. If found, biological replicates are averaged, further, dose-dependent synergy scores are averaged inside cell line-drug combination tuple. Model selection. Model selection for the regression analysis of molecular fingerprints is split into three steps. In the first step, 13 different regression models are tested thrice in five-fold cross-validation on the 10% of the full dataset, sampled without replacement (Supplementary Information). The goal is to identify an optimal type of a regression model for prediction of four synergy scores and the CSS. The second step concerns hyperparameter tuning of the previously selected regression model on all available data in ten-fold cross-validation. Lastly, the model is trained in ten-fold repeated cross-validation on SMILES-filtered and CID-filtered datasets with 90:10 and 60:40 train:test splits [93]. Regression performance metrics. Pearson Correlation Coefficient (PCC) and Root Mean Squared Error (RMSE) are used to assess the regression performance. PCC and RMSE 95% confidence intervals are calculated via Student's t-distribution estimate of Fisher's z transformed PCC values, and via empirical bootstrap with 1 000 iterations and symmetric confidence intervals [94][95][96][97][98]. RMSE values are normalized by standard deviations. Shapiro-Wilk test is used to test the normality assumption [99]. Related work. PCC scores of regression models, used to predict single synergy scores in three recent studies, are in the Supplementary Information section.

Fingerprint Similarity -VS II and III
Fingerprint similarity metric -VS II. All 11 types of molecular representations vary in length and data types, making commonly-used metrics, such as Jaccard-Tanimoto or cosine distances poor choices for fingerprint comparison [100][101][102]. Jaccard-Tanimoto is suboptimal, as it is based on bits present in one fingerprint, absent in another, and shared by both [103]. Cosine distance between two vectors, defined as their inner product normalized by the corresponding L2 norms, only measures an angle between two vectors without accounting for differences in their ranges [104]. It may be possible to post-process DL fingerprints and define common distance metrics on both the binary and real-valued arrays [105]. However, we opted against it, as we are not aware of any studies that systematically assess DL fingerprint similarity or quantify downstream effects of such transformations. Recall that an inner product is an unnormalized measure of similarity allowing metrics based on the Canonical Correlation Analysis (CCA), singular vector CCA and projection-weighted CCA to be defined on any real-valued arrays [106][107][108]. All these methods underperform when the number of compounds is smaller than the dimensionality of feature space, i.e. n bits [109]. It is not intuitive to use unnormalized inner product as a similarity measure, as it is unbounded and requires original data to be referenced alongside the similarity scores. Since calculation of pairwise compound distances is not a prerequisite to quantify their similarity, we compare complete fingerprint matrices using Centered Kernel Alignment (CKA), a modification of Hilbert-Schmidt Independence Criterion (HSIC) originally proposed to assess nonlinear dependence of two sets of variables [110]. Fingerprint matrix -VS II and III. Let m compounds be represented with two fingerprint matrices and , where individual fingerprints and may be of different lengths x and y and data types: and are normalized by subtracting column means from the corresponding column values. Linear kernel k -VS II. Let be a kernel matrix, such that its entries correspond to scalar outputs of a linear kernel function k. Let k be an inner product, k = , where and are 1D vectors from two fingerprint matrices and corresponding to the same compound or feature. When and are column vectors, becomes a feature similarity matrix: , If and are row vectors, is a sample similarity matrix: Hilbert-Schmidt Independence Criterion -VS II. Hilbert-Schmidt Independence Criterion (HSIC) is a test statistic equal to 0 when and are independent [103]. Unnormalized HSIC is without an upper bound and equal to: where T is a dot product of feature vectors and || · || 2 F is a squared Frobenius norm and T and T are left Gram matrices. Notice that: Further, for centered and under linear dot product kernel: where ( , ) is a cross-covariance matrix of and [104]. Centered Kernel Alignment -VS II. HSIC is an empirical statistic that converges to its unbiased value at a rate of 1 [105]. Unbiased HSIC values are used to define Centered Kernel Alignment (CKA), a normalized version of HSIC that ranges from 0 to 1. CKA is used to quantify the difference between two fingerprint matrices and . When CKA is calculated via the feature similarity, it is defined as: CKA is a non-linear extension of the CCA and does not require any assumptions about noise distributions in the datasets [112]. CKA with linear kernel is equivalent to the RV coefficient and Tucker's Congruence coefficient [109,[113][114][115].
If the number of samples is higher than the number of features, CKA should be calculated using feature similarities. Conversely, sample space and use of Gram matrices is preferred. Fingerprint clustering -VS III. The Anatomical Therapeutic Chemical (ATC) Classification System is used to annotate drugs according to biological systems on which they act, as well as their therapeutic, pharmacological, and chemical properties [116]. The 2 228 DrugComb compounds found in the ATC database are assigned to 10 classes. All but GAE 16 bits and Morgan 1 024 bits models are then used to generate nine fingerprint matrices. The generated fingerprint matrices are preprocessed three-fold: by z-score normalization, z-score normalization followed by dimensionality reduction with PCA, and z-score normalization followed by dimensionality reduction with PLS. For the PCA preprocessing the number of loadings explaining >0.95 variance is used, PLS regression for dimensionality reduction is performed with ATC labels as targets. Linear Discriminant Analysis (LDA) is used for one-vs-all clustering with ATC class labels as response variables, averaged Silhouette score and Variance Ratio Criterion are clustering performance metrics [117,118].
Silhouette score -VS III. Silhouette for a single point is defined as: where a is the mean distance between point i and all points within its cluster and b is the smallest mean distance between point i and all points in a cluster ≠ . Variance Ratio Criterion -VS III. Variance Ratio Criterion (VRC) is a ratio of between-to within-cluster variation, adjusted by the number of clusters. VRC is closely related to the F-statistic in ANOVA [119]. Both scores are min-max scaled to be in [0, 1].

Computational facilities
All models are trained on Tesla P100 PCIe 16GB GPU. VM deployment is automated with Docker 19.03.9, pythonopenstack 3.14. 13

Prediction of drug combination synergy and sensitivity -VS I
Regression model selection. We identify Catboost Gradient Boosting on Decision Trees (GBDT) as an optimal regression model for the prediction of drug combination sensitivity and synergy after testing 13 algorithms on the 10% of the DrugComb dataset in three replicates ( Table 2). Three of the tested algorithms failed to generate any predictions and are omitted. With optimized hyperparameters GBDT models tend to reach the early stopping criterion in the last 20% of the training on all the fingerprint variants which indicates correctly tuned hyperparameters, further details are in Supplementary Information. There exist alternative dataset splitting modes that incorporate chemical similarity via Tanimoto distance or Murcko decomposition [68,120]. While they may better mimic current drug development practices and lead to a better correlation between in silico predictions and prospective experimental validation, we do not expect them to produce categorically different results. Regression performance. Among 11 fingerprinting models, Infomax 300 and VAE 256 achieved the highest PCC in prediction of Loewe synergy score using Catboost Gradient Boosting across all the test folds, cross-validation modes and duplicate filtering methods. As seen in Figure 2 and Table 3, for the 60 : 40 splits on the SMILES-filtered dataset Infomax reaches a PCC of 0.6842, while VAE 256 score is 0.6813. All tested fingerprints result in the CSS prediction scores above 0.85 PCC, with Infomax 300 and VAE 256 fingerprints still ranked on top. Infomax 300 and VAE 256 have overlapping 95% confidence intervals, as such they are considered to be equally performant. E3FP is the best rule-based fingerprint and is among the top three in most experimental runs. As seen in Figure 3 and  Figures S1-S6 and Tables S1-S6).
Experimental results indicate that if similarity-based clustering or identification of key molecular moieties are of interest, rule-based fingerprints should be considered. Their average performance in regression is compensated by the inbuilt interpretability and robust clustering performance [121]. On the other hand, neural fingerprint models are well-suited for regression tasks, as seen in the VS I experiment. It is important to note that the differences in regression performance between rule-and DL-based fingerprints do not exceed 0.05 PCC when predicting any synergy scores or the CSS. Consistently good performance of the DL models and E3FP fingerprints may be offset by their high computational costs during model training or fingerprint generation, respectively. GAE 64 fingerprints appear to be a reasonable compromise in terms of the downstream performance and relatively short model training times.

Fingerprint similarity -VS II and III
CKA distance (VS II). A heatmap of pairwise CKA similarities between 11 fingerprints, as seen in Figure 4, indicates that similar types of fingerprints cluster together. Rule-based fingerprints form two clusters corresponding to topological and circular subtypes. All the DL fingerprints generated by the trained models form the third cluster. Graph-based models appear to be far removed from all sequence and rule-based variants. GAE 64 is the most different from other trained DL fingerprints, while being co-clustered with them. Infomax 300 fingerprints, based on a pre-trained Deep Graph Infomax model, are not part of any cluster. Smaller sequence-based DL fingerprints, namely VAE 16 and Transformer 64 are at least as similar to each other, as they are to their longer in-type/subtype counterparts. We conclude that fingerprint type and subtype, as indicated in Table 1, contribute the most to the CKA similarity, followed by fingerprint pretraining status, size, and data format.   Table 5 and the overview of DrugComb compounds with the corresponding ATC classes is in Figure 5. The Infomax 300 bits model achieves the best clustering results on the z-score normalized fingerprint matrices, followed by three rule-based fingerprints. Dimensionality reduction following z-score normalization generally improves clustering performance of all rule-based fingerprints. It has the opposite effect on most DL fingerprints, with the largest reduction seen in the Infomax 300 and GAE 64 models. Longer DL sequence models, namely VAE 256 and Transformer 1 024, perform better after dimensionality reduction steps, albeit with a

Conclusion
Choosing an optimal fingerprint type to represent molecular features is an important step in computational drug discovery. To this end, we systematically compared 11 variants of such molecular representations in predicting drug combination sensitivity and synergy scores, and evaluated their relationships based on the clustering performance and CKA-based fingerprint similarity. We found that VAE 256 bits long and 3D circular E3FP 1 024 bits long fingerprints generated from SMILES strings, as well as Infomax 300 bits long fingerprints based on molecular graphs lead to the best regression performance. Out of the four tested synergy scores, we observe that Loewe synergy is the easiest to predict with best models reaching PCC 0.72. CSS, a measure of drug combination efficacy, can be predicted >0.85 PCC with any fingerprint type. We found that the rule-based fingerprinting methods underperform in regression tasks in comparison to the data-driven DL variants. However, the gap between the best and worst performing fingerprint models rarely exceeds PCC 0.05. Further, we adapted Centered Kernel Alignment to quantify the extent of similarity between fingerprint matrices and to demonstrate that similar types of fingerprints cluster together. An optimal similarity measure for the comparison of single rule-based and data-driven fingerprints remains an open question. Lastly, in one-vs-all compound clustering using ATC classes as labels rule-based fingerprints perform on par or better than the best DL representations.  We conclude that the quantitative performance differences between rule-based and Deep Learning-based fingerprints are likely to be insignificant in the context of preclinical studies of small molecule drugs [122,123]. In order to identify an optimal fingerprint type for a given project we advise enriching quantitative performance metrics with qualitative concerns, e.g. available chemometric and deep learning expertise, model interpretability requirements, opinions of project stakeholders and model performance on unseen data. Fingerprints generated using the E3FP 1 024, Infomax 300, Morgan 1 024 and VAE 256 bits models are suggested as good starting points based on our experimental results and distinct methodologies underlying their data generating methods [124]. We recommend the Loewe synergy score for use in drug combination screening due to its best performance among four tested synergy models tested on dose-response data from 14 DSRT studies.
This work focuses on the evaluation of single fingerprint types. However, it is worth exploring the impact of combining several fingerprints together. We expect a statistically significant regression performance increase when combining molecular representations with low CKA similarity, or using models trained on multimodal data and/or key biological databases, such as Gene Ontology, Protein Data Bank and UniRef [5,[125][126][127]. Another line of inquiry could address high computational costs of DL and E3FP models. To this end we suggest exploring alternative molecular representations and CPU-friendly generative models based on genetic algorithms, such as STONED on SELFIES [128]. Finally, we hope that in the future biomedical DL research will go beyond representation learning and will be used to derive novel biological knowledge by e.g. inferring synthetic and retrosynthetic chemical reactions, identifying novel diseaseassociated druggable proteins and clinically actionable biomarkers [129][130][131].

Key points
• To choose an optimal molecular fingerprint type it is advised to enrich quantitative metrics of model performance with qualitative concerns related to the nature of downstream tasks, model interpretability and robustness requirements, as well as available chemometric expertise.
• Data-driven fingerprints, namely VAE 256 bits long trained on SMILES and Infomax 300 bits long trained molecular graphs are well-suited for regression tasks. 1 024 bits long 2D and 3D circular fingerprints are flexible and well-performant rule-based models fit for clustering tasks. Graph Autoencoder 64 bits long model may be used in any analysis scenario as a baseline option.
• Loewe synergy scores enable the highest correlation between in silico predictions and subsequent experimental validation of drug combination synergy in cancer cell lines.
• Centered Kernel Alignment is an effective measure of molecular representation similarity applicable to any combination of rule-based and DL fingerprints.

Data and code availability
The data and code underlying this article are available in the article and in its online supplementary material.

Acknowledgements
We acknowledge the computational resources provided by the Finnish IT Center for Science.

Combination sensitivity and synergy scores
It is known that the choice of an appropriate null model of no interaction is crucial for an accurate assessment of drug synergy. Four such models are used in the current work. Bliss model is based on probabilistic independence of drug effects, such that single agents are competing, but independent perturbations each contributing to a total effect [30]. HSA assumes that an expected drug combination effect is the higher of the two single-agent effects at corresponding concentrations [31]. Degree of non-interaction according to Loewe is equal to the outcome of a sham experiment, that is combining a drug with itself [32,33]. Zero Interaction Potency (ZIP) assumes that non-interacting drugs minimally impact each other's dose-response curves [34]. Combination Sensitivity Score quantifies drug combination efficacy, which is defined as an average of areas under drug combination dose-response curves, whereby each curve is determined by fixing one of the drugs at its IC50 concentration [29].

Regression model performance
13 regression models, representing a wide spectrum of ML algorithms, are compared in prediction of drug combination synergy and sensitivity. All models are tested with default hyperparameters in five-fold cross-validation using 1 024 bits long Morgan and 300 bits long Infomax fingerprints together with one-hot encoded cell line labels on 10% of randomly sampled data in three replicates. 13 tested regression models are: Bayesian Ridge, Catboost Gradient Boosting, ElasticNet, Gaussian Process Regression with a sum of Dot Product and White Kernels, Histogram-based Gradient Boosting, Isotonic Regression, Lasso regression, LassoLars regression, Linear regression, Ridge regression, Random Forest, Support Vector Machines with a linear kernel and XGBoost Gradient Boosting. All trees-based models are limited to a depth of six. Neural networks are not included in the comparison, since on tabular data they tend to perform on par with the previously mentioned methods, while being less interpretable and more difficult to set up [132,133].
Top four identified regression models are bagging and boosting ensembles, followed by linear kernel Support Vector Regression and Bayesian Ridge. Gaussian Process Regression, Isotonic and Lassolars models failed to generate any predictions and their results are omitted from Table 1. Catboost implementation of Gradient Boosting on Decision Trees (GBDT) is selected for further experiments due to its efficient GPU utilization and two design choices aimed at reducing overfitting: out-of-the-box categorical encoding that translates classes into numeric representations, binning them based on the expected value of target statistic, and Ordered Boosting whereby training data is randomly permuted throughout tree growing process to limit unwanted target metric prediction shift, one of well-known GBDT disadvantages [134][135][136].
Tuned in ten-fold cross-validation best hyperparameters for Catboost GBDT are Poisson bootstrap with 0.66 subsampling ratio, L2 regularization of 9, tree depth of 10, learning rate of 0.15 with 5 000 boosting iterations and 50 early stopping rounds. Overall, we conclude that ensembles are the most powerful type of tested ML algorithms in prediction of drug combination synergy and sensitivity [137].

Neural Network Training
For all DL models the hyperparameter search consisted of testing various activation functions (GELU, ELU, LeakyReLU, ReLU, SELU, Sigmoid, Softmax, Swish), dropout ratios (0.1 to 0.5 with 0.1 step size), initial learning rates (1 · 10 −1 to 1 · 10 −5 , with a step size of 0.1), number of patience epochs (1 − 30 with a step size of 1) and learning rate decay factor (0.9 to 0.1 with a step size of 0.1). Both Transformer models were tested with 3 − 6 attention heads. VAE encoders were tested with up to 5 convolutional layers and convolutional kernel sizes up to 10; VAE decoder is tested with up to 3 recurrent GRU layers of sizes up to 600. It is likely that longer training or a more extensive hyperparameter optimization or use of alternative optimizers, such as SGD with a cyclic learning rate scheduling, may result in lower final loss values, however, we do not expect it to have a significant influence on downstream experiments [138]. It is interesting to note that optuna-based optimization of the GAE model resulted in the encoder architecture consisting of seven convolutional layers 54 − 46 − 40 − 34 − 28 − 22 − 16 neurons wide. Such a high number of convolutional layers is somewhat unexpected, as performance of Graph Neural Networks based on spectral convolutions is expected to deteriorate with the convolutional layer count above six, most likely due to excessive feature smoothing [139,140]. Such GAE encoder architecture may be explained by a positive correlation between the performance of Neural Networks and a number of trainable parameters, as a seven layer GAE model with dot-product based Decoder has circa 10k learnable parameters, whereas Transformer and VAE models have two orders of magnitude more [141].

Prior work in drug combination synergy predictions
In three independent studies listed below single synergy scores are predicted.