Facing small and biased data dilemma in drug discovery with federated learning

Artificial intelligence (AI) models usually require large amounts of high-quality training data, which is in striking contrast to the situation of small and biased data faced by current drug discovery pipelines. The concept of federated learning has been proposed to utilize distributed data from different sources without leaking sensitive information of these data. This emerging decentralized machine learning paradigm is expected to dramatically improve the success of AI-powered drug discovery. We here simulate the federated learning process with 7 aqueous solubility datasets from different sources, among which there are overlapping molecules with high or low biases in the recorded values. Beyond the benefit of gaining more data, we also demonstrate federated training has a regularization effect making it superior than centralized training on the pooled datasets with high biases. Further, two more cases are studied to test the usability of federated learning in drug discovery. Our work demonstrates the application of federated learning in predicting drug related properties, but also highlights its promising role in addressing the small data and biased data dilemma in drug discovery.


Introduction
Current artificial intelligence (AI) has high requirements for data both in terms of quality and quantity to achieve good predictive performance. Data acquisition difficulties and data biases in the measurement of scientific tests have significantly limited AI's power in drug discovery [1][2][3] . Data acquisition challenges come from time-consuming and expensive processes of data-generation and consequent confidentiality, especially in the later stages drug development such as data about drug pharmacokinetics, safety, and efficacy profiles. Taking ADME/T (Absorption, Distribution, Metabolism, Excretion and Toxicity) properties as an example, such data are usually highly standardized and of good quality, and would contribute to better predictive model and generate larger added value when used for modeling. However, few of these properties are exposed to the latest deep learning models due to confidentiality 4,5 , which can be considered as an enormous loss for drug development. Not only the data acquisition difficulties resulting from confidentiality, the data biases in the measurement of scientific tests also perplex AI for drug discovery 6 . It is common to see, for example, a specific molecular property has large discrepancy in recorded values from different sources, even under the same measurement of the same scientific tests. The discrepancy in recorded values is usually considered to come from data biases, because a recorded value in each data source is obtained from repeating measurements and the effect of variance has been reduced to minimum. In conventional machine learning paradigm, the discrepancy in different sources will usually be uniformed by taking the mean, median or a majority vote, which might bring the value closer to the "ground truth". However, for practitioners who generated those data may only care about the recorded value in their own experimental setting-ups for reference (i.e. whether the structure modification will lead to the optimization of a property), rather than the absolute "true" value. Therefore, a shared global model for all data sources might not form good instructions to practitioners. Federated learning emerges as a new machine learning paradigm provide viable solutions for data acquisition and data bias problems faced by AI drug discovery by keeping confidentiality and customizing model for users.
Federated learning represents a scenario where multiple clients can train a model collectively without sharing raw data [7][8][9][10] . The original idea dates back to 2016, in the context of the enactment of GDPR (General Data Protection Regulation) in Europe, users gain more control over the use of personal data, which challenges many companies that rely heavily on selling Ads based on users' personal data. McMahan et al. from Google proposed federated learning and a year later initially applied it in Gboard, the keyboard on android phones 7,8 . Therein, federated learning was adopted to train, for example, a next word prediction model crosses many phone devices without uploading users' data to central servers. This process has improved users' input experience and preserved users' privacy. Noteworthily, aside from training a single global model collectively on clients' datasets, it becomes viable for each client to have a customized model in federated learning 9,10 . Therefore, the problem of discordant records in centralized machine learning turn into an intrinsic feature of federated learning 10 . Federated learning has attracted substantial attention and has found more and more applications in a much broader areas [13][14][15][16][17] , which is also a promising approach to satisfy the needs of drug discovery 18 but yet to be investigated and tested. Drug discovery has similar request to protect the confidential or IP-sensitive data, and at the same time, to extract the maximum information/knowledge present within such data by machine learning. Moreover, given the high biases in drug discovery related data, customizing a model for each client is appealing for personalized prediction as done in Gboard 8 .
Here, we setup a general federated learning framework for drug discovery (Fig. 1) and tested it on FATE (Federated AI Technology Enabler) 19 , an open-source project aiming at providing a secure computing framework for federated learning. Different from previously mentioned Gboard cross-device federated learning application that trained cross millions of phone devices, federated learning for drug discovery is another setting trained cross data silos, which is termed as cross-silo federated learning. In this setting, there are a coordinator server and several collaborators instrumented with federated learning client program. These clients in collaboration can be big pharmas, biotech startups or even academic labs having their own data silos. The life cycle of a federated learning system for drug discovery. In federated training: 1) the coordinator server broadcast the latest shared global model to each client; 2) the client locally computes the model updates, 3) encrypts and uploads the model updates; 4) finally, the coordinator server aggregates all the encrypted model updates securely and uses them to update the shared global model for the next round of training. After the training is done, the best model is selected for rollout and might be customized for the users who have their own labeled data.
During each round of cross-silo federated training, 1) the coordinator server broadcast the latest shared global model to each client, 2) each client locally computes the model updates by executing the training program, 3) and encrypts and uploads the model updates under a secure aggregation protocol, 4) finally, the coordinator server aggregates all the encrypted model updates securely and uses them to update the shared global model. Figure 1 illustrates the life cycle of federated learning system for drug discovery. Many rounds of training will be required until the model converges or meets the criteria for stopping, which might be the metrics that does not improve within given rounds on a shared dataset on the coordinator server or a held-out validation dataset on each client. The best model is then selected for rollout. For users who want to use the model for prediction, they can use the selected shared model directly. Alternatively, for users who own plenty of labeled data themselves, they can opt to instrument the federated training program and locally update the model (without uploading updates to coordinator), thus obtaining a customized model. Model customization is a common application and should be very practically useful.
In this work, we simulated cross-silo federated learning processes in three use cases: solubility prediction, kinase inhibitory activity prediction and hERG liability prediction. The datasets in these use cases show variance in the chemical space of compounds covered, measurement methods, experimental conditions, nonstandard representations and size of data. These real-world drug property datasets from different sources represent non-identical data distributions at different clients, from which we would like to investigate how drug discovery projects can benefit from federated learning. Tested with different network structures and federated aggregation algorithms, federated model can always outperform models built on only individual datasets. So we can rely on federated learning to build more predictive model if possible.

Results
Facing non-IID data *The numbers outside the parentheses are shared molecules between two datasets, and numbers inside are the mean absolute deviation (MAD) of LogS of these shared molecules.
In conventional centralized machine learning application for drug discovery, to include more data, researchers collect data from different sources, and assume the data are independent identically distributed (IID). The IID sampling of the training data is important to ensure that the stochastic gradient is an unbiased estimate of the full gradient. However, the assumption is usually violated due to the high data biases introduced in the measurement of scientific tests, which are conducted by different people in different experimental setting-ups. As shown in Table 1, we collected water solubility datasets from different sources and some of the shared molecules between datasets can have distinct recorded values. For example, dataset F1 and dataset C2 have 4 shared molecules, the values of those molecules in two datasets have a Mean Absolute Deviation (MAD) of 1.52. The large MAD of shared molecules between datasets may signify the data distribution varies across datasets in some extent. In conventional machine learning, even a model predicts perfectly right on one dataset, it can't predict well on another due to the violation of IID. Aside from a shared global model for all datasets, it is viable for federated learning to customize a model for each dataset, which is practically useful to deal with biased/Non-IID data.
Tuning model update frequency  *The reported performance is the MAE of LogS on test sets in 5 independent runs. Clients F1-4 are the federated training participants, and have 1/10 of the dataset as the test set (8/10 for federated training and 1/10 for validation), while clients C1-3 participate in customization only, and has 1/5 of the dataset as test set (3/5 for customization training and 1/5 for validation). The best performance on each client is highlighted in bold. In this study, datasets F1-4 were used for simulating clients who participate in the training process of the federated learning models, and C1-3 for simulating clients who didn't participate in training but want to customize the federated model with their own data. We compared federated modeling with individualized and centralized modeling baselines (Figure2 and Supplementary Table 1), in terms of the MAE values on the test set of each client. Generally, the sub-models trained on individual datasets achieved higher performance on their own internal test set (i.e., F1/611, F2/465, F3/260 and F4/212), but much lower performance on other test tests, indicating these sub-models can't generalize well. In contrast, the federated learning model and Union model showed much improved predictive performance on cross client datasets. For clients F1-4, the federated learning model generally yielded lower MAE values than the corresponding sub-models trained locally, and the prediction capability was maintained on tests from external clients C1-3. It is worth noting here that the federated model performed even better than the Union model, in which data from different sources are simply pooled together for training in a non-privacy-preserving way. This is counterintuitive, to examine their differences in learning, we compared the weight distribution of fully connected layers in the Union model and the shared Federated model ( Figure  2C-E). The weight distributions of the Union model basically unchanged after centralized training compared with the initialized weight distribution, while the weight distributions of federated model vary significantly with more weights concentrated on 0. In the same network architecture with the same cohort of parameters, more weights of 0 means that the model is regularized and simpler, which is likely to generalize better 20,21 . This regularization effect explains the side benefit of federated learning when preserving data privacy on training datasets with different systematic biases. As shown in Table 1, dataset F3 showed a larger systematic bias, where the compounds shared with datasets F1 and F4 have averaged MAD values of 0.35 and 0.24, respectively, which may cause the inferior performance of the Union model and the shared federated model when testing on dataset F3. However, when the shared federated model is further fine-tuned with small learning rate locally without uploading the updates, which is a form of customization for the local data, the performance of the federated model can be further improved, especially for datasets with higher biases (i.e., F3 and C2). To investigate how different network architectures and federated learning aggregation algorithms will influence the performance of federated learning, apart from the previous MLP architecture and FedAvg aggregation algorithm, a residual fully connected neural network (RFCN) architecture 22 and the FedAMP 23 aggregation algorithm are also tested. The RFCN model in our experiment is composed of a fully connected layer with 1536 neurons followed by a ResNet of two 2-layer blocks and one single-layer block (Supplementary Figure 2B). The centralized RFCN model (Union + RFCN) outperforms the federated learning model with MLP and FedAvg algorithm (FedAvg + MLP) on 6 out of 7 clients. This means the centralized MLP model (Union + MLP) tested in the previous section is not a strong baseline and the MLP + FedAvg model outperforms it easily owing to the regularization effect of FedAvg. But with a strong centralized baseline (Union + ResNet), the federated learning usually cannot outperform centralized union model ( Table 4 and Table 5).

Improved network architecture and aggregation algorithm
FedAMP (Federated attentive message passing) 23 , a personalized federated learning aggregation algorithm, is performed here to see how federated learning aggregation algorithm will influence the outcome. FedAMP encourages clients with similar model parameters to have stronger collaboration, so the algorithm adaptively discovers the hidden collaboration relationships between clients and enhancing their collaboration effectiveness by assigning different models to different clients. The combination of FedAMP and RFCN makes it perform best in 4 out of 7 clients, even better than the Union + RFCN baseline. However, in terms of the size-weighted mean and unweighted mean, Union + RFCN model performs the best. To demonstrate more use cases, we also simulated federated learning on hERG liability and kinase inhibition data sets. Many kinase inhibitors have the problem of either high toxicity or resistance in tumor 24 . It is of great importance for kinase inhibitors to precisely modulate the wanted kinases as well as avoid the unwanted kinases 25,26 . But usually biotech companies only have the inhibitory activity data of the specific kinase they are developing on. Constructing a predictive model for inhibitory activity across multiple kinases will be helpful for selective inhibitor screening. Federated learning can help them collectively train a more powerful model across multiple kinases. We build a federated model for kinase pIC50 prediction across four data sets from different sources (Supplementary Table 2). As shown in Table 4, the FedAVG + RFCN model perform better than 3 individual models with a large margin but worse than the individual model built on BioMedX dataset. With a better federated aggregation algorithm, FedAMP + RFCN model is better than all of the individual models. However, the best model is the Union + RFCN model trained with mixing data sets in a centralized way in this case.

Case study on kinase inhibition and hERG liability prediction
Drug-induced hERG block is one of the main causes of cardiotoxicity 27 . Assessing hERG liability is required in early drug discovery program. However, various experimental assays can be used to evaluate the hERG liability 28,29 , which induce large biases in the recorded values of hERG liability. Previous studies are focused on merging data from different sources and construct a centralized model to fit the data [30][31][32][33] , which can result in biased and overfitted model that may not generalize well. In our study case, a federated hERG classification model was constructed using hERG inhibitory data from different source (Supplementary Table 3). Seen from Table 5, FedAMP + MLP model and FedAMP + MLP model usually outperform modeling on individual data sets but are always inferior to the Union + MLP model.
These two use cases in drug discovery suggest that we can rely on federated learning for better predictive performance without sharing sensitive data, which will largely cut the cost in the help of "knowledge" from each other.

Discussion
In a bigger federated learning context, the framework we set up only focuses on simulating participants who have the same feature space (molecular ECFP fingerprints) as input, which is attributed to horizontal federated learning (Supplementary figure  1A) 34 . There is also vertical federated learning scheme can cope with participants having different feature types as input (Supplementary figure 1B). Moreover, a combination of horizontal and vertical federated learning can effectively handle the participants who share some feature types and samples but also have their own proprietary feature types and samples, which is referred to as federated transfer learning 34 . Federated transfer learning will further expand the feature space and sample size we have by taking both the union of feature space and sample space of multiple participants. For example, to predict the clinical outcome of drug candidates, we need integrate data with shared and proprietary features from multiple parties, including related pharmaceutical companies, hospitals and patients, federated transfer learning may generate large adding value for each party.
Federated learning may still get some security concerns and malicious or nonmalicious failures, but it has attracted substantial attention and has been improving and evolving quickly. This paradigm opens up the possibility to integrate confidential datasets through secure distributed training, which have previously been considered impractical but absolutely attractive to drug discovery. Given the predictive model in drug discovery often work in very confined domain, the opportunity to leverage larger and diverse data silos from multiple institutions will improve generalizability of the predictive models in drug discovery.

Conclusion
In this work, we set up a cross-silo federated learning framework for drug discovery based on FATE 19 , and constructed baseline models using MLP and RFCN architectures. We collected 7 drug solubility datasets and simulated the whole process including federated training, model selection, rollout and customization. Federated training can perform better than individual training on each dataset and, more surprisingly, better than centralized training on the pooled highly biased datasets. Visualizing the weight distributions of parameters in the neural network, we find federated training learned a simpler model with more zero weights than conventional centralized training, which means federated learning intrinsically has a regularization effect and may contribute to better generalization performances in highly biased data. Beyond that, it is feasible for federated learning to customize the global model locally (don't need to upload the model updates) if new users having plenty of labeled data. We demonstrated that users can get some benefits from customizing the global model than using the global model directly. Federated learning represents a new machine learning paradigm, the feature of privacy-preserving will encourage more and more institutions to fully utilize their data and expose more and more data to the latest machine learning models, thus solving the "small data" dilemma in drug discovery. Federated learning setting also makes it feasible to customize models for different users/clients, and hence alleviate the problem of data bias and achieve better predictive performance and form wiser instructions in real application scenario.

Data curation and partitioning
The 7 aqueous solubility datasets are collected from 7 different sources, which are preprocessed and curated by Sorkun et al. in AqSolDB 35 . Dataset F1 was extracted from eChemPortal, an open-source chemical property database developed by the OECD (Organisation for Economic Co-operation and Development) 36 . Both Dataset F2 and Dataset F4 are obtained EPI Suite Data website, which were generated by Water Solubility Fragment program 37 and WSKOWWIN program 38 , respectively. Dataset F3 was taken from the work of Raevsky et al. 39 . Dataset C1 was collected from the work of Huuskonen et al. 40 . Dataset C2 was collected from the work of Wang et al. 41 . Dataset C3 was collected from the work of Delaney et al. 42 .
In our simulation, the owners of datasets F1-4 are supposed to be the collaborated parties who want to participate in federated training, and the owners of datasets C1-3 are users who want to use the federated trained model. To prevent overfitting, 1/10 molecules of datasets F1-4 were held out as validation set. Another 1/10 molecules were held out as test sets, for comparing with different models. Because user C1-3 have their own data, it is feasible for them to customize their own model by fine tuning the federated trained model. We set the proportion of train, validation and test sets of C1-3 by 3:1:1.
Kinase inhibition datasets were curated by Merget

Federated Averaging and Secure Aggregation
In our setting, all clients have the same features (molecular fingerprints) as input for prediction task, so all the clients are deployed with the same neural networks architecture and could be trained with Federated Averaging. To ensure the security of data and model, the model updates should also not be uploaded in plaintext. Therefore, a Secure Aggregation protocol are implemented together with Federated Averaging. Both the Federated Averaging and Secure Aggregation protocol are proposed by Google's team in separate works 7,62 .
As descripted in pseudo-code of Algorithm 1, when the training start, the coordinator initializes the model parameters " , which will be broadcast to each client.
In each round of federated training, the client downloads the current shared global model # from the coordinator server, and trains its model locally on its own data with SGD optimizer. At every E (a hyperparameter) epochs of local training, all the clients will compute the updates and encrypted them under Secure Aggregation protocol. As per the protocol, the local model updates of each client would be added a unique random mask that are carefully generated and relevant to all the other participants, so as to make sure all the random masks adds up to 0 and thus be cancelled out when the coordinator aggregates the local updates uploaded by all clients. Since the random masks are cancelled out, the coordinator gets the true averaged model updates and uses them to update the federated model parameters, obtaining the current shared model. The shared global model will be broadcast to all clients, starting a new round of training. Similar to conventional neural networks, the training process will stop when the federated model converges or the training process reaches a predefined max-round threshold. Note that not as simple as descripted in Algorithm 1, Secure Aggregation Protocol is much more complicated with a four-round interaction between the coordinator and clients, which make protocol robust to dropouts and delays of the clients. The Federated Averaging and Secure Aggregation Protocol are implemented on FATE 19 .

Federated attentive message passing
Most of the existing federated learning practices are not able to achieve good performances because a single global model is used for all clients. Personalized federated learning allows us to train a personalized model without leaking the private data. FedAMP (Federated attentive message passing) 23 , a personalized federated learning aggregation algorithm, has not implemented on FATE and we simulated the process.