Abstract
Most machine learning algorithms require that training data are identically distributed to ensure effective learning. In biological studies, however, even small variations in the experimental setup can lead to substantial deviations. Domain adaptation offers tools to deal with this problem. It is particularly useful for cases where only a small amount of training data is available in the domain of interest, while a large amount of training data is available in a different, but relevant domain.
We investigated to what extent domain adaptation was able to improve prediction accuracy for complex biological data. To that end, we used simulated data and time-lapse movies of differentiating blood stem cells in different cell cycle stages from multiple experiments and compared three commonly used domain adaptation approaches. EasyAdapt, a simple technique of structured pooling of related data sets, was able to improve accuracy when classifying the simulated data and cell cycle stages from microscopic images. Meanwhile, the technique proved robust to the potential negative impact on the classification accuracy that is common in other techniques that build models with heterogeneous data. Despite its implementation simplicity, EasyAdapt consistently produced more accurate predictions compared to conventional techniques.
Domain adaptation is therefore able to substantially reduce the amount of work required to create a large amount of annotated training data in the domain of interest necessary whenever the domain changes even a little, which is common not only in biological experiments, but universally exists in almost all data collection routines.
1 Introduction
Over the last decade, machine learning, especially supervised learning, has become increasingly important in biological and medical research. Example applications range from protein structure prediction [1,2] and the identification of new disease subgroups from gene expression data [3,4], to the identification of cell connectivity [5] and the prediction of phenotypes from time-lapse [6] data and high throughput imaging [7]. With improving capabilities of data collection and growing computational resources, machine learning will be playing an even more important role in understanding of underlying biological processes.
One of the most well-known limitations of supervised learning, however, is the need for a large amount of annotated data. In biological and medical research, this requirement is often difficult to meet, as it necessitates expert knowledge and intensive manual work. With an increase in high-throughput data it becomes more and more unrealistic to annotate all observations. An appealing alternative is to combine already-annotated data from one or multiple sources in order to build a model for a new problem for which there is only little annotated data.
Another limitation of classic supervised learning techniques is the poor performance in dealing with data from multiple sources. A typical problem in biological research are batch effects. Batch effects describe qualitative changes in measurements because of experimental changes that are unrelated to the biological feature under investigation [8]. Typically, differences in the experimental setup, the use of different protocols, reagents or different machine settings can all lead to such effects. Conventional machine learning techniques are less effective in data with batch effects, due to differences in underlying distributions. Even in the case of an experiment being designed to be a replicate, the classifier trained with data from one experiment often tends to have lower predictive accuracy when applied to data from another replicate [9]. While it is possible to build a new model using only data from one experiment, this would mean wasting expert knowledge and involve labor-intensive annotation for each separate experiment. Consequently, it is desirable to have a model that can achieve a high performance with limited additional annotation work.
Domain adaptation describes the case where at least a part of the data used to train a model follows a different distribution from the data on which the model is finally applied [10]. It is closely related to the notion of transfer learning and mutlitask learning [10,11,12]. We follow Pan and Yang [11] and consider transfer learning as the more general term, with domain adaptation being one special form of transfer learning. Domain adaptation can be applied where a large number of annotated data are available in one or more domains that are not of direct interest (the source domain), while only a limited amount of annotated data is available in the domain of interest (the target domain) (Fig. 1). The idea of domain adaptation is to transfer the knowledge from the source to improve the learning in the target domain. Technically, it can be understood that the pre-trained decision boundary only requires some ‘minor’ tuning from a smaller amout of data to be applied to the new domain. Domain adaptation techniques have originally been developed to address text classification problems [13,14,15].
Domains in this context correspond to different types, styles or topics, e.g., a model trained with news articles can be adapted to classify a corpus containing fiction texts [14]. However, the concept is very broad and can be applied to any variable that is likely to lead to differences in the data distribution, e.g. different machines, protocols or reagents. Here, we consider domains representing different replicates of a biological experiment, where each replicate can be seen as a different domain.
2 Methodology
2.1 Definitions
We define a domain D as a feature space X with the marginal probability distribution P(X) and a label space Y. A function f(·) maps xi to yi, where xi ∈ X and yi ∈ Y. We consider problems with an arbitrary number of source domains Ds1,…, Dsm (m ≥ 1) and a single target domain Dt. For a multi-class classification problem, we convert to a set of binary classification problems in a one-vs-all manner, i.e. by training a single classifier per class, with the observations of that class as the positive examples and all other observations as negative examples. The aim of domain adaptation is to use the knowledge from the source domains and limited labeling information from the target domain to effectively learn the objective predictive function f(·) for the target domain.
2.2 Learning techniques
We compare a particular domain adaptation algorithm, the EasyAdapt technique [16], with four more conventional techniques of building classifiers. We refer to these as the ‘Source’, ‘Target’, ‘Combined’ and ‘Domain’ techniques. In this study, all domains share the same feature space X. In general, the techniques require a common feature subspace across domains. The details of these techniques are outlined below and illustrated in Fig. 2. For all techniques, we assume that the number of observations in the source domains is sufficiently large to estimate a model that will generalize to unseen data from the same distribution. In the Source technique, we only use labeled data from the source domains Ds1,…, Dsm to train the model. The model trained on the source domains is then evaluated on data from the target domain, giving an indirect measure of proximity between source and target domains. In the Target technique, we only use labeled data from the target domain Dt to train the model, without considering the data from the source domains. Given enough training data in the target domain, this model should perform the best. In the Combined technique, we use labeled data from both the source and the target domains without any reference to the domain membership when training the models (where every data point is weighted equally). This is arguably one of the most common approaches in practice [17,18,19], where a typical scenario consists of a relatively large amount of labeled data from the source domains and a limited amount of data from the target domain. In the Domain technique, we slightly adapt the Combined approach. An additional set of binary variables encoding the domain membership, in the form of one-hot-encoding, is added to the existing feature set [20]. It is expected to enable the estimated function to have a different offset for each domain, while making use of all the other predictors from all domains to define the shape of the function in common. The EasyAdapt domain adaptation technique [16,21], uses a simple transformation to create a representation for the general data structure common to source and target domains and a separate representation for each domain. The transformations between the features spaces of the different domains have the following form: 0Dd denotes a matrix of dimensions corresponding to the dimensions of domain d filled with zeros. EasyAdapt can be applied to an arbitrary number m of source domains Ds1,…, Dsm and a single target domain Dt (see Fig. 2 for a visualization and a comparison with other techniques). Features only available in the target domain could also be incorporated by setting the relevant entries for the other domains to 0. The technique is simple and flexible and can be used with any supervised classifier. However, it is recommended that the number of features per domain is not too large, because the feature space increases to ℝ(m+2)p dimensions with p being the dimension of the shared feature space.
3 Results
3.1 Simulation study
In order to visualize how the different techniques work and to test their performance, we created a two dimensional artificial data set with one source domain and one target domain (each with 200 data points), where the ground truth is known (see Fig. 3A). The data was created as follows: In the source domain, we simulate the positive class by sampling 200 data points uniformly around a central point with coordinates (1.0, 0.0). The distance from the centre is sampled from a uniform distribution with mean 0.5 and a range between 0.1 and 0.9. The radial angle is uniformly distributed between 0 and 360 degrees. For the negative class, 200 data points are sampled uniformly around the same central point, but the distance from the centre is sampled from a uniform distribution with mean 0.9 and a range between 0.5 and 1.3. Again, the radial angle is uniformly distributed between 0 and 360 degrees. In order to create the data for the target domain, we translate both classes in the source domain by y′ = y − 0.60, where y is the horizontal coordinate in the source domain while y′ is the horizontal coordinate in the target domain. 15% of the data in the target domain was used for training. The remainder of data in the target domain was used for performance evaluation. Support Vector Machine (SVM) [22,23] with a radial basis function (RBF) kernel was chosen as the basic classifier for all the five learning techniques described in the previous section. Parameters were selected using a grid search with 5-fold cross-validation. From both the contour lines (Fig. 3B-F) and the ROC curves (Fig. 3G) it is evident that the EasyAdapt technique captured the distribution of the target domain most accurately (AUC = 0.91), by leveraging information from both the source domain and the limited amount of training data from the target domain in building the classifier. Fig. 3B illustrates that due to the limited amount of training data in the target domain, the Target technique (AUC = 0.86) learned a decision boundary that was much more complicated than the underlying distribution. The Source technique (AUC = 0.55, Fig. 3C) directly applied the decision boundary learned from the source to the target domain, leading to an evident discrepancy with respect to the target domain distribution. The Combined technique (AUC = 0.64, Fig. 3D), shifts towards the target domain when building the model. Due to the comparatively large number of source domain data, however, the model is strongly biased towards the source distribution. The Domain technique (AUC = 0.89, Fig. 3E) learned a model that describes the target domain quite well, especially in regions close to the centre. In regions that were farther away, however, the contour lines were clearly distracted by source domain information. Compared with these four techniques, the EasyAdapt technique (Fig. 3F) learned a model that described the target distribution the best, by successfully integrating the information from the two domains.
3.2 Imaging data set
For a realistic evaluation case, we applied the techniques to a biological data set [25] consisting of 2888 cells with 186 cell texture and shape features from time lapse microscopy experiments, where 8 different cell cycle stages have been manually annotated. The data comes from three experiments, with 1468, 726, and 694 cells, respectively. It is important to note that the experiments differ regarding the microscope objectives and the magnification factor (10x for experiments 1 and 3, and 20x for experiment 2) used, and were conducted by different lab technicians [25]. The different techniques were trained and tested in a one-vs-all manner on the 8 cell cycle stages (where each stage is treated as a separate class). We always picked two experiments to represent the source domains and the remaining experiment as the target domain. We tested all three possible combinations of two source domains and one target domain. All data from the source domains together with the data from the target train set were centered and scaled to unit variance. Subsequently, we applied a principal component analysis (PCA) to the data, (i) keeping only factors explaining 98% of variance (reducing the number of features to roughly 20-30), and (ii) keeping only the 16 highest loaded principal components. We used 4-fold cross-validation and a grid search to select parameters and subsequently evaluated performance on a test set in the target domain. The procedure was repeated 50 times for different target training set sizes of 100, 120, 150, 200, 250, 300, and 400 samples in order to obtain robust estimates for variable performance, especially when using small training set sizes. Independent of the amount of data available in the target domain, we used a fixed-sized test set with 240 samples for performance evaluation, which was randomly chosen for every iteration and for every new training set. In order to evaluate and compare performance of techniques, we chose the microaveraged AUC. Using this metric, class imbalances were taken into account by computing cumulative values for true positives, false negatives, true negatives and false positives for every label and then computing the performance measure from the aggregated values [24]. We compared three different base classifiers, namely a linear SVM [23], an RBF kernel SVM [22], and a random forest classifier [26].
We found that the EasyAdapt technique is particularly robust when working with a small set of training samples in the target domain and consistently performed among the top techniques in the regime of small training set sizes (Fig. 4). As expected, with increasing training set size the Target technique catches up and for 400 training samples (the maximum training set size in the study), the performance for this technique was among the best performing techniques. In general performance improved for all techniques with increasing training set size with exception of the Source technique, which was not trained with any of the target domain data. Results from all experiments are summarised in Table 1, showing the performances of the five learning techniques across three different base classifiers, two different feature selection methods and three different target domains (each combination of a base classifier, a feature selection method and a target domain is referred to as a ‘setting’ below).
To assess performance of the different techniques across training set sizes (Fig. 4), we measured the area under the curve for each of the 50 iterations for a given setting. This renders an aggregated performance for each train/test split across the range of training set sizes we used and gives us an estimate of performance for small to medium training set sizes. In contrast to the microaveraged AUC across different training set sizes, this measure takes into account the fact that we tested more smaller training set sizes (in the range of 100-200 samples) and is a more conservative measure than simple averaging in our case. This is achieved by weighting performance according to train set size sampling frequency. Additionally, we normalized performance, so that a perfect classifier would achieve an relative performance of 1, corresponding to an AUC of 1 for all training set sizes in the range from 100 to 400 samples. Fig. 5 shows the distribution of this performance measure for different techniques, classifiers and transfer directions. Across all settings, the EasyAdapt technique consistently showed superior performance over other techniques: Among 18 different settings, EasyAdapt ranked 15 times the best or tied for the best and 3 times as the second best. This not only demonstrates the effectiveness of knowledge transfer of EasyAdapt, but also shows its generality with respect to base classifiers and feature selection methods under different transfer situations. The second best technique was the Domain technique, with 8 times the best or tied for the best and 3 times in the second place. This indicated that in many cases the membership feature used by the Domain technique was also able to leverage some knowledge from related domains. The technique with the lowest performance was the Source technique, which ranked last in every setting.
In practice, it is hard to predict whether pooling of data will actually improve prediction performance or lead to negative transfer, i.e. learning in the target domain might be negatively affected by the use of additional information, if domains are too different [11,27]. An example for such negative transfer is the case of experiment 2 as the target domain. Here, both the Combined and Domain techniques performed considerably worse compared to the Target technique (see Table 1). This can probably be explained by stronger differences in distributions between experiments 1 and 3 on the one hand, and experiment 2 on the other, as experiment 2 used a different magnification. This difference can also be seen from the extremely poor performance of the Source technique for experiment 2 as the target domain. It is worth noting that the negative transfer that affected the Combined and Domain techniques with experiment 2 as target domain appears stable across different training set sizes (Fig. 4). Importantly, we do not observe such negative transfer in the case of the EasyAdapt technique. Performance of EasyAdapt was comparable or even slightly better than the Target technique when looking at experiment 2 as the target domain.
4 Discussion
In the present study, we investigated whether accounting for experimental variation in biological data using a domain adaptation techniques can help improve prediction performance and reduce the need for labeled data. We show that indeed, given only limited training data, the EasyAdapt domain adaptation technique boosts prediction performance both in a simulation study and a data set of imaged single cells [25] and leads to more robust predictions in the presence of experimental variation.
Recently, there have been a number of approaches that try to improve generalization of deep neural network performance across multiple domains. This is important, as neural networks have been known to generalize relatively poorly [28]. Often, the approach is to learn transferable representations that both identify the factors driving variation within the data and match feature distributions across domains [29,30]. Recent work has used models that are able to adapt to different domain very quickly by using an efficient parametrization of deep neural networks and adapter residual modules [31,32]. There is also interesting work combining generative adversarial networks with domain adaptation [33,34,35]. It is worth noting that the approach described in this work is orthogonal to these models, and can be used with any type of supervised machine learning algorithm, including but not limited to deep neural networks.
Applications of domain adaptation techniques in biological research have so far been mostly restricted to genomic sequence analysis [36,37]. Widmer et al. [38,39] used a more general multi-task learning framework in conjunction with regularization based supervised learning methods, such as SVM and logistic regression for splice-site and binding site prediction and to transfer model parameters learned on 2D images to 3D images in order to enhance learning. In contrast to [39], we do not learn domain specific differences explicitly. In practice, this information is also often hard to quantify. Here, we rather focus on the effect of training set size and the pooling of heterogeneous data without quantitative knowledge about the relationship between domains. We compare performance of the EasyAdapt technique across three different machine learning algorithms. Furthermore, we consider a range of common ways of combining information from different domains, e.g. via explicit encoding of domain membership, a procedure that is often used in practice. We demonstrate that the EasyAdapt technique is relatively robust to negative effects of data pooling.
Our results have implications for dealing with biological batch effects in machine learning tasks and for improving learning in settings with limited training data, if additional source data is available. The EasyAdapt technique allows the reuse of existing data sets as source data and avoids cost-intensive manual labelling of training data. Results confirm the problem that is one major motivation of this work: a model trained using data from one biological experiment is likely to have much inferior performance when applied to a different experiment, despite the experiments sharing similar experimental setups. Importantly, the EasyAdapt technique is general in that it does not change the machine learning method used and can therefore be applied to a wide set of problems. Because the feature space grows linearly in the number of domains, the approach is not applicable in cases with very large feature spaces or a large number of domains.
In general, classification accuracy in the transfer learning setting will be an increasing function of both the number of training samples available and the homogeneity and level of relatedness of the training samples to the test set. Given a limited set of training samples and reasonable relatedness between training and test set, transfer learning can help to improve classification accuracy. However, in the case when the relatedness between training and test set is insufficient to enable transfer, there is potential for negative impact when adding additional data from a different domain (known as negative transfer). EasyAdapt strikes a balance between improving performance in cases when additional information is available and robustness to experimental variations. Compared with classic techniques such as the Domain and Combined techniques, the EasyAdapt technique is less affected by negative transfer and for small to medium training set sizes it can improve learning in the target domain.
The technique is limited by the necessity to identify domains, i.e. it is necessary to have domain knowledge about potential differences in experimental conditions and fundamental differences in feature distributions that define domains. Furthermore, it requires that the domains have a shared feature subspace and are distinct [16]. Both requirements are typically fulfilled in biological data. Further research will be necessary to develop empirical measures of domain relationships that help to identify cases where the use of domain adaptation in machine learning can be particularly helpful.
Footnotes
lichaowang{at}gmail.com,carsten.marr{at}helmholtz-muenchen.de
Slightly revised and accepted at the 28th International Conference on Artificial Neural Networks (ICANN) 2019 in Munich, Germany