Abstract
Profiling cell morphology is a powerful tool for inferring cell function. However, this technique retains a high barrier to entry. In particular, configuring image processing parameters for optimal cell profiling is susceptible to cognitive biases, and dependent on user experience. Here, we present an interactive machine learning strategy that learns the optimum cell profiling configuration to maximise quality of the cell profiling outcome. The process is guided by the user, from whom a rating of the quality of a cell profiling configuration is obtained. The machine learning algorithm uses this information to automatically recommend the next configuration to examine. We validated our interactive approach against the standard human trial-and-error scheme to optimise an object segmentation task on the standard software CellProfiler. Our approach enabled rapid optimisation of an object segmentation pipeline, which more accurately segmented objects compared to those optimsed through human trial-and-error. Users also attested to the ease of use and reduced cognitive load enabled by our machine learning strategy over the standard approach. We envision that our interactive machine learning strategy can enhance the quality and efficiency of pipeline optimisation to democratise image-based cell profiling.
Introduction
Image-based cell profiling is a powerful tool to capture the intricacies of cell phenotype. The resolution and rapidity stemming from image-based cell profiling has enabled study of mechanisms of and cellular response to disease [1], drugs [2], or materials [3]. Together with the explosion of automated and high-throughput microscopy techniques, image-based cell profiling is increasingly relied on as a biological toolkit. Central to image based profiling are software tools devoted to ease the burden of processing a large volume of images by making detection, segmentation and feature extraction automated [4].
To optimise a cell profiling process or pipeline for a particular image set, users configure the optimal values for various image processing parameters (e.g. image correction, object segmentation and feature extraction) in a trial and error process. The standard toolbox CellProfiler already reduces this task by carefully curating the most pertinent and widely-used parameters in cell profiling [5]. Yet selecting an optimum set of cell profiling pipeline parameters (or ‘configuration’) from the available parameter space is still an onerous task and prone to biases. Optimising an image processing pipeline is biased against those with limited knowledge in biology, microscopy or image analysis. The high cognitive load of pipeline optimisation can inadvertently lead to decision-making bias that deteriorates the quality of the cell profiling result. Testing of pipelines on small datasets can also induce an availability bias, where positive results from small subsets are incorrectly assumed to generalise to the entire dataset. Furthermore, novice users may be susceptible to default bias, where default settings are selected over the true optimal ones. While incredibly informative and powerful for biology, cell profiling is hindered by the users’ capability to process images robustly and reproducibly.
Here, we present a new method that integrates user input with machine learning to optimise the configuration of a cell profiling pipeline. We obtain from the user the quality score (QS), a metric to describe the performance of a pipeline configuration. We use a Bayesian optimisation (BO) process to learn the optimal pipeline configuration by maximising QS in an iterative fashion. Effectively, we present a machine learning method that diverts the burden of pipeline optimisation from the user and automates and accelerates pipeline optimisation. Through our interactive machine learning method, reduce cognitive load and bias against new users and thus improve the rapidity and quality of cell profiling.
We created new modules on the standard biological toolbox CellProfiler (CP) to implement our interactive machine learning approach. Those three new modules can be easily integrated within the existing CP software infrastructure. We created two types of modules: evaluation modules to obtain QS from users; and, a BO module to define parameters that will be automatically optimised. Our approach in optimising pipeline configuration uses the evaluation and BO modules together to obtain QS and automatically change pipeline settings towards maximisation of the QS. We also tested our BO based approach to optimise a pipeline configuration for object segmentation. Users with varying levels of expertise obtained higher QS of object segmentation using our BO approach compared to the conventional trial and error method. Users also attested to the ease of use of the BO approach, with a majority electing to incorporate the process into their own pipeline optimisation process.
The rest of paper is organized as follows. First, we describe the conceptual framework behind our BO approach to pipeline optimisation. Next, we present the results of user experiments comparing our BO approach to the conventional method of pipeline optimisation. Finally, we discuss the implications of our work for scientifically reliable, high quality, and rapid image-based cell profiling for all.
Semi-automated pipeline optimisation using machine learning
We propose to utilise a semi-automated, machine learning approach to optimise a cell profiling configuration (Fig. 1). Critical to this approach is the explicit definition of the level of performance of each cell profiling configuration. We define the QS as a metric of the quality of a pipeline configuration. We also created a highly customisable Bayesian optimisation (BO) module that allows the user to define the image processing parameters to be optimised. The QS is then exploited by a BO algorithm to automatically change all user specified image processing parameters simultaneously. The BO process uses the evaluation and BO modules together to iteratively obtain the QS and automatically change pipeline parameters with the goal of QS maximisation. Our concept has been implemented as a collection of stand-alone CP modules which can be used as plugins to the existing software: ManualEvaluation, AutomatedEvaluation and BayesianOptimisation modules. The implementation, module plugins, CP pipelines, training and testing datasets, and results can be found on https://github.com/uofg-cellprofiler-modules/bayesopt4cellprofiler.
Evaluation modules
The evaluation modules were created to obtain three key pieces of information at each iteration: the target object requiring optimisation, the minimum acceptable QS required by the user (referred to as the ‘target QS’), and the QS from the latest pipeline configuration (referred to as the ‘current QS’). Definition of the target and current QS depend on whether the user will provide a QS at each iteration (manual) or set a criteria that defines robust processing of the target object (automatic). To provide a concrete example, we discuss the application of our evaluation modules for object segmentation, a common bottleneck in pipeline optimisation.
AutomatedEvaluation
The AutomatedEvaluation module automatically evaluates the quality of a pipeline configuration based on user-prescribed criteria characteristics of an optimally segmented object (the target QS) (Fig. S2). Thus, AutomatedEvaluation requires prior knowledge of the optimally segmented object. For instance, an optimally segmented nucleus rarely contains any concavities, allowing us to define the target QS from high measurements of solidity. At least one target object with its characteristics (e.g. shape, texture, intensity) measured needs to be placed before AutomatedEvaluation in the pipeline. When multiple measurements of a segmented object are used, an aggregate is calculated to obtain a target QS. At each iteration of BO, AutomatedEvaluation calculates the current QS of the segmented object using the same measurements defined in the target QS. If the current QS falls below the target QS, the BO process continues. When the current QS meets or exceeds the target QS, the BO process stops and the segmented object resulting from the optimised pipeline configuration is displayed. If the user deems segmentation to be poor, the user will be prompted to redefine the target QS.
ManualEvaluation module
The ManualEvaluation module relies on the user’s subjective rating of a segmented object (Fig. S3). First, the user is required to define the minimum acceptable segmentation quality or target QS on a scale of 1 (poor quality) to 10 (excellent quality). During pipeline execution, ManualEvaluation temporarily interrupts the pipeline to display the segmented object from the most recent pipeline configuration. The user is required to rate the quality of the segmented object using the same rating scale of 1 to 10 to provide the current QS. The BO process will continue to iterate until the target QS is met or exceeded. Both AutomatedEvaluation and ManualEvaluation allows the user to customise objects and images to be displayed to the user at each iteration of BO.
BayesianOptimisation module
The BayesianOptimisation module implements a Bayesian Optimisation algorithm to automatically optimise pipeline configuration by maximising the QS (Fig. S4). To do this, we created the highly customisable BayesianOptimisation module.
BayesianOptimisation requires at least one evaluation module placed upstream from which the current QS can be obtained. BayesianOptimisation allows the combination of the two evaluation modules, with weighting of contribution to the joint current QS explicitly defined by the user. The BayesianOptimisation module also provides full customization of the image processing modules and settings to be optimised using the BO algorithm. Even settings within object identification modules (e.g. IdentifySecondaryObject (e.g. threshold correction factor or adaptive window value) can be optimised by the BO process. In principle, any parameters or settings with integer and float values in modules upstream of the BayesianOptimization module can be optimised by the BO process. BayesianOptimisation also gives the user control of the BO process, including setting the maximum number of iterations of BO.
Together, the evaluation and BayesianOptimisation modules aim to minimise the gap between the current QS and target QS by automatically changing pipeline configuration. A pop-up window shows the deviance of current from target QS at every iteration of the BO process (Fig. 2). The BO process iterates until the current QS matches the target QS (i.e. quality gap = 0) or the maximum number of iterations specified by the user has been attained.
Bayesian Optimisation algorithm
At the core of the BayesianOptimisation module is a custom version of a BO algorithm [6–8]. BO relies on a surrogate function/model that represents and provides calibrated predictive distributions for the quality score (QS), y, for a given pipeline configuration, x. We define the surrogate model, f (x), mapping from configuration to QS as a Bayesian regression model with a Gaussian likelihood, 𝒩 (y|f (x), σn), with a Gaussian process (GP) prior on f such that f ∼ 𝒢𝒫(m(x), k(x, x′) | θ𝒢𝒫) [9]. The GP is defined by the effective mean function, m(x) = 0, and chosen covariance function where the hyperparameters are collected in θ = {σn, σf, σℓ}. Given the GP and a training set, D = {(x, y)1:N}, containing a certain pipeline configuration and its corresponding QS, the predictive distribution for any pipeline configuration, x*, is directly available as p(y*|x*, D, θ). This allows us to estimate both the expected QS and its uncertainty for all unseen configurations. For simplicity, we have defined the model without priors on the hyperparameters and we do marginal likelihood optimisation of the hyperparameters (after an initial bootstrap phase). However, some BO algorithm hyperparameters and GP parameters (e.g. the length scale of the covariance function and the assumed noise level) can be customised in the BayesianOptimisation module.
BO exploits the predictive distribution at any point in the optimisation process to sequentially choose the next set of image processing parameters (i.e. the configuration) to evaluate. It does so by trading-off the desire to optimise the current QS with the implicit need to learn the surrogate model. To do so, here we applied Expected Improvement [7, 8]. At the end of each iteration, the current QS from the newly chosen pipeline configuration is subsequently included in the training set and the model re-estimated before repetition of the BO process. A summary of the BO process is given in (Fig. S1).
User experiments
Methods
User based experiments in pipeline optimisation for object segmentation were performed. These experiments were conducted to test our interactive machine learning approach against the trial and error (here referred to as ‘conventional’) method of optimising a pipeline configuration. Experiments involving human subjects were performed with approval from the Ethics committee of the College of Science and Engineering, University of Glasgow (case no. 300180170).
Participants were randomly assigned the objective of segmenting either cells or focal adhesions. Pipelines for both objectives were designed to have interdependent modules, where segmentation of cells and focal adhesions were dependent on nuclei and cell segmentation, respectively. Each participant was required to optimise 1 pipeline using the conventional approach, and 3 pipelines using our interactive machine learning approach. Participants were given 20 minutes to optimise each pipeline. In the conventional approach, participants were required to optimise settings across prescribed modules in a trial and error manner. Using the BO approach, participants were required to use BayesianOptimisation in conjunction with either AutomatedEvaluation, ManualEvaluation or both evaluation modules (called ‘Composite Evaluation’). A summary of the pipeline configuration automatically optimised by BayesianOptimisation is found in Table S1 and Table S2 for cell and focal adhesion segmentation, respectively.
Each pipeline optimisation task used identical image sets for training and testing. At the end of each task, 10 images were used to test the quality of the resulting pipeline configuration. Participants rated the QS of the test images. To provide a baseline measurement, participants also rated the QS of the test images segmented from a pipeline optimised by a CP expert. A summary of all tasks performed by each participant is summarised in Table 1. All tasks were conducted on the same computer running CellProfiler v3.1.8. A detailed description of methods (including cell preparation, image acquisition, and participant recruitment) are provided in Supporting Methods. CP pipelines and image sets used in both tasks are included in Supporting Information 2.
A Friedman test for rank based analysis of paired samples with Dunn’s post-hoc test for pairwise comparison was used to test statistical significance in QS between different optimisation methods. For visualisation, we present QS from optimisation tasks normalised to QS from a CP expert’s pipeline (normalised QS). One test image of cell segmentation was excluded due to the absence of any segmented cell by the CP expert.
Results
Here, we tested the performance of our interactive machine learning approach. We compared the quality of the segmentation, ease of use, and speed of optimisation between our approach and the conventional method of pipeline optimisation. First, we showed that our approach significantly enhanced segmentation QS over the conventional method (Fig 3). In particular, providing user-based feedback in object segmentation (through the use of ManualEvaluation) significantly improved cell segmentation QS compared to the conventional approach. In contrast to cell segmentation, the interactive machine learning approach (regardless of the evaluation module used) outperformed the conventional approach in segmenting focal adhesions. We noted that the use of ManualEvaluation (by itself or compositely with AutomatedEvaluation) was advantageous for object segmentation. Indeed, despite having different characteristics, both cells and focal adhesions were accurately segmented when using ManualEvaluation. Presenting visual evidence (Fig 4) allow users to evaluate the conformity of outlines to the edges of target objects. This is a critically simpler task than setting criteria that define optimal object segmentation, which may be unknown a priori, as required by AutomatedEvaluation.
Though it failed to show an advantage over the conventional method for cell segmentation, AutomatedEvaluation improved focal adhesion segmentation. Presumably, measurements in the ratio scale that easily define focal adhesions (e.g. ellipticity and solidity) were easier to intuit and exploit compared to measurements in the interval scale (e.g. cell area). Under certain circumstances or for users with some experience, AutomatedEvaluation presents advantages for pipeline optimisation.
Next, we assessed the ease of use of the interactive machine learning approach (Fig 5). When asked to use AutomatedEvaluation, the number of users who found pipeline optimisation to be easy doubled in number. Feedback on ManualEvaluation was even more positive, as all participants considered pipeline optimisation to be easy when using this evaluation mode. Participants also overwhelmingly (15 out of 16 or 93.8%) elected to adopt our approach for future pipeline optimisation, indicating a widening of support of a semi-automated approach to cell profiling optimisation. Aside from resulting in poor segmentation QS, only a minority (3 out of 16 or 18.8%) of users found it easy to optimise pipeline configuration using the conventional method.
Finally, we demonstrated the efficiency of our approach over the conventional method for pipeline optimisation (Fig 5). Prior to user based experiments, we tested our approach against a random selection of pipeline configuration (Fig S5). The random selection process approximated the conventional trial and error method. On average, our approach required less iterations to optimise nucleus, cell and focal adhesion segmentation compared to the conventional approach. User based experiments supported these findings, where 10 out of 16 (62.5%) users required more than 20 minutes to sufficiently optimise a pipeline using the conventional method. A large number (13 out of 16 or 81.3%) of users found 20 minutes insufficient for pipeline optimisation using AutomatedEvaluation. Meanwhile, a majority of users (12 out of 16 or 75%) regardless of prior experience in cell profiling found that 20 minutes was sufficient to optimise a pipeline using ManualEvaluation. We showed here that our method empowers robust and rapid cell profiling without compromising on ease of use, cognitive burden, or bias against novice users.
Discussion
Robust and reproducible image-based cell profiling depends on the optimal configuration of the image processing pipeline. The conventional method of optimising an image processing pipeline is effectively a trial and error process, and is thus time consuming, tedious and prohibitive to those with minimal experience in image analysis or biology. Here, we propose a semi-automated method that relies on minimal user intervention and machine learning to accelerate pipeline optimisation, and enhance the quality of cell profiling.
A key component in our proposed method is the iterative acquisition of the QS from the user. By obtaining a QS corresponding to a certain pipeline configuration, we were able to effectively incorporate learning into the process of pipeline optimisation. This was performed using a BO approach. Importantly, the BO algorithm is ideal for optimising broad parameter spaces such as in synthetic gene design [10], hyperparameter tuning [7] or crystal structure prediction [11]. Here, we also showed that the BO method optimised the broad combinatorial space for image processing parameters across multiple segmentation objectives. This is especially important for users with little to no experience in image analysis, where the BO method reduces default bias in pipeline optimisation.
The BO method is also an effective remedy to memory bias, which increases in propensity with longer and more complex pipelines. Because the conventional method relies on a user to remember outcomes corresponding to a image processing configuration, the process is highly susceptible to memory and cognitive biases. Not only do these biases severely narrow the setting space being tested, they prevent users from obtaining the optimum processing pipeline that is crucial to accurate cell profiling. Diverting the user’s focus towards providing the QS is also an essential feature of our method that reduces cognitive load on users without compromising on the quality of pipeline outcomes.
Though intended for completely autonomous optimisation [12], here we modified BO to incorporate a human-in-the-loop [13, 14]. By relying on the user instead of absolute limits to determine QS, we have created a more generalised and flexible method to assess and optimise pipeline performance. Without predefined limits on quality (as is most apparent with the ManualEvaluation module), our method can optimise pipelines for segmentation of objects with complex geometric properties (e.g. the mitochondria). We can even extend the pipeline optimisation process for tasks with undefined quality metrics (e.g. illumination and background correction [15] or for curation of images for quality control [16]).
The flexibility of our method for pipeline optimisation is also extended to the implemented modules, where users have control over: 1) the task; 2) the target QS; 3) modules and settings; 4) weighting between automatic and manual evaluation into a composite evaluation score; and 5) BO hyperparameters. The modularity of the CP also permits multiple BO methods throughout a single pipeline to optimise various tasks. Complex tasks such as focal adhesion segmentation undoubtedly benefit from this scenario, where there are interdependencies between segmented objects.
The rapidity by which we collect data calls for fully or semi-automatic methods of cell profiling that is adaptable to different experimental designs, biological systems and imaging modalities. Many are developing machine and deep learning methods to eliminate human intervention in the data analysis process. However, it is difficult and often counter productive to eliminate the user, who has expertise to validate, configure, fine-tune parameters and label data under novel conditions. Here, we show that integrating the user with machine learning to improve both automation and quality of analysis. Our interactive machine learning approach presents a new paradigm wherein human decision-making and oversight is required for robust scientific discovery.
Supporting information
Supporting Information 1 File. Participant information sheet, raw image results and survey data from user experiments.
Supporting Information 2 File. Pipelines and raw images for the various tasks carried out in user experiments.
Supporting methods
Image acquisition
Cell culture
MC3T3 cells (passages 10-12, ATCC) were cultured using standard cell culture practice. Cells were grown in growth media comprised of α-MEM with nucleosides and L-glutamine without ascorbic acid and supplemented with 10% FBS and 1% penicillin-streptomycin [17]. Cells were seeded at a density of 4000 cells/cm2 on injection moulded and surface-texturised polycarbonate substrates [3, 18].
Immunofluorescence staining
MC3T3 cells were cultured for 2 days before fixation using 4% paraformaldehyde. Cells were then stained with DAPI, AlexaFluor conjugated-phalloidin (ThermoFisher, 1:200) to detect the nucleus and the actin cytoskeleton, respectively. On the same cells, focal adhesions were visualised using an anti-talin1 (Abcam 71333, 1:200) and an appropriate AlexaFluor conjugated secondary antibody (ThermoFisher, 1:500). Cells were then mounted on 0.17 µm thick glass coverslips before imaging.
Fluorescence microscopy
Images of fluorescently stained cells were obtained using an EVOS FL2 Auto system (ThermoFisher) with 40X magnification (numerical aperture = 1.3). Image sets of the nucleus, the cell (visualised using the actin cytoskeleton) and focal adhesions were used to test the newly developed CellProfiler modules.
Participant recruitment
Participants were informed that the purpose of the study was to evaluate the performance of our CP modules for object segmentation. Participants provided consent to participate by signing the informed consent forms. The participant information sheet and consent form are provided in Supporting Info. The participants were required to have a basic understanding of computer aided image analysis. Participants were offered minor monetary compensation (10 GBP), and the possibility to win a larger amount (50 GBP) in a raffle. A total of 16 participants were recruited, all of whom showed varying levels of experience in image analysis.
Supporting figures
Supporting tables
Acknowledgments
We thank the team at the Broad Institutes for creating the open-source, image-based cell profiling platform CellProfiler.