Abstract
In this paper, a new biological modeling approach is proposed for predicting complex heterogeneous subcellular behaviors. Cell protrusion which initiates cell migration has a significant amount of subcellular heterogeneity in micrometer length and minute time scales. It is driven by actin polymerization, e.g., pushing the plasma membrane forward, and then regulated by a multitude of actin regulators. While mathematical modeling is central to system-level understandings of cell protrusion, most of the modeling is based on the ensemble average of actin regulator dynamics at the cellular or population levels, preventing from capturing the heterogeneous cellular activities. With these in mind, a systematic modeling framework is proposed in this paper for predicting velocities of heterogeneous protrusion of migrating cells driven by multiple molecular mechanisms. The modeling framework is developed through the integration of the multiple AutoRegressive eXogenous (ARX) models employing probability density input variables. Unlike conventional ARX models, it provides an effective framework for modeling heterogeneous subcellular behaviors with complex nonlinearities and uncertainties of dynamic systems. To train and validate the proposed model, numerous subcellular time series are extracted from time-lapse movies of migrating PtK1 cells using spinning disk confocal microscope: The current edge velocities and fluorescent intensities of mDia1, actin at the leading edge are used as the input while the future cell edge velocities are selected as an output. It is demonstrated that the proposed approach is highly effective in predicting the future trends of heterogeneous cell protrusion. In particular, by capturing the various multiple activities from the dataset, it is expected that it would improve the understanding of the molecular mechanism underlying cellular and subcellular heterogeneity.
1. Introduction
Over the last decade, a mathematical modeling approach has been suggested for better understanding complex cellular systems (DiMilla et al. 1991). However, it is still very challenging to quantitatively predict system-level cellular behaviors because the relationships between the multiple inputs and multiple outputs are inherently nonlinear and time-varying (Karr et al. 2012). Moreover, cellular activities are often highly heterogeneous, meaning that cells exhibit multiple phenotypes in space and time and changing their phenotypes depending on their environment (Karr et al. 2012; Wang et al. 2018). Therefore, building precise mathematical models of heterogeneous cellular behaviors is very difficult since they can be driven by several distinct underlying mechanisms quantitatively or qualitatively, including (1) there are a number of uncertainties in measuring and processing input signals of subcellular activities: (2) it is challenging to derive a mathematic model of highly heterogeneous nonlinear dynamic systems where multiple modes of molecular mechanisms are included; and (3) the acquired datasets are incomplete and incoherent in general.
The time series model such as autoregressive (AR) models has been attracted a great attention in a variety of engineering fields. This approach has been successfully applied to various areas such as weather forecasting and financial market as well as cell biology (Jaqaman et al. 2006). This is because it allows researchers to model complex system behaviors without comprehensive information about the system due to a data-driven approach (Chon et al. 2001; Kim et al. 2016). However, the traditional AR models do not consider complex nonlinear dynamic problems with numerous uncertainties such as the multiple cellular or subcellular activities. Therefore, there is a limited ability to model heterogeneous cellular activities. With these in mind, a discrete fuzzy modeling (DFM) framework is proposed to predict the heterogeneous cell motions from the recruitment of actin assembly factors involved in the protrusion of the cell membrane. It can predict the future subcellular protrusion velocity using current protrusion velocity with uncertain inputs such as mDia1 and actin fluorescence intensity. It is created through the integration of multiple autoregressive exogenous input (ARX) models, fuzzy logic theory, data clustering schemes, and weighted least squares estimators, as shown in Figure 1: (1) the uncertainties of input variables are incorporated into the proposed model; (2) the heterogeneity of complex molecular interaction is considered in the proposed modeling framework; and (3) a well-established pre-processing is used to address the incomplete and incoherent measurements.
2. Methods
2.1 Model architecture
The DFM model proposed in this paper is presented in Eq. (1) where Rj is the jth rule and its consequent part is the jth linear dynamic model. A local dynamic system has a set of if-then rules. For example, “if the mDia1 is large, the velocities intensity increases”, or “if the actin is small, the velocities decrease.” ui is the ith premise variable, pi,j is the associated parameter, and k is the integer value. The output y(k)input u(k) and the associated parameter vectors ai and bi are presented in Eq. (2) to Eq. (5) where m1 is the number of delay steps in the output; m2 is the number of delay steps in the input; y(k) is the output; u(k) is the input; p and q are the number of output and input variables, respectively, and ai and bi are the coefficient matrices to be estimated. In this study, p= 1 because the velocity is the only output signal. The multiple linear models at the specific operating point ui are integrated through a defuzzification approach (Kim et al. 2011; 2013; 2015; 2016). Therefore, Eq. (1) can be expressed as follows. where where 0 ≤ αj ≤ 1 is the normalized value of the jth rule and is expressed as shown in Eq. (9). where μi,j (ui) is the membership function (MF) of ui and N is the number of local dynamic models. MFs are useful to handle complex nonlinear systems with uncertain parameters. Fuzzy sets are constructed from the MFs. For example, if the level of mDia1 is categorized into three stages, e.g., low, medium, and high mDia1, a fuzzy set can be constructed. The MF for each antecedent variable in the DFM model should be carefully determined. In particular, the mean values of the probabilistic MFs need to be optimized. In this paper, a clustering algorithm is used.
2.2 Clustering algorithm
A cluster center is selected based on the highest density measure where σi is the ith data point, nd is the total number of data points and Ra is the range of data neighborhood. Subsequently, the selected cluster center and its neighborhood data points are reduced using the selection procedure where Dci is the ith density measure, σci is the ith cluster center, and Rb = ηs Ra is used to avoid closely spaced centers where ηs is a positive constant greater than 1. After subtraction, the cluster center is selected based on three criteria: (1) acceptance ratio, (2) rejection ratio, and (3) other relative distance criterion. This procedure is repeated until a sufficient number of cluster centers are found in the input space. The parameter settings for the subtractive clustering algorithm are assigned as ηs is 1.25; Ra is 0.35; the acceptance ratio is determined to be 0.5; and the rejection ratio is determined to be 0.15. It is noted that many different clustering algorithms would be available to estimate the ancedent parameters of the DFM (Babuska 1998; Lee et al. 2018). This clustering algorithm is used to construct the ancedent parameters. For example, the cluster center information is used as a center value of Gaussian or triangular MFs, as shown in Figure 2. Once the premise part of the DFM model is determined by the clustering algorithm, the weighted least squares algorithm is used to search optimum solutions of the consequent part parameters of the DFM model.
2.3 Weighted Least Squares
The consequent part parameters of the DFM model are determined using the weighted linear least squares where and (k) is the measured data, wj is the weighting factor.
2.4 Proposed DFM algorithm
The DFM algorithm proposed in this paper is as follows. The flowchart for the proposed algorithm is depicted in Figure 1.
Step 1: The images on dynamic motions are collected, and then various components such as the mDia1, actins, and velocities are extracted from the collected images.
Step 2: Correlation coefficients between each pair of input and output signals are calculated. The signals with high coefficient values are retained as input signals. To reduce the number of input signals, partial correlation coefficients are calculated to determine which signals can be removed from the data sets.
Step 3: The clustering algorithm is used to construct the premise part of the DFM model.
Step 4: Once the antecedent part are determined using the clustering algorithm, the consequent parameters are optimized using the weighted least squares algorithm.
Step 5: The performance of the DFM model is estimated via various evaluation indices, including percent error in peak, bias, mean square error, root mean square error, and coefficient of determination. If the modeling performance is not satisfied (i.e. the errors are larger than the allowable error limits), the modeling process goes to Step 3. Note that Step 3 to Step 5 are repeated until the errors converge to desirable values. For example, the number of membership functions can be adjusted. The target errors are determined by users.
Step 6: When the model estimates are satisfied with the specified boundaries of errors, the model is tested using other data sets that are not used for the training process. In this paper, the specified boundaries of errors are determined qualitatively by visual inspection of the time series as well as quantitatively by the evaluation index such as R2. If the prediction is not satisfied, the procedure goes to Step 3. If it is satisfied, the algorithm would stop.
Note that trial and error is required for Step 3 to Step 6. When it is difficult to develop an effective model from Step 3 to Step 6, it is recommended to return to Step 2. Based on different combinations of input-output signals, it is sometimes possible to improve the modeling performance. It should be noted that the computational costs of calculating the output significantly increase when the number of input variables grows, making the DFM model in high dimension unfeasible. It is often counterproductive to consider a high number of input variables in the prediction model for a restricted purpose.
2.5 Experimental data
Sample videos for the analysis were prepared by taking time-lapse movies of PtK1 cells expressing fluorescently tagged mDia1 and actin with a spinning disk confocal microscope for approximately 200 frames at 5 sec/frame, as shown in Figure 3. After segmenting the leading edge of each cell by multiple probing windows with an area of 1 μm2, time series of velocities and fluorescence intensities of the tagged mDia1 and actin acquired from each probing window were quantified (Machacek et al. 2009; Lee et al. 2015). These time series datasets are used for the DFM.
The dataset used in this study contains a significant amount of heterogeneity. First, recently we identified six different protrusion phenotypes by deconvolving the subcellular protrusion heterogeneity using unsupervised learning (Wang et al, 2018). Moreover, the leading edge of the cells undergoes protrusion and retraction cycles. The protrusion and retraction are distinct processes, which are driven by the different molecular mechanism. For example, the protrusion is driven by actin assembly processes whereas the retraction is driven by myosin contraction. In our DFM framework, the model training was performed without knowing that the time series are in the protrusion or retraction phases. Second, there are at least five different protrusion phenotypes based on our previous clustering analyses of the protrusion velocities (Wang et al, 2018).
3. RESULTS
The input of the DFM includes the fluorescence intensity of mDia1 and actin, and current velocities while the future velocities are used as an output signal. To further evaluate the effectiveness of the trained DFM model, a variety of different validation datasets that are not used in the training process are applied to the trained model. In this study, both qualitative (i.e., visual inspections) and quantitative analysis methods are used to evaluate the performance of the proposed modeling framework: the modeling errors are visually present first and then they are quantified using several indices.
3.1 Qualitative analysis of numerical model
The performance of the DFM models can be judged first by visual inspection (i.e., viewing patterns in data). It is easily to detect under- or non-modeled patterns and capture the overall behavior of the model without conducting the extensive quantitative analysis. In many problems, simple visual inspection of models is sufficient (Bennett et al. 2013; Kim et al. 2015). In this study, the time series prediction, the residual, quantile-quantile (QQ), and normal probability density function plots are used. Figure 4 (a) and (b) compare the predicted velocities (Model) with the measured training and validated datasets, respectively. DATAt represents the measured data used for the model training while DATAv is the validation data. The solid red line is the model while the dotted black lines are datasets. As shown in the figure, great agreements between the predicted values and measurements are found. Figure 4 (c) and (d) shows the residual error plots for the trained model and its validated results. The residual errors of both training and validation models appear random, which suggests that there are no systematic errors in the models. For instance, high density of positive/negative values is not found in the plots, which indicates that all the models do not tend to over/under-estimate the measured values (Bennett et al. 2013). Figure 4 (e) presents the QQ plots of the trained model and measured data while Figure 4 (f) is the QQ plot of the proposed model and validation data. If the model and data come from the same distribution, the QQ plot will be linear. From Figure 4 (e) and Figure 4 (f), all the QQ plots are closely linear, which means that both models and datasets come from the same distribution. These QQ plots correspond to the normal distribution functions (NDF) in Figure 4 (g) and (h). As shown in the figure, good agreements between the model NDF and the measured data are found for both the training and validation. The error analysis is quantitatively conducted in next section.
3.2 Quantitative analysis of numerical model
In order to quantify the modeling error, several evaluation indices were used. The simulation results are shown in Table 1. As shown in Table 1, the proposed DFM model is effective in predicting the complex behavior of cell motion fluctuations. The trained model demonstrates good performance according to all the indices. The maximum errors of the trained model in peak (J1) in forecasting cell motions are smallest compared to the other validation models. The J1 metric in the validation process is negative because the DFM model slightly overestimates the overall data values by 0.17%. The validation error is slightly higher than the training error, as measured by J1.
This is because the highest peak error in the validated model is higher than the one in the training time history data. As seen in J2, the trained model slightly overestimates some validated data. However, the occurrence of both positive and negative errors in J2 could result in a value close to zero, thus indices J3 and J4 account for this issue. As seen in J3, the performances of all the trained models are better than the validation models. The RMSE provides a normalized metric, yielding values between 0.14 and 0.64, for all models, respectively (J4). The coefficient of determination (J5) for the proposed DFM models is 97% for the training data, indicating strong agreement. The coefficients of determination of the DFM models range from 70% to 73% for the validation data.
4. Discussion
In this paper, a time series model is proposed for predicting the subcellular heterogeneous protrusive motion of migrating cells. The prediction model is developed through the integration of multiple autoregressive models, fuzzy logic membership functions, data clustering algorithms, and multiple weighted least squares algorithms. The discrete-type fuzzy model (DFM) was trained using the actin, mDia1, and current velocities as input signals and the future velocities as an output. The trained model was validated using different datasets collected from 9 different locations in the same cell. It is demonstrated from extensive experiments (both experimental and numerical testing) that the proposed DFM is very effective in predicting the complex nonlinear behavior of cellular systems.
It was observed that it is effective in modeling subcellular protrusion heterogeneity when mDia1, actin, and edge velocity are used as training datasets. It is believed that mDia1 initiates the protrusion by nucleating actin filaments that other actin nucleators and Arp2/3 complex can bind (Isogai et al. 2015; Lee et al. 2015). Since that mDia1 is not a major driver of actin polymerization for cell protrusion, it can be inferred from this fact that mDia1 may be an important player which generates the protrusion heterogeneity.
It is highly expected that the proposed modeling framework can improve the mechanistic understanding of heterogeneous cellular and subcellular behaviors, extracted from live cell imaging data, and can be applied to the other cellular and subcellular heterogeneity in cellular processes such as cell migration, cell division, cytoskeletal structures, and membrane-bound organelles.
Author Contributions
Y.K and K.L. conceived and initiated the project. Y. K. implemented the DFM algorithm into the subcellular datasets. H. C. designed, conducted the experiments, and validated the results. Y. K. and K. L. coordinated the study and wrote the final version of the manuscript and supplement. All authors discussed the results of the study.
Competing Financial Interests
The authors declare no competing financial and non-financial interests.
Acknowledgments
This work was supported by the NIH (Grant Number: GM122012), CBU, and WPI.