Abstract
Single molecule time trajectories of biomolecules provide glimpses into complex folding landscapes that are difficult to visualize using conventional ensemble measurements. Recent experiments and theoretical analyses have highlighted dynamic disorder in certain classes of biomolecules, whose dynamic pattern of conformational transitions is affected by slower transition dynamics of internal state hidden in a low dimensional projection. A systematic means to analyze such data is, however, currently not well developed. Here we report a new algorithm – Variational Bayes-double chain Markov model (VB-DCMM) – to analyze single molecule time trajectories that display dynamic disorder. The proposed analysis employing VB-DCMM allows us to detect the presence of dynamic disorder, if any, in each trajectory, identify the number of internal states, and estimate transition rates between the internal states as well as the rates of conformational transition within each internal state. Applying VB-DCMM algorithm to single molecule FRET data of H-DNA in 100 mM-Na+ solution, followed by data clustering, we show that at least 6 kinetic paths linking 4 distinct internal states are required to correctly interpret the duplex-triplex transitions of H-DNA.
Author Summary We have developed a new algorithm to better decode single molecule data with dynamic disorder. Our new algorithm, which represents a substantial improvement over other methodologies, can detect the presence of dynamic disorder in each trajectory and quantify the kinetic characteristics of underlying energy landscape. As a model system, we applied our algorithm to the single molecule FRET time traces of H-DNA. While duplex-triplex transitions of H-DNA are conventionally interpreted in terms of two-state kinetics, slowly varying dynamic patterns corresponding to hidden internal states can also be identified from the individual time traces. Our algorithm reveals that at least 4 distinct internal states are required to correctly interpret the data.
Introduction
Recent technological advances in single molecule experiments on biomolecules have provided an unprecedented chance to investigate dynamics of proteins and nucleic acids at single molecule (SM) level, which has previously been elusive in conventional experiments [1–7]. Folding/unfolding pathways gleaned from individual SM trajectories indicate rugged folding landscapes inherent to biomolecules [4, 8, 9]. Long time trajectories from SM measurements, which now can be extended more than hundreds seconds, allow us to address how a rugged conformational landscape is sampled over time [7, 10, 11]. One of the striking findings from these measurements is that even under the same folding condition, conformational dynamics of individual molecules differ substantially from one another while still maintaining their biological functions. Cofactor-induced conformational transitions of T. ri-bozymes [12], Holliday junctions [13], TPP-riboswitch [14], and preQ1-riboswitch [15] are the recent seminal examples that exhibit molecular heterogeneity at equilibrium. The variation in the velocities of individual RecBCD helicase motors along the dsDNA [16] is a good example of the molecular heterogeneity out of equilibrium, driven by ATP hydrolysis. Together with other reports [17–28], these could be merely a subset of more widespread, yet unrecognized cases that exhibit dynamical heterogeneity in SM time traces.
The chance of conformational frustration increases with the system size (Nsys). For a given Nsys, the time for conformational sampling (τsample) is expected to scale as [29]. Suppose that Tobs, which is in practice limited by several factors [30–32], is long enough to observe many (more than hundreds) transitions along a trace generated from SM measurement. Two distinct scenarios arise depending on the length of τsample relative to Tobs, (i) If the sampling time is shorter than Tobs (τsample ≪ Tobs), then the conformational space of biomolecule is fully sampled. In this case, the ergodicity of the system is ensured such that for any molecule α (or time trace α) the time average of an observable , is equivalent to the ensemble average of Oa(t) over all α’s (1≤ α ≤ Nens) at any moment t, i.e.,; ; thus thermodynamic properties of the system can be read out by analyzing a single time trace. (ii) In contrast, if τ sample ≫ Tobs is satisfied due to ruggedness of conformational space characterised with a number of deep local basins of attraction, then each time trace can sample only a local region of the conformational space. In this case, dynamic pattern from each time trace would look different, and a change in the dynamic pattern from one time interval to another would be observed only occasionally.
To be more precise about the second scenario (τsample ≫ Tobs), suppose that the average time scale for each local basin of attraction to be “sampled” by the conformational dynamics of molecule is τconf and that the time for the molecule to make transitions between different superbasins of attraction is τint (Fig. 1). In principle the relaxation rates and energy barrier heights of biomolecules span continuous spectra. So, the clear time scale separation may not always be waranteed. However, to be able to grasp the presence of dynamic disorder, if any, in SM time traces straightforwardly, a separation between two distinct time scales is required such that (or ). If τconf and τint were comparable (or the spectra of relaxation rates were uniform and continuous), an algorithm we will propose here as well as others could hardly be of any help to conceive a concrete landscape model as the one illustrated in Fig.1. Therefore, here we consider τconf and τint as two disparate time scales as illustrated in Fig. 1. τconf is the time at which the time average of an observable reaches its steady state value when τ > τconf, corresponding to a time scale in which to fully sample the local basin of attraction. Alternatively, τconf is limited by a kinetic barrier with the greatest within the local basin of attraction, so that . On the other hand, τ int is the transition time that is expected to scale with the height of kinetic barriers between the two superbasins as . When measurements are conducted with a finite duration of observation time (Tobs), we can conceive two entirely different dynamic patterns depending on the relationship between and τconf, τint, and Tobs:
τconf ≪ Tobs ≪ τint: The interconversion time between distinct basins of attractions is far longer than the observation time. The dynamic patterns from individual trajectories that sample distinct basin of attraction are expected to differ from each other. Since Tobs ≪ τint, there is few chance to observe an exchange of dynamic pattern in a single time trace, which corresponds to a case with quenched disorder that each SM time trace looks entirely different. Such cases are reported in Holliday junction [13], T. ribozyme [12], and RecBCD [16].
τconf. ≪ τint ≲ Tobs: The interconversion time between basins of attraction is shorter than or comparable to the observation time. In this case, it is possible to observe a few rounds (∼Tobs/τint) of pattern exchanges in a single time trace. Such SM time traces are called to have a dynamic disorder [15, 28, 33–36].
While the most interesting and physically relevant question to ask about the heterogeneity in single molecule time traces is its molecular origin, detection and quantification of such heterogeneity should precede such question for a further analysis. For SM time traces with quenched disorder, it is relatively straightforward to analyze as one can use the criterion of ergodicity and partition each time trace into its dynamic subensembles [13]. It is, however, more challenging to analyze time traces with dynamic disorder.
In the ion-channel community, ion currents across a single ion-channel measured with patch-clamp technique often demonstrate time series that switch between multiple dynamic patterns, and such a phenomenon is called ‘mode-switching’ [37] or ‘modal gating’ [38]. An algorithm (aggregated Markov model, AMM) developed by ion-channel community to analyze time series exhibiting dynamics disorder is in principle of use, but when applied to our synthetic data, we found that the algorithm tends to overpredict the transitions between hidden states (see Fig. S25 and discussion related to it below). Thus, here we have developed a more reliable and systematic algorithm – Variational Bayes-Double Chain Markov Model (VB-DCMM) – which combined variational Bayes method with Double Chain Markov Model (DCMM) [39–43], to analyze SM time traces with dynamic disorder in which dynamic pattern of conformational transition changes at much longer time scale than apparent con-formational fluctuations due to a slower transition of a hidden variable.
We first explain the algorithm for VB-DCMM, and next apply our VB-DCMM method to synthetic data as a blind test to show that our method can accurately identify the hidden internal states and determine the kinetic rate constants associated with the data. The results from our analysis using VB-DCMM are reliable as long as a clear separation in time scales exists between the apparent conformational transition (τconf) and the interconversion times (τint).
As a prototypical example of single molecule time traces with dynamic disorder, data from H-DNA [44, 45] that undergoes duplex-to-triplex conformational transitions (Fig. 2A) are analyzed. A kinetic pattern of two-state like conformational transitions between duplex (low FRET ∼ 0.1) and triplex form of H-DNA (high FRET ∼ 0.9) observed in one time interval changes to another pattern in the next time interval (Fig. 2B). DCMM models this peculiar dynamic pattern of H-DNA in Fig. 2B by assuming a slowly varying dynamics of a hidden internal state. Fig. 2C illustrates how the dynamic pattern of the original time trace of observable state, on(t) (gray traces in Fig. 2C), changes with the internal state x(t) at a given time t. The dynamic pattern of on(t), displaying multiple transitions, is slave to the slowly changing value of x(t). DCMM implements this idea into an algorithm and allows us to extract the information of x(t) from on(t). Finally, we apply VB-DCMM to an ensemble of H-DNA time traces obtained from smFRET experiments and show that the dynamics of H-DNA at [Na+]=100 mM should be modeled using at least 4 large basins of attraction.
Algorithm
Here, we provide a general overview of the VB-DCMM algorithm, defining terms and parameters. More technical details of derivation and implementation of the algorithm are given in the Supplementary Information.
Modeling time series with dynamic disorder
Markov chain approach is ubiquitously used in modeling biological systems. For example, reversible conformational transitions of biomolecules probed by single molecule fluorescence resonance energy transfer (smFRET) or force spectroscopy are often modeled as a homogeneous Markov process in which the transition rates between experimentally discernible conformational states are uniquely decided. To decipher time series with dynamic disorder that change their dynamic pattern from one time interval to another we assume that there are hidden “internal states”, each of which determines the rate of conformational transitions. A signature of the transition between internal states, which gives rise to dynamic disorder in time series, are difficult to detect using the value of FRET efficiency or end-to-end distance alone when the values observed along the time series are indiscernible even if the internal state is altered. By assuming that the transition between internal states is described by a homogeneous Markov process, and that transition between observable (in this study, FRET) follows non-homogeneous Markov process, whose transition rates are slaved to the internal state at each time, we model time trajectories made of these two layers of Markov chains. This algorithm corresponds to the Double Chain Markov Model (DCMM) [39–43] (Fig. 2C,D).
DCMM is characterized by the following model parameters: (i) Transition matrix A for homogeneous Markov chain, which describes the transition probability between the K-distinct internal states along the time series (x = (x(1), x(2), …, x(t), · · ·, x(T – 1))). Here K is a total number of internal states in the model, and x(t), specifying internal state at time t, takes one of the values between 1 and K. T is the total observation time. The internal state at time t+1 (x(t + 1)) is determined by the previous internal state at time t (x(t)), whose transition to x(t + 1) is determined by a K × K Markov transition matrix A as where P(x(t) = v) denotes the probability of x(t) being in the v – th internal state; (ii) K transition matrices B(µ) with µ ∈ {1, 2, …, K} for non-homogeneous Markov chain describes the transition probability between the observable states along the time series (o = (o(1), o(2), …, o(T))). o(t) specifies the state of the observable among N possible states {1,2, …, N} at time t. Transition from o(t) to o(t + 1) is determined by an N × N transition matrix Bx(t)(t), the matrix elements of which are slave to the value of x(t)(= µ ∈ {1,2, …, K}).
For example, if there are two (K = 2) internal states, and each internal state has three (N = 3) observables in a given time trace recorded with time resolution Δt, then two transition matrices for o with µ = 1, 2 can be considered (i.e., B(1) and B(2)): Next, the transition matrix A for the interconversion between two internal states is: In the above matrices, the matrix elements must satisfy, for each i = 1, 2, 3 in B(µ), and in A. More detailed descriptions about DCMM are available in the original papers [39–43] particularly in ref. [39] (see also SI). A similar but more general version of DCMM, which can accommodate inputs variables as well as multiple number of internal state sequences, has been suggested by extending the factorial hidden Markov model [46, 47].
Determining the number of internal states
DCMM can estimate the transition matrices A and B(µ) quantitatively, and hence determine the most probable sequence of internal state and associated kinetic rates, and . However, the likelihood (the probability of observing data for given model parameters), maximized by DCMM, P(o|π,A,B) where π ≡ (π1, π2, …, πk) with π µ = P(x(1) = µ|o, A, B), is prone to increase when more number of parameters are used in the model. DCMM can select the best set of parameters for a given model, but not suited to select the best model (i.e., cannot determine the optimal number of internal states K for a given time trace). To overcome this limitation, often used is the maximum evidence method, where the evidence (P(o|K), also called marginal likelihood) is defined as the conditional probability of observing data (o) for a given model (K), so that where λ ≡ (π, A, B) represents the parameter space. In this method, the penalty against model complexity is naturally incorporated during the calculation, allowing to select the best model (see SI). By calculating the evidence for each different model (different K, the number of internal states in data), one can select the best model with an optimal number of internal states that maximizes the evidence. The calculation of the evidence, however, involves a massive computational cost to explore the entire parameter space for a given model.
Variational Bayes Double Chain Markov Model
To alleviate the computational cost in employing the maximum evidence method in Eq.1, we employ the Variational Bayes [48], a method that effectively uses a mean-field approximation. The method has previously been used to determine the number of observable states (FRET states) from smFRET data [49–51], the number of diffusive states from single molecule tracking data [52], and the number of DNA-protein conformations from tethered particle motion data [53]. It has also been used inside the empirical Bayes method which can analyze several smFRET time series simultaneously [54, 55]. In our study, the variational Bayes method combined with DCMM (VB-DCMM) was used to analyze single molecule time traces with dynamic disorder. The analytical expression of the lower bound of the evidence (F), offered by VB-DCMM, makes clear where the model penalty comes from, thus providing guidelines to choose the prior parameters to incorporate a prior knowledge of data (see SI). Once prior parameters are selected, VB-DCMM iteratively increases the lower bound of log (evidence)(= log P(o|K)) by identifying a better approximation to the true probability distribution. where q(Z) is an arbitrary probability distribution of a set of variables, Z(≡ (x, λ)) consisting of parameters and hidden variables of model, and where Dkl(q||p) is the Kullback-Leibler divergence of q(Z) from P(Z|o,K), which we want to minimize. Once the solution from the algorithm converges, the approximate value of logP(o|K*)(≃ F[q*]) and the (locally) best model parameters (a set of the best kinetic rates), π *, A* and B*, which determines all the rate constants to describe the given time traces ( and ), are acquired from an approximated probability distribution (See SI for the mathematical details). The performance of VB-DCMM is quite robust over a wide variation of prior parameters (Fig. S21, S22).
Implementation of the algorithm
The observable sequence o is obtained by filtering the noise in the experimental data (on) using Hidden Markov Model (HMM) following a similar procedure as the previous studies [49, 56] using a custom code written based on the code from Sagemath software [57]. Next, the o is analyzed using VB-DCMM to select the best model and to estimate the best model parameters. The optimal sequence of internal states x is determined by using Viterbi algorithm [39]. All the implementations and data analysis are done by using our custom code. VB-DCMM is freely available at “https://github.com/TBiophysG/VBDCMM”
Results and Discussion
Validation of VB-DCMM
To first validate the efficacy of VB-DCMM in identifying internal states in a given SM time trace, we applied VB-DCMM algorithm on synthetic data that mimic a SM time trajectory with dynamic disorder (see Methods). To generate a synthetic SM time trajectory, we first produce a time trajectory specifying the value of internal state from t = 1 to t = T – 1. The time trajectory of internal state is represented with a symbol x = (x(1), x(2), · · ·, x(t), · · ·,x(T – 1)). When the total number of distinct internal states in the model is K, one of the values in {1,2, …, K} is assigned to x(t). Thus, for K = 2 a typical time trajectory of internal state x looks like (1, 1, 1, …, 1, 1, 2, 2, 2, …, 2, 2, 1, 1, …,1, 1), (1, 1, 1, …, 2, 2, …, 2, 2), (2, 2, …, 2, 1, 1, …, 1, 1, 2, …, 2, 2, 2), etc. The time trajectory given in Fig. 3A(i) is an example generated with K = 2 and T=8801. Next, similar to the structure of x, the time trajectory of noiseless observables is represented using o ≡ (o(1), o(2), · · ·, o(t), · · ·, o(T)). In Fig. 3A-(ii), a trajectory of o is shown, also demonstrating the influence of x on o. Finally, Gaussian noise was added on o and the range of signal was adjusted to produce the final trajectory on = (on(1), on(2), …,on(T)) which now resembles a time trajectory of SM FRET signal (Fig. 3A-(iii)).
Deciphering the information of internal states from an observed time trace involves solving an inverse problem, i.e., decoding on to obtain x. To decode the trace of internal states from the synthetic data, we follow a 3-step procedure: (1) Filter the noise from on to obtain o using Hidden Markov Model (HMM) [56] (Fig. 3A-(iv), blue line); (2) Analyze o by applying VB-DCMM algorithm with different models 1, 2, …, K (again, K is the total number of internal states assumed in each model); (3) To select the best model we calculated the conditional probability of observing data for a given model parameter K, P(o|K), which is often called evidence or marginal likelihood in machine learning community (Eq.2) [48]. Calculation of P(o|K) is conducted using the Variational Bayes (VB) method, which gives the lower bound of log P(o|K) denoted by F(K). Details of the evidence function F(K) and approximation procedure are provided in the Supplementary Information (SI). Finally, we select the best model K* which maximizes F(K), i.e., K* = arg max F(K).
To be specific, in order to identify the best model parameter K for the time trace o(t) given in Fig.3A-(iv), we varied K from 1 to 3. The most probable trace of internal states, , was calculated for each model with K = 1 (red), K = 2 (orange), K = 3 (blue) (see Fig. 3A-(v)). The evidence F(K) calculated using VB method was maximized at K = K* = 2, and the resulting time trace of the internal states, , most closely recovers the trajectory of x (black trace in Fig. 3A(v)) except at the time interval where the transitions of x(t) between 1 and 2 occur only transiently or at the boundaries of transitions (red arrows on Fig.3A-(v)). This result shows that VB-DCMM can avoid the over-fitting problem that other methods based on maximum likelihood are often fraught with [48].
Conditions required for an accurate recovery of internal states
VB-DCMM detects a signature of change in internal state (x) from a given observable time trace (o) by evaluating the statistical difference in transition rates. Thus, in the absence of an enough number of transitions in the trace o, the algorithm becomes less reliable. For example, we obtained F(K = 2) ≈ F(K = 3) although F(2) ≫ F(3) is more desirable (Fig. 3B. See another example in Fig. S1). This is due to the lack of statistics in transition events in this particular test trace given in Fig. 3A. For example, when only a part of the time trace is selected and analyzed using HMM, the estimated rates of transition from high (H) to low (L) FRET value are in and in . Thus, in (K=3)-model the two time intervals, originally generated by using the same kinetic parameter , are determined to be distinct from each other (blue trace in Fig. 3A-(v)). By contrast, in (K = 2)-model, was estimated over these two time intervals. This type of statistical error is unavoidable for a small Tobs. A more systematic evaluation on the accuracy of the algorithm as a function of Tobs and transition rate between distinct internal states will be discussed in the next section.
To assess the accuracy of the best model predicted by VB-DCMM against the solution x, the following overlap function can be used. where δi,j is the Kronecker delta and T = Tobs/ Δt is the total number of data in the traces (Δt denotes the temporal resolution of the data). For 100 synthetic time traces, generated under the identical parameters used for producing the time trace in Fig. 3A, we found that χ ≈ 0.9 on average (Fig. 3C). Note, however, that x(t), only available for the case of “synthetic data”. Thus, to assess the accuracy of our method against a real time trace from SM experiments, we devised other metrics.
For a given time trace with dynamic disorder, our algorithm quantifies the kinetic features of the time trace in terms of the transition rate between the observable states a and b within the µ-th internal state and the transition rate from the µ-th internal state to v-th internal state – γ (µ) → (v) (1 ≤ µ, v ≤ K, 1 ≤ a, b ≤ N. Here, µ is the index for internal state whereas a and b are indices for observable (In FRET displaying low/high two state transitions, these states correspond to the low and high FRET values). K is the total number of hidden internal states, and N denotes the total number of observables). To be able to extract the information of multiple internal states reliably from a time trace using VB-DCMM, two general conditions are required for the time trace being analyzed.
A large time scale separation should be present in the kinetics within each internal state, i.e., and should be disparate.
There should be a clear time scale separation between intra-basins and inter-basin transitions (i.e., τconf and τint). More precisely, the intra-basin transition probability should be much greater than the transition probability from the µ-th to any other internal state .
To substantiate the above-mentioned conditions 1 and 2, we define two metrics Dconf and Dint, which compute the average Hamming-like distances between the distinct rate constants extracted from a given time trace using VB-DCMM analysis: and
Dconf measures the dissimilarity between distinct internal states in terms of the intra-basin transition rates. Two distinct internal states (µ, v (µ ≠ v)) can be better discerned if the intra-basin transition rate of one internal state (say, differs greatly from that of other internal state , so that | log2 | is maximized. Dint measures the average number of intra-basin transitions in each internal state using the ratio between the transition probabilities, and . A greater Dint ensures a large time scale separation in dynamics between intra-basin and inter-basin transitions, which improves the reliability of our method to decode the internal state from a given time trace. In general, Dint or Dconf shows a good correlation with 〈χ〉 (see below); thus, one can use (Dint,Dconf) to assess the accuracy of predicted internal states. Note that the metrics Dint and Dconf can be estimated for real data, while 〈χ〉 can be calculated only against the synthetic data. Since there is a good correlation between (Dint,Dconf) and χ, one can evaluate (Dint, Dconf), alternative to χ, to assess the reliability of a predicted result of .
To be more concrete, we applied VB-DCMM algorithm to analyze synthetic data generated with N = 2 (transitioning between high and low FRET values) and K = 2 (two internal states; µ = 1 and 2) under various scenarios.
We fixed the transition rates in the state µ = 1 as , and varied the rates associated with the state µ = 2 over the range of , (Fig. 4A, left). For the interconversion probability between the two internal states we set . The accuracy of the model prediction (〈χ〉, Eq.(3)) is on average greater than 0.9 as long as the transition rates and differ more than the factor of 4. Note that in Fig. 4A (left), the value of 〈χ〉 is greater for than for this is because a statistically sufficient number of transitions make the detection of internal states more reliable. In contrast, when , i.e. when the kinetics inside the two internal states are essentially identical, it is difficult to discern the two internal states. In this case, K=1 instead of K=2 is effectively the correct number of internal states. Indeed, when K = 1 is assumed (i.e., assuming true internal state x(t) = 1 for all t in Eq.(3)), the re-calculated 〈χ〉 is close to 1 (see Fig. S2).
To explore the effect of interconversion between distinct internal states on the performance of algorithm, we generated synthetic data with , , and by, this time, varying γ (1) → (2) Δt and γ (2) → (1) Δt = 0.00025 ∼ 0.005 (Fig. 4B, left). The results clearly show that the case with smaller γ(µ) →(v) results in a higher 〈χ〉, which is expected because each internal state can have more number of transitions in the traces o when the interconversion is slower (Fig. 4B). Re-plotting 〈χ〉 as a function of Dconf and Dint reveals clear dependence of the accuracy on Dint (Fig. 4B, right). Similar trends are observed for other conditions of and (Fig. S3).
Analyses on synthetic data generated using the same input parameters with those in Fig. 4, but with a different number of data points in each trace, Tobs/Δt = 4400, and 2200 (Fig. S4) show a similar trend as observed in Fig. 4 with Tobs/Δt = 8800 but with slightly smaller 〈χ〉 values.
Extension of VB-DCMM algorithm to a more complicated case for K > 2 (Fig. S5, Fig. S14) or N > 2 (Fig. S6, Fig. S14) is straightforward. Application of VB-DCMM to a trajectory in which each internal state trajectory has different N is also straightforward (Fig. S7). In the latter case, the data is analyzed by assuming that all internal states have the same number of possible observables, N; but the analysis would indicate that transition associated with a small transition rate is essentially disallowed. In all situations considered for various K and N, VB-DCMM can be used for the reliable recovery of the sequence of true internal states.
Analyses of synthetic traces show that the accuracy of the algorithm improves with both Dconf and Dint (Fig. 4 right panels and Fig. 5). Thus, these two metrics allow one to judge the reliability of the information on internal states extracted from a given time trace. Alternatively, a single parameter Dtot(= Dconf + αDint) with an empirically acquired coefficient α ≈ 0.8) can be used to judge the reliability of the extracted information. Note that 〈χ〉 remains similar as long as Dtot remains constant (Fig. 5B). Hence, when 〈χ〉 is plotted against Dtot, all synthetic data generated using different parameters approximately collapse onto a single universal curve (Fig. S8).
There are multiple ways of assessing the efficacy of VB-DCMM in decoding the internal states. In addition to 〈χ〉, Dconf, Dint, and Dtot as the possible measures for the assessment, one can also use the statistical property that the dwell times of homogeneous Markov process satisfies (see SI for details).
Application of VB-DCMM on H-DNA data
Now, to analyze the duplex-triplex transitions of H-DNA (Fig. 6), we obtain o by filtering the noise from FRET signal (Fig. 6-(ii), blue line) and apply the VB-DCMM algorithm to decode the hidden internal state in the signals. Fig. 6-(iii) shows time series of internal state, calculated from the VB-DCMM by varying K from 1 to 5. It is of note that the number for a of actually observed internal states in the for a given input parameter K does not change after some Kobs(≤ K) (Kobs = 2 (Fig. 6A), 2 (Fig. 6B), 2 (Fig. 6C), and 1 (Fig. 6D)). (See also other time traces of synthetic data and H-DNA analyzed in SI: Fig. S1A (Kobs = 1), Fig. S9A (Kobs = 3), Fig. S9B (Kobs = 2), Fig. S9C (Kobs = 3), Fig. S9D (Kobs = 3), Fig. S10A (Kobs = 4), Fig. S10B (Kobs = 2), Fig. S10C (Kobs = 2), Fig. S10D (Kobs = 3), and Fig. S12 (Kobs =3)). A similar behavior is also observed when analyzing data using the variational Bayes Gaussian mixture model [48].
To account for the contribution due to degeneracy in labeling the internal states, log K! term is conventionally considered in formulating the evidence function F(K) (See SI for the details); however, in our problems, the actual number of degeneracy in labeling internal states should be instead of K!. Therefore, we replace the log K! term in F(K) with log [K!/(K – Kobs)!], and considered a modified evidence function, G(K), to identify an optimal K for a given time trace: G(K) shows a clear peak, allowing us to identify the optimal K(= K*) with ease (blue circles on the right side of Fig. 6, S9, and S10). Use of G(K) instead of F(K) in analyzing synthetic data does not alter K* (Fig. S12, Fig. S14B).
Among the time traces of H-DNA, traces with more than 3 interconversions between distinct internal states, which enables us to estimate γ(µ) →(v), are rare, especially when [Na+]=100 mM; thus it is not feasible to get a statistically meaningful scatter plot of (Dconf,Dint) (see Fig. S23D, E, F); however, for those displayed in Fig. S23D, E, and F, (Dtot) ≈ 7 suggests that χ ≳ 0.9 (from Fig. 5). Therefore, at least the intra-basin rate constants extracted from H-DNA data using VB-DCMM are reliable. Time traces that have τint comparable to experimental observation time (τint ≈ 𝒯obs) would exhibit on average no or only a single transition event between distinct internal states. Indeed, we find that only a subset of total number of internal states is sampled by individual time traces due to the limited observation time. For instance, at [NaCl]=100 mM, our analysis identified K* ≤ 2 in 265 out of 269 traces, and that only 4 time traces display K* > 2 (Fig. S11). Therefore, in order to identify the internal states present in the transition dynamics of H-DNA, clustering analysis is required against the whole ensemble of time trajectories. We provide the procedure of clustering analysis and results in details in the following section.
Clustering H-DNA data
VB-DCMM algorithm allows us to decompose individual H-DNA time traces with dynamic disorder into multiple “components”, each of which should satisfies the property of homogeneous Markov chain. In order to understand the structure of conformational space of H-DNA, the ensemble of components acquired from the VB-DCMM analysis should be clustered into the same kind. To this end, we produce scatter plots of (kL→H, kH→L), representing the kinetic property of the ensemble of time traces, using the transition rates estimated for individual time traces. The scatter plots of (kL→H, kH→L) were calculated for the ensemble of H-DNA time traces (i) before (Fig. 7A, left) and (ii) after decomposing the individual heterogeneous time traces retaining multiple components into the homogeneous ones (Fig. 7A, right). The scatter plot of (kL→H, kH→L) after the decomposition has a greater dispersion, which is expected since a data point (kL→H, kH→L) for a time trace with dynamic disorder is a mixture of with µ = 1,2, … K. In the presence of clear distinction between internal states (µ ≠ v), the clustering of would be straightforward which is indeed the case for the synthetic data (Fig. S13A). However for the H-DNA data, even after the decomposition, the clustering of data on (kL→H, kH→L) plane (Fig. 7A) is not that clear.
To improve the quality of clustering, we extended the clustering of the kinetic data to a higher dimension by considering the kinetic information of internal states that are contiguous (kinetically linked) along time traces. To be specific, for a time trace exhibiting a transition from the µ-th to v-th internal state (µ ≠ v), one can consider that the inter-basin transition has occurred from the time interval represented by its pair of kinetic rate to the next time interval represented by , where the superscripts, ‘bf and ‘af denotes ‘before’ and ‘after’ the transition, respectively. Thus, instead of , a clustering at a higher dimension can be carried out by measuring the Euclidean distance between a pair of the four-dimensional (4-dim) arrays, .
In order to cluster the 4-dim arrays we used the k-means clustering algorithm. Application of the algorithm to the H-DNA data at [Na+]=100 mM reveals that the average pairing distance, 𝒟(𝒦) (see Methods), is minimized when the number of clusters is 6 (𝒦 = 6), namely, the model with 6 clusters provides the best interpretation of the data (Fig. 7B). Although the model with 14 clusters shows a smaller 𝒟, we selected 𝒦 = 6 as the best solution, since for 𝒦 = 14 each of 12 clusters out of 14 has less than 10 data points, which makes the result of clustering statistically less significant (Fig. S16). This results remain qualitatively identical when L1 distance (so called “city block” distance) was used instead of “square-euclidean” distance (Fig. S19). Furthermore, the clustering algorithm using “affinity propagation” [58], which considers all the data points as possible exemplars (analogous to centroids in k-means clustering method) and iteratively exchanges messages between them, also gives qualitatively identical results, confirming the robustness of the conclusion on H-DNA dynamics obtained from VB-DCMM and k-mean clustering (see Fig. S20).
We present the result of clustering either (i) by projecting it on the two separate kinetic planes, and , which visualize the inter-basin transitions in terms of variable L → H and H → L transition rates (see Fig. 7C), or (ii) by using “interconversion arrows” linking the kinetic rates of two internal states, on, the plane (Fig. 7D). Note that in the scatter plot visualized with , the distinction between different clusters is clear (the right panel of Fig. 7C). Furthermore, for a system in equilibrium or at least near equilibrium, the interconversion between two internal states, say µ and v, should occur in both directions, i.e., µ → v and v → µ. In the representation (i), a symmetry of is expected in the both panels of Fig. 7C; and in the representation (ii), the “arrows”, amounting to the kinetic connectivity between distinct internal states, should be bi-directional. The symmetry of the data plotted in Fig.7C or the bidirectionality of the kinetic arrows confirms the condition of detailed balance being satisfied in the system in equilibrium. Fig. 7D depicts 6 kinetic arrows (3 pairs of reversible kinetic arrows) connecting the centroids of or data.
Application of the above clustering method to synthetic data with K = 3, N = 2 (Fig. S13) is straightforward. To check the efficacy of clustering method for a more complicated case, we have tested with synthetic data generated with K = 4, N = 4, i.e. when there are as many as 4 observable states in each internal state (Fig. S14, Fig. S15). In the case with 4 observable states, total 12 possible intrabasin transitions are conceivable. Thus, the dimension of the array associated with interbasin transition is 24. As long as there is a clear time scale separation, it is expected that the pairing distance 𝒟(𝒦 = 12) shows minimum as there are 12 connection paths between 4 internal states. Indeed, 𝒟(𝒦) is minimized at 𝒦 = 12 (Fig. S15A).
Lastly, it is noteworthy that the clustering method presented here is not limited to data analysis for systems in equilibrium, but can be extended to systems in nonequilibrium steady state [59] where the individual state-to-state kinetic transition rate is well defined using the reversible Markov process although the condition of detailed balance is no longer anticipated [60, 61]. The symmetry of data point and bidirectionality of kinetic arrows as in Fig.7C, D are still of use to cluster the kinetic information generated from a system in nonequilibrium steady states.
Folding energy landscape of H-DNA
We classified the “components” of a similar kinetic pattern (kL→H, kH→L) obtained from VB-DCMM into a single cluster which represents a kinetic path linking two independent basins of attraction (or internal states). For example, the kinetic paths in Fig. 7D can be best understood by hypothesizing 4 internal states (four basins) linked by 6 kinetic paths. Thus, the conformational transition landscape of H-DNA at [Na+]=100 mM condition consists of 4 internal states with 3 reversible kinetic paths being established as illustrated in Fig. 7E. At lower salt concentrations ([Na+] = 50 mM (Fig. S17) and [Na+] = 26 mM (Fig. S18)), H-DNA transitions slow down and the dispersion of data also increases; however, the overall structure of conformational landscape of H-DNA remains unchanged from the picture suggested in Fig. 7E; thus, there is a central superbasin to which three other superbasins are kinetically connected (Figs. S17 and S18).
Contributions of our work
In comparison to other pre-existing methods, the advantage of our VB-DCMM in decoding dynamic disorder from a given trajectory is highlighted as follows:
Dynamic disorders in single molecule time trajectories are modeled using DCMM by assuming the presence of hidden internal states. While Aggregated Markov Model (AMM), which has been adopted in ion-channel community for time trace analysis of varying current [62–76], can be employed to analyze our data with dynamic disorder, DCMM is better in correctly decoding dynamic disorder than AMM. We found that AMM is prone to overpredict the transition between kinetic patterns (Fig. S25). Our method is more suitable to the data showing persistent dynamic patterns by suppressing unwanted frequent transition between kinetic patterns. Detailed explanations of connection and quantitative comparison between DCMM and AMM are provided in SI and Fig.S25.
In this paper, Bayesian version of DCMM was developed by using variational Bayes (VB) method, which enabled us to determine the number of internal states straightforwardly. Although Bayesian version of DCMM using Markov chain Monte Carlo (MCMC) method has previously been developed for the credit portfolio modeling [77], the idea of Bayesian inference in ref. [77] was used only for the purpose of calculating a posterior distribution of model parameters. To determine the number of hidden states corresponding to the internal states in this study, the authors in ref. [77] used the economic cycle fluctuation model, instead. Our study combining VB with DCMM (i) can determine the number of internal states in a more objective fashion, (ii) offers intuitive way to incorporate prior knowledge, and (iii) is computationally more efficient than MCMC (See SI for details).
We tested VB-DCMM under various conditions, by varying the kinetic rates, the number of observables, the number of hidden states, and prior parameters. New metrics were also devised to quantify the performance of algorithm systematically.
Finally the connection paths (kinetic arrows) between internal states of H-DNA are clustered by using the kinetic components extracted from VB-DCMM and by applying k-means clustering algorithm to high dimensional arrays.
To recapitulate, our entire process of analyzing single molecule data is composed of three stages: (i) noise-filtering using HMM; (ii) decomposition of heterogeneous time traces into the homogeneous components using VB-DCMM; (iii) clustering the decomposed components into the same cluster.
In principle, this three-stage analysis can be made more systematic by combining the noise-filtering and clustering procedure with VB-DCMM. To be more specific, (1) The noise-filtering of observable trace (on) is processed, independently from the main VB-DCMM algorithm, by using HMM, which has been proved to be reliable in noise-filtering [56], and the maximum number of observables (N) are predetermined as an input parameter. Current version of algorithm can be further automated by combining with the Bayesian version of HMM [49], which can determine the number of observables while filtering the noise in data (See Fig. 2D). The resulting model will have a similar structure with the modified factorial HMM [46, 47]. (2) The heterogeneous components identified from individual time traces are clustered separately from our main algorithm. It would be also desirable to unify the post-processing step (clustering) with VB-DCMM using empirical Bayes method which has been applied recently to analyze single molecule data [54, 55].
However, it should also be noted that a blind integration of noise-filtering and clustering steps inevitably complicates the implementation of VB-DCMM, as more number of prior parameters are ought to be decided by users. For example, Bayesian implementation of HMM for noise filtering demands manual determination of additional N(N+5) prior-parameters [49]. Compared to this, currently VB-DCMM requires users to pre-determine only one prior parameter which characterizes the final transition rate matrix, A (see the subsection: Selection of prior parameters in SI). Moreover, the integration of other methods will obscure the flow of analysis, making it difficult to identify an error-causing step. Keeping each step in the algorithm separate makes the integration of VB-DCMM to other applications more transparent (for example, if noise-filtering by HMM is unsuccessful, other advanced method can be employed [49]). We leave it as our future work to develop an algorithm that integrates the above-mentioned three procedures (noise-filtering, VB-DCMM, and clustering) without increasing complexity or obscuring the flow of analysis.
In decoding SM FRET data, the most notable difference of our VB-DCMM from the previous studies employing the probabilistic models such as maximum likelihood and Bayesian statistics is that VB-DCMM explicitly considers the situation that transition rates can change from one time interval to another within individual time traces. The previous studies [49–51, 56, 78–80] assumed that the transition rates were constant within individual time traces. Also, currently, VB-DCMM is applicable to window-averaging FRET trajectories. It will be of great interest to extend VB-DCMM to analyzing time trajectories in which arrival times for individual photons are available. VB-DCMM is particularly powerful when there is a separation in time scales between τint and τconf.
Concluding Remarks
While the notion of dynamical heterogeneity or broken ergodicity seems better recognized in the research field of nucleic acids [81] than in proteins, which likely arises from more homopolymer-like nature of building block of nucleotides [82], biomolecules in general can have a rugged folding landscape with many local basins of attraction and kinetic barriers with varying heights [83]. Conformational dynamics of biomolecules on rugged landscapes can be heterogeneous, which gives rise to static or dynamic disorder depending on the time scale of observation or the height distribution of kinetic barrier. The presence of heterogeneity or disorder among individual molecules, unveiled by in vitro SM experiments could be surprising at first sight; however, it is also important to note that the general hypotheses in the conventional molecular biology towards a single native state have been put forward based on the observations from ensemble experiments where the heterogeneity, if any, is usually masked by the process of ensemble averaging. Given that the complexity of a molecular system increases with the system size (Nsys) as ∼ eNsys [29], it should not be too surprising to find such disorder in biomolecules in itself. Cells are equipped with molecular chaperones that can tame misfolding-prone biomolecules with rugged landscapes [84–87]; thus the principle of optimization in biology, if it fails at the level of a molecule in isolation, can be extended further to the molecular system including its environmental factors.
It is not easy to elucidate the molecular origin of disorder in a conclusive manner; yet, it has recently been suspected that interactions of biomolecules with cofactor such as ATP and multivalent metal-ions could be the microscopic causes for those molecules exhibiting dynamical heterogeneity [12, 13, 16, 88, 89]. Modulating the concentration of Mg2+ ions from high to low and again to high induced inter-conversions of dynamic patterns in equilibrium conformational fluctuations of T. ribozyme [12] and Holliday junctions [13]. Distinct velocities of ATP-empowered individual RecBED helicase motors, which can move progressively along ds-DNA by unwinding it into two separate strands, can be reset by introducing a long pause by halting the supply of ATP. For the time trajectories of biomolecules displaying quenched disorder, a method to analyze such data was proposed using a concept from glass physics [13]. Here, to deal with more general scenarios, we have developed a method to analyze single molecule time traces with dynamic disorder.
As demonstrated by testing the VB-DCMM algorithm on synthetic data, the algorithm is quite accurate in decoding dynamic disorder as long as a time trajectory of interest contains multiple time intervals, each of which display kinetic pattern distinct from others. When a clear separation in timescale is present between two distinct kinetic patterns, large value of Dconf, Dint, and Dtot would be acquired.
While we developed the VB-DCMM algorithm primarily to analyze dynamic disorder in duplex-triplex transitions of H-DNA, the method is applicable to any data in the form of one-dimensional time series with multiple transitions. Together with a further technical advance in SM, which eliminates experimental artifacts as well as extends the measurement time, our algorithm developed here will contribute to better understanding of biomolecules that display heterogeneous dynamics.
Methods
Generation of synthetic data
Internal state sequence x was generated by using Monte Carlo method with a constant transition matrix (homogeneous Markov chain model). The observable sequence o was generated by using the same method but with the transition matrix that was defined at each time t based on the internal state x(t). Finally, Gaussian noise was added on o to produce on.
Single-molecule FRET measurements to monitor duplex-triplex transitions of H-DNA
We purchased triplex forming oligonucleotides from Integrated DNA Technologies (Coralville, IA, USA). The oligonucleotides were dissolved in T50 buffer solution (10 mM Tris-HCl, 50 mM NaCl, pH=7.5) and were heated beyond the melting temperature of DNA duplex (~ 90 °C), and slowly cooled down on a heat block to room temperature over 8 hour to properly hybridize them. The DNA prepared as such is called “H-DNA” here. The sequences of the triplex forming strands (purine-rich and pyrimidine-rich) are: Purine-rich strand: 5’ AAG AAG AAG AAG AAG (Cy5) TGG CGA CGG CAG CGA (Biotin) 3’, Pyrimidine-rich strand: 5’ TCG CTG CCG TCG CCA CTT CTT CTT CTT CTT TTT TCT TCT TCT TCT TCT TC (Cy3) 3’. In the purine-rich strand, the biotin at 3’ terminus is used to attach the H-DNA molecule to a neutravidin-coated cover-glass. The Cy3 and Cy5 dyes in the H-DNA molecule correspond to a donor and an acceptor for FRET measurements, respectively. In order to observe the transition between folded triplex and unfolded DNA, we used the reaction buffer containing 50 mM HEPES(Sigma-Aldrich) and various concentrations of Na+ (26, 50, 100 mM). These buffer solutions also contained 2 mM trolox, 10% glucose and gloxy for single-molecule fluorescence experiments. We utilized a home-made TIRF (Total Internal Reflection Fluorescence) microscope to measure the FRET efficiency between donor and acceptor dyes, which reveals the conformational state of the H-DNA molecule. A 532-nm laser (CrystaLaser DPSS, 10 mW) was used to excite donor molecules and fluorescence intensities of both dyes were measured by an EMCCD (Andor iXon DV887, Andor technology). To observe the change of FRET efficiency in real time, we measured the time-lapse FRET traces with the repetition rate of 10 Hz. To study kinetic features of the conformational transition with dynamic disorder, we acquired the FRET time traces for a long period (> 100 sec).
Clustering at a higher dimension
For given N and K, total N(N – 1) intra-basin transition rates are defined in the µ-th basin (or µ-th internal state) and total K(K – 1) inter-basin transitions are conceivable. To cluster the kinetic information of H-DNA data obtained from VB-DCMM, we consider the kinetic arrow, 2N(N – 1)-dimensional array of data, which has the structure of where the subscript i denotes an index referring to one of K(K – 1) possible inter-basin transitions linking two internal states (μ ≠ ν). For a kinetic scheme made of a network of reversible transitions between K internal states, the transition between two internal states should be bidirectional; thus for a given inter-basin transition path i, there should be a kinetic path j antiparallel to the path i, satisfying , where . In our problem, the set of all the data generated as an outcome of VB-DCMM can in principle be clustered into the disjoint subsets of size 2 partitioning the 𝒦 transition paths, {𝒦|1 ≤ 𝒦 ≤ K(K – 1)}, and one realization of such disjoint subsets will minimize the pairwise sum of Euclidean distances for all α and β; however, the method suffers from high computational cost as the possible number of clusters increases rapidly with N and K.
To alleviate the computational cost for large N and K, we modified the original method. We first searched the the best partitioning set of data S*(𝒦) for a given 𝒦 that minimizes the Euclidean distance between all with the pairs of centroids, where with , , and ⟨…⟩ denotes the centroid of clustered data. To obtain the best clustering result for a given 𝒦, we conducted k-means clustering using k_means function from scikit-learn libraries [90] with 20,000 different random initial conditions in each analysis. It is expected that . The summation, ∑(i,j), signifies that the sum is taken over the disjoint subsets of size 2 partitioning a set {1, …, 𝒦} with 𝒦 being an even number) and S*(𝒦) is the best partitioning set that minimizes the value of 𝒟c(𝒦) for a given 𝒦. For example, provided that there are 4 kinetic arrows made of centroids (i = 1,2,3,4), which minimizes 𝒟c at 𝒦 = 2 when i = 1 is paired with i = 3 and i = 2 with i = 4, then S*(2) = {{1, 3}, {2,4}} and .
Next, in order to decide the optimal 𝒦, we calculated pairing distance between paired clusters in S*(𝒦) again, but this time using all the elements in each cluster. The total pairing score where the average pairing distance between two clusters i and j is defined as where , , and n refers to an index for the element in the i-th cluster and m to an index for the elements in the j-th cluster. Mi is the total number of the elements in the i-th cluster. Finally, the optimal 𝒦*, minimizing 𝒟(𝒦), is selected, i.e., 𝒦* = arg min 𝒟(𝒦), and the interpretation of data is conducted for the best partitioning set S*(𝒦 = 𝒦*).
For H-DNA data at three different Na+ concentrations, the optimal 𝒦* are determined at 𝒦* = 6 for [Na+]=100 mM (Fig. 7A), 𝒦* = 10 for [Na+]=50 mM (Fig. S17B), 𝒦* = 12 for [Na+] = 26 mM (Fig. S18B). This implies that the complexity of conformational space of H-DNA increases at low salt condition (also see the scatter plot of (kL→H, kH→L) in Fig. 7A, Fig. S17A, Fig. S18A).
The clustering results presented in this study remain robust regardless of the choice of distance metric. K-means clustering using L1 distance (“city block”) measure with 20,000 different random initial conditions also was led to qualitatively similar results (Fig. S19). Furthermore, as an alternative clustering algorithm, we also tested “affinity propagation” [58] on our data, and the results remain qualitatively identical (see Fig. S20). In the affinity propagation method, negative square-euclidean distance was employed as a similarity metric where xi denotes the coordinate of the i-th data point. The objective of the algorithm is to optimize the factorized probability distribution which approximates the net similarity 𝒮, defined as . Here, ci is the index of the exemplar of i-th data point xi. For example, if ci = k, xk is an exemplar of xi and xi belongs to the cluster represented by xk. Multiple iterations of message passing are carried out until convergence is achieved in the result and the best result of clustering is acquired. For implementation, we used AffinityPropagation class from scikit [90] library with varying “preference” as an input parameter, where the preference denotes the logarithm of probability that i-th data point xi selects itself as an exemplar. Further details of the algorithm are available in Ref.[58].
VB-DCMM: IMPLEMENTATION
Derivations
A factorized form of q(Z) = q(π)q(A)q(B)q(x) was assumed to find an approximate q(Z). As a result, F[q] (Eq.S5) can be expanded in term by term as By substituting Eq.(S1) to Eq.(S12) we obtain where
Updating q(λ) (q(π), q(A), and q(B))
We first set q(x) = P(x|o, λ′) with given initial values of λ′ = (π′, A′, and B′). Substitution of Eqs.(S9),(S10),(S11) to F[q] (Eq.(S13)) and integration over x lead to To derive the equations above, we first use ∫q(A)dA = ∫q(B)dB = 1, and then replace the ∫q(x) log πx(1)dx with Σx(1),x(2), …,x(T−1) P(x(1), x(2), …, x(T—1)|o, λ′) log πx(1) in the second to the third line. The normalization factor of P(π|K) (Eq.(S9)) is added as a constant term. Finally, by changing the sum of log to multiplication of its arguments and combine all the integrands together, the final result is obtained. By a similar procedure, F[q(A)] and F[q(B)] can be written as After combining the above three equations together we get where Now by setting q(π), q(A) and q(B) equal to Dirichlet distributions with new parameter W, we can increase F[q] as -DKL(⋅) ≤ 0. P(x(1) = µ|o, λ′) and P(x(t) = µ, x(t + 1) = ν |o, λ′) can be calculated efficiently by using Forward-Backward algorithm [1].
Updating q(x)
Now we integrate F[q] over π, A, and B with fixed (and updated) q(π), q(A), and q(B) to optimize q(x). From Eq.(S12), F[q(π)] can be written as We first use ∫ q(A)dA = ∫ q(B)dB = 1 as the integrand does not depend on A and B. As does not depend on x, the result of integration of this term can be written as a constant (const.). By similar procedure, F[q(A)] and F[q(B)] are written as By combining Eq.(S18-S20), we get where Here, ψ(⋅) denotes the digamma function . Now that F[q] again has a form of −DKL(⋅) + const., F[q] can be maximized by minimizing the DKL(⋅) term, which is achieved by setting Note that, the numerator of the equation above is equal to P(o, x|π″, A″, B″) implying q″(x) = P(x|o, π″, A″, B″).
With q(x) and by replacing λ′ = (π′, A′, B′) with λ″ = (π″, A″, B″), one can further update q(π), q(A), and q(B). These procedures are iterated until the value of F[q] converges to a desired precision.
Finally, the converged F[q] can be calculated by substituting the converged argument q = q* and parameters π*, A*, B* into Eq.(S12). The first three terms, −DKL(⋅), correspond to penalties against the model complexity. The fourth term corresponds to the likelihood, which generally increases with K. The final log K! term is added to account for the symmetry of model [2]. K is the number of possible internal states in the model. Degeneracy arises from the freedom of permutating the labels. For example, if two internal states x = 1, 2 are found from VB-DCMM, a new model with x = 2,4 and Bxnew=2 (= Bx=1), Bxnew=4 (= Bx=2) can also be a possible solution with an equal probability. Thus, overall evidence should be calculated with the sum of all possible cases that can be obtained from the permutation of labels for internal states. Thus a corrected evidence should be multiplied by K!, which results in introducing the additional factor log K! to log evidence. In the analysis of real single molecule data, the number of observed internal states Kobs is not generally identical to the parameter K. In this case, the actual number of degenercy in labeling internal states should be KCKobs × Kobs! instead of K!. To take this effect into account in calculating evidence function, we modified the original evidence function into the following form: According to Eq. (S3), the increase of lower bound of F[q] accompanies the decrease of DKL(q∥p). Thus, it is expected that after multiple iterations, F[q] (or G[q]) converges to F[q*] which satisfies F[q] < F[q*] ≃ log P(o|K)). This implies that DKL(q*∥p) ≃ 0. From and q(Z) = q(x)q(π)q(A)q(B), it follows that Thus, DKL(q*∥p) ≃ 0 implies q*(π) ≃ P(π|K), q*(A) ≃ P(A|K), q*(B) ≃ P(B|K), and Eq. S26 also implies that, q*(x) = P(x|o, π* A* B*) ≃ P(x|o, π, A, B). Finally, π*, A*, B*, which provide us with a set of rate constants (e.g. , ), are interpreted as the estimated model parameters.
Implementation
Selection of prior parameters
The likelihood, log P(o|π*,A*,B*) in Eq.(S23) generally increases with K. Other terms, −DKL(⋅), are always negative, which imposes a penalty against the model with a higher K. As the difference between two Dirichlet distributions vanishes when the posterior value W is equal to the prior parameter u and is minimized when the ratios between the element of W and that of u are identical (for example when ), Eq.(S23) provides a natural guideline for selecting the prior parameters. We have selected the prior parameters using the following rule.
for µ ≠ ν
= (transition rate (with Δt = 1) using a visual estimation)−1.
Perform Hidden Markov Analysis assuming K = 1 to construct a transition matrix Bh of homogeneous Markov process.
Set for all µ.
For example, when roughly one internal-state transition is observed in the trace with Tobs/Δt= 2000, we set . If , then for all µ. The results do not depend critically on the choice of prior parameters as long as they are in a reasonable range (Fig S21, S22).
Avoiding local minima
To avoid local minimum, the evidence was calculated 20 times for each model with random initial parameters and the result with a larger evidence was selected. Initial values for transition matrices were generated by using Dirichlet distribution: A with parameters ua = 0.3, uad = 200; B with parameters ub = 1, uad = 20. uad, ubd are used to generate the diagonal elements of transition matrices.
Computation time
Computation time depends on the length of data, the number of models to be tested, and the number of repeat (to avoid local minimum). For example, the analysis of one time trace with Tobs/Δt= 4400, K=1, 2, and 3, and 20 repeats takes ~ 3 min whereas the same test but with Tobs/Δt= 8800 takes ~6 min on Macbook pro 13 (3 GHz intel core i7). Linear dependence of analysis time on Tobs/Δt is expected because each implementation requires execution of DCMM. The running time scales linearly with the length of data as it involves a similar procedure of parameter estimation as HMM [1]. F converged usually after 10 iterations in our test conditions except the case when poor guess for ua, uad, ub, ubd was used on purpose while testing the algorithm (Fig. S21, S22). All the implementations of algorithm and data analysis were conducted by using our custom-code written in python with the following libraries: Matplotlib [4], Numpy [5], Scipy [6], IPython [7], Scikit-learn [8] and Cython [9].
EFFICACY OF VB-DCMM ASSESSED BY THE LAW OF LARGE NUMBER
To assess the efficacy of VB-DCMM in identifying dynamic disorder (hidden internal state) of a given time trace, we divided an ensemble of heterogeneous time traces into shorter homogeneous traces by using the information of internal states in xmodel(t), and calculated the distribution of φ20 = σ20/µ20 of dwell times, where the subscript 20 means that 20 consecutive data of dwell times along the time traces are used in evaluating the standard deviation and the mean .
It is expected that φ20 = 1 for the time traces generated from a completely homogeneous Markov process; however, φ20 > 1 when it is evaluated at the boundaries where different internal states coexist. Thus the distribution of φ20 will be sharply defined as P(φ20) ~ δ(φ20 – 1) if a heterogeneous trace is correctly decomposed into several pieces of homogeneous traces, so that each piece contains only one internal state. Indeed, after the decomposition of original time trace the histogram of φ20 become narrower and more Gaussian like (Fig. S23A-C). Test on synthetic data generated using K = 3 also shows a similar trend (Fig. S24). Next we analyzed H-DNA traces with more than 3 interconversion events between internal states (Fig. S23D-F). (Dconf, Dint) values of these traces are in the region where the synthetic traces displaying ⟨χ⟩ ~ 0.9.
In Markov model, the transition probability from an observable state a to b is estimated as (na→b is the actual number of transitions from a to b observed from a given trace) and the ratio between the standard deviation and mean of the number of transitions na→b satisfies . Thus, we expect . Since ~4 fold difference in and (with μ ≠ ν) is sufficient for the reliable detection of internal states (Fig. 3A, Fig. S2), VB-DCMM is expected to work for φka→b ~ φna→b ≲ 1/4 which leads to a requirement of time scale separation between τint and τconf as τint/τconf ≳ 16 (or Dint ≳ 4 (Eq. (5))). Indeed when all synthetic data were plotted with two metrics Dconf and Dint, all the data with Dint ≳ 4 show high ⟨χ⟩ for Dconf ≳ 2 (Eq. (4)) (Fig. 5). Large Dconf is important for the internal states to be discernible, whereas large Dint is required for accurate estimation of k. The performance of algorithm relies on these two factors.
OTHER APPROACHES
Markov Chain Monte Carlo (MCMC) technique
As an alternative way of calculating the evidence, Bayesian version of DCMM using MCMC method has previously been developed for credit portfolio modeling [10]. They, however, used Bayesian inference to calculate posterior distribution of model parameters, instead of selecting a model with optimal number of internal states, and determined the number of internal states based on well-accepted economic cycle fluctuation model. This approach is not applicable when solid knowledge on internal states is not available. Also, they have used constant value for all prior parameters without investigating the effect of prior parameters on the analysis. Furthermore, unlike VB-DCMM, MCMC method does not offer analytical expression for the evidence, which makes it difficult to select prior parameters (or to incorporate prior information).
Comparison with Infinite Aggregated Markov Model (iAMM)
There has been a study developing a method that can detect the presence of hidden states by analyzing the dynamic pattern of single ion channel data using (sticky) infinite aggregated Markov model (iAMM) with nonparametric Bayesian method [11, 12]. VB-DCMM differs from iAMM in several ways and sometimes can be more advantageous: (1) iAMM uses Markov chain Monte Carlo method whereas VB-DCMM employs the variational Bayes method which is computationally less expensive. (2) In iAMM, the aggregated Markov model (AMM) is the basic structure in which only a single Markov Chain exists. The model aims to detect distinct transition rates from a signal layer of signal. In fact, one can map DCMM onto the structure of AMM by flattening the two layers of states in DCMM (internal and observable states) into a sequence of one state. For example, a data structure of DCMM retaining two internal states X1,X2 and two observables O1,O2 can be mapped onto four states in AMM as follows: Z1 = (X1,O1),Z2 = (X1,O2),Z3 = (X2,O1), and Z4 = (X2,O2) (Fig. S25A). While the transition between two dynamic patterns in DCMM are more strictly regulated, so that the transition rates kZ1→Z4, kZ4→Z1, kZ2→Z3 and kZ3→Z2 are practically zero as the transitions of observables are slaved to the internal state, AMM does not impose such condition. Although AMM could be more flexible in accommodating possible transitions, and could accurately predict the sequence of internal states under very carefully selected the prior parameters (see Fig. S25C), we found that the results obtained from iAMM analysis against our synthetic data was highly sensitive to the prior parameters being selected (Fig.S25D), and that in the most of prior parameters, the traces predicted by iAMM (ziAMM(t)), predicting unwanted frequent transitions between the states, do not match with the synthetic data (z(t)), which gives rise to a low χ value .
The persistent dynamic pattern as shown in H-DNA and preQ1-riboswitch [13] dynamics can be better modeled with VB-DCMM whose result is not sensitive to the choice of prior parameters (Fig. S21, S22).
Acknowledgements
We thank the KIAS Center for Advanced Computation for providing computing resources. This study was partly supported by National Research Foundation of Korea NRF-2015R1D1A1A01060376.
Footnotes
↵* hyeoncb{at}kias.re.kr;
References
- [1].↵
- [2].
- [3].
- [4].↵
- [5].
- [6].
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].
- [19].
- [20].
- [21].
- [22].
- [23].
- [24].
- [25].
- [26].
- [27].
- [28].↵
- [29].↵
- [30].↵
- [31].
- [32].↵
- [33].↵
- [34].
- [35].
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].
- [41].
- [42].
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].
- [64].
- [65].
- [66].
- [67].
- [68].
- [69].
- [70].
- [71].
- [72].
- [73].
- [74].
- [75].
- [76].↵
- [77].↵
- [78].↵
- [79].
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].
- [86].
- [87].↵
- [88].↵
- [89].↵
- [90].↵