Perturbative formulation of general continuous-time Markov model of sequence evolution via insertions/deletions, Part II: Perturbation analyses

Background Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. In a separate paper (Ezawa, Graur and Landan 2015a), we established a theoretical basis of our ab initio perturbative formulation of a genuine evolutionary model, more specifically, a continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. And we showed that, under some conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) separated by gapless columns. Results This paper describes how our ab initio perturbative formulation can be concretely used to approximately calculate the probabilities of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). For each local alignment type, we calculated the fewest-indel contribution and the next-fewest-indel contribution to its probability, and we compared them under various conditions. We also derived a system of integral equations that can be numerically solved to give “exact solutions” for some common types of local PWAs. And we compared the obtained “exact solutions” with the fewest-indel contributions. The results indicated that even the fewest-indel terms alone can quite accurately approximate the probabilities of local alignments, as long as the segments and the branches in the tree are of modest lengths. Moreover, in the light of our formulation, we examined parameter regions where other indel models can safely approximate the correct evolutionary probabilities. The analyses also suggested some modifications necessary for these models to improve the accuracy of their probability estimations. Conclusions At least under modest conditions, our ab initio perturbative formulation can quite accurately calculate alignment probabilities under biologically realistic indel models. It also provides a sound reference point that other indel models can be compared to. [This paper and three other papers (Ezawa, Graur and Landan 2015a,b,c) describe a series of our efforts to develop, apply, and extend the ab initio perturbative formulation of a general continuous-time Markov model of indels.]


Results
This paper describes how our ab initio perturbative formulation can be concretely used to approximately calculate the probabilities of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). For each local alignment type, we calculated the fewest-indel contribution and the next-fewest-indel contribution to its probability, and we compared them under various conditions. We also derived a system of integral equations that can be numerically solved to give "exact solutions" for some common types of local PWAs. And we compared the obtained "exact solutions" with the fewest-indel contributions. The results indicated that even the fewest-indel terms alone can quite accurately approximate the probabilities of local alignments, as long as the segments and the branches in the tree are of modest lengths. Moreover, in the light of our formulation, we examined parameter regions where other indel models can safely approximate the correct evolutionary probabilities. The analyses also suggested some modifications necessary for these models to improve the accuracy of their probability estimations.

Introduction
The evolution of DNA, RNA, and protein sequences is driven by mutations such as base substitutions, insertions and deletions (indels), recombination, and other genomic rearrangements (e.g., Graur and Li 2000;Gascuel 2005;Lynch 2007). Thus far, analyses on substitutions have predominated in the field of molecular evolutionary study, in particular using the probabilistic (or likelihood) theory of substitutions that is now widely accepted (e.g., Felsenstein 1981Felsenstein , 2004Yang 2006). However, some recent comparative genomic analyses have revealed that indels account for more base differences between the genomes of closely related species than substitutions (e.g., Britten 2002;Britten et al. 2003;Kent et al. 2003; The International Chimpanzee Chromosome 22 Consortium 2004; The Chimpanzee Sequencing and Analysis Consortium 2005). It is therefore imperative to develop a stochastic model that enables us to reliably calculate the probability of sequence evolution via mutations including insertions and deletions. Since the groundbreaking works by Bishop and Thompson (1986) and by Thorne, Kishino and Felsenstein (1991), many studies have been done to develop and apply methods to calculate the probabilities of pairwise alignments (PWAs) and multiple sequence alignments (MSAs) under the probabilistic models aiming to incorporate the effects of indels, and such methods have greatly improved in terms of the computational efficiency and the scope of application. See excellent reviews for details on this topic (e.g., Rivas 2005;Bradley and Holmes 2007;Miklós et al. 2009). A majority of these studies are based on hidden Markov models (HMMs) or transducer theories. Both of them calculate the indel component of an alignment probability as a product of inter-column transition probabilities or of block-wise contributions. Unfortunately, they have two fundamental problems, one regarding the theoretical grounds and the other regarding the biological realism. Regarding the theoretical grounds, it is unclear whether or not a HMM or a transducer is related with any genuine evolutionary model, which describes the evolution of an entire sequence via indels along the time axis. Regarding the biological realism, the standard HMMs or transducers can at best handle geometric distributions of indel lengths, whereas many empirical studies showed that the real indel lengths are distributed according to the power-law e.g., Gonnet et al. 1992;Benner et al. 1993;Gu and Li 1995;Kent et al. 2003;Zhang and Gerstein 2003;Chang and Benner 2004; The international Chimpanzee Chromosome 22 Consortium 2004; Yamane et al. 2006;Fan et al. 2007). Besides, it is very hard for the previous methods to incorporate the indel rate variation across regions, which were often observed empirically (e.g., Gu et al. 2008). See the "background" section in part I (Ezawa, Graur and Landan 2015a) for more details on these problems.
In part I of this series of study (Ezawa, Graur and Landan 2015a), we established an ab initio formulation of a genuine stochastic evolutionary model, that is, a general continuous-time Markov model of sequence evolution via indels. Our evolutionary model allows any indel rate parameters including length distributions, but it does not impose any unnatural restrictions on indels. Thus, the model is naturally devoid of the aforementioned two problems. Aided by some techniques of the perturbation theory in physics (Dirac 1958;Messiah 1961a,b), we formally expanded the probability of an alignment into a series of terms with different numbers of indels. This expansion theoretically underpinned the stochastic evolutionary simulation method of Gillespie (1977). And we also showed that, if the indel model parameters satisfy a certain set of conditions, the ab initio probability of an alignment is indeed factorable into the product of an overall factor and contributions from local alignments delimited by preserved ancestral sites (PASs). This result reconfirmed and generalized the conjecture that Miklós et al. (2004) made under their spatially and temporally homogeneous indel model.
In this paper, we turn our attention to more concrete problems. In other words, we focus on how to concretely calculate the contribution from each local alignment, assuming that the indel model satisfies the conditions for the factorability of alignment probabilities.
In Section 1 of Results & Discussion, we demonstrate how our ab initio perturbative formulation can be concretely used to approximately calculate the contributions to the alignment probabilities from regions separated by gapless columns (i.e., local alignments). We examine all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs). For each local alignment type, we calculate the fewest-indel contribution and the nextfewest-indel contribution to its probability. In this section, we also derive a system of integral equations that can be numerically solved to give "exact solutions" for some common types of local PWAs. Then, by comparing the fewest-indel contribution with the next-fewest-indel contribution, or with the "exact" solution, we examine the parameter region in which the fewest-indel contribution can approximate the alignment probability quite accurately. In Section 2 of Results & Discussion, we examine some representative models that were used in the past studies, including the HMM of Kim and Sinha (2007), in the light of our ab initio perturbative formulation. We attempt to delimit parameter regions where these models can safely approximate the ab initio probabilities under the genuine evolutionary model. The analyses will also suggest some modifications to these models that would enhance their reliability. And Appendix describes details on some perturbation calculations.
This paper is part II of a series of our papers that documents our efforts to develop, apply, and extend the ab initio perturbative formulation of the general continuous-time Markov model of sequence evolution via indels. Part I (Ezawa, Graur and Landan 2015a) gives the theoretical basis of this entire study. Part II (this paper) describes concrete perturbation calculations and examines the applicable ranges of other probabilistic models of indels. Part III (Ezawa, Graur and Landan 2015b) describes our algorithm to calculate the first approximation of the probability of a given MSA and simulation analyses to validate the algorithm. Finally, part IV (Ezawa, Graur and Landan 2015c) discusses how our formulation can incorporate substitutions and other mutations, such as duplications and inversions. This paper basically uses the same conventions as used in part I (Ezawa, Graur and Landan 2015a). Briefly, a sequence state s (∈ S) is represented as an array of sites, each of which is either blank or equipped with some specific attributes. And each indel event is represented as an operator acting on the bra-vector, s , representing a sequence state. More specifically, the operator M I (x, l) denotes the insertion of l sites between the x th and (x +1) th sites, and the operator M D (x B , x E ) denotes the deletion of a sub-array between (and including) the x B th and the x E th sites. See Section 2 of part I for more details.
And, also as in part I, the following terminology is used. The term "an indel process" means a series of successive indel events with both the order and the specific timings specified, and the term "an indel history" means a series of successive indel events with only the order specified. And, throughout this paper, the union symbol, (See below for the definitions of the symbols for the arguments.) Thus, as long as we can concretely calculate these regional contributions, we can obtain the specific value of the probability of a given alignment. However, because each of these factors is in general a summation of contributions from an infinite number of local indel histories, we usually need to approximate it by a summation over a finite number of histories. In this section, we will do this, again based on the perturbation expansion as unfolded in Section 3 of part I.
For a PWA, α(s A , s D ) , we decompose Λ ID γ κ ; α(s A , s D ) ⎡ ⎣ ⎤ ⎦ , the set of local indel histories consistent with the local PWA confined in the region γ κ , as: Here Λ ID N κ ; γ κ ; α(s A , s D ) ⎡ ⎣ ⎤ ⎦ is the subset of PWA-consistent histories in γ κ , each of which consists of N κ indels, and N min γ κ ; α(s A , s D ) ⎡ ⎣ ⎤ ⎦ is the minimum number of indels required to give rise to the local PWA of α(s A , s D ) within γ κ . Then, the "perturbation expansion" of the multiplication factor, for the local PWA probability produced during time interval [t I , t F ] , is given by: Here is the multiplication factor contributed from all PWA-consistent N κ -event local indel histories. For a MSA, α[s 1 , s 2 ,..., s N X ] , we decompose Λ Ψ ID C Κ ; α[s 1 , s 2 ,..., s N X ]; T ⎡ ⎣ ⎤ ⎦ , the set of local indel histories along the tree ( T ) consistent with the local MSA confined in the region C Κ , as: ⎤ ⎦ is the subset consisting of all MSA-consistent N Κ -event local indel histories along the tree, and N min C Κ ; α[s 1 , s 2 ,..., s N X ]; T ⎡ ⎣ ⎤ ⎦ is the minimum possible number of indels that can produce the sub-MSA within C Κ . Using this, the "perturbation expansion" of the multiplication factor, is the multiplication factor contributed from all local histories in Thus, the problems were reduced to those of how to enumerate the local indel histories in the subset, for some typical gap-configurations of the local PWAs and MSAs, respectively, that are contributed from the parsimonious local indel histories. Then we compare them with the portions contributed from the nextparsimonious local indel histories. Moreover, for a few simplest but commonest types of local PWAs, we also derive integral equation systems that give the "exact solutions" for the total multiplication factors, Eq.(1.1.1a). Then, we see how well the contributions from parsimonious indel histories alone can approximate the actual occurrence frequencies of the gap-configurations of local MSAs.
In this section, we assume that we are working with a model whose alignment probabilities are factorable, and that we are focusing on calculating the multiplication factor that comes from a single local alignment flanked by a pair of gapless columns (i.e., a pair of PASs). We will mainly work in the state space S = S II (see Section 2 of part I for the definition of the space). This means that we will focus on calculating the probability of the homology structure of each local alignment (see, e.g., Lunter et al. 2005). Let ΔL(s) be the number of sites that a sequence s ∈ S II has between the pair of PASs. We will re-assign the site numbers so that the left-and right-flanking PASs are numbered 1 and ΔL(s) + 2 , respectively, and the sites in between them are numbered 2, ..., ΔL(s) +1 . This will make it easy to apply the theory formulated thus far to the current situation. We will re-assign the ancestries υ(1) = L and υ(ΔL(s) + 2) = R to the left-and right-flanking PASs, respectively. And we will usually re-assign the ancestries υ(x) = x −1 to the sites, x = 2,..., ΔL(s) +1 , of the ancestral sequence s = s A (or the root sequence s = s Root for a MSA) in between the PASs (see Figure 1 as an illustration).
In the following subsections, we will omit the symbol " [ ] LHS " that denoted a local history set (LHS) in part I. Because we will consider a PWA or MSA region that accommodates only one local history, the resulting LHS equivalence classes are trivial classes, each of which consists only of a single history. We will also omit the appended " C Κ [ ] " to indicate the MSA region when it is obvious.

Perturbation analyses on local PWA
In a PWA, α(s A , s D ) , a gap-configuration flanked by a pair of PASs corresponds to the portion of α(s A , s D ) confined in γ κ . If we are interested only in the homology relationships among sites (i.e., homology structures (Lunter et al. 2005)), there are four broad types of gap-configurations (see Figure 2; see also Subsection 3.3 of part I (Ezawa, Graur and Landan 2015a) for complexities concerning this issue). (i) Neither the ancestral nor the descendant sequence has even a single site in between the pair of PASs (panel A of Figure 2). (ii) The ancestor has one or more site(s) but the descendant has no site in between the PASs (panel B). (iii) The ancestor has no site but the descendant has one or more sites in between (panel C). And (iv) both the ancestor and the descendant have one or more site(s) in between, but no ancestral site is related to any of the descendant sites (panel D). We will consider these cases in turn.
In case (i), the sequence states could be represented as where no indel event takes place. Therefore, in this case, the portion of the multiplication factor contributed by the fewest-indel history is: In this case, there is no history consisting only of one indel event. And each nextfewest-indel history should be a two-event history of the form, M I (1, l),M D (2, l +1) ⎡ ⎣ ⎤ ⎦ with l = 1,..., L CO , where L CO ≡ min{L I CO , L D CO } . Thus, the portion contributed by the next-fewest-indel histories is: Let s[l] ≡ s AM I (1, l) be the state resulting from the action of an insertion of l sites on s A . Then, using Eq.(4.1.1b) and Eq.(3.1.8b) of part I, each summand is calculated as: is the increment of the exit rate, which in turn is given by Eq.(2.4.1b') of part I. The second equation of Eq.(1.2.2b) follows from δR X ID (s A , s A ,τ ) = δR X ID (s D , s A ,τ ) = 0 . We could at least numerically calculate Eq.(1.2.2b) once the specific functional forms of the indel rates and the exit rates are given. For example, in a space-time-homogeneous model, like Dawg's indel model (Cartwright 2005) --- for each l = 1,..., L CO . Applying this inequality to Eq.(1.2.2a) and using another inequality, 3) Empirically, the rate of indels ( λ I + λ D ) is estimated to be at most on the order of 1/10 of the substitution rate (Lunter 2007;Cartwright 2009). Eq.(1.2.3) indicates that, in case (i), even if the elapsed time ( t F − t I ) is such that the substitution process is nearly saturated, e.g., λ S (t F − t I ) ≈ 4 , where λ S is the total substitution rate per site, the total contribution from the next-fewest-indel histories, Eq.(1.2.2a), is at most on the order of 1/10 of the contribution by the fewest-indel history, Eq.(1.2.1). Thus, we expect that the fewest-indel contribution should approximate the entire multiplication factor (Eq.(4.1.8b) in part I) very well in case (i).
Incidentally, taking advantage of the result for case (i), we can calculate the multiplication factor for a gapless PWA segment consisting of L P (> 2) PASs. For clarity, let us consider an indel model as discussed in Subsection V-1. Then, in each of the L P −1 inter-PAS positions this segment contains, a null indel history could occur independently of those in other positions. Let υ 1 P , υ 2 P , ..., υ L P P be the ancestries assigned to the L P PASs in this segment (in spatial order). Focusing on the segment, we have s A = s D = υ 1 P , υ 2 P , ..., υ L P P ⎡ ⎣ ⎤ ⎦ . Then, excluding the contributions from both ends, the total multiplication factor for this segment is expressed as: total multiplication factor for case (i) with L = υ i P and R = υ i+1 P . Under a spacehomogeneous model, we can further simplify the above product as: Here we used the uniform multiplication factor, In case (ii), we assume that the ancestral state has ΔL A sites in between the flanking PASs. Thus, the ancestral and descendant states could be represented as s A = L, 1,..., ΔL A , R ⎡ ⎣ ⎤ ⎦ and s D = L, R [ ] , respectively. As long as ΔL A ≤ L D CO , N min γ κ ; α(s A , s D ) ⎡ ⎣ ⎤ ⎦ = 1 , and there is only one fewest-indel history, M D (2, ΔL A +1) ⎡ ⎣ ⎤ ⎦ , which consists of a single event that deletes the ancestral sites in between the PASs. Therefore, the contribution by the fewest-indel history is: ---Eq.(1.2.4) Each next-fewest indel history is composed of two indel events. There are two types. Thus, in this case, the total contribution by the next-fewest indel histories is given by: Here, (1.2.5b) is the sum of contributions from the histories of type (a). And (1.2.5c) is the sum of contributions from the histories of type (b). Let be the intermediate state in each type (a) history. Then, the history's contribution is calculated as: Then, the history's contribution is calculated as: ---Eq.(1.2.5e) Eq.(1.2.4) and Eqs.(1.2.5a-e) can indeed be calculated at least numerically once the specific functional forms of the indel rates and the exit rates are given. For example, under Dawg's indel model (Eqs.(2.4.4a,b) in part I) or the "long indel" model as noted in case (i), we have δR X ID (s D , s A ,τ ) = − (λ I + λ D )ΔL A , and Eq.(1.2.4) becomes: (1.2.5d,e) in Dawg's model (and also in the "long indel" model) are calculated, respectively, as: ---Eq.(1.2.5e') Substituting Eqs.(1.2.5d',e') into Eqs.(1.2.5a,b,c), we can concretely calculate the total contribution of the next-fewest-indel histories. Figure 3 A shows the ratio of the total next-fewest-indel contribution, Eq.(1.2.5a), to the fewest-indel contribution, Eq.(1.2.4), as a function of the number of deleted ancestral sites ( ΔL A ) and the expected number of indels per site ( (λ I + λ D )(t F − t I ) ). For the figure, we used the following parameters: f I (l) = f D (l) ∝ l −1.6 , L I CO = L D CO = 500 , and λ I / λ D = 1 . The power-law behavior of the indel length distributions broadly conforms to the past empirical observations e.g., Gonnet et al. 1992;Benner et al. 1993;Gu and Li 1995;Kent et al. 2003;Zhang and Gerstein 2003;Chang and Benner 2004;Yamane et al. 2006;Fan et al. 2007). Here, the exact overall indel rate ( λ I + λ D ) does not matter, because the probabilities are invariant under the simultaneous rescaling of the rate and the time interval ( t F − t I ) that keeps the expected number of indels per site (i.e., (λ I + λ D )(t F − t I ) ) unchanged. Moreover, we confirmed that the results remain almost the same even if we use L I CO = L D CO = 5000 . Now, as indicated by Figure 3A, each curve for a fixed (λ I + λ D )(t F − t I ) reaches an asymptotic value slightly above 1 when ΔL A is sufficiently large. Thus, to define a threshold within which the fewest-indel histories alone are likely to give a decent approximation of the probability, using the point at which the ratio is 1 (unity) is risky. Here, we tentatively define the threshold as the value of ΔL A at which the ratio is 0.5. With this definition, the tentative threshold, (ΔL A ) 0.5 ( NP−ii) , is around 128, 31, 12, 6 and 3 when (λ I + λ D )(t F − t I ) is 0.01, 0.04, 0.1, 0.2 and 0.4, respectively (Table 1). Hence, we have a rough inversely proportional relationship: , under the parameter setting used here.
In case (iii), we assume that the descendant state has ΔL D sites in between the flanking PASs. Thus, the ancestral and descendant states could be represented as The history consists of a single event that inserts the descendant sites in between the PASs. As in case (ii), each next-fewest indel history is composed of two indel events, and classified into two types.  Figure 3 A and with the same value of (λ I + λ D )(t F − t I ) , their ratio with ΔL D = L is identical to that in case (ii) with ΔL A = L , because of the symmetry of the probabilities under the time reversal. Thus, the same conclusions can be drawn from Figure 3 A also on this case.
In case (iv), we assume that the ancestral and the descendant states have ΔL A and ΔL D sites, respectively, in between the flanking PASs. Thus, the ancestral and descendant states could be represented as s A = L, 1,..., ΔL A , R ⎡ ⎣ ⎤ ⎦ and ⎦ . (f) An insertion immediately on the right of the ancestral sites to be deleted, followed by the deletion, And (g) an insertion immediately on the left of the ancestral sites to be deleted, followed by the deletion, In this case, each next-fewest-indel history is composed of three indel events, and classified into one of 6 broad types: (h) two successive deletions followed by an insertion; (i) a deletion, followed by an insertion, followed by a deletion; (j) an insertion followed by two successive deletions; (k) a deletion followed by two successive insertions; (l) an insertion, followed by a deletion, followed by an insertion; and (m) two successive insertions followed by a deletion. And these six broad types can be further sub-classified into 24 sub-types, as described in Appendix A1.2. The calculations of the sum of contributions from the fewest-indel local histories and that from the next-fewest-indel local histories are also detailed in Appendix A1.2. Table 2 shows their ratios calculated for some typical configurations under the same setting as for Figure 3 A. As the table indicates, the approximation by the fewest-indel histories alone is likely to be decent as long as the expected number of indels (i.e., (λ I + λ D )(t F − t I ) ) and the gap lengths (i.e., ΔL A and ΔL D ) are at most moderate.
It is difficult to calculate the summed contributions from local histories involving more indels, especially in case (iv). We could calculate the contribution from a single local history involving any number of indels if we use the algorithm for a "trajectory likelihood" given by Miklós et al. (2004). As we exemplified in Appendix A1.2, however, it is already quite hard to enumerate all possible local indel histories even up to the next-to-minimum number of involved indels. Nevertheless, if we consider only cases (i), (ii), and (iii) under a (locally) spatially homogeneous model, we can work out systems of "exact" integral equations that could in principle provide the numerical solutions for the total sum of contributions to a multiplication factor up to a desired level of accuracy (at some time and memory expenses). In the following, we derive a system of integral equations to give multiplication factors for cases (i) and (ii). Another system of integral equations, which gives multiplication factors for cases (i) and (iii), is derived in Appendix A1.3.
Here, we assume that the indel rates are locally homogeneous, which means that the rates do not depend on the exact positions that the indels hit, as long as they are confined in the region that accommodates the local history. Thus, we assume that the indel rate is locally homogeneous and the exit rate is locally an affine function of the (local) sequence length, but that they may be non-homogeneous globally. (In terms of equations, we locally assume Eqs.(5.1.1a,b) of part I for the indel rates and Eq.(5.2.4) of part I for the local exit rate, but we assume something like Eqs. We sandwich the fundamental integral equation with s A and s D , and expand the instantaneous mutation operator Q M ID (t) using its definition (i.e., Eq.(3.1.1c) of part I supplemented by Eqs.(2.4.2b',c') of part I). Because we know that the flanking PASs, which are assigned the ancestries L and R , have never been struck by any indels, we can ignore the effects of indels that hit the PASs. And, because we are now focusing on the local alignment, we will also ignore the indels completely outside of the region delimited by the PASs. Thus, we have: ---Eq.(1.2.6) In the present setting, the number of sites between the PASs, ΔL(s) , uniquely determines the local sequence state s . Thus, we let the local states denoted as for the probability that the local state with ΔL sites in between the PASs at time t became a state with Δ ′ L sites in between at time ′ t . Then, taking advantage of the independence of the indel rates, exit rates and ) on the position of each indel ( x ), we have: (1.2.7) gives the desired system of integral equations for the "exact" probabilities,

This equation holds
for every non-negative integer ΔL A and even if we replace the initial time t I with any time in the closed interval [t I , t F ] . Thus, the equations can be solved iteratively, starting with the "zero-event approximation" of the probability, ---Eq.(1.2.7') If N ID iteration steps are performed, the resulting probability, P N ID is the summation of the probabilities over all possibly responsible local histories consisting of up to (and including) N ID indel events. After the iteration is finished, the multiplication factor will be obtained by following its definition (i.e., Eq.(4.1.1b) in part I). We have: The accuracy of the numerical solutions will depend on how finely we partition the time interval [t I , t F ]. If the interval is partitioned into N P equal-sized sub-intervals, we could in principle achieve an accuracy of O (N P ) −4 ( ) under Simpson's rule (e.g., Press et al. 1992). However, as the number of sub-intervals increases, it would take longer to complete the calculation. A naïve implementation of the aforementioned numerical iteration would have the time complexity of O N ID (L CO ) 2 (N P ) 2 ( ) and the space complexity of O L CO N P ( ) , when we want to obtain the total probabilities of local histories composed of up to N ID indels each, with ΔL A = 0,1, 2,..., ΔL max A (≤ L CO ) .
Here we set L I CO = L D CO ≡ L CO . This becomes impractically slow when either L CO or N P is large, e.g., around 1000 or greater. It is likely that N P does not have to be this large, as it would be usually enough to set N P around 100 or smaller. However, L CO will often be around 1000 or greater, indeed making the naïve algorithm too slow to be practical. Fortunately, we can avoid this problem by rewriting the recursion equation, Eq.(1.2.7'), as: Here, the "auxiliary function, ---Eq.(1.2.9b) Consider the following "two-sub-step" algorithm. In the first sub-step (in each ) . This algorithm does finish in a practical amount of time (typically on the order of an hour or shorter when implemented in Perl). But it may still be too slow to perform each time we evaluate the probability of the gap configuration of a local MSA. Good news is that a single run of the iteration algorithm inevitably calculates the probabilities for all ΔL A = 0,1, 2,..., ΔL max A (≤ L CO ) at all temporal partitioning points, .., N P −1), as well as at t = t I and t = t F . Thus, once we calculate the probabilities with a fixed set of model parameters, we could use them to calculate the probabilities of various alignments (under various phylogenetic trees), as long as the model parameters remain unchanged. In any case, the time and space complexities might be further reduced without considerably compromising the accuracy by a clever beforehand discarding of terms that are unlikely to make significant contributions to the final probabilities. Panels B and C of Figure 3 show the ratios of the multiplication factors, Eq.(1.2.8) at N ID = 1, 2, 5, 10, 20 iteration steps, to that at N ID = 50 steps.
They were calculated with ΔL max A = 300 and (λ I + λ D )(t F − t I ) N P = 0.002 , and under the same setting as that used for Figure 3 A. (We confirmed that results remain almost the same even if L CO is greater than 1000, instead of L CO = 500 used for the figure.) When N ID ≥ 2 , we actually started from N ID = 2 , at which the probabilities were calculated using Eqs.(1.2.2a,b'), Eq.(1.2.4') and Eqs.(1.2.5a,b,c,d',e'), in stead of from N ID = 0 as mentioned above, in order to enhance the accuracy of the approximation. As indicated by the panels B and C, the accuracy of the probabilities improves in a step-wise manner as the number of iterations increases. Now that the "exact" probabilities are available in cases (i) and (ii), we can use them to define more precise threshold values of ΔL A within which the approximate probabilities are expected to be quite accurate. For example, let (ΔL A ) 0.5 (1−ii) be the value of ΔL A at which the approximation by the 1-event local indel history alone accounts for 50% of the total probability of the local PWA. As shown in Table 1 Thus, we expect that the tentative threshold ( (ΔL A ) 0.5 ( NP−ii) ) should work as a slightly conservative criterion for the decency of the approximation by the fewest-indel histories alone. We could also define the threshold, (ΔL A ) 0.5 ( N ID −ii) , at which the probability calculated up to (and including) N ID iterations, i.e., ( ) , is 1/2 (=0.5) of the "exact solution." As indicated by Table 1, . This implies that, even when the fewest-indel histories alone give a poor approximation of the probability, taking account of the next-fewest-indel histories, or of the histories involving yet more indels, will considerably expand the range of ΔL A where the approximate probability is quite accurate.
Following the similar procedures, this time starting from the integral equation, Eq.(III-1.2), we can also derive a system of integral equations for the multiplication factors for cases (i) and (iii), as described in Appendix A1.3. Thanks to the symmetry of the probabilities under the time reversal, panels B and C of Figure 3 can also be interpreted as the results of numerical calculations of this equation system, under the same setting as above except that ΔL max A = 300 is replaced by ΔL max D = 300 .

Perturbation analyses on local MSA
Compared to contributions to local PWAs, those to local MSAs are much more complex. In this subsection, we consider some simple but common patterns, under a tree T with three OTUs, corresponding to the external nodes, n 1 , n 2 and n 3 . Here, we regard its single internal node as the root node n Root for simplicity (panel A of Figure 4). Let b m ( m = 1, 2, 3 ) be the branch that connects the nodes n Root and n m . Let s m ∈ S ( m = 1, 2, 3 ) be the (local) sequence state at node n m . Then, we consider the gap-configurations of the MSAs of the three sequences, α[s 1 , s 2 , s 3 ] , as well as the consistent sequence states s Root ∈ S at the root node n Root . As in the previous subsections, we focus on the portion of MSAs delimited by a pair of PASs, whose ancestries are denoted as L and R . Here we consider four typical situations (see (II) s 1 and s 2 share the identical set of sites in between the PASs, but s 3 has no site in between (panel C). (III) s 1 has a set of sites in between the PASs, but neither s 2 nor s 3 has even a single site in between (panel D). And (IV) s 1 has a set of sites in between the PASs, but s 3 has no site in between, and s 2 lacks a run of some, but not all, of contiguous sites of s 1 in between the PASs (panel E). These situations are not restricted to the 3-OTU trees but widely applicable to each portion surrounding a trivalent node of any tree topology, although they never exhaust all gap configurations. The time at n Root will be represented as t I , and the time at node n m will be represented as t F:m . The indel parameters along branch b m will be indicated by the subscript ":m ." Case (I) is represented by the external sequence states In is composed only of a no-indel history: Here we used μ P s 0 Root , s 0 Root , n Root ; C Κ ⎡ ⎣ ⎤ ⎦ = 1 . No single-event local history can result in the gap configuration in this case. Each next-fewest-indel history consists of two indels, and it can be represented as: { } , and with s Root = s 0 Root again.
Thus, the total contribution to the multiplication factor by the next-fewest-indel histories can be calculated similarly to Eqs.(1.2.2a,b). We have: Each summand can be calculated from Eq.(1.2.2b), by replacing s A with s 0 Root and also replacing the time and rate parameters with those assigned to each branch. Especially, under Dawg's model, each summand is calculated as: ---Eq.(1.3.4b) If the three branches share the same time interval and the indel rate parameters, the summed contribution from the next-fewest-indel histories, Eq.(1.3.4a), is exactly three times Eq.(1.2.2a) for a PWA. Indeed, this total contribution on a general tree can be calculated by summing Eq.(1.2.2a) (with appropriate parameters) over all branches of the tree. Following the same line of reasoning as around Eq.(1.2.3), this total contribution is roughly proportional to the summation of the squared branch lengths over all branches. This means that a richer sampling of the homologous sequences will not significantly increase, or might rather slightly decrease, this total contribution, as long as the maximum evolutionary distance remains at a similar level. Incidentally, any root sequence state of the type s Root = L, 1,..., ΔL Root , R ⎡ ⎣ ⎤ ⎦ is also consistent with consists of a single element: As in case (ii) of PWAs, each next-fewest-indel history is composed of two indel events, and there are two types of such histories. One is based on type (a) in case (ii): ˆ with l = 1,..., min{L I:3 CO , L D:3 CO − ΔL D12 } , x = 1,..., ΔL D12 +1 , and also with s Root = s 0 Root again. Thus the summed contributions over the next-fewest-indel histories is given by:  Figure 3 can also be interpreted exactly as the comparison between the fewest-indel history's contribution and the next-fewest-indel histories' total contribution in the current case. Incidentally, root sequences with some additional ancestral sites in between the PASs of s 0 Root ≡ L, 1,..., ΔL D12 , R ⎡ ⎣ ⎤ ⎦ are also consistent with α[s 1 , s 2 , s 3 ] in this case. However, such root sequence states require at least three indels each to give rise to α[s 1 , s 2 , s 3 ] . This is because the additional ancestral sites need to be deleted independently along b 1 and b 2 , even if they are deleted simultaneously with the sites with the ancestries 1,..., ΔL D12 along b 3 . Thus, in general, the indel histories consistent with such root states are expected to contribute much less to the multiplication factor. Case (III) is represented by the external sequence states In this case, the phylogenetic correctness condition does not require the root state to have any site in between the PASs. Thus we have with s Root = s 0 Root . Again, as in case (II), the contribution by this local history turns out to be exactly the same as Eq.(A1.1.1) (in Appendix A1.1) for case (iii) of PWAs, with the parameters replaced with those assigned to the branch b 1 , and with ΔL D replaced with ΔL D1 . Especially, under Dawg's model, we have: ---Eq.(1.3.10) As in case (iii) of PWAs, each next-fewest-indel history is composed of two indel events. Unlike case (iii) of PWAs, however, there are three types of such histories; two of them are similar to those in case (iii), but the other one is totally new. Specifically, the first one is based on type (c) in case (iii): with l = 1,..., min{L D:1 CO , L I:1 CO − ΔL D1 } , x = 2,..., ΔL D1 + 2 , and also with s Root = s 0 Root again.
The third one involves events along b 2 and b 3 , instead of along b 1 : It should be noted that there is only one local history of the third type. In this case, therefore, the summed contribution of the next-fewest-indel local histories is given by: ---  Figure 3 can also be interpreted as the comparison between the total contribution of these two types of next-fewest-indel histories and the fewest-indel history's contribution.  we assume the uniform distribution of the root sequence length, we have μ P s Root = s 1 , s 0 Root , n Root ; C Κ ⎡ ⎣ ⎤ ⎦ = 1. Thus, Eq.(1.3.13) is reduced to: (λ I:m + λ D:m )ΔL D1 (t F:m − t I ) 's are sufficiently smaller than 1 for all m = 1, 2, 3 . In general, as ΔL D1 gets larger, the ratio is expected to decrease, because the relative frequencies of long indels ( f I:m (ΔL D1 ) and f D:m (ΔL D1 ) ) are small in general. The ratio is expected to be much smaller than 1 in general. However, it may become quite large when the relative frequency of deletions compared to insertions (i.e., the ratio λ D:m λ I:m ) is considerably larger than 1, or when the lengths of b 2 and b 3 are much larger than that of b 1 (i.e., t F:2 − t I , t F:3 − t I >> t F:1 − t I ). Such situations are similar to those causing the "Felsenstein zone" regarding a substitution model, where a nonparsimonious substitution history at a site is most likely to occur along a tree (see, e.g., Chapter 9 of Felsenstein 2004). Under the conditions used to draw Figure 5, an indel history of the 3rd type has a probability much smaller than that of the fewest-indel history. The former is less than 5% of the latter even when as much as 0.4 indels per site are expected to occur. This is probably because the history of the 3rd type requires an exquisite spatial coordination of deletions along different branches. And the result implies that the "Felsenstein zone" of indels should generally be quite narrow, consisting of the cases where a node is connected with branches with extremely unequal lengths. Case (IV) is represented by the external sequence states, s 1 = L, 1,..., ΔL D1 , R ⎡ ⎣ ⎤ ⎦ , , and is represented as: ---Eq.(1.3.14a) The other starts with the root state s Root = s 1 = L, 1,..., ΔL D1 , R ⎡ ⎣ ⎤ ⎦ , which differs from s 0 Root (= s 2 ) . It is represented as: The portion of the multiplication factor contributed from these two fewest-indel local histories, and that contributed from the next-fewest-indel local histories, are calculated in Appendix A2. Table 3 shows the ratio of the total contribution of the next-fewest-indel local histories to that of the fewest-indel local histories for typical configurations in this case. The table indicates that the fewest-indel local histories give a decent approximation of the local MSA probability, as long as the number of descendant sites and the expected number of indels per site are at most moderate.

Goodness of approximation by past indel models
In part I (Ezawa, Graur and Landan 2015a), as well as in Section 1 of this series of paper, we established a theoretical formulation that can calculate the probabilities of given alignments (PWAs or MSAs) under a general continuous-time Markov model, which is a genuine evolutionary model describing the stochastic sequence evolution along the time axis via insertions and deletions. The Markov model we dealt with can accommodate quite biologically realistic settings, such as overlapping indels, powerlaw indel length distributions, and indel rate variation across regions. As shown in Subsection 3.1.1 of part I, the probabilities calculated via our theoretical formulation also fulfill, up to a desired order of the perturbation expansion, the Chapman-Kolmogorov (CK) equation, which is a crucial consistency condition for a genuine evolutionary model. To the best of our knowledge, the CK equation has never been satisfied by any previous probabilistic theories allowing for indels of multiple residues. Moreover, the results in Section 5 of part I indicated that even the lowestorder approximation based on the fewest-indel histories alone can approximate the probabilities of the local gap configurations of the alignments considerably well, as long as the indel lengths and the branch lengths are at most moderate. Given these ideal properties, the ab initio perturbation theory that we formulated in this series of study can be used as a sound "reference point." More precisely, other probabilistic models can be compared to it in order to examine how well they can approximate the biologically realistic probabilities of the alignments, and the conditions under which they can. In this section we will examine the "goodness of approximation" by some representative models of indels in the light of our ab initio formulation.

Geometric indel length distributions
As mentioned in Introduction, a majority of indel probabilistic models that have been used thus far are based on HMMs or transducer theories that calculates the indel component of the probability of an alignment as a product of column-to-column transition probabilities. Such models can only accommodate geometric indel length distributions, or at most linear combinations of two or more geometric distributions. However, empirical analyses revealed that the lengths of inserted/deleted sequences are distributed according to power-laws e.g., Gonnet et al. 1992;Benner et al. 1993;Gu and Li 1995;Kent et al. 2003;Zhang and Gerstein 2003;Chang and Benner 2004; The international Chimpanzee Chromosome 22 Consortium 2004; Yamane et al. 2006;Fan et al. 2007). Therefore, a question naturally arises as to up to what length a geometric distribution can decently approximate a biologically realistic power-law distribution. Here we address this question. As a reference distribution, we used a power-law distribution: with an empirically typical exponent of a = 1.6 (panel A of Figure 6). Then we fitted a scaled geometric distribution: to the reference, Eq.(2.1.1), via a least-square fitting across the lengths l = 1, 2, ..., 100 . We found that the best-fitting parameters are for l = 1, 2, ..., 25 (panel B of Figure 6). As shown in the figure, R LS (l; 1.6) is less than 1/3 at l = 5 , less than 1/10 at l = 7 , and less than 1/1000 at l = 13 . This indicates that the probabilistic model of indels based on the geometric distribution is somewhat reliable for indels that are at most 4 bases long, which account for 70.5% of all indels if we assume this power-law distribution for all indel lengths. In other words, the geometric distribution substantially underestimates the probabilities of 30% of indels that actually occurred. Although a larger value of q could prevent a rapid damping of the estimated frequency, this then causes the mis-estimation of relative frequencies among commonly occurring indels with, e.g., l = 1, 2, 3 (the dotted line in panel A of Figure 6). Hence, this simple analysis clearly demonstrates how important it is to incorporate biologically realistic indel length distributions into the evolutionary model.

Models pursuing biological realism
Some existing probabilistic models of indels, such as the "long indel" model (Miklós et al. 2004) and the model of Kim and Sinha (2007), aim to pursue more biological realism by accommodating indel length distributions that are biologically more realistic than geometric distributions. As noted in Subsection 5.1 of part I (and proved in Appendix A6 of part I), as far as each LHS equivalence class is concerned, the probability given by the recipe of Miklós et al. (2004) equals the probability calculated by our formulation under the space-and time-homogeneous model parameters (given in Eqs.(2.4.5a,b) of part I). Thus, regarding the probability of a PWA, the two methods should give the identical sum of contributions from the fewest-indel histories. And, provided that both methods can correctly enumerate all possible local histories of any given number of indels consistent with each local gap configuration, they will also give the identical sum of contributions from the histories with up to a desired number of indels. Fulfilling this last condition, however, will be an important task left for future studies. (At present, we cannot tell whether or not the algorithm of Miklós et al. (2004) can correctly enumerate indel histories, because their paper has no explicit description on this topic. So, we will not discuss their method further in this section.) Here, we specifically examine the model of Kim and Sinha (2007) in the light of our theoretical formulation. Their model is a block-wise HMM, and calculates the probability of a PWA between the ancestral and descendant sequences along a branch as a product of block-wise probabilities. In their HMM, a block is either a column of a PAS, a run of gaps in the ancestor aligned with a run of residues in the descendant, or a run of gaps in the descendant aligned with a run of residues in the ancestor. Each PWA is actually a part of a MSA of given sequences at the external nodes and one of alternative sets of sequences at internal nodes. Ancestral gaps aligned with descendant gaps are removed before evaluating the probability of a PWA. Because their purpose is to find a most likely indel history and a resulting set of consistent ancestral sequence states at internal nodes, they are not interested in an indel event that begins and/or ends in the middle of a block. Thus, they only consider those events that insert/delete the entire blocks in single steps.
We now calculate the probabilities of the local PWAs that were considered in the cases (i)-(iv) in Subsection 1.2, via the model of Kim and Sinha (2007). And we compare the results to those via our theoretical formulation under Dawg's parameters (Eqs.(2.4.4a,b) of part I). In the following, the probabilities via Kim and Sinha (2007) will be calculated according to Eq.(2) and Figure 1C of their paper, and the probabilities via our formulation will be calculated according to the prescriptions in -25 -Subsection 1.2 (and in Appendix A1). We set t F − t I = | b | in the following calculations. (Here | b | denotes the length of branch b ). Via the model of Kim and Sinha (2007), the PWA probability in case (i) is calculated as: where p I and p D are the "transition probabilities of the 'Insertion' and 'Begin deletion' states," respectively. Via our formulation, the PWA probability is: Here, Δ Dawg is the abbreviation of the "universal" factor for the indel exit rate (i.e., , the summed contribution from the n -event local histories to the multiplication factor in case (i). Especially, μ P (2) [case (i)] is concretely expressed in Eq.(1.2.2a). Now, assuming that (λ I + λ D ) | b | is sufficiently small, we expand the expression in the square brackets into the power series in λ I | b | and λ D | b |, which will collectively be denoted as λ | b | when considering the order of magnitude. From Eqs.(1.2.2a,b'), we get terms. Thus, we have: Here Pr D (l) is the "probability distribution on the deletion length ( l )," which is assumed as shared among different branches. To facilitate the comparison, we consider the ratio of the probability in case (ii) to that in case (i), which yields: Meanwhile, the probability via our formulation is: ---Eq.(2.2.6'a) Here G D (ΔL A ) is defined as: ---Eq.(2.2.6'b) Figure 7 shows the ratio G D (ΔL A ) f D (ΔL A ) as a function of ΔL A . Similarly, according the HMM of Kim and Sinha (2007), the ratio of the PWA probability in case (iii) to that in case (i) is expressed as: Here Pr I (l) is the "probability distribution on the insertion length ( l )," which also is assumed as shared among different branches. The ratio via our formulation is obtained by the power-series expansion in λ | b | of Eq.(A1. ---Eq.(2.2.8a) Here G I (ΔL D ) is defined as: ) .
---Eq.(2.2.9d) Here E X [ ] denotes the expected value of the estimated parameter X , which is the average of estimated X over all indel processes under the given set of conditions (the tree and model parameters). And we also used the notation, However, it does not actually matter so much whether or not the estimated Kim-Sinha parameters ( c I , c D , Pr I (l) and Pr D (l) ) approximate Dawg's indel parameters ( λ I , λ D , f I (l) and f D (l) ) fairly well.
, respectively, become comparable to or greater than 1 (unity). Because G D (ΔL A ) and G I (ΔL D ) are mostly contributed from next-fewest-indel local histories containing overlapping indels, we can interpret the result as follows. "Overlapping indels start to make Kim and Sinha's method poorly approximate the alignment probabilities when the involved gap is long and the branch lengths show a large variation." If there is a good reason to believe that the Dawg indel parameters ( λ I , λ D , f I (l) and f D (l) ) are shared among all branches, one way to mitigate the aforementioned effects of overlapping indels may be to set: and to fit λ I , λ D , f I (l) and f D (l) according to these equations supplemented with Eqs. (2.2.6'b,8b). Now, as indicated by Figure 7, under the power-law indel length distributions, the ratios G D (ΔL A ) f D (ΔL A ) and G I (ΔL D ) f I (ΔL D ) are less than 4 in absolute value when the gap is 300 residues long or shorter. Therefore, the 2nd-order terms will begin to substantially influence the results when 1 is larger than, say, 0.1. Such a situation will be quite rare in practical sequence analyses. Even if we encounter such a rare case, then local histories with more than 2 indels will begin to account for a substantial fraction of the probability. Considering this way, we expect that the method of Kim and Sinha (2007) will pretty well approximate the probabilities of local PWAs belonging to cases (ii) and (iii) at least within the threshold (ΔL A ) 0.5 (2−ii) defined in Subsection 1.2.
Finally, we consider case (iv). Indel histories giving rise to the local sequence states in this category are shown, e.g., in Figure 5, panels F and G of Figure 3, and panel A of Figure 6 (all of them are in part I). In such a situation, an aligner will reconstruct a MSA that is like either panel B or C of Figure 5 of part I (if the reconstruction is correct), and Kim-Sinha's method assigns a probability according to the reconstructed MSA. Whether it is like panel B or panel C, the assigned probability is the same, and its ratio to the probability of case (i) is: Via our formulation, how to calculate the probability in this case was briefly described in the middle of Subsection 1.2, and detailed in Appendix A1.2. In this case, each fewest-indel history consists of two indels, and each next-fewest-indel history consists of three indels. Because there are as many as 24 types of next-fewest-indel histories, here we only consider the fewest-indel histories. Then, the lowest-order contribution of the multiplication factor, μ P (2) [case (iv)] , is given by Eq.(A1.2.1a), supplemented with Eqs. (A1.2.1b,c,d,e',f',g'). Expanding each term into a power series in λ | b | , we get the following expression for the ratio: ---Eq.(2.2.12) In case (iv), as opposed to in cases (ii) and (iii), the PWA probabilities via the HMM of Kim and Sinha (2007) differ considerably from that via our formulation even when | b | |b| << 1 and | b |<< 1. Under these conditions, p I Pr I (ΔL D ) and p D Pr D (ΔL A ) quite ---Eq.(2.2.13) Table 4 shows this ratio for typical cases. The second term on the right hand side of Eq.(2.2.13) is the effect of overlapping indels. When ΔL A = ΔL D = 1 , this term is expected to be quite small; for example, it is about 0.167 if f I (l) = f D (l)∝ l −1.6 . And it gets more and more influential when ΔL A and/or ΔL D gets larger, and it substantially exceeds 1 (unity) in some cases (Table 4). Actually, a similar effect was incorporated in the HMM of Knudsen and Miyamoto (2003). Their HMM could only accommodate geometric indel length distributions, and consequently the relevant term was independent of ΔL A and ΔL D . Coming back to Eq.(2.2.13), the first term on the right hand side, 3/2, differs from 1 (unity) because the HMM of Kim and Sinha (2007) does not correctly take account of the non-overlapping indel histories, either. This error is actually shared by most of the standard, or nearly standard, HMMs and transducers used thus far as probabilistic models of indels (see Background of part I).
Taking these results into consideration, another possible improvement on the model of Kim and Sinha (2007) would be to modify the HMM structure so that the probability of an insertion and an immediately adjacent deletion (or that of the opposite configuration) will be given by Eq.(2.2.12), or by its extension to include the terms of higher-orders in λ | b | .

Violation of Chapman-Kolmogorov equation
The Chapman-Kolmogorov (CK) equation (Eq.(3.1.1.1) in part I), is a crucial condition that must be satisfied by genuine evolutionary models based on any continuous-time Markov models. Thus far, the CK equation was violated by most HMMs and transducer theories, except those directly derived from evolutionary models, e.g., the TKF91 model (Thorne et al. 1991). As formally shown in Appendix A3 of part I (Ezawa, Graur and Landan 2015a), our theoretical formulation could satisfy the CK equation up to any desired order of the perturbation expansion. Therefore, our formulation enables us to analytically examine the effects of the violation of the CK equation. Here, we will only consider a simplest yet non-trivial example, i.e., case (ii) in Subsection 1.2. And we will examine the HMM of Kim and Sinha (2007) because of its generality; most HMMs could be obtained from their HMM by slightly modifying the parameters and by fixing the indel length distributions to geometric ones, possibly with time-dependent gap-extension probabilities. Via the HMM of Kim and Sinha (2007), the PWA probability in case (ii) was given by Eq. (VII-2.3), with the substitutions p I = c I | b | and p D = c D | b |: In contrast, the probability in the same case via our formulation was given by Eq.(2.2.5), which can be partially expanded into the power-series in λ | b | as: Here G D (ΔL A ) was given in Eq.(2.2.6'b). And exp −Δ Dawg | b | { } is a "universal factor" that occur in the probability of any indel processes on the entire sequence (along branch b ). Eq.(2.3.2) satisfies the CK equation up to (and including)   Figure 7 indicates, this condition will hold if ΔL A is within a "critical value," which decreases as (λ I + λ D ) | b | increases. Once ΔL A exceeds the critical value, the violation of the CK equation will considerably impact on the probability estimation, and therefore on the data analyses. Under the parameter setting used for Figure 7, the critical value is ΔL cr A ≈ 300 when (λ I + λ D ) | b | ≈ 0.6 . This means that, as far as case (ii)  ) .
However, we expect that their HMM will severely violate the CK equation when it is applied to case (iii) local PWAs, in view of the results in Subsection 2.2. Some past simulation analyses (e.g., Thorne et al. 1992;Knudsen and Miyamoto 2003;Metzler 2003) seem to have concluded that the violation of the CK equation, or its cause, i.e., the failure to accommodate overlapping indels, did not seriously impact on the results of data analyses. These results might be because they used geometric distributions of indel lengths. To see if this is indeed the case, let's assume f I (l) = f D (l) = (1− q)q l−1 . Substituting them into Eq.(2.2.6'b), we have: ---Eq.(2.3.3) Here we used q 2 min{L I CO , L D CO −ΔL A } << 1 to obtain the approximation. Now, with the approximations c I ≈ λ I , c D ≈ λ D and Pr D (ΔL A ) ≈ f D (ΔL A ) , we calculate the ratio: ---Eq.(2.3.6) If ΔL A is quite large, e.g., 10 or greater, the middle term on the right hand side will predominate, and the ratio could be roughly approximated as . Thus, the "critical value" is roughly given by q)) . If we set, e.g., λ D = 0.1 and q = q LS (1.6) = 0.3957 as given above Eq. Here (ΔL A −1) comes from the number of ways in which a run of gaps can be split into two segments. Thus, its ratio to the first-order term, . This is roughly 4q ΔL A times as large as the dominant term in Eq.(2.3.6). This ratio is much smaller than 1 (unity) when ΔL A is around or greater than the above critical values. Therefore, such histories with two contiguous deletions have only negligible effects on the present argument.] In contrast, if we incorporate biologically realistic indel length distributions into indel probabilistic models that are not genuine evolutionary models, such effects may become remarkable when dealing with long indels and/or long branches. The effects, however, might be somewhat alleviated by fitting indel rate parameters and length distributions that depend on the branch length as in Eqs.(2.2.10a,b).
In a previous paper (Ezawa, Graur and Landan 2015a), we approached the fundamental problems on the probability of indels in an absolutely orthodox manner. More specifically, we established the theoretical basis of an ab initio perturbative formulation of a general continuous-time Markov model, which is a genuine stochastic model describing the evolution of an entire sequence via indels along the time axis. Using the formulation, we proved that, if the indel model parameters satisfy a certain set of conditions, the probability of an alignment is indeed factorable into the product of an overall factor and contributions from local alignments delimited by preserved ancestral sites (PASs).
In this paper, we concretely calculated the probability contributions from individual local alignments using perturbation analyses. The results indicated that even the fewest-indel terms alone can approximate the probabilities pretty well as long as the branch lengths and the indel lengths are at most moderate. And we also clarified the parameter regions where the alignment probabilities are safely approximated by the hidden Markov models (HMMs) of indels (and by association the transducer theories) that were often used in the past. We showed that the approximation by the geometric indel length distributions grossly deteriorates as soon as the indels become longer than 4 sites. We also showed that these HMMs violate the Chapman-Kolmogorov (CK) equation because they lack most of the non-fewest-indel terms in the perturbation expansion. This finding enables to predict the regions where the violation of the CK equation starts to compromise the reliability of these models. The analyses also suggested possible modifications to these models in order to improve their accuracy.
To summarize, by depending purely on the first principle, our ab initio perturbation formulation provides a sound reference point to which other indel models can be compared in order to see when and how well they can approximate the true alignment probabilities.

Implementation of the formulas
Most of the formulas used for the perturbation analyses were implemented in Perl. They are available as a part of the package named LOLIPOG (log-likelihood for the pattern of gaps in MSA), which in turn is available at the FTP repository of the Bioinformatics Organization (Ezawa 2013).

Authors' contributions
KE conceived of and mathematically formulated the theoretical framework in this paper, implemented the key algorithms, participated in designing the study, performed all the mathematical analyses, and drafted the manuscript. DG and GL participated in designing the study, helped with the interpretation of the data, and helped with the drafting of the manuscript.

A1. Perturbation calculation of multiplication factors for PWAs between ancestral and descendant sequences
A1.1. For case (iii) local PWAs Here we detail the calculation of the sum of contributions from the fewest-indel events as well as that from the next-fewest-indel events in case (iii) considered in Section 1.2 of Results, that is, the case where the ancestor has no site but the descendant has one or more sites in between a pair of preserved ancestral sites (PASs) (Figure 2 C).
In case (iii), we assume that the descendant state has ΔL D sites in between the flanking PASs. Thus, the ancestral and descendant states could be represented as consists of a single event that inserts the descendant sites in between the PASs. Therefore, the sum of contributions from the fewest-indel history is: ---Eq.(A1.1.1) As in case (ii), each next-fewest indel history is composed of two indel events, and classified into two types: x = 2,..., ΔL D + 2 . Thus, in this case, the portion contributed by the next-fewest indel histories is given by: ) is the sum of contributions from the histories of type (c), and is the sum of contributions from the histories of type (d). As in case (i), let s[ΔL D − l] ≡ s AM I (1, ΔL D − l) be the intermediate state in each type (c) history. Then, the history's contribution is calculated as: ---Eq.(A1.1.2d) Similarly, each type (d) history's contribution is calculated as: ---Eq.(A1.1.2e) Let us now calculate Eqs.(A1.1.1,2a-e) under Dawg's indel model or the "long indel" model, as in the previous cases in Section 1.2 of Results. In this model, we have δR X ID (s D , s A ,τ ) = + (λ I + λ D )ΔL D , and Eq.(A1.1.1) becomes: - ---Eq.(A1.1.2e') Substituting Eqs.(A1.1.2d',e') into Eqs. (A1.1.2a,b,c), we can concretely calculate the total contribution from next-fewest-indels.

A1.2. For case (iv) local PWAs
Here we detail the calculation of the sum of contributions from the fewest-indel events as well as that from the next-fewest-indel events in case (iv) considered in Section 1.2 of Results, that is, the case where both the ancestor and the descendant have one or more site(s) in between a pair of PASs, but no ancestral site is related to any of the descendant sites (Figure 2 D).
In case (iv), we assume that the ancestral and the descendant states have ΔL A and ΔL D sites, respectively, in between the flanking PASs. Thus, the ancestral and descendant states could be represented as s A = L, 1,..., ΔL A , R ⎡ ⎣ ⎤ ⎦ and Thus, the sum of contributions by the fewestindel histories is expressed as: ---Eq.(A1.2.1a) Here, is the contribution from the single history of type (e), 1c) is the sum of contributions from the type (f) histories, and ) is the sum of contributions from the type (g) histories. To calculate the contribution from each of these histories, let s 0 ≡ [L, R] = s AM D (2, ΔL A +1) be the intermediate state of the type (e) history. And, as in case (ii), let s A ⋅[x, +l] ≡ s AM I (x, l) be the state type including the intermediate states of the histories of types (f) and (g). Then, the type (e) history's contribution is: ---Eq.(A1.2.1e) the contribution from each history of type (f) is: ---Eq.(A1.2.1f) and each type (g) history's contribution is: ---Eq.(A1.2.1g) Under Dawg's indel model (or, similarly, under the "long indel" model), we have . Using them, Eqs.(A1.2.1e,f,g) are calculated as: ---Eqs.(A1.2.1f',g') Here, we defined the function F e− x, t ( ) as: Meanwhile, each next-fewest-indel history is composed of three indel events, and classified into one of 6 broad types: (h) two successive deletions followed by an insertion; (i) a deletion, followed by an insertion, followed by a deletion; (j) an insertion followed by two successive deletions; (k) a deletion followed by two successive insertions; (l) an insertion, followed by a deletion, followed by an insertion; and (m) two successive insertions followed by a deletion. And these 6 broad types can be further sub-classified into 24 sub-types, as explained in the following. Type (h) does not need to be sub-classified, and each history belonging to this type can be expressed as: .., ΔL A −1 and x = 2,..., ΔL A − l + 2 . Type (i) is sub-classified into three subtypes: (i-1) the case where the insertion is immediately on the right of the sites to be deleted, represented as  ...,8 ) is the sum of contributions from the histories belonging to one of the 24 sub-types explained above. Each of them can be calculated by calculating the contribution from each constituent history according to the definition, Eq.(4.1.1b) of part I supplemented by Eq.(3.1.8b) of part I, and by summing the contributions over possible values of the length variables l and ′ l (or ′′ l ) and over possible positions x when necessary, all specified above. Under a spaceand time-homogeneous model, the indel rates become independent of the positions of the indels and of the exact state before the event, and the exit rate depends only on the sequence length (and possibly on the functional form of the deletion length distribution). This simplifies the calculation considerably. Especially, because each term is independent of the exact position of indels, a summation over the positions x is reduced to a simple multiplication by the number of possible positions. In addition, considerations on spatial symmetry reveal that some different sub-types actually give identical contributions to each other (detailed below). Moreover, under Dawg's indel model or the "long indel" model, the increment of the exit rate is proportional to the difference in the sequence lengths, which further simplifies the calculation. Here, we give the results of contributions by individual histories under Dawg's indel model. But the results apply immediately to, e.g., the "long indel" model as well. We first give the general formula for the contribution of a three-event local history, and then we apply the formula to the histories of different sub-types.
First we consider a general three-event history, M 1 ,M 2 ,M 3 ⎡ ⎣ ⎤ ⎦ , on the ancestral Here M ν (with ν = 1, 2, 3 ) is either an insertion (M I (x, l) ) or a deletion (M D (x B , x E ) ). Let δl v (with ν = 1, 2, 3 ) be the sequence length change caused by the event M ν ; δl v is positive if M ν is an insertion and negative if M ν is a deletion. They satisfy: δl 1 +δl 2 +δl 3 = ΔL D − ΔL A . And let r(δl) be the space-and time-homogeneous rate of the indel event that changes the sequence length by δl . Under Dawg's indel model, we have: Then, according to Eq.(4.1.1b) of part I supplemented by Eq.(3.1.8b) of part I, the contribution from this three-event local history is expressed as: = r(δl 1 )r(δl 2 )r(δl 3 ) I 3M δl 1 ,δl 2 ,δl 3 ; t F − t I ; λ I + λ D ( ) .
---Eq.(A1.2.6a) Here, IM +2M δl 1 ,{δl 2 ,δl 3 }; t F − t I ; λ I + λ D ( ) is a triple-time integral: It is calculated as: ) is a triple-time integral: It could be calculated by taking advantage of the following relationship: Now we can apply the general formulas, Eqs(A1.2.5a,b"), Eqs.(A1.2.6a,b'), and Eqs.(A1.2.7a,b'), to specific cases. First, the contribution from a type (h) history is: ---Eq.(A1.2.8a) Second, by spatial symmetry, the contribution from a type (i-1) history is identical to that from the corresponding type (i-2) history. They are given by: ---Eqs.(A1.2.8b,c) Third, the contribution from a type (i-3) history is: ---Eq.(A1.2.8d) Fourth, by spatial symmetry, the contribution from a type (j-1) history is identical to that from the corresponding type (j-2) history. They are given by: ---Eqs.(A1.2.8e,f) Fifth, we consider the joint contribution from a type (j-3) history and the corresponding type (j-4) history. By spatial symmetry, it is identical to the joint contribution from the corresponding type (j-5) and type (j-6) histories. They are given by: ---Eqs.(A1.2.8g,h) Sixth, consider the joint contribution from a type (j-7) history and the corresponding type (j-8) history. It is given by: ---Eq.(A1.2.8i) Seventh, the contribution from a type (k) history is: ---Eq.(A1.2.8j) Eighth, by spatial symmetry, the contribution from a type (l-1) history is identical to that from the corresponding type (l-2) history. They are given by: ---Eqs.(A1.2.8k,l) Ninth, the contribution from a type (l-3) history is: ---Eq.(A1.2.8m) Tenth, by spatial symmetry, the contribution from a type (m-1) history is identical to that from the corresponding type (m-2) history. They are given by: ⎤ ⎦ , respectively, which will be denoted as s A = 0 and s D = ΔL D under the setting of (local) spatial homogeneity. The starting point here is the fundamental integral equation, Eq.(3.1.2) of part I (Ezawa, Graur and Landan 2015a), for the stochastic evolution operator P ID (t I , t F ) , instead of Eq.(3.1.4) of part I for cases (i) and (ii). Similarly to above Eq.(1.2.6), we sandwich Eq.(3.1.2) of part I with s A and s D , expand Q M ID (t) using its definition (i.e., Eq.(3.1.1c) of part I supplemented with Eqs.(2.4.2b',c') of part I), and ignore the effects of indels that are irrelevant to the cases under consideration. Then, we get: ---Eq.(A1.3.1) Here, when considering the domains of the summations, we used the fact that the ketvectors M I (x, l) s D and M D (x, x + l −1) s D correspond uniquely to the states ΔL D − l and ΔL D + l , respectively. Next, we take advantage of this fact again, take account of the local homogeneity, and use the notation, ---Eq.(A1.3.2) This gives the system of integral equations for the "exact" probabilities, ) , with non-negative integers ΔL D = 0,1, 2,... . This system of equations can be solved iteratively, again starting with the "zero-event After a desired number (say, N ID ) of iteration steps, the multiplication factor will be obtained similarly, and we have: ) . ---Eq.(A1.3.3) It should be noted that the exponent on the right-most hand side is the time integral of the exit rate of the ancestral state but not that of the descendant state. This is just due to the definition of the multiplication factor (Eq.(4.1.1b) of part I). The consideration of the time-and space-complexities of a naïve implementation of the iteration algorithm goes just as below Eq.(1.2.8) of Results. Just as Eq.(1.2.7), Eq.(A1.3.2) could also be numerically solved via a system of "two-sub-step" recursion relations: - ( ) for all ΔL D = 0,1, 2,..., ΔL max D (≤ L CO ) and at all timepoints, t = t I + i t F −t I N P with i = 0,1,..., N P .

A2. Perturbation calculation of multiplication factors for MSAs: case (IV) in Section 1.3
Here we detail the calculation of the portion of the multiplication factor contributed by the fewest-indel events as well as that contributed by the next-fewest-indel events in case (IV). This case was considered in Section 1.3 of Results, and its situation is illustrated in Figure 4 E. In this case, the external sequence states were represented as: ---Eq.(A2.2b') and ---Eq.(A2.2c') The next-fewest-indel local histories consist of three indels each, and can be broadly  and ′ υ a ≠ ′ υ b for ∀ a ≠ b (∈ {1,..., l}) . These four broad types are further sub-classified into 10 sub-types in total.
First, type (G) local histories are sub-classified into four sub-types: (G-1) the case where two successive insertions occur along b 1 , represented as ; (G-2) the case where a "long" insertion and a subsequent deletion occur along b 1 , represented as .., min{L D:1 CO , L I:1 CO − j + i +1} and x = i + 2,..., j +1 ; (G-3) the case where two successive deletions occur along b 3 , represented as Here Μ P [(X)] (with X = G −1,..., J ) represents the summed contribution from the histories over one of the aforementioned 10 sub-types. Each such term can be calculated by summing the contributions from individual local histories belonging to each sub-type over the possible lengths and possible positions, if necessary. Under a space-homogeneous model, the summation over possible positions will be reduced to the multiplication by the number of possible positions (for each fixed set of lengths  Tables 1-4   Table 1. Various "threshold gap lengths" in case (ii) of local PWAs a The expected number of indels per site. b The number of ancestral sites in between the PASs, i.e., ΔL A , at which the total probability of the next-parsimonious local indel histories is 1/2 (=0.5) of that of the parsimonious local indel histories. c (ΔL A ) 0.5 ( N ID −ii) is the value of ΔL A at which the local histories involving up to (and including) N ID indels each account for 1/2 (=0.5) of the total probability of all local histories consistent with the configuration. We used the probability calculated up to (and including) N ID = 50 indel steps as a proxy of the total probability. d A rough (inversely proportional) relationship between each threshold gap length (Y ) and the expected number of indels per site ( X ). NOTE: Each cell shows the ratio of the total probability of the next-fewest indel histories to that of the fewest indel histories, when there are ΔL A ancestral sites and ΔL D descendant sites in between the PASs. Each column is labeled with the expected number of indels per site (i.e., (λ I + λ D )(t F − t I ) ). The parameters used for this analysis are: λ I = λ D = 0.1 , L I CO = L D CO = 500 , and f I (l) = f D (l) = l −1.6 k −1.6 k=1 500 ∑ ( ) .
Because of the symmetry of the probabilities under the time reversal, the ratio for (ΔL A , ΔL D ) = (L 1 , L 2 ) is identical to that for (ΔL A , ΔL D ) = (L 2 , L 1 ) . Thus we only showed the results for ΔL A ≥ ΔL D . The ratios that are less than 0.5 are shown in boldface. NOTE: Each cell shows the ratio of the total probability of the next-fewest indel histories to that of the fewest indel histories when sequence 1 and sequence 2 have ΔL D1 sites and ΔL D2 ≡ ΔL D1 − ( j − i −1) ( < ΔL D1 ) sites, respectively, in between the PASs. Each column is labeled with the expected number of indels per site (i.e., ) , for ∀ m = 1, 2, 3 . And we used the 3-OTU trees with t F:1 = t F: 2 = t F: 3 . The ratios that are less than 0.5 are shown in boldface. NOTE: Each cell in the middle column gives the ratio, Eq.(2.2.13), of the reference probability in the fewest-indel approximation to the corresponding probability given by Kim and Sinha's HMM (2007), for a local PWA with ΔL A ancestral sites and ΔL D descendant sites in between a pair of PASs. The parameters used for this analysis are:   (Ezawa, Graur and Landan 2015a), and re-assign the ancestries as shown in panel B. The ancestries in between the PASs (with the reassigned ancestries L and R ) are just an example and not always assigned in this way. C. Sequences extracted from the MSA. Shown above each site is the site number (i.e., spatial coordinate) assigned to it. And shown on the right of each sequence is the count of sites in between the PASs. In panels A and B, as in Figure 5 of part I, the boldface characters in the leftmost columns stand for the sequence IDs. (More precisely, the number ' i ' stands for the sequence s i , and the ' R ' stands for the root sequence ( s Root ).) The results shown in all panels of this figure apply to both case (ii) and case (iii) local PWAs. In all panels, the abscissa is ΔL A in case (ii) and ΔL D in case (iii). Panel A shows the ratio of the total probability of the next-fewest-indel histories to that of the fewest-indel history. The blue, magenta, cyan, purple and green curves show the ratios when the expected numbers of indels per site, i.e., the values of (λ I + λ D )(t F − t I ) , are 0.01, 0.04, 0.1, 0.2 and 0.4, respectively. Panels B and C show the ratios of the multiplication factors calculated via the iteration formulas (either Eq.(1.2.8) or Eq.(A1.3.3) in Appendix A1), including histories with up to N ID = 1 (blue), 2 (magenta), 5 (green), 10 (purple), and 20 (cyan) indels, to the factor including histories with up to N ID = 50 indels. Panel B is for (λ I + λ D )(t F − t I ) = 0.04 indels per site, and panel C is for (λ I + λ D )(t F − t I ) = 0.2 indels per site. The following parameters were used for all panels: f I (l) = f D (l) ∝ l −1.6 , L I CO = L D CO = 500 and λ I λ D = 1 . For panels B and C, we also set the following parameters: The graph shows the ratio of the probability of the history with two deletions giving rise to a case (III) local MSA (Eq.(1.3.13')) to that of the single-insertion history (Eq.(1.3.10)). The ratio is shown as a function of the length of the gapped segment ( ΔL D1 ) and the expected number of indels per site along each branch ( (λ I: m + λ D: m )(t F: m − t I ) , identical for all m = 1, 2, 3 ). The blue, magenta, cyan, purple, and green curves show the ratios when the values of (λ I: m + λ D: m )(t F: m − t I ) are 0.01, 0.04, 0.1, 0.2 and 0.4 indels/site, respectively. We used a 3-OTU tree whose three branches are equally long. For each branch, we used the same parameters as those used for Figure 3. , gives a rough estimate of the "threshold" value of (λ I + λ D ) | b | beyond which the violation of the Chapman-Kolmogorov equation (Eq.(3.1.1.1) in part I) may severely undermine the HMM of Kim and Sinha (2007).