## Abstract

Λ recent study (Louca and Pennell, 2020) spotlighted the issue of model congruence, or asymptotic unidentifiability, in timedependent birth-death models used for reconstructing species diversification histories on phylogenetic trees. The phenomenon of model congruence implies that any given timetree can be equally likely explained by various diversification scenarios. The present work investigates this issue in state-dependent speciation and extinction (SSE) models, which are widely employed to study trait-dependent diversification. We demonstrate that model unidentifiability is universal for SSEs, with every SSE belonging to an infinite class of congruent models. Importantly, we find that any trait-independent model is congruent with a trait-dependent model that possesses the same or fewer parameters, raising significant concerns for model selection and hypothesis testing. To address this challenge, we propose an analytical solution and discuss its implication for comparative phylogenetics and model development. Our results indicate that while SSE congruence is critical to avoiding unidentifiability in diversification studies, no fundamental change to phylogenetic methodologies is needed. Instead, congruence opens up new opportunities for modeling trait-dependent diversification.

The dramatic variation in the species richness of clades across the tree of life has lead to widespread interest in the potential drivers of diversification. State-dependent speciation and extinction (SSE) models have emerged as powerful tools for testing these hypotheses, allowing researchers to explicitly model relationships between traits and diversification rates on phylogenies (1, 2).

However, the underlying complexity of the diversification process poses a challenge when drawing reliable inferences from SSE models. Any reasonably-sized clade is likely to have factors influencing diversification that are not completely captured by the model or the hypothesized trait. These hidden factors have been shown to mislead inference using SSEs (3, 4). To address this challenge, SSEs have adopted hidden states to model the heterogeneity of diversification processes, which may or may not be associated with the focal trait (4–6). While such SSEs advance our ability to reconstruct biologically-realistic process, the limitations of “hidden expansion” in modeling trait-dependent diversification for empirical data—and the optimal model complexity required for this hidden dimension— remain uncertain.

Recently, Louca and Pennell [LP] (7) demonstrated the limitations in reconstructing diversification using time-dependent birth-death models, which are similar to SSEs but do not include a trait-dependent variation in rates (1). The critical limitations identified by LP stem from model asymptotic unidentifiability (8) or congruence. Specifically, LP showed that any given timetree could be explained by an infinite number of alternative diversification scenarios, all yielding identical likelihood scores. Consequently, any timetree is consistent with a countless set of congruent time-dependent models, indistinguishable regardless of the amount of observed data, thereby limiting our ability to accurately reconstruct the true diversification process. The model congruence persists under any sampling fraction of species (7), even it varies over time (9).

This discovery has sparked considerable interest in the phylogenetic community, leading to novel methods for addressing the problem. One group of these methods proposes a thorough exploration of congruent diversification scenarios (10–12). Another group recommends imposing mild restrictions on extinction and speciation functions, which leads to identifiable birth-death models (13, 14). Furthermore, an elaborated version of the time-dependent diversification model, the fossilized birth-death model, has also been confirmed as identifiable (15).

While LP’s study focused on time-dependent models, they proposed that similar unidentifiability issue may arise in SSEs. Responding to this concern, a recent study mathematically confirmed the identifiability of a specific subset of pure-birth SSE models (16). Meanwhile, another study argued that SSEs are generally identifiable (17).

Contrary to previous studies, we demonstrate that model congruence is a universal phenomenon across all SSEs. To illustrate this, we employ a general model, HiClaSSE, which comprises commonly used SSEs (e.g., BiSSE, MuSSE, HiSSE, MiSSE, BiSSE-ness and ClaSSE) as its special cases. We find that congruence manifests as a consequence of hiddenexpansion of models that adhere to the lumpability criteria of Markov chains (18), presenting a scenario where a model with or without hidden states is congruent with an infinite set of models that include hidden states. This infinite set forms a congruence class where models may have drastically different diversification histories, and parameter counts, ranging from, at least two, to infinity. For example, each state-independent model is congruent with a model with the same or fewer parameters but with trait-dependent diversification. Consequently, such congruence poses a problem for model selection (7) and testing hypothesis of trait-dependent diversification.

To address this issue, we propose a solution that minimizes the number of hidden states in the optimal model, ensuring its uniqueness within a congruence class. Importantly, our solution is available analytically, which enables efficient model selection and does not significantly change existing phylogenetic practices. Below, we discuss its implications for hypothesis testing in particular and inference with SSE models in general. Contrary to expectations, we demonstrate that SSE congruence does not limit our modeling capabilities; rather, it opens up new avenues for investigating the behavior of SSEs and improving the inference of trait-dependent diversification, and diversification shifts (6). Based on these results, we conjecture that SSEs are free from the problematic congruent classes as those identified for time-dependent models. Our findings are supported by simulations with HiClaSSE model.

## Results

### HiClaSSE and hidden states

The HiClaSSE model extends ClaSSE (19) and BiSSE-ness (20) by including hidden states, as in HiSSE (5). In addition to speciation and extinction events (1), HiClaSSE allows modeling two types of statechange events. The cladogenetic events which denote state transitions occurring with lineage bifurcation during speciation, and anagenetic events which denote state transitions without speciation. For simplicity, speciation events with and without state changes are termed coupled and decoupled speciations, respectively. HiClaSSE may include any number of states *k*.

Like other SSEs, HiClaSSE is a joint process that combines an observable trait and a diversification component. The trait represents a Continuous Time Markov Chain (CTMC) where discrete states change over time. The diversification component is a separate CTMC whose states are unobservable diversification regimes, each equipped with a specific speciation and extinction rate to control lineage birth and death. Thus, in an SSE, every state is effectively hidden, belonging simultaneously to the trait state and the diversification regime. Observable states are modeled at the tips of tree using ambiguous character coding, they are the collections of the hidden states.

For example, in a four-state HiSSE that characterizes dependent evolution between a binary trait and a diversification process with two regimes each observable state comprises two hidden states. Conversely, a trait-free MiSSE (6), which models diversification shifts alone, maps all hidden states onto a single observable one; thus, the trait CTMC disappears in MiSSE, leaving only the CTMC for diversification regimes. This framework allows for the construction of various SSE models. Note, our interpretation of hidden states, for biological clarity and mathematical precision, may differ from previous treatments (SI Appendix).

### HiClaSSE likelihood

As for other SSE models (1, 21), the likelihood is calculated by solving a system of ODEs backward in time using two terms: *D*_{i}(*t*), the probability of a lineage being in state *i* at time *t*, and *E*_{i}(*t*), the probability of a lineage in state *i* not surviving until the present time. Representing them as column vectors, ** D** = [

*D*

_{1}(

*t*),. ..,

*D*

_{k}(

*t*)] and

**= [**

*E**E*

_{1}(

*t*),. ..,

*E*

_{k}(

*t*)], yields the following ODEs expressed in array notation: where, “○” is an element-wise product of two arrays,

*D⊗ E*^{T}is outer product of two vectors, and I

_{a,b}indicates square matrix where all diagonal elements are

*a*’s while off-diagonals are

*b*’s. The speciation tensor Λ= { Λ

_{1},. ..,

*Λ*

_{k}} contains

*k× k*matrices Λ

_{i}with speciation rates (λ

_{ijz}). For example, the notation λ

_{111}indicates a decoupled speciation event – when one lineage in state 1 specialties into two lineages with the same state. In contrast, the notation λ

_{122}indicates a coupled speciation – when one lineage in state 1 produces two lineages each with the state 2 (19, 21).

The extinction rates of a lineage are denoted by the vector ** μ** = [

*μ*

_{1}(

*t*),. ..,

*μ*

_{k}(

*t*)]. The instantaneous rate matrix

*Q*defines anagenetic transitions (

*q*

_{ij}) within a CTMC that governs the joint evolution of the trait states and diversification regimes (details in SI Appendix). HiClaSSE is a general model that includes various SSEs as submodels. For instance, HiClaSSE with two states and no cladogenetic events is BiSSE model (1).

### Correlated evolution in SSE models

In SSEs, the parameters determine whether the diversification component and the trait are independent or correlated. The parameterizations corresponding to trait-dependent diversification have been identified initially for simple models (e.g. BiSSE, ClaSSE). In these models, trait-dependence occurs when: (**C1**) there is variation in speciation and/or extinction rates between observable states as in BiSSE (1); or (**C2**) a model includes cladogenetic events, as in ClaSSE and BiSSE-ness, indicating that a speciation is coupled with a trait change (19, 20). However, this list is not exhaustive. It is often overlooked that SSEs with hidden states may have additional parametrizations implying trait-dependence. While these parametrizations were initially formulated for binary traits (22), they can be readily extended to all SSEs. Specifically, they are applicable when the condition (**C1**) is met, and they arise in the following situations: (**C3**) when the rate of change in the diversification regime depends on the trait; (**C4**) when changes in both the trait and regime are interdependent; and (**C5**) when changes in the trait and regime occur simultaneously, typically when dual state transitions are allowed (SI Appendix).

### Lumpability and Model Congruence

Our construction of congruent models is based on the model lumpability principle (18, 23) that defines when the number of states in an original SSE may be reduced by aggregating two or more states; if the aggregated process results in an SSE with fewer states, then the original SSE is referred to as “lumpable” for that state grouping. Note that lumpable aggregations, leading to an SSE with fewer states, depend on specific rate symmetries in the original SSE (SI Appendix); without these symmetries, state aggregation is impossible meaning that an SSE with fewer states does not exist.

Model lumpability was initially formulated for standard CTMCs (18). In contrast, SSEs require distinct lumpability conditions due to diversification component. We begin by deriving these conditions, and then show how they can be used to create congruent models. To derive them, we applied the CTMC lumpability principle (18) to the SSE’s likelihood function in Eq. 1 (SI Appendix).

Specifically, an SSE model is lumpable if three following conditions are met simultaneously (SI Appendix): (**L1**) the rate matrix *Q* is lumpable, which implies that for any two aggregated states *ŝ*_{i} and *ŝ*_{j}, the total transition rate , from any original state *s*_{m} in *ŝ*_{i} to *ŝ*_{j} remains constant. These ‘s become the transition rates in the lumped *Q* (18); (**L2**) all *μ*’s within each partition set must be equal, and the lumped extinction rate is that value; (**L3**) the speciation tensor presents a more complex situation. In this case, each lumped speciation rate is composed of the original rates λ_{ijz} which can be grouped based on the parent branch state (i.e., *i*). So, the speciation tensor is lumpable if the sum of the original λ is in each group is the same. The lumped speciation rate is one sum that satisfies this condition.

Our method for creating congruent models involves aggregating hidden states linked to the same observable state in a lumpable SSE. This operation reduces the total number of states without altering the observable ones. As a result, both the original and aggregated SSE models become congruent (denoted by “≅”) and phylogenetically indistinguishable since they share the same observable states and likelihoods (SI Appendix). Typically, we apply this method in reverse: starting with an SSE model *S*_{1}, we add hidden states to generate a new SSE model *S*_{2}; when *S*_{2} is lumped, it yields *S*_{1}, making *S*_{1} and *S*_{2} congruent.

As an example, consider the trait-free MiSSE model (*M*_{1} in Eq. 2) with three parameters: two speciation rates *λ*_{AAA} and *λ*_{BBB}, a single transition rate *q* between the diversification regimes, and a constant extinction rate *μ*. This model is congruent with another model, *M*_{2} (Eq 2), in which there are three diversification regimes, but the speciation rates for the regimes *B* and *C* are the same (i.e., λ_{BBB} = λ_{CCC}). Clearly, *M*_{2} is lumpable given the state partition {{*A*}, { *B, C*} } as it meets the lumpability conditions (**L1-3**). Since all states in *M*_{2} are hidden, it collapses to *M*_{1} under the lumping.
In contrast, a slightly different model *M*_{3} (Eq. 2) is not congruent with either *M*_{1} or *M*_{2}, as its rate matrix violates the lumpability condition (**L1**) for the partition {{*A*}, {*B, C*} }because the transitions *B → A* and *C→ A* are not equal. Therefore, despite all three models in Eq. (2) have the same parameter count, *M*_{3} stands as a distinct model in its own right. Note, the states of *M*_{3} cannot be reduced to a two-state model. While sharing the same speciation rates as *M*_{1}, *M*_{3} displays distinct diversification dynamics. Consequently, the pattern of speciation and extinction events on a phylogeny will be different too.

### Hidden Expansion (HE)

In the example above, we illustrated how adding hidden states to model creates a congruent model through a process termed “hidden expansion” (HE). While our example above is a fairly trivial HE, this technique can be applied universally to any SSE to generate a congruent model, including those mapping to biologically distinct interpretations– which we will demonstrate in the next sections. For instance, *M*_{2} can be further expanded, maintaining lumpability, to construct models with any number of hidden states. Given the limitless possibilities for maintaining lumpability, countless implementations of HE exist. Therefore, every SSE model is part of an infinite congruence class, a phenomenon we define as HE-congruence. Also, this congruence holds under any sampling fraction which specifies the estimated proportion of extant species (SI Appendix).

### Congruence between independent and dependent SSEs

An SSE with trait-independent diversification is always congruent with a model implying dependence. Take, for example, the CID4 model, which represents a trait-independent diversification with two regimes (5). Assume the model has four parameters: two speciation rates, a single rate for state transitions, and a constant extinction (Table 1; Fig. 3; SI Appendix). Note, the numeral in the model denotes the count of hidden states.

Similar to the *M*_{1} (Eq. 2), we can apply HE to CID4 by adding extra states to diversification regimes. Let us assume that this results in CID8 model, which is congruent to CID4 and has identical parameter count (Table 1; SI Appendix).

Now, we can begin modifying CID8 but keeping it lumpable with CID4. One way to achieve this is by enabling dual transitions between the trait and diversification regimes. This transformation results in a model with a trait-correlated diversification, which we call COR8-C. Yet COR8-C remains congruent with CID4 (Table 1; Fig. 3C; SI Appendix). To illustrate that this congruence is solely due to lumpability, we can invalidate it by prohibiting certain (but not all) dual transitions. This creates a non-congruent model COR8-NC that maintains the correlation but displays distinct diversification dynamics; this model is not lumpable (Fig. 3D; Table S1).

A different congruent model can be created by disabling all decoupled speciation events (λ_{xxx}’s) and permitting coupled events that adhere to lumpability principles. This transformation results in CLA8-C, which is similar to COR8-C, but all speciation events coincide with trait changes (Table 1; Fig. 3E; SI Appendix). As before, this congruence occurs solely due to lumpability. If, instead, we allow “non-lumpable” cladogenetic events, we obtain the non-congruent model, such as CLA8-NC, which has distinct dynamics but similar biological interpretation with CLA8-C (Table 1; SI Appendix)

HE can generate congruent and correlated models from any CID model; its parameterization determines the state count of a congruent model. The resulting congruence class is also infinite. For instance, we can expand CID4 to CID16 instead of CID8, and apply similar modifications.

### Equal Rate Hidden Expansion (EHE)

HE can be extended to a point when all permissible rates are equal, except extinctions. We refer to this transformation as Equal Rate Hidden Expansion (EHE).

Consider, for instance, a CID4 model with four parameter values: *λ*_{1} = 0.3, *λ*_{2} = 0.1, *q* = 0.2, and *μ* = 0.01. Applying the EHE to this model results in the EHE8-C, where the diversification is trait-dependent, and coupled speciation events are allowed (Table 1). The EHE8-C has only two parameters compared to the four in the CID4; and it is akin to CLA8-C but all its *λ*’s and *q*’s have the same value of 0.1 (SI Appendix), while extinction rate remains unchanged (*μ* = 0.01).

EHE transformation typically includes coupled speciation events along with trait-dependent *Q* matrix. It reduces the number of parameters to the number of extinction categories plus one. The EHE can be applied to any SSE model, regardless of the state count or rate values. This universality arises from the properties of rational numbers, where any rational number can be expressed as the product of an integer *z*, and a rational multiplier *r*. If the rates are not rational, they can be approximated to the desired precision. As a result, the set of parameters (*θ*’s) in an original SSE, excluding extinction, expands as {*θ*_{1} = *z*_{1}*r, θ*_{2} = *z*_{2}*r*, … *θ*_{n} = *z*_{n}*r*}. The common factor *r* serves as the equal rate, while the integers *z*_{i} represent additional rate terms in the EHE. Therefore, an SSE model that is sufficiently large to embed these additional rates represents the EHE transformation. We present a general algorithm that can be used to create EHE models (SI Appendix).

For a given SSE model, EHE generates an infinite congruence class because there is an infinite set of multipliers *r* to use for the same original model. The choice of *r* determines the state count in the EHE model.

### Semi-Congruent Behavior

A universal parametric form of EHE for a given SSE does not exist because the number of hidden states in EHE depends on the original rate values (SI Appendix). For example, varying parameter values in the CID4 from the previous section would result in different EHE models.

Consider fitting the CID4 model, from the previous example, to data that were originally generated by it. The maximum likelihood (ML) estimates might be: *λ*_{1} = 0.336, *λ*_{2} = 0.138, *q* = 0.229, and *μ* = 0.0139. When applying an EHE transformation to them, it would require thousands of hidden states due to the many decimals, which is present in all ML estimates. However, instead of using the true generating EHE model, let us fit EHE8-C (Table 1) that would be congruent if we round the estimates to one decimal. While EHE8-C is expected to yield a worse likelihood, it might be preferred by the Akaike Information Criterion (AIC) due to having fewer parameters. This exemplifies semi-congruent behavior, which may occur when there is no exact match between the SSE and EHE models.

Our simulations show that semi-congruence can lead to flawed model selection, regardless of the amount of data. For example, in 55-64% of the simulations, a semi-congruent EHE8-C received positive support (ΔAIC > 2) over CID4, despite CID4 constantly having a better likelihood (Fig. 1).

For any SSE, a semi-congruent model can be constructed by reducing the respective EHE. The extent to which AIC favors a semi-congruent model over the original one depends on their parameter difference. The maximum possible AIC support (*AIC*_{m}) for the semi-congruent model is deterministic and calculates as: *AIC*_{m} = 2(*p*_{1} − *p*_{2}), where *p*_{1} and *p*_{2} represent the number of parameters in the compared models with *p*_{1 ≥}*p*_{2}.

### The Structure of a Congruence Class

Each congruence class, generated by HE, consists of a unique non-lumpable model called the irreducible model (*M*_{0}) and an infinite set of lumpable models created by adding hidden states to *M*_{0} (Fig. 2). Thus, all other models in the class have more hidden states than *M*_{0}. For instance, CID4 and BiSSE are irreducible models with and without hidden states, respectively. In addition to the HE explained earlier, which lead to congruent models with the same or fewer parameters, it is straightforward to create models with more parameters.

We identified three main types of HE: sub-expansion (e.g,. EHE), equivalent (e.g, CID8, COR8-C, CLA8-C), and super-expansion. These types create models with the fewer, same, or more parameters in the off-diagonal blocks of the *Q* matrix, respectively. The off-diagonal parameters are essential for defining the parametric form (SI Appendix).

Sub-expansion models depend on specific parameter values in *M*_{0}. Each parameter setup corresponds to a distinct sub-expansion model, usually with fewer parameters than *M*_{0}. *M*_{0} is not directly congruent with individual sub-expansion models but rather with sets of these models (e.g., *M*_{0} ≅ [*EHE*]).

## Discussion

By using the general HiClaSSE model, we demonstrated that any SSE is a member of a congruence class, indicating that there is an infinite set of models which always produce identical likelihoods irrespective of the amount of data. Thus, the congruence reported by LP in time-dependent models (7), is also ubiquitous in SSEs. This finding contradicts the previous claim that SSEs are identifiable (17) as that study did not consider hidden expansion (HE). Nevertheless, HE is and will continue to be a critical component to the evolutionary biologists SSE modeling toolkit (3–5), necessitating a fuller understanding of congruence on biological inference and statistical practice.

We have identified such HE as the primary cause of SSE congruence. This congruence arises in models with hidden states that meet specific lumpability conditions and renders these models congruent with other models with or without hidden states. HE congruence classes are akin to those found by LP, as both share a similar invariance in their rates. Namely, the pulled speciation and diversification rates in LP classes are similarly invariant as the distribution of individual rates with respect to observable states in HE classes. Note, this congruence also holds in models for ancestral state reconstruction alone, as they are submodels of SSEs with no diversification.

We showed that congruent models can vary in parameter count from, at least, two (sub-expansions) to infinity (superexpansions), and imply alternative evolutionary scenarios. In particular, any trait-independent CID model is congruent with an infinite set of trait-dependent models with equal or fewer parameters. Similarly, there are some trait-dependent models that will not be identifiable from a generating model that is independent since congruence is symmetric. This raises four important questions: does congruence mislead hypothesis testing of trait-dependent diversification? If so, can it be resolved? How significantly will it alter existing phylogenetic practices? And what insights can we gain from congruence to enhance the existing practices?

### Conventional Model Selection Fails

The standard practice for testing trait-dependent diversification involves fitting several SSE models corresponding to trait-independent and - dependent evolution and then assessing the fit of the alternative hypotheses using, for example, ΔAIC or AIC weights (4). Typically, a few default models provided by available software are used in practice. For instance, these models can vary from BiSSE to CID and HiSSE, with a variable number of diversification regimes.

Consider a scenario where the data were generated under the trait-independent CID4 model (Table 1, Fig. 3A-B), and the amount of data is sufficient for reliable inference. Based on AIC, the original CID4 would almost certainly have the best fit when alternative hypotheses are tested in accordance with the standard practice. However, we showed that there exists an infinite set of models congruent with CID4, which suggest a different plausible scenario – trait dependence. These models are not included in the standard set of SSEs, but they can be analytically derived from CID4 by applying the HE and lumpability principles (SI Appendix). They range from simple trait-dependent evolution as in COR8-C (Fig. 3C), to scenarios where speciation is coupled with state changes, as in EHE8-C and CLA8-C (Fig. 3E). Compared with CID4, they may have equal AIC due to the identical parameter count (COR8-C and CLA8-C) or even better AIC due to fewer parameters as in EHE8-C.

The EHE (Equal Rate Hidden Expansion) models are interesting as they may include thousands of hidden states, which makes them computationally infeasible in practice. However, our analytical approaches allows their derivation without fitting them directly to data (SI Appendix). Therefore, any SSE can be represented by a congruent trait-dependent EHE with at most two parameters, which will result in a better fit than a CID if the CID has more parameters. Thus, AIC fails to select the true scenario from a set of congruent models.

Furthermore, there exist semi-congruent models (Fig. 1), which belong to a different congruence class, but may exhibit better AIC compared to CID4 due to fewer parameters but similar likelihood; such models can also be derived analytically from CID4. Semi-congruent models imply trait-dependent diversification too, with support (*AIC*_{m}) depending on data and parameter differences compared to the original process. When the original process has four or more parameters, it is possible to obtain a semi-congruent model with positive to substantial ΔAIC support. This phenomenon was common, in our simulations, the semi-congruent model was favored in 55-64% of trials (Fig. 1). Consequently, AIC frequently fails to select the true scenario if models from a different congruence class are used.

Note, the aforementioned findings are equally applicable to other model selection approaches, such as the Bayesian Information Criterion and the Bayes factor.

### Solution to model selection

We propose applying the parsimony principle when selecting among congruent models, favoring a model that explains rate variation with minimal number of hidden states. This leads to one unique solution for each congruence class, known as the irreducible model *M*_{0} (Fig. 2). It can be found analytically by lumping the states in a reducible model as described above. In the previous example, this model was CID4, and all other (semi-)congruent models can be lumped into it. Consequently, this parsimony principle selects the true model here.

The uniqueness of *M*_{0} in congruence class renders model selection unambiguous. Thus, the proposed parsimony principle is merely a practical assumption for hypothesis testing with a minimum set of restrictions. It also implies that if the true process were evolving under a different model from the congruence class (not *M*_{0}), we would never be able to uniquely identify it. However, we admit that other models within a congruence class may more accurately represent the true biological process of diversification. In certain cases, it may be necessary to consider factors other than parsimony in order to construct biologically plausible scenarios, making reducible models within a congruence class more credible than *M*_{0}.

Our main recommendation is to always use irreducible models for model comparison in practice to avoid potential unidentifiability issues. When comparing models between different congruent classes, one needs to ensure that they are irreducible too, using the lumpability conditions. It is trivial if a model has no hidden states because all such models are irreducible.

Although, various reducible models can be constructed in RevBayes (21, 24), while those with decoupled speciation in HiSSE (5), we did not identify any reducible models among the standard SSEs provided by these software. This suggests that, in general, previous results do not require reevaluation in light of identified congruence. However, to avoid unidentifiability pitfalls, users should be vigilant about congruence when using SSEs, especially non-standard ones, for hypothesis testing. However, this vigilance does not necessitate a significant shift in phylogenetic practices.

### Implications

Lumpability and congruence have been proven useful in modeling trait evolution alone (23, 25), and they are equally important for analyzing the probabilistic behavior of SSEs. Lumpability helps in differentiating between SSEs that appear distinct yet are fundamentally identical (i.e., congruent) and those that are genuinely different (i.e., non-congruent). Therefore, the concept of lumpability is crucial for constructing distinct SSEs when testing trait-dependent diversification and reconstructing diversification shifts alone; especially if the number of hidden states varies within the set of models selected for an analysis.

Our MiSSE example shows how lumpability helps create various models with varying hidden states and diversification rates (e.g., *M*_{1} vs. *M*_{3} in Eq. 2), indicating that the range of MiSSEs is wider than originally proposed (6). These distinct MiSSEs exhibit different diversification dynamics because the models assume different sequences of diversification events across phylogenetic trees and, accordingly, different probabilities of character patterns at tips.

This insight extends to other SSEs, revealing that there is a wide range of trait-independent models that have not been explored. These models can be created by combining a variety of irreducible MiSSEs with different irreducible trait models into a joint, independent process. Similarly, unconventional trait-dependent models exist as well (e.g., Fig. 3D). Thus, lumpability broadens the range of possible SSEs, suggesting that the complex dynamics of diversification might not be adequately captured by the limited set of standard models. This allows for the customization of SSEs to accommodate the varying complexity of diversification processes, thereby improving statistical inference.

### SSE models and Type 1 error

Lumpability and congruence reveal the non-linear probabilistic relationships among SSEs. For example, models within the same congruence class do not necessarily nest parametrically, and can be both trait-independent and -dependent; moreover, a CID model may relate more to trait-dependent SSEs from other congruence classes than to other CIDs, as shown by the parametric similarities between irreducible trait-dependent models (e.g., COR8-NC, CLA8NC, Table 1) and those in the CID4 congruence class (e.g., COR8-C, CLA8-C, Table 1).

In the context of model selection, these non-linear interrelationships indicate that trait-dependent and -independent SSEs do not form two separate model families; note they would if, for example, all independent SSEs were perfectly nested within dependent SSEs. Instead, the diverse universe of SSE models comprises an infinite set of different model families. Similar issue was identified for diversity-dependent birth-death models (26). The non-linear interrelationships challenge accurate model selection.

A fundamental assumption of the ML method is that the true stochastic process, which generated the data, must lie within the family of models used for inference (27); thus, selecting an inappropriate family does not guarantee a reliable ML inference. This phenomenon can explain the high false positive rates, or Type 1 error, in SSEs, when a true null hypothesis is incorrectly rejected (3). Specifically, not including an adequate independent model in the analysis can lead to situations where a dependent model, even if wrong, appears to better fit the observed data than the selected independent model.

Our findings indicate that there is a wide, unexplored range of SSEs, suggesting that testing different models, with different parametrizations, could increase the probability of hitting the correct model family and, thereby, reducing Type 1 error. Lumpability emerges as a valuable tool in this context, aiding in constructing diverse irreducible SSEs for hypothesis testing. However, new methods are necessary for efficiently exploring the SSE model space. Historically, time-dependent diversification models have employed speciation and extinction rates that vary exponentially through time. The discovery of alternative functional forms by LP has led to novel methods for investigating congruent diversification scenarios (10–12). Similarly, we need new methods tailored specifically to SSEs, in order to explore alternative models from different congruence classes [e.g., (28)].

## Conclusion

Every SSE belongs to an infinite congruence class akin to the congruence in time-dependent diversification models. Congruent classes can be constructed through hidden expansion, adding additional hidden states to an SSE while maintaining the model lumpability. Congruent SSEs are not identifiable based on observed data alone. This unidentifiability is essentially a mathematical coincidence due to lumpability. If it were possible to explicitly observe state changes throughout phylogenetic history, congruent models would be identifiable. The identifiability of SSEs poses challenges for hypothesis testing because every trait-independent model is congruent with a set of trait-dependent SSEs. As a solution that aligns with traditional statistical practices, we recommend using irreducible SSEs for comparing models within and between congruence classes; an irreducible SSE is a model with the minimal number of hidden states, unique to each congruence class. Users are advised to ensure their selected SSEs are irreducible before performing inference.

Besides unidentifiability issue, the key significance of model lumpability and congruence lies in their crucial role for analyzing the probabilistic behavior of SSEs. Lumpability explains the high false positive rates in detecting trait-dependent diversification and facilitates the construction of SSEs for hypothesis testing. Additionally, it may be useful in developing new SSE models.

Can we expect other types of congruence in SSEs? A recent study has mathematically proved the identifiability of pure-birth SSEs under specific conditions (16). Based on it and our results, we conjecture that problematic congruent classes in SSEs, that require new methods as in time-dependent diversification models, are improbable. Even if a new type of congruence is discovered, it is unlikely to alter existing practices. Thus, SSEs continue to be crucial tools in studying trait-dependent diversification and diversification shifts.

### HiClaSSE and simulations

The HiClaSSE model is implemented by modifying the functions from the diversitree R package (29); congruent models were constructed using rphenoscate (30).

To demonstrate the semi-congruent behaviour (Fig. 1), we simulated 100 datasets under the CID model (*λ*_{1} = 0.3, *λ*_{2} = 0.1, *q* = 0.2, and *μ* = 0.01) with variable number of tips (100, 500, 1000) using HiSSE package (5). Next, we fitted the original CID model and semi-congruent model that was constructed from the original CID by EHE and has two parameters.

### Data, Materials, and Software Λvailability

The software, data and code necessary for reproducing the results and the figures are available on github https://github.com/sergeitarasov/Congruent-SSE-CTMC.

We thank Matthew W. Pennell and the members of both the Tarasov and Uyeda Labs for valuable comments on this work.

ST was supported by the Research Council of Finland Grants (339576, 346294, 331631), and the three-year grant from University of Helsinki; ST partially conducted this work while a Postdoctoral Fellow at the National Institute for Mathematical and Biological Synthesis sponsored by the NSF-DBI1300426. JCU was funded by NSF-DEB-1942717 and NSFDBI-1661516.

## Supporting Information Text

### HiClaSSE and hidden states

In order to avoid a common misunderstanding about what hidden states are in an SSE model, we provide a theoretical background below to help understand this concept.

### The individual components of an SSE model

An SSE is a joint model that considers both observable trait evolution and diversification regimes simultaneously. To understand SSE models, including HiClaSSE, it is practical to start with the CID model, where trait and diversification are independent. The CID model is equivalent to two separate models — one for diversification and one for trait component. In other words, CID is a separable model, whose likelihood is the product of the likelihoods of the individual models. Let us consider the simplest such model, CID4, with four states, describing the evolution of a binary trait and binary diversification regime.

The trait component is a traditional CTMC, represented by a rate matrix *Q*_{t}, with each state being observable (Eq. 1). This model is traditionally used for ancestral state reconstruction of binary traits.

The diversification component is another CTMC characterized by a rate matrix *Q*_{r} (Eq. 1), where the regimes are discrete states (A and B) like in the trait’s CTMC. Each state is equipped with corresponding speciation (λ_{A}, *λ*_{B}) and extinction (*μ*_{A}, *μ*_{B}) rates, governing the birth and death of lineages. It is convenient to think of this CTMC from a data generation perspective. When this process generates a phylogenetic tree, it transitions between regimes, and depending on the regime, a lineage can split into two lineages or go extinct according to the speciation and extinction rates. The diversification component alone is known as a trait-free MiSSE model.

### Joint CID process

By combining the diversification and trait components, we obtain a four-state transition rate matrix *Q*_{j} for the joint process (Eq. 1). Each state in this matrix simultaneously belongs to a diversification regime and a trait state. Each state is associated with speciation rates (λ_{A}, *λ*_{B}, *λ*_{A}, *λ*_{B}) and extinction rates (*μ*_{A}, *μ*_{B}, *μ*_{A}, *μ*_{B}), following the order of states in *Q*_{j}. As speciation and extinction rates are independent, it implies that pairs of joint states 1A and 2A, as well as 1B and 2B, have the same speciation and extinction rates as well as the *Q* matrix as in Eq. 1.

### What are the hidden states?

All states in the joint model *Q*_{j} are essentially hidden because we are uncertain about which states our data at the tree tips correspond to. For instance, if we know that a tip is in state 1, we do not know whether it belongs to 1*A* or 1*B*. Therefore, the observable states in our model consist of sets of hidden states. For example, an observable state 1 consists of the hidden states 1*A* and 1*B*. It is helpful to conceptualize this model as having a two-layered structure: a hidden layer that represents the actual states of the model and an observable layer to which the hidden states are mapped. The evolution takes place exclusively at the hidden layer, and its results are subsequently re-mapped to the observable layer. This two-layered approach serves as a general framework for constructing various SSE models, accommodating a wide range of scenarios. In model formalism, these two layers are modeled using ambiguous coding at the tips of tree.

For likelihood computation, the data at tips should be encoded as probability vectors, instead of integers like in character matrix. For instance, for a binary trait, data at the tips are encoded as vectors: (1, 0) or (0, 1), representing the observed states 0 and 1, respectively. In cases where the tip’s state is uncertain, we use an ambiguous coding, such as (1, 1), which implies that both states are possible.

The same approach is adopted by SSEs. For example, the CID model (Eq. 1) requires the data to be encoded as (1,1,0,0) and (0,0,1,1) corresponding to the observable states 0 and 1 respectively. For the two-state MiSSE model (e.g., *Q*_{r} in Eq. 1), all tips are encoded as (1, 1), indicating that all states are hidden, which emphasizes our lack of knowledge about the diversification regimes at the tips, which the model aims to infer.

### Correlated SSE models

The SSE model displaying correlated evolution between trait and diversification can be created from CID by altering rate symmetries in *Q*_{j} or by rewiring the association between hidden states and speciation/extinction rates. See the section “Correlated Evolution in SSE Models”.

### BiSSE model and hidden states

Any SSE model, including those without hidden states like the BiSSE model, can be viewed as an instance of the general “hidden state” framework. By modifying the mapping of speciation and extinction rates in the CID model (Eq. 1), we can create a correlated process. For example, we can set the rates as follows: speciation (λ_{A}, *λ*_{A}, *λ*_{B}, *λ*_{B}) and extinction (*μ*_{A}, *μ*_{A}, *μ*_{A}, *μ*_{B}) while keeping *Q*_{j} the same. This modified model, which we will call COR4, is congruent to the BiSSE model because the rate matrix of COR4 (i.e., *Q*_{j} in Eq. 1) and the diversification rates are lumpable with respect to the state partition {{*A*0, *B*0}, {*A*1, *B*1}}. So, the corresponding BiSSE model is defined by *Q*_{t} (Eq. 1) and the diversification rates are: (λ_{A}, *λ*_{B}), (*μ*_{A}, *μ*_{B}). Although the BiSSE model does not formally have hidden states, it can be viewed as a simplified version of a more complex hidden state model.

### HiClaSSE likelihood

The HiClaSSE description follows the general framework for SSE models. We will use lowercase symbols for scalars (e.g., *a*), italic-bold symbols for column vectors (e.g., ** a**), capital letters for matrices (e.g.,

*A*), and bold capital letters for sets of matrices, called tensors (e.g.,

**A**). The element-wise product, denoted as °, involves multiplying each element of two arrays with the same dimensionality. The notation

*u⊗ v*^{T}signifies the outer product of two vectors, where

*X*

**represents the multiplication of matrix**

*u**X*by vector

**, and**

*u**XY*indicates the multiplication of matrix

*X*by matrix

*Y*.

We consider that HiClaSSE represents a lineage-varying state-dependent birth-death process with *k* discrete states and the following types of events: (1) speciation events, where a lineage splits into two; (2) extinction events, resulting in the death of a lineage; and (3) state-change events, where a lineage change from one state to another.

### Speciation events

We denote by Λ the speciation array that represents a set of *k* matrices Λ= Λ_{1}, *Λ*_{2},. .., *Λ*_{k} characterizing the rates for speciation events. Each matrix Λ_{i} has dimension *k* × *k*, with upper triangular entities populated with λ rates (others are zero). Matrix Λ_{i} has entities λ_{ijz}, which refer to the speciation rates that occur when a lineage in state *i* speciates and produces two daughter lineages one in state *j* and another in state *z*:
All these events fall in two categories – those simultaneously occurring with state changes and those which do not. We refer to them as coupled (e.g., *λ*_{111}) and decoupled (e.g., *λ*_{122}) speciation events respectively.

Note that Λ_{i} is intentionally an upper triangular matrix to visualize unique rate entities and avoid rate duplication (as *λ*_{112} = *λ*_{121}). However, in some calculations below, we will require a symmetric speciation matrix where the zeros in Λ_{I} are filled with rates from the upper triangle. Such matrices can be obtained through the following algebraic operation:
where, superscript *T* denotes the transpose of matrix, diag(Λ_{i}) extracts the diagonal of Λ_{i}, and Diag(diag(Λ_{i})) forms a diagonal matrix with the diagonal elements of Λ_{i}. These symmetric matrices also form an array .

We also need to define a matrix-wise operation on the matrix arrays (**Λ** and ). For example, a notation **Λ ○** *X* indicates that each Λ_{i} in **Λ** should be multiplied via ○ by matrix *X*:
Similar matrix-wise operations apply to **Λ** and for vector multiplication (e.g., **1**^{T} **Λ1**). We will use the matrices *I*_{2,1} and in matrix-wise operations on arrays. *I*_{2,1} is a matrix where all diagonal elements are 2s, while off-diagonals are 1s. It multiplies a diagonal element in each by two. is a matrix where all diagonal elements are 1s, while off-diagonals are 1*/*2s. It divides the off-diagonal elements in each by two. These matrices are necessary to control for the number of speciation and extinction events in calculations.

### Extinction events

The extinction rates for each state are given by the column vector ** μ**:

### State-change events

The instantaneous rate matrix matrix *Q* defines anagenetic state changes and represents a traditional CTMC. Its entities are the transition rates *q*_{ij}.

### Probability Terms

The probability of observing a particular branch at time *t*, given that the branch is in state *i*, is denoted by the scalar *D*_{i}(*t*). The probability of lineage extinction at time (*t*), given the lineage is in state *i*, is the scalar *E*_{i}(*t*). These two entities are equivalent to *D*_{N,i}(*t*) and *E*_{i}(*t*) from BiSSE. Note, that we have dropped the index *N* from *D*_{N,i}(*t*) for brevity. The entities *D*_{i}(*t*) and *E*_{i}(*t*) can be presented in vector form as:

### Differential Equations and Likelihood

The probability terms evolve over time and branch according to the following system of differential equations, which is written in array notation as:
this system is equivalent to BiSSE-ness model extended for *k* states. Note, that each of the vectors, and , characterizes *k* differential equations.

At each ancestral node (*A*) the diversification probabilities (*D*_{A}) calculates as:
given that this node has two daughter lineages (*M* and *N*) with the the diversification probabilities *D*_{M} and *D*_{N} respectively. This formula is equivalent to how BiSSE-ness combines the probabilities at nodes.

The likelihood of the tree (ℒ) calculates at the root as:
where, *D*_{R} are the diversification probabilities at the root, and π is the initial vector of probabilities for each state. The term *w* (Eq. 10) conditions the overall likelihood on clade survival (1), indicating that neither of the two daughter lineages at the root is allowed to become extinct before the present. Note, both the root and likelihood equation are equivalent to BiSSE-ness model too.

### Example: HiClaSSE2

Following Eq. (7), the HiClaSSE model with two states is defined by these ODEs:
where, ** f′** indicates .

These equations (Eq. 7) are equivalent to BiSSE-ness model, and thereby ClaSSE, but have a different parametrization. The BiSSE-ness model has 10 parameters: (*λ*_{0}, *λ*_{1}, *μ*_{0}, *μ*_{1}, *q*_{01}, *q*_{10}, *p*_{0c}, *p*_{0a}, *p*_{1c}, *p*_{1a}). These parameters include two speciation rates (*λ*_{0}, *λ*_{1}) that are weighted by probabilities representing anagenetic or cladogenetic speciation events. The parameter *p*_{0c} is the probability of cladogenetic speciation when the ancestral lineage is in state 0, which can be asymmetric (*p*_{0a}) or symmetric (1 *p*_{0a}). Similar logic applies to *p*_{1c} and *p*_{1a}. The mapping between BiSSE-ness parameters (RHS) and HiClaSSE2 (LHS) is as follows:

### Correlated Evolution in SSE Models

Mark Pagel proposed conditions to determine whether two binary traits should be considered correlated or independent. We can extend these conditions to SSE models. Let us take a SSE model with two speciation and extinction rates: (*λ*_{1}, *λ*_{2}, *λ*_{1}, *λ*_{2}), and (*μ*_{1}, *μ*_{2}, *μ*_{1}, *μ*_{2}), and the rate matrix:
For this model to be independent (i.e., CID), the following conditions should hold: *q*_{13} = *q*_{24}, *q*_{31} = *q*_{42}, *q*_{12} = *q*_{34}, *q*_{21} = *q*_{43}. If these conditions are not met, the model implies the following dependencies:
Consequently, if two different types of conditions from Eq. (14) are met simultaneously (e.g., *q*_{13 ≠} *q*_{24} and *q*_{12 ≠} *q*_{34}), then the changes in both the trait and diversification regime are interdependent.

It is worth noting that there are additional conditions suggesting correlation. If any of the left diagonal rates, indicating dual transitions (e.g., *A*0 → *B*1 is allowed), has a non-zero value, then a change in trait implies an immediate change in diversification regimes, and both are mutually dependent. Note that all these correlation conditions are only valid when we have at least two speciation or extinction rates (as in the CID model of this section). In the trivial case of equal speciation and extinction rates across all states, the diversification is naturally not affected by the trait since its invariable.

Non-independence can be proven by the fact that an independent SSE is separable, thereby it can be constructed from two individual models, one for trait and another for diversification. If this is not possible, then the SSE is correlated. This construction can be achieved by manually deriving the joint probabilities. Alternatively, for constructing the joint rate matrix *Q*_{j}, there is also a convenient algebraic operation using the Kronecker sum (⊕) that combines individual matrices.
where, *Q*_{r} and *Q*_{t} are the individual rate matrices for diversification regimes and traits, and are the identity matrices for the respective *Q*’s, and ⊕ denotes the Kronecker product. Consequently, all joint matrices that cannot be constructed using Eq. (15) imply correlation.

As an example, consider the models: COR8-C and COR8-NC from Eq. (28) (Table S1). Their rate matrices cannot be produced by combining individual trait and diversification models using Eq. (15). Therefore, they are not separable and describe dependent evolution. COR8-NC is a distinct model, while COR8-C is congruent with CID4 (Table S1) and shares the same likelihood.

### Lumpability Conditions for SSE models

#### Lumpability of a basic CTMC

Let us begin by discussing lumpability in the context of a basic CTMC and then extend it to SSE models. Suppose, we have an initial CTMC that describes the evolution of a four-state trait {*s*_{1}, *s*_{2}, *s*_{3}, *s*_{4} } and is defined by the rate matrix *Q*:
The aggregation of states at the level of the rate matrix means constructing a smaller matrix with fewer states and transitions. Suppose that we are willing to aggregate *Q* into a two-state process , with states {*ŝ*_{1}, *ŝ*_{2} }, using the following partition of the initial states: *ŝ*_{1} = {*s*_{1}, *s*_{2} *}, ŝ*_{2} = {*s*_{3}, *s*_{4}}. This partition can be visualized on *Q* by splitting the rates into four partition blocks shown with a horizontal and a vertical line within the matrix in the Eq. (16). Each block includes four initial rates whose function should yield new rates in the aggregated matrix. So, the aggregated matrix should be:
The lumpability property ensures that the transition rates of the aggregated CTMC can be calculated from the original CTMC, allowing us to model them using the aggregated process. The lumpability property described in the main text can be restated in the context of partition blocks within the rate matrix. We refer to this restatement as the “row-wise sum rule” (RWR). Both interpretations are equivalent.

The RWR implies the following: *the original CTMC is lumpable with respect to a given partition of states when the row-wise sum of rates within one partition block in the original Q is the same for all rows within the given partition block, and this property holds for all blocks in Q. The rates in the aggregated matrix represent simply the row-wise sums of the original rates*.

For *Q* in the equation Eq. (16), the RWR is maintained if the following two equalities hold:
These equalities imply lumpability of *Q* for the given state partition; and represent the new rates in the lumped process , shown in the equation Eq. (17). Note, to prove lumpability, it is enough to show that the RWR holds for the off-diagonal blocks because it implies that the main diagonal blocks maintain the RWR too. This property will be helpful later, in classifying hidden expansions. In the lumped process, the initial probability vector at the tree root (π) should be aggregated, too: by adding up state probabilities from the original process belonging to the same partition subset.

The lumpability property is general and applies to all types of CTMCs. Lumpability does not depend on the time over which a CTMC evolves nor the initial probability vector at the tree root; if a CTMC is lumpable, it is lumpable under any value of the initial vector. If the lumpability property does not hold, then the simple form for the rate matrix for the aggregated process does not exist since it is not Markovian.

#### Lumpability and backward ODEs

The previous example demonstrated lumpability for forward-in-time equations. However, in phylogenetics, likelihood is estimated using backward-in-time equations starting at *t* = 0. Let us first consider these equations for a basic CTMC from Eq. (16) to demonstrate our approach.

The system of backward-in-time equations for calculating likelihood over a tree branch can be written in matrix form as ** D′** =

*Q*

**, where**

*D***is: This system expands as: Now, we want to construct a smaller CTMC given the state partition**

*D**ŝ*

_{1}= {

*s*

_{1},

*s*

_{2}},

*ŝ*

_{2}= {

*s*

_{3},

*s*

_{4}} as above. Our approach involves first deriving a necessary condition for the smaller CTMC to be Markovian, then applying it to the system of the original ODEs, and finally deriving the relationships between the rates that are necessary and sufficient to maintain lumpability. For the smaller CTMC to be Markovian in the backward equation, it is necessary that we do not distinguish between the original states that belong to the same lumped state. Thus, the

*D*

_{i}(

*t*)’s belonging to the same lumped state must be equiprobable for any time

*t*. In our example, this means that

*D*

_{1}(

*t*) =

*D*

_{2}(

*t*) and

*D*

_{3}(

*t*) =

*D*

_{4}(

*t*). Putting this conditions in equation Eq. (20) and after some algebra gives: Clearly, for this system to be lumpable the same equalities as in Eq. (18) should hold. This allows us to rewrite these four equations using just two equations that characterize the lumped process from Eq. (17): where

*D*

_{12}(

*t*) =

*D*

_{1}(

*t*) =

*D*

_{2}(

*t*) and

*D*

_{34}(

*t*) =

*D*

_{3}(

*t*) =

*D*

_{4}(

*t*); the dependencies between the rates are the same as in Eq. (18). We will employ this approach to derive lumpability conditions for SSE models.

#### Lumpability Conditions for SSE

The ODEs for SSE models can become quite extensive and involve numerous parameters depending on the number of states. For simplicity, let us illustrate the method discussed in the previous section by applying it to a HiClaSSE3 with states denoted as {0, 1, 2}. Our goal is to lump these states into two states given the partition {{0}, {1, 2}}. We assume that the original SSE model has the maximum number of parameters, which include the extinction rates ** μ** = [

*μ*

_{0},

*μ*

_{1},

*μ*

_{2}], and the transition rate matrix

*Q*

_{3}: The coupled and decoupled speciation rates are specified by

**Λ**tensor: To lump the model, we set

*D*

_{1}(

*t*) =

*D*

_{2}(

*t*) and use the ODEs fromEq. (7), which yields: where

*D*’s and

*E*’s are, indeed,

*D*(

*t*)’s and

*E*(

*t*)’s, with

*t*omitted for brevity. You may notice that the rates across these equations group together based on the same

*D*and

*E*terms. For example, for , there is an entity

*D*

_{0}

*E*

_{1}(

*λ*

_{101}+

*λ*

_{102}) and for

*D*

_{2}

^{′}(

*t*), there is

*D*

_{0}

*E*

_{1}(

*λ*

_{201}+

*λ*

_{202}). If

*λ*

_{101}+

*λ*

_{102}=

*λ*

_{201}+

*λ*

_{202}, then the the two entities are equal. Thus, we can make the HiClaSSE3 model lumpable by ensuring that all such entities are equal. According to our state partition, HiClaSSE3 is lumpable if: If all of these conditions are satisfied, it allows us to rewrite the HiClaSSE3 model from Eq. (25) as a HiClaSSE2 model with two states as shown in Eq. (11). The following relationships exist between the HiClaSSE3 (LHS) and HiClaSSE2 (RHS):

#### The lumpability algorithm

For an SSE to be lumpable, each of its components: **Λ, μ**, and

*Q*, should satisfy lumpability conditions. While the lumpability conditions for

**and**

*μ**Q*are straightforward and have been discussed earlier in detail, the conditions for the

**Λ**tensor may be cumbersome. If we think of

**Λ**not as a set of matrices but as a 3D array, then its lumpability is analogous to the RWR, but in three dimensions.

To check if **Λ** is lumpable given a certain state partition ℬ, you can follow this algorithm: (1) split each Λ_{i} matrix in **Λ** into partition blocks, as done with *Q*, according to ℬ (e.g., Eq. 24); (2) Then partition **Λ** itself by grouping individual matrices into subsets according to ℬ (e.g., as shown by the vertical bar after Λ_{0} in Eq. 24); (3) **Λ** is lumpable if, for each subset of matrices, the sum of rates across identical partition blocks is the same, and this condition holds for all subsets and blocks.

### Lumpability and Model Congruence

If the original model is lumpable with respect to the hidden states, then the lumped model is congruent with the original one. This can be understood by examining *D*(*t*)’s and *E*(*t*)’s. When calculating likelihood over a tree branch using Eq. (7), *D*(*t*)’s and *E*(*t*)’s belonging to the same observable state are identical due to the lumpability assumption. This implies that they are also the same at nodes and the root (Eq. 8). Therefore, lumping by aggregating hidden states essentially means that *D*(*t*)’s and *E*(*t*)’s at the root between the two models only differ in the number of states, while the likelihood values at the root (Eq. 9) are identical. Accordingly, the lumpability holds regardless of the sampling fraction, since it simply alters the relative value of *D*(*t*)’s and *E*(*t*)’s.

### Congruence between independent and dependent SSEs

The characteristics of all the discussed models are provided in Table S1 (the top part) and Eq. (28). These models were derived from CID4 using HE. The expanded models maintain the same parameter count of four parameters and consist of eight states. In these models, the speciation rates are anagenetic, except for CLA8-C and CLA8-NC, which have mostly cladogenetic rates. The congruent models (CID8, COR8-C, CLA8-C) can be reduced to CID4 through the following states partitioning: {{*a*0, *b*0 *}, {c*0, *d*0}, {*a*1, *b*1}, {*c*1, *d*1}}, resulting in the states {*A*0, *B*0, *A*1, *B*1}. The rate matrix of COR8-C is lumpable, but COR8-NC is not lumpable as it violates the RWR, making it non-congruent. Similarly, the speciation tensor **Λ** of CLA8-C is lumpable, while that of CLA8-NC is not lumpable, resulting in non-congruence for the latter model as well.

### Equal Rate Hidden Expansion (EHE)

The CID model used in the EHE transformation example consists of four parameters with specific parameter values. Its expansion results in EHE8-C, which has only two parameters and includes cladogenetic speciation events (Table S1, the bottom part; and Eq. 29).

Note that in the EHE example, the rates are exact multipliers of 0.1 for demonstration purposes. This is done to simplify the visualization. If other values with many decimals were used, EHE transformations could result in rate matrices with thousands of states, making it impractical to visualize them.

### The EHE Algorithm

The EHE algorithm expands any CID model with decoupled separation events to an EHE model. It can also be adopted for any SSE. This algorithm expands each initial state with an equal number of additional hidden states.

To calculate the minimal EHE model, follow these steps:

Pool together the rates from

*Q*(only non-negative rates) and**Λ**into a single vector. Represent them as irreducible fractions of two integers and find their least common denominator (*lcd*).Define the equal rate as

*r*= 1*/lcd*.Find the maximum rates in

**Λ**and*Q*. Denote them as*λ*_{max}and and*q*_{max}respectively.Calculate the necessary number of hidden states per initial state, represented as

*N*_{Λ}and*N*_{Q}, corresponding to**Λ**and*Q*respectively. For the rate matrix*Q*, this number is: For the speciation tensor**Λ**is: where ⌈ · ⌉represents the ceiling function. The ceiling of a number*x*is the smallest integer that is greater than or equal to*x*. It is required since the number of hidden states is an integer. The formula from Eq. (30) derives from the number of different lambda rates per initial CID state in the expanded model, which is: setting , and solving Eq. (31) yields Eq. (30).Select the maximum value to determine the final number of hidden states

*N*_{h}needed for the observable one:*N*_{h}= max(*N*_{Q},*N*_{Λ}).Create the appropriate EHE model, corresponding to

*N*_{h}, and distribute the rates to maintain lumpability.

For the CID4 from Eq. (29) and Table S1, the algorithm operates as follows. The three rates {0.1, 0.3, 0.2} can be represented as fractions: 1*/*10, 3*/*10, and 2*/*10; their *lcd* = 10, hence *r* = 1*/lcd* = 0.1. The maximum rates are: *q*_{max} = 0.2 and *λ*_{max} = 0.3. So, *N*_{Q} = 0.2*/*0.1 = 2, *N*_{Λ} = 2, which means that *N*_{h} = 2, resulting in EHE having two hidden states for each initial state, as illustrated in Eq. 29 and Table S1.

The algorithm outlined above constructs the EHE model with the minimal number of hidden states. However, if one selects a different common denominator (other than the *lcd*) and applies the same algorithm, it will generate an EHE model with a greater number of hidden states.

### The Structure of a Congruence Class

To derive parametric forms for congruent models, we classify HEs into three main types based on the number of parameters in a HE model compared to the original SSE. This classification focuses on counting parameters located in the off-diagonal blocks (referred to as off-diagonal parameters) of the *Q* matrix in the expanded model, while ignoring the total number of parameters in the model. This distinction is crucial because the parameters in the diagonal blocks are unidentifiable and can be set to any values, as they do not affect the RWR (see the section “Lumpability of a basic CTMC”). For example, setting the diagonal *q*’s in the *Q*_{CID8} from Eq. (28) to 0 or any other separate parameter will not impact lumpability.

The three main types of HEs are as follows: equivalent expansion, super-expansion, and sub-expansion. These types generate models with the same, greater, and fewer off-diagonal parameters, respectively.

We also define two additional categories of HE models that are crucial for analyzing SSE dynamics. The Waiting-Time Preserving (WTP) hidden expansion does not alter the expected waiting times in *Q* between the original model and HE. These waiting times are indicated by the negative main-diagonal entries in the rate matrix. Conversely, the opposite category is a non-Waiting-Time Preserving (nWTP) expansion, which modifies the expected waiting times in *Q* between the original and HE models. For instance, *Q*_{CID8} in Eq. (28) is nWTP, but if we change the *q*’s in the diagonal blocks to 0, it becomes WTP.

The parametric forms for equivalent and super-expansion models can be directly derived from *M*_{0} using the lumpability conditions. The total number of parameters in these models remains the same (e.g., WTP equivalent type models) or increases (e.g., nWTP equivalent or super-expansion models). Super-expansion models have the potential for an unlimited number of parameters since there is no upper bound to maintain lumpability.

On the other hand, the parametric forms for sub-expansion models depend on the specific parameter values of *M*_{0}. This implies that each point in the parameter space of *M*_{0} corresponds to a unique sub-expansion model. The total number of parameters in sub-expansion models can vary: (1) it can be fewer than in *M*_{0} (e.g., WTP models), ranging down to two parameters (EHE), indicating that these models are nested within *M*_{0}; or (2) it can be the same as or greater than in *M*_{0} (e.g., nWTP models).

## Footnotes

The authors declare no competing interests.

The text is improved and better structured.

## References

## References

- 1.