Abstract
Background Technological advances in genomic sequencing are facilitating the reconstruction of transmission histories during outbreaks in the fight against infectious diseases. However, accurate disease transmission inference using this data is hindered by a number of challenges due to within-host pathogen diversity and weak transmission bottlenecks, where multiple genetically-distinct pathogenic strains co-transmit.
Results We formulate a combinatorial optimization problem for transmission network inference under a weak bottleneck from a given timed phylogeny and establish hardness results. We present SharpTNI, a method to approximately count and almost uniformly sample from the solution space. Using simulated data, we show that SharpTNI accurately quantifies and uniformly samples from the solution space of parsimonious transmission networks, scaling to large datasets. We demonstrate that SharpTNI identifies co-transmissions during the 2014 Ebola outbreak that are corroborated by epidemiological information collected by previous studies.
Conclusions Accounting for weak transmission bottlenecks is crucial for accurate inference of transmission histories during outbreaks. SharpTNI is a parsimony-based method to reconstruct transmission networks for diseases with long incubation times and large inocula given timed phylogenies. The model and theoretical work of this paper pave the way for novel maximum likelihood methods to co-estimate timed phylogenies and transmission networks under a weak bottleneck.
Background
Accurate inference of the transmission history of an infectious disease outbreak is pivotal for real-time outbreak management, public health policies and devising disease control strategies for future outbreaks [1]. Traditional epidemiological approaches are fieldwork intensive and aim to uncover contact histories and exposure times of hosts to disease sources. With decreasing costs of genomic sequencing, molecular epidemiology has complemented these traditional approaches toeffectively analyze and manage disease outbreaks.
Given genomic and epidemiological data, the key challenge is to infer the evolutionary history of the pathogen isolates and the transmission history of the hosts. Importantly, while the phylogeny of the pathogen isolates captures the evolutionary history of the outbreak, it does not necessarily match the transmission history of the outbreak [2]—this mutation-migration discordance also arises in metastatic cancers [3]. In particular, methods that assume that transmission events coincide with branching events in the phylogeny are only applicable in the context of pathogens with low mutation rates, short incubation times and acute infections [4–8]. By contrast, pathogens with high mutation rates and long incubation times lead to within-host diversity. This diversity is either the result of infection by multiple strains or arose after infection by a single strain. Most current methods assume the latter, an assumption known as a complete transmission bottleneck [9–14].
Under a weak transmission bottleneck multiple genetically-distinct strains of the pathogen are simultaneously transmitted from a donor to a recipient through a non-negligibly small inoculum. Large inoculum sizes have been observed in a number of diseases [15]. There are two recent methods that partially support a weak transmission bottleneck [16,17]. While SCOTTI allows a single host to be infected by multiple strains, it does not support the simultaneous transmission of these strains and considers each in isolation [16]. On the other hand, BadTrIP supports simultaneous transmission but does so only at single locus resolution rather than genome scale [17]. Supplementary Table S1 provides a summary of current methods.
Here, we formulate the Transmission Network Inference (TNI) problem under a weak bottleneck for a given timed phylogeny (Fig. 1). In this problem, we use the principle of parsimony to minimize the number of co-transmissions, which each may comprise of multiple transmitted strains. We prove hardness for the optimization and sampling versions of the problem. We introduce SharpTNI, a method to uniformly sample optimal solutions and quantify the size of the solution space. On simulated data, we show that SharpTNI accurately counts and samples parsimonious transmission networks, scaling to large datasets. We analyze a dataset from the 2014 Ebola outbreak [18], showing that SharpTNI outperforms SCOTTI and recapitulates previously documented co-transmissions.
Results
This section outlines the problem statement, the complexity results and the results obtained by applying our method SharpTNI to simulated and real datasets.
Problem Statement
Let T be a tree rooted at vertex r(T) with vertex set V(T), leaf set L(T) and edge set E(T). We denote the children of a vertex u by δT (u). Conversely, the unique parent of a non-root vertex u ≠ r(T) is denoted by πT (u). We write if vertex u is ancestral to vertex v, i.e. vertex u is present on the unique path from r(T) to vertex v. Note that is reflexive. We say that u and v are incomparable if neither nor holds. We omit the subscript T from , δT and πT if it is clear from context. We denote the subtree of T rooted at vertex v by Tv.
Give a set Σ of hosts, the key objects in this paper are a timed phylogeny T and host labeling ℓ : V (T) → Σ, which are defined as follows.
A timed phylogeny is a rooted tree T whose vertices are labeled by time-stamps τ : V (T) → ℝ≥0 such that τ (u) < τ (v) for all pairs u, v of vertices where .
A host labeling of a timed phylogeny T is a function ℓ : V (T) → Σ, assigning a host ℓ(u) to each vertex u of T.
Intuitively, time moves forward when traversing down a timed phylogeny T starting from the root r(T). A leaf u of T corresponds to a strain that has been removed from the population at time τ (u), due to treatment or death of the corresponding host ℓ(u). On the other hand, an internal vertex u of T corresponds to a strain that infected host ℓ(u) at time τ (u).
A timed phylogeny T combined with a host labeling ℓ constrains the set of allowed transmissions in the following three ways. First, an edge (u v) of T is a transmission edge if ℓ(u) ≠ ℓ(v). Second, a transmission event Ψ is a subset of transmission edges between the same pair of hosts that have occurred simultaneously. Third, a transmission network N = {Ψ1, …, Ψ|N|} is a partition of transmission edges into disjoint transmission events. More formally, we have the following definitions.
Definition 3
Given a timed phylogeny T and host labeling ℓ, an edge (u, v) of T is a transmission edge if ℓ(u) ≠ ℓ(v).
Given a timed phylogeny T and host labeling ℓ, a transmission event Ψ is a subset of edges of T such that (i) each edge (u, v) ∈ Ψ is a transmission edge, (ii) each edge (u, v) ∈ Ψ has the same source host ℓ(u) = s and target host ℓ(v) = t and (iii) for all pairs (u, v), (u′, v′) ∈ Ψ it holds that [τ(u), τ(v)] ∩ [τ (u′); τ (v′)] ≠ ∅.
Given a timed phylogeny T and host labeling ℓ, a transmission network N is a partition of the transmission edges of (T, ℓ) into disjoint transmission events.
As suggested by the name, a transmission network N = {Ψ1, …, Ψ|N|} can be equivalently viewed as a graph. More specifically, N is directed, edge-labeled multi-graph, where the vertex set V(N) equals the set of hosts Σ, the edge multi-set E(N) is composed of transmission edges of T incurred by the host label ℓ associated with N, and the edge labeling ψ : E(N) → {1, …, |N|} assigns each transmission edge (u, v) ∈ Ψi to transmission event ψ(ℓ(u), ℓ(v))) = i. We say that a transmission network N is consistent with timed phylogeny T and host labeling ℓ if the set of transmission edges N equals the set of transmission edges in (T, ℓ).
We evaluate a transmission network N by two different quantities. First, the transmission number μ(N) equals the number of transmitted strain, i.e. μ(N) = ΣΨ∈N |Ψ|. Second, the co-transmission number γ(N) equals the number of transmission events, i.e. γ(N) = |N|. By definition, we have that the transmission number is greater or equal to the co-transmission number, i.e. γ(N) ≥ μ(N) for all transmission networks N.
Note that all transmission networks that are consistent with (T, ℓ) have the same transmission number, but may have varying co-transmission numbers. Under the principle of parsimony, we may assume that transmissions and co-transmission are rare, leading to the following optimization problem.
(ℓ-Transmission Network Inference (ℓ-TNI)) Given a timed phylogeny T with timestamps τ and host labeling ℓ, find a transmission network N consistent with (T, ℓ) with minimum co-transmission number γ(N).
We consider the two criteria in lexicographical order, where the first criterion seeks to minimize the number of transmitted strains, whereas the second criterion seeks to minimize the number of transmission events. Thus, we assume that the transmission of additional strains is less likely than co-transmission events by an order of magnitude. We leave exploring the trade-off between the two criteria as future work. We note that the transmission number criterion was introduced previously by Slatkin and Maddision [19], while a time-invariant version of the co-transmission number has been applied to the analyses migration in metastatic cancers [3,20]. Supplementary Table S2 provides nomenclature for topological features of transmission networks.
In practice, we do not observe a timed phylogeny T and host labeling ℓ. Rather, we obtain the genomic sequences of the strains present in individual hosts Σ. The set of extracted strains from each host forms the leaf set L(T) of an unknown timed phylogeny T. The function records the presence of strains in each host. As each host s ∈ Σ is removed from the population at time τr(s), we have identical time-stamps τr(s) for all strains u present in host s (i.e. . In addition, based on epidemiological data, we have an entrance time τe(s) for each host s.
Fig. 1 shows an overview of the entities defined so far. Fig. 1a shows a timed phylogeny T with a leaf labeling and three hosts with different entry and removal times. Figures 1b and 1c show two host labelings ℓb and ℓc respectively, both of which are consistent with the leaf labeling . Both host labelings ℓb and ℓc have the same transmission number μ = 2. Further, two transmission networks Nb and Nc are shown that are consistent with the host labelings ℓb and ℓc respectively. In this case, the transmission network Nc has a smaller co-transmission number γ = 2 and is therefore more parsimonious compared to Nb which has a co-transmission number of γ = 4.
The key challenge in phylodynamics is to infer a timed phylogeny T and host labeling ℓ given leaf set L(T), host-leaf labeling , entrance times τe and removal times τr. Various tools have been developed for the simpler task of inferring T given L(T) and τr [21–25]. Here, we focus on inferring a parsimonious transmission network N and host labeling ℓ given timed phylogeny T, host-leaf labeling , entrance times τe and removal times τr.
(Transmission Network Inference (TNI)) Given a timed phylogeny T with time-stamps τ, hostleaf labeling , entrance times τe and removal times τr, find a transmission network N and corresponding host labeling ℓ with minimum transmission number μ(N) = μ* and subsequently smallest co-transmission number γ(N) = γ* such that τ(u) ∈ [τe(s), τr(s)] for all hosts s and vertices u where ℓ(u) = s.
It is possible to define two counting versions of the above problem. The first counting problem seeks the number of transmission networks N with minimum transmission number μ(N) and subsequently smallest co-transmission number γ(N). The second counting problem seeks the number of host labelings ℓ that incur a transmission network N with minimum transmission number μ(N) and subsequently smallest co-transmission number γ(N). In this study, we restrict ourselves to the second version of the counting problem. Let be the set of host labelings that are solutions to Problem 2. The counting problem, denoted as #TNI, is to find the cardinality of the set denoted by . The corresponding sampling problem seeks to uniformly at random sample host labelings .
Complexity
The inclusion of the co-transmission number in the objective function renders the optimization and sampling versions of the TNI problem hard.
Complexity of the Optimization Problem
We have the following theorem.
TNI is NP-hard.
We prove this theorem by reduction from 3-SATISFIABILITY (3-SAT), which is NP-complete [26]. In 3SAT, we are given a Boolean formula with n variables {x1, …, xn} and k clauses in 3-conjuctive normal form (3-CNF) form. The task is to decide whether there exists a truth assignment θ : [n] → {0, 1} that satisfies all the clauses of ϕ. Without loss of generality, we may assume that each clause of ϕ consists of three distinct variables.
To relate literals to variables, we use the function ν : [k] × {1, 2, 3} → [n] such that ν(i, j) is the variable corresponding to literal yi,j. We define σ(i, j) to be 1 if yi,j is a positive literal (i.e. yi,j = xv(i,j)), otherwise σ(i, j) = 0 if yi,j is a negative literal (i.e. ). A truth assignment θ satisfies ϕ if for each clause i ∈ [k] there exists a j ∈ {1, 2, 3} such that σ(i, j) = θ(ν(i, j)).
Given ϕ, we construct a timed phylogeny T(ϕ) with leaf labeling and time-stamps τ, τe, τr, as depicted in Fig. 2 and detailed below. We set . Let ε > 0 be a small positive constant. As for entry and removal time-stamps, we set τe(⊥) = τr(⊥) = 0, and and for each variable i ∈ [n]. Timed phylogeny T(ϕ) is composed of k clause gadgets and n variable gadgets, each corresponding to a subtree that is directly at-tached to the root r(T(ϕ)). The root vertex has time-stamp τ (r(T(ϕ)) = 0. The leaves of T have identical time-stamps (2k + 3)ε. For each variable i ∈ [n], we have a subtree T[vari] whose root has time-stamp τ(r(T[vari])) = ε. The two children of r(T[vari]) have identical time-stamps 2ε, with one child leading to two leaves labeled by positive literal xi and the other child leading to two leaves labeled by negative literals . Similarly, for each clause i ∈ [k], we have a subtree T[clausei]. The root of this subtree has time-stamp (2i+1)ε and three children corresponding to the three literals of the clause. The three children have identical time-stamps (2i + 2)ε, each leading to two leaves labeled by the corresponding literal. Clearly, T(ϕ) can be obtained in polynomial time from ϕ. We refer to the supplement for the hardness proof (Supplementary Section 1.2). The supplement also shows how the reduction can be adapted to bifurcating timed phylogenies.
Complexity of Sampling
It would be desirable to sample solutions from , the set of host labelings ℓ with minimum transmission number and subsequently smallest co-transmission number, almost uniformly at random. Such a desirable algorithm is known as a fully polynomial almost uniform sampler (FPAUS). In general, an FPAUS for a sampling problem is a randomized algorithm that takes as input an instance x of the problem and a sampling tolerance δ > 0, and outputs a solution in time polynomial in |x| and log δ−1 such that the difference of the probability distribution of solutions output by the algorithm and the uniform distribution on all solutions is at most δ [27].
Recall the complexity class RP (randomized polynomial), which is composed of decision problems that admit randomized polynomial time algorithms that return no if the correct answer is no and otherwise return yes with probability at least 1/2. Using our reduction from 3-SAT to TNI, the existence of an FPAUS to sample the solutions of TNI would imply an FPAUS for 3-SAT. This in turn would imply that RP=NP as 3-SAT is NP-complete.
There exists no FPAUS to sample solutions of TNI unless RP=NP.
Simulations
To show the efficiency of our method in sampling parsimonious transmission networks, we simulate outbreaks following the procedure described in [10]. We were unable to compare to existing methods, as our simulations consider timed phylogenies which can not be used as input for joint inference methods like SCOTTI [16] and have multiple samples per host which are not supported in timed phylogeny based methods like TransPhylo [12]. However, to put the performance of our method in context we use the naive sampling algorithm as a baseline method.
We employ a two stage approach where we are given a number m of hosts, a transmission bottleneck size κ and additional epidemiological model parameters (Supplementary Section 1.5). First, we simulate a transmission process between m hosts using the SIR (Susceptible-Infectious-Recovered) epidemic model [28]. Under the SIR model, the outbreak begins with a single infected host and the remaining m – 1 individuals are infected from a unique host, each with at most κ co-transmitted strains. As such, the resulting transmission network N is a multi-tree. In the second phase, we simulate the evolution of the pathogens within each infected host using a simple coalescence model [29] with constant population size. Stitching together the resulting phylogenies according to N results in a single timed phylogeny T. We vary m ∈ {5, 10, 15, 20, 30} and κ ∈ {1, 2, 3}, with 5 instances for each combination, amounting to a total of 75 simulated instances. For each instance, we generate K = 11, 000 samples using SharpTNI and the naive sampling algorithm.
To assess the counting and sampling accuracy of our method, we restrict our attention to a subset of simulated instances (where m ∈ {5, 10, 15, 20} and κ ∈ {1, 2}) that can be exhaustively enumerated using dynamic programming (Section Methods). We find that the approximate number of solutions inferred by SharpTNI is nearly identical to the actual number of solutions, with 69/75 instances having the correct number (Fig. 3a). Next, we compute for each solution ℓ in the solution set , the fraction of samples generated by SharpTNI that are identical to ℓ. Under uniformity, this relative frequency should be close to the expected sampling frequency of . Indeed, Fig. 3b shows that the ratio between, respectively, the minimum and maximum relative frequency, and the expected sampling frequency is close to 1.
The ratio between the number of solutions to TNI and the number of to the relaxed problem decreases exponentially with increasing number m of samples and the transmission bottleneck size κ, rendering the naive sampling algorithm impractical (Fig. 3c). Thus, we cannot expect the solutions obtained from the naive sampling algorithm to have the smallest co-transmission number γ. This in turn should lead to larger deviations from ground truth compared to SharpTNI. Indeed, defining recall as the fraction of labeled transmission edges in the ground truth host labeling ℓ* that are correctly inferred, we observe a large relative improvement in recall by SharpTNI compared to the naive sampling algorithm (Fig. 3d). We are not showing precision, as this was identical to recall due to the ground truth transmission networks having minimum transmission number. Supplementary Fig. S3 shows the total wall time spent on a Intel Xeon 2.2 GHz processor, generating K = 11, 000 samples for an instance with m = 30 and κ = 3 in under 10 hours with a single thread. Since the underlying SAT sampling problem is embarrassingly parallel, SharpTNI is able to leverage UniGen’s multi-threading capabilities to cut down this running time by a factor that is equal to the number of threads.
In summary, our simulations show that SharpTNI accurately and quickly counts and samples parsimonious transmission networks, outperforming the naive sampling algorithm.
Ebola 2014 Outbreak
To demonstrate the applicability of SharpTNI to real data, we infer parsimonious transmission networks among chiefdoms of Sierra Leone and Guinea during the 2014 Ebola outbreak [18]. The available data consist of 81 Ebola virus genomic sequences from 78 patients from Sierra Leone and 3 patients from Guinea, with metadata that include sampling date and the chiefdom where the sample was collected. There are a total of 14 Sierra Leonan chiefdoms in the data (with one chiefdom designated as unknown). Along with Guinea that makes m = 15 possible host labels for each node in the timed phylogeny of the 81 genomic sequences.
Comparison to SCOTTI
We first run SCOTTI [16], which is a Bayesian approach to co-estimate a timed phylogeny and transmission network using a Monte-Carlo Markov Chain (MCMC). We run SCOTTI for 5 × 106 MCMC iterations with a burn-in percentage of 10%. We draw 100 samples of host-labeled timed phylogenies from the resulting posterior distribution. To compare the host labelings inferred by SCOTTI to those inferred by SharpTNI, we set the entry time τe and removal time τr for each host equal to the time-stamps of the first and the last node labeled by the host in that SCOTTI tree. Fig. 4a shows that the transmission numbers μ of the host labelings inferred by SCOTTI and SharpTNI are comparable, but that the minimum co-transmission numbers incurred by the host labelings inferred by SharpTNI are significantly smaller than those obtained using SCOTTI. This shows that SharpTNI infers a more parsimonious transmission network compared to SCOTTI.
To further illustrate this point, we pick an instance where both methods inferred host labelings with the same transmission number μ = 24 but a co-transmission number of γ = 20 for SCOTTI and γ = 19 for SharpTNI. The transmission networks are nearly identical, except for the infection between Luawa and Jawie (Fig. 5). Notice that in both the networks, Luawa is infected by both Kissi Teng and Jawie. However, SCOTTI infers a re-infection from Luawa to Jawie whereas SharpTNI infers a transmission network with no re-infection event while keeping the transmission number the same. This leads to a simpler and more parsimonious transmission network.
Re-analysis using BEAST and SharpTNI
We now re-analyze the same data using BEAST [22] to infer a timed phylogeny followed by SharpTNI to infer a transmission history. Similarly to [18], we run BEAST (version 2) for 106 MCMC iterations with a burn-in percentage of 10%. Supplementary Fig. S5 shows the resulting Maximum Clade Credibility (MCC) consensus tree, which resembles the tree reported in [18]. We assume that a transmission from a chiefdom is possible from three weeks prior and three weeks following the first and the last sample collected from the chiefdom respectively, which is in line with reported Ebola incubation periods [30]. In addition, we allow one unsampled host in our inference with an entry and removal time that covers the entire outbreak period. Since more than 70% of the patients diagnosed in Sierra Leone were sampled, the unsampled host is most likely from Guinea. Out of a total of 324 host labelings with minimum transmission number μ* = 26, SharpTNI identifies 9 transmission networks with minimum co-transmission number γ* = 19 (Supplementary Fig. S6). SharpTNI identifies 9 transmission networks with minimum co-transmission number γ* = 19 (Supplementary Fig. S6).
Gire et al. [18] hypothesize that the Sierra Leone outbreak stemmed from the introduction of two genetically distinct viruses from Guinea around the same time. This is because the first 12 Ebola virus disease (EVD) patients in Sierra Leone were all believed to have attended a funeral of an EVD case from Guinea and the samples from these patients fell into two distinct clusters according to their analysis. SharpTNI corroborates this hypothesis, i.e. all 9 par-simonious transmission networks (with γ* = 19) contain a co-transmission of two strains from an unsampled host (most likely from Guinea as discussed above) to Kissi Tengi, a chiefdom located on the border of Sierra Leone and Guinea. By contrast, the majority (216/324) of host labelings that have minimum trans-mission number but not the smallest co-transmission number do not identify this co-transmission (Supplementary Fig. S7). This example highlights the utility of SharpTNI’s ability to analyze outbreaks under a weak bottleneck.
Discussion and Conclusions
This paper introduces the Transmission Network Inference (TNI) problem for estimating a parsimonious transmission network under a weak transmission bottleneck given a timed phylogeny. Weak transmission bottlenecks arise in phylogeographic analyses of disease outbreaks as well as phylodynamics analyses of pathogens with high mutation rates, long incubation times or chronic infections. After establishing hardness of the optimization and sampling versions of the TNI problem, we present SharpTNI, a novel method for counting and sampling the solution space. The hardness of the counting problem #TNI remains open, whereas the given reduction may be used to show #P-completeness when the co-transmission number is fixed. Our method leverages recent progress in approximate counting and sampling of SATISFIABILITY [31–34]. We envision that other previously considered counting [35–39] and sampling [31,32,40] problems in computational biology can benefit similarly.
In the future, we plan to extend the current frame-work to co-estimation of the timed phylogeny and the transmission network by formulating a maximum likelihood version of TNI. In such a likelihood-based model, we will consider the time of transmission relative to known characteristics of the pathogen (such as incubation time). Moreover, we may assign higher like-lihood to reciprocal transmissions between the same pair of hosts. In addition, we will support additional constraints such as contact maps, bottleneck sizes and other epidemiological constraints. Finally, we wish to study the problem of deriving one or more consensus transmission networks from the solution space, akin to our recent work in cancer genomics [41].
Methods
We number the vertices of a timed phylogeny T from 1 to n, i.e. V (T) = {v1, …, vn}. Similarly, we number the hosts from 1 to m, i.e. Σ = {1, …, m}.
Polynomial Time Algorithm for ℓ-TNI
In the ℓ-TNI problem, we seek a transmission network N consistent with a given (T, ℓ) with minimum co-transmission number γ(N). Let Vs,t be a list of edges (u, v) of T where ℓ(u) = s and ℓ(v) = t sorted in ascending order by time-stamp τ (v) of the target vertex v (ties may be broken arbitrarily). In the following, we show that the ℓ-TNI problem can be reduced to vertex partitioning problems of an interval graph, each of which can be solved by a simple greedy algorithm in time linear in |Vs,t| [42].
For each pair (s, t) of distinct hosts (where s < t), we construct the interval graph Gs,t with vertex set Vs,t and an edge between (u, v) and (u′ v′) if the corresponding time intervals [τ(u); τ(v)] and [τ(u′); τ(v′)] overlap. By construction, a clique in Gs,t forms a set of transmission edges that can be part of the same transmission event. Thus, the minimum co-transmission number for the host pair (s, t) is then given by the smallest number of cliques that cover all the nodes in the interval graph. Applying the algorithm described in Ref. [42], we compute such a minimum cardinality clique partition in O(|Vs,t|) time by greedily removing the maximal clique that contains the first available edge until the graph is empty (Fig. 6). Constructing the ordered sequences Vs,t requires O(n log n) time, which dominates the overall running time.
Relaxation of TNI
To obtain a randomized algorithm for TNI, we consider a relaxation where we are interested in all host labelings ℓ that admit transmission networks N with minimum transmission number μ(N) and any co-transmission number γ(N). While the TNI problem, where we additionally require γ(N) = γ*, is NP-hard, the relaxed problem can be solved in polynomial time using dynamic programming. In the following, we describe how to solve the optimization, enumeration, counting and sampling versions of this relaxed problem.
Optimization
Let f[v, s] be the minimum transmission number of the subtree Tv rooted at vertex v that can be attained when labeling vertex v by host s, i.e. ℓ(v) = s. The following recurrence defines f[v, s]. where I(s) = [τe(s), τr(s)], and c(s, t) = 1 if s = t and c(s, t) = 0 otherwise. The above recurrence is an adaptation of the recurrence used in the Sankoff algorithm for the small phylogeny maximum parsimony problem [43, 44]. We compute f bottom up from the leaves L(T) to the root vertex r(T) of T in O(nm) time (Supplementary Algorithm S1). The minimum transmission number μ* is given by
Fig. 7a shows an example of the recurrence of f [v, s] on a timed phylogeny.
Enumeration
We now identify vertex-host pairs (v, s) that are part of minimum transmission host labelings, indicated by g[v, s] = 1. We define g[v, s] as where Γ((u, v), s) is the set of host labels of vertex v that are part of minimum transmission host labelings ℓ where the parent vertex u is labeled by host s, i.e. Γ((u, v), s) = {t ∈ Σ | c(s, t) + f [v, t] = mint′∈Σ{c(s, t′)+f[v, t′]}}. We note that g can be computed in a top down fashion in O(nm) time (Supplementary Algorithm S2), whereas Γ can be computed in O(m) time. Using g and Γ, we enumerate all minimum transmission host labelings of T (Supplementary Algorithm S3 and S4).
Counting
Next, we consider the counting version of this problem. This number can also be solved using dynamic programming [45]. Let h[v, s] denote the number of minimum transmission labelings in the subtree Tv of T rooted at vertex v when ℓ(v) = s. We define h[v, s] recursively as
The total number of solutions is given by
Directly translating the above recurrence into a recursive function results in a O(nm) time algorithm. Fig. 7b shows an example of the recurrence of h[v, s] on a timed phylogeny.
Sampling
Using the count matrix h[u, s], we introduce a subroutine that takes a vertex v and host s as input, and uniformly samples a host labeling ℓu of subtree Tu rooted at u subject to the restriction that ℓu(u) = s (Supplementary Algorithm S5). Supplementary Section 1.3 gives a correctness proof of our algorithm.
Let Σ* = {s1, …, sk} be the set of hosts of the root vertex r(T) that are part of minimum transmission labelings, i.e. Σ* = {s ∈ Σ | g[r(T), s] = 1}. The fraction ps of minimum transmission host labelings ℓ where ℓ(r(T)) = s equals h[r(T), s]/ Σs′Σ* h[r(T), s′]. Thus, to sample all minimum transmission host labelings uniformly at random, we draw a s ∈ Σ* according to the categorical probability distribution defined by (p1, …, pk). Supplementary Algorithm S6 is then used on T with ℓ(r(T)) = s to sample minimum transmission host labeling ℓ of T uniformly at random. This takes O(nm) time per sample.
Naive sampling algorithm
To identify host labelings with minimum transmission number and subsequently smallest co-transmission number, we may repeatedly generate a uniformly random sample using the above algorithm and retain only those host labelings that have smallest co-transmission number. The success probability of this naive sampling algorithm is where K is the number of repetitions.
Solving TNI via SAT
We focus our attention on a decision version of the general TNI problem: is there a host labeling ℓ that admits a transmission network N with transmission number μ(N) = μ* and co-transmission number γ(N) = α, where α ∈ ℕ? Since γ* ∈ {|Σ| − 1, …, |E|}, we may solve the optimization problem of finding N with minimum γ(N) = γ* by initially setting α = |Σ| − 1 = m − 1 and incrementing α until the decision problem has a yes-answer or α = |E(T)| = n − 1.
In the following, we will show how to reduce a TNI instance to a Boolean formula ϕ. To facilitate almost uniform sampling and approximate counting, we require that there is a bijection between the solutions to TNI instance and the corresponding SAT instance ϕ. As such, we must introduce variables and constraints that encode (i) a host labeling ℓ, (ii) ℓ has minimum transmission number μ*, (iii) ℓ admits a transmission network N with co-transmission number γ(N) = α and (iv) uniqueness of N given ℓ.
For clarity, we will not present constraints in clause normal form (CNF). Rather, we refer the reader to Supplementary Section 1.4 for a CNF representation of ϕ with O(n2 + nm + nα) variables and O(nm2α2 + n2m2 + n2α2) clauses.
Host labeling
Variables x ∈ {0, 1}n×m encode a host labeling. That is xi,s = 1 if vertex vi is labeled by host ℓ(vi) = s, and xi,s = 0 otherwise. To encode a host labeling, we introduce the following constraints for all vertices vi ∈ V (T).
The function onehot(X) encodes that exactly one binary variable x ∈ X is true, which can be accomplished by the following constraint.
Minimum transmission number μ*
Next, we need to ensure that x encodes a host labeling with minimum transmission number μ*. To this end, we use the functions f, g and Γ defined in the previous section. First, we prevent labeling a vertex vi by a host s if this is not part of a minimum transmission host labeling (i.e., g(vi, s) = 0). That is, for all vertex-host pairs (vi, s) ∈ V (T) × [m] where g[vi, s] = 0, we have
Labeling a vertex vi by host s restricts the set of host for each child vj of vi to Γ((vi, vj), s). Thus, for all edges (vi, vj) ∈ E(T) and hosts s ∈ [m], we have
Transmission network
We now need to encode that x admits a transmission network N with cotransmission number γ(N) = α. We order the edges E(T) = {e1, …, en−1} in ascending order by the time-stamp of the target vertex, breaking ties arbitrarily. We introduce a variable cij,kl for each pair (i, j), (k, l) of distinct edges where (i, j) < (k, l). We require cij,kl = 1 if and only if (i, j) and (k, l) are transmission edges between the same pair of hosts with overlapping time intervals. This is achieved by the following three sets of constraints. First, we have for all edge pairs (i, j) < (k, l) that do not have overlapping time intervals, i.e. [τ(vi), τ(vj)] ∩ [τ(vk), τ(vj)] = ∅. Second, we have that cij,kl = 0 for all edges (i, j) < (k, l) where ℓ(vi) = ℓ(vj) or ℓ(vk) = ℓ(vl). That is, for all edge pairs (i, j) < (k, l) and hosts s ∈ [m], we have
Third, cij,kl = 1 if (i, j) and (k, l) are transmission edges between the same pair of hosts with overlap-ping time intervals. That is, for all pairs (i, j) < (k, l) of distinct edges with overlapping time intervals, i.e. [τ(vi), τ(vj)] ∩ [τ(vk), τ(vl)] = ∅ and hosts s,t ∈ [m] where 1 < s < t < m we have
We now introduce variables y ∈ {0, 1}(n−1)×α such that yij,p = 1 if and only if (i, j) is a transmission edge and assigned to transmission event p. We require that each transmission edge (i, j) is assigned to exactly one transmission event. That is, for all edges (i, j) and distinct hosts s < t, we have
Next, if (i, j) is not a transmission edge then it must not be assigned to any transmission event p. That is, for all edges (i, j), hosts s and transmission events p ∈ [α], we have
Finally, edges (i, j) < (k, l) that are not time-overlapping, transmission edges between the same pair of hosts (i.e. cij,kl = 0), must not be assigned to the same transmission event p ∈ [α]. That is, for all distinct edges (i, j) < (k, l) and transmission events p ∈ [α], we have
Uniqueness
To ensure bijectivity between the set of satisfying assignments of ϕ and the set of host labelings ℓ that admit a transmission network N with transmission number μ(N) = μ* and co-transmission number γ(N) = α, we require that each host labeling ℓ encodes a unique transmission network N. To that end, we introduce constraints that will pick the exact same transmission network N given ℓ as the greedy algorithm described in Section Methods. Specifically, each transmission edge (k, l) must be assigned to the same transmission event p as the first transmission edge (i, j) that overlaps in time and hosts with (k, l) (i.e. cij,kl = 1). That is, for all edges (i, j) < (k, l) and transmission events p ∈ [α], we have
While variables x uniquely determine variables c, they do not uniquely determine variables y as there exist α! permutations of the α transmission events. To break this symmetry, we use the edge ordering E(T) = {e1, …, en−1} to designate the smallest transmission edge of each transmission event p as its representative. We require that these representatives are assigned to transmission events according to the edge ordering. Specifically, we introduce variables z ∈ {0, 1}n−1 such that zij = 1 if and only if edge (i, j) is a representative transmission edge of some transmission event.
We impose the forward direction of the bi-implication by modeling the contrapositive using the following two set of constraints. First, if edges (i, j) < (k, l) are assigned to the same transmission event p then edge (k, l) cannot be a representative. That is, for all distinct edges (i, j) < (k, l) and transmission events p ∈ [α], we have
Second, if an edge (i, j) is not a transmission edge then it cannot be a representative. That is, for all edges (i, j) and hosts s, we have
To model the reverse direction, we have for all edges (i, j) < (k, l) and transmission events p ∈ [α]
Finally, we require that representatives are ordered correctly. For all representatives (i, j) < (k, l) where (i, j) is assigned to transmission event q, it cannot be that (k, l) is assigned to a transmission event p < q. That is, for all edges (i, j) < (k, l) and transmission events p < q, we have
Approximate counting and almost uniform sampling
Now that we have a SAT formula, we look at the related problems of approximate sampling and almost uniform sampling of the solution space [46]. We use ApproxMC [33, 34] to approximate and Uni-Gen [31, 32] to sample almost uniformly from . We call the resulting method SharpTNI.
List of Abbreviations
- TNI
- Transmission Network Inference
- SAT
- Satisfiability
- CNF
- Conjunctive Normal Form
- FPAUS
- Fully Polynomial Almost Uniform Sampler
- RP
- Randomized Polynomial
- NP
- Nondeterministic Polynomial
- SIR
- Susceptible-Infectious-Recovered
- MCC
- Maximum Clade Crediblity
- EVD
- Ebola Virus Disease
Declarations
Competing interests
The authors declare no competing interests.
Author’s contributions
M.E-K. conceived the project. P.S. developed the code and performed the experimental evaluation. P.S. and M.E-K. wrote the manuscript. All authors read and approved the final manuscript.
Funding
Publication was funded by the National Science Foundation (CCF 18-50502).
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Availability of data and materials
Ebola and simulated data used in the results section is available at https://github.com/elkebir-group/SharpTNI/tree/master/data. Results generated using this data are available at https://doi.org/10.13012/B2IDB-9734610_V1.
Supplementary Materials
Background and Problem Statement — Fig. S1, Tables S1 and S2
Complexity Proof — Lemmas 1 to 4
Algorithms for Relaxed TNI Problem — Algorithms S1 to S6
CNF form of the SAT formulation — Eqs. 2 to 19, Table S3
Outbreak Simulation Details — Text
Simulated and Ebola Outbreak Analysis — Figs. S3 to S7, Table S4
Acknowledgements
M.E-K. was supported by the National Science Foundation (CCF 18-50502). Experiments were run on Blue Waters, which is a joint effort of the University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications. The authors thank the anonymous referees for insightful comments that have improved the manuscript.