## Abstract

**Motivation** The combination of genomic and epidemiological data hold the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data.

**Results** We formulate the Direct Transmission Inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce TiTUS, a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritizes parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS’s ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain.

**Availability** https://github.com/elkebir-group/TiTUS

**Contact** melkebir{at}illinois.edu

**Supplementary information** Supplementary data are available at *Bioinformatics* online.

## 1 Introduction

With the advent of cheaper and more powerful sequencing methods, molecular epidemiology has become an indispensable tool for inference of transmission histories of infectious disease outbreaks. Genomic data of pathogen isolates collected from infected hosts is used to assist with the identification of unknown infection sources and transmission chains. Intensive field work generates crucial epidemiological data that provides addition information such as contact history between patients and exposure times of the patients to sources of infection. Methods that can efficiently use genomic and epidemiological data together for accurate inference of transmission history of outbreaks are the key to real-time outbreak management and devising public health policies and disease control strategies for future outbreaks (Dellicour *et al*., 2018).

There are several challenges that hinder the accurate inference of the transmission history of an outbreak. Phylogeny estimation of the pathogen isolates reveals the evolutionary history of the pathogen during the outbreak. However, due to within-host diversity of the pathogen, branching events in the phylogeny do not correspond to the transmission events during the outbreak (Romero-Severson *et al*., 2014). Phylogeny-based methods that assume that the transmission events coincide with the branching events in the phylogeny are therefore not applicable in the context of pathogens with low mutation rates, short incubation times and acute infections (Ypma *et al*., 2011; Harris *et al*., 2010; Leitner *et al*., 1996; Cottam *et al*., 2008).

Another factor that makes outbreak transmission history inference challenging is a *weak transmission bottleneck*, where multiple strains of the pathogen are transmitted from a donor to a recipient through a non-negligibly small inoculum. Due to this, the most recent common ancestor of lineages from the same host need not have arisen in that host. Although large inocula have been observed in a number of diseases (Leonard *et al*., 2017), most of the existing methods for transmission tree inference that account for the within-host diversity do not account for the co-transmission of pathogen strains (Ypma *et al*., 2013; Didelot *et al*., 2014; Hall *et al*., 2015; Didelot *et al*., 2017). That is, these methods assume a *strong transmission bottleneck* where a single strain of the pathogen is transmitted in an infection. A weak transmission bottleneck is considered in SCOTTI (De Maio *et al*., 2016) and BadTrIP (De Maio *et al*., 2018), however they make the simplifying assumption that all the transmissions are independent of each other. SharpTNI (Sashittal and El-Kebir, 2019) considers the weak transmission bottleneck without this assumption under a parsimony based framework for a known phylogeny. However, SharpTNI may yield transmission histories that cannot be represented by a tree due to multiple infections of a single host from distinct donors. Such super-infection are unlikely for pathogens where infected hosts acquire immunity towards further infections of the pathogen (Whittle *et al*., 1999; Wearing and Rohani, 2009), thus restricting the transmissions history to a tree.

The contributions of this paper are three-fold. First, we consider the problem of counting and sampling uniformly from the set of possible transmission trees for a known phylogeny and epidemiological data. In previous works, this problem is considered by Kenah *et al*. (2016) when the order of infections during the outbreak is completely known and by Hall and Colijn (2019) under the strong transmission bottleneck constraint. In this work, we relax both these constraints and propose a method TiTUS that approximately counts and almost uniformly samples the transmission trees under a weak transmission bottleneck for a given timed phylogeny (Fig. 1). We prove the hardness of the decision and counting versions of this problem and demonstrate the efficiency and accuracy of TiTUS on simulated data. Second, we present a robust criteria for ranking or prioritizing the uniformly sampled candidate transmission trees. In addition to the simulated data, we demonstrate the performance of the selection criteria on an HIV outbreak with a known transmission chain (Vrancken *et al*., 2014). Third, in practice, the underlying phylogeny has some uncertainty and there can be multiple candidates for the transmission tree for a given phylogeny. It is therefore desirable to have an efficient method to summarize the solution space of transmission trees that are consistent with the genetic and epidemiological data. To this end, we propose a consensus-based method that provides the mean transmission tree for a set of candidate solutions while accounting for the number of distinct strains transmitted in each infection event.

## 2 Preliminaries

To state the problems we consider in this manuscript, we start by introducing the required concepts and notation. Let *T* be a rooted tree with vertex set *V*(*T*) and edge set *E*(*T*). The set of leaves of the tree is given by *L*(*T*). The root of the tree is denoted by *r*(*T*). We denote the children of a vertex *u* by *δ*_{T} (*u*). We write *u* ⪯_{T} *v* if vertex *u* is ancestral to vertex *v, i*.*e*. vertex *u* is present on the unique path from *r*(*T*) to vertex *v*. Note that ⪯_{T} is reflexive, *i*.*e*. it holds that *u* ⪯_{T} *u* for all vertices *u*. We denote the set of *m* distinct hosts in the outbreak by Σ. In a phylogeographical setting, the set Σ corresponds to *m* distinct geographical locations.

The evolutionary of all strains of a pathogen in an outbreak is modeled by a timed phylogeny, which we define as follows.

A *timed phylogeny T* is a rooted tree whose vertices are labeled by time-stamps *τ* : *V* (*T*) → ℝ^{≥0} such that *τ* (*u*) ≤ *τ* (*v*) for all pairs *u, v* of vertices where *u* ⪯_{T} *v*.

Thus, as we can see in the above definition, time moves forward when traversing down a timed phylogeny *T* starting from the root *r*(*T*). It is important to note that the leaves of a timed phylogeny *T* may occur at distinct time-stamps, *i*.*e. T* is not necessarily ultrametric.

Each leaf of a timed phylogeny *T* corresponds to a strain of pathogen that was collected during the outbreak. As such, we know the host from which each individual strain was isolated. This is captured by a leaf labeling, *i*.*e*. a labeling of the leaves of *T* by hosts Σ.

A *leaf labeling* of a timed phylogeny *T* is a function , assigning a host to each leaf vertex *u* ∈ *L*(*T*).

While we know the host from which each individual leaf *u* of *T* was sampled, we do not know the hosts of the internal vertices, which correspond to unsampled, ancestral strains. Here, our goal is to determine the hosts in which these ancestral strains reside.Mathematically, we wish to construct a *vertex labeling ℓ* : *V* (*T*) → Σ, such for all leaves *u* ∈ *L*(*T*). Given a vertex labeling *ℓ*, each internal vertex *u* of *T* thus corresponds to a strain residing within host *ℓ*(*u*) at time *τ* (*u*).

In addition to the evolutionary history of all strains in the outbreak, a timed phylogeny *T* combined with a vertex labeling *ℓ* gives us information about the transmission history of the outbreak. Transmissions of strains from one host to another correspond to edges (*u, v*) of *T* labeled by distinct hosts *ℓ*(*u*) ≠ *ℓ*(*v*). Formally, we define a *transmission edge* as follows.

Given a timed phylogeny *T* and vertex labeling *ℓ*, an edge (*u, v*) of *T* is a *transmission edge* if *ℓ*(*u*) ≠ *ℓ*(*v*).

The vertex labeling that we construct for a given timed phylogeny *T* and leaf labeling , must follow certain constraints for a realistic reconstruction of the transmission history of the pathogen. We will now define these epidemiological constraints.

The first constraint that we introduce is called the *direct transmission* constraint, which imposes the following two restrictions. First, the outbreak begins with a single infected host. We call this initial host the *root host* and it labels the root node *r*(*T*) of the timed phylogeny. The *root host* is not infected by any other host and therefore if *s* is the root host, there cannot exist a transmission edge (*u, v*) such that *ℓ*(*u*) ≠ *s* and *ℓ*(*v*) = *s*.Second, the remaining hosts have a unique infector and are thus infected only once in the course of the outbreak. A possible explanation for this phenomenon is diseases where infected hosts acquire immunity towards further infections of the pathogen (Whittle *et al*., 1999; Wearing and Rohani, 2009). Consequently, there cannot exist two distinct transmission edges (*u, v*) and (*u*′, *v*′) such that *ℓ*(*v*) = *ℓ*(*v*′) and *ℓ*(*u*) ≠ *ℓ*(*u*′). However, an infection between any two hosts *s, t* ∈ Σ may involve the transmission of multiple strains at the same time. This is known as a *weak transmission bottleneck*. Since the transmission of strains must occur concurrently, the time intervals corresponding to any two transmission edges between the same pair (*s, t*) of hosts must have an non-empty intersection. More formally, we state the *direct transmission* constraint as follows,

For a timed phylogeny *T*, a vertex labeling *ℓ* satisfies the *direct transmission constraint* if (i) there does not exist a transmission edge (*u, v*) such that *ℓ*(*v*) = *ℓ*(*r*(*T*)) and (ii) we have [*τ* (*u*), *τ* (*v*)] ∩ [*τ* (*u*′), *τ* (*v*′)] ≠ ø for any two transmission edges (*u, v*) and (*u*′, *v*′) where *ℓ*(*u*) = *ℓ*(*u*′) and *ℓ*(*v*) = *ℓ*(*v*′).

Under the *direct transmission* constraint, the set of transmission edges induced by the vertex labeling *ℓ* uniquely determines the *transmission tree S*. More formally, the vertex set *V* (*S*) of a transmission tree *S* is the host set Σ, and there is a directed edge from *s* ∈ Σ to *t* ∈ Σ if and only if there exists at least one edge (*u, v*) ∈ *E*(*T*) such that (i) *s* ≠ *t*, (ii) *ℓ*(*u*) = *s* and (iii) *ℓ*(*v*) = *t*. Since every host except the *root host* has a unique infector, the directed edges necessarily form a tree. Each directed edge (*s, t*) ∈ *E*(*S*) is given a weight *w* : *E*(*S*) → ℕ such that *w*(*s, t*) equals the number of transmission edges in *T* from host *s* to *t*. If *w*(*s, t*) = 1 for all edges (*s, t*) ∈ *E*(*S*) then each host is infected due to the transmission of a single pathogen strain. This phenomenon is known as a *strong transmission bottleneck*.

Epidemiological data provide two additional types of information. First, for each host *s* we are given an interval [*τ*_{e}(*s*), *τ*_{r} (*s*)] during which the host was present in the outbreak and susceptible for infection. Specifically, *τ*_{e}(*s*) ∈ ℝ^{≥0} is the entry time at which host *s* became susceptible for infection, whereas *τ*_{r} (*s*) ∈ ℝ^{≥0} is the *removal time* at which the host was removed from the susceptible and infected populations and placed in treatment or recovering.

Second, there can also be documented geographical constraints that prevent transmissions between any given pair of hosts. We account for all such constraints using a *contact map*. A *contact map C* is a directed graph whose vertex set equals the set Σ of hosts. A directed edge (*s, t*) represents a possible infection event from host *s* to host *t*. If any two hosts are not connected in *C* then there can be no infection event between that pair of hosts. It can clearly be seen that (i) the contact map *C* is a subgraph of the interval graph induced by the intervals [*τ*_{e}(*s*), *τ*_{r} (*s*)], ∀*s* ∈ Σ and (ii) the transmission tree *S* is a spanning arborescence of the contact map *C*. Thus, even in the absence of documented contacts between hosts, a contact map is induced by the entry and removal times of the hosts.

## 3 Problem Statement

We focus on inferring the transmission history of an outbreak for a known pathogen phylogeny *T*. In addition, we are given epidemiological data, which include the contact map *C*, entry and removal times [*τ*_{e}(*s*), *τ*_{r} (*s*)] for each host *s* ∈ Σ and assume a direct transmission constraint under a weak transmission bottleneck. This leads to the following decision problem.

(Direct Transmission Inference (DTI)). Given a timed phylogeny *T* with time-stamps *τ* : *V* (*T*) → ℝ^{≥0}, a leaf labeling , a contact map *C* and entry *τ*_{e} : Σ → ℝ^{≥0} and removal times *τ*_{r} : Σ → ℝ^{≥0}, find a vertex labeling *ℓ* that induces a transmission tree *S* that is a spanning arborescence of *C* and *τ* (*u*) ∈ [*τ*_{e}(*s*), *τ*_{r} (*s*)] for all hosts *s* and vertices *u* where *ℓ*(*u*) = *s*.

An instance of the DTI problem is shown in Fig. 1a shows an instance of the DTI problem along a with a solution vertex labeling *ℓ* and induced transmission tree *S*, where the three hosts are inducated using three colors. Importantly, a DTI problem instance may admit multiple solutions, as shown in Fig. 1b and Fig. 1c. These solutions provide alternative reconstructions of the transmission history, and thus must be taken into consideration in any downstream analysis of the outbreak to devise policy to better manage/prevent future outbreaks. To quantify the number of alternative reconstructions, we formulate the following counting problem.

(# Direct Transmission Inference (#DTI)). Given a timed phylogeny *T* with time-stamps *τ* : *V* (*T*) → ℝ^{≥0}, a leaf labeling , a contact map *C* and entry *τ*_{e} : Σ → ℝ^{≥0} and removal times *τ*_{r} : Σ → ℝ^{≥0}, count the number of vertex labelings *ℓ* that induce a transmission tree *S* that is a spanning arborescence of *C* and *τ* (*u*) ∈ [*τ*_{e}(*s*), *τ*_{r} (*s*)] for all hosts *s* and vertices *u* where *ℓ*(*u*) = *s*.

Let ℒ be the set of all solutions to a given DTI problem instance. Ideally, we would exhaustively enumerate all solutions to the problem instance. However, worst case, the number of solutions scales exponentially with our input. Thus, to obtain a good overview of the solution space ℒ, we need to consider the sampling version of #DTI problem where we wish to uniformly sample the solution space.

In summary, we defined three versions of the DTI problem: a decision, counting and sampling version. In the following, we will consider a previously defined constrained version of the DTI problem as well as a generalization.

### 3.1 Related Transmission Tree Inference Problems

We start by considering a version of the DTI problem with one additional constraint. This additional constraint requires that only one pathogen strain is transmitted to a new host in a transmission event, and is known as a *strong transmission bottleneck*. We refer to this problem as Directed Transmission Inference under Strong Bottleneck (DTI-SB), and denote the space of solutions by ℒ*SB*. This problem was posed by Hall *et al*. (2015).In subsequent work, Hall and Colijn (2019) introduced a polynomial time algorithm to enumerate and uniformly sample from the set ℒ_{SB}. Since the DTI-SB only has one additional constraint over the original DTI problem, the solution space of DTI-SB is a proper subset of the solution space of DTI for the same timed phylogeny *T*, leaf labeling and epidemiological data. More formally, we have ℒ_{SB} ⊆ ℒ.

The second problem we consider is a relaxed version of DTI. Specifically, we relax the *direct transmission* constraint for a given instance of DTI. We refer to this problem as rel-DTI and the space of feasible solutions for a given instance by ℒ_{REL}. Section 5.2.1 introduces a polynomial time dynamic programming algorithm that enumerates, counts and uniformly samples from the set ℒ_{REL}. Since the rel-DTI problem is a relaxation of the DTI problem, we can use the algorithm introduced in Section 5.2.1 to uniformly sample from the solution space of the DTI problem (ℒ). Fig. 2 shows the relation between the solution spaces of the three transmission tree inference problems.

### 3.2 Consensus Tree Problem

For the DTI problem described in the previous section, we start with a given pathogen phylogeny *T*. However, in practice the phylogeny needs to be inferred from genomic sequences of the strains collected from individual hosts Σ. Several methods of phylogeny inference generate either multiple candidates for the phylogeny or a posterior on the solution phylogeny space (Bouckaert *et al*., 2019; Stamatakis, 2014). Moreover, for each given timed phylogeny, we can get multiple solutions to the DTI problem as shown for a representative instance in Fig. 1. Therefore, there is a need for an efficient method to summarize the candidate transmission trees that explain the disease outbreak.

A common method to summarize the solution space of transmission trees is to aggregate the information from the candidate transmission trees to generate a single graph where each edge is weighted by the number of candidate trees that support that edge (De Maio *et al*., 2016; Wymant *et al*., 2017; Didelot *et al*., 2014). This graph rarely represents a single coherent transmission tree among the set of all hosts in the dataset. For this reason, the resulting graph is called a *relationship graph* (Wymant *et al*., 2017) and does not provide crucial information about co-occurrence and mutual exclusivity among edges of the candidate transmission trees.

Another line of method summarizes the set of candidate solutions using one or more consensus trees that best represent the solution space (Jombart *et al*., 2017; Kendall *et al*., 2018). For instance, Jombart *et al*. (2017) apply pairwise distance metrics on the space 𝒮 of transmission trees, not taking into account the number *w*(*s, t*) of transmitted strains between pairs of host (*s, t*). The resulting distance matrix is subsequently embedded into lower dimensional space that the authors then cluster. Finally, each cluster is then assigned a single transmission tree in 𝒮 as its representative (Hall and Colijn, 2019). Kendall *et al*. (2018) follow a similar embedding approach, again not taking the number *w*(*s, t*) of transmission into account. Thus neither method supports a weak transmission bottleneck. To address this limitation, we define the weighted parent-child distance (WPCD) *d*(*S*_{1}, *S*_{2}) between any two transmission trees *S*_{1} and *S*_{2} as follows.

Let *S*_{1} = (Σ, *E*_{1}) with edge labeling *w*_{1} and *S*_{2} = (Σ, *E*_{2}) with edge labelings *w*_{2} be two transmission tree on the same vertex set Σ. The *weighted parent-child distance* between the two graphs denoted by *d*(*S*_{1}, *S*_{2}) is

In Appendix A.1.1 we show that this distance function induces a metric in the space 𝒮 of transmission trees. Note that transmission trees *S* and *S*′ that have the same topology but different edge weights *w* and *w*′ will have *d*(*S, S*′) > 0. As a result, WPCD can be used to produce a consensus transmission tree while taking an incomplete transmission bottleneck into account. Under the *strong transmission bottleneck* the *weighted parent-child distance* simplifies to the size of the symmetric difference between the edge sets of the two transmission trees, *i*.*e. d*(*S, S*′) = |*E*′\*E*|+|*E*\*E*′|. This distance is known as the parent-child distance, and has been used to compare tumor phylogenies (Aguse *et al*., 2019; Govek *et al*., 2018). Using WPCD, we define the following consensus tree problem.

(Single Consensus Transmission Tree (SCTT)). Given *k* distinct transmission trees 𝒮 = {*S*_{1}, …, *S*_{k}} with edge labelings {*w*_{1}, …, *w*_{k}} find a consensus transmission tree *R* that minimizes .

## 4 Complexity

This section establishes hardness results for the decision and counting versions of the DTI problem.

DTI is NP-complete.

We show the hardness of DTI by reduction from the 1-in-3SAT problem, which is a known NP-complete problem (Karp, 1972). Details are in Appendix A.2.

It is known that the #1-in-3SAT is a #P-complete problem (Creignou and Hermann, 1993). In order to show that the #DTI is also #P-complete, it suffices to show that there exists a polynomial-time reduction from #1-in-3SAT such that the number of solutions is preserved, which we do in Appendix A.2.

#DTI is #P-complete.

Since the decision problem DTI is NP-complete, there does not exist a fully polynomial randomized approximate scheme (FPRAS) for the counting version of DTI unless NP=RP (Jerrum, 2003; Miklós, 2019).

## 5 Methods

This sections describes the methods developed to solve the decision, counting and sampling versions of the DTI problem.

### 5.1 Decision Problem

Since the DTI is NP-complete, we propose to use SATISFIABILITY to solve the decision problem. As such, we construct a Boolean formula *ϕ* for a given DTI instance *T*, , such that there is a bijection between the solutions of the DTI instance and the corresponding SAT instance *ϕ*. Solving the SAT instance will then be equivalent to solving the corresponding DTI problem.

#### Vertex labeling

Decision variables **x** ∈ {0, 1}^{n×m} encode a vertex labeling, *i*.*e. x*_{i,s} = 1 if and only if the node *ℓ*(*v*_{i}) = *s* and *x*_{i,s} = 0 otherwise. We encode uniqueness of the label of each vertex with the following formula.

The function onehot(*X*) encodes that exactly one binary variable *x* ∈ *X* is true, which can be accomplished by the following constraint.

#### Transmission edges

We encode the transmission edges using variables *c*_{s,t} with *s, t* ∈ Σ and *s* ≠ *t*. We enforce that *c*_{s,t} = 1 if and only if the host *t* is infected by host *s* and *c*_{s,t} = 0 otherwise as follows.

#### Root host

To enforce that the host which labels *r*(*T*) is not infected by any other host, we have
where *v*_{i} = *r*(*T*).

#### Direct transmission constraint

We enforce that any host cannot be infected by more than one other host. For each host *s* ∈ Σ we have

We require that all transmission edges from host *s* to host *t* must have time intervals that overlap. For all edge pair (*v*_{i}, *v*_{j}), (*v*_{k}, *v*_{l}) that do not have overlapping time intervals, i.e. [*τ* (*v*_{i}), *τ* (*v*_{j})] ∩ [*τ* (*v*_{k}), *τ* (*v*_{l})] = ø, we impose

### 5.2 Counting and Sampling Problem

#### 5.2.1 Naive Rejection based Method

For a naive rejection sampling algorithm, we relax the *direct transmission constraint* and uniformly sample vertex labelings for the timed phylogeny *T* such that for all transmission edges (*u, v*) we have (*ℓ*(*u*), *ℓ*(*v*)) ∈ *E*(*C*). As described in Section 3.1, we refer to this as the rel-DTI problem. Let the set of such vertex labelings be ℒ_{REL}. Drawing a vertex labeling labeling *ℓ* ∈ ℒ_{REL} uniformly at random from the set ℒ_{REL} can be done in polynomial time, as we describe in Appendix A.3. The sampled vertex labeling labeling *ℓ* is rejected unless it satisfies the *direct transmission constraint*, which can be verified in polynomial time. The probability of success for this rejection based sampling algorithm is 1−(|ℒ|*/*|ℒ_{REL}|)^{K} after *K* repetitions.

#### 5.2.2 Approximate Counting and Sampling using SAT

Using the SAT formulation shown in Section 5.1, we may use ApproxMC (Chakraborty *et al*., 2013; Soos and Meel, 2019) to approximate |ℒ| and UniGen (Chakraborty *et al*., 2014, 2015) to sample almost uniformly from ℒ. We call the resulting method Transmission Tree Uniform Sampler (TiTUS).

### 5.3 Consensus Problem

This section introduces a polynomial time algorithm to solve the SCTT problem. The algorithm and the proof for correctness follow the work of (Govek *et al*., 2018). Let 𝒮 = {*S*_{1}, …, *S*_{k}} be a set of *k* transmission trees with edge weights {*w*_{1}, …, *w*_{k}}. Our goal is to find a consensus tree *R* that minimizes *d*(𝒮, *R*) where *d*(·, ·) is the *weighted parent-child distance*. For any given tree *S*_{i}, we define the function *q*_{i} : Σ × Σ → ℕ where

Observe that the parent-child distance between two transmission trees *S*_{i} and *S*_{j} can be re-written as

To get the optimal weights for the consensus tree, for any pair of hosts (*s, t*) ∈ Σ × Σ, we define

Clearly, *w**(*s, t*) for every pair of hosts (*s, t*) is given by max{med, 1} where med is the median of the set {*q*_{1}(*s, t*), …, *q*_{k}(*s, t*)}. Thus, we have the following observation.

**Observation 1.** Given a set 𝒮 = {*S*_{1}, …, *S*_{k}} of *k* transmission trees with edge weights *w*_{1}, …, *w*_{k}, optimal consensus trees *R* that include the edge (*s, t*) must assign this edge weight *w**(*s, t*).

We define the *weighted parent-child graph P* as a complete graph with nodes given by the set Σ and a weight function

Observe that the weights of the edges of *P* can be negative.

Given a set 𝒮 = {*S*, …, *S*} of *k* transmission trees with edge weights *w*_{1}, …, *w*_{k}, a minimum weight spanning arborescence of the corresponding weighted parent-child graph *P* defines a tree *R* that is a solution to the SCTT problem with the distance measure used is weighted parent-child distance.

Proof. Provided in Appendix A.4. □

Although edge weights *w** of *P* can be negative, the requirement of *R* to be a spanning arborescence of *G* means that we can solve this problem in polynomial time with standard minimum weight spanning arborescence algorithms.

## 6 Results

This section presents the results obtained by applying TiTUS to simulated as well as a real dataset.

### 6.1 Simulations

We employ a two-stage approach to simulate an outbreak, generalizing Didelot *et al*. (2014)’s simulation framework that uses a strong transmission bottleneck to support a weak transmission bottleneck. First, we simulate the transmission process between the *m* hosts using the SIR epidemic model (Allen, 2008). The epidemiological model takes the transmission bottleneck size *κ* and minimum number *n*_{s} of strains/leaves for each host *s* as input. Given this input, the model generates a transmission tree *S* with entry *τ*_{e}(*s*) and removal times *τ*_{r} (*s*) for each host *s* as well as the number of transmissions *w*(*s, t*) = *κ* between each pair (*s, t*) ∈ *E*(*S*) of hosts. Given *S* and *w*, we then simulate the evolution of the pathogens within each infected host using a simple coalescence model with constant population size (Kingman, 1982). This process yields a forest of timed phylogenies for each individual host *s*. We construct a single timed phylogeny of all hosts by stitching together individual timed phylogenies using the transmission tree *S*. For each combination of number *m* ∈ {5, 7, 10} of hosts and bottleneck size *κ* ∈ {1, 2, 3} we generate five instances, amounting to a total of 45 simulated instances. The cases with *κ* = 1 correspond to outbreaks with a strong transmission bottleneck. In order to mimic the uncertainty in epidemiological data seen in practice, we increase the length of the entry and removal time interval [*τ*_{e}(*s*) − Δ, *τ*_{r} (*s*) + Δ] for each host *s*, where Δ equals 10% of the total outbreak duration.

We find that increasing the number of hosts and bottleneck size in the simulations leads to an increase in the number of vertices *n* in the phylogenetic trees (Fig. S12a). This leads to a sharp increase in the number of feasible solutions to the rel-DTI (Fig. 3a). The number of solutions to DTI, on the other hand, stays relatively constant for increasing bottleneck size. As a consequence of this, the sampling efficiency of the naive rejection sampling method, defined by the ratio ℒ*/*|ℒ_{REL}|, precipitates with increasing number *m* of hosts and bottleneck size *κ* proving it unsuitable for any real applications.

For cases with simulated bottleneck size *κ* > 1, STraTUS fails to provide any solutions (Fig. 3a). This shows that when multi-strain infections occur, transmission history inference with a strong bottleneck assumption will fail to provide the true transmission tree topology. Finally, we assess the sampling accuracy of TiTUS by comparing the sampling frequency with 1*/*|ℒ| where |ℒ| is computed with sharpSAT (Thurley, 2006). For each unique solution that is sampled, the expected sampling frequency 1*/*|ℒ| is the same. Fig. 3c shows that the ratio between both the minimum and maximum values of the observed sampling frequencies with their expected values is close to 1.

In summary, our simulations show that methods that assume a strong transmission bottleneck cannot be applied to outbreaks with a weak bottleneck. Moreover, the exponentially increasing gap between the size of the solution space of rel-DTI compared to DTI renders the rejection-based sampling impractical. In contrast, TiTUS almost uniformly samples from the complex solution space of DTI.

#### 6.1.1 Criteria to Prioritize Candidate Transmission Trees

We propose several criteria for ranking the vertex labelings for a given timed phylogeny uniformly sampled by TiTUS. The first criterion is the *number of transmission edges* in the vertex labeling. Based on the parsimony principle, which has been used in previous works for both phylogeny inference (Sankoff, 1975) as well as transmission tree inference (Wymant *et al*., 2017; Snitkin *et al*., 2012; Sashittal and El-Kebir, 2019), we expect vertex labelings that have few transmission edges to be closer to the ground truth. The second criterion is the *number of unsampled lineages*, which is the number of transmission edges (*u, v*) for which there does not exist a descendant leaf *v*′ (*i*.*e. v* ⪯_{T} *v*′) labeled by *ℓ*(*v*). Unsampled lineages are a consequence of multi-strain infections and we expect to see fewer unsampled lineages when the within-host diversity of the infected hosts is adequately sampled. Fig. 5 illustrates this concept.

To assess these criteria, we compare the sampled transmission trees with the ground truth by computing the *infection recall*, defined as the fraction of transmission events between pairs of hosts that are correctly inferred. Fig. 4a shows the value of the *infection recall* for candidate solutions in different percentiles based on the number of transmission edges. Clearly, as we look at solutions with larger transmission numbers, the infection recalls decreases. Fig. 4b show a similar negative correlation between the infection recall and the number of unsampled lineages. We use both the transmission number and the number of unsampled lineages to prioritize the uniformly sampled candidate solutions. Specifically, for any given percentile threshold *α* we include all the vertex labelings whose percentile is at most *α* for both the transmission number and the number of unsampled lineages. (Thus, setting *α* = 1 will include all sampled vertex labelings.) The selected vertex labelings are then used to compute the consensus transmissions tree. Fig. 4c shows the infection recall of the consensus transmission trees for increasing value of the percentile threshold *α*. We see that a value of *α* that is either too small or too large results in a decrease in the *infection recall*. Based on the simulated data, we see that *α** = 0.01 yields accurate consensus transmission tree solutions. Hence, the two criteria enable accurate prioritization of sampled vertex labelings.

### 6.2 HIV Outbreak with a Known Transmission Chain

We apply our method TiTUS to infer the transmission history of an HIV-1 outbreak involving 11 patients with a known transmission chain (Vrancken *et al*., 2014; Lemey *et al*., 2005). The data consists of 212 samples collected over the span of 18 years from the 11 patients. The direction of transmissions and a relatively narrow time interval for each transmission event were inferred from epidemiological information obtained by patient interviews, clinical data and treatment histories of the patients.

The DTI problem for this HIV dataset is set up as follows. For the timed phylogeny, we use the Maximum Clade Credibility (MCC) tree obtained from the partially sequenced *env* regions presented by Vrancken *et al*. (2014) in their publication. Table 1 in Appendix A.6 shows the sampling times and transmission windows provided in the epidemiological data for each of the hosts. The transmission window of a host is the time interval inside of which the host is expected to have been infected. Transmission windows for host A and host D are incongruent with the given timed phylogeny. By this we mean there is no vertex labeling on the given MCC phylogeny that allows for the known transmissions to host A and host D. We exclude these time windows, while the transmission windows for the remaining hosts are used to constraint the possible vertex labelings of the MCC tree. We restrict the infection for each host to take place in within the transmission window provided in the epidemiological data. Appendix A.6 shows the details of the implementation of this constraint in the SAT formulation. Note that while using the time window constraints, we only restrict the time of infection and do not utilize information about the known infectors for each infected host. Finally, for each host the entry time is taken as the beginning of its time window of transmission and the removal time is the latest date of sampling (Table 1). We find that STraTUS fails to provide a solution on this dataset. Indeed, a weak transmission bottleneck needs to be considered in order to infer the transmission history.

For this DTI instance, using sharpSAT (Thurley, 2006) we find that there are exactly 30,901,500 feasible vertex labelings. We generate 100,000 samples from this solution space and compute the *infection recall* when compared to the known transmission chain. Fig. 6 shows the values the *infection recall* for solutions with different number of transmission edges and number of unsampled lineages. The infection recall is close to 1 for the solutions that have no unsampled lineages. The number of transmission edges also has a negative, albeit weaker correlation with the infection recall.

For any given percentile threshold *α* we include all vertex labelings whose percentile is at most *α* for both the transmission number and the number of unsampled lineages. Based on the simulations, we focus on percentile threshold *α** = 0.01. For this threshold value, Fig. 6 shows the consensus transmission tree inferred by TiTUS. The infection recall for this tree is 0.9, *i*.*e*. we correctly infer 9*/*10 transmission from the known transmission chain. We incorrectly infer the transmission B→F while the known transmission to F based on epidemiological data is A→F. Fig. S14 shows similar behavior of the infection recall as a function of *α* as observed in our simulations. Moreover, this figure shows that our method is robust around *α** = 0.01.

## 7 Discussion

In this paper, we formulated the Direct Transmission Inference (DTI) problem of inferring transmission trees for a given timed phylogeny and epidemiological data while supporting a weak transmission bottleneck. Weak transmission bottlenecks are common in the spread of diseases due to pathogens with large inoculum sizes, high mutation rates, long incubation times and chronic infections (Leonard *et al*., 2017). Previous studies of counting and sampling transmission trees for a given timed phylogeny assume a strong transmission bottleneck (Kenah *et al*., 2016; Hall and Colijn, 2019), and are not applicable to outbreaks of pathogens with a weak transmission bottleneck, often failing to return any solution.

We proved that the decision version of the DTI problem is NP-complete and the counting version #DTI is #P-complete. Leveraging recent advances made in approximate counting and sampling of solutions to SATISFIABILITY (Chakraborty *et al*., 2014, 2013, 2015; Soos *et al*., 2009), TiTUS, which uses a SATISFIABILITY oracle to almost uniformly sample from the solution space of DTI. In most cases, uniformly sampled candidate solutions from the transmission tree space will deviate considerably from the ground truth. To address this issue, we proposed two criteria that can be used to prioritize the uniformly sampled transmission trees. We demonstrated the performance and robustness of our selection criteria on both simulated data and a real dataset of an HIV outbreak (Vrancken *et al*., 2014).

Further, we also considered the problem of summarizing a given set of candidate transmission tree solutions of a disease outbreak. We defined a new distance metric *weighted parent-child distance* (WPCD) on the space of transmission multi-trees that capture the transmission of multiple strains between hosts during an outbreak. This distance is an extension of the parent-child distance which is used in previous works to summarize cancer phylogenies (Govek *et al*., 2018; Aguse *et al*., 2019). We presented a polynomial time algorithm for finding the consensus transmission tree with minimum total WPCD from the candidate solutions. The performance of the consensus transmission tree of recalling the transmissions that occurred during the outbreak is demonstrated both on simulated and real datasets.

There are several avenues for future research. First, the decision version of the DTI problem can be used to prioritize a posterior distribution of phylogenies, by checking if each phylogeny admits a vertex labeling that induces a transmission tree that is compatible with the given epidemiological data. A similar approach is employed by Sledzieski *et al*. (2019) where they prioritize statistically likely timed phylogenies that admit vertex labelings with fewer transmission edges. By including biological relevant constraints such as a contact map and direct transmission constraints, we expect to obtain high-fidelity phylogenetic and transmission history reconstructions. Second, one limitation of the proposed method is that it assumes that all the infected hosts in the outbreak are sampled. This assumption is only applicable for small outbreaks in regions with perfect surveillance and reporting system in place. An extension of this method to include unsampled hosts would be a useful. Third, akin to (Jombart *et al*., 2017), we plan to extend the SCTT to simultaneously cluster the set 𝒮 of transmission trees and infer a representative consensus transmission tree for each cluster. Finally, we plan to directly include the identified prioritization criteria as constraints in the DTI problem.

## Funding

M.E-K. was supported by the National Science Foundation (grant: CCF 18-50502).

## A.1 Background and Theory

In this section we provide the information we could not include in the main text. Fig. S7 shows all the feasible solutions to the representative DTI problem desribed in the Fig. 1.

### A.1.1 Transmission Tree Metric

In this section we show that WPCD is a distance metric. To show that WPCD is a distance metric, for any transmission tree *S*_{i}, we define the function *q*_{i} : Σ × Σ → ℕ as

Observe that, by construction, *q*_{i} uniquely determines the transmission tree *S*_{i} since for any edge (*s, t*) ∈ *E*(*S*_{i}) we have *w*_{i}(*s, t*) > 0. Further, the WPCD between any two transmission trees *S*_{1} and *S*_{2} can be alternatively written in terms of *q*_{1} and *q*_{2} as follows,

WPCD is a distance metric on the space of transmission trees 𝒯.

Proof. First, we show that for any two transmission trees *S*_{1} and *S*_{2}, *d*(*S*_{1}, *S*_{2}) = 0 if and only if *S*_{1} = *S*_{2}. Clearly when *S*_{1} = *S*_{2}, we have *d*(*S*_{1}, *S*_{2}) = 0. Now, let us consider the case *d*(*S*_{1}, *S*_{2}) = 0. For any (*s, t*) ∈ Σ × Σ, |*q*_{1}(*s, t*) − *q*_{2}(*s, t*)| ≥ 0. Therefore, if *d*(*S*_{1}, *S*_{2}) then for all (*s, t*) ∈ Σ × Σ we have *q*_{1}(*s, t*) = *q*_{2}(*s, t*) implying that *S*_{1} = *S*_{2}.

By definition, WPCD is always nonnegative and symmetric.

We only need to show the triangle inequality, *i*.*e*. given trees *S*_{1}, *S*_{2} and *S*_{3}, we must show

We show this as follows,

### A.1.2 Sampling Scenarios

The weak transmission bottleneck has some interesting implications for the sampling of the within-host diversity of the infected hosts. Fig. S8 gives an overview, with schematic representations, of 4 different scenarios that can occur for real outbreaks.

## A.2 Complexity

In this section we show the hardness of the decision and the counting versions of the DTI problem using reduction the one-in-three SAT (1-in-3 SAT).

(1-in-3SAT). Given a Boolean formula in 3-conjunctive normal form (3-CNF) with *n* variables and *k* clauses, decide whether there exists a truth assignment *θ* : [*n*] → {0, 1} so that each clause has *exactly* one true literal (and thus exactly two false literals).

### A.2.1 Decision Problem

To relate literals to variables, we use the function *ν* : [*k*] × {1, 2, 3} → [*n*] such that *ν*(*i, j*) is the variable corresponding to literal *y*_{i,j}. We define *σ*(*i, j*) to be 1 if *y*_{i,j} is a positive literal (*i*.*e. y*_{i,j} = *x*_{ν(i,j)}), otherwise *σ*(*i, j*) = 0 if *y*_{i,j} is a negative literal (*i*.*e. y*_{i,j} = ¬*x*_{ν(i,j)}). A truth assignment *θ* satisfies *ϕ* if for each clause *i* ∈ [*k*] there exists a *j* ∈ {1, 2, 3} such that *σ*(*i, j*) = *θ*(*ν*(*i, j*)).

Given *ϕ*, we construct a timed phylogeny *T* (*ϕ*) with leaf labeling , a contact map *C*(*ϕ*) and time-stamps *τ, τ*_{e}, *τ*_{r}, as depicted in Fig. S9 and detailed below. We set Σ = {⊥, *x*_{1}, …, *x*_{n}, ¬*x*_{1}, …, ¬*x*_{n} *c*_{1}, …, *c*_{k}}. Let *ε* > 0 be a small positive constant. As for entry and removal time-stamps, we set *τ*_{e}(⊥) = 0, *τ*_{r}(⊥) = *ε*, and *τ*_{e}(*x*_{i}) = *τ*_{e}(¬*x*_{i}) = *ε* and *τ*_{r}(*x*_{i}) = *τ*_{r}(¬*x*_{i}) = 3*ε* for each variable *i* ∈ [*n*]. For each clause *c*_{i}, *i* ∈ [*k*] we set *τ*_{e}(*c*_{i}) = *τ*_{r}(*c*_{i}) = 3*ε*. Timed phylogeny *T* (*ϕ*) is composed of 3*k* clause gadgets and *n* variable gadgets, each corresponding to a subtree that is directly attached to the root *r*(*T* (*ϕ*)). The root vertex has time-stamp *τ* (*r*(*T* (*ϕ*)) = 0. The leaves of *T* have identical time-stamps 3*ε*. For each variable *i* ∈ [*n*], we have a subtree *T* [*x*_{i}] whose root has time-stamp *τ* (*r*(*T* [*x*_{i}])) = 2*ε*. The two children of *r*(*T* [*x*_{i}]) have identical time-stamps 3*ε*, with one child leading to two leaves labeled by positive literal *x*_{i} and the other child leading to two leaves labeled by negative literals ¬*x*_{i}. Similarly, for each clause *c*_{i}, *i* ∈ [*k*], we have 3 subtrees *T* [*y*_{i,1}], *T* [*y*_{i,2}] and *T* [*y*_{i,3}]. The root of the subtree *T* [*y*_{i,j}] has time-stamp *ε* and two children, one of which is the leaf labeled by *x*_{ν(i,j)} if *y*_{i,j} = ¬*x*_{ν(i,j)} and ¬*x*_{ν(i,j)} if *y*_{i,j} = *x*_{ν(i,j)}. The other child node, denoted as *v*_{i,j}, has time-stamp *τ* (*v*_{i,j}) = 2*ε* and has only one child which is a leaf labeled by *c*_{i}. The contact map *C*(*ϕ*) is constructed as follows. The vertex set for the contact map is given by Σ. We have a directed edge from ⊥ to each of the variables {*x*_{1}, …, *x*_{n}, ¬*x*_{1}, …, ¬*x*_{n}}. For *i* ∈ [*n*], each variable *x*_{i} has an outgoing edge to ¬*x*_{i} and similarly variable ¬*x*_{i} has an outgoing edge to *x*_{i}. Finally, each clause *c*_{i} has three incoming edges, one from each of the literals that form the clause, i.e. *y*_{i,1}, *y*_{i,2} and *y*_{i,3}. For instance, if *c*_{1} := (*x*_{1} ∨ *x*_{2} ∨ ¬*x*_{3}), then we have the directed edges (*x*_{1}, *c*_{1}), (*x*_{2}, *c*_{1}) and ¬*x*_{3}, *C*_{1}. Clearly, *T* (*ϕ*) and *C*(*ϕ*) can be obtained in polynomial time from *ϕ*. An example of this reduction is shown in Fig. S11.

For any vertex labeling *ℓ* of *T* (*ϕ*), ⊥ is the *root host*.

Proof. Under the direct transmission constraint, *root host* is given by the host that labels the root node of the timed phylogeny. The time stamp of the root node of *T* (*ϕ*) is *τ* (*r*(*T* (*ϕ*))) = 0. The only host that has entry time before *τ*_{e} ≤ 0 is ⊥. Therefore, for any vertex labeling we have *ℓ*(*r*(*T* (*ϕ*))) = ⊥, which makes ⊥ the *root host*. □

For any variable *x*, either {(⊥, *x*), (*x*, ¬*x*)} ⊆ *E*(*S*) or {(⊥, ¬*x*), (¬*x, x*)} ⊆ *E*(*S*).

Proof. For any variable *x*, consider the subtree *T* [*x*]. By construction we have, *τ* (*r*(*T* [*x*])) = 2*ε* and the node only has two children labeled by *x* and ¬*x*. From the contact map we know that the only possible infectors for *x* has ⊥ and ¬*x* and similarly for ¬*x* are ⊥ and *x*. Given that *τ*_{r} (⊥) < *τ* (*r*(*T* [*x*])), the only remaining choices for *ℓ*(*r*(*T* [*x*])) are *x* and ¬*x*.

If *ℓ*(*r*(*T* [*x*])) = *x* then we have {(⊥, *x*), (*x*, ¬*x*)} ⊆ *E*(*S*) and if *ℓ*(*r*(*T* [*x*])) = ¬*x* we have {(⊥, ¬*x*), (¬*x, x*)} ⊆ *E*(*S*). □

For any clause *c*_{i} = (*y*_{i,1} ∨ *y*_{i,2} ∨ *y*_{i,3}), if (*y*_{i,j}, *c*_{i}) ∈ *E*(*S*) then *ℓ*(*r*(*T* [*y*_{i,j}])) = *y*_{i,j} and *ℓ*(*r*(*T* [*y*_{i,j′})) = ⊥ for *j*′ ≠ *j*. Proof. Consider the subtree *T* [*y*_{i,j}]. Let us denote the node that is child of *r*(*T* [*y*_{i,j}]) and parent of the leaf of *T* [*y*_{i,j}] labeled with *c*_{i} as *v*_{j}.

Since *S* is a spanning arborescence of *C*(*ϕ*) we have either (*y*_{i,1}, *c*_{i}), (*y*_{i,2}, *c*_{i}) or (*y*_{i,3}, *c*_{i}) in *E*(*S*). Without loss of generality, let us assume that (*y*_{i,1}, *c*_{i}) ∈ *E*(*S*).

The edges (*v*_{1}, *δ*_{T} (*v*_{1})), (*v*_{2}, *δ*_{T} (*v*_{2})) and (*v*_{3}, *δ*_{T} (*v*_{3})) need to be transmission edges since *τ* (*v*_{1}) = *τ* (*v*_{2}) = *τ* (*v*_{3}) < *τ*_{e}(*c*_{i}). Since (*y*_{i,1}, *c*_{i}) ∈ *E*(*S*), we require *ℓ*(*v*_{1}) = *ℓ*(*v*_{2}) = *ℓ*(*v*_{3}) = *y*_{i,1}. Looking at *r*(*T* [*y*_{i,2}]) and *r*(*T* [*y*_{i,3}]), since each clause consists of distinct variables, we can only have *ℓ*(*r*(*T* [*y*_{i,2}])) = *ℓ*(*r*(*T* [*y*_{i,3}])) = ⊥. Consequently, the transmission edges (*r*(*T* [*y*_{i,2}]), *v*_{2}) and (*r*(*T* [*y*_{i,3}]), *v*_{3}) results in a edge (⊥, *y*_{i,1}) in *E*(*S*). By Lemma 2, this also means (*y*_{i,1}, ¬*y*_{i,1}) ∈ *E*(*S*) and therefore *ℓ*(*r*(*T* [*y*_{i,1}])) = *y*_{i,1}. □

For any literal *y*_{i,j} in clause *c*_{i}, (⊥, *y*_{i,j}) ∈ *E*(*S*) if and only if (*y*_{i,j}, *c*_{i}) ∈ *E*(*S*).

Proof. Consider the subtree *T* [*y*_{i,j}]. Let us denote the node that is child of *r*(*T* [*y*_{i,j}]) and parent of the leaf of *T* [*y*_{i,j}] labeled with *c*_{i} as *v*.

(⇒) If (⊥, *y*_{i,j}) ∈ *E*(*S*), then by Lemma 2 we know that (*y*_{i,j}, ¬*y*_{i,j}) ∈ *E*(*S*). Therefore, *ℓ*(*r*(*T* [*y*_{i,j}])) = *y*_{i,j}. Given that *ℓ*(*r*(*T* [*y*_{i,j}])) = *y*_{i,j}, *ℓ*(*δ*_{T} (*v*)) = *c*_{i} and *τ* (*v*) = *ε*, the only feasible label for *v* is *y*_{i,j}. Therefore *ℓ*(*v*) = *y*_{i,j} and (*y*_{i,j}, *c*_{i}) ∈ *E*(*s*).

(⇐) If (*y*_{i,j}, *c*_{i}) ∈ *E*(*S*), then since *τ* (*v*) < *τ*_{e}(*c*_{i}), we have *ℓ*(*v*) = *y*_{i,j}. From Lemma 3 we know that *ℓ*(*r*(*T* [*y*_{i,j}])) is either ⊥ or *y*_{i,j}. If *ℓ*(*r*(*T* [*y*_{i,j}])) = ⊥, then we will have {(⊥, *y*_{i,j}), (⊥, ¬*y*_{i,j})} which is not possible due to Lemma 2. Therefore *ℓ*(*r*(*T* [*y*_{i,j}])) = *y*_{i,j} and consequently (⊥, *y*_{i,j}) ∈ *E*(*S*). □

There exists a vertex labeling *ℓ* of *T* (*ϕ*) under the direct transmission constraint such that the corresponding transmission tree *S*(*ℓ*) is a spanning arborescence of *C*(*ϕ*) if and only if *ϕ* is satisfiable with a truth assignment *θ* so that each clause has exactly one true literal.

Proof. (⇒) Let *ℓ* be a vertex labeling of *T* (*ϕ*) under the direct transmission constraint such that the corresponding transmission tree *S* is a spanning arborescence of *C*(*ϕ*). We construct the corresponding truth assignment *θ* for *ϕ* as follows. From Lemma 2 we know that for any variable *x*, either (⊥, *x*) ∈ *E*(*S*) or (⊥, ¬*x*) ∈ *E*(*S*). We set *θ*(*i*) = 1 if (⊥, *x*_{i}) ∈ *E*(*S*) and *θ*(*i*) = 0 if (⊥, ¬*x*_{i}) ∈ *E*(*S*). We claim that the this truth assignment satisfies *ϕ* with exactly one literal for each clause.

We need to show that, for any clause *c*_{i} = (*y*_{i,1} ∨ *y*_{i,2} ∨ *y*_{i,3}), exactly one of (⊥, *y*_{i,1}), (⊥, *y*_{i,2}) and (⊥, *y*_{i,3}) is in *E*(*S*). From Lemma 4 we know that (⊥, *y*_{i,j}) ∈ *E*(*S*) if and only if (*y*_{i,j}, *c*_{i}) ∈ *E*(*S*). Since *S* is a spanning arborescence, exactly one of (*y*(*i*, 1), *c*_{i}), (*y*_{i,2}, *c*_{i}) and (*y*_{i,3}, *c*_{i}) is in *E*(*S*). Therefore, exactly one of (⊥, *y*_{i,1}), (⊥, *y*_{i,2}) and (⊥, *y*_{i,3}) is in *E*(*S*) which renders the clause *c*_{i} satisfied with exactly one literal.

(⇐) Consider the truth assignment *θ* that satisfies *ϕ* with exactly one literal for each clause in *ϕ*. We build the vertex labeling *ℓ* for *T* (*ϕ*) as follows. From Lemma 1 it is clear that ⊥ is the root host and therefore *r*(*S*) = ⊥. We set *ℓ*(*T* [*x*_{i}]) = *x*_{i} if *θ*(*i*) = 1 and *ℓ*(*T* [*x*_{i}]) = ¬*x*_{i} if *θ*(*i*) = 0. For any clause *c*_{i} in *ϕ*, if *y*_{i,j} is true we set *ℓ*(*r*(*T* [*y*_{i,j}])) = *y*_{i,j} and if ¬*y*_{i,j} is true we set *ℓ*(*r*(*T* [*y*_{i,j}])) = ⊥. Finally, we set *ℓ*(*v*_{i,j}) = *y*_{i,j} for all *j* ∈ {1, 2, 3}. We need to show that constructed vertex labeling satisfies the direct transmission constraint and that the resulting transmission tree is a spanning arborescence of the contact map *C*(*ϕ*). We do this by first showing that (i) each variable has a unique infector and (ii) all transmission edges between the same pair of hosts have time intervals that overlap.

Consider all the variables that are assigned true by the truth assignment. The infector for all these variables is ⊥ since *ℓ*(*r*(*T* (*ϕ*))) = ⊥ and *ℓ*(*T* [*x*_{i}]) = *x*_{i} if *θ* = 1 and *ℓ*(*r*(*T* [*y*_{i,j}])) = ⊥ if ¬*y*_{i,j} is true. This agrees with *C*(*ϕ*). The time intervals of the outgoing edges from *r*(*T* (*ϕ*)) and *r*(*T* [*y*_{i,j}]), ∀*i* ∈ [*k*], *j* ∈ {1, 2, 3} contain *τ* = *ε*. Therefore, all possible transmission edges from ⊥ overlap at *τ* = *ε*.

Consider the variables that are assigned false by the truth assignment. From Lemma 2 we know that for any such variable *x*, they are infected by ¬*x*. This agrees with *C*(*ϕ*). Moreover, these variables do not label any of the interval vertices of the tree *T* and all the leaves of *T* are at the same time-stamp *τ* = 3*ε*. Therefore, all possible transmission edges to any such variable *x* overlap at *τ* = 3*ε*.

Finally, consider any clause *c*_{i}. All the internal vertices *v*_{i,j}, *j* ∈ {1, 2, 3} are labeled by the same variable *y*_{i,j} that renders the clause *c*_{i} satisfied. As a result, *y*_{i,j} is a unique infector of *c*_{i} and (*y*_{i,j}, *c*) exists in *E*(*C*(*ϕ*)) by construction. Also, time-stamp of all vertices *v*_{i,j} are the same *τ* = 2*ε* and therefore, the transmission edges overlap at *τ* = 2*ε*. □

### A.2.2 Counting Problem

This section proves the #P-completeness of the #DTI problem.

There exists a parsimonious reduction from #1-in-3SAT to #DTI.

Proof. Consider the reduction shown in Section 4. Here we show that this reduction is parsimonious, i.e. it preserves the number of solutions in the solution spaces of the two problems. We show a bijection between the solution space of a 1-in-3SAT and the solution space of the corresponding DTI instance.

Consider the Boolean formula *ϕ*. For a given truth assignment *θ* that satisfies each clause of *ϕ* with exactly one true literal, we construct the vertex labeling of *T* (*ϕ*) as following. We let *ℓ*(*T* [*x*_{i}]) = *x*_{i} if *θ*(*i*) = 1 and *ℓ*(*T* [*x*_{i}]) = ¬*x*_{i} if *θ*(*i*) = 0. We will show that this unique determines the labeling for the rest of the internal vertices of *T* (*ϕ*). Consider the clause *c*_{i} and the corresponding subtrees *T* [*y*_{i,1}], *T* [*y*_{i,2}] and *T* [*y*_{i,3}]. Since the truth assignment satisfies each clause with exactly one literal, without loss generality, assume that *y*_{i,1} is true. Then using Lemma 4, since (⊥, *y*_{i,j}) ∈ *E*(*S*), we have (*y*_{i,j}, *c*_{i}) ∈ *E*(*S*). For the nodes *v*_{i,j} we have *τ* (*v*_{i,j}) < *τ*_{e}(*c*_{i}) and therefore *ℓ*(*v*_{i,j}) = *y*_{i,j}, ∀*j* ∈ {1, 2, 3}. Finally, the vertex labels for the roots of the clause subtrees *ℓ*(*r*(*T* [*y*_{i,1}])) = *ℓ*(*r*(*T* [*y*_{i,2}])) = *ℓ*(*r*(*T* [*y*_{i,3}])) = *y*_{i,1} due to Lemma 3. Proof of Proposition 2 shows that this vertex labeling is a solution of the DTI problem.

From a given vertex labeling *ℓ*, we construct the truth assignment as follows. We set *θ*(*i*) = 1 if *ℓ*(*r*(*T* [*x*_{i}])) = *x*_{i} and *θ*(*i*) = 0 if *ℓ*(*r*(*T* [*x*_{i}])) = ¬*x*_{i}. Proof of Proposition 2 shows that this is a truth assignment that satisfies each clause with exactly one true literal.

The construction of *θ* from *ℓ* and *ℓ* from *θ* are inverses of each other. If we view these constructions as functions then they show a bijection in the solutions spaces of #1-in-3SAT and #DTI. This shows that the number of solutions is preserved. Obviously, the reduction can be performed in polynomial time. Therefore, the reduction is parsimonious. □

## A.3 Naive Rejection Sampling Algorithm

Here we descirbe the naive rejection sampling algorithm introduced in Section 5.2.1. Let *h*[*v, s*] denote the number of vertex labelings *ℓ* ∈ ℒ_{REL} in the subtree *T*_{v} of *T* rooted at vertex *v* when *ℓ*(*v*) = *s*. We define *h*[*v, s*] recursively as
where *I*(*s*) = [*τ*_{e}(*s*), *τ*_{r}(*s*)] and G_{C} (*s*) = {*s, δ*_{C} (*s*)}. Let Σ* = {*s*_{1}, …, *s*_{k}} be the set of possible labels for the root vertex *r*(*T*), *i*.*e*. Σ* = {*s* ∈ Σ | *τ* (*r*(*T*)) ∈ *I*(*s*)}. The number of vertex labelings |ℒ_{REL}| is given by Σ_{s′∈Σ*} *h*[*r*(*T*), *s*′].

Using the count matrix *h*[*u, s*], we introduce a subroutine that takes a vertex *v* and host *s* as input, and uniformly samples a vertex labeling *ℓ*_{u} of subtree *T*_{u} rooted at *u* subject to the restriction that *ℓ*_{u}(*u*) = *s* (Algorithm 3). The fraction *p*_{s} of the vertex labelings *ℓ* where *ℓ*(*r*(*T*)) = *s* equals *h*[*r*(*T*), *s*]/Σ_{s′∈Σ*} *h*[*r*(*T*), *s*′]. Thus, to sample *all* vertex labelings uniformly at random, we draw a *s* ∈ Σ* according to the categorical probability distribution defined by (*p*_{1}, …, *p*_{k}). Algorithm 4 is then used on *T* with *ℓ*(*r*(*T*)) = *s* to sample minimum transmission host labeling *ℓ* of *T* uniformly at random. This takes *O*(*nm*) time per sample.

For a given phylogeny and vertex labeling (*T, ℓ*), it is possible to find the minimum number of transmission events in polynomial time (Sashittal and El-Kebir, 2019). The *direct transmission constraint* is satisfied by the vertex labeling when the number of transmission events is *m* − 1, where each transmission event corresponds to an edge of the transmission tree. We can therefore draw vertex labelings from ℒ_{REL} and only retain the solutions that belong to ℒ in polynomial time. Since we are uniformly sampling from ℒ_{REL}, the retained solutions will also be uniformly sampled from ℒ. For the counting problem we estimate the number of vertex labelings in ℒ by the success rate of the sampling algorithm. Say after *K* draws of samples from ℒ_{REL}, we retain *K*′ vertex labelings that belongs to ℒ. In that case the estimate of the size of ℒ, denote by ⟨|ℒ|⟩, is given by

From the law of large numbers, as *K* → ∞ we have ⟨|ℒ|⟩ → |ℒ|. We now present the algorithms for naive rejection based sampling.

## A.4 Consensus Transmission Tree Algorithm Proof

Given a set 𝒮 = {*S*_{1}, …, *S*_{k}} of *k* transmission trees with edge weights , the minimum weight spanning arborescence of the corresponding weighted parent-child graph *P* defines a tree *R* that is a solution to the SCTT problem with the distance measure used is weighted parent-child distance.

Proof. Consider the *weighted parent-child graph P* for the set of transmission trees 𝒮. Since *P* is a complete graph, the optimal consensus tree *R* is necessarily a spanning arborescence of *P*. The weights of the edges in *R* are given by *w** due to Proposition 1. The total WPCD of *R* from the set of transmission trees 𝒮 is given by where

Consequently,
where the first term is a constant with respect to *R* and minimizing the second term is equivalent to finding the minimum weight spanning arborescence of *P*. □

## A.5 Additional Simulation Results

## A.6 Additional HIV Data Analysis and Implementation Details

## Footnotes

The title had a typo.