Abstract
The combination of bulk and single-cell DNA sequencing data of the same tumor enables the inference of high-fidelity phylogenies that form the input to many important downstream analyses in cancer genomics. While many studies simultaneously perform bulk and single-cell sequencing, some studies have analyzed initial bulk data to identify which mutations to target in a follow-up single-cell sequencing experiment, thereby decreasing cost. Bulk data provide an additional untapped source of valuable information, composed of candidate phylogenies and associated clonal prevalences. Here, we introduce PhyDOSE, a method that uses this information to strategically optimize the design of follow-up single cell experiments. Underpinning our method is the observation that only a small number of clones uniquely distinguish one candidate tree from all other trees. We incorporate distinguishing features into a probabilistic model that infers the number of cells to sequence so as to confidently reconstruct the phylogeny of the tumor. We validate PhyDOSE using simulations and a retrospective analysis of a leukemia patient, concluding that PhyDOSE’s computed number of cells resolves tree ambiguity even in the presence of typical single-cell sequencing errors. In a prospective analysis, we demonstrate that only a small number of cells suffice to disambiguate the solution space of trees in a recent lung cancer cohort. In summary, PhyDOSE proposes cost-efficient single-cell sequencing experiments that yield high-fidelity phylogenies, which will improve downstream analyses aimed at deepening our understanding of cancer biology.
1 Introduction
Tumorigenesis follows an evolutionary process during which cells gain and accumulate somatic mutations that lead to cancer (Nowell, 1976). The most natural expression of an evolutionary process is a phylogeny — a tree that describes the order and branching points of events in the history of a cellular population. Tumor phylogenies are critical to understanding and ultimately treating cancer, with recent studies using tumor phylogenies to identify mutations that drive cancer progression (Jamal-Hanjani et al., 2017; McGranahan et al., 2015), assess the interplay between the immune system and the clonal architecture of a tumor (Łuksza et al., 2017; Zhang et al., 2018), and identify common evolutionary patterns in tumorigenesis and metastasis (Turajlic et al., 2018a,b). These downstream analyses critically rely on accurate phylogenies that are inferred from sequencing data of a tumor.
The majority of current cancer genomics data consist of pairs of matched normal and tumor samples that have undergone bulk DNA sequencing. Bulk data is composed of sequences from cells with distinct genomes. More specifically, we observe frequencies f = [fi] for the set of somatic mutations in the tumor (Fig. 1a). Many deconvolution methods have been proposed for tumor phylogeny inference from such data (Deshwar et al., 2015; El-Kebir et al., 2015, 2016; Malikic et al., 2015; Popic et al., 2015; Yuan et al., 2015), typically inferring a set of equally plausible trees (Fig. 1b). These approaches are unsatisfactory, as candidate trees with different topologies may alter conclusions in downstream analyses. Single-cell sequencing (SCS), as opposed to bulk sequencing, enables us to observe specific clones present within the tumor. These clones correspond to the leaves of the true phylogeny, allowing phylogeny inference methods to reconstruct the tree itself once we observe all clones in the tumor (El-Kebir, 2018; Jahn et al., 2016; Ross and Markowetz, 2016; Zafar et al., 2017). However, the elevated error rates of SCS, as well as its high cost (Navin, 2014), make it prohibitive as a standalone method for phylogeny inference. As such, hybrid methods have been recently proposed to infer high-fidelity phylogenies from combined bulk and SCS data obtained from the same tumor (Malikic et al., 2019a,b).
Several hybrid datasets have been obtained by performing bulk and single-cell DNA sequencing simultaneously (Kuboki et al., 2019; Leung et al., 2017). However, there is merit in first performing bulk sequencing to guide follow-up SCS experiments. For instance, several studies first identified a subset of single-nucleotide variants from the bulk data to target in subsequent SCS experiments, thereby reducing costs compared to conventional whole-genome SCS approaches (Gawad et al., 2014; Kim et al., 2018; McPherson et al., 2016). Davis et al. (2019) recently introduced SCOPIT, a method to compute how many cells are needed to observe all clones of a tumor, given estimates on the smallest prevalence of a clone as well as the number of clones to detect. The authors provide no guidance on how to obtain these two quantities. Here, we build upon this work by directly incorporating knowledge encoded by the trees inferred from the initial bulk sequencing data. Indeed, by using data from a SCS experiment we may eliminate trees from that do not align with the observed clones (Fig. 1). In other words, if we observe all clones in a tumor, it is possible to determine the phylogeny of the tumor. However, is it possible to achieve the same goal by observing fewer clones? If so, how many cells are necessary for us to observe the required clones?
We introduce Phylogenetic Design Of Single-cell sequencing Experiments (PhyDOSE), a method to strategically design a follow-up SCS experiment aimed at inferring the true phylogeny (Fig. 1). Given a set of candidate trees inferred from initial bulk data, we describe how to distinguish a single tree T among the rest using features unique to T. In particular, if our SCS experiment results in observing cells corresponding to a distinguishing feature of T, we may conclude that T is in fact the true tree. This means that we can typically identify T using only a subset of the clones. To determine the number of cells to sequence, we introduce a probabilistic model that incorporates SCS errors and models successful SCS experiments as a tail probability of a multinomial distribution (Fig. 1c). Finally, we reconcile the sampled cells utilizing these distinguishing features to infer the true phylogeny (Fig. 1d). We validate PhyDOSE using both simulated data and a retrospective analysis of a leukemia patient that has undergone both bulk and SCS sequencing. We also demonstrate the utility of PhyDOSE by prospectively computing how many cells are needed to resolve the uncertainty in phylogenies of a recent lung cancer cohort. The cost-efficient SCS experiments enabled by PhyDOSE will yield high-fidelity phylogenies, improving downstream analyses aimed at understanding tumorigenesis and developing treatment plans.
2 Problem Statement
Let n be the number of single-nucleotide variants, or simply mutations, identified from initial bulk sequencing data of a matched normal and tumor biopsy sample. For each mutation i, we observe the variant allele frequency (VAF), i.e. the fraction of aligned reads that harbor the tumor allele at the locus of mutation i. Specialized methods exist that combine copy number information and VAFs to infer a cancer cell fraction fi for each mutation i, which is the proportion of cells in the tumor biopsy that contain at least one copy of the mutation (Bolli et al., 2014; Dentro et al., 2017; Jamal-Hanjani et al., 2017; Stephens et al., 2012). Here, we refer to cancer cell fractions as frequencies. Typically, phylogenies inferred by current methods from frequencies f = [fi] adhere to the infinite sites assumption. That is, each mutation i is introduced exactly once at vertex vi and never subsequently lost.
When we sequence a single cell from the same tumor biopsy, assuming no errors, we identify a clone of the tumor. In other words, we observe a set of mutations that must form a connected path in the unknown true phylogeny T*. By repeatedly sequencing single cells until we observe all clones in the tumor, we will have observed all root-to-vertex paths of T*, thus identifying tree T* itself. We assume that (i) the true unknown phylogeny T* is among the trees in and that (ii) mutations among single cells that we sample from the tumor biopsy follow the same distribution as f. The latter assumption particularly holds for liquid tumors with well-mixed tumor populations, and may be expected to hold for solid tumors if we isolate single cells at random from the same tissue that underwent bulk sequencing. This leads to the following question and problem statement. How many single cells do we need to identify T* with confidence level γ?
(SCS Power Calculation (SCS-PC)). Given a set of candidate phylogenies, frequencies f and confidence level γ, find the minimum number k* of single cells needed to determine the true phylogeny T* among with probability at least γ.
Clearly, we do not know which phylogeny in is the true underlying phylogeny T* of the tumor. Thus, we consider a slightly different problem: In the T-SCS-PC problem (defined formally at the end of the section), we are given an arbitrary phylogeny and want to perform a similar power calculation when conditioning on T being the true phylogeny. By solving the T-SCS-PC problem for all trees , we obtain the numbers of single cells needed for each tree. As T* is in , the maximum number among is an upper bound on the number of required SCS experiments to identify T* with probability at least γ. To solve the T-SCS-PC problem, we need to reason for which SCS experiments we can conclude that T is the true phylogeny.
Observe that each tree T in describes a unique set of clones, corresponding to the sets of mutations encountered in all root-to-vertex paths of T (Fig. 1). Thus, if we observe all clones of a phylogeny T in our SCS experiments, we may conclude that T is the true phylogeny. What is the probability of doing so? To answer this question, we must compute the prevalence of each clone in the tumor biopsy. For phylogenies that adhere to the infinite sites assumption, the prevalences u(T, f) = [ui] of the clones in the tumor biopsy are uniquely determined by the phylogeny T and frequencies f as where δT (i) is the set of children of the node where mutation i was introduced (El-Kebir et al., 2015).
Tumor phylogeny inference methods guarantee that the inferred phylogenies from frequencies f have clonal prevalences u(T, f) = [ui] that are nonnegative and that , where the remainder is the prevalence of the normal clone. Thus, conditioning on a phylogeny T and frequencies f, sequencing one cell from the tumor will lead us to observe one of the n + 1 clones of T with probabilities (u0, …, un). In other words, the outcome of this SCS experiment with one cell is a draw from the categorical distribution Cat(u0, …, un). The possible outcomes of a SCS experiment composed of k cells thus follow a multinomial distribution Mult(u0, …, un). Thus, the probability of observing all tumor clones of T in such a SCS experiment with k cells corresponds to the tail probability of the multinomial where each of the n tumor clones is observed at least once. The corresponding power calculation is to determine the smallest number for k where the tail probability is greater or equal to the confidence level γ. Note that this power calculation for observing all clones was previously introduced by Davis et al. (2019).
Importantly, in many cases we need not observe all clones of T to distinguish T from the remaining phylogenies (Fig. 2). This means that we may conclude that T is the true phylogeny with a SCS experiment with fewer cells. To formalize this notion, we start by defining a featurette.
A featurette τ is a subset of mutations.
We say that a featurette τ is present in a phylogeny T if the nodes/mutations of τ induce a path of T starting at the root node, otherwise we say that τ is absent in T. The same featurette, however, may be present in more that one phylogeny. Thus, multiple featurettes may be required to distinguish a phylogeny T from the remaining phylogenies .
A set Π of featurettes is a distinguishing feature of T if (i) for all featurettes τ ∈ Π it holds that τ is present in T, and (ii) for each remaining phylogeny there exists a featurette τ′ ∈ Π where τ′ is absent in T′.
Thus, a SCS experiment where we observe one cell from each clone of a distinguishing feature Π of T enables us to conclude that phylogeny T is the true phylogeny. As discussed, every phylogeny T has a trivial distinguishing feature, which is composed of all featurettes present in T. Moreover, T may have multiple distinguishing features. Therefore, we must consider the complete set of all distinguishing features, which we call the distinguishing feature family.
The set composed of all distinguishing features of T with respect to is a distinguishing feature family of T.
Let (c0, …, cn) be the outcome of a SCS experiment of k cells, where ci ≥ 0 is the number of cells observed of clone i and . This experiment is successful if, among the k sequenced cells, we observe the clones of at least one distinguishing feature — i.e. ci > 0 for all clones i in some distinguishing feature . As discussed, conditioning on frequencies f and T being the true phylogeny, outcomes (c0, …, cn) of SCS experiments of k cells follow a multinomial distribution Mult(k, u0, …, un) where u(T, f) = [ui] is defined as in (1). Let Yk denote the event of a successful outcome. We are interested in computing the probability Pr(Yk | u(T, f)), which equals the sum of the probabilities of all successful outcomes. More specifically, we want to determine the smallest number k* of single cells to sequence such that Pr(Yk* | u(T, f)) is at least the prescribed confidence level γ (Fig. 2).
(SCS Power Calculation for Phylogeny T (T-SCS-PC)). Given a set of candidate phylogenies and a phylogeny , frequencies f and confidence level γ, find the minimum number k* of single cells needed such that Pr(Yk | u(T, f)) ≥ γ.
Section A.1 proves that the above problem is NP-hard.
T-SCS-PC is NP-hard.
3 Methods
We introduce Phylogenetic Design Of Single-cell sequencing Experiments (PhyDOSE), a method to determine the number of single cells to sequence to identify the true phylogeny given initial bulk sequencing data. PhyDOSE is implemented in C++/R and is available at https://github.com/elkebir-group/PhyDOSE. This section describes the various methodological components of PhyDOSE.
3.1 Multinomial Power Calculation
To solve the T-SCS-PC problem, it suffices to have an algorithm that computes Pr(Yk | u(T, f)), which is the probability of concluding that T is the true phylogeny. Using this algorithm we identify k* by starting from k = 0 and simply incrementing k until the corresponding probability Pr(Yk | u(T, f)) exceeds the prescribed confidence level γ. In the following, we describe how to efficiently compute Pr(Yk | u(T, f)).
Recall that the outcome of a SCS experiment composed of k cells corresponds to a vector c = [ci], where ci ≥ 0 is the number of cells that we observe from clone i and . In a successful outcome c we observe at least one cell for each featurette in at least one distinguishing feature , where is the distinguishing feature family. For brevity, we will write Φ rather than . Let c(Π, k) denote the set of all outcomes where we observe at least one cell for each featurette in a distinguishing feature Π — i.e. , and for all i ∈ {0, …, n} it holds that ci > 0 if clone i is a featurette in Π and ci ≥ 0 otherwise. The set c(Φ, k) of successful outcomes is defined as the union ∪Π∈Φ c(Π, k). The probability of any SCS outcome c = (c0, …, cn) is distributed according to Mult(k, u(T, f)). Since successful outcomes enable us to conclude that T is the true phylogeny, we have
If there is only one distinguishing feature Π, i.e. Φ = {Π}, then the desired probability is a standard tail probability of the multinomial where we sum up the probabilities of outcomes c(Π, k) = [ci] such that if clone i is a featurette of Π and ci ≥ 0 otherwise. Davis et al. (2019) have developed a method titled SCOPIT that performs a fast calculation of this tail probability using a connection to the conditional probability of independent Poisson random variables described by Levin (1981). If there are multiple distinguishing features but they are pairwise disjoint — i.e. no two distinct distinguishing features share the same featurette — then we simply have and we can apply the fast computation (Davis et al., 2019) to obtain each independent tail probability. However, the equality in the above equation does not hold if the family Φ is composed of distinguishing features with overlapping featurettes. Incorrectly applying this equation will lead us to overestimate the value of k*. Since single-cell sequencing is expensive, overestimating the number of cells to sequence in a SCS experiment can be costly and unnecessary. One naive way would be to simply brute force all (n + 1)k SCS outcomes, but this will not scale. Instead, to calculate Pr(Yk | u(T, f)) exactly, we propose to use the inclusion-exclusion principle as follows.
where I(Φ′) is the set of all featurettes in Φ′, i.e. I(Φ′) = ∪Π∈Φ′ Π (Fig. 3a).
Thus, we need to compute 2|Φ| − 1 tail probabilities, which each can be done using the fast calculation in SCOPIT (Davis et al., 2019). In the worst case, Φ has O(2n) distinguishing features resulting in O(2n) tail probabilities. We now describe one final optimization that will significantly reduce the number of required computations. This is based on the following observation.
If Π is a distinguishing feature of T then for all featurettes τ present in T it holds that Π ∪ {τ} is a distinguishing feature of T.
This means that distinguishing features in Φ form a partially ordered set under the set inclusion relation. We call a distinguishing feature Π minimal if there does not exist another distinguishing feature Π′ ∈ Φ that is a proper subset of Π, i.e. Π′ ⊊ Π. A direct consequence of Observation 1 is that the outcome of an SCS experiment is successful when we observe all featurettes of a distinguishing feature Π, and remains so even if we observe additional featurettes τ′ ∉ Π. As such, successful outcomes w.r.t. Φ equal those w.r.t. the set Φ* of all minimal distinguishing features of T.
It holds that c(Φ*, k) = c(Φ, k).
Therefore, it suffices to restrict our attention to only Φ* rather than the complete family Φ when computing Pr(Yk | u(T, f)) using (4). Section A.2.1 describes how to find Φ* by reducing the problem to that of finding all minimal set covers, which we solve in an iterative fashion using integer linear programming.
3.2 Consideration of SCS Error Rates
One current challenge with SCS is that the false negative rate per site is quite high with typical rates up to 0.4 for the commonly used multiple displacement amplification (MDA) method (Fu et al., 2015). On the other hand, current false positive rates are low and are typically less than 0.0005 for MDA-based whole-genome amplification (Fu et al., 2015). A false negative is defined as not observing a mutation that is present in the cell. A false positive occurs when we observe the presence of a mutation that did not occur in that cell.
Our method can easily be adjusted when the false negative rate β is known. The probability of a true positive corresponds to 1 − β. To observe a featurette/clone i that has ni mutations and a prevalence of ui, we thus need to have ni true positives. Thus, we derive new clonal prevalences from u(T, f) = [ui]. For each i ∈ {1, …, n}, we set where ni is the number of mutations in featurette/clone i. We set to be equal to . This adjustment result in a reduction of the clonal prevalences and ultimately increases the value of k*. The issue of false positives is less serious as error rates are low enough to be negligible.
3.3 Determining the True Phylogeny T*
The final step is to determine the true phylogeny T* after performing a SCS experiment with the number k* of cells computed by PhyDOSE. To this end, we compute the support of each tree . Intuitively, support(T) is the number of cells that support the conclusion that T is the true phylogeny. Formally, we say that a distinguishing feature Π of a tree T is observed if each featurette of Π is observed in at least one cell. Using this, we define support(T) as the number of cells that correspond to featurettes of an observed distinguishing feature Π of T. Per Observation 1, it suffices to restrict our attention to the set Π* of minimal distinguishing features. There are two outcomes of a SCS experiment with k* cells. Either there is no tree with non-zero support or there are one or more trees with non-zero support. In the former case, the SCS experiment has failed, which is expected to occur with probability 1 − γ. In the latter case, which may occur in the presence of false negatives and false positives, we return the set of trees with maximum support.
4 Results
In this section, we demonstrate the application of PhyDOSE to simulated and real data. Section 4.1 provides results for simulated data. Section 4.2 provides retrospective results for a leukemia patient where both bulk and single-cell DNA sequencing have been performed (Gawad et al., 2014). Finally, Section 4.3 uses PhyDOSE to perform a prospective analysis to determine the required number of single cells to identify the true phylogeny in a lung cancer patient cohort (Jamal-Hanjani et al., 2017).
4.1 Simulated data
To assess the performance of PhyDOSE, we generated simulated data where the ground truth tree T* is known. Given a fixed number n of mutations, we first generated a ground truth tree T* with n vertices uniformly at random using Prüfer sequences (Prüfer, 1918). Next, we generated clonal prevalences u = [ui] by drawing from a symmetric n + 1-dimensional Dirichlet distribution with concentration parameter 0.2. We used rejection sampling to ensure that each clonal prevalence ui was at least 0.05. Let σ(i) be the set of clones that contain mutation i. We generated frequencies f = [fi] by setting fi = ∑j∈σ(i) uj for each mutation i ∈ {1, …, n}. We used the SPRUCE algorithm (El-Kebir et al., 2015) to enumerate the set of trees given frequencies f. Finally, we considered varying false negative rates β ∈ {0, 0.1, 0.2}. In order to validate our method, we generated, for each simulation instance, 10000 single cells sampled according to the clonal prevalences u and under the specified false negative rates β from the clones of the ground truth tree T*. In total, we generated 100 simulation instances for n = 5 mutations and each value of β.
The number of trees in our simulations ranged from 1 to 25 with a median of 11 trees per simulation instance (Fig. 4a). We ran PhyDOSE to identify the minimal distinguishing feature family Φ* for each tree in each simulation instance, which yielded a single minimum distinguishing feature in each case. Fig. 4b shows the number of featurettes in each minimal distinguishing feature identified by PhyDOSE for each tree in each simulation instance, ranging from 1 to 3 with a median of 2. Importantly, this number is smaller than the total number of 5 featurettes. As such, running PhyDOSE resulted in a median reduction of ~ 76% in the number of cells needed to identify the true tree T* compared to the naive approach of requiring all featurettes/clones to be observed (Fig. 3c). In particular, with a confidence level of γ = 0.95, PhyDOSE computed a median number of k* = 36 single cells to identify the true phylogeny T* (Fig. 4d) compared to a median number of k* = 127 single cells proposed by the naive method (Fig. S3). Upon performing 100 in silico SCS experiments with PhyDOSE’s number k* of cells for each simulation, we observed that a median of 96% of experiments uniquely identified T* (Fig. 4e). With increasing false negative rates β ∈ {0, 0.1, 0.2} we observed that (i) PhyDOSE continued to outperform the naive method (Fig. 4c), (ii) more cells were needed to identify T* (Fig. 4d), but (iii) the median fraction of successful in silico SCS experiments remained close to γ = 0.95 although variance increased (Fig. 4e).
In summary, our simulations demonstrate that PhyDOSE’s distinguishing feature analysis results in significantly fewer cells to sequence than the naive approach without a subsequent loss in power to identify the true phylogeny. Moreover, we find that PhyDOSE is robust to increasing values of false negatives rates that are typical to real data.
4.2 Retrospective Analysis of a Leukemia Patient
We considered a cohort of six childhood acute lymphoblastic leukemia (ALL) patients whose blood was sequenced using bulk and targeted single-cell DNA sequencing (Gawad et al., 2014). The number of sequenced single cells per patient varied between 96 and 150. To validate our approach, we used PhyDOSE to calculate the number k(T*) of cells needed to identify the true phylogeny T* that is consistent with both data types, thereby retrospectively determining whether fewer single cells suffice to determine T*, decreasing the cost of replicate experiments. In addition, we assessed whether the calculated number k(T*) yielded T* using in silico SCS experiments.
Recall that PhyDOSE relies on two key assumptions, i.e. (i) a correspondence between mutation frequencies in the bulk and SCS data, and (ii) the presence of T* among the trees inferred from the bulk data under the infinite sites assumption. Only patient 2 satisfied both criteria (as detailed in Section A.3). Gawad et al. (2014) sequenced 16 autosomal mutations in 115 cells for this patient. Using the infinite sites assumption and assuming the absence of copy-number aberrations, we define the cancer cell fraction, or frequency fi of each mutation i in the bulk data as 2 VAF(i). We define the SCS mutation frequency as the fraction of single cells that harbor the mutation. Strikingly, there is a clear correlation between the bulk and SCS mutation frequencies, supporting PhyDOSE’s first assumption (Fig. 5a). We excluded mutation CMTM8 because of a notable discrepancy in frequencies (0.4 in bulk vs. 0.2 in SCS). Using SPRUCE (El-Kebir et al., 2015), we enumerated the set of trees from the bulk data, yielding over 2.5 million trees. This number is mainly driven by 3 mutations (ATRNL1, LINC00052 and TRRAP) with a VAF less than 0.05. Excluding these 3 mutations resulted in a more tractable number of 2576 trees. Fig. 5b shows the single tree that was consistent with the cleaned single-cell data, supporting PhyDOSE’s second assumption.
We ran PhyDOSE using varying confidence levels γ ∈ {0.75, 0.95} and an estimated false negative rate of β = 0.2. PhyDOSE calculated that k(T*) = 103 cells suffice to identify T* with confidence level γ = 0.95. Indeed, performing 100 in silico SCS experiments, by sampling k(T*) cells among the 115 sequenced cells without replacement, yielded a success rate of 99% (Fig. 5c). To reduce costs, we explored what would have happened retrospectively with a lower confidence level γ of 0.75. PhyDOSE calculated that k(T*) = 50 cells are needed for γ = 0.75, which is a significant cost savings over γ = 0.95. Performing 100 in silico SCS experiments yielded a success rate of uniquely identifying T* of 66%, which was lower than the expected rate of 75%. Furthermore, we noted that in an additional 26% of experiments the correct phylogeny T* was among the trees with the highest overall support (Fig. 5c). The number of trees in the tied set of successes varied from 2 to 6 (Fig. 5d), showing that although PhyDOSE did not uniquely identify the tree, it was able to significantly reduce the original set of 2576 trees (Figs. S4 and S5).
In summary, this retrospective analysis shows that the true tree for patient 2 could have been identified confidently with fewer cells than the 115 cells sequenced by Gawad et al. (2014). With a lower confidence level γ, PhyDOSE computes that far fewer cells are required, significantly reducing costs but at the expense of a lower success rate of uniquely identifying the true phylogeny. Nevertheless, the resulting SCS experiment will eliminate a large fraction of the original set of candidate phylogenies due to the incorporation of distinguishing features in the PhyDOSE power calculation.
4.3 Prospective Analysis of a Lung Cancer Cohort
Using PhyDOSE, we prospectively determined the number of cells needed to uniquely identify the true phylogeny for the 25 out of 100 patients in the TRACERx non-small-cell lung cancer cohort that have multiple candidate trees (Jamal-Hanjani et al., 2017). The authors previously identified the set of candidate trees for each patient using CITUP (Malikic et al., 2015) after clustering mutations with PyClone (Roth et al., 2014). Jamal-Hanjani et al. (2017) reported the cancer cell fraction of each mutation cluster in each bulk sample. The number of trees in the candidate set for each patient ranged from 2 to 17, with each containing mutation clusters with between 5 and 882 mutations (Table S6).
Unlike in the simulations and ALL patient 2, multiple bulk samples per patient were available for analysis. Therefore, we calculated k* for each sample independently for all 25 patients at varying confidence level γ ∈ {0.75, 0.95}. Mutation clusters alleviate the issue of false negatives, i.e. it suffices to only observe a single mutation to impute the presence of the other mutations in the same cluster. Here, with a typical SCS false negative rate of 0.2, the probability of all mutations in the smallest cluster (with size 5) dropping out thus equals 0.25 = 0.00032, a probability that can be neglected. As such, we set β = 0. The reported k* value is the minimum k* over the set of available samples, subsequently identifying which of the samples is the best to utilize for the SCS experiment. PhyDOSE was able to return a finite value of k* for 23 out of the 25 patients.PhyDOSE will return ∞ when for each sample of the patient there is a featurette in every distinguishing feature where the clonal prevalance is 0. For two of the 23 patients the calculated k* was over 400 due to featurettes in the distinguishing features with low clonal prevalences. For the remaining 21 patients, the median value of k* was 29 for γ = 0.95 and 14 for γ = 0.75 (Fig. S6). These strikingly low values of k* for the majority of the 25 patients with multiple candidate trees demonstrate the benefit of using PhyDOSE to strategically optimize the design of follow-up single cell experiments.
5 Discussion
In this work, we showed that the mutation frequencies f and the set of tumor phylogenies inferred from initial bulk data contain valuable information to provide guidance for follow-up SCS experiments. We introduced PhyDOSE, a method to calculate the number k* of single cells needed to infer the true phylogeny T* given f, and a user-specified confidence level γ. Underpinning our method is the observation that often only a subset of clones suffices to distinguish one tree from the remaining trees . Thus, by observing cells in a follow-up SCS experiment from these distinguishing clones — the probability of which we model as a tail probability of a multinomial distribution — we can definitively conclude that T is the true phylogeny. We validated PhyDOSE using simulations and a retrospective analysis of a leukemia patient, concluding that PhyDOSE’s computed number k* of cells resolves tree ambiguity, even in the presence of SCS errors. In a prospective analysis, we demonstrated that only a small number of cells suffice to disambiguate the solution space of trees in a recent lung cancer cohort. In summary, PhyDOSE proposes cost-efficient SCS experiments that will yield high-fidelity phylogenies, which will consequently improve downstream analyses in cancer genomics aimed at deepening our understanding of cancer biology.
There are several future research directions. First, in the case of multiple bulk samples, rather than selecting cells from a single sample, a better strategy would be to select cells across samples. To model this accurately, we must consider a multinomial mixture model. Second, to further reduce SCS costs, we might want to include a mutation selection step as part of our approach to perform targeted rather than whole-genome sequencing. Third, similar ideas can be used to design follow-up sequencing experiments using alternative sequencing technologies such as long read sequencing. Fourth, we plan to replace the integer linear program used for identifying minimal distinguishing features with a combinatorial algorithm. This will enable us to develop an easy-to-use and install R package with a Shiny user interface. Fifth, to improve robustness in the presence of SCS errors, we plan to explore alternative definitions of successful SCS experiment outcomes, requiring that more than one cells is observed of each featurette of a distinguishing feature. Sixth, we plan to explore evolutionary models beyond the infinite sites model, such as the Dollo parsimony model where mutations might be lost (El-Kebir, 2018). Finally, the concept of distinguishing features may be useful to summarize diverse solution spaces in cancer phylognetics (Aguse et al., 2019).
A.1 Complexity
T-SCS-PC is NP-hard.
We prove the theorem using a polynomial-time reduction from the Set Cover problem, a known NP-hard problem (Karp, 1972).
(Set Cover). Given a family of subsets over a universe U = {1, …, n}, find a cover such that ∪S∈C S = U and C has minimum cardinality.
Specifically, we reduce a Set Cover instance to an T-SCS-PC instance as follows. The set includes one tree Ti for each element i in the universe U and an additional tree T0. All trees in have vertices, corresponding to subsets and two additional mutations {⊤, ⊥}. Each tree in includes the edge (⊤, ⊥). Additionally, if element i ∈ U is absent from subset Sj then there is an edge (⊤, Sj) in tree Ti, otherwise Ti includes an edge (⊥, Sj). Tree T0 includes edges (⊤, Sj) for all subsets Sj. As for the frequencies f, we set f⊤ = 1, f⊥ = 0.5 and the remaining frequencies for all subsets . Moreover, we set the confidence level γ to ε as well. In the corresponding T-SCS-PC instance , the tree of interest is T0. Fig. S1 shows an example.
The key idea is that as γ = ε > 0 is a small positive infinitesimal constant, this T-SCS-PC instance seeks the smallest number k* of cells such that Pr(Yk* u(T0, f)) is non-zero. In particular, this number k* of cells will only be achieved if there is a distinguishing feature Π of the same size k*. By our reduction, there is a 1-1 correspondence between set covers of U and distinguishing features Π of T0 with respect to {T1, …, Tn}. Specifically, a set cover C of size k corresponds to a distinguishing feature Π(C) of the same size k, and vice versa. As such, we have the following lemma whose proof is in the supplement.
Let be the T-SCS-PC instance corresponding to Set Cover instance . A minimum cover has size k* if and only if k* is the smallest integer such that Pr(Yk* | u(T0, f)) ≥ γ.
Proof. (⇒) Let C be a minimum cover of the Set Cover instance . By the premise, we have that |C| = k*. We start by showing that Pr(Yk* | u(T0, f)) ≥ γ by constructing a distinguishing feature Π(C) of T0 where |Π| = k*. Observe that for each subset Sj in C we have that {⊤, Sj} is a featurette of T0. We define Π(C) to be composed of featurettes {⊤, Sj} for all subsets Sj ∈ C. Thus, |Π(C)| = k*. To show that Π(C) is a distinguishing feature of T0, it remains to show that at least one featurette τ ∈ Π(C) is absent in each tree in . Consider any tree Ti ≠ T0. Since C is a cover, the element i of the universe U corresponding to tree Ti must be covered by some subset Sj ∈ C. This means that tree Ti contains the edge (⊥, Sj), which means that the featurette {⊤, Sj in Π(C) is absent from Ti. Hence, Π(C) is a distinguishing feature of T0.
We now must show that Pr(Yk* | u(T0, f)) ≥ γ. We do so by focusing on distinguishing feature Π(C). By construction of T0 and f, it follows from (1) that each featurette {⊤, Sj} in Π(C) has a clonal prevalence uj = ε. This means that a SCS experiment of k* cells where we only observe the k* featurettes/clones has a probability that is strictly greater than 0. Therefore, Pr(Yk* | u(T0, f)) > 0. Since ε is a small positive infinitesimal constant, we have that Pr(Yk* | u(T0, f)) ≥ γ = ε.
It remains to show that k* is the smallest integer where Pr(Yk* | u(T0, f)) ≥ ε. Assume for a contradiction that the smallest integer k′ where Pr(Yk′ | u(T0, f)) ≥ ε is strictly smaller than k*. This means that there exists a minimal distinguishing feature Π′ of size at most k′. By definition Π′ is composed of featurettes corresponding to root-to-vertex paths in T0. Since Π′ is minimal, it will not contain the featurette {⊤, ⊥}, as this featurette is present in all remaining trees {T1, …, Tn}. Thus, Π′ is composed of featurettes of the form {⊤, Sj} where . Since Π′ is a distinguishing feature, no tree Ti ∈ {T1, …, Tn} contains all featurettes of Π′. By construction of {T1, …, Tn}, this means that the subsets encoded in Π′ form a cover of the universe U. Thus, there exists a cover with size strictly smaller than k*, contradicting the premise. Therefore, k* is indeed the smallest integer where Pr(Yk* | u(T0, f)) ≥ γ = ε.
(⇐) Let k* be the smallest integer such that Pr(Yk* | u(T0, f)) ≥ γ = ε. We start by showing that the size of a minimum distinguishing feature Π of T0 has to be exactly k*. Clearly, if |Π| > k* then Pr(Yk* | u(T0, f)) = 0 as there exists no successful SCS experiment with k* cells. On the other hand, if |Π| < k* then there exists a successful SCS experiment with |Π| cells. In other words, Pr(Y|Π| | u(T0, f)) ≥ ε. This contradicts that k* is the smallest integer where Pr(Y|Π| | u(T0, f)) ≥ ε. Hence, |Π| = k*.
Consider a minimum distinguishing feature Π of T0. By the previous argument, we know that |Π| = k*. We will show that Π encodes a cover C(Π) of U of size k*. Since Π is minimal, it will not contain the featurette {⊤, ⊥}, of T0 as this featurette is present in all remaining trees {T1, …, Tn}. Thus, Π is composed of k featurettes of the form {⊤, Sj} where . Let C(Π) be defined as the collection of subsets where {⊤, Sj} in Π. Since Π is a distinguishing feature, no tree Ti ∈ {T1, …, Tn} contains all featurettes of Π. By construction of {T1, …, Tn}, this means that C(Π) is a cover of size k of the universe U.
Finally, we must show that there exists no cover C′ of U with size C′ strictly smaller than k*. Suppose for a contradiction that such a cover C′ exists. By construction, C′ encodes a distinguishing feature Π(C′) composed of featurettes {⊤, Sj} for all subsets Sj ∈ C′. Thus, |Π(C′)| = |C′|. To show that Π(C′) is a distinguishing feature of T0, we must show that (i) all features τ ∈ Π(C′) are present in T0, and (ii) at least one featurette τ ∈ Π(C) is absent in each tree in . Condition (i) holds by construction of Π(C′) and T0, i.e. for each subset Sj in C′ we have that {⊤, Sj} is a featurette of T0. As for condition (ii), consider any tree Ti ≠ T0. Since C′ is a cover, the element i of the universe U corresponding to tree Ti must be covered by some subset Sj ∈ C′. This means that tree Ti contains the edge (⊥, Sj), which means that the featurette {⊤, Sj} in Π(C′) is absent from Ti. Hence, Π(C′) is a distinguishing feature of T0. This in turn means that Pr(Y|Π(Ct)| | u(T0, f)) > 0. In other words, Pr(Y|Π(Ct)| | u(T0, f)) ≥ γ = ε, thus contradicting the premise. Hence, minimum set covers of have cardinality k*.
The theorem follows from the above lemma, as the reduction to obtain from takes only polynomial time.
A.2 Supplementary Methods
A.2.1 Finding the Minimal Distinguishing Feature Family Φ*
To perform the calculation in (4), it is necessary to first find the minimal distinguishing feature family Φ*. Using similar ideas as in our hardness proof (Section A.1), we consider the reverse reduction from the problem of finding a minimal distinguishing feature to that of finding a minimum size set cover (Problem 3).
We define the universe U = {1, …, m} to be the set of trees excluding the tree T for which we want to solve the T-SCS-PC problem. We define the family of subsets to correspond to the n featurettes present in T. Specifically, the subset Sj corresponding to featurette τj that is present in T is composed of elements i ∈ U corresponding to trees Ti where τj is absent. The key idea is that Sj is indicating in which input trees featurette τj of T is absent. We note that is a multi-set as distinct featurettes τj and τj′ may be absent in the same set of trees, thus leading to Sj = Sj′ (Fig. 3b). There is a bijection between set covers of and distinguishing features of T. That is, each distinguishing feature Π = {τ1, …, τ|Π|} corresponds to the same-sized cover Π(C) composed of subsets S1, … {⊤, S|Π(C)|, and vice versa. In particular, a minimal distinguishing feature corresponds to a minimal set cover. Thus, we may use the following integer linear program (ILP) to find a minimum set cover C and thus a corresponding minimum distinguishing feature Π(C) — a minimal distinguishing feature of minimum size.
In order to find the next minimum distinguishing feature Π(C′) that is not contained within Π(C), we add the following constraint to the ILP. where . By repeatedly adding identified minimum set covers to until the ILP becomes infeasible, we identify all minimal set covers and thus all minimal distinguishing features Φ*. Fig. 3b shows an example. We use IBM ILOG CPLEX v12.9 to solve the ILP1.
A.3 Supplementary Results
Gawad et al. (2014) used an EM-based algorithm to cluster the sequenced cells into 2 to 7 clones for each patient. Based on the fact that false negatives occur more frequently than false positives, we designated an SNV as present if at least 30% of cells in the clone had the mutation. We then checked if the resulting binary, clone-by-SNV matrices adhered to the infinite sites assumption, which was the case for only patients 2 and 3. While the VAFs of all 16 SNVs in patient 2 are less than 0.5, patient 3 had 6 out 49 SNVs with a VAF larger than 0.5, which is indicative of copy number aberrations. Since no copy number information was available to infer cancer cell fractions, we excluded patient 3 from our analysis, thus restricting our attention to patient 2.
Gawad et al. (2014) clustered the 115 cells of patient 2 into 5 clones. This patient has 16 SNVs, from which we excluded mutations CMTM8, ATRNL1, LINC00052 and TRRAP for reasons that we described in the main text. The majority voting rule described above yielded a binary clone-by-SNV matrix with 4 mutations clusters that each correspond to SNVs that co-occur in every clone (Fig. S2a), corresponding to a two-state perfect phylogeny TSCS on the mutation clusters (Fig. 5b). To obtain the set of candidate phylogenies, we considered the bulk data. Specifically, we merged mutations ZC3H3 and XPO7 as they had the same VAF in the bulk data and occurred in the same mutation cluster in the cleaned SCS data (Fig. 5a). Using SPRUCE (El-Kebir et al., 2015), we enumerated trees (Fig. S2b). Only one tree was consistent with TSCS, i.e. each mutation cluster of TSCS formed a connected path in T* and subsequently collapsing these paths in T* yields TSCS. Comparing the cleaned single-cell data to the raw values, we computed a false negative rate β of 0.2 for the 14 mutations (Fig. S2a), which was in line with the value reported by Gawad et al. (2014).
Acknowledgements
This work was supported by UIUC Center for Computational Biotechnology and Genomic Medicine (grant: CSN 1624790) and the National Science Foundation (grant: CCF 1850502).
Footnotes
↵* Shared first authorship
‡ Accepted at RECOMB-CCB 2020