Abstract
Phylogenetic methods are emerging as a useful tool to understand cancer evolutionary dynamics, including tumor structure, heterogeneity, and progression. Most currently used approaches utilize either bulk whole genome sequencing (WGS) or single-cell DNA sequencing (scDNA-seq) and are based on calling copy number alterations and single nucleotide variants (SNVs). Here we explore the potential of single-cell RNA sequencing (scRNA-seq) to reconstruct cancer evolutionary dynamics. scRNA-seq is commonly applied to explore differential gene expression of cancer cells throughout tumor progression. The method exacerbates the single-cell sequencing problem of low yield per cell with uneven expression levels. This accounts for low and uneven sequencing coverage and makes SNV detection and phylogenetic analysis challenging. In this paper, we demonstrate for the first time that scRNA-seq data contains sufficient evolutionary signal and can be utilized in phylogenetic analyses. We explore and compare results of such analyses based on both expression levels and SNVs called from our scRNA-seq data. Both techniques are shown to be useful for reconstructing phylogenetic relationships between cells, reflecting the clonal composition of a tumor. Without an explicit error model, standardized expression values appear to be more powerful and informative than the SNV values at a lower computational cost, due to being a by-product of standard expression analysis. Our results suggest that scRNA-seq can be a competitive alternative or useful addition to conventional scDNA-seq phylogenetic reconstruction. Our results open up a new direction of somatic phylogenetics based on scRNA-seq data. Further research is required to refine and improve these approaches to capture the full picture of somatic evolutionary dynamics in cancer.
Introduction
Phylogenetic analysis is an approach that relies on reconstructing evolutionary relationships between organisms to determine population genetics parameters such as population growth (Kingman 1982; Heled et al. 2008), structure (Müller et al. 2017a) or geographical distribution (Lemey et al. 2009; Lemey et al. 2010). Typically, the reconstructed phylogeny is not the end-goal. Using previously estimated trees, various evolutionary hypotheses can be explored, such as the evolutionary relationship of traits carried by individual taxa (Grafen et al. 1989; Pagel et al. 2004; Freckleton 2012).
Within-organism cancer evolution is increasingly being studied using population genetics approaches, including phylogenetics (Navin et al. 2011; Yuan et al. 2015; Alves et al. 2017; Schwartz et al. 2017; Caravagna et al. 2018; Singer et al. 2018; Alves et al. 2019; Caravagna et al. 2019; Detering et al. 2019; Malikic et al. 2019; Werner et al. 2019; Kuipers et al. 2020), to understand evolutionary dynamics of cancer cell populatons. These approaches have shown promise to be developed into therapeutic applications in the personalized medicine framework (Gerlinger et al. 2012; Abbosh et al. 2017; Rao et al. 2020b). Specifically, the clonal composition of tumors, metastasis initiation, development, and timing can be reconstructed using phylogenetic methods (Yuan et al. 2015; Angelova et al. 2018; El-Kebir et al. 2018; Alves et al. 2019). Unlike other evolutionary processes prone to events such as hybridization or horizontal gene transfer, population dynamics of somatic cells is underpinned by a strictly bifurcating clonal process driven by cell division. This is in perfect agreement with theoretical assumptions routinely applied in stochastic phylogenetic models such as coalescent (Kingman 1982; Hudson et al. 1990; Posada 2020) or birth-death processes (Aldous 1996; Aldous 2001; Komarova 2006).
From the methodological perspective, however, cancer is an evolutionary process with unique characteristics which are not modeled in conventional phylogenetic approaches. These include a high level of genomic instability with structural changes (gene losses and duplications) which accumulate along with point mutations during the course of growth and evolution (Beerenwinkel et al. 2015; Posada 2015).
Traditional Whole Genome Sequencing (WGS) methods have been instrumental in understanding cancer mutational profiles and oncogene detection (Mardis et al. 2009; Nakagawa et al. 2018). DNA from a tissue sample is isolated and sequenced “in bulk”. This increases the total amount of DNA which improves coverage and reduces amplification errors. To establish the presence or absence of mutations, a variant allele frequency (VAF) is calculated and compared to a threshold, typically 10 20% (Strom 2016). This filters out rare mutations present only in a few reads that are likely to be false positives or sequencing errors (Petrackova et al. 2019). More recently, bulk sequencing is used to study cancer evolution using phylogenetic methods, either by comparing VAF (Zhai et al. 2017; Zhao et al. 2016; Ling et al. 2015) or estimating copy number variants (CNV) (Desper et al. 1999; Demeulemeester et al. 2016; Tarabichi et al. 2021). However, the usage of bulk sequencing in this context is problematic. Bulk samples contain cells from multiple cell lineages including non-tumor cells, such as immune or blood vessel cells (Racle et al. 2017), and strong evidence also points out a constant migration of metastatic cells between tumors (Aguirre-Ghiso 2010; Cheung et al. 2016; Reiter et al. 2017; Casasent et al. 2018). High VAF thresholds ignore tumor heterogenity, but by lowering the threshold, mutations in non-tumor cells or clonal lineages are retained instead. Sequences or mutational profiles derived from bulk samples thus have a chimeric origin (Alves et al. 2017).
A typical assumption in classical phylogenetics is that the sequences or mutational profiles represent individual taxonomic units, either individuals or populations of closely related individuals. If these methods are used on the data from bulk samples, the reconstructed trees are not phylogenies describing an evolutionary history, but evolutionary meaningless sample similarity trees (Alves et al. 2017). To address this issue, phylogenetic trees are reconstructed by estimating the sequential order of somatic mutations using VAF from one or multiple tumor samples (Deshwar et al. 2015; El-Kebir et al. 2018; Miura et al. 2018). Given the tumor heterogenity and insufficient read depth to reliably estimate VAF, this is not a simple problem and the performance of current methods is limited (Miura et al. 2020).
Single-cell DNA sequencing (scDNA-seq) does not suffer from the chimeric DNA origin of bulk-sequencing as each DNA segment is barcoded to guarantee its known cell of origin. Recent progress in WGS technology made sequencing individual cells cost-efficient (Gawad et al. 2016) and this approach is now regularly used for the phylogenetic reconstruction of metastatic cancer or the subclonal structure of a single tumor (Potter et al. 2013; Roth et al. 2016; Leung et al. 2017; Myers et al. 2019). However, this increased resolution comes with additional complications. Current methods are not sensitive enough to sequence DNA from a single cell and DNA amplification is required (Gawad et al. 2016). This process suffers from a random bias with different parts of the genome amplified in different quantities or not at all (Satas et al. 2018). In addition, polymerase does not replicate DNA without error, this can have a significant impact if the replication errors occur early in the amplification process (Gawad et al. 2016). This does not only increase the error rate for identified SNVs, but a large proportion of SNVs might be simply missing (Hicks et al. 2018).
The advantages associated with scDNA-seq led to the development of novel approaches that tackle these challenges using an error model to correct for amplification errors and false-positive SNV calls (Zafar et al. 2016; Zafar et al. 2018; Luquette et al. 2019; Kozlov et al. 2020).
Similar technological development led to proliferation of single-cell RNA sequencing (scRNA-seq) which, compared to traditional bulk RNA sequencing, enabled detection of gene expression profiles for individual cells in the tissue sample (Müller et al. 2017b; Olsen et al. 2018; Jerby-Arnon et al. 2018; González-Silva et al. 2020). This allows to understand tumor heterogenity by identifying different cell populations (Andrews et al. 2018), estimating immune cell content within a tumor (Yu et al. 2019), or even identify individual clones and subclones, as they can differ in their behavior (Fan et al. 2020). However, as the levels of RNA expression vary between genes and cells, the amplification problems of scDNA-seq that cause unequal expression and drop-out effect are more pronounced in scRNA-seq. There is an increased interest for SNV calling on scRNA-seq data using bulk-SNV callers (Chen et al. 2016; Poirion et al. 2018; Liu et al. 2019; Schnepp et al. 2019) and specialized CNV callers (Kuipers et al. 2020; Harmanci et al. 2020b; Harmanci et al. 2020a; Gao et al. 2021) as this allows for identification of mutations in actively expressed genes.
In this work, we test if expression values and SNVs inferred from scRNA-seq contain phylogenetic information to reconstruct a population history of cancer. We perform an experiment to guarantee a known population history, and then try to reconstruct this history using computational phylogenetics from both expression values and identified SNVs derived from the same scRNA-seq data set. We then compare phylogenies obtained from these methods against the known population history to evaluate the strength of the phylogenetic signal contained in the scRNA-seq data set.
Methods
Experimental design
To a guarantee known population history, immunosuppressed mice were injected with human breasts cancer cells. The tumors that develop are derived from the same population and thus share a common ancestor, but evolved independently in each mouse. As each tumor was seeded by a population of cancer cells, a number of small sample-specific clades representing subclonal diversity of the population sample should be observed. To test for the presence of these sample-specific clades, as well as the strength of phylogenetic relationship between cells from each tumor, we employ phylogenetic clustering tests. If the phylogenetic tests confirm sample-specific clustering of cells, then the scRNA-seq data contains sufficient phylogenetic signal.
Sample preparation and scRNA sequencing
MDA-MB-231-LM2 (GFP+) (Minn et al. 2005) cells were injected into the R4 mammary fat pad of Nu/J mice (250, 000 cells per mouse, 3 mice), and tumor growth was monitored for 8 weeks. Mice were euthanized when tumor size approached the endpoint (2 cm). Tumors were resected and dissociated into single cells. To extract circulating tumor cells (CTC), up to 1 ml of blood was drawn immediately post euthanasia using cardiac puncture. Red blood cells were removed using RBC lysis buffer. All cells (tumor derived and circulating tumor cells) were stained with DAPI and sorted for DAPI and GFP using a BD FACSAria cell sorter. Libraries were generated using the 10x Chromium single cell gene expression system immediately after cell sorting, and sequenced on an Illumina NextSeq platform together to eliminate batch effect.
Mapping and expression analysis
Reads were mapped with the Cellranger v5.0 software to the GRCh38 v15 from the Genome Reference Consortium using the analysis-ready assembly without alternative locus scaffolds (no_alt_analysis_set) and associated GTF annotation file.
The Cellranger software performs mapping, demultiplexing, cell detection, and gene quantification for the 10x Genomics scRNA-seq data.
Postprocessing expression data
Standardizing expression values
The filtered feature-barcode expression values from Cellranger were processed using the R Seurat v3.2.0 package (Stuart et al. 2019) and according to the Seurat’s standard pre-processing workflow. However, low-quality cells, such as cells with small number of unique reads or small number of represented genes, were not removed at this stage and no normalization was performed. The expression values for each gene were centered (µ = 0) and rescaled (σ2 = 1).
Discretizing expression values
The rescaled expression values were then categorized into a 5 level ordinal scale ranging from 1 (low level of expression) to 5 (high level of expression). The five-level scale was chosen to capture the data distribution of the rescaled expression values and represent a compromise between introducing data noise with too many levels or artificial similarity with only a few categories.
Interval ranges, according to which the values were categorized, were chosen according to the 60% and 90% High Density Intervals (HDI), the shortest intervals containing 60% or 90% of values respectively. The values inside the 60% HDI were categorized as normal, values inside the 90% HDI, but outside the 60% as increased/decreased expression and values outside the 90% HDI as a extremely increased/decreased expression.
Genes that contain only a single categorized value for each cell were removed as phylogenetically irrelevant and the discretized values were then transformed into a fasta format.
SNV
Pre-processing reads for SNV detection
The BAM file from Cellranger was processed using the Broad Institute’s Genome Analysis ToolKit (GATK) v4.1.7.0 (Poplin et al. 2018) according to GATK best practices of somatic short variant discovery.
SNV detection and filtering
To obtain SNVs for individual cells of the scRNA-seq data, first a list of SNVs were obtained by running the Mutect2 (Benjamin et al. 2019), treating the data set as a pseudo-bulk sample, and retaining only the SNVs that passed all filters.
SNVs for individual cells were then obtained by individually summarizing reads belonging to each single cell at the positions of the SNVs obtained beforehand using the pysam library, which is built on htslib (Li et al. 2009). The most common base for every cell and every position was retained, base heterogeneity or CNVs was ignored. This SNV table was then transformed into a fasta format.
Phylogenetic analysis
To reconstruct phylogenetic trees from the categorized expression values and identified SNVs, we used the IQ-TREE v2.0.3 (Minh et al. 2020) and BEAST2 v2.6.3 (Bouckaert et al. 2019).
For the expression data, IQ-TREE was used with an ordinal model and an ascertainment bias correction (-m ORDINAL+ASC). For the SNV values, we used a generalized time reversible model (GTR) (Tavaré 1986) as the most complex substitution model present in both IQ-TREE and BEAST2. Tree support was evaluated using the standard non-parametric bootstrap (Felsenstein 1985) with 100 replicates ( -b 100).
The BEAST2 analysis was performed with a birth-death tree prior (Kingman 1982) and an exponential population growth (Kuhner et al. 1998) as these models most closely mimic the biological conditions of tumor growth. The substitution model of choice for the categorized expression values was the ordinal model available in the Morph-Models package and the GTR model for the SNVs as described above.
Phylogenetic clustering tests
To test if the phylogenetic methods were able to recover expected population history history, we employ Mean Pairwise Distance (MPD) (Webb 2000) and Mean Nearest Taxon Distance (MNTD) (Webb 2000) to investigate the relationship between cells from each sample. MPD is calculated as a mean distance between each pair of taxa from the same group, while MNTD is calculated as a mean distance to the nearest taxon from the same group.
For each sample and samples isolated from a single individual, MPD and MNTD are calculated and compared to a null distribution obtained by permuting sample labels on a tree and calculating MPD and MNTD for these permutations. The p-value is then calculated as a rank of the observed MPD/MNTD in the null distribution normalized by the number of permutations. The MPD and MNTD are calculated using the ses.mpd and ses.mntd functions implemented in the package picante (Kembel et al. 2010) For the Bayesian phylogenies, MPD and MNTD were calculated for a sample of 1000 trees from the posterior distribution and then summarized with mean and 95% confidence interval.
Code and data availability
Code required to replicate the data processing steps is available at https://github.com/bioDS/phyloRNAanalysis.
To aid in creating pipelines for phylogenetic analysis of scRNA-seq data, we have integrated a number of common tools in the R phyloRNA package, which is available at https://github.com/bioDS/phyloRNA.
All data will become available in the NCBI GEO under the accession number GSE163210 upon acceptance of this paper by a peer-reviewed journal.
Results
Phylogenetic expectation derived from experimental design
To test if scRNA-seq contains sufficient phylogenetic information to reconstruct a population history of cancer, immunosuppressed mice were injected into the mammary fat pad with human breast cancer cells. The tumors that develop are derived from the same population and thus share a common ancestor, but evolved independently in each mouse and should form separate clades on reconstructed phylogenetic trees when analysed together. As each tumor was seeded by a population of cancer cells, number of smaller sample-specific clades representing subclonal diversity of the population sample should be observed. We would thus expect clustering of each tumor and CTC sample as well as clustering of tumor and CTC samples isolated from single individual. Due to the lack of a specialized scRNA-seq caller or error model to account for the uncertainty in the data, some intermixing is possible, but heavy intermixing would demonstrate an insufficiency of scRNA-seq for phylogenetic analyses.
Sample overview
In total, five samples were used in this analysis, three tumor samples (T1, T2, T3) and two CTC samples (CTC1, CTC2). The number of cells isolated from the CTC3 sample was too small for scRNA sequencing and the sample was removed from the study. The number of detected cells in the tumor samples was generally smaller than in the CTC samples, but the reverse was true for the total number of detected unique molecular identifiers (UMIs) – the number of unique mRNA transcripts (see Table 1). In the T2 sample, a large number of cells but a small number of UMIs were detected in a similar pattern to the CTC samples.
Compared to the fluorescent-activated cell sorting (FACS), Cellranger detected fewer cells for tumor samples, but more cells for the CTC samples. Cellranger classifies barcodes as cells based on the amount of UMI detected to distinguish real cells from a background noise (Lun et al. 2019). The large number of detected cells in the CTC samples is likely a result of lysed cells or cell-free RNA (Fleming et al. 2019). In all cases, the number of detected genes across data sets was relatively low, with the best sample T3 amounting to about 3% of expressed genes when summarized across all cells.
Recording unexpressed genes as unknown data
The amount of coverage in a standard bulk RNA-seq expression analysis is usually sufficient to conclude that genes for which no molecule was detected are not expressed (Lähnemann et al. 2020). In scRNA-seq however, the sequencing coverage is very small, drop out effect is likely, and thus this assumption does not hold. This is especially a problem for non-UMI based technologies (Cao et al. 2021), but not entirely absent from the UMI-based technologies as well due to biological and technological processes (Townes et al. 2020; Hsiao et al. 2020).
According to the standard expression pipeline, these values are commonly treated as biological zeros, i.e., no detected expression of a particular gene, and have a significant impact on the data distribution during the normalization and rescaling steps (Hicks et al. 2018; Townes et al. 2020). Without an explicit model of drop out effect to account for technical or biological variation, these values are more accurately represented as unknown values rather than true biological zeros (Van den Berge et al. 2018). We have modified the Seurat code to treat these values as unknown values (NA in R) and included modified functions in the phyloRNA package.
We will further use data density to describe the number of unknown values in both expression and SNV datasets, with 100% representing dataset without unknown values. For example, the T3 sample has 1% data density after recoding zeros as unknown values.
SNV identification and data density
To identify SNVs in scRNA-seq data, we first identified a list of SNVs by treating the single-cell reads as a pseudo-bulk sample. The total of 120,310 SNVs that passed all quality filters were identified this way. When these SNVs were called for each individual cell, the resulting data set had data density of less than 0.09%. The expression data is expected to have higher data density than SNV because for expression quantification a presence or absence of a molecule is sufficient while for SNV, knowledge of each position is required. This expectation is confirmed in Table 1, where data density of the expression data is summarized. About 16% of the 10,587 cells represented in this data set did not contain any known SNV (data density 0%), these were relatively equally distributed among the T2 (651), CTC1 (571) and CTC2 (475) samples. This represents a challenge from a data analysis perspective given the large sample size and its small data density.
Finding the most well-represented subset of data
When treating the potentially unexpressed genes as unknown values, only a small proportion of the expression count values was known, with the data set derived from SNV suffering from the same problem due to the low number of reads for each cell.
While model-based phylogenetic methods can process missing data by treating the missing data as phylogenetically neutral, this significantly flattens the likelihood space which can cause artefacts, convergence problems or increase computational time (Wiens 2006; Jiang et al. 2014; Xi et al. 2016). It is however not only the proportion of the unknown values, which is about 99% for both data sets, but also the sheer size of the data set that is problematic. With over 10, 000 cells and more than 54, 000 genes or over 120, 000 SNVs, the unfiltered data set would require substantial computational resources.
In comparison, for the published phylogenetic tools designed for single-cell DNA, data sets ranged from 47 cells and 40 SNVs (Jahn et al. 2016) to 370 cells and 50 SNVs (Singer et al. 2018) or in an extreme case 18 cells and 50, 000 SNVs (Singer et al. 2018) with at most 58% of missing data across these data sets.
To help alleviate these issues, we have employed a stepwise filtering algorithm to find the densest subset of a data set. By iteratively cutting out cells and genes/SNVs with the smallest number of known values, we increase the data density until a local maximum or desired density is reached. This is equivalent to the gene/cell quality filtering during the scRNA-seq post-processing pipeline, such as the Seurat’s standard pre-processing workflow described in the methodology section. The advantage of this method is that a desired density can be reached with the least amount of data removed.
To test the effect of unknown data, the categorized expression values were filtered using the above method to get categorized expression data sets containing 20%, 50%, and 90% of known values.
Filtering scRNA-seq expression and SNV data sets
The expression data set filtered to 20% density contained 1,627 cells and 16,187 genes. Cells were mainly represented by T1 and T3 samples which form over 92% of the data set. In contrast, other samples (T2, CTC1, CTC2) were significantly underrepresented despite their larger amount of cells in an unfiltered data set. In filtering to 50% density, the numbers decreased to 1,454 cells and 1,634 genes. The sample diversity also decreased, with T2 dropping out entirely. When filtered to 90% density, the data set was reduced to 593 cells and 528 genes. The data diversity further decreased to T1, T3 and CTC2, although the CTC2 sample was reduced to 2 cells.
When the SNV data set was filtered to 20% density, the numbers decreased significantly – to 1,498 cells and 1,297 SNVs. The sample composition was similar to the expression data set, with the vast majority of retained cells belonging to the T1 and T3 samples. However, the T2 sample was already missing. In subsequent filtering to both 50% and 90% data density, the CTC1 sample has vanished.
Despite the difference in starting dimension between expression and SNV data sets, the dimension of data set filtered to the same data density is similar, with the expression data having higher density than SNVs. This suggests that in both cases, known values are concentrated at the same cell subsets and the filtering algorithm can successfully localize them.
Maximum likelihood phylogeny from expression data
We inferred Maximum likelihood trees of the expression data filtered to 20%, 50%, and 90% data density (Supplementary Figure 1). In the reconstructed phylogeny at the 20% density filtering (Figure 1), individual tumor samples did not form three separate clades, but a large number of smaller clades. These clades are distributed along the central spine of the unrooted maximum likelihood tree and have little internal structure. The T2, CTC1, and CTC2 samples form relatively compact clades, while the more represented T1 and T3 clades are generally intermixed. The MNT and MNTD test confirm this (Table 2), with T2, CTC1, and CTC2 showing significant clustering signal.
When the data is filtered to 50% and 90% data density (Supplementary Figure 3), the intermixing between T1 and T3 is reduced, with many clades showing a sample-specific pattern in their internal structure, and statistically significant MPD and MNTD tests support this for T3, but not T1 sample. As data are filtered, CTC1 still forms a compact cluster, but the CTC2 cluster disappears as the cells are being removed. The phylogenetic position of the remaining CTC2 cells is however stable as sister cells to the T2 sample. In contrast to dataset with 20% data density, the clustering tests for the T2, CTC1 and CTC2 are no longer significant. A possible explanation for this might be a small number of cells remaining for these samples when the dataset is further filtered.
Bootstrap branch support scores are estimates of topology uncertainty for each branch of a tree, with 0 signifying no support while 100 high support. Given a large number of branches in our trees, displaying them directly on a tree is challenging. For this reason, we show a mean and a standard deviation of bootstrap scores for each reconstructed Maximum likelihood tree (Table 3).
Overall, the tree support was very weak, with only 6% being statistically supported (bootstrap> 70). These were usually branches close to the tips of the tree.
Maximum likelihood phylogeny from the identified SNVs
Similarly to the phylogeny reconstructed from the expression data, the tree reconstructed from the SNV data (Figure 2) consisted of a large number of clades. In contrast to the expression phylogeny, the clades in the SNV trees showed significantly more internal structure.
While T1 and T3 samples were still intermixed, sample-specific clades are present and this clustering tendency was confirmed by phylogenetic clustering tests (Table 4). The clustering pattern for T1 is different from the clustering pattern of T3 as only MNTD was significant for T1, and only MPD was significant for T3. The CTC1 cells formed a compact clade but placed on a very long branch, suggesting a long independent evolution of the CTC1 cells. CTC2 samples did not form a clade but were placed as sister cells to the T1 sample. The general structure of the phylogeny did not change when the data was further filtered to 50% and 90% data density (Supplementary Figure 2), except for the clustering pattern of T1, which was no longer significant.
The topology of the SNV tree had very low bootstrap support. In fact, 85% of branches were not found in any of the 100 bootstrap trees and thus had a zero bootstrap score. Despite this, when explored with the MTD and MNTD, the bootstrap trees show a similar pattern of phylogenetic clustering as the Maximum likelihood tree (Supplementary Table 1). Low bootstrap support in both expression and SNV Maximum likelihood analyses can be explained by a combination of weak phylogenetic signal and amplification errors, that is, bootstrap replicates simply oversample noise and undersample signal. This shows the importance of phylogenetic analyses beyond point estimates (when only one tree is reported). Indeed, as we have seen the Maximum likelihood approach accompanied by the clustering analysis form a powerful tool that identifies major evolutionary trends, despite the maximum likelihood tree being very uncertain.
Alternative filtering approach
The filtering used above was designed to remove the least amount of data possible. While this is generally a positive behavior, it might be problematic if the amount of missing data varies between individual samples as we are dependent on the presence of all samples for the assessment of the performance of the phylogenetic reconstruction.
To ameliorate this, we employ an alternative filtering strategy and select cells with the least amount of missing data as the best representation of each data set. In this reduced data set, genes that were not present in any of the cells or present only in a single cell, are removed. The data set is then filtered to a desired data density using the method described above, but to retain the same sample size, only genes were removed this way. Given the smaller size, the full data set, and data sets filtered to 50% and 90% of data density were then analyzed using Maximum Likelihood and Bayesian method to further explore the topological uncertainty.
A total of 58 cells were retained: 20 cells for T1 and T3 samples and six cells for T2, CTC1, and CTC2 samples.
The full expression data set contained 30% of known data distributed across 7,520 genes. This was further reduced to 3,261 and 219 genes when filtered to 50% and 90% data density.
The full SNV data set contained 12% of known data distributed across 3,980 SNVs. When further filtered, this decreased to 433 SNVs at 50% data density and to 29 SNVs at 90% data density.
Phylogenetic reconstruction from expression data
The Maximum likelihood tree reconstructed from the reduced expression data set showed significant clustering of all samples (Figure 3). This is confirmed by the phylogenetic clustering tests where all but CTC2 cells had a significant MPD p-value (Table 5).
Four out of six CTC2 cells clustered together, but on the opposite side of the tree with phylogenetic proximity to the T1 cells. This close phylogenetic relationship suggests that T1 and CTC1 were isolated from a single individual. This pattern is further reinforced as T2 cells clustered in a single compact clade with phylogenetic proximity to the CTC1 sample. A similar pattern has been observed already in the density filtering method used above, although this was complicated as a majority of cells that did not belong to T1 or T3 samples were filtered out due to their low data density. Given the strong signal in the data, the simplest explanation is that samples were mislabeled and T2 and CTC1 come from the same individual (although this explanation cannot be ruled out, we were unable to isolate an event in the experiment when the mislabeling could have happen). When this relationship was tested with phylogenetic clustering methods, both MPD and MNTD confirmed the strong clustering signal between T2 and CTC1. The same tests were not significant for the T1-CTC2 grouping likely due to the presence of two non-clustering CTC2 cells. This phylogenetic structure remains stable for subsequent filtering to 50% and 90% data density (Supplementary Figure 3).
The phylogenies reconstructed from the same data using the Bayesian inference show a similar pattern of clustering (Figure 4, Table 6), although neither CTC1 nor CTC2 formed a compact cluster. According to the MNTD, only a single relationship is significant, compared to four for the MPD. However, the MNTD’s 95% credible intervals are wider compared to the MPD’s. For example, the MNTD’s credible interval for the T1 clustering covers almost the whole range of possible p-values (0.001 0.931). This is likely caused by the higher sensitivity of MNTD to a clustering pattern closer to the tips of the tree and their unstable position in the posterior tree sample. With posterior clade support for major bifurcations being weak, only the phylogenetic relationship between T2 and CTC1 was well supported across expression trees build from different data density (Supplementary Table 2a).
Phylogenetic reconstruction from the SNV data
The Maximum likelihood tree reconstructed from the reduced SNV data set (Figure 5) displayed many similar patterns to the previously analyzed SNV tree (Figure 2). As T2 cells were not filtered, they are now placed together with the CTC1 cells on a long branch suggesting a long shared evolutionary history. Alternatively, this could be a methodological artifact called Long Branch Attraction (Felsenstein 1978; Huelsenbeck 1997), where unrelated taxa with a large amount of accumulated changes are grouped into a single clade, although given the evidence from the expression data, this is unlikely. The phylogenetic clustering tests confirm the grouping of all samples except the CTC1 sample, which is dispersed around the tree. Clustering of samples from a single individual could not be confirmed. Instead, the alternative hypothesis about possible mislabeling between CTC1 and CTC2 samples is supported, with a strong signal for clustering of T1 with CTC2, and T2 with CTC1.
A similar, but substantially weaker pattern of sample clustering can be observed on the Bayesian phylogeny reconstructed from the same data (Figure 6). T2 and CTC1 do not form a well-supported clusters (Table 8), but there is a strong support for a combined T2-CTC1 cluster. Additionally, while many clustering patterns are still supported by the MPD, they are not confirmed by MNTD. However the 95% intervals are large and contain a number of trees that do support respective clustering. This is likely a case of the higher sensitivity of MNTD to one or two taxa that break the pattern, especially when the number of taxa is low.
Biological zero or unknown value
To test the assumption if the zero expression values should be treated as an unknown data rather than biological zeros, i.e., no expression of a particular gene, we have reconstructed the phylogenies from the scRNA-seq expression by treating the zeros in the dataset as biological zeros. Data was processed as per the standard methodology to get the alignments, but instead of treating the zeros as an unknown position, they were treated as a category 0 in addition to the 5 level ordinal scale. Phylogenies were then reconstructed using both Maximum Likelihood and Bayesian methods with sample clustering explored using the phylogenetic clustering tests.
For the full dataset, the pattern of clustering calculated on the Maximum Likelihood trees reconstructed from the real-zero data was similar, if not a stronger, than when treating the zero as unknown data (Supplementary Table 3a). When the phylogenies were reconstructed from the reduced sample of 58 cells, the clustering pattern was different, with T1 and T3 no longer supported and the clustering of CTC2 was supported instead (Supplementary Table 3b). The clustering patterns changed when the dataset was further filtered, i.e., when the number of zeros was reduced. This suggests that the clustering is driven by the similarity of non-expression rather than by the expression levels themselves. However, the clustering calculated on the phylogeny using the Bayesian method does not support this change of pattern. While the clustering pattern is markedly different when zeros are treated as biological zeros rather than unknown data (Supplementary Table 3c), the clustering pattern does not change when the data are further filtered.
These results do not provide a conclusive answer on which assumption should be preferred. Assuming all zeros to be biological zeros will bias the model as many of those might be technical zeros instead. At the same time, the pattern of expression and non-expression seems to carry information. This information is lost when all zeros are assumed to be technical zeros and thus unknown data.
Application to other datasets
We have reconstructed the phylogenetic relationship of cancer cells from the expression data of two additional scRNA-seq datasets, a UMI-based dataset of small intestinal neuroendocrine cancer (Rao et al. 2020a) and non-UMI based dataset of gastric cancer (Wang et al. 2021).
Intestinal neuroendocrine cancer
The small intestinal neuroendocrine cancer dataset from Rao et al. (2020a) consisted of a primary tumor and a paired liver metastatic sample. Both samples contained a mixture of cancerous and non-cancerous cells (Fibroblasts, Endothelial cells and Immune cells). The expression values for both samples from Rao et al. (2020a) were processed as per the methodology section, with zeros recoded as unknown data. To obtain the SNVs, the raw reads were mapped using the Cellranger v5 and processed as per the methodology section. Cells were labelled according to their sample of origin (primary tumor and metastasis) and their cell type, which was determined by replicating the analysis from Rao et al. (2020a). Two subsets for both data types were then derived, a subset with all cell types and a subset with only cancer cells. To reduce the computational burden, 1000 cells with the least amount of missing data were selected, 500 from the primary tumor and 500 from the metastatic sample. To derive subsets from the SNVs, the cells from the expression subsets were used. However, not all cells found in the expression subsets were found in the SNV data. This is likely due to a different version of the Cellranger software used in this work compared to the Rao et al. (2020a). Maximum likelihood trees were then reconstructed and the relationship between cells of the same type and sample of origin were then tested using the phylogenetic clustering test. In both derived subsets from the expression data, metastatic cells showed a strong clustering tendency (p= 0.001) into several large clades (Figure 7).
This suggests a strong phylogenetic relationship with several well-preserved lineages. In addition, in the derived subset containing all cell types, the cancer cells showed a significant clustering (MPD p = 0.016 and MNTD p = 0.001), while other cell types showed the opposite tendency (Table 5). This is consistent with our expectation as only the cancer cells evolve through the process of clonal evolution that is assumed by the phylogenetic model. A similar albeit significantly weaker pattern of cancer cell clustering can be observed on the trees derived from the SNV data (Figure 7, Table 5). In both subsets derived from the SNV data, the primary and metastatic cells clustered together, but in the subset with all cells, cancer cells no longer formed a compact cluster.
Gastric cancer
The gastric cancer dataset from Wang et al. (2021) consisted of 94 cells from a primary tumor and a lymph node of three patients (GC1, GC2 and GC3). We would expect that for each patient, the lymph node cells would from a monophyletic lineage derived from the primary tumour cell, but due to the small number of cells, clustering of the primary tumour cells is also interpreted as a success.
The expression values were split into patient-specific datasets and analysed separately as per the methodology section and the discretized expression values were analysed using the Maximum Likelihood and the Bayesian phylogenetic methods. To obtain the SNV values, the raw reads were mapped using the STAR v2.7.9a (Dobin et al. 2013) and mapped reads were then processed as per the methodology section. The clustering of primary and lymph node cells was then explored using the phylogenetic clustering test. For the expression data, only a single patient showed significant clustering of lymph nodes (Figure 9).
In the datasets derived from the SNV data, the phylogenetic signal was stronger and in the primary tumours cells clustered in two patients (Figure 9). Poor separation of primary and lymh node cells from the expression levels was pointed out in the original study (Wang et al. 2021). Additionally, non-UMI based methods suffer from an increased error rate through zero-count inflation (Cao et al. 2021) and amplification variability (Townes et al. 2020). In the absence of a strong phylogenetic signal shared by a large percentage of genes, this additional noise is making a phylogenetic reconstruction difficult, if not impossible. At the same time, the typically higher coverage in the non-UMI based sequencing compared to the UMI improves the identification of SNVs and decrease the misspecification error. This might suggest that different strategies for the phylogenetic reconstruction should be applied to UMI and non-UMI based sequencing.
Discussion
Phylogenetic methods using scDNA-seq data are becoming increasingly common in tumor evolution studies. scRNA-seq is currently used for studying expression profiles of cancer cells and their behavior. However, while clustering approaches to identify cells with similar expression profiles are common and frequently used, scRNA-seq data are yet to be used in phylogenetic analyses to reconstruct the population history of somatic cells. To test if the scRNA-seq contains a phylogenetic signal to reliably reconstruct the population history of cancer, we have performed an experiment to produce a known history by infecting immunosuppressed mice with human cancer cells derived from the same population. Then using two different forms of scRNA-seq data, expression values and SNVs, we have reconstructed phylogenies using Maximum likelihood and Bayesian phylogenetic methods. By comparing the reconstructed trees to the known population history, we have been able to confirm that scRNA-seq contains sufficient phylogenetic signal to reconstruct the population history of cancer. Without an explicit error model to account for an increased uncertainty in the data (Hicks et al. 2018), the phylogeny from the expression values describes the expected population history better than the one reconstructed from SNV, despite requiring lower computational costs to reconstruct. This highlights that scRNA-seq can be utilized to explore both the physiological behavior of cancer cells and their population history using a single source of data.
Without any specialized phylogenetic or error models for the scRNA-seq data, conventional methods and software tools developed for systematic biology are able to reconstruct population history from this data at low computational cost. This implies that more accurate inference will be possible when and if specialized models and software are developed, and serious computational resources are employed. For example, computationally more intensive standard non-parametric bootstrap or Bayesian methods on the unfiltered data sets are certainly within the reach of modern computing clusters. This is a future direction for research.
In this work, we tested for phylogenetic signal on three data sets, a new data set consisting of 5 tumor samples seeded using a population sample, and two previously published data sets consisting of a primary tumor with a paired lymph node or a metastatic samples. Due to the nature of the experiment and the amount of uncertainty in the scRNA-seq data, this barred us from a more detailed exploration of the tree topology as only broad patterns, the phylogenetic clustering of cells according to sample and individual of origin, could be considered. Our clustering analyses show that the phylogenetic trees conform broadly to the expected shapes under different experimental conditions, and thus that expression data and SNV data can both be used to infer phylogenetic trees from SNV data. Nevertheless, our results also demonstrate that all such trees contain significant uncertainty, so new datasets and methods will be required to extend this work.
The degree to which low and uneven gene expression plays a role in scRNA-seq requires special attention, especially for non-UMI based data sets, as this causes not only a large proportion of missing data, but also burdens the known values with a significant error rate. Research should aim at trying to quantify this expression-specific error rate and build specialized models to include the uncertainty about the observed data in the phylogenetic reconstruction itself. This could potentially include removing a large proportion of low-coverage data in favor of robust analysis and proper uncertainty estimation of the inferred topology.
The estimation of the topological uncertainty, be it the Bootstrap branch support or the Bayesian posterior clade probabilities, is a staple for phylogenetic analyses. Currently existing methods for the phylogenetic analysis of scDNA-seq, such as SCITE (Jahn et al. 2016), SiFit (Zafar et al. 2017), or SCIΦ (Singer et al. 2018), do not provide this uncertainty estimate. This makes interpretation of the estimated topology difficult because a single topology can only be marginally more accurate than a number of alternative topologies. Out of package we are aware of only CellPhy, through its integration in the phylogenetic software RAxML-NG (Kozlov et al. 2019), provides an estimate of topological uncertainty through the bootstrap method. Bayesian methods could be a solution as they provide an uncertainty estimate through the posterior distribution. However, they are significantly more computationally intensive than Maximum likelihood methods. Instead, as the size of single-cell data sets will only increase, bootstrap approximations optimized for a large amount of missing data need to be developed to provide a fast and accurate estimate of topological uncertainty.
An aspect of scRNA-seq expression data that was not considered here is correlated gene expression (Wang et al. 2004; Bageritz et al. 2019). A single somatic mutation could thus induce a change of expression of multiple genes. This might be problematic given that phylogenetic methods assume that individual sites are independent and this would cause an overestimation of a mutation rate. However, phylogenetic methods are generally rather robust to a wide range of model violations (Huelsenbeck 1995a; Huelsenbeck 1995b; Song et al. 2010; Philippe et al. 2011). In addition, by randomly sampling sites, the bootstrap analysis does explore solutions that would arise from this model violation. An investigation of the effect of correlated gene expression on the estimated phylogeny provides an interesting direction for further research.
Multiomic approaches are increasingly popular as they integrate information from multiple biological layers (Bock et al. 2016; Hasin et al. 2017; Nam et al. 2020). While CNVs were ignored in this paper, it is possible to detect large-scale CNVs from scRNA-seq data (Müller et al. 2018; Kuipers et al. 2020; Harmanci et al. 2020b; Harmanci et al. 2020a; Gao et al. 2021). Combined with the SNVs and expression data as analyzed in this paper, this enables a multiomic approach using just a single scRNA-seq data source, without the additional cost of DNA sequencing.
ACKNOWLEDGEMENT
We thank Dr. Jon Preall and the Genomics Technology Development Core (CSHL) for scRNA-seq library preparation, and Pamela Moody and the Flow Cytometry Facility (CSHL) for support with single-cell sorting. We acknowledge Suzanne Russo for technical assistance with animal experiments.
AG and JCM acknowledge support from the Royal Society te Apārangi through a Rutherford Discovery Fellowship (RDF-U001702), AG, RL, and SDD acknowledge support of from an Endeavour Smart Ideas grant (U00X1912), AG acknowledges support from a Data Science Programmes grant (UOAX1932), SDD acknowledges support from a Rutherford Discovery Fellowship (RDF-U001802) and the NHI/NCI grant (1K99CA215362-01), and DLS acknowledges support from the NCI grant (5P01CA013106-Project 3). We would also like to acknowledge the CSNL Next-Gen Sequencing Core (NCI-2P30CA45508).
Footnotes
↵E-mail addresses: jiri.moravec{at}otago.ac.nz, rob.lanfear{at}anu.edu.au, spector{at}cshl.edu, sarah.diermeier{at}otago.ac.nz
* alex{at}biods.org.
Two new data analyses added
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.
- 37.
- 38.↵
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.
- 46.↵
- 47.↵
- 48.
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.
- 73.↵
- 74.
- 75.↵
- 76.
- 77.↵
- 78.↵
- 79.
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.
- 90.
- 91.↵
- 92.↵
- 93.
- 94.↵
- 95.
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵