SUMMARY
Chromosome copy number variations (CNVs) are a near-universal feature of cancer however their specific effects on cellular function are poorly understood. Single-cell RNA sequencing (scRNA-seq) can reveal cellular gene expression however cannot directly link this to CNVs. Here we report scRNA-seq normalization methods that improve gene expression alignment between cells, increasing the sensitivity of scRNA-seq for CNV detection. We also report sciCNV, a tool for inferring CNVs from scRNA-seq. Together, these tools enable dual profiling of DNA and RNA in single cells. We apply these techniques to multiple myeloma (MM) and examine the cellular effects of cancer CNVs +8q23-24 and +1q21-44. Primary MM cells with +8q23-24 upregulate MYC, MYC-target genes, mRNA processing and protein synthesis; but also upregulate DEPTOR and have smaller transcriptomes. MM cells with +1q21-44 instead reconfigure translation and suppress unfolded protein stress whilst increasing proliferation, oxidative phosphorylation and MCL1. Overall, we provide tools that can enhance the analysis of scRNA-seq and help reveal the effects of cancer CNVs on cellular reprogramming.
Genomic CNVs are a pervasive feature of cancer. Copy number gains on chromosome arms 8q, 1q, 3q and 5p are amongst the most common karyotype abnormalities in human cancer, yet the action of these and other CNVs on the molecular processes within cancer cells remains poorly understood1,2.
ScRNA-seq can reveal the transcription state of single cells, however it cannot directly relate this to DNA lesions. Although physical sequencing of both DNA and RNA within single cells has been reported3-5, and should enable pairing of CNVs with their transcriptional outcomes, existing techniques provide profiling of only a few cells and thus afford only a limited view of the genomic and transcriptional heterogeneity within any cancer. Furthermore, while CNVs and gene expression can be profiled in separate populations of cells and computationally integrated6, this may not recapitulate the biological state of individual cells.
DNA CNVs can be inferred from scRNA-seq, which could thus be leveraged to provide both layers of omics information within individual cells. However, previously reported approaches7-9 reveal constraints imposed by the sparsity of single-cell data. In particular, inconsistencies in the detection of lowly-expressed genes within single cells causes stochastic noise that influences transcriptome distribution and interferes with RNA-based CNV detection. Normalization is thus critical for accurate scRNA-seq interpretation10-14 and for secondary CNV detection.
Here we report scRNA-seq normalization methods that reduce the influence of noise from lowly-expressed genes on single-cell transcriptome scale. These methods improve gene expression comparisons between cells and thus enhance the sensitivity of scRNA-seq for the detection of small expression changes arising from gene copy number differences. We also report sciCNV, a new tool for inferring CNVs in single cells from scRNA-seq. Together, these methods enable high-throughput profiling of both DNA copy number and RNA in the same cell, facilitating direct examination of the effects of cancer CNVs on gene expression programs at a cellular level.
RESULTS
Enhanced single-cell RNA-seq normalization methods: RTAM1 and -2
Single-cell RNA-seq enables gene expression comparisons between cells. However, the accuracy of such comparisons depends critically upon data normalization. As the best methods for normalizing scRNA-seq remain controversial, we developed RTAM1 and -2 (described in the online methods and supplementary figures S1-3) and compared the RTAM methods with other normalization strategies currently in use.
To compare the methods for their control of systemic and stochastic variations between cells due to size or sequence depth we generated scRNA-seq data for cells belonging uniformly to the B cell lineage (n>15,000) (figure 1a). We examined a single lineage in order to minimize confounding biological variation between cells due to their ancestry. However, we deliberately generated data from a mix of both small quiescent B cells and large transformed plasma cells to ensure that the normalization methods would be challenged by cells embodying a full spectrum of sizes and transcriptional activities. The cells were isolated from MM patient bone marrow samples by FACS and were profiled using the 10X Genomics single cell RNA-seq library kit. Cell- and gene-specific transcripts were enumerated using barcoded unique molecular identifiers (UMI).
The raw scRNA-seq data from one of three initial test samples is depicted in figure 1b. As shown, the distributions of transcript counts per gene varied significantly from cell to cell, reflecting differences in their cellular transcriptome sizes and demonstrating a clear need for normalization. The samples were next normalized using either TPM15, SCRAN11, SCONE12 or Seurat’s SCTransform function16 (figure 1b and supplementary figures S4-S20). To compare the alignments of the normalized transcriptomes, we examined the mean and median expression in each cell of a curated list of housekeeping genes (HKG) known to be broadly expressed with low variation9. We also examined the average expression in each cell of all of the ubiquitously-expressed genes (UEG) detected in >95% of the cells in the sample. For each sample tested, the UEG represent the largest possible set of genes that are commonly expressed across the test cells. Whilst the expression of any individual gene is expected to vary between cells for both biological and technical reasons, the average expression per cell of a large set of ubiquitous genes should be similar, particularly amongst cells of the same lineage, and its variance between cells provides a metric of normalization effectiveness.
As shown, TPM, which normalizes cellular transcriptomes primarily by their total transcript count, produced a very large variance in the average expression of HKG or UEG between cells, suggesting significant limitations for scRNA-seq application. By comparison, SCRAN and SCONE produced superior alignments of gene expression averages across cells. However, SCONE, which produced the better alignment, achieved this only by implementing quantile normalization – exchanging the actual distribution of transcript counts in each cell for a standardized distribution – which caused a loss of inter-cellular variation, particularly in highly-expressed genes. The expression of IGH or IGL genes, for example, a critical feature of plasma cells, was reduced by SCONE’s quantile normalization into a virtual constant across cells (supplementary figure S21).
As each of these scRNA-seq normalization methods has limitations, we developed RTAM1 and -2. The RTAM approach originates from a consideration of the strengths and weaknesses of scRNA-seq. Whereas lowly expressed genes are detected within single cells with low resolution (due to low integer transcript counts) and show significant stochastic variation, highly expressed genes are robustly detected and show finer quantisation of variation relative to intensity. RTAM thus utilizes highly-expressed genes, whose expression is resolved with greater accuracy, to align cellular transcriptomes. Genes are ranked in each cell by their expression and the summed intensities of the top-ranked genes is standardized in log-space using unique non-linear cell- and gene-specific adjustments of gene expression determined either by cellular gene expression rank (RTAM1) or by gene expression intensity (RTAM2) (see methods).
Importantly, compared to TPM, SCTransform or SCRAN, both RTAM1 and RTAM2 reduce the cell-to-cell variance in the average (median or mean) expression of HKG and UEG sets (figure 1c and supplementary figures S4-S20). The coefficients of variation (CV) produced by each normalization method for the “average” expression of HKG or UEG in individual cells is shown in figure 1d and supplemental figure S21a, for 3 independent patient samples. As shown, RTAM1 (red) and RTAM2 (blue) reduce variations in the average gene expression of single cells, even when this average expression is calculated by 3 different methods. By design, the RTAM methods also standardize the average expression of highly-expressed genes, and thus overall these methods produce superior alignments of cellular transcriptomes and of gene expression between cells. At the same time, both RTAM1 and RTAM2 maintain the original variability observed between cells in the expression of individual highly-transcribed genes, unlike the quantile normalization implemented by SCONE (supplementary figure S21b-d). Overall, therefore, the RTAM methods represent useful new strategies for normalizing scRNA-seq data that can enhance the accuracy of gene expression comparisons between cells.
Single-cell inferred chromosomal copy number variation: sciCNV
We next sought to develop a method for detecting single-cell chromosomal CNV from scRNA-seq, leveraging the enhanced normalization provided by RTAM to increase the sensitivity of single-cell transcriptomics for CNV detection. To optimize DNA copy number estimates from gene expression, and to mitigate against data sparsity in single cells, we developed a two-pronged approach, called sciCNV (described in the supplemental methods). Briefly, RTAM-normalized gene expression data from single cells was aligned with matching data from pooled control cells to develop expression disparity scores, which were averaged in a moving window defined by genomic location. Gene expression in the control cells was weighted according to the probability of gene detection, enhancing the comparison with single cell data, where signal dropout was common for many genes. In a parallel method, the expression disparity values were exchanged for binary values, which were summed cumulatively as a function of genomic location; the gradient of this function yielded a second estimate of CNV that was sensitive to small concordant expression variations in contiguous genes and that was insensitive to large single-gene variations. The CNV estimates of the two methods were combined by their geometric mean.
Figure 2 shows sciCNV applied to scRNA-seq data from primary MM cells. Significantly, the CNV profile of a single cell, inferred from its RNA, closely resembles the average CNV profile of >104 tumor bulk cells, derived from whole exome DNA sequencing (WES) (R2=0.72) (figure 2a-b). The CNV predictions produced from a single cell by sciCNV were also validated at key locations by FISH (figure 2c). Furthermore, examination of >1700 plasma cells from the same MM patient biopsy using sciCNV revealed that the tumor-specific CNV were robustly detected in all of the MMPC (figure 2d), despite biological and technical variations between the cells; and were not detected in normal plasma cells (NPC). Thus, sciCNV can utilize scRNA-seq to reveal CNVs in single cells. Moreover, it can distinguish cancer cells and normal cells on the basis of their CNV profile (figures 2e-f).
Identification of subclones and intra-clonal evolution using scRNA-seq
The detection of CNV with single cells from scRNA-seq data enables the identification of subclones and examination of intra-clonal evolution. Using scRNA-seq, RTAM2 and sciCNV we readily detected up to 7 subclones in primary MM samples comprising <4000 cells (figure 3a-b) and identified an average of 2-3 subclones per sample. Examination of the sciCNV profiles of the individual MM cells yielded evidence of both branching and linear intra-clonal evolution (figure 3c-d). In some tumors, marked divergence of two subclones from an inferred ancestral cell was evident, as in figure 3a, c; however, in the majority of MM samples examined the subclones diverged at only one or two loci.
Dissecting the effects of CNVs on gene expression: +8q23-24 in MM
Simultaneous profiling of both DNA copy number and RNA in the same cell should enable examination of the influence of CNVs on transcriptional programs. To explore this, we used sciCNV to screen MM patient bone marrow samples for tumor cells with +8q24. We sought to examine +8q24 as this is one of the most recurrent abnormalities in human cancer1,17 and is known to target MYC18, providing a benchmark for our analyses.
Using sciCNV, primary MM samples MM199 and MM244 were both found to contain subclonal gains of chromosome 8 encompassing 8q23-24 (figure 3e). Both samples also contained closely-related isogenic subclones without +8q. To facilitate gene set enrichment analyses (GSEA)19 of the intra-tumor subclone pairs, these subclones were next subsampled to yield cellular subpopulations with matching transcriptome depth (figure 3f). This prevented subclone biases in total cellular gene expression from influencing specific gene-set detection. The gene expression of the intra-clonal subpopulations, representing isogenic cells with and without +8q23-24, with matched transcriptome sizes, were then compared by GSEA using RTAM2-normalized data. From an analysis of 215 gene sets defined by chromosome location, +8q cells in both samples were strongly enriched for the gene-sets located at 8q23-24, with striking statistical confidence (p=0.000, q=0.000, FWER=0.000), compared to cells without +8q (supplemental figure S22). In contrast, no other genomic regions were significantly enriched. Thus, sciCNV accurately resolved single MM cells into intra-tumor subclones, isolating +8q23-24 as a unique variation distinguishing these.
We next used GSEA to explore the influence of +8q23-24 on cellular programming. As expected, +8q cells from both MM199 and MM244 samples showed increased MYC expression (p<0.05) compared to sibling cells without +8q (figure 3g). Surprisingly, however, only +8q cells from MM199 showed a broad increase in MYC target genes (p=0.000, q=0.000, FWER=0.000). Canonical MYC signature genes were not upregulated in MM244 +8q cells (p=0.767, FWER=1.0)(figure 3h). Nevertheless, from an analysis of 3303 curated gene sets, +8q cells from both MM199 and MM244 tumors showed similar upregulation of gene-sets encoding the machinery of mRNA translation and protein synthesis, including specifically genes involved in 3’UTR-mediated mRNA translation regulation (enrichment rank 5/3303 in MM199 and 9/3303 in MM244), ribosome biogenesis (enrichment rank 4/3303 in both) and peptide chain elongation (enrichment rank 3/3303 and 6/3303)(figure 3h, supplemental figure S22), potentially representing a more restricted MYC response. Conspicuously, these transcriptional effects of +8q23-24 in MM cells were remarkably close to those of +8q23-24 in breast cancer (FWER p=0.000, enrichment rank 1-2/3303 in both tumors), and this similarity was strong even when MYC hallmark genes were not increased (figure 3h, supplemental figure S22). Thus, +8q23-24 induces analogous gene expression changes across malignancies; and these analogous effects are not dependent on broadly-defined MYC-target genes but instead map to the specific upregulation of mRNA translation and protein synthesis.
The cellular re-programming induced by +8q23-24 might be expected to promote significant increases in gene expression and in cell mass. Notably, however, in the MM samples examined the mTOR-interacting gene, DEPTOR, located at 8q24, was also upregulated in +8q cells (figure 3g), and likely serves to counter increases in cell size, as previously reported20. Indeed, from our examination of +8q at a single cell level we uniquely observed that the transcriptome sizes of +8q cells were in fact mildly reduced, compared to sibling cells without the CNV (p<0.001)(figure 3i). Thus, from a single-cell analysis of +8q23-24 it appears that this CNV acts to boost protein synthesis capacity (ribosomes, translation) without increasing cellular transcriptome size. Ultimately this may lead to enhanced expression of MYC-target genes as proteins in some cancers, but may also serve more broadly to improve the dynamics of protein synthesis and reduce the lag-time required to respond to gene expression changes, potentially enhancing cellular adaptability.
The effects of +1q on MM cells
Like +8q23-24, gain of chromosome 1q is highly recurrent in human cancer and is present in >30% of clinical tumors1,17 Although rare in MM precursor disease, the prevalence of +1q increases significantly in symptomatic MM, more so than any other copy number gain.18,21 In newly diagnosed MM, +1q is found in 35% of cases and is associated with poor prognosis.22-29 Despite this, the effects of +1q on cancer cell biology remain poorly understood.
To examine the cellular effects of +1q, we screened MM patient bone marrows (n=30) by scRNA-seq and RTAM2/sciCNV, and identified ten tumors with +1q (figure 4a), including three (MM241, MM244 and MM379) containing synchronous subclones with and without the CNV (figure 4b). Although these tumor samples contained 2-6 subclones by sciCNV profiling, the subclones were only partially segregated by expression-based clustering (supplementary figure S23).
By GSEA, +1q cells in MM241 showed significant enrichment for all 10 chromosome position gene-sets located at 1q21-1q44 (p=0.000, FDR q<0.005, FWER p=0.000-0.058), while MM244 and MM379 +1q cells were correspondingly enriched for gene-sets located at 1q23-1q32 (p=0.000, q≤0.004, FWER≤0.019) or 1q22-1q42 (p≤0.004, q≤0.03, FWER≤0.024; 1q23 FWER=0.359)(Figure 5a-b and supplementary figures S24-26). No other genomic regions were significantly enriched, confirming that the intra-clonal +1q subpopulations identified by sciCNV were uniquely divergent at this locus alone.
We next examined the influence of +1q on transcriptional programs in MM241, MM244 and MM379. Remarkably, the +1q cells in all three tumors showed significant reductions in the unfolded protein response (UPR) compared to their sibling cells lacking +1q (p<0.003, FDR≤0.015, FWER≤0.028), suggesting that +1q acts consistently in MM to reduce endoplasmic reticulum (ER) stress (figure 5b and supplementary figures 27-29). This effect of +1q on the UPR has not previously been reported, though is likely highly advantageous to MM cells, which are professional secretor cells burdened by high proteotoxic stress. In MM241, with the largest +1q CNV, UPR genes EIF4EBP1, EIF4A2, DDIT4, ATF4, ERN1, XBP1 and CEBPB were amongst the genes most downregulated in +1q cells (figure 5c). In contrast, ATF6, UAP1 and PSMD4 were incongruously upregulated, likely as result of their location within the 1q gain. With respect to mechanism, we observed that the 1q24 gene EEF1AKNMT, which selectively enhances protein translation in a codon-specific manner30 to support oncogenic growth31, was increased in all three +1q subclones, as was TIPRL, which regulates the mTORC1 pathway by inhibiting PP2A and sustaining phosphorylation of EIF4EBP1 and RPS6KB1. In contrast, EIF4A1 or EIF4A2, which jointly promote EIF4E-dependent translation (ET), were reduced, as was the ET-repressor EIF4EBP1 (figure 5d). Thus +1q induces complex alterations of translation and of the mTORC pathway that likely influence misfolded protein load. Expression of UAP1 and/or COPA from 1q23 may further alleviate ER stress32,33.
Additional +1q effects were observed. Mitochondrial oxidative phosphorylation (OxPhos) and reactive-oxygen gene sets were enriched in MM241 +1q cells, likely driven by the increased expression of COX20, NDUFS2, SDHC, MRPS14 and MRPS21 from 1q21-44 (figure 5b-c). However, similar metabolic signatures were not observed in MM244 or MM379, perhaps because MRPS21 (1q21.2) falls outside of the +1q CNV in these later samples, or because enhanced NF-κB signaling may also be required for OxPhos augmentation34 and was observed only in the MM241 subclone (supplementary figure S27), associated with TNFRSF13B over-expression (figure 5c).
Both MM244 and MM379 also showed significant enrichment of E2F, G2M and mitosis programs in +1q cells (p=0.000, FDR=0.000, FWER≤0.001) (figure 5b) and small increases in cycling cells in G2/M (figure 5e), consistent with increased proliferation. However, no increase in proliferation was observed in MM241 +1q cells, indicating that 1q-induced proliferation requires a permissive cellular context. Although CKS1B has been proposed to be mechanistic in +1q-induced proliferation22,35, we observed no increase in CKS1B in two of the three +1q subclones examined (figure 5f), indicating that alternative mechanisms likely drive cell cycling. Overexpression of EEF1AKNMT31, increased oxidative phosphorylation and reductions in the UPR, may instead contribute to the enhanced proliferation of +1q cells.
MCL1, a critical anti-apoptosis gene for MMPC36,37 located at 1q21.2, was also increased 1.45-fold (p<10−9) in +1q cells from MM241 (figure 5f) in direct proportion to 1q copy number. MCL1 was not however upregulated in either MM244 or MM379, whose 1q gains narrowly excluded the MCL1 locus. Increased MCL1 and apoptotic threshold thus represents an additional function of +1q that may further increase cancer cell aggressiveness.
A summary of these cellular effects of +1q21-44 in MM is shown in figure 6a.
Comparison of intra-tumor and inter-tumor CNV studies
We next compared our intra-tumor studies (figure 6b) with a traditional inter-tumor study designed to identify the biological role of +1q (figure 6c). To perform the inter-tumor study, we examined microarray data from a large published series of MM tumor samples (n=532) characterized by +1q FISH22 (supplementary Figures S30-32). As expected, the MM samples with 1q21 gain by FISH showed enrichment by GSEA for chromosomal position gene sets located at 1q21-44. However, the same samples also showed enrichment for gene-sets located on chromosome 1p22, 13q22, 11q13, 11q22, 5q14, 8q24 and Xq28, compared to tumors without +1q, undermining the value of this cohort for isolating gene expression changes attributable to +1q (figure 6c). The samples defined by +1q FISH were also biased towards distinctive MM subtypes, as the +1q cohort included more tumors with t(4;14) while the control samples included more tumors with t(11;14) or hyperdiploidy. Consequently, the utility of these cohorts for the isolation effects specifically attributable to +1q was undermined. GSEA of the cohorts yielded an overabundance of putative +1q-associations whose attribution to +1q or to confounding CNVs or biases in MM subtype was unclear (figure 6c).
Conspicuously, both intra- and inter-tumor studies identified the UPR as a significant +1q covariant in MM. Strikingly, however, the direction of association differed between the studies, suggesting an error in one of the approaches. Notably, whereas dual profiling of DNA and RNA in single cells enables direct matching of a CNV with its effects on gene expression (figure 6d), inter-tumor studies must instead infer associations between CNVs and gene expression from their correlation across unrelated tumors, which can lead to erroneous conclusions as demonstrated in figure 6e. Thus, single cell studies of intra-tumor heterogeneity can better isolate CNV-specific effects than traditional multi-tumor bulk profiling studies and may reveal the cellular effects of CNVs with greater accuracy.
DISCUSSION
CNVs are critical drivers of cancer biology yet their specific effects on cellular processes remain poorly understood. Here, we report the dual profiling of DNA copy number and RNA within the same cells, using scRNA-seq, and leverage this to explore the effect of CNVs on gene expression. To capture intra-tumor heterogeneity, we profile the RNA and CNVs or thousands of cells per sample. Using these new techniques, we examine the transcriptional effects of copy number gains of chromosome regions 8q23-24 and 1q21-44, representing two of the most common CNVs in human cancer. We show that these lesions induce critical reprogramming of cancer cells that can explain their influence on clinical disease.
Chromosome +1q is the most common adverse CNV in MM. We demonstrate that +1q causes multiple effects on MM cells including a reduction in the unfolded protein response, which likely results from 1q-associated reconfiguration of translation and from changes in the mTOR pathway. In addition, we demonstrate that primary MM cells with +1q show enhanced oncogenic growth, oxidative phosphorylation and MCL1 expression. Significantly, these specific reprogramming effects may explain the inferior disease control achieved by MM patients with tumors harboring this abnormality, following standard of care therapies22,26-28,35,38,39. Thus, the suppression of unfolded protein stress in +1q MM cells may counteract the activity of proteasome inhibitors26-28, which induce cytotoxicity via ER stress40,41. Similarly, the upregulation of MCL1 in cells with +1q21 may counteract treatment-induced apoptosis. And cellular proliferation, which may be induced by 1q-mediated upregulation of EEF1AKNMT, or by UPR reduction, may further contribute to early disease recurrence.
We demonstrate that the transcriptional effects of +8q23-24 are remarkably similar in MM and breast cancer (FWER p=0.000), irrespective of whether or not hallmark MYC target genes are increased (figure 3h). Although +8q23-24 can upregulate the expression of a broad spectrum of MYC target genes, we demonstrate that the transcriptomes of MM cells with +8q are in fact smaller than those of cells lacking +8q, at least in the samples examined by us. Significantly, we demonstrate that a consistent function of +8q23-24 is the upregulation of gene sets involved in mRNA translation, ribosomal biogenesis and peptide elongation. Thus +8q23-24 selectively enhances protein synthesis capacity, without increasing transcriptome size. We propose that this may improve the dynamics of proteome reconfiguration following gene expression changes; and that this may enhance the malleability of cancer cells to environmental challenges.
We show here that the study of CNVs via single-cell transcriptomics offers a number of advantages. As intra-clonal cells that diverge at a single CNV are virtually isogenic, any consistent divergence in their gene expression can be precisely matched to the subclonal CNV. Furthermore, as the test and control cells are present within the same sample, differences in gene expression due to the microenvironment, clinical factors or due to sample processing are minimized. Inter-tumor cohort studies instead rely upon the identification of correlations between CNVs and gene expression across unrelated samples, and suffer from the substantial additional genetic and clinical heterogeneity that exists between samples. As a result of these limitations, the effects of most cancer CNVs on gene expression remain poorly understood. Fortunately, the compelling benefits of intra-clonal studies suggest that a new era of cancer genomics is emerging in which the precise effects of all cancer CNVs on cellular programming can be determined at the single-cell level. This important knowledge is critical for understanding cancer and for advancing therapeutic strategies that seek to address the foundations of this disease.
SUPPLEMENTARY INFORMATION
Methods and supplementary figures can be found on-line.
Author Contribution
A.M-S performed research and analyzed data. N.E., C.L-H. and I.T. performed FISH, FACS and whole exome sequencing, respectively. P.N. provided essential reagents. R.E.T. designed research, analyzed data and wrote the paper.
Competing interests
The authors declare that they have no competing interests.
ACKNOWLEDGEMENTS
The authors thank the patients, their families and the physicians who made this study possible. They also thank N. Winegarden, N. Khuu and G. Basi in the Princess Margaret Genomics Facility and Z. Lu in the Princess Margaret Bioinformatics Core for technical assistance; and Drs. Gary Bader and Caleb Stein for independent critical review and comments. This work was supported by funding from The Princess Margaret Cancer Centre Foundation, the Terry Fox Foundation and the Canadian Cancer Society Research Institute.