Inferring copy number variation from gene expression data: methods, comparisons, and applications to oncology

Copy number variations (CNVs) are genomic events where the number of copies of a particular gene varies from cell to cell. Cancer cells are associated with somatic CNV changes resulting in gene amplifications and gene deletions. However, short of single-cell whole-genome sequencing, it is difficult to detect and quantify CNV events in single cells. In contrast, the rapid development of single-cell RNA sequencing (scRNA-seq) technologies has enabled easy acquisition of single-cell gene expression data. In this work, we employ three methods to infer CNV events from scRNA-seq data and provide a statistical comparison of the methods’ results. In addition, we combine the analysis of scRNA-seq and inferred CNV data to visualize and determine subpopulations and heterogeneity in tumor cell populations.


INTRODUCTION
Single-cell RNA-seq (scRNA-seq) is a powerful molecular profiling tool used to measure gene expression in individual cells [1]. The development of high-throughput sequencing technologies has made it possible to quantify gene expression in thousands of single cells for less than $1 USD per cell [2]. However, in addition to understanding the transcriptomic landscape, researchers are also interested in profiling the genomic landscape of tumors as well. Among genomic variants, copy number variants (CNVs) are an important genetic driver of cancer [3].
CNVs are genomic events in which the number of copies of a particular gene varies from individual to individual, or even from cell to cell.
Although there are tools for identifying CNVs using bulk whole-genome sequencing data [4], it has been more challenging to determine CNVs in single cells. Single-cell DNA sequencing has shown promise, but remains more challenging and is less ubiquitous than scRNA-seq. In addition, it remains technically challenging to sequence both the genome and the transcriptome from the same single cell [5]. With these challenges in mind, there has been interest in inferring CNVs from RNA-seq alone. However, this approach faces difficulties due to non-uniform coverage and dynamic variations in gene expression. Nevertheless, in recent years there have been several methods attempting this task.
In this work, we compare three methods that broadly represent the different approaches to inferring single-cell CNV profiles directly from scRNA-seq data: InferCNV [6], CaSpER [7], and CopyKAT [8]. InferCNV uses smoothed averages over gene windows; CaSpER uses a multi-scale signal processing framework; and CopyKAT uses a Bayesian segmentation approach. We compare these methods on a publicly available glioblastoma scRNA-seq dataset and measure their performance under varying conditions of the input gene expression data and algorithms' parameters. Lastly, we combine the analysis of scRNA-seq data and the CNV profiles inferred from the scRNA-seq data to determine subpopulations and visualize intratumor heterogeneity in glioblastoma.
In addition, 1026 normal brain tissue bulk RNA-seq profiles from the Genotype-Tissue Expression (GTEx) project were included as control non-malignant samples intended to mimic normal single cells [10]. The normal tissue profiles were from the cerebellum, caudate, cortex, basal ganglia, cerebellar hemisphere, frontal cortex, and the hippocampus. For the InferCNV and CopyKAT algorithms, the glioblastoma and GTEx data were both accessed in the file "glio.wGtexBrain.counts.matrix.gz" at https://github.com/broadinstitute/inferCNV_examples/tree/master/glioblastoma. For the CaSpER algorithm, the normalized glioblastoma expression data and B-allele frequencies were accessed in the file "scell_gbm.rda" at https://github.com/akdess/CaSpER/tree/master/data.

Workflows
InferCNV uses a corrected moving average of gene expression data to determine CNV profiles. Genes are sorted by absolute genomic position. That is, they are first ordered by chromosome, and then by genomic start position within the chromosome. The algorithm's authors reason that averaging out the expression of genomically adjacent genes removes genespecific expression variability and yields profiles that reflect chromosomal copy number variations. To further refine the CNV profile of tumor cells, InferCNV constructs the CNV profile of a known normal sample, and then for each gene and each cell, the normal sample is subtracted from the tumor sample to determine the final tumor CNV profile.
CaSpER uses a multiscale signal processing framework that combines two streams of information: gene expression and B-allele frequencies (BAF), both generated from aligned RNAseq reads (BAM file). CaSpER first smooths both the BAF and expression signals using recursive iterative median filtering. At each scale, a hidden Markov model is applied to the smoothed expression signal and outputs five CNV states: (1) homozygous deletion, (2) heterozygous deletion, (3) neutral, (4) one-copy amplification, and (5) multi-copy amplification.

Inputs, Outputs, and Technical Considerations
All three methods operate on the assumption that gene expression is correlated with copy number variation. That is, genes that are highly expressed are likely associated with copy number amplifications and genes that are lowly expressed are likely associated with copy number deletions. The key task is then to remove the confounding effect of natural fluctuations in gene expression levels, so that the excess or relative gene expression can be directly attributed to copy number variation.
Following this line of reasoning, all methods require a gene expression matrix (UMI counts) as input. InferCNV does not require labeled normal cell samples to work, but if none are provided it will simply use the average of all cells as the reference copy number profile.
CopyKAT can work with either labeled or unlabeled normal cells as reference. If normal cells are assumed present but not explicitly labeled, CopyKAT will attempt to predict which cells are normal using hierarchical clustering. CopyKAT assumes that tumor cells are highly aneuploid while normal cells are diploid. The CopyKAT authors acknowledge this presents a limitation that can lead to potential misclassifications if the data only has a few normal cells, or when the tumor cells are nearly diploid. CopyKAT also requires expression data from at least five genes on every chromosome, whereas InferCNV allows for incomplete expression data. In addition to a gene expression matrix, CaSpER requires aligned RNA-seq BAM files to compute the BAF signal (Table 1).
CaSpER outputs discrete CNV classes (deletion, neutral, or amplification) while InferCNV and CopyKAT output continuous CNV values. CopyKAT's CNV values are normalized between -1 and 1 for deletion and amplification, respectively, with no CNV change from the reference normal cells given as 0. Similarly, InferCNV's outputs are normalized between 0 to 2 with no CNV change from the reference normal cells given as 1.

Experiments
We compared the output of the three algorithms when varying the number of normal samples used as reference and whether those normal samples were labeled as normal or not. We ran InferCNV and CopyKAT using scRNA-seq data from all 430 glioblastoma tumor cells combined with 0, 5, 50, 500, or all 1,026 normal brain bulk RNA-seq samples. With CopyKAT, we ran each of those conditions twice: once when explicitly labeling which samples were normal and once allowing CopyKAT to predict which of the samples were normal. Since CaSpER does not require any normal samples, CaSpER was only run once using data from all tumor cells.

Comparison of Technical Performance
To compare the CNV profiles of each method as a function of varying normal reference data, we sorted the estimated copy numbers by absolute genomic position and averaged the data profiles. Interestingly, unlabeled CopyKAT appears to detect patient-specific CNV events but is unable to detect the population-wide amplification on chromosome 7 and deletion on chromosome 10.

Comparing gene expression and inferred CNV profiles
Tumor heterogeneity presents a major challenge in cancer diagnosis and treatment.
Different tumor subtypes will lead to variable and distinct treatment responses and outcomes. In addition to inter-tumor variability, different cells within the same tumor might also harbor different mutations and exhibit different responses. Therefore, it is important to profile and resolve different subpopulations within a particular cancer, and then again within a particular patient. Clustering algorithms are commonly applied scRNA-seq data to detect subpopulations.
From these clusters, one can apply differential gene expression analysis to determine biomarkers and cell types.
To this end, we wanted to compare cell clusters in the original gene expression data to those in the inferred CNV profiles. We applied principal component analysis and then generated uniform manifold approximation and projection (UMAP) plots for both the original gene expression data (Figure 3) and the corresponding inferred CNV profiles (Figure 4). Although we can clearly differentiate between bulk RNA-seq from normal brain samples and scRNA-seq from glioblastoma cells (Fig. 3A), the separation between different patients' glioblastoma cells is less pronounced (Fig. 3B). When analyzing the inferred CNV profiles, we only considered the glioblastoma cells and combined results from the four algorithm conditions shown in Figure 3. In the UMAP projections of these inferred CNV profiles, the cells cluster first by patient (Fig. 4B) and then, within each patient, to some extent by algorithm (Fig. 4A). Thus, the inferred CNV values show more variability between patients than between methods. As expected by the heatmap visualizations, InferCNV, CaSpER, and annotated CopyKAT cluster very tightly.

DISCUSSION AND CONCLUSION
In this work we compared the results from three methods for inferring CNV profiles from gene expression data. We found that given sufficient reference information (e.g., BAM files, normal samples with annotations to use as reference profiles) all three methods were quite consistent in their predicted CNV profiles. However, in the absence of one or multiple pieces of reference information, performance varied wildly. Each method had their own limitations and strengths: InferCNV was a robust method, provided that at least some (as few as 5) normal samples were provided. CopyKAT was unable to differentiate between normal samples and Nevertheless, despite these caveats, the inferred CNV profiles proved to be a valuable tool for visualizing and determining tumor subpopulations and heterogeneity. Patient-specific distinctions that were obscured when solely analyzing gene expression data were more readily apparent when visualizing inferred CNVs. Interestingly, we discovered that inter-patient tumor CNV variability exceeded inter-method CNV variability, providing reassurance that despite the technical challenges of CNV estimation, the current methods provide effective approaches to uncovering inter-tumor heterogeneity.
Overall, we believe this work provides the most thorough comparison to date of methods for CNV inference from gene expression data and demonstrates the unique potential of CNVs for discovery in cancer biology. Figure 1. Inter-Methods Correlation. Spearman correlation matrix for averaged CNV profiles generated as described in the Experiments section. On the x-axis, from left to right, the columns indicate the results from InferCNV (I), CaSpER (C), CopyKAT provided with labeled normal samples (w_K), and CopyKAT provided with unlabeled normal samples (wo_K) for varying numbers of normal sample references (the number of normal samples used is given after the underscore).