Abstract
Despite its popularity, characterization of subpopulations with transcript abundance is subject to significant amount of noise. We propose to use effective and expressed nucleotide variations (eeSNVs) from scRNA-seq as alternative features for tumor subpopulation identification. We developed a linear modeling framework SSrGE to link eeSNVs associated with gene expression. In all the cancer datasets tested, eeSNVs achieve better accuracies and more complexity than gene expression for identifying subpopulations. Previously validated cancer relevant genes are also highly ranked, confirming the significance of the method. Moreover, SSrGE is capable of analyzing coupled DNA-seq and RNA-seq data from the same single cells, demonstrating its power over the cutting-edge single-cell genomics techniques. In summary, SNV features from scRNA-seq data have merits for both subpopulation identification and linkage of genotype-phenotype relationship. SSrGE method is available at https://github.com/lanagarmire/SSrGE.
Background
Characterization of phenotypic diversity is a key challenge in the emerging field of single-cell RNA-sequencing (scRNA-seq). In scRNA-seq data, patterns of gene expression (GE) are conventionally used as features to explore the heterogeneity among single cells [1-3]. However, GE features are subject to significant amount of noises [4]. One issue of GE is the batch effect, where results obtained from two different runs of experiments may present substantial variations [5], even when the input materials are identical. Additionally, the expression of certain genes vary with cell cycle [6], increasing the heterogeneity observed in single cells. To cope with these sources of variations, normalization of GE is usually a mandatory step before downstream functional analysis (except those done with Unique Molecular Identifiers). Even with these procedures, other sources of biases still exist, e.g. dependent on read depth, cell capture efficiency and experimental protocols etc.
Single nucleotide variations (SNVs) are small genetic alterations occurring in specific cells as compared to the population background. SNVs may manifest their effects on gene expression per cis and/or trans effect [7,8]. It is regarded that cancer evolution involves the disruption of the genetic stability e.g. increasing number of new SNVs [9,10]. A cell may become the precursor of a subpopulation (clone) upon gaining a set of SNVs. Large heterogeneity exists not only between tumors but also within the same tumor [11,12]. Therefore, investigating the patterns of SNVs provides means to understand tumor heterogeneity.
In single cells, SNVs are conventionally obtained from the single-cell exome-sequencing approach [13]. Previously, the resulting SNVs were used to infer cancer cell subpopulations [14,15]. In this study, we propose to obtain useful SNV-based genetic information from scRNA-seq data, in addition to the GE information. Rather than being considered the “by-products” of scRNA-seq, the SNVs not only have the potential to improve the accuracy of identifying subpopulations compared to GE, but also offer unique opportunities to study the genetic events (genotype) that are associated with gene expression (phenotype) [16,17]. Moreover, when the coupled DNA- and RNA- based single-cell sequencing techniques become mature, the computational methodology proposed in this report can be easily adopted as well.
Here we first built a computational pipeline to identify SNVs from scRNA-seq raw reads directly. We then constructed a linear modeling framework to obtain filtered, effective and expressed SNVs (eeSNVs) associated with gene expression profiles. In all the datasets tested, these eeSNVs show better accuracies at retrieving cell subpopulation identities, compared to those from gene expression (GE). Moreover, when combined with cell entities into bipartite graphs, they demonstrate improved visual representation of the cell subpopulations. We ranked eeSNVs and genes according to their overall significance in the linear models, and discovered that several top-ranked genes (e.g. HLA genes) appear commonly in all cancer scRNA-seq data. In summary, we emphasize the SNV approach that was previously understudied in scRNA-seq analysis, which can successfully identify subpopulation complexities and highlight genotype-phenotype relationships.
Results
eeSNV detection from scRNA-seq data
We implemented a pipeline to identify SNVs directly from FASTQ files of scRNA-seq data, following the SNV guideline of GATK (Suppl. Figure S1). We applied this pipeline to four scRNA-seq cancer datasets (Kim, Ting, Miyamoto and Patel, see Methods), and tested the efficiency of SNV features on retrieving single cell groups of interest. These datasets vary in tissue types, origins (Mouse or Human), read lengths and map-ability (Table 1). They all have pre-defined cell types (subclasses), providing good references to assess the performance of a variety of clustering methods used in this study.
Summary of scRNA-seq datasets used in this study.
To link the relationship between SNV and GE, we developed a method called “Sparse SNV inference to reflect Gene Expression” (SSrGE), as detailed in Materials and Methods. This method uses SNVs as predictors to fit a linear model for gene expression, under LASSO regularization and feature selection [18]. The output is a subset of eeSNVs selected by LASSO, which serve as refined descriptive features for subsequent subpopulation identification (Suppl. Figure S2). To directly pinpoint the contributions of SNVs relevant to protein coding genes, we used the SNVs residing between transcription starting and ending sites of genes as the inputs. In SSrGE, the value of the regularization parameter a is the only tuning variable, controlling the sparsity of the linear models and influences the number of eeSNVs.
eeSNVs are better features than gene expression to identify subpopulations
We measured the performance of SNVs and gene expression (GE) in the four datasets with five clustering approaches. These clustering approaches include two dimension reduction methods, namely Principal Component Analysis (PCA) [19] and Factor Analysis (FA) [20], followed by either K-Means or the hierarchical agglomerative method (agglo) with WARD linkage [21]. We also used a recent algorithm SIMLR specifically designed for scRNA-seq data clustering and visualization (Wang et al., 2016). To evaluate the accuracy of obtained subpopulations in each dataset, we used the metric of Adjusted Mutual Information (AMI) over 30 bootstrap runs, from the optimal a parameters (Suppl. Table S1). Even though the numbers are much reduced from the original SNVs, eeSNVs are still better features to retrieve cancer cell subpopulations compared to GE, independent of the clustering methods used (Figure 1). Among the clustering algorithms, SIMLR is a better choice in general using eeSNV features. In addition, we also computed Adjusted Rand index (ARI) [22] and V-measure [23], two other metrics for modularity measurements and obtained similar trends (Suppl. Figure S3).
(A) Bar plot comparing the clustering performance using eeSNV vs. gene expression (GE) as features, over four datasets and five different clustering strategies. Y-axis is the adjusted mutual information (AMI) obtained across 30 bootstrap runs (mean ± s.d.). *: P<0.05, ** P<0.01 and *** P<0.001. (B) Heatmap of the rankings among different methods and datasets as shown in (A).
Visualization of subpopulations with bipartite graphs
Bipartite graphs are useful to represent binary relations between two different classes of objects. We next represented the binary eeSNVs features and the single cells with bipartite graphs using ForceAtlas2 algorithm [24]. We drew an edge (link) between a cell node and a given eeSNV node whenever an eeSNV is detected. The results show that bipartite graph is a robust and more discriminative alternative (Figure 2), comparing to PCA plots (using GE and eeSNVs) as well as SIMLR (using GE). For Kim dataset, bipartite graph separates the three classes perfectly. However, gene based visualization approaches using either PCA or SIMLR have misclassifications. For Ting data, eeSNVs-cell bipartite graph gives clear visualization of all six different subgroups of single cells. Other three approaches have more exaggerated separations among the same mouse circulating tumor cells (CTC) subgroup MP (orange color), but mix some other subpopulations (e.g. GM, MP and TuGMP groups). Miyamoto dataset is the most difficult one to visualize among the four datasets, due to its high number (24) of reference classes and heterogeneity among CTCs. Bipartite graph is not only able to condense the whole populations, but also separate subpopulations (e.g. the orange colored PC subpopulation) much better than the other three methods.
(A) Bipartite graphs using eeSNVs and cell representations. (B) Principle Component Analysis (PCA) results using gene expression. (C) PCA results using eeSNVs. (D) SIMILR results using gene expression.
Characteristics of eeSNVs
Since the selection of eeSNVs is dependent on regularization parameter a, we next explored their relationship. For every dataset, increasing the value of a decreases the number of selected eeSNVs overall (Figure 3A), as well as the average number of eeSNVs associated with every expressed gene (Figure 3B). The optimal a depends on the clustering algorithm and the dataset used (Suppl. Table S1 and Suppl. Figure S4). Increasing the value of a increases the proportion of eeSNVs that have annotations in human dbSNP138 database, indicating that these eeSNVs are biologically valid (Figure 3C). Finally, increasing a generally increases the average number of cells sharing the same eeSNVs, supporting the hypothesis that cancer cells differentiate with a growing number of genetic mutations over time (Figure 3D). Note the slight drop of the average number of cells sharing the same eeSNVs in Kim data when a > 0.6, this is due to over-penalization (eg. a =0.8 yields only 34 eeSNVs). Finally, we also compared the CIS effect of the eeSNVs, i.e. the ability of a given eeSNV to predict the expression of its own gene, (See methods). The vast majority of the eeSNVs, and all top ranked eeSNVs, have a low CIS effect, indicating that eeSNVs are mostly used by the method as predictor to infer the expression of other genes (Suppl. Figure S5).
X-axis: the regularization parameter a values. And the Y-axes are: (A) Log10 transformation of the number of eeSNVs. (B) The average number of eeSNVs per gene. (C) The proportion of SNVs with dbSNP138 annotations (human datasets). (D) The average number of cells sharing eeSNVs. Insert: Patel dataset.
Cancer relevance of eeSNVs
To further explore the biological functions, we ranked the different eeSNVs and the genes harboring them, using eeSNVs’ coefficients from SSrGE models (Suppl. Tables S2). We found that eeSNVs from multiple genes in Human Leukocyte Antigen (HLA) complex, such as HLA-A, HLA-B, HLA-C and HLA-DRA, are top ranked in all three human datasets (Table 2 and Suppl. Tables S2). HLA is a family encoding the major histocompatibility complex (MHC) proteins in human. Beta-2-microglobulin (B2M), on the other hand, is ranked 7th and 45th in Ting and Patel datasets, respectively (Table 2). Unlike HLA that is present in human only, B2M encodes a serum protein involved in the histocompatibility complex MHC that is also present in mice. Other previously identified tumor driver genes are also ranked top by SSrGE, demonstrating the significance of mutations on cis-gene expression (Table 2 and Suppl. Tables S2). Notably, KRAS, previously linked to tumor heterogeneity by the original scRNA-Seq study (Kim et al., 2015), is ranked 13th among all eeSNV containing genes (Suppl. Tables S2). AR and KLK3, two genes reported to show genomic heterogeneity in tumor development in the original study (Miyamoto et al., 2015), are ranked 6th and 19th, respectively. EGFR, the therapeutic target in Patel study with an important oncogenic variant EGFRvIII (Patel et al., 2014), is ranked 88th out of 4,225 genes. Therefore, genes top-ranked by their eeSNVs are empirically validated.
A list of interested genes highly ranked. Ranks with ‘*’ designate cancer driver genes reported in the original studies.
Next we conducted more systematic investigation to identify KEGG pathways enriched in each dataset, using these genes as the input for DAVID annotation tool [25] (Figure 4A). The pathway-gene bipartite graph illustrates the relationships between these genes and enriched pathways (Figure 4B). As expected, Antigen processing and presentation pathway stands out as the most enriched pathway, with the sum -log10 (p-value) of 9.22 (Figure 4A). “Phagosome” is the second most enriched pathway in all four data sets, largely due its members in HLA families (Figure 4B). Additionally, pathways related to cell junctions and adhesion (focal adhesion, tight junction, cell adhesion molecules CAMs), protein processing (protein processing in endoplasmic reticulum and proteasome), and PI3K-AKT signaling pathway are also highly enriched with eeSNVs (Figure 4A).
(A) KEGG pathways enriched with genes containing eeSNVs in the four datasets. Pathways are sorted by the sum of the -log10(p-value) of each dataset, in the descending order. (B) Bipartite graph for KEGG pathways and genes enriched with eeSNVs. Pathways and genes in each dataset are colored as shown in the gragh. The size of the nodes is proportional to the normed gene scores, according to the eeSNV scores, and to the sum of the normed gene scores for the pathway nodes.
Heterogeneity markers using eeSNVs
We exemplified the potential of eeSNV as heterogeneity markers using Kim dataset. First, we reconstructed the pseudo-time ordering of the single-cells entirely using eeSNVs, rather than GE. We built a Minimum Spanning Tree, similarly to the Monocle algorithm [26], to reconstruct the pseudo-time ordering of the single-cells. The graph beautifully captures the continuity among cells, from the primary to metastasized tumors (Figure 5A). Moreover, it highlights ramifications inside each of the subgroups, demonstrating the intra-group heterogeneity. On the contrary, pseudo-time reconstruction using GE showed much less complexity and more singularity (Supplementary Figure S6). Next, we used our method to identify eeSNVs specific to each single-cell subgroup and ranked the genes according to these eeSNVs. We compared the characteristics of the metastasis cells to primary tumor cells. Two top ranked genes identified by the method, CD44 (1th) and LPP (2th), are known to promote cancer cell dissemination and metastasis growth after genomic alteration [27–30] (Suppl. Tables S3). Other top ranked genes related to metastasis are also identified, including LAMPC2 (7th), HSP90B1 (14th), MET (44th) and FN1 (52th). As expected, “Pathways in Cancer” are the top ranked pathway enriched with mutations (Figure 5B). Additionally, “Focal Adhesion”, “Endocytosis” pathways are among the other significantly mutated pathways, providing new insights on the mechanistic difference between primary and metastasized RCC tumors (Figure 5B).
(A) Pseudo-time ordering reconstruction of the different subgroups. (B) Bipartite graph for KEGG pathways and genes enriched with eeSNVs. The size of the nodes is proportional to the gene scores, according to the eeSNV scores, and to the sum of the gene scores for the pathway nodes. Also, lighter green indicates genes with a lower rank
Integrating DNA- and RNA-Seq data measured in the same single cells
Coupled DNA-Seq and RNA-Seq measurements from the same single cell are the new horizon of single-cell genomics. To demonstrate the power of SSrGE in integrating DNA and RNA data, we downloaded the only accessible public single cells data, which have DNA methylation and RNA-Seq records from the same hepatocellular carcinoma (HCC) single cells (Hou dataset) [31]. We then inferred SNVs from the aligned reduced representation bisulfite sequencing (RRBS) reads (See Methods). Using our methods, we identified eeSNVs, which perfectly separate normal hepatocellular cells from cancer cells and highlight the two cancer subtypes identified in the original study (Figure 6A). Pseudo-time ordering shows not only an early divergence between the two previously assumed subtypes, but also unveils significant ramifications amongst subtype type II, indicating potential new subgroups (Figure 6B).
(A) Bipartite-graph representation of the single cells using eeSNVs from RRBS reads. (B) Pseudo-time ordering reconstruction of the HCC cells.
We postulated that a considerable part of bisulfite reads was aligned with methylation islands associated with gene promoter regions. We thus annotated eeSNVs within 1500bp upstream of the transcription starting codon, and obtained genes with these eeSNVs, which are significantly prevalent in certain groups. When comparing HCC vs. normal control cells, two genes PRMT2, SULF2 show statistically significant mutations in HCC cells (P-values < 0.05). Downregultion of PRMT2 was previously associated with breast cancer [32], SULF2 was known to be upregulated in HCC and promotes HCC growth [33]. When comparing HCC subgroup I vs. II, CTBP2 is significantly mutated (P-value = 0.01) in subgroup I. CTBP2 is a transcriptional co-repressor that promotes cancer cell migration and invasion by inhibiting tumor suppressor genes, and was previously associated with worse prognostic in HCC [34].
Discussion
Using GE to accurately analyze scRNA-seq data has many challenges, including technological biases such as the choice of the sequencing platforms, the experimental protocols and conditions. These biases may lead to various confounding factors in interpreting GE data [5]. SNVs, on the other hand, are less prone to these issues given their binary nature. In this report, we demonstrate that eeSNVs extracted from scRNA-seq data are ideal features to characterize cell subpopulations. Moreover, they provide a means to examine the relationship between eeSNVs and gene expression in the same scRNA-seq sample.
eeSNVs have improved accuracy on identifying tumor single-cell subpopulations
The process of selecting eeSNVs linked to GE allows us to identify representative genotype markers for cell subpopulations. We speculate the following reasons attributed to the better accuracies of eeSNVs compared to GE. First, eeSNVs are binary features rather than continuous features like GE, thus eeSNVs are more robust at separating subpopulations. We have noticed that SNVs are less affected by batch effects (Suppl. Figure S7). Secondly, LASSO penalization works as a feature selection method and minimizes the spurious SNVs (false positive) from the filtered set of eeSNVs. Thirdly, since eeSNVs are obtained from the same samples as scRNA-seq data, they are more likely to have biological impacts, and this is supported the observation that they have high prevalence of dbSNP annotations.
A small number of eeSNVs can be used to discriminate distinct single-cell subpopulations, as compared to thousands of genes that are normally used for scRNA-seq analyses. Taking advantage of the eeSNV-GE relationship, a very small number of top eeSNVs still can clearly separate cell subpopulations of the different datasets (eg. 8 eeSNV features have decent accuracy for Kim dataset). Moreover, our SSrGE package can be easily parallelized and process each gene independently. It has the potential to scale up to very large datasets, well-poised for the new wave of scRNA-seq technologies that can generate thousands of cells at one time [35]. One can also easily rank the eeSNVs and the genes harboring them, for the purpose of identifying robust eeSNVs as genetic markers for a variety of cancers.
eeSNVs highlight genes linked to cancer phenotypes
SSrGE uses an accumulative ranking approach to select eeSNVs linked to the expression of a particular gene. Particularly, HLA class I genes (HLA-A, HLA-B and HLA-C) are top-ranked for the three human datasets, and they contribute to “antigen processing and presentation pathway”, the most enriched pathways of the four datasets. HLA has amongst the highest polymorphic genes of the human genome [36], and the somatic mutations of genes in this family were reported in the development and progression of various cancers [37,38]. HLA genes with eeSNVs could be used as fingerprints to characterize the cellular state of the cancer cells. B2M, another gene with top-scored eeSNVs in Ting and Patel datasets, is also known to be a mutational hotspot [39]. It is directly linked to immune response as tumor cell proliferation [37,39]. Many other top-ranked genes, such as KRAS and SPARC, were reported to be driver genes in the original studies of the different dataset. Thus, it is reasonable to speculate that SSrGE is capable of identifying some driver genes. However, SSrGE may miss some driver mutations, since its primary goal is to identify a minimal set of eeSNV features by LASSO penalization and LASSO may select one of those highly correlated SNV features that correspond to GE.
eeSNVs reveal higher degree of single-cell heterogeneity than gene expression
We have showed with strong evidence that eeSNVs unveil inter- and intra- tumor cells heterogeneities better than gene expression count data obtained from the same RNA-Seq reads. Reconstructing the pseudo-time ordering of cancer cells from the same tumor (Kim dataset) displays branching even inside primary tumor and metastasis subgroups, which gene expression data are unable to do. We identified genes enriched with SNVs specific to the metastasis, which were not reported in the original HCC single cell study [31]. Most interestingly, we showed that eeSNVs can also be retrieved from RRBS reads in a multi-omics single-cell HCC dataset, a twist from their original purpose of single-cell DNA methylation. Again, genes ranked by eeSNVs from RRBS reads only differentiate normal from cancer cells but also the different cancer subtypes. We identified several genes that are significant in either HCC or HCC subgroup, whose promoters are highly impacted by eeSNVs. Thus, we have demonstrated that our method is on the fore-front to analyze data generated by new single-cell technologies extracting multi-omics from the same cells [31,40].
Advantages of using bipartite graphs to represent scRNA-seq data
Bipartite graphs are a natural way to visualize eeSNV-cell relationships. We have used force-directed graph drawing algorithms involve spring-like attractive forces and electrical repulsions between nodes that are connected by edges. This approach has the advantage to reveal “outlier” single cells, with a small set of eeSNVs, compared to those distance-based approaches. Moreover, the bipartite representation also reveals directly the relationship between single cells and the eeSNV features. Contrary to dimension reduction approaches such as PCA that requires linear transformation of features into principle components, bipartite graphs preserve all the binary information between cell and eeSNV. Graph analysis software such as Gephi [24] or Cytoscape [41] can be utilized to explore the bipartite relationships in an interactive manner.
Conclusion
We demonstrated the efficiency of using eeSNVs for cell subpopulation identification over multiple datasets. eeSNVs are excellent genetic markers for intra-tumor heterogeneity and may serve as genetic candidates of new treatment options. We also have developed SSrGE, a linear model framework that correlates genotype (eeSNV) and phenotype (GE) information in scRNA-seq data. Moreover, we have showed the capacity of SSrGE in analyzing multi-omics data from the same single cells, obtained from the most cutting-edge genomics techniques [42,43]. Our method has the great promise as part of routine scRNA-seq analyses, as well as multi-omics single-cell integration projects.
Materials and Methods
scRNA-seq datasets
All four datasets were downloaded from the NCBI Gene Expression Omnibus (GEO) portal [44].
Kim dataset (accession GSE73121)
Contains three cell populations from matched primary and metastasis tumor from the same patient [45]. Patient Derivated Xenographs (PDX) were constructed using cells from the primary Clear Cell Renal Cell Carcinoma (PDX-pRCC) tumor and from the lung metastasic tumor (PDX-mRCC). Also, metastatic cells from the patient (Pt-mRCC) were sequenced.
Patel dataset (accession GSE57872)
Contains five glioblastoma cell populations isolated from 5 individual tumors from different patients (MGH26, MGH28, MGH29 MGH30 and MGH31) and two gliomasphere cell lines, CSC6 and CSC8, used as control [46].
Miyamoto dataset (accession GSE67980)
Contains 122 CTCs from Prostate cancer from 18 patients, 30 single cells derived from 4 different cancer cell lines: VCaP, LNCaP, PC3 and DU145, and 5 leukocyte cells from a healthy patient (HD1) [47]. A total of 23 classes (18 CTC classes + 4 cancer cell lines + 1 healthy leukocyte cell lines) was obtained.
Ting dataset (subset of accession GSE51372)
Contains 75 CTCs from Pancreatic cancer from 5 different KPC mice (MP2, MP3, MP4, MP6, MP7), 18 CTCs from two GFP-lineage traced mice (GMP1 and GMP2), 20 single cells from one GFP-lineage traced mouse (TuGMP3), 12 single cells from a mouse embryonic fibroblast cell line (MEF), 12 single cells from mouse white blood (WBC) and 16 single cells from the nb508 mouse pancreatic cell line (nb508) [48]. KPC mice have uniform genetic cancer drivers (Tp53, Kras). Due to their shared genotype, we merged all the KPC CTCs into one single reference class. CTCs from GMP1 did not pass the QC test and were dismissed. CTCs from GMP2 mice were labeled as GMP. Finally, 6 reference classes were used: MP, nb508, GMP, TuGMP, MEF and WBC.
Hou dataset (accession GSE65364)
Contains 25 hepatocellular carcinoma single-cells (Ca) extracted from the same patient and 6 normal liver cells (HepG2) obtained from the adjacent normal tissue of another HCC patient [31]. The 32 cells were sequenced using scTrio-seq in order to obtain reads from both RNA-seq and reduced representation bisulfate sequencing (RRBS). The authors highlighted that one of the Ca cells (Ca_26) was likely to be a normal cell, based on CNV measurements, and thus we discarded this cell. We used only the RRBS reads and as Gene expression measurements. As controls, we also used the bulk genome of all the RNA-Seq and RRBS reads of the HepG2 group.
SNV detection using scRNA-seq data
The SNV detection pipeline using scRNA-seq data follows the guidelines of GATK (http://gatkforums.broadinstitute.org/wdl/discussion/3891/calling-variants-in-rnaseq). It includes four steps: alignment of spliced transcripts to the reference genome (hg19 or mm10), BAM file preprocessing, read realignment and recalibration, and variant calling and filtering (Suppl. Figure S1) [49].
Specifically, FASTQ files were first aligned using STAR aligner [50], using mm10 and hg19 as reference genomes for mouse and human datasets, respectively. The BAM file quality check was done by FastQC [51], and samples with lower than 50% of unique sequences were removed (default of FastQC). Also, samples with more than 20% of the duplicated reads were removed by STAR. Finally, samples with insufficient reads were also removed, if their reads were below the mean minus two times the standard deviation of the entire single-cell population. Raw gene counts Xj were estimated using featureCounts [52], and normalized using the logarithmic transformation:
where Xj is the raw expression of gene j, R is the total number of reads and Gj is the length of the gene j. Bam files were pre-processed and reordered using Picard Tools (http://broadinstitute.github.io/picard/), before subject to realignment and recalibration using GATK tools [53]. SNVs are then calculated and filtered using GATK tools using default parameters.
SNV detection using RRBS data
We first aligned the RRBS reads on the hg19 reference genome using the Bismark software [54]. We then processed the bam files using all the preprocessing steps as described in “SNV detection using scRNA-seq data” section (i.e. Picard Preprocessing, Order reads, Split reads and Realignments), except the base recalibration step. Finally, we called the SNVs using the BS-SNPer software [55]. We only considered the SNVs with the status “PASS”.
SNV annotation
To annotate human SNV datasets, dbSNP138 from the NCBI Single Nucleotide Polymorphism database [56] and reference INDELs from 1000 genomes (1000_phase1 as Mills_and_1000G_gold_standard) [57] were used. To annotate the mouse SNV dataset, dbSNPv137 for SNPs and INDELs were downloaded from the Mouse Genomes Project of the Sanger Institute, using the following link: ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/ [58]. The mouse SNP databases were sorted using SortVcf command of Picard Tools, in order to be properly used by Picard Tools and GATK.
SSrGE package to calculate eeSNVs
For each dataset, we denote MSNV and MGE as the SNV and gene expression matrices, respectively. MSNV is binary and (inline) ∈ {0,1}designates the presence/absence of SNV s in cell c. (inline) is the log transformed gene expression value of the gene g in cell c. A gene and its associated SNVs were only considered when the gene was expressed in at least one sample. Sparse linear regression using LASSO was then applied to identify Wg, the linear coefficients associated to the SNVs. The objective function (to minimize) is:
where α is the regularization parameter.
An SNV was considered as eeSNV when Wg (s) ≠ 0. To derive sensible eeSNVs, the linear regression was only done on a particular gene, when at least 10 cells in the population expressed it.
Ranking of eeSNVs and genes
SSrGE generates coefficients of eeSNVs for each gene, as a metric for their contributions to the gene expression. The score of an eeSNV is given by the sum of its weights over all genes:
Each gene also receives a score according to its associated eeSNVs:
In practice, we first obtained eeSNVs using a minimum filtering of α=0.1, before using these two scores above to rank eeSNVs and the genes.
Ranking of eeSNVs and genes for a subpopulation
For a given single-cell subpopulation p, a eeSNV is specific to the subpopulation p if only it is significantly more present for cells of p. For each eeSNV we took only the subset of cells expressing the gene g associated with the eeSNV. We then computed the Fisher’s exact test to compare the presence of the eeSNV between single-cells inside and outside p. We considered a eeSNV as significant for p-value < 0.05. p’ designates the subset of cells from p expressing g. The score of a eeSNV for p is given by:
The score of a given gene g for p is thus given by:
To rank eeSNVs from the promoter regions of the RRBS reads in Hou dataset, we applied a similar methodology: we annotated the eeSNVs within 1500bp upstream of genes’ starting codon regions.
CIS scores of eeSNVs
The CIS score of an eeSNV s is the fraction of contribution of an eeSNV to the expression of the gene g that it resides in among the total score:
Pseudo-time ordering reconstruction
To estimate the trajectory of cell evolvement, we adopted the following procedure, motivated by the method described earlier [26]. We first constructed the following distance matrix to reflect the correlation between each pair of cells:
Using this distance matrix as the adjacency matrix, we constructed a weighted undirected complete graph with each node representing a cell. We then find the minimum spanning tree of this complete graph. Finally, we plotted the graph and mapped the original labels as colors of the nodes.
Subpopulation clustering algorithms
We combined two dimension reduction algorithms: Principal Component Analysis (PCA) [19] and Factor Analysis (FA) [20] with two popular clustering approaches: the K-Means algorithm [59] and agglomerative hierarchical clustering (agglo) with WARD linkage [21]. We also used SIMLR, a recent algorithm specifically tailored to cluster and visualize scRNA-seq data, which learns the similarity matrix from subpopulations [60]. Similar to the original SIMLR study, we used the embedding of the cells produced by the algorithm to apply K-Means algorithm.
PCA and FA were performed using their corresponding implementation in Scikit-Learn (sklearn) [61]. For PCA, FA and SIMLR, we used various input dimensions D [2, 3, 5, 10, 15, 20, 25, 30] to project the data. To cluster the data with K-Means or the hierarchical agglomerative procedure, we used a different cluster numbers N (2 to 80) to obtain the best clustering results from each dataset. We computed accuracy metrics for each (D, N) pair and chose the combination that gives the overall best score. Between the two clustering methods, K-Means was the implementation of sklearn package with the default parameter, and hierarchical clustering was done by the AgglomerativeClustering implementation of sklearn, using WARD linkage.
Validation metrics
To assess the accuracy of the obtained clusters, we used three metrics: Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI) and V-measure [22,23]. These metrics compare the obtained clusters C to some reference classes K and generate scores between 0 and 1 for AMI and V-measure, and between −1 and 1 for ARI. A score of 1 means perfect match between the obtained clusters and the reference classes. For ARI, a score below 0 indicates a random clustering.
Rand Index (RI) was computed by: , where a is the number of con-concordant sample pairs in obtained clusters C and reference classes K, whereas b is the number of dis-concordant samples. As an improvement, ARI normalizes RI against random chances:
.
AMI, similarly to ARI, normalizes Mutual Information (MI) against chances [22]. The Mutual Information between two sets of classes C and K is equal to:, where P(i) is the probability that an object from C belongs to the class i, P'(j) is the probability that an object from K belongs to class j, and P(i, j) is the probability that an object are in both class i and j. AMI is equal to:
, where H(C) and H(K) designates the entropy of C and K.
V-measure, similar to F-measure, calculates the harmonic mean between homogeneity and completeness. Homogeneity is defined as , where H(C|K) is the conditional entropy of C given K. Completeness is the symmetrical of homogeneity:
.
Graph visualization
The different datasets were transformed into GraphML files with Python scripts using iGraph library [62]. Graphs were visualized using GePhi software [24] and spatialized using ForceAtlas2 [63], a specific graph layout implemented into the GePhi software.
Pathway enrichment analysis
We used the KEGG pathway database to identify pathways related to specific genes [64]. We selected genes scored with significant eeSNVs for the metastasis cells from Kim dataset and for the CTCs for the Ting dataset. We then used DAVID 6.8 functional annotation tool to identify significant pathways amongst these genes [25]. We used the default significance value (adjusted p-value threshold of 0.10). Significant pathways are then represented as a bipartite graph using Gephi: Nodes are either genes or pathway and the size of each nodes represent the score of the genes or, in the case of pathways, the sum of the scores of the genes linked to the pathways. We used the same methodology to infer significant pathways of cancer cells, compared to normal cells, from Hou dataset. However, we used all the genes ranked rather than only the significant genes, since only few genes are found to be significant for cancer cells.
Code availability
The SNV calling pipeline and SSrGE are available through the following GitHub project: https://github.com/lanagarmire/SSrGE.
Author contributions
LG envisioned this project. OP implemented the project and conducted genomics analysis, XZ and TC helped on implementation. OP and LG wrote the manuscript. All authors have read and agreed on the manuscript.
Competing financial interests
The author(s) declare no competing financial interests.
Supplemental Materials
Supplementary Figure S1: The SNV calling pipeline. It follows GATK’s “Best Practice” workflow for SNP and INDEL calling, with four steps. Step 1: alignment. Step 2: preprocessing of BAM files. Step 3: read realignment and recalibration. Step 4: variant calling.
Supplementary Figure S2: Sketch of Sparse SNV inference to Reflect Gene Expression (SSrGE) linear models. The SNVs calculated from the SNV calling pipeline (Supplementary Figure S1) are transformed into a predictor matrix MSNV. Gene expression is the response matrix MGE. For each gene, a LASSO regression is fitted to identify non-null coefficient matrix W. The output of the models is a set of filtered eeSNVs and a set of corresponding genes in which eeSNVs are found.
Supplementary Figure S3: Bar plot comparing the clustering performance using eeSNV vs. gene expression (GE) as features, over four datasets and five different clustering strategies. The metrics used are (A): Adjusted Rand Index (ARI), and (B): V-measure.
Supplementary Figure S4: Relationship between the best accuracy metrics and the LASSO regularization parameter a, over the four datasets and five different clustering approaches. The accuracy metrics are: (A) Adjusted Mutual Information (AMI), B: Adjusted Rand Index (ARI), and (C): V-measure.
Supplementary Figure S5: CIS score of the eeSNVs for the four datasets. (I) CIS effect for all the dataset and (II) CIS score for the top 100 eeSNVs.
Supplementary Figure S6: Pseudo-time reconstruction using the Monocle algorithm with gene expression from genes having eeSNVs as features.
Supplementary Figure S7: Comparison of the batch-effect on SNVs and gene expression, using scRNA-seq data from glioblastoma patient MGH26.
Supplementary Table S1: Regularization values (a) used for the clustering procedures along with the number of eeSNVs features.
Supplementary Tables S2: Ranked eeSNVs and genes for each dataset (with minimum regularization filtering a=0.1).
Supplementary Tables S3: Ranked genes for the metastasis single-cells from the Kim dataset (mRCC).
Acknowledgements
This research was supported by grants K01ES025434 awarded by NIEHS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (http://www.bd2k.nih.gov), P20 COBRE GM103457 awarded by NIH/NIGMS, R01 LM012373 awarded by NLM, R01 HD084633 awarded by NICHD and Hawaii Community Foundation Medical Research Grant 14ADVC-64566 to L.X. Garmire. We acknowledge K. Chaudhary for manuscript proofreading.