SCReadCounts: Estimation of cell-level SNVs from scRNA-seq data

SCReadCounts is a method for a cell-level estimation of the sequencing read counts bearing a particular nucleotide at genomic positions of interest from barcoded scRNA-seq alignments. SCReadCounts generates an array of outputs, including cell-SNV matrices with the absolute variant-harboring read counts, as well as cell-SNV matrices with expressed Variant Allele Fraction (VAFRNA); we demonstrate its application to estimate cell level expression of somatic mutations and RNA-editing on cancer datasets. SCReadCounts is benchmarked against GATK and Samtools and is freely available as a 64-bit self-contained binary distribution (Linux), along with MacOS and Python installation. Availability https://github.com/HorvathLab/NGS/tree/master/SCReadCounts Supplementary Information SCReadCounts_Supplementary_Data.zip

To aid these types of studies, we have developed a tool -SCReadCounts -for cell-level estimation of reference and variant read counts (nvar and nref, respectively), from pooled barcoded scRNA-seq alignments. Provided a list of variant sites, SCReadCounts estimates nvar and nref, calculates expressed Variant Allele Fraction (VAFRNA = nvar / (nvar + nref)) and outputs cell-SNV matrices. The cell-SNV matrices can be used as inputs for a wide range of downstream analyses. We demonstrate the application of SCReadCounts to estimate cell level expression of somatic mutations and RNA-editing on cancer datasets from Adrenal Neuroblastoma (Dong et al., 2020). We also exemplify a downstream application to correlate VAFRNA to GE using scReQTL (Liu et al., 2020).
Results ScReadCounts requires two inputs: a barcoded scRNA-seq alignment, and a list of genomic positions of interest. In the exemplified workflow (Fig.1a), the barcodes and the Unique Molecular Identifiers (UMIs) are processed using UMItools, and the sequencing reads are aligned to the reference genome (GRCh38) using STAR (v.2.5.7b) (Smith et al., 2017;Dobin et al., 2013). The resulting pooled alignments can be filtered to correct for allele-mapping bias (WASP, Van De Geijn et al., 2015); this filtering utilizes the same list of positions to be used as input for scReadCounts. Examples of positions of interest include SNVs called in the corresponding alignments (i.e. using GATK, see Fig.1a), or user-specified lists of coordinates from external sources, such as sets of somatic mutations (COSMIC) or RNA-editing loci (Auwera Mauricio O. et al., 2002;Tate et al., 2019;Picardi et al., 2017).
SCReadCounts generates three main outputs in a tab-separated value format: (1) a table containing nvar and nref with quality and filtering metrics for each barcode, (2) a cell-SNV matrix with the absolute nvar and nref counts, and, (3) a cell-SNV matrix with the VAFRNA estimated at a user-defined threshold of minimum number of required sequencing reads (minR) (S_Fig.1).
Performance We compared the SCReadCounts estimations with the analogous modules of the mpileup utility of Samtools (Li et al., 2009) and the haplotype caller of GATK. SCReadCounts default options generate nearly identical values to mpileup and GATK ( Fig.1b and S_Fig.2). SCReadCounts uses, by default, a very simple read-filtering criteria, but it can also be readily configured to achieve scenario-specific, mpileup-consistent, or GATK-consistent results, with optional explicit output of the number of reads discarded by each filtering rule. In regard to efficiency, on our system (2x14 cores CPUs with 1.5TB RAM compute node) processing of a file containing ~5000 cells, ~150mln reads, and ~80K SNVs, requires approximately 4h for the estimation of nvar and nref, and up to 2 minutes for the generation of the cell-SNV matrices. The later enables the users to quickly generate VAFRNA matrices at various minR.
Applications We first tested the ability of SCReadCounts to assess the expression of known somatic mutations in the neuroblastoma scRNA-seq dataset. To do that we extracted from COSMIC 404,693 single nucleotide substitutions located in Cancer Census Genes and not overlapping with known germline SNV loci (S_Table 1). SCReadCounts estimated detectable expression of a number of COSMIC mutations in a low to moderate proportion of the individual cells. An example is COSV67805199 in the gene RHOH, with a variable VAFRNA (minR = 5) across a number of cells from sample SRR10156295 (Fig.1c).
Next, we sought to assess if SCReadCounts can detect cell-specific RNA-editing levels. We used the same set of neuroblastoma samples, this time assessing 101,713 single nucleotide editing events from the REDI database (S_Table_2) (Picardi et al., 2017). At minR = 5, SCReadCounts identified multiple loci with variable levels of editing, some with apparent clustering in the two-dimensional space of certain cell-types (cell-types are classified using Seurat and Sin-gleR  (Fig.1e, right).
Considerations When applying SCReadCounts, the following considerations are in place. First, as mentioned earlier, modeling sequencing errors in the UMI is essential. Second, estimation of nvar is sensitive to mapping, therefore WASP-correction of the alignments is recommended. Third, when estimating VAFRNA, the selection of minimal required number of reads is important; our results show that for the most of the current scRNA-seq datasets, minR=5 provides a reasonable balance between randomness of sampling (high minR) and inclusivity (low minR) (Prashant et al., 2020).
In summary, we believe that SCReadCounts supplies a fast and efficient solution for estimation of scRNA-seq genetic variance.