Abstract
As part of the process of preparing scRNA-seq libraries, a diverse template is typically amplified by PCR. During amplification, spurious chimeric molecules can be formed between molecules originating in different cells. While several computational and experimental strategies have been suggested to mitigate the impact of chimeric molecules, they have not been addressed in the context of scRNA-seq experiments. We demonstrate that chimeras become increasingly problematic as samples are sequenced deeply and propose two computational solutions. The first is unsupervised and relies only on cell barcode and UMI information. The second is a supervised approach built on labeled data and a set of molecule specific features. The classifier can accurately identify most of the contaminating molecules in a deeply sequenced species mixing dataset. Code is publicly available at https://github.com/asncd/schimera.
Context Specific Definitions
Cell barcode (CBC): A random nucleotide barcode present on the reverse transcription primer used to capture RNA that is unique for each single cell in an experiment.
Unique Molecular Identifier (UMI): also known as a Random Molecular Tag (RMT). This refers specifically to a random nucleotide barcode present on the reverse transcription primer. For a given cell (as identified by a CBC), every captured mRNA molecule should be uniquely labeled by a UMI. PCR amplification is exponential and its efficiency depends on initial concentration and various sequence features. A UMI should identify all amplicons that originated from the same transcript. Thus, UMIs can be used for normalization of transcript count and allow correction of PCR bias.
Chimera: also known as a recombinant molecule or hybrid molecule. A chimera is a PCR artifact generated when two template molecules anneal to each other rather than with a primer. This yields molecules that are comprised of a cell barcode and a mRNA from another cell.
Cell barcode UMI tag (CBCUMI): A unique pair of cell barcode and UMI. This pair can potentially be associated with more than one mRNA molecule due to random chance in which UMIs from two distinct original molecules are the same or chimera formation.
Unique Molecule (UM): A unique combination of a CBC, UMI, and an mRNA transcript.
Literature Summary
(Meyerhans et al. 1990) Demonstrated chimeric PCR amplicons between two distinct HIV tat gene sequences could be reduced from 5.4% to 2% of all molecules by increasing extension time. Screening phage plaques using gene specific probes identified recombinant sequences.
(Odelberg et al. 1995) Demonstrated that recombinant molecules can be produced in a single round of extension without subsequent denaturation. They suggest the polymerase or the nascent strand can switch templates during synthesis. While template switching had been proposed in the context of RNA viruses and certain DNA polymerases, this paper demonstrated that it could occur during PCR with the widely used Taq. They used a set of plasmids that contained a constant middle region flanked by 20bp unique sequences. After a single round of primer extension, they performed competitive PCR (whose purpose is the same as qPCR) to demonstrate the presence of recombinant DNA. They report a 4-fold reduction in recombinant DNA when physically separating the forward and reverse strands by streptavidin beads during extension.
(Thompson 2002) Discussed a strategy for measuring heteroduplex formation during PCR of three highly related rDNA from species in the genus Vibrio using a capillary electrophoresis strategy. They demonstrate mixed species can constitute a large fraction of the members of the final amplified pool after 30 cycles of PCR. Introducing a molar excess of primers or reintroducing primers at later stages of PCR mitigated the problem.
(Ashelford et al. 2005). Suggested up to 5% of 16S rRNA sequences in public databases can be anomalies, most of which are chimeras.
(Smyth et al. 2010) By terminating PCR before the exponential phase of amplification ends, they show a reduction in chimera formation. They also emphasize the importance of polymerase processivity.
(Haas et al. 2011) Compare several chimera identification tools in the context of 16S sequences based on sequence similarity metrics including their own novel one and show a surprisingly large fraction of chimeric reads.
Results
Computational UMI collapsing can obscure chimeras when library construction is pooled
In several single cell transcriptional profiling methods, library construction across many cells is performed in a single pool using cell-level barcoding to deconvolve transcription profiles during sequencing (Hashimshony et al. 2016; Macosko et al. 2015; Klein et al. 2015; Fan et al. 2015; Zheng et al. 2016; Jaitin et al. 2014). While some rely on in-vitro transcription to perform linear amplification, they all involve at least one step in which the library is amplified via PCR. Additionally, enrichment PCR is sometimes performed, as in (Fan et al. 2015), to selectively amplify a specific subset of molecules of interest from a complex RNA library.
We examine the abundance of PCR chimeras in scRNA-seq libraries (Figure 1), and which factors can contribute to their prevalence. We focus on the species mixing experiments (aka “barnyard experiments”) in which human and mouse cells are mixed together before being processed (Figure 2). These experiments give a direct measurement of the rate at which more than one cell is incorporated into a droplet due to Poisson loading.1
For this paper, we focus on cross-species molecules from a deeply sequenced species mixing experiment in the Drop-seq paper (Table 1). These molecules can come from several sources, including: 1) ambient RNA, 2) barcode collisions due to random chance, and 3) chimeric molecules formed during PCR (Figure 1).
We find that 1.7% of all unique molecules (UMs) in cells that can be unambiguously assigned to a species come from the opposite species. If, to obtain an upper bound, we assume all of these molecules to come from chimeric events, then, including presumed human-human and mouse-mouse chimeras, up to 5.2% of all molecules considered in this experiment would be chimeric (based on expected fraction of 33% coming from crossspecies products of Table 2).
We show that cross-species UMs increase in relative abundance as a function of sequencing depth; they are associated with a lower number of reads per molecule (Figure 2B, C).
Paradoxically, sequencing deeply can reduce the statistical power to detect differential expression.
Transcripts per Transcript as a Metric for PCR Chimeras in scRNA-seq
Depending on the diversity of the UMI barcodes used in library preparation, most pairs of cell barcodes and UMIs should be uniquely associated with a single gene. PCR chimeras could cause swapping between cell barcodes (Figure 2). However, these events are more likely to happen during later stages of PCR. Resultantly, the chimeric molecules will have a lower number of reads per molecule and are likely to have another gene containing the same cell barcode-UMI pair containing a higher number of reads.
We introduce a notion of transcript per transcript (TPT) as a measure of a chimera formation (Figure 3). Specifically, for transcript i in a set of molecules that share a unique pair of CBC and UMI, u, TPT is defined as:
In the deeply sequenced Drop-seq library, we were able to increase the power to detect differential expression between human and mouse cells by filtering molecules with a TPT less than 0.02 (Figure 3C).
We found 10% of molecules have a TPT less than 0.01 and almost 30% have a TPT less than 0.05 (Figure 4A). We observed far more genes per CBC-UMI tag than we would expect based on the diversity of UMI barcodes present (Figure 4B).
Theseus provides an unsupervised approach for chimera filtering by performing TPT normalization and removing molecules whose TPT is below a certain threshold.
Yorimasa, a Supervised Approach
If it is possible to perform supervised labeling of cell types that should be distinct, and a subset of genes that should be mutually exclusive between the cell types, then we can perform a supervised chimera filtering approach. For example, in the species mixing experiment, cross-species molecules can be clearly identified. Without species mixing, a set of cell-type specific markers that are known to be very specific can be used instead.
We train a random forest classifier on the species mixing experiment using the following features of each UM (Figure 4):
Log2(reads per UM); Unique molecules with low number of reads are more likely to be chimeric events.
TPT; Low TPT is more likely to be a chimera
Log2(total mRNA abundance)
Log2(total CBC abundance)
Log2(Gene Length); Longer genes are expected to be more likely to form chimeras
GC content; High GC content is also expected to be more likely to form chimeras
The classifiers AUC is 0.93. However, we note that our performance evaluation is slightly contaminated by the rate of human-human and mouse-mouse chimeras that should be present (contributing an unfair number events labeled as “false positives”). Specifically the false positive rate is defined as:
There are 3,394,611 molecules that are not cross species chimeras (our labeled set of “negatives”). Of those, we estimate on the order of 120,000 molecules are actually within-species chimeras. As such, if all chimeras are detected, including the within-species chimeras, the FPR will be at least 0.034. A rough approximation of the max AUC is 0.98 (subtracting the triangle defined by an FPR of 0.034).
By setting a threshold on the out-of-bag probability estimates from the random forest classifier and setting the acceptable FPR to 0.034, 4.55% of all molecules are filtered (from before, an estimated 5.2% are chimeras).
In a more shallowly sequenced 10X genomics sample available at http://support.10xgenomics.com/single-cell/datasets/hgmm_1k that was sequenced to a depth of 69,793 reads per cell and an approximate saturation of 24.4% we still noted significant cross-species contamination that could be classified with decent accuracy (Figure 5A, B).
Yorimasa is an approach that allows users to specify a set of cells and a set of genes that should not be present in any of the other cells. The labels are used to train a classifier. Molecules confidently predicted as chimeras are removed.
Discussion
The challenges of PCR chimeras in scRNA-seq include:
Difficulty in obtaining sensitive detection of medium to lowly expressed transcripts while retaining power to detect differential expression between more highly expressed transcripts (as sequencing depth increases to detect lowly expressed transcripts, chimeras amongst highly expressed transcripts are detected)
Artificially increased estimates of sequencing saturation if low read abundance chimeras are counted
When performing enrichment PCR to focus on a subset of the library, there will be an increased prevalence of the chimera problem due to higher sequence similarity between amplified molecules
Potential experimental solutions that would also help with the problem include:
Longer cell barcodes and UMIs
For disambiguating random barcode collisions from true chimeras
Droplet PCR (Figure 5C)
If each input molecule is amplified individually, chimera formation is not possible
The library should also be more uniform in terms of reads per molecule
Pipelines that process scRNA-seq should begin to incorporate some level of chimeric correction. While certainly not perfect, we hope that our two methods, Yorimasa and Theseus, will help tackle these mythically monstrous mixed molecules in scRNA-seq.
Software Accessibility
Examples code using Yorimasa and Theseus are at https://github.com/asncd/schimera
Acknowledgements
The author thanks all members of the Regev Lab for helpful discussions, encouragement, and support, especially Rebecca Herbst for her help with early brainstorming, Aviv Regev for mentorship, and Carl de Boer and Christoph Muus for their insightful feedback on the manuscript. AD was supported by a NDSEG fellowship.
Footnotes
↵1 Less commonly noted, the correlation between the number of reads mapping to human and mouse for doublets can give an indication of what extent bead/droplet quality is the major factor influencing capture efficiency. (Drop-seq datasets can have an R2 greater than 0.6)