Validation of scRNA-seq by scRT-ddPCR using the example of ErbB2 in MCF7 cells

Single-cell RNA sequencing (scRNA-seq) can unmask transcriptional heterogeneity facilitating the detection of rare subpopulations at unprecedented resolution. In response to challenges related to coverage and quantity of transcriptome analysis, the lack of unbiased and absolutely quantitative validation methods hampers further improvements. Digital PCR (dPCR) represents such a method as we could show that the inherent partitioning enhances molecular detections by increasing effective mRNA concentrations. We developed a scRT-ddPCR method and validated it using two breast cancer cell lines, MCF7 and BT-474, and bulk methods. ErbB2, a low-abundant transcript in MCF7 cells, suffers from dropouts in scRNA-seq and thus calculated fold changes are biased. Using our scRT-ddPCR, we could improve the detection of ErbB2 and based on the absolute counts obtained we could validate the scRNA-seq fold change. We think this workflow is a valuable addition to the single-cell transcriptomic research toolbox and could even become a new standard in fold change validation because of its reliability, ease of use and increased sensitivity.


Introduction 45
RNA-seq is the method of choice for gene expression analysis. Herein, differential expression 46 (DE) analysis between two conditions is pivotal to answer challenging questions in research 47 and clinical applications. In bulk RNA-seq, population heterogeneity remains covert, whereas 48 scRNA-seq can capture delicate differences between cells [1]. The development of new 49 platforms expresses the growing interest in scRNA-seq [2-6] and allows novel applications, 50 such as unmasking transcriptional heterogeneity in healthy and cancerous tissues by 51 functional clustering [7-9], discovering uncharacterized cell types [10], and identifying 52 phylogenetic relationships between cells [11]. scRNA-seq enables researchers to understand 53 underlying mechanisms of drug resistance development and relapse in disease treatment by 54 the detection of rare subpopulations at unprecedented resolution [5,8,12]. However, the 55 inherent low sample input in scRNA-seq introduces a significant amount of noise, which 56 increases the propensity for dropouts and artificially increases cell-to-cell variability [13,14]. 57 This is especially dramatic with respect to low-abundant transcripts, which are often referred 58 to as highly interesting but difficult to reliably analyze [15][16][17][18]. Furthermore, the tremendous 59 variety of platforms and bioinformatics tools has not yet solidified into a consistent pipeline 60 [13,19,20]. Additionally, the protocol impacts results, as plate-based Smart-seq2 [21,22] 61 proved to be more sensitive, especially regarding low-abundant transcripts compared to the 62 droplet-based Chromium system from 10X Genomics [23,24]. 63 Thus, DE analysis from scRNA-seq must be independently confirmed by single-cell PCR 64 [25]. Several scRT-qPCR workflows have been described [26][27][28][29][30] as well as a few 65 scRT-ddPCR workflows [31][32][33]. The majority of these workflows use fluorescence-activated 66 cell sorting (FACS) for single-cell isolation [27,28,30,31,34], while other studies use 67 microfluidic devices [32,33], micromanipulators [26] or manual cell picking [29]. Despite its 68 widespread use, single-cell isolation with FACS requires high sample input and the inherent 69 shear forces can damage the cells and impair RNA integrity [34,35]. Furthermore, qPCR is 70 less sensitive and more susceptible to inhibitors compared to dPCR [17,36], while the 71 detection mechanism of dPCR allows absolute quantification without reference [37]. 72 Particularly, the lower sensitivity of qPCR hampers its use for challenging, single-cell mRNA 73 quantification with a focus on low-abundant transcripts. 74 Therefore, we here propose a novel method for the validation of fold changes from scRNA-75 seq. Our scRT-ddPCR method combines gentle (ensuring high cell viability ~ 80 %) and 76 highly reliable (~ 90 % single cell isolation efficiency) single cell isolation using the 77 F.SIGHT™ single-cell dispenser (CYTENA GmbH, Freiburg) [38] and contact-free liquid 78 handling (I.DOT, Dispendix, Stuttgart) with highly sensitive dPCR [36]. The F.SIGHT™ 79 requires minimal sample input (down to 5000 cells in 5 µl) and its image-based analysis 80 ensures single-cell isolation and delivers an image proof of each dispensing event, which can 81 be unambiguously assigned to the addressed well of the microplate. Through partitioning, 82 dPCR can reliably detect single molecules and enables absolute quantification [17,32,36]. The 83 latter characteristic is key to inter-experimental comparisons as absolute counts are 84 independent of any standard and depict the ground truth. For the validation of our scRT-85 ddPCR, we used two breast cancer cell lines, MCF7 and BT-474, the latter overexpresses 86 ErbB2 [39][40][41]. We found high concordance between mRNA counts from scRT-ddPCR and 87 bulk RT-ddPCR methods. Interestingly, ErbB2 log2FCs were significantly different between 88 scRNA-seq and scRT-ddPCR. We assume that the inherent partitioning of dPCR increases 89 sensitivity and resolution, and thus allows us to confirm or reject fold changes from 90 scRNA Freiburg, Germany). The settings for MCF7 cells were 10 to 25 µm cells size (BT-474: 10 to 147 30 µm) and 0.5 to 1 roundness (same for BT-474) (Fig 1a, S3a  , and is additionally controlled by cell images unambiguously assigned to 152 each dispensation event (Fig 1a).  Fig 1b). 207 208 2.6.2 Library preparation using Nextera XT and sequencing 209 Prior to tagmentation cDNA was normalized to 0.2 ng/µl. Tagmentation  was assessed with FASTQC v0.11.9 (exemplary images: Fig S1). were directly imported. Cells with an alignment efficiency below 80 % were filtered out 245 (salmon and kallisto) (Fig 1d and Tab S1). Cells with less than 60 % of uniquely mapped 246 reads were filtered out (STAR) (Fig 1e and  of each initial expression array is randomly subsampled with replacement (same length as 279 initial array). The log2FCs of the means of these expression arrays are calculated per method. 280 Then the ratio r g of these log2FCs is determined (Eq 1). Subsampling and ratio calculation is 281 repeated 1,000 times. Finally, mean and 95 % confidence intervals (CI) of the new array r g is 282 determined. If the 95 % CI overlaps with 1, the methods are assumed to yield the same fold 283 change. 284 according to roundness and size from a heterogeneous population of particles (Fig 1a, only  302 colored dots are dispensed cells, the grey dots are either artefacts or cells that could not be 303 isolated) resulting in a homogeneous cell population (Fig 1a, boxplots at the edges). We 304 manually analyzed all images from putative single cells (84 cells per cell line) and found that 305 7 % (MCF7) and 4 % (BT-474) were doublets or empty droplets (Tab S1). The cells were 306 directly dispensed into the lysis buffer and processed by down-scaled SMART-Seq® and 307 down-scaled Nextera XT protocols (2.6.1 cDNA synthesis using SMART-Seq® Single Cell 308 Kit and 2.6.2 Library preparation using Nextera XT and sequencing). The average fragment 309 length for tagmented cDNA was 459 bp and 432 bp for MCF7 and BT-474 cells, respectively. 310 According to Jaeger et al.
[62], this is an indication for good quality, tagmented cDNA. 311 Representative electropherograms of cDNA and tagmented cDNA are shown in Fig 1b and  312 1c. FastQC analysis revealed an average Phred score of above 30 for both cell lines (Fig S1). 313 We analyzed sequencing data using three common aligners:  (Fig 1d and 1e and Tab S1). Further on, cells with less than 1E+5 reads were also 317 excluded from analysis (Fig 1f and Tab S1). All aligners yielded the same number of genes 318 per cell. 11676 genes per single MCF7 cell and 11682 genes per single BT-474 cell were 319 detected (Fig 1g). We also clustered our data with external data from Isakova et al.
[4], who 320 used 10-fold down-scaled Smart-seq2 protocol. The clusters of MCF7 cells exactly overlap, 321 while other cells formed independent clusters like the reference cell lines HEK293T and 322 fibroblasts (Fig 2a). We performed pseudo-bulk DE analysis with different input to DESeq2 323 (salmon, kallisto or STAR aligner count matrices) (Fig 2c, 2d and Tab S2). Interestingly, 324 salmon and kallisto predict the highly significant overexpression of OLFML3, RAMP3 and 325 VWA5A (only salmon), which we could not observe with STAR aligner (Fig 2c). To our 326 knowledge there is no supporting evidence for this overexpression in MCF7 cells in literature. 327 Furthermore, STAR aligner input to DESeq2 predicts clearly more DEGs in MCF7 than the 328 other two aligners (Fig 2d). We found that ErbB2 is significantly overexpressed in BT-474 329 cells as previously described [39], while ACTB as a housekeeping gene is not significantly 330 different between the cell lines (Fig 2c and Tab S2). These findings are consistent across all 331 aligners. Additionally, we evaluated the expression of two marker genes for MCF7 cells, 332 KRT8 and TFF1 [4]. TFF1 is overexpressed in MCF7 cells, while KRT8 shows differential 333 expression only with STAR aligner input to DESeq2 (Fig 2c and Tab S2). This underlines 334 furthermore the dissimilarity of the aligners used. 335 336

Validation of scRT-ddPCR using bulk methods 337
For scRT-ddPCR, we isolated cells on the same day and from the same culture as in the case 338 of down-scaled SMART-Seq® using F.SIGTH™ and I.DOT except that the cells were 339 dispensed into LBTW (Fig S3a and S3b). After lysis, the gene mRNA per cell counts were 340 determined directly form the lysate using digital PCR. First, we investigated varied volumes 341 of LBTW lysis buffer, as the carry-over of detergents may impair droplet formation or reverse 342 transcription and thus PCR efficiency [31,63]. The results indicate that as the volume of lysis 343 buffer increases, the number of formed droplets decreases (Fig 3a) due to increased areas of 344 coalescence (Fig S2a). The use of 0.5 µl LBTW produces no areas of coalescence and the 345 number of droplets is not significantly reduced, despite a drop of ~18 % in total droplet 346 number (Fig 3a). Similarly, we could not detected a significant difference between the ErbB2 347 and ACTB mRNA concentrations upon different volumes of lysis buffer (Fig S2b). with Bonferroni correction ; Fig 3b and Tab S5). The F.SIGHT™ records morphological 361 details of each dispensed cell but we could not detect any correlation between cell size and 362 number of mRNAs per singe cell (Fig S3c and S3d). These mRNA counts could be biased by 363 incomplete lysis of the single cell. To verify these mRNA counts, we checked the ability of 364 the lysis buffer (LBTW) to exert full dispersion of cell material prior to compartmentalization. 365 Thus, we used two commercially available methods for total RNA isolation, for which we 366 assume a 100 % isolation efficiency ('bulk'). The two methods differ in sample preparation 367 (DNase I digest vs. no digest and enzymatic lysate homogenization vs. mechanical lysate 368 homogenization), buffers and handling in general, but resulted in the same amount of ErbB2 369 or ACTB mRNAs per cell (Fig S2d). Of note, the RNA quality differs between the two 370 methods (Fig S2c and Tab S3). Similarly, a single-cell volume equivalent from a crude lysate 371 ('cl') cells was dispensed into the dPCR mix and ErbB2 and ACTB counts per single cell were 372 determined to validate that the lysis conditions have no effect on dPCR or the detection of the 373 transcripts. Comparing these extraction methods ('sc', 'bulk', 'cl') for BT-474 cells yields no 374 significant differences for both genes ErbB2 or ACTB (Fig 3b). Also ErbB2 counts in MCF7 375 cells are not significantly different between extraction methods. However, ACTB counts from 376 'bulk' are significantly higher than from 'sc' or 'cl' (p < 0.001, Mann-Whitney test with 377 Bonferroni correction). Although, ACTB is considered to be a housekeeping gene, its 378 variability due to possible uncontrolled conditions is already described [64]. In this given case 379 we assume, the variability might be related to the different passage numbers (Tab S4) or to a 380 different confluency state of the cell culture. Overall, the results of absolute gene mRNA 381 counts of total bulk RNA isolation methods and similarly the results of the crude lysates 382 confirm unbiased, quantitative transcript detection by our scRT-ddPCR method (Fig 3b) we observed a significantly different shape as we could not observe an accumulation of cells 407 in a bin of the histogram and the skewness is much lower (Tab S6). For high-abundant 408 transcripts such as ErbB2 in BT-474 cells, we could detect differences between the alignment 409 tools especially between salmon and kallisto, and STAR. This difference might be justified by 410 the missing normalization of raw counts from STAR aligner or by using the genome as 411 alignment reference. However, ACTB signal distributions show no such behavior, but data 412 from Isakova et al. are strikingly different compared to all our approaches (Fig 4b). Based on 413 the expression values (Fig 2b and Fig 3b), we calculated log2FCs (MCF7 vs. BT-474) (Fig  414  4c). Additionally, we bootstrapped and down-sampled the scRNA-seq groups to the same 415 sample size as the scRT-ddPCR group to eliminate subsampling errors (Fig 4c, shaded bars). 416 The so calculated log2FCs do not differ from the log2FCs calculated by DESeq2 (blue and 417 green arrows ; Fig 2c, 4c and Tab S2). All ACTB log2FCs from scRNA-seq and scRT-ddPCR 418 fluctuate within 0 ± 1, which is the null hypothesis of DESeq2 [56], meaning that there is no 419 differential expression between cell lines (Fig 4c). The fluctuation probably depicts statistical 420 noise. To compare log2FCs between methods, we bootstrapped their ratio and calculated 421 95 % CIs. If these 95 % CIs overlap with 1, the methods are assumed to determine the same 422 log2FC (Eq 1 and Fig S4). Log2FCs from both, scRNA-seq (with salmon, kallisto or STAR 423 aligner) and scRT-ddPCR, confirm the overexpression of ErbB2 in BT-474 cells, although to 424 a significantly different extent, while we could not detect any difference between the log2FCs 425 from the different aligners (Fig 4c). scRT-ddPCR predicts significantly stronger 426 overexpression of ErbB2 in BT-474 cells. This can potentially be explained by the biased 427 detection of ErbB2 expression in MCF7 cells by scRNA-seq (Fig 4a). 428

Discussion 429
In this study, we present a novel, orthogonal method, scRT-ddPCR, for the validation of 430 scRNA-seq fold changes. Durst et al. found that absolute quantification is the most reliable 431 approach for single-cell analysis [39], which is the key feature of dPCR and delivers a ground 432 truth that facilitates inter-experimental comparisons as it is detached from any standard. This 433 is achieved by the inherent partitioning of dPCR, which further allows spatially separated but 434 simultaneous reverse transcription of mRNAs. This potentially improves cDNA capture 435 through enrichment. Thus, for low-abundant transcripts, which are often referred to as highly 436 interesting but difficult to reliably analyze [15-18], dPCR might therefore be of great 437 advantage. 438 First, we aimed to enhance molecular detections for SMART-Seq® by down-scaling, which is 439 frequently applied to scRNA-seq protocols to increase throughput, reduce costs, and increase 440 sensitivity, while maintaining data quality [4,59-62]. We demonstrated that our down-scaled 441 SMART-Seq® protocol using F.SIGHT™ and I.DOT delivers high quality data (Fig 1, S1  442 and Tab S1). We validated our method by comparative UMAP-clustering against published, 443 down-scalded data [4] and found excellent conformity (Fig 2a) sensitive enough to detect low-abundant ErbB2 mRNA in MCF7 cells as we show (Fig 4a). 447 We relate these dropouts to the simultaneous reverse transcription of multiple 448 poly(A)-mRNAs into cDNAs. Thus, our data support previous findings of dropouts in 449 scRNA-seq [13,14]. 450 Secondly, we successfully validated our scRT-ddPCR method (Fig 3 and S2), which 451 underlines that scRT-ddPCR can serve as a ground truth. We could detect ErbB2 expression 452 in MCF7 cells without dropouts (Fig 4a) potentially because of the inherent partitioning step 453 in dPCR, which increases the effective mRNA concentration [36]. ErbB2 expression in BT-454 474 cells and ACTB expression in MCF7 and BT-474 cells could be similarly detected (Fig  455  3b, 4a and 4b). Finally, we compared log2FCs from scRNA-seq and scRT-ddPCR and found that ACTB 465 log2FCs from scRNA-seq were not different from scRT-ddPCR log2FCs (Fig 4c). In both 466 cell lines, ACTB expression has a good signal distribution for scRT-ddPCR and all aligners 467 used in scRNA-seq (Fig 4b and Tab S6). Strikingly, the signal distribution obtained from 468 published data shows a much stronger skewness (Fig 4b and Tab S6), which could be an 469 indication that our down-scaled protocol using F.SIGHT™ and I.DOT performs better than 470 existing down-scaled versions. While ErbB2 fold changes are consistent among the aligners 471 used, we found a significant difference between scRNA-seq and scRT-ddPCR (Fig 4c). We 472 hypothesize that these differences originate from the heavily skewed signal distributions 473 (skewness ≈ 3) of ErbB2 in MCF7 cells, which indicate dropouts (Fig 4a and Tab S6). 474 However, scRNA-seq and scRT-ddPCR use different priming strategies for cDNA synthesis 475 [68] and in scRNA-seq protocols more PCR steps are included (i.e. cDNA amplification, 476 tagmented library amplification, bridge amplification), which potentiate biases and promote 477 dropouts. Biases in scRNA-seq could also originate from the bioinformatics tools used as we 478 observe that some genes are predicted to be overexpressed with some aligners (Fig 2c and 2d) 479 and that overexpression is not consistent across the aligners (Tab S2).

480
In this study, we only evaluate the impact of dropouts on individual fold changes but we 481 assume that this has far-reaching implications on DE analyses and their conclusions, 482 especially since this is not an issue limited to scRNA-seq but also exists in conventional 483 RNA-seq [69-71]. However, it is pronounce in scRNA-seq because of low sample input 484 [13,14]. In concordance with this, we could show that the alignment tool has an impact on the 485 amount of DEGs and on the fold changes (Fig 2c, 2d, 4c