Preprocessing choices affect RNA velocity results for droplet scRNA-seq data

Experimental single-cell approaches are becoming widely used for many purposes, including investigation of the dynamic behaviour of developing biological systems. Consequently, a large number of computational methods for extracting dynamic information from such data have been developed. One example is RNA velocity analysis, in which spliced and unspliced RNA abundances are jointly modeled in order to infer a ‘direction of change’ and thereby a future state for each cell in the gene expression space. Naturally, the accuracy and interpretability of the inferred RNA velocities depend crucially on the correctness of the estimated abundances. Here, we systematically compare five widely used quantification tools, in total yielding thirteen different quantification approaches, in terms of their estimates of spliced and unspliced RNA abundances in five experimental droplet scRNA-seq data sets. We show that there are substantial differences between the quantifications obtained from different tools, and identify typical genes for which such discrepancies are observed. We further show that these abundance differences propagate to the downstream analysis, and can have a large effect on estimated velocities as well as the biological interpretation. Our results highlight that abundance quantification is a crucial aspect of the RNA velocity analysis workflow, and that both the definition of the genomic features of interest and the quantification algorithm itself require careful consideration.


Introduction
sequence of length L − 1 (where L is the RNA read length of the respective study) is added on each side of each intron to account for reads mapping across exon/intron boundaries. For comparison, we also 152 reimplemented the extraction of transcript and intron sequences (for both the 'separate' and 'collapse' 153 approaches) directly using functions from the GenomicFeatures and BSgenome R/Bioconductor packages 154 (Pagès 2019; Lawrence et al. 2013). Code used to extract the features has been included in the eisaR 155 package (https://github.com/fmicompbio/eisaR), from v0.9, for convenience. In each case, the ex-156 tracted transcript and intron sequences were written to a joint FASTA file, summarized in Table 1. Upon 157 comparison of the two implementations, we noticed that the current release version (v1.0.0) of BUSpaRse 158 returned erroneous feature sequences for multi-exonic transcripts on the negative strand (Fig. S2). For 159 this reason, the features extracted by BUSpaRse were not used for further analyses.

FASTA file name Sequence extraction Description
BUSpaRse-separate BUSpaRse Introns extracted separately from each transcript isoform BUSpaRse-collapse BUSpaRse Introns extracted after collapsing all transcripts of a gene eisaR-separate eisaR Introns extracted separately from each transcript isoform eisaR-collapse eisaR Introns extracted after collapsing all transcripts of a gene Table 1: Summary of reference FASTA files containing exonic and intronic sequences. Only files generated by eisaR were used for creation of quantification indices. 161 The combined transcript and intron FASTA files were used to build the following quantification indices 162 (summarized in Table 2 the aim to exclude reads coming from intergenic regions of the genome. Across data sets and reference 174 specifications, this excluded between 1 and 2.5% of the reads from the quantification. For the quantification based on spliced-only transcripts, in which the genome decoy would also capture unambiguously 176 intronic reads, this number was between 2 and 10%. Finally, we built an index for CellRanger (v3.0.2) 177 (Zheng et al. 2017) and one for STAR (v2.7.3a) (Dobin et al. 2013) based on the reference genome and 178 GTF file from Gencode. The splice junction database overhang in the STAR index was set to 150nt, which 179 is at least as long as the read length minus one for all data sets considered here.  In other words, the transcript IDs are provided as the -c argument to quantify the exonic abun-228 dances, and the intron IDs are provided to quantify the intronic abundances. In practice, this 229 means that reads in equivalence classes containing at least one transcript are retained for the ex-230 onic quantification, and reads in equivalence classes containing at least one intron are retained for 231 the intronic quantification. Hence, equivalence classes containing both exonic and intronic features 232 will be provided to both the exonic and intronic quantification steps.

233
• 'exclude', where the features that are not of interest for the quantification at hand are provided 234 to kallisto|bustools, and subsequently excluded. In other words, intron IDs are provided as the -c 235 argument to quantify the exonic abundances, and the transcript IDs are provided to quantify the 236 intronic abundances, and in addition the -x flag is used to indicate that the provided IDs represent 237 sequences to be excluded. In practice, this means that only reads in equivalence classes that don't 238 contain any introns will be retained for the exonic quantification, and only reads in equivalence 239 classes that don't contain any transcripts will be retained for the intronic quantification. Hence, 240 equivalence classes containing both exonic and intronic features will be excluded in both steps.

241
In addition to the manual application of kallisto|bustools as described above, we applied the kb- counts (below referred to as starsolo). Second, we used the 'Gene' count matrix as the exonic counts, and 252 the difference between the 'GeneFull' and 'Gene' counts as the intronic counts (below referred to as 253 starsolo_diff ). For genes where the 'Gene' counts were higher than the 'GeneFull' counts, the intronic 254 count was set to zero. This can happen, for example, for a gene located in the intron of another gene.

255
In the 'GeneFull' quantification, reads mapping to such a gene are considered ambiguous and therefore 256 discarded. However, they may be assigned in the 'Gene' quantification, if they are compatible with the 257 annotated gene model. An overview of the evaluated quantification approaches is provided in Table 3.   Table 3: Summary of quantification approaches. In addition to the strategies included in this table, we applied alevin_sep_gtr after using different flank length when defining the intronic reference sequences, alevin_sep_gtr in unstranded mode, and the kb-python wrapper around kallisto|bustools. The effects of these modifications are evaluated in Fig. S3.
parameters to the PCA output to obtain a two-dimensional representation that was used for visualization All velocity estimates were embedded into the same low-dimensional representation, calculated from the spliced-only abundance quantification by alevin as described above. The embedded velocity vector for cell i, calculated by scVelo, is given byṽ whereπ ij is the transition probability from cell i to cell j (derived from the cosine correlation π ij ), n is the number of cells, andδ ij =s j −s i s j −s i is the normalized difference of the coordinates of cells i and j in the low-dimensional embedding. The fact thatδ ij is normalized implies that the length ofṽ i indicates to what extent the cells to which cell i has high transition probabilities are all located in the same direction from cell i in the low-dimensional representation. It further implies that the embedded velocities are potentially more comparable across methods than the original velocity vectors, since the magnitudes of the latter depend on the normalized abundance levels of the genes, and since the velocity vectors will only be directly comparable between methods if they are based on the same set of input genes. In order to compare the velocity embeddings across methods, we calculate a concordance score for each cell. The score for cell i is defined as the ratio between the length of the sum of the embedded velocity vectors for cell i across all quantification methods, and the sum of the lengths of the individual embedded velocity vectors. In other words, the score for cell i is given by where the sum is taken over all methods m, andṽ representation, this ratio will be close to 1, while if there is less concordance between the different 324 quantification methods, the ratio will be lower than 1.

326
The total UMI count varies between methods

327
With the aim of characterizing global differences between the counting methods, we first directly com-328 pared the total UMI count assigned to exonic and intronic targets by each of the methods. We added up 329 the counts across all cells, either across all genes or within gene subsets stratified by sequence uniqueness 330 ( Fig. 1, Pancreas data, similar values were obtained for the other three data sets). There are consider-331 able differences in the assigned UMI counts between the methods. Moreover, these differences are not 332 confined to a small number of susceptible genes, but can be seen across a large fraction of the expressed 333 genes (Fig. S4).

334
Overall, starsolo_diff and the alevin-based quantification approaches based on transcript/intron an-335 notations gave the highest total UMI counts, mainly driven by higher counts for the exonic targets. As 336 shown in Fig. 1, this is predominantly due to the assignment of reads to genes with a low fraction of 337 unique k-mers. This is in contrast to kallisto|bustools, velocyto and starsolo, which by default exclude 338 ambiguous reads that map to multiple genes from the quantification.

339
Velocyto and starsolo_diff gave the highest UMI counts for genes whose transcripts are all shorter than   alevin_coll_decoy_gtr  alevin_coll_gtr  alevin_sep_decoy_gtr  alevin_sep_gtr  alevin_spliced_unspliced_gtr  kallisto|bus_coll_excl  kallisto|bus_coll_incl  kallisto|bus_sep_excl  kallisto|bus_sep_incl  starsolo  starsolo_diff  velocyto  alevin_coll_decoy_gtr  alevin_coll_gtr  alevin_sep_decoy_gtr  alevin_sep_gtr  alevin_spliced_unspliced_gtr  kallisto|bus_coll_excl  kallisto|bus_coll_incl  kallisto|bus_sep_excl  kallisto|bus_sep_incl  starsolo  starsolo_diff  velocyto  alevin_coll_decoy_gtr  alevin_coll_gtr  alevin_sep_decoy_gtr  alevin_sep_gtr  alevin_spliced_unspliced_gtr  kallisto|bus_coll_excl  kallisto|bus_coll_incl  kallisto|bus_sep_excl  kallisto|bus_sep_incl  starsolo  Total UMI count across genes in bin Pancreas, total count, stratified by overall uniqueness (collapse) Figure 1: Total UMI count across genes and cells. The bars correspond to the total UMI count, and the split of these into counts for exonic and intronic targets, for each quantification method in the Pancreas data set. Similar patterns were seen in the other three data sets. In addition to the overall count (top row), the figure shows the total count after stratifying genes by the overall fraction of unique k-mers (using the 'collapse' annotation), indicated in the vertical panel headers together with the number of genes in the category. The genes for which no uniqueness information could be calculated (the 'NA' category) are those for which all transcripts are shorter than the chosen k-mer length (which was set to the read length minus one; here 150nt).

Individual genes exemplify methodological differences
Next, we aimed to find individual genes whose count patterns exemplify the main methodological dif-368 ferences among the counting strategies. First, we restricted the set of genes to those that were selected as 369 highly variable by scVelo, and thus used for velocity estimation, for at least one counting method. Notably, 370 while a large fraction of these genes were selected across all quantification methods, there were non-371 negligible differences between the sets of selected genes (Fig. S5). In particular, alevin_spliced_unspliced 372 gave the largest number of unique highly variable genes, followed by the starsolo variants and velocyto 373 depending on the data set. For each of the retained genes, we calculated the fraction of the total counts 374 that were assigned to unspliced features (summarized across all cells). Next, we calculated the standard 375 deviation, across the quantification methods, of these intronic count fractions and selected the top 10% 376 of the genes based on this measure. These genes were partitioned into 10 clusters based on the Pearson 377 correlation dissimilarity ( 2(1 − ρ) where ρ is the Pearson correlation) between the fractions of intron-378 assigned counts across methods, using hierarchical clustering with complete linkage (Fig. 2). The gene exonic and intronic counts. Conversely, running kallisto|bustools with 'exclude' capture discards many 396 reads in ambiguous regions, since they will typically be assigned to equivalence classes containing both 397 exonic and intronic targets, and this counting strategy therefore often returns low counts for both types 398 of features. While these effects can be seen in the absolute counts, they do not necessarily affect the ratio 399 of spliced and unspliced counts (Fig. S6).  Genes overlapping (introns of) other genes (clusters 1, 2, 3, 6, 10, exemplified by Chkb, Gm21983,  strand are considered ambiguous, and are therefore discarded, by kallisto|bustools, velocyto and starsolo.

411
In contrast, alevin does not discard the reads, but instead distributes them between the two features.

412
Thus, alevin tends to give a higher fraction of unspliced counts, and in many cases also a higher total 413 count, for genes with other genes in their introns (cluster 6, exemplified by Cnot6 in Fig. S7). It is worth 414 observing that alevin will often also assign reads to the gene located within the intron (Gm12191 in 415 Fig. S7, which is the gene in the intron of Cnot6). As for the previous category of genes, alevin with 416 the decoy approach double-counts reads mapping equally well to an exon and an intron, regardless of 417 whether or not the exon and the intron belong to the same gene.

418
In cases of exonic overlap between genes on the same strand (clusters 1 and 2, exemplified by Chkb 419 and Gm21983 in Fig. S7), all methods except alevin consider the corresponding reads ambiguous and 420 discards them, leading to a large difference in the total counts (Fig. 2). The main difference between lines pointing in different directions in certain regions of the plots (Figs. S15-S17). These differences are not captured by the velocity confidence estimates returned by scVelo for an individual method, which are 491 often high for all the cells (Figs. S18-S20), suggesting that the differences between methods are systematic 492 rather than just the result of random fluctuations or uncertainty in the velocity estimation. The similarity 493 among the low-dimensional velocity embeddings based on different quantification methods increased 494 somewhat when they were derived from the set of shared genes (Fig. 3). However, considerable differ-495 ences were still seen, indicating that the quantification does not only influence the velocity interpretation 496 via the selection of genes.

497
The lack of unambiguous ground truth complicates a direct evaluation of the accuracy of velocity any neighboring cell in the data set, which is used here as a proxy for a lack of systematic dynamics.

528
While the maximal cosine correlation varied considerably among cell types (Fig. S21-S24), the value was 529 typically slightly lower for the PFC data set than for the data sets with known dynamics (Fig. 4). The  Maximal cosine correlation by cell impact on the genes that are selected for inclusion by scVelo, but differences affecting the interpretation 543 remain even when the same set of genes is used across all methods.

544
Given the relative immaturity of the RNA velocity field, and the lack of a generally accepted method 545 for generating realistic, simulated data with known ground truth for this application, it is challenging 546 to rank the quantification methods in terms of absolute performance. However, some clear themes a gene can obtain a nonzero 'intronic' count despite completely lacking annotated introns. Moreover, genes located within introns of other genes will often obtain a nominally negative intronic UMI count 561 since non-junction-spanning reads mapping to the former will be considered ambiguous and discarded 562 in the 'GeneFull' counting. Fourth, for 3' tag data such as the 10x Genomics data we have considered in 563 this study, quantifying the spliced and unspliced transcripts (rather than spliced transcripts and introns 564 only) implies that for a large fraction of the reads, it is difficult to resolve whether they stem from the 565 spliced or unspliced target. Thus, this type of reference may be more suitable for full-length scRNA-seq 566 protocols, where reads are sampled across the entire length of the transcript.

567
Among the counting strategies contrasted in this manuscript, alevin_sep_gtr, kallisto|bus_sep_excl, star-568 solo and alevin_coll_gtr provided velocity embeddings most in line with expectations in the three real data 569 sets. However, even among these methods, there are large differences in the assigned counts as well as in 570 the handling of ambiguous reads and genomic regions. Going forward, we expect that improvements in 571 counting strategies for scRNA-seq data, specifically tailored for RNA velocity preprocessing, will likely 572 come alongside an increased understanding of the read generation process and the biases underlying 573 specific scRNA-seq library preparation protocols, and that different counting strategies may be optimal 574 for different types of scRNA-seq data. An increased understanding of the read generation process will 575 also enable realistic simulation of sets of spliced and unspliced scRNA-seq reads, which in turn will 576 provide an improved platform for objective evaluation of the performance of counting strategies.

577
Data and code access 578 The code used for our analyses is available on https://github.com/csoneson/rna_velocity_quant.

579
The spermatogenesis data can be downloaded from GEO, accession number GSE109033. The pancreas 580 data set was downloaded from GEO, accession number GSE132188. The dentate gyrus data set was