Single cell RNA-seq by mostly-natural sequencing by synthesis

Massively parallel single cell RNA-seq (scRNA-seq) for diverse applications, from cell atlases to functional screens, is increasingly limited by sequencing costs, and large-scale low-cost sequencing can open many additional applications, including patient diagnostics and drug screens. Here, we adapted and systematically benchmarked a newly developed, mostly-natural sequencing by synthesis method for scRNA-seq. We demonstrate successful application in four scRNA-seq case studies of different technical and biological types, including 5’ and 3’ scRNA-seq, human peripheral blood mononuclear cells from a single individual and in multiplex, as well as Perturb-Seq. Our data show comparable results to existing technology, including compatibility with state-of-the-art scRNA-seq libraries independent of the sequencing technology used – thus providing an enhanced cost-effective path for large scale scRNA-seq.

3 Cellular profiling by single-cell RNA-seq (scRNA-seq) now enables studying and characterizing cellular states and pathways at ever-growing experimental scales, including the Human Cell Atlas 1 , cell atlases for tumors 2 and other diseases 3, 4 , and large-scale Perturb-seq screens of millions of cells under genetic 5,6 or drug 7 perturbations. As methods for capturing and processing single cell libraries have been radically scaled in the past few years [8][9][10][11] , sequencing technologies are 5 becoming a major barrier to the broad adoption of scRNA-seq in both basic research and the clinic.
Here, we describe the application of a new sequencing technology that has the potential to advance single cell genomics by significantly lowering the sequencing cost component of scRNA-seq.
Mostly-natural sequencing by synthesis (mnSBS) is a new sequencing chemistry that leverages a 10 low fraction of labeled nucleotides to combine the efficiency of non-terminating chemistry with the throughput and scalability of optical endpoint scanning to enable low-cost, high-throughput sequencing. To benchmark mnSBS with scRNA-seq, we performed experiments with four library types, sequenced in parallel on an Illumina sequencer and on an Ultima Genomics (Ultima) prototype sequencer implementing mnSBS (Fig. 1a). 15 To implement mnSBS for massively-parallel, droplet-based scRNA-seq, we converted a typical scRNA-seq workflow to be compatible with Ultima sequencing (Fig. 1b-d

; Online Methods).
Focusing on 10x Chromium scRNA-seq (Online Methods), a popular method, we first added adapters to cDNA libraries specific for Ultima sequencing (Fig. 1b). Next, we address the fact that 20 droplet-based scRNA-seq relies on pairing each cDNA read with a cell barcode (CBC) and a Unique Molecular Identifier (UMI) (Online Methods). With Illumina sequencing, the two ends of the library are sequenced separately by paired-end sequencing, but for single-end Ultima sequencing, we capture all the information in a single 200 to 250 base read (Fig. 1d), such that the CBC and UMI are read first and followed by the cDNA. For those reads derived from the transcript's 3' end, we sequence through poly(T) bases, which are due to the mRNA poly(A) tail, adjacent to the cDNA sequence. 5 To evaluate mnSBS with scRNA-seq, we carried out experiments with four libraries, spanning different technical and biological use cases, and sequenced each in parallel on both Ultima and Illumina sequencers (Online Methods). Three libraries were from peripheral blood mononuclear cells (PBMCs) of healthy human donors, spanning 3' scRNA-seq (~7,000 cells, 1 individual), 5' scRNA-Seq (~7,000 cells, 1 individual), and a library generated in multiplex by pooling cells from 10 eight donors (~24,000 cells, 8 individuals, 5' scRNA-seq). We chose PBMCs because they are primary human cells, include diverse cell types of various sizes and frequencies, and have been used for previous benchmarking 12,13 . The fourth library was from a Perturb-Seq 5, 6 experiment, where ~20,000 cells were profiled after CRISPR/Cas9 pooled genetic perturbation, followed by scRNA-Seq to detect both the cell's profile and associated guide RNA. Together, the four libraries 15 span three major use cases -individual patient atlas, multiplex patient profiling, and large-scale screens, and the two most commonly used library types for scRNA-seq.
We first tested the feasibility of mnSBS for scRNA-seq, with matched Ultima and 5' and 3' droplet-based scRNA-seq of PBMCs. Initial analysis (Online Methods) showed that the number 20 of UMIs generated at a given sequencing depth was comparable between Ultima and Illumina in the 5' libraries, while for the 3' libraries we obtained more UMIs with Illumina than Ultima (Fig.   2a), due to sequence quality differences. While Ultima and Illumina data for 5' libraries were similar, for the 3' data there was lower quality for Ultima in the bases flanking the poly(T) region -the 3' end of the UMI and the 5' end of the cDNA (Extended Data Fig. 1a). Indeed, filtering out reads that have bases with quality <10 in their UMI (the filter applied by the pre-processing pipeline we used, Cell Ranger 14 ), yields similar rarefaction curves for Illumina and Ultima (Extended Data Fig. 1b). Thus, much of the difference in the observed number of UMIs per 5 sequenced read for 3' libraries is explained by the lower sequence quality UMIs in the Ultima data due to the need to sequence through the poly(T) bases. To overcome this, for 3' libraries we trimmed 5 bases from the cDNA adjacent to the poly(T) bases and then explored how best to trim the UMI. As we shortened the UMIs, UMIs that differed only in the trimmed bases "collapsed" into a single UMI leading to decreases in the fraction of UMI/CBC pairs that occur in only one 10 gene at roughly the same rate in Illumina and Ultima data (Extended Data Fig. 1c). Shortening the UMIs for Illumina had a minimal effect at 9 bases or more (Extended Data Fig. 1d), suggesting that the challenges with Ultima reads were due to lower base quality and that trimming to 9 bases was reasonable. This led us to exclude the last 3 bases of each UMI in Ultima 3' data in subsequent downstream analysis (Online Methods). 15 Next, comparing the performance of these PBMC 3' and 5' matched libraries, we obtained similar overall performance for both sequencing technologies. First, to correct for differences in sequencing depths, which were higher in Ultima than Illumina, we randomly sampled Ultima reads, so that we used the same number of reads for each sequencing platform (Online Methods). 20 Both technologies identified nearly all the same CBCs (Fig. 2b, 7,916 cells (Ultima) vs 7,926 cells (Illumina) in the 3' data, and 7,875 cells (Ultima) vs 7,854 cells (Illumina) in the 5' data), with the same number of UMIs and genes per cell for 5' libraries and slightly lower numbers for 3' libraries with Ultima (as expected) (Fig. 2c,d). When we sampled reads to have the same number of UMIs (Online Methods), we obtained a similar number of genes per cell in Illumina and Ultima also for 3' libraries (Extended Data Fig. 2). Other metrics (Supplementary Table 1) also showed similar overall performance, with slightly higher genome mapping rates in Ultima but comparable transcriptome mapping rates. 5 The two technologies yielded highly correlated expression levels, albeit with some outlier genes and minor differences (Pearson's r=0.98 in all cases; Fig. 2e, Extended Data Fig. 2c). (As expected, when a single sequencing run was randomly split into two datasets, we see even higher correlation of expression levels (Extended Data Fig. 2d)). Specifically, there was a modest bias, particularly in the 3' libraries, towards genes with higher GC content having higher expression in 10 Illumina and the longest genes having higher expression in Ultima 3' libraries (Extended Data   Fig. 3a,b). Of the 166 genes with differences in expression for 3' PBMC between the two sequencing platforms, most (130 genes, 78.3%) differed in the fraction of reads that were assigned by Cell Ranger to the gene out of all the reads mapped to that gene region (Extended Data Fig.   3c). This is likely related to how Ultima and Illumina reads map to different locations relative to 15 the transcript, as expected from the difference in single-end vs. paired-end reads (Fig. 1d): in 5' data, Ultima reads map closer to the 5' end than Illumina reads, while in 3' data, Ultima reads map closer to the 3' end than Illumina reads (Extended Data Fig. 3d,e). Because Cell Ranger excludes reads that do not fully map within annotated gene boundaries, more Ultima reads are excluded from analysis as they are closer to gene ends (Extended Data Fig. 3d,e), as shown for example 20 for LILRA5 and HIST1H1D (Extended Data Fig. 3f,g). This difference in location can also lead to more multimapping or ambiguous reads (Extended Data Fig. 3h, Supplementary Table 2).
For example, four (ARF5, MIF, IFITM1, and TCIRG1) of the 20 genes with the largest log fold change (logFC) between Ultima and Illumina in the 3' data (labeled in Fig 2E) have higher expression in Illumina and a much higher rate of mapped ambiguous reads in the Ultima than the Illumina data (>50 vs. <10 ambiguous reads per non-ambiguous read for each gene, respectively) (Supplementary Table 2), possibly explaining the difference in their expression levels.
Shortening Ultima Read 2 to the same length as Illumina Read 2 had a small effect on the fraction of assigned reads (Extended Data Fig. 3h) and other metrics (Supplementary Table 1) -5 suggesting read length is not a major factor in the differences we observed.
To further explore the effects of gene annotation on Ultima and Illumina-based scRNA-seq, we extended the standard reference using RNA-seq data, as we have previously shown this can recover the expression of a gene with an alternative 3' end compared to the annotation 15 . We created a 10 pipeline that extends the annotated gene boundaries based on reads that overlap a gene but are not completely contained in any of its annotated exons (Online Methods). We generated three such references, extended with either (1) published bulk PBMC data 12 , (2) the Ultima 3' scRNA-seq data, or (3) the Ultima 5' scRNA-seq data (with Ultima and Illumina data sampled to the same number of reads). We compared the expression of genes in Ultima data processed with the 15 extended references to those in Illumina data either with or without the extended reference (Extended Data Fig. 4, Supplementary Tables 1 and 3). Analyzing the 5' PBMC data with the extended reference decreased the number of differentially expressed (DE) genes between Ultima and Illumina by 22 to 23% (absolute logFC > ln(2)) compared with the standard reference, while other overall metrics were largely unchanged. In the 3' data, there were a similar number of DE 20 genes in analyses with the extended and standard references, although the expression of some genes, e.g., LILRA5 and MT-CO2, agreed much more closely using the extended reference.
Comparing gene expression levels for the same sequencing dataset processed with the standard or an extended reference shows that most levels are very similar, though a sizeable number (23 to 83) are higher and a few (1 to 3) are lower (Extended Data Fig. 4c). Also. some of the top genes that differ between the extended and standard references are genes that differ between Ultima and Illumina with the standard reference, e.g., MT-CO2 and LILRA5 in the 3' data and HIST1H1D and HIST1H1E in the 5' data ( Fig. 2e, Extended Data Fig. 3f,g). This suggests that a data-driven extended reference might help recover expression in Ultima scRNA-seq data, particularly when 5 using 5' data. Alternatively, one could consider modifying the way Cell Ranger counts UMIs to better take advantage of reads that overlap genes but are not completely contained within them.
To compare the biological insights derived from scRNA-seq using the two technologies, we turned to analyze 5' scRNA-seq of PBMCs from eight individuals processed together and sequenced with 10 both Ultima and Illumina (Online Methods). Both methods have roughly the same number of UMIs in this dataset (<1% difference) and performed similarly (Supplementary Table 1 and Extended Data Fig. 5, using all reads). We also generated matched T cell Receptor (TCR) and B cell Receptor (BCR) Illumina sequencing data (Online Methods). Ultima sequencing was not used for this, because the 10x Chromium constructs specifically require paired-ends or much 15 longer single-end reads to cover the entirety of these genes.
In the 8 individuals PBMC dataset, the two sequencing platforms produced very similar results for the common tasks of genotype-based assignment, cell type labeling, and DE gene identification, and were well-embedded together. First, we used Vireo 16 , which finds genotype clusters in the 20 data without prior knowledge of the genotypes of individuals in the experiment, to assign reads to each individual in the mixture (Online Methods). Both Ultima and Illumina data returned highly concordant labels (Fig. 3a), with 92% agreement in label if we include those cells declared doublets or unassigned (χ 2 test for independence gives a p-value < 2.2 x 10 -16 and χ 2 =199127 with df=81), and >99.9% agreement if the cell is assigned singlet by both technologies (only 5 cells differ, χ 2 test for independence gives a p-value < 2.2 x 10 -16 and χ 2 =146879 with df=49). Next, we clustered the cells for each of the two datasets separately (Online Methods), and used Azimuth 17 to automatically label cell types in each (Online Methods). In both sequencing datasets, we 5 identify the major cell types expected for PBMCs, with the expected cell type markers (Extended Data Fig. 6a,b), and cells are comparably well-mixed among individuals (Fig. 3b), with low adjusted mutual information (AMI) between cell type and individual in both Ultima (0.026) and Illumina (0.025) (AMI = 0 corresponds to no relation between individual and cell type; AMI = 1 corresponds to the case of perfect agreement between the two labelings). The two sequencing 10 datasets also had high agreement on proportions of each cell type from each individual, both for the main cell type categories (Fig. 3c, 95% agreement in cell type labels between Ultima and Illumina, χ 2 test for independence gives a p-value < 2.2 x 10 -16 and χ 2 =123891 with df=49, AMI = 0.88) and for finer cell subsets, such as subclusters of T cells (Fig. 3d). They further agreed on differential expression between cell types ( Fig. 3e; r=0.93-0.95), such that 67.9% of genes that are 15 significantly DE in one cell type in one of the two datasets are significant in both for that cell type.
We found similar results with 5' and 3' PBMC datasets from a single individual (Extended Data   Fig. 7). Moreover, the two PBMC mixture datasets were co-embedded well together into a joint 2-dimensional space using Uniform Manifold Approximation and Projection (UMAP) after regressing out dataset of origin (Online Methods), with good mixing between datasets (AMI = 20 0.00068 between the joint clustering and dataset of origin), and good separation of cell types (Fig.   3f). Thus, data generated by the two sequencing technologies are compatible and can be combined easily in a single analysis. To explore finer signals, we compared the two data sets for continuous cell states -such as 15 activation status or the cell cycle -recovered by unsupervised non-negative matrix factorization (NMF). Each NMF factor can reflect a gene program, defined by a non-negative score for each gene (referred to as gene loadings) and a non-negative score for each cell (referred to as cell loadings). Because NMF runs are not identical even when re-run on the same data, to compare NMF models from Ultima and Illumina data, we fit NMF on Ultima data, Illumina data, and a null 20 of randomly permuted Illumina expression values (Online Methods), and then measured how well cell or gene loadings fit each dataset. Cell loadings from the model learned on Ultima data fit the Illumina data almost as well as cell loadings from the Illumina-learned model and vice versa, while loadings from the permuted (null) dataset led to much poorer fit (Extended Data Fig. 8a). For gene loadings, there was lower performance when fitting data from one sequencing technologies with loadings from a model learned on the data from the other technologies, each to a comparable extent, and both far better than random permutations (Extended Data Fig. 8b). Consensus NMF (cNMF) 19 (Online Methods), which reduces variability due to random sampling between NMF runs, showed high correlations of cell (Extended Data Fig. 8c) or gene (Extended Data Fig. 8d) 5 loadings between models learned on different runs. The correspondence was comparable to that observed between two independent cNMF runs on the same dataset (Extended Data Fig. 8e,f), and lower than when comparing a single run to itself (Extended Data Fig. 8g,h), as expected. It was also much stronger than comparing cNMF models of two different biological systems (5' PBMC mixture data and Perturb-seq, see below for details of this experiment) (Extended Data 10 Fig. 8i,j). Notably, the same cell subsets score highly for Ultima (Extended Data Fig. 8k) and Illumina (Extended Data Fig. 8l) data-derived programs on a joint UMAP embedding. For example, factor 13 in Illumina and factor 1 in Ultima scored in the same cells (Extended Data   Fig. 8k,l) and were correspondingly highly correlated on both cell (Extended Data Fig. 8c) and gene (Extended Data Fig. 8d) loadings, indicating that they correspond to the same program. 15 Moreover, other factors that differed between Ultima and Illumina were highly related-for example, factor 5 in the Ultima dataset was roughly decomposed into factors 5 and 11 in the Illumina dataset. Overall, we conclude that there is a high correspondence between cell states in Ultima and Illumina data. 20 As a final test, we evaluated performance with a Perturb-seq screen, where heavy sequencing requirements are particularly limiting for scale 5,6 , and used a design that also tested for CITE-Seq 20 and Cell Hashing 21 performance. Specifically, we used a library from a pilot screen of an ongoing genome-wide Perturb-seq study (PIT, KGS, CJF and AR, unpublished results) to identify regulators of MHC Class I in melanoma A375 cells (Fig. 4a). In this pilot, we introduced 6,127 guides targeting 1,902 transcription factors and chromatin modifiers (Supplementary Table 4) along with both intergenic and non-targeting control guides, enriched for cells with low HLA levels, and followed by scRNA-seq of 20,000 cells that included CITE-seq 20 and Cell Hashing 21, 22 (Online Methods). We sequenced the resulting scRNA-seq libraries with Illumina and Ultima, 5 but the dial-out libraries used for guide detection, CITE-seq, and Cell Hashing were only sequenced with Illumina (Extended Data Fig. 9a-c). Initial pre-processing of the Perturb-seq scRNA-seq data showed similar performance for Ultima and Illumina, after sampling reads to have the same number of UMIs in each dataset (Extended Data Fig. 5a,b, Supplementary Table 1), as before, as well as in terms of cell assignment to guides ( Fig. 4b and Extended Data Fig. 9c), 10 Cell Hashing barcodes (Fig. 4c), and cell clustering and marker gene expression (Extended Data

Fig. 9d-i).
Importantly, the Ultima and Illumina datasets identified similar relationships between perturbations and similar regulatory effects. For this analysis, we included the 335 cells in Illumina 15 and 336 cells in Ultima, coming from 11 perturbations and 10 control guides in this pilot screen that were assigned to a single perturbation that had more than 10 assigned cells (the same perturbations were found by Illumina and Ultima). We then fit a regularized linear model (with elastic net, similar to 6, 22 (Online Methods)) of the mean impact of each perturbation on each gene, selected genes with nominal p-values < 0.05 using a permutation-based approach (Online 20 Methods), and clustered the guides by these regulatory profiles. Ultima (Fig. 4d) and Illumina ( Fig. 4e) based analyses yielded very similar guide relationships, both between multiple guides to the same gene (e.g., STAT1 guides) and between guides to different, functionally-related genes (e.g., STAT1 and IRF1 or COP1_1 and CREBBP_3). Moreover, there was very high agreement in the effects on individual genes in both datasets, when comparing DE genes between each guide and an intergenic control (intergenic_1) in each dataset, in both significance and effect size (Fig.   4f, Extended Data Fig. 9j,k). Many such gene/guide pairs were significant in both the Ultima and Illumina datasets (Fig. 4f), and those significant only in one, had highly similar effect sizes, showing consistent signal. Moreover, the KEGG pathways enriched with DE genes between each 5 guide and an intergenic control were highly similar between the datasets (Fig. 4g).
In conclusion, the two sequencing platforms generally perform similarly for scRNA-seq, across two main protocols for droplet based scRNA-seq (3' and 5'), two different sample types (primary cells and a cell line), and multiple experimental designs (simplex and multiplex, Perturb-Seq, 10 CITE-Seq and Cell Hashing). One key explanation for the minor differences we observed is the position of reads relative to annotated gene boundaries (Extended Data Fig. 3d-g), as a consequence of Ultima's single-end reads being closer to gene ends. Additionally, we currently recommend 5' over 3' libraries, given the small penalty in lost reads in 3' libraries ( Fig. 2a) due to lower sequencing quality adjacent to the poly(T) sequence (Extended Data Fig. 1). 15 It is exciting to imagine what will be possible with Ultima's lower cost, estimated at over 5-fold reduction in cost per read, that enables sequencing more reads, cells, and/or samples in the context of large scale tissue atlasing projects, such as the Human Cell Atlas 1 , the BRAIN Initiative 23 , the Cancer Moonshot Human Tumor Atlas Network 2 as well as perturbation screens 5,6 . It should also 20 be possible to design droplet-based scRNA-seq reagents, and methods for other large scale single cell and spatial genomics 24-28 customized to Ultima sequencing to directly generate libraries and eliminate the need for library conversion (Fig. 1b)   processing and analyzing our sequence data will be freely available on June 1, 2022 from https://github.com/seanken/CompareSequence.

10x Chromium library conversion and Ultima Genomics (UG) sequencing
Our 10x Chromium libraries were converted using a library conversion PCR workflow (Fig. 1b) to enable sequencing on the UG platform. In brief, library concentration was verified using Qubit  After pooling, we seeded libraries, clonally amplified them on sequencing beads using a high-scale emulsion amplification tool, and sequenced them on a prototype UG Sequencer, which uses a sequence of additions of partially labeled non-terminating nucleotides, followed by imaging to generate single-end reads of length 200 to 250 bases (Fig. 1c, a detailed workflow description will 5 be published elsewhere). For sequencing of 10x Chromium 3' libraries, we used a modified sequencing protocol that accommodates the high consumption of dT nucleotides in the poly(dT) stretch of the cDNA. Specifically, we included additional T injections when sequencing cycles 28 to 32, which were predicted to include the poly(dT) stretch: (TGCA)27 (T10GCA)5 (TGCA)60.

Initial Ultima read processing
To enable standard scRNA-seq analysis of single-end reads, we first converted UG data to create paired-end data (Fig. 1d). To this end, we removed conversion adapters, and quality trimmed reads using Cutadapt v2.10 30 , using a threshold of 30. We discarded reads not containing at least 8 Ts

Extracting expression information from FASTQ files
We used Cell Ranger v5 14 to pre-process data for both Illumina and Ultima (for Ultima using simulated Read 1 and Read 2 as extracted above). For all datasets, we used the GRCh38 human reference from 10x Genomics unless otherwise stated, and set --expect-cells to the expected number of cells (7,000 cells for the single sample 3' and 5' PBMC data, 24,000 cells for the 5' mixture PBMC sample, and 20,000 cells for the Perturb-seq sample). To process 3' Ultima data, unless otherwise stated, we modified the last 3 bases of the UMI using awk by replacing them with A's and setting the last 3 quality values to be equal to I. 5 For the 5' mixture data, we demultiplexed it by first calculating SNP coverage data for SNPs in the 1000 Genomes Project 32 with cellsnp-lite v1.2.0 33 (using --minMAF 0.1 --minCOUNT 20) followed by Vireo v0.5.5 16 to get labels for the sample of origin.
For Perturb-seq data, we also passed Cell Ranger FASTQ files for dial-out data (using the CRISPR 10 Guide Capture keyword), hash tag oligo (HTO) data (using the Custom keyword), and antibody derived tag (ADT) reads for CITE-seq (using the Antibody Capture keyword), as well as feature barcode information for each. We further processed HTO data with DemuxEM v0. the total number of UMIs. We chose an initial unrefined proportion as the largest proportion that gave fewer UMIs in Ultima than were present in Illumina. We then performed a refinement step with 1% steps ranging from this unrefined proportion up to the initial unrefined proportion plus 5%, and chose the final refined proportion used for sampling from this range to be the largest proportion that gave fewer UMIs in Ultima than were present in Illumina. We did not sample the 5' mixture Ultima data because it had roughly the same number of UMIs as the 5' mixture Illumina 5 data. After sampling, we processed data in a similar fashion to non-downsampled data (see

Extracting FASTQ QC metrics
To extract base quality information from each FASTQ file, we randomly selected 1,000,000 reads 10 with seqtk sample using the parameter -s 100 to set a random seed. We then used the SeqIO.parse function from Biopython v1.79 36 to read the FASTQ into Python. We then extracted the quality information with the letter_annotations function and recorded the resulting information to a file with one line per read and one column for each base in that read. This was used for downstream visualizations. 15 To explore the effects of shortening UMIs on number of reads, we loaded the molecular information h5 file generated by Cell Ranger into R with DropletUtils and saved the resulting data frame. We then loaded this into Python, resulting in a table with one entry for each UMI counted by Cell Ranger, which included CBC, UMI, and Gene. We shortened the UMI to different lengths 20 and used the pandas 37 groupby function to count the number of UMIs that collapsed together after this shortening and the number of UMIs from different genes that collapsed together after this shortening.

Analysis of PBMC data
We loaded filtered PBMC count data from Cell Ranger (located in the outs/filtered_feature_bc_matrix subdirectory output) into R v4.0.3 38 using Seurat v4.0.0 39 . To avoid biases introduced by using slightly different sets of cells, we used only the intersection of the sets of cells found in Ultima and Illumina for each analysis. For the 5' mixture data, we also 5 uploaded the labels from Vireo and removed doublets and unassigned cells. We processed data were then processed through the standard Seurat pipeline, as followed. We scaled scaling data to transcripts per million (TPM) and log normalized with NormalizeData, finding variable genes with

Analysis of method specific biases
We calculated logFCs comparing expression levels between Ultima and Illumina by creating a pseudobulk profile for an entire dataset using count data, which was then normalized to TPM. We Classification of gene expression and read assignment bias was performed as follows. For each protein-coding gene with a total count >100 in at least one of the platforms, we calculated read 20 assignment rate (see below) differences for unfiltered (total reads, including reads from cell barcodes not in the filtered list) Illumina and downsampled Ultima reads, and normalized the foldchanges of counts by the median difference. We partitioned the genes into three categories: (1) >5fold higher in Illumina ("Higher count in Illumina"); (2) >5-fold higher in Ultima ("Higher count in Ultima"); and (3) all remaining genes ("Similar counts") (Extended Data Fig. 3c). A read assignment rate higher in Illumina ("Higher read assignment in Illumina") was defined if the ratio of Cell Ranger gene-assigned reads out of the total reads mapped to the annotated gene body plus a flanking 100bp was higher in Illumina (>3-fold higher ratio with p < 0.01, binomial test). The read assignment rate higher in Ultima ("Higher read assignment in Ultima") was defined 5 analogously.
To explore 3' and 5' bias, the GTF file used by Cell Ranger was processed as described above to collapse overlapping exons together. We then used pysam v0.15.3 (https://github.com/pysamdevelopers/pysam) to load each alignment (selecting 1% of alignments at random), excluding those 10 without an assigned gene, CBC, or UMI, as well as excluding multimappers. We then calculated the distance along the exonic regions of the gene (normalized by gene length) from the 5' end of the gene to the 3' and 5' ends of the read using the overlapping exon representation we generated earlier. We then recorded this information for each alignment. For plotting, this was divided into bins of length 1% labeled from 0 to 100 (including the bottom of each bin but not the top -meaning 15 that bin 100 was empty for 5' reads and bin 0 was empty for 3' reads), and normalized by the number of reads mapping to that gene to avoid highly expressed genes biasing the results.
To extract information about the number of reads falling into different categories (unmapped, ambiguous, etc.), we took the BAM file from Cell Ranger and applied FeatureCount v1.6.2 41 with 20 settings -t exon -g gene_name -fracOverlap 0. For 3' data we set -s 1 to denote sense reads, while for 5' data we used -s 2 to denote antisense reads. In addition, for gene level information about the number of reads in different categories, we reran FeatureCount once with the flag -M (for multimappers) and once with the flag -O (for ambiguous reads).
For IGV v2.3.80 42 plots, we used SAMtools v1.8 43 to extract a region around each gene of interest from the associated 10x BAM file. We then used grep to extract reads that were assigned to a given gene and those that were not. 5

Extension of the standard reference
We built a Nextflow-based 44 pipeline that takes in a preexisting reference GTF file and RNA-seq BAM file (from paired-end or single-end RNA-seq) and outputs a new GTF file that extends the old one using RNA-seq data. The first step in the pipeline annotates reads in the BAM file overlapping genes in the standard reference using FeatureCount with the parameters -t exon, -R 10 BAM, and -g gene_id, as well as using the -s 1 or -s 2 flag depending on the strandedness of the RNA-seq data. For paired-end data we also used the flag -p. We then used a pysam-based start and end coordinates being the start and end coordinates of that read, and with an extra field recording the assigned gene for the read. We excluded reads with large gaps (>10 bases labeled as N in the CIGAR string) and, for paired-end reads, only include properly paired reads. We then sorted the 15 BED file with bedops 45 , clustered the entries in this BED file using bedtools cluster with the -s flag, and used bedtools groupby to merge BED entries from the same cluster and gene. We then sorted the resulting BED file with bedops again, use a Python script to turn the BED file into a GTF with one exon per entry in the BED file. We combined this GTF file with the GTF for the standard reference and sorted the results with BEDTools, yielding the extended references. These 20 new GTFs were then used to generate references for Cell Ranger with cellranger mkfastq. We used this approach to create three references, one using published bulk data 12 (using the BAM file generated in that publication), and two using scRNA-seq Ultima PBMC data -one generated with 3' data and the other with 5' data. We then processed PBMC data with each of these references using cellranger count as described above and performed downstream analysis. We did not process the 3' data with the 5' reference or vice versa.

Gene program analysis by NMF
We calculated all NMF models with RcppML v0.3.7 46 using 15 factors and with the log TPM matrix as input including genes expressed in more than 1% of cells. NMF returns a cell loading matrix, with one row per cell and one column per factor, and a gene loading matrix, with one row per gene and one column per factor. To test how well NMF factors from one data type fit another, 15 we split our data into a training set (with 5,000 cells) and a test set (all other cells) to avoid data overfitting when testing the gene loading matrix. We then fit NMF models separately on the Ultima data, Illumina data, and a permuted version of the Illumina data (where the values of each gene were randomly scrambled between cells) using the 5,000 cell training set. To test the accuracy of gene loadings of each NMF model for each data type, we used the project function from RcppML 20 to generate a cell loading matrix on the training data and the mean squared error (MSE) was measured. For cell loadings, we used a similar approach, except that testing was performed on the test dataset. In all cases we repeated the analysis 10 times with different random seeds to account for variability in NMF solutions.

Analysis of Perturb-seq data
We processed Perturb-seq data through a similar pipeline to the PBMC data (see Analysis of PBMC data), except the HTO, ADT, and guide count matrices were also uploaded as additional assays, while the DemuxEM labels for Hash ID and the Cell Ranger labels for guide assignment were added to the metadata. After initial processing with Seurat, we removed cells assigned to 15 multiple hash tags or multiple guides, as were cells assigned to guides with 10 or fewer cells assigned to them.
We generated a guide similarity matrix with a slightly modified version of the MIMOSCA package 6 . We extracted a cell by gene log TPM expression matrix from the Seurat object, selecting the 20 cells filtered as described above (one hashtag and guide assignment, with guides with more than 10 assigned cells) and genes that were expressed in >5% of cells. We also extracted a covariate matrix consisting of the scaled number of UMIs per cell, as well as one-hot encoded versions of the perturbation assignments, Hash ID assignments, and cluster assignments (based on clustering all cells with Seurat's FindClusters function at a resolution of 0.2 with 20 PCs and otherwise default settings). We loaded data into Python using pandas and an elastic net model was fit modeling expression as a linear model of the covariates using sklearn.linear_model.ElasticNet 47 with parameters l1_ratio=0.5, alpha=0.0005, and max_iter=10000. The coefficient matrix from this model was saved. We randomly permuted guide labels 100 times (while preserving the number 5 of guides assigned to each hash barcode and vice versa) followed by the same elastic net-based analysis. We loaded the resulting gene by covariate coefficient matrices into R and discarded columns that did not correspond to guide labels, along with columns corresponding to nontargeting control guides (those labeled as NO_SITE in our feature data) and the Background control guide. For each gene, we calculated a p-value based on the resulting matrices by scoring 10 each gene by the maximum absolute value for that gene across all guides. We partitioned genes into 20 bins of equal size based on average expression. We calculated a p-value for each gene by comparing the score of that gene in the non-permuted data to the score of all genes in the same bin as it in all 100 permuted datasets. We retained all genes in the coefficient matrix with uncorrected p-value < 0.05 and calculated the Pearson correlation between guides based on this matrix.

Perturb-seq differential expression analysis of genes regulated by each guide
We performed DE between each guide's profiles and the intergenic guide profiles with Nebula  55 , and histograms that were produced with the base R hist function.    Table 1. Sequencing metrics.