ABSTRACT
Recent advancements in functional genomics have provided an unprecedented ability to measure diverse molecular modalities, but learning causal regulatory relationships from observational data remains challenging. Here, we leverage pooled genetic screens and single cell sequencing (i.e. Perturb-seq) to systematically identify the targets of signaling regulators in diverse biological contexts. We demonstrate how Perturb-seq is compatible with recent and commercially available advances in combinatorial indexing and next-generation sequencing, and perform more than 1,500 perturbations split across six cell lines and five biological signaling contexts. We introduce an improved computational framework (Mixscale) to address cellular variation in perturbation efficiency, alongside optimized statistical methods to learn differentially expressed gene lists and conserved molecular signatures. Finally, we demonstrate how our Perturb-seq derived gene lists can be used to precisely infer changes in signaling pathway activation for in-vivo and in-situ samples. Our work enhances our understanding of signaling regulators and their targets, and lays a computational framework towards the data-driven inference of an ‘atlas’ of perturbation signatures.
INTRODUCTION
Recent years have witnessed an explosion in the development of new technologies for functional genomics, including the ability to measure diverse molecular modalities that span the central dogma1–8. The broad application of these techniques has yielded a wealth of information associating changes in molecular state across individuals, environmental conditions, and disease states9–14. A key challenge for the next stage of genomic analysis is to move beyond associative and correlative findings towards a more causal understanding of biological systems.
Genome engineering tools such as CRISPR hold enormous promise towards identifying causal regulators that drive molecular and functional phenotypes15–19, particularly when applied as part of massively parallel screens. When combined with a single-cell RNA-seq readout (i.e. Perturb-seq)20–22, these technologies combine the observational but unsupervised nature of RNA-seq measurements with inherent causal inference enabled by genetic perturbations. As a result, technologies like Perturb-seq offer enormous promise for causal gene regulatory network reconstruction21–24. For example, the Genome-Wide Perturb-seq (GWPS) resource knocked down approximately 10,000 genes in resting human cells in order to create a large-scale genotype-phenotype map and to perform an in-depth dissection of gene function25. This represents a powerful tool for elucidating transcriptional signatures, but applications in resting cells may fail to accurately describe context-dependent gene function.
The genomics community places significant emphasis on identifying the downstream effectors of signaling regulators to quantify and compare levels of pathway activation across diverse samples. Comparative analysis workflows routinely test for gene set enrichment using pathway signature databases, which are often compiled from diverse data types and studies26–29. We propose that large scale single-cell perturbation screens represent an innovative approach for refining these databases, and in particular, offer a comprehensive and data-driven workflow to enumerate transcriptional signatures and to link them to causal regulators. However, profiling dynamic gene function requires the Perturb-seq assay to be repeatedly run across multiple biological contexts, creating challenges for cost and scalability.
Here, we introduce a highly scalable Perturb-seq workflow and apply it to perturb signaling regulators across 30 separate biological contexts. We demonstrate how Perturb-seq is compatible with recent and commercially available advances in combinatorial indexing as well as a new sequencing-by-synthesis platform, sequencing 2.6 million cells in total. We divide these profiles across six cell lines and five signaling pathways in order to explore both conserved and cell type-specific responses to each signaling regulator. We introduce an improved computational framework (Mixscale) to address cellular variation in perturbation efficiency that is inherent to Perturb-seq experiments, alongside optimized statistical methods to learn differentially expressed gene (DEG) lists and broader signaling programs. Finally, we demonstrate how our data-driven inference of molecular signatures can be used to precisely infer changes in signaling pathway activation for in-vivo and even in-situ samples, including for immune and intestinal disorders. Our work enhances our understanding of signaling regulators and their targets and lays a computational framework towards the systematic construction of an ‘atlas’ of perturbation signatures.
RESULTS
Scalable and flexible Perturb-seq across cell lines and conditions
We aimed to use Perturb-seq to build a database of molecular response signatures, focusing on targets of different signaling regulators, and to examine how these signatures change across various cell lines. Obtaining these datasets required performing scalable Perturb-seq in a large number of different contexts, which required us to address two experimental challenges. First, we needed to create a diverse set of biological samples reflecting a diversity of cellular and environmental contexts. Second, we needed to develop a workflow to perform massively scalable Perturb-seq experiments across these multiple sample types.
To study pathway activity across multiple biological contexts, we aimed to perform Perturb-seq experiments in six different cancer cell lines from different tissues of origin: A549 (lung), MCF7 (breast), HT29 (colon), HAP1 (bone marrow), BxPC3 (pancreas), and K562 (bone marrow). To facilitate multiplexed gene knockdown screens, we modified each of these lines to express a CRISPR interference (CRISPRi) dCas9-KRAB-MeCP2 cassette30,31 (Supplementary Methods). To explore different environmental signaling contexts, we exposed each cell line to five distinct stimuli representing well-established pathway regulators, each of which has been broadly implicated in cellular responses and disease pathogenesis. Our selected pathways included interferon-beta (IFNβ), interferon-gamma (IFNγ), transforming growth factor beta (TGFβ), tumor necrosis factor-alpha (TNFα), and insulin (INS). Together, our combination of six cell lines and five stimuli created a diverse matrix of biological samples, which we used as input to Perturb-seq (Figure 1a).
(a). (Top) Experimental workflow for perturbing and stimulating each pathway. (Bottom, left) Schematic showing how Perturb-seq can be used to identify pathway-specific target gene sets. (Bottom, right) Totals for biological samples, perturbations, and cells profiled in this study. (b). Diagram of transcriptome and sgRNA barcoding using Parse Biosciences combinatorial indexing. RNA transcripts and sgRNAs are captured, reverse transcribed, and barcoded by poly(dT) and random hexamer primers so that both modalities share a set of cell barcodes. (c). Eight example single-cell heatmap of one or two perturbations per signaling pathway, showing top up and down regulated differentially expressed genes.
Our goal was not to discover new pathway regulators, but instead, to characterize the molecular responses after the perturbation of known regulators. Therefore, for each pathway, we selected 44 to 61 genes based on a literature review of known regulators32–40. For each gene, we selected three independent single guide RNAs (sgRNA) from the Dolcetto genome-wide CRISPRi library41, as well as 14 non-targeting (NT) controls (Supplementary Table 1). We next created five pooled sgRNA libraries, one independently created for each pathway. For each pathway, we separately infected all six cell lines with the pathway-specific sgRNA library, and then stimulated infected cells with the corresponding cytokine for 24 hours to activate signaling (Figure 1a). This process resulted in a total of 30 distinct Perturb-seq samples.
To address logistical and scalability challenges, we adapted the Perturb-seq workflow to be compatible with the Parse Biosciences Evercode™ Whole Transcriptome Mega kit (Figure 1b, Supplementary Figures 1,2, Supplementary Methods). The Evercode kit is compatible with fixed samples, enabling us to perturb, stimulate, and fix cells with our five distinct pathway perturbation libraries at different times, and subsequently perform scRNA-seq simultaneously on all samples. This workflow also leverages three levels of combinatorial indexing to increase the scalability and cost-effectiveness of large-scale analysis42. Briefly, our modifications included the addition of a guide-specific primer to the cDNA amplification reaction, and the modification of the PCR reaction conditions to optimize guide recovery without adversely affecting the whole transcriptome amplification (Supplementary Figure 1,2).
We sequenced our transcriptomic libraries using a recently developed mostly natural sequencing-by-synthesis technology (Ultima Genomics)43, further validating a subset of samples with additional Illumina NovaSeq sequencing (Supplementary Methods). In total, we applied our combined workflow to sequence approximately 2.6 million cells across two experimental replicates. We used the set of combinatorial Parse barcodes to infer the type and the stimulation condition for each cell, and the gRNA barcode to infer its genetic perturbation. For each biological context and perturbation, our dataset enables the identification of up-regulated and down-regulated transcriptional signatures (eight representative examples in Figure 1c).
Weighted differential expression analysis for CRISPRi Perturb-seq data
We next developed a tailored computational pipeline to optimally process and interpret our Perturb-seq datasets, aiming to solve the following problems. First, we aimed to optimize the process of identifying DEGs resulting from each perturbation, which can be challenging due to technical and biological variation in scRNA-seq assays including Perturb-seq44,45. Second, while our dataset encompasses thousands of individual perturbations across biological contexts, grouping these results into coherent and reproducible gene programs represents a key challenge for data interpretation. Finally, we aimed to demonstrate how to leverage these programs to interpret in-vivo scRNA-seq datasets.
We and others have previously highlighted that Perturb-seq data exhibits extensive cellular heterogeneity, even for cells that receive the same gRNA31,46–48. This variation encompasses both technical and biological sources, including variability in the effectiveness of perturbation. Previously, we developed a binary classification method (called ‘Mixscape’) to identify and remove cells that ‘escape’ perturbation in standard CRISPR-Cas9 knock-out experiments46. However, when examining well-characterized regulators in our CRISPRi dataset, we found that cells exhibited a quantitative gradient of responses (Figure 2a), consistent with multiple factors that influence variable efficiency of CRISPRi knockdown49,50. In these cases, binary classification is an oversimplification that can lead to the incorrect classification of ‘non-perturbed’ cells. Moreover, the existing mixscape framework assumes the presence of only a single cell type, which does not reflect our multi-context experimental study.
(a). Overview of the Mixscale scoring calculation procedure and its application in the weighted differential expression (DE) test. (b). Relationship between the expression levels of the perturbation targets (y-axis) and the inferred perturbation scores (x-axis) across individual cells. Expression is calculated based on pseudobulk expression of 20 bins, which cells are placed into after ordering by Mixscale score (c). Single-cell heatmap for JAK1 perturbation in A549 cells. Cells are ordered by Mixscale score. Even in cells where the JAK1 transcript is not detected, the Mixscale score correlates with the effective perturbation strength. (d). Comparison of false positive rates for the Mixscale weighted DE method (wmvReg) and the unweighted DE test computed under null simulations (Supplementary Methods). (e). Replication rates of identified DE genes across scRNA-seq replicates, when applying different DE methods.
To address this, we extended the original Mixscape classifier46 to replace binarized classification with a continuous scalar value that reflects the strength of perturbation. Our improved approach, Mixscale, uses an initial estimate of DEG (based on a standard single cell differential expression test) from gRNA-classified cells to estimate a cell’s response (Figure 2a). As in our previous work46, Mixscale first estimates a cell-specific ‘perturbation vector’, representing a difference in expression between a perturbed cell and the most similar non-targeting control cells in the dataset. Instead of using this vector to infer a binarized response, Mixscale performs a scalar projection of each cell’s expression profile onto this perturbation vector, in order to quantitatively estimate each cell’s degree of perturbation (Supplementary Methods). We adapted the procedure to estimate the initial set of DEG when considering all cell lines simultaneously (to increase power), but calculated scalar projections independently for each cell line in order to robustly model potentially heterogeneous perturbation responses across biological contexts (Supplementary Methods). The output of this procedure represents a quantitative perturbation ‘score’ for each gRNA-receiving cell.
We reasoned that our perturbation score for each cell should also reflect the degree of CRISPRi-driven knockdown of the regulator itself. While the level of a single target gene is difficult to quantify accurately in individual cells, we observed a gradual decline in target gene expression with increasing perturbation scores in cells that received the target gRNA (perturbed cells) (Figure 2b). Even in cells with no detectable expression of the CRISPRi target gene, we observed a quantitative gradient of perturbation scores that revealed gradual variation in the expression of downstream regulated genes (Figure 2c). We observed similar patterns even when restricting analysis to cells receiving the same gRNA (Supplementary Figure 3a,b), indicating such variation is not purely driven by the different efficiencies among gRNAs. We replicated our findings in an independent CRISPRi Perturb-seq dataset targeting essential genes51. In this case, when we observed heterogeneity in perturbation scores across multiple gRNAs targeting the same gene, our molecularly-derived scores exhibited strong correlations with functional gRNA activity scores, as measured in independent proliferation-based screens from the same dataset51 (Supplementary Figure 3c,d). We conclude that our inferred perturbation scores represent a robust single-cell quantification of perturbation strength.
Accounting for the degree of perturbation each cell received can improve the robustness of downstream analysis. In particular, cells that exhibit weaker perturbation effects, likely driven by a lesser degree of target gene knockdown, should contribute less to DEG identification. We therefore implemented in Mixscale a weighted multivariate regression (wmvReg) procedure that accounts for multiple variables, including a cell’s perturbation score, cell line identity, and sequencing depth of each cell, in order to identify genes whose expression is dependent on gRNA identity (Supplementary Methods). The use of wmvReg to identify a final DEG list represents an iterative procedure, since it leverages cell-specific weights which themselves were constructed from an unweighted standard test. We included a ‘leave-one-out’ procedure to address potential circularity resulting from this approach (Supplementary Methods), and found that this approach mitigated artificial correlations that could otherwise arise from inaccurate initial DEG estimation (Figure 2d, Supplementary Figure 4a-c).
We further tested the statistical power of Mixscale’s wmvReg on our dataset, and compared it to alternative strategies for DEG identification. Our weighted framework identified substantially more DEG per perturbation when compared to a Wilcoxon rank sum test (Supplementary Methods) on gRNA-derived labels (on average, 404 vs. 290 DEGs per perturbation). Importantly, when we repeated the entire process on both scRNA-seq replicates separately, the DEG identified by Mixscale exhibited higher rates of reproducibility (Figure 2e, Supplementary Figure 4d), even when considering genes that were exclusively found by our procedure (Supplementary Figure 4e,f). Taken together, we conclude that Mixscale represents a sensitive, robust, and reproducible procedure to identify DEG from large scale and complex CRISPRi Perturb-seq datasets.
As we sequenced a subset of our libraries with both the Illumina NovaSeq and the Ultima UG100 platforms (Supplementary Methods), we analyzed each sequencing run independently and compared results. Gene expression estimates between the two technologies were highly correlated, with the exception of a small number of outlier genes (Supplementary Figure 5a, Supplementary Table 2), as has been previously described52. These platform-dependent differences in baseline expression, however, were no longer apparent after comparing NT and perturbed cells sequenced within the same platform (Supplementary Figure 5b,c). Our findings demonstrate that alternative sequencing technologies are compatible with combinatorial indexing workflows, and consistent with previous work 25, can be used for large-scale scRNA-seq and Perturb-seq studies.
Conserved and context-specific perturbation responses
Our entire dataset encompasses 1,626 multiplexed perturbation experiments, where each experiment corresponds to the knockdown of a given regulator (n ranges from 44 to 61 depending on the pathway), in a given cell line (n=6), under a given pathway stimulation (n=5). Of these, we discarded 30 perturbations for which Mixscale identified < 5 significant DEG across cell lines. Our remaining 1,596 gene lists provide an opportunity to ask two key questions: First, do multiple regulators within the same pathway target overlapping or distinct lists of downstream genes? Second, How does a regulator’s downstream targets change across different cell lines? More broadly, our dataset highlights the necessity of learning conserved gene modules that are repeatedly identified as differentially expressed across regulators and cell lines, in order to facilitate interpretation and exploration.
Initial analysis of our data revealed both conserved and context-specific perturbation responses (Figure 3a-d). For example, key upstream regulators of the IFNγ pathway, including the IFNGR1/2 receptors and the JAK1/2 kinases, all targeted highly overlapping groups of genes. Perturbation of IRF1 resulted in only a subset of these changes (for example, STAT1 levels are unchanged after IRF1 perturbation), consistent with IRF1’s role downstream of STAT1 in interferon signaling53. In addition, for both the IFNγ and IFNβ pathways, we observed extensive conservation of each regulator’s downstream target genes across all six cell lines (Figure 3a,b). In contrast, we observed clear cell-type specificity in the response to TGFβ and insulin signaling (Figure 3c,d). In particular, we found that individual regulators had substantial perturbation effects in all cell lines (i.e. TSC1/TSC2), while other regulators (i.e. IRS1/GRB2) exhibited highly cell type-specific effects (Figure 3d).
(a-d). Comparison of z-scores for top differentially expression genes (DEGs) (y-axis) across different perturbations (x-axis) and across six cell lines. Each dot represents a unique combination of a perturbation, a cell line, and a DEG. The size of the dot represents the magnitude of its z-score produced by the Mixscale weighted DE test. (e). Overview of the MultiCCA decomposition method that extracts correlated perturbations within and across cell lines (Supplementary Methods) (f). Overview of the main regulators in the IFNγ pathway. (g, h). The first two perturbation programs for the IFNγ pathway, returned by MultiCCA decomposition. Each column indicates a combination of either a positive or negative regulator (upper labels) and a cell line (bottom labels), and each row indicates a top DEG from the program signature gene list. Additional perturbation programs are shown in Supplementary Figure 6.
While the presence of 1,596 molecular signatures reflects the richness of our dataset, we sought to define consistent patterns across perturbations and cell lines, reasoning that these conserved signatures would increase robustness and interpretability. To identify shared DEG patterns across perturbations and cell lines, we adapted the MultiCCA-based54 decomposition approach DIALOGUE 55 into a pseudo-bulk-level decomposition method (Supplementary Methods). Briefly, MultiCCA identifies linear combinations of features (i.e. perturbation responses) that are highly correlated across different matrices (i.e. Perturb-seq experiments) (Figure 3e). We applied this approach separately to each of our five pathways, learning responses that we deem ‘perturbation programs’.
Importantly, MultiCCA can return multiple perturbation programs for each pathway, which can depict the hierarchical relationship shared by different regulators across cell lines. For example, when applied to the IFN! Perturb-seq datasets, the first perturbation program represents a set of hundreds of canonical targets of the IFN! pathway (Figure 3f), and is tightly conserved across cell lines (Figure 3g, Supplementary Table 3). Our approach also directionally links the expression of this program to specific regulators. We linked program 1 to the canonical positive IFN! upstream regulators (IFNGR1, IFNGR2, STAT1, JAK1, JAK2), confirming that this program represents a comprehensive description of the canonical IFN! signaling pathway (Figure 4f,g). We obtained similar results when applying our approach to IFNβ datasets (Supplementary Figure 6a), where we also identified negative regulators of the first canonical program (USP18), which is known to inhibit signaling via competitive binding to IFNAR256,57.
(a). Overlap between IFNβ program 1 genes identified by our Perturb-seq experiment and the IFNβ signatures curated by the MSigDB Hallmark collection (Supplementary Methods). (b). Gene set enrichments for a set of DEG from IFNβ-stimulated monocytes (Supplementary Methods), using either Perturb-seq or MSigDB signature lists. Dashed line represents the Bonferroni-corrected threshold for statistical significance. (c). IFNβ module score comparing unstimulated and stimulated Monocytes from an external dataset (Kang et al. 2018 Nat. Biotech) using the Perturb-seq unique gene set (the left panel) or the shared Perturb-seq and MSigDB gene set (the right panel). (d). Overlap of the IFNβ, IFNγ, and TNFβ pathway genes identified by our Perturb-seq experiment. (e). Same as (b), but for pathway-exclusive gene sets. Only the Perturb-seq gene lists correctly identify significance for IFNβ pathway lists. (f-h). Gene set module scores for IFNγ pathway genes, IRF1-associated genes, and IRF1-independent genes calculated in an external dataset that includes IRF1-deficient patients (Supplementary Methods). The IRF1-associated genes and IRF1-independent genes are identified using the IFNγ program 1 and 2 in our Perturb-seq data (Supplementary Methods).
The second perturbation program for IFN! specifically represents genes that are targets of a subset of more downstream regulators (Figure 3h). These include the positive regulator IRF1, which is known to act downstream of JAK/STAT signaling53, and to affect only a subset of pathway targets. We identified IRF2 as a negative regulator of this program, consistent with its known role in inhibiting IRF1-mediated signaling58,59.
Applied across our datasets, we identified 31 different perturbation programs (Supplementary Figure 6a-c, Supplementary Table 3). We note that for the INS pathway, MultiCCA failed to return clear perturbation programs due to extensive heterogeneity and minimal conservation in cell-type specific responses. To identify robust programs in this case, and to complement our cross-cell line program discovery using MultiCCA, we also used hierarchical clustering60 to group together shared perturbation responses for multiple perturbations within a given cell line (Supplementary Methods). We applied this approach to learn additional 133 gene signatures (Supplementary Table 4), which we linked to the perturbation of both positive and negative regulators within individual cell types.
Evaluating the performance of Perturb-seq pathway signatures
We first compared our learned signaling modules with those present in MSigDB, a comprehensive resource for pathway enrichment analysis29. We specifically considered the “Hallmark” datasets for IFNβ, IFN!, TNFα, and TGFβ signaling. While these signatures are highly curated and leverage multiple independent publications, the underlying datasets used for creation have different origins and are collected with diverse technologies. Contrastingly, the inherently multiplexed design and uniformity of our Perturb-seq data collection minimizes batch effects and facilitates the inference of robust signatures across multiple systems. Moreover, our Perturb-seq signatures were inferred from functional genetic perturbation experiments, in contrast to observational data that is often utilized in the construction of gene signature databases26–29. We therefore compared our perturbation programs against each of the four hallmark gene sets.
For each pathway, we observed overlap between our learned pathway signatures and hallmark MSigDB signatures, though we did observe discrepancies as well (Figure 4a, Supplementary Figure 7a). For example, for targets of IFNβ signaling, the two databases shared 73 genes, MSigDB uniquely identified 24 genes, and Perturb-seq uniquely identified 227 genes (Figure 4a). We found that the unique Perturb-seq genes exhibited extensive reproducibility across multiple cell lines (Supplementary Figure 7b). Existing literature also suggests many of these 227 genes are genes regulated by IFNβ (e.g., BRIP161, RNF21361,62, DDX60L63). Of the MSigDB-unique genes, the vast majority (70.8%, 17 out of 24) exhibited low expression levels in our Perturb-seq data, with count per million (CPM) under 20 (Supplementary Figure 7c). In comparison, only 0.4% (1 out of 227) of the Perturb-seq-unique genes had a CPM below 20. Similar observations were also made for other pathways (Supplementary Figure 7c), suggesting that the MSigDB-unique genes are either not as reliable indicators of signaling pathway activation, or cannot be accurately quantified by the Parse Evercode technology. We observed between 200 and 300 unique Perturb-seq genes for all benchmarked pathways (Supplementary Figure 7a).
We next evaluated the performance of these gene signatures to interpret experimental data from new biological contexts, but where ground truth validation of signaling pathway activation was available. For example, we considered a publicly available scRNA-seq dataset of human PBMC comparing both resting and IFN"-stimulated cells64. From this data, we extracted CD14+ monocytes, inferred DEG of stimulated vs. resting cells, and performed enrichment analysis against pathway signatures from Perturb-seq or MSigDB (Supplementary Methods). As expected, we observed an enrichment for IFNβ pathway genes in both signature sets (Figure 4b), with a stronger signal using the Perturb-seq signature list. To further validate the comprehensiveness of our gene list, we also performed module score analysis on both the 73 shared genes and the 227 unique Perturb-seq genes in IFNβ-stimulated CD14+ monocytes (Supplementary Methods). Our results demonstrated that both gene lists effectively distinguished stimulated cells from controls, demonstrating that the additionally identified genes from Perturb-seq analyses were reproducible hallmarks of IFNβ stimulation in a new, in-vivo biological context (Figure 4c). In comparison, the 24 MSigDB unique genes showed more limited power to strongly distinguish between stimulated and control cells, and accurate discrimination was only possible for myeloid cell types but not lymphoid cell types (Supplementary Figure 7d).
We also observed significant enrichment for alternative gene sets, including the IFNγ and TNFα pathways, using both Perturb-seq and MSigDB. This is unsurprising, as the targets of the three signaling pathways are known to overlap (Figure 4d), but illustrates the challenge in correctly inferring specific pathway activity from gene set databases. Importantly, the Perturb-seq datasets not only exhibited substantially stronger enrichment than MSigDB for IFNβ (p-value = 3.8×10-70 vs. p-value = 4.6×10-40), but also more clearly distinguished IFNβ from IFNγ signaling ( p-value = 1.3×10-47) (Figure 4b).
Given the improved breadth and reproducibility of our Perturb-seq gene signatures, we considered that we could leverage these data to distinguish enrichment between closely related pathways (Supplementary Methods). We repeated the enrichment test on IFNβ-stimulated cells using three sets of genes, 139 genes that were shared between Perturb-seq IFNβ and IFNγ pathways, 289 that were unique to IFNβ, and 198 that were unique to IFNγ. We found that only shared and IFNβ-exclusive gene sets exhibited enrichment, and there was no enrichment for IFNγ-exclusive targets (Figure 4e). We obtained similar results when comparing shared and unique genes for the IFNβ and TNFα pathways, which are also highly overlapping (Figure 4d). By contrast, when utilizing MSigDB signatures, exclusive gene sets for both IFNβ and IFNγ remained statistically enriched (Figure 4e).
We repeated this ground-truth validation with three additional external datasets performing IFNγ-stimulation of human PBMCs65, TNFα-alpha stimulation of DU145 (prostate cancer cell line)66, and TGFβ stimulation of OVCA420 (ovarian cancer cell line)66 (Supplementary Figure 8a-c). In each case, we were able to leverage our Perturb-seq datasets to correctly identify the underlying stimulation pathway, and to generate exclusive gene sets that correctly excluded enrichment for closely related pathways (Supplementary Figure 8 a-c). Notably, in all four of these ground-truth validations, the external datasets utilized biological systems that were not present in our Perturb-seq datasets.
Lastly, we asked whether our datasets could be used to correctly identify not only broad pathway programs, but to accurately identify specific sub-programs driving cellular responses. We considered a recent bulk RNA-seq dataset of mycobacteria-exposed human fibroblasts from patients with an inherited IRF1 deficiency67. When compared to control populations, IRF1-deficient fibroblasts exhibited deficient IFNγ signaling responses (Figure 4f). However, when considering exclusive gene sets to separate IFNγ response genes into IRF1-associated and IRF1-independent groups (Supplementary Methods), we correctly identified that only IRF1-associated genes exhibited impaired transcriptional activation, while independent genes were unaffected (Figure 4g,h). Strikingly, we were able to correctly infer IRF1-specific enrichment even in datasets from non-human species, including RNA-seq data from virally infected IRF1-deficient bat cells68 (Supplementary Figure 8d). We conclude that our Perturb-seq signatures outperform existing databases, and can be used to sensitively infer signaling pathway activation in diverse external datasets.
Inferring signaling pathway activation for in-vivo and in-situ datasets
Having demonstrated the ability to correctly infer causal regulatory events from in-vitro experiments with ground-truth, we next aimed to extend these analyses to infer pathway activity in new in-vivo datasets. For example, the activation of interferon signaling has been widely reported to affect multiple cell types after SARS-CoV-2 infection69–74. However, both type I (IFN#/β)69–71 and type II (IFN!)72–74 are widely reported to contribute to cellular responses to infection. While it is traditionally challenging to disentangle the specific effects of these pathways, even with scRNA-seq datasets from healthy and SARS-CoV-2-infected samples, we reasoned that our Perturb-seq pathway signatures could help determine whether individual cell types were responding exclusively to type I, type II, or both interferon pathways.
To address this, we leveraged a large-scale scRNA-seq dataset from COVID-19 patients. We utilized a dataset of circulating immune cells from the COvid-19 Multi-omics Blood ATlas (COMBAT)75, and determined cell-type specific DEG for myeloid and lymphoid cell types (Supplementary Methods). Using the full pathway signatures from Perturb-seq, we observed strong enrichment in many cell types for both IFNβ and IFNγ targets (Figure 5a). Alongside general heterogeneity in patient responses (Figure 5b-c), we confirmed a depletion of interferon responses within critically ill patients, and localized this deficiency most strongly to T cell subgroups (Figure 5a). When searching for enrichment of exclusive gene sets, we only identified enrichment in IFNβ-specific groups, with no enrichment for IFNγ-specific groups (Figure 5a,d,e). Consistent with this result, when ordering patients by their molecular response to disease (Supplementary Methods), we observed a clear gradient of expression for IFNβ-specific gene sets, but not IFNγ-specific gene sets (Figure 5d,e). These results highlight the specific role of IFNβ in the response of circulating immune cells to SARS-CoV-2 infection.
(a). Enrichment test for DEG across COVID-19 severity groups from an external dataset (COvid-19 Multi-omics Blood Atlas, COMBAT). Rows represent gene sets from our Perturb-seq data, columns show cell types yielding DEG between disease and healthy cells. Dot size denotes the odds ratio, color intensity indicates the adjusted P-value (Benjamini-Hochberg; * indicates p < 0.01). (b, c). Expression heatmap for the top 30 IFNγ and IFNβ pathway genes (including shared genes). Each column represents pseudobulk expression of CD14 monocytes within each individual (Supplementary Methods). Columns are ordered by increased expression of a combined gene list of IFNγ and IFNβ genes (d, e). Same as (b-c), but for pathway-exclusive gene lists. Only IFNβ-exclusive gene sets are coordinately up-regulated, consistent with the enrichment analysis in (a).
In addition to identifying disease-relevant pathways, we reasoned that our Perturb-seq datasets could also help to prioritize and pinpoint specific cell types that may be driving disease state. We first considered a comprehensive dataset of immune and epithelial cells from patients with Crohn’s disease (CD)76, an immune-mediated inflammation disorder known to primarily affect the gastrointestinal tract, leading to symptoms such as abdominal pain, severe diarrhea, and malnutrition77. Notably, anti-TNFα therapy has been repeatedly demonstrated to be an effective treatment for CD patients, but it is unclear if there are specific cell types that are responding to treatment.
The original manuscript computed DEG between healthy and CD colons for 54 cell types, identifying broad enrichment for inflammatory pathways without observing specific enrichment for TNFα signaling pathways76. Re-analyzing these DEG sets, we identified clear and specific up-regulation of TNFα targets, with minimal contributions from interferon-signaling pathways. TNFα-enrichment is observed primarily in non-immune subsets, including enterocytes, epithelial cells, goblet cells, and fibroblasts (Supplementary Figure 9). Interestingly, we found that only specific subgroups of these cells exhibited enriched TNFα signaling. For example, of five fibroblast subgroups, only two exhibited activation of TNFα targets. We observed similar heterogeneity in enterocytes and goblet cells, identifying specific sub-clusters with TNFα target-enriched DEG, alongside additional subclusters with no enrichment. These findings highlight specific subsets of non-immune cells that likely reflect promising cellular targets of anti-TNFα.
Finally, while our previous analyses focused on scRNA-seq datasets, gene signatures can be applied to broad genomic data types including spatial analyses. To demonstrate this, we explored recently generated 10x Visium spatial transcriptomic maps of the mouse colon, taken during the course of dextran sodium sulfate (DSS) colitis, which represents acute colonic injury and triggers a wound healing response78. We reasoned that we could implement our perturbation programs in order to better understand specific pathways and tissue regions that were influenced by disease pathology. For each of our Perturb-seq signaling programs, we computed a ‘signaling induction’ score for each region of the colon, based on the differential expression of signaling genes between voxels in the Day 14 (7 days of DSS administration followed by 7 days of healing) and Day 0 (healthy) samples (Supplementary Methods).
We observed a clear enrichment of our inferred TGFβ signaling program in discrete and specific regions of the mouse colon (Figure 6a-c). The region of highest activation (cluster 12) was located towards the center of the tissue, and exhibited striking overlap with a region annotated as exhibiting signs of ‘inflammation and hyperplasia’ based on a pathologist’s analysis of hematoxylin and eosin staining of the tissue section78 (Figure 6b). Following the original study’s methods for ‘digitally unrolling’ the colon 78 (Supplementary Methods), we further demonstrated that the induced signaling response was strongest at the distal end of the proximal-distal axis (Figure 6d,f). We observed minimal up-regulation in this region of other inflammatory signaling pathways (Figure 6c), or an existing literature TGFβ gene set used in the original study78,79 (Figure 6e). TGFβ represents a critical component of wound healing, and is particularly important for driving cellular proliferation and the generation of new connective tissue80–82. Consistent with this, the regions associated with activated TGFβ signaling were also identified in the original study as being enriched for proliferating epithelial stem cells, which help to coordinate the healing response78. We conclude that our Perturb-seq derived gene sets can be used to infer spatially restricted patterns of signaling activation.
(a). Unsupervised transcriptomic clustering of the mouse healing intestine Visium dataset (Parigi et al. 2022 Nat. Comm.). (b). Overlap between the clusters with elevated TGFβ activation and the anatomical regions annotated as exhibiting signs of ‘inflammation and hyperplasia’ based on the pathologist’s analysis in the original study. (c). Enrichment analysis for DEG identified for different clusters in the mouse healing intestine (Supplementary Methods). Rows represent gene sets from our Perturb-seq data, columns show cell types yielding DEGs between disease and healthy cells. Dot size denotes the odds ratio, color intensity indicates the adjusted P-value (Benjamini-Hochberg; * indicates p < 0.01). (d, e). The TGFβ activation scores in the mouse intestine before unrolling using our Perturb-seq TGFβ gene set (d) and the PROGENy TGFβ gene set (e). (f). The TGFβ activation scores in the digitally unrolled mouse intestine using our Perturb-seq TGFβ gene set. The Visium spots shown in (a) are digitally flattened into a proximal to distal direction from left to right on the x-axis (Supplementary Methods).
DISCUSSION
In this study, we adapted a commercially available combinatorial indexing workflow to perform scalable Perturb-seq experiments, and utilized this approach to learn perturbation gene expression signatures across a diverse range of signaling pathways and biological contexts. To address the challenge of heterogeneous gene expression knockdown in CRISPRi experiments, we introduce Mixscale, which infers the level of transcriptome-wide perturbation in individual cells and enables a weighted DE testing framework for boosting statistical power. We utilized these data to identify reproducible pathway (and sub-pathway) level signatures that were conserved across multiple cell types. We found that our signatures broadly extended existing and widely used gene sets, that they could be used to accurately infer signaling activitys in-vivo when examining bulk, single-cell, or spatial transcriptomic datasets.
Our study was inspired by previous efforts, including Genome-Wide Perturb-seq (GWPS)25, to learn large sets of gene signatures in a data-driven way. However, GWPS was mainly conducted in a single cell type, and the scalability of the study was driven by the need to profile more than 10,000 genetic perturbations. Our study does not pursue a genome-wide perturbation approach, but instead, our scalability constraints were driven by the large number of biological contexts, necessitated by the goal of learning signaling pathway signatures (which could not be observed by perturbing regulators at steady state). We addressed this challenge using the Parse Biosciences and Ultima Genomics platforms, but note that there are a series of pioneering combinatorial methods that enable larger scale Perturb-seq experiments83,84. In addition to the inherent multiplexing of Perturb-seq, where perturbed and non-targeting control cells are simultaneously profiled in the same experiment, performing Perturb-seq on fixed samples enabled us to multiplex different cell types and signaling pathways together, further reducing batch effects.
Our efforts are complementary to alternative approaches to learn gene expression signatures. For example, direct stimulation of cells with signaling ligands can also be used to measure transcriptome-wide signaling responses79,85–89. While this approach does not require additional genetic perturbations, it may also trigger the downstream activation of secondary pathways (which would be measured together with the primary targets), and cannot be used to infer sub-pathway level signatures. Notably, both approaches benefit from multiplexed single-cell profiling. For example, the Immune Cell Dictionary utilized multiplexed single-cell analysis to profile responses of immune cell types to 86 cytokines89. Given the rapidly growing number of identified cell types, multiplied by the vast possibilities for single or combinatorial genetic and environmental perturbations, we expect that the generation of response signature dictionaries will represent a primary use case for new massively scalable single-cell sequencing techniques.
The experimental design of this study introduces several limitations to the conclusions we can draw with our molecular pathway signatures. First, our coverage of biological systems is not comprehensive and represents only an initial set of cell types and stimulation conditions. As other groups generate additional data, these new datasets could be integrated with our own to generate improved gene signatures. Second, our experimental design collects cells after stimulation at a single time point, which prohibits the study of temporal signaling dynamics and presents another opportunity for future Perturb-seq experiments. Third, while our use of a single profiling technology minimizes batch effects, our signatures may exclude genes that are not well-quantified by Parse Biosciences, even with large numbers of cells90. Finally, while our primary goal in this study was to identify conserved pathway genes across different cell lines, our datasets also generated 141 cell-line specific response lists (Supplementary Table 4), enabling future studies to more deeply explore cell-type-specific responses.
In future studies, the experimental and computational framework we developed can be applied in different biological contexts and to different cellular processes to identify additional gene signatures. While our study focused on transcriptional signatures, incorporating additional modalities, such as chromatin accessibility91–93 and protein levels94, could provide further insights. We could also employ combinatorial perturbation technology95 to explore interactions between multiple regulators within and across pathways. All these would greatly enrich our understanding of signal transduction at multiple steps of the central dogma.
AUTHOR CONTRIBUTIONS
L.J., C.D., E.P., and R.S. conceived the research. C.D., E.P., I.M., H.H.W., and H.Y. performed experimental work. L.J. performed the computational work and developed the software tool with guidance from R.S.. N.I., G.L.Y., and D.L. performed the Ultima sequencing and generated the simulated paired-end fastq data. All authors participated in interpretation and in writing the manuscript.
CONFLICT OF INTEREST STATEMENT
In the past 3 years, R.S. has received compensation from Bristol-Myers Squibb, ImmunAI, Resolve Biosciences, Nanostring, 10x Genomics, Neptune Bio, and the NYC Pandemic Response Lab. R.S. is a co-founder and equity holder of Neptune Bio. N.I., G.L.Y., and D.L. are employees and shareholders of Ultima Genomics. E.P. has been an employee at Parse Biosciences since December 2021 and owns stock in the company. The other authors declare that they have no competing interests.
DATA AND CODE AVAILABILITY
Processed data is available at Zenodo96 (https://zenodo.org/records/10520190). Software implementing our approach is freely available as an open-source R package Mixscale (https://github.com/longmanz/Mixscale). A vignette demonstrating the application of Mixscale is also available as an online resource (https://longmanz.github.io/Mixscale/).
SUPPLEMENTARY FIGURES
Schematic diagram of sgRNA capture, barcoding, and library preparation compatible with the Parse Bioscience EvercodeTM Whole Transcriptome Mega kit. Please see Supplementary File 1 for a full protocol.
(a). Schematic showing the location of possible guide additive primer binding sites. (b). Primer efficiency for cDNA amplification, measured by qPCR with annealing temperature 65°C. Guide additive with phosphorothioate (*) bonds was ultimately chosen. (c). Primer efficiency of chosen guide additive primer for cDNA amplification, measured by qPCR at different annealing temperatures. (d). Percentage of cells assigned to each guide classification when using the original Parse cDNA amplification annealing temperature (some cycles at 67°C and some cycles at 65°C) compared to our modified annealing temperature (all cycles at 65°C). (e). RNA UMI counts and genes per cell when using the original Parse annealing temperature compared to our modified annealing temperature. (f). Percentage of cells assigned each guide classification when using different guide additive primer concentrations. (g). Percentage of unique gRNAs recovered when using different guide additive primer concentrations. (h). Guide UMIs per cell, split by guide classification, when using the original cDNA amplification SPRI bead cleanup ratio (0.8X) compared to our modified ratio (1X). We chose a 1X cleanup to help retain shorter guide transcripts captured by random hexamer reverse transcription primers.
(a) Scatter plots illustrating the relationship between the expression level of the perturbation targets (y-axis) and the perturbation scores (x-axis) in each cell. This plot is analogous to Figure 2b but this time cells are stratified by their guide RNA identities instead. (b) Single-cell heatmap for STAT1 perturbation in three cell lines after IFNψ stimulation, split by gRNA identities. (c) Comparison of Mixscale score and target gene expression estimated in an external CRISPRi dataset (Jost et al. 2020 Nat. Biotech). The figure displays the Mixscale score (y-axis on the left) using black dots and the degree of knockdown of the target gene (y-axis on the right) marked by red triangles. The x-axis represents different gRNAs, including the perfectly matched gRNA (“_00”) and those with varying numbers of mismatched nucleotides. Plot shows that gRNA that result in more effective knockdown also result in cells with higher Mixscale scores. (d) Comparison of Mixscale score and relative activity of gRNAs. Similar plot as in (a), but instead the figure contrasts the Mixscale score (black dots) with the relative activity of the gRNA (y-axis on the right) marked by blue diamonds, a phenotypic measure of cellular growth defects measured from a viability screen (Jost et al., 2020, Nat. Biotech). Plot shows that gRNA with the highest phenotypic activity also yield cells with the highest Mixscale scores.
(a). Comparison of false positive rates (FPRs) for the Mixscale weighted DE test (wmvReg) and a standard unweighted DE test. FPRs are calculated based on alpha ≤ 0.05, 0.01, and 0.005. Genes were simulated as null by shuffling cell gRNA labels, and shuffling was performed after calculating Mixscale scores (Supplementary Methods). (b). FPR calculations when shuffling is performed prior to calculating Mixscale scores. This situation is a conservative control, as we force the assignment of mis-specified Mixscale scores, though in a real dataset the lack of an initially identified DEG set would abort the procedure (Supplementary Methods). (c). Boxplots for Mixscale DE test scores X2 (= Zscore2) of the genes used in the calculation of these mis-specified scores, comparing methods with and without a leave-one-out (LOO) strategy (Supplementary Methods). The red dashed line indicates the expected !! under the null. (d). Replication rate of the DEGs across two scRNA-seq replicates using different DE methods. Same as Figure 2e but includes an approach where Wilcoxon sum rank tests are applied to Mixscape-derived labels. (e, f). Replication rate of DE genes that were uniquely identified by any DE method, compared to others. For example, in the IFNβ pathway dataset (Replicate 1), wmvReg and Wilcoxon uniquely identified 4,415 and 1,025 DEG, respectively, across regulators, and the plot shows the percentage that reproduce in the second replicate.
(a). -log10(Count per gene) comparison between Illumina and Ultima Genomics sequencing platforms. Data reflects pseudobulk values (all cells) for a Perturb-seq library sequenced by both Illumina and Ultima platforms, followed by the same alignment and pre-processing steps. 20 outlier genes are highlighted in red. (b). -log10(DE P-values) comparison between NT control cells and gRNA targeted cells, for the same library sequenced by either Illumina or the Ultima Genomics sequencing platform. The datasets are independently processed with Mixscale, and we observe high consistency across the sequencing platforms. (c). Distribution of the DE Z-scores for the outlier genes in panel (a). The red dashed lines are the Z-score threshold after Bonferroni correction for multiple testing (adjusted p-value = 0.05). Plot shows that genes that are differentially detected between Illumina and Ultima data no longer show differential expression when comparing NT and targeted cells within the same platform.
The first and second perturbation programs for IFNβ, TGFβ, and TNFα pathways, identified by MultiCCA. Each panel (a) IFNβ, (b) TGFβ, and (c) TNFα, shows a heatmap where columns represent correlated perturbations within and across cell lines, and rows list the program’s top 15 down-regulated genes and top 5 up-regulated genes. As in Figure 3g, the color gradient in the heatmap cells reflects the DE test Z-scores for each gene under each perturbation. See Supplementary Table 3 for a complete lists of pathway programs and the corresponding program genes.
(a). Venn diagrams showing the overlap between the MultiCCA program 1 genes we identified and the MSigDB Hallmark gene lists for IFNγ, TGFβ, and TNFα pathways (Supplementary Methods). (b). Density plot for the distributions of DE Z-scores for the Perturb-seq unique genes. The density plots are generated using the perturbations from a key regulator in each pathway (labeled in the figure titles) using our Perturb-seq datasets. Plot shows that the uniquely identified Perturb-seq genes are repeatedly identified across multiple cell lines (c). Density plot for the log10(count per million) of MSigDB-unique, Perturb-seq-unique, and shared genes for each pathway, calculated in our Perturb-seq data (pseudobulk for all cells). The red dashed line indicates log10(CPM=20). Plot shows that genes identified by MSigDB have a very different expression profile than those either unique identified by Perturb-seq or shared between the two databases (d). IFNβ module score comparing unstimulated and stimulated cells using the MSigDB unique gene set. Plot shows that the MSigDB-unique gene sets effectively discriminates stimulated and control cells in only some cell types, in contrast with Perturb-seq gene sets in Figure 4c.
(a-c). Evaluating complete pathway and pathway-exclusive gene sets. Plots are as in Figure 4b, but run on datasets of cells that are stimulated with a single cytokine as a positive control. (a) shows the results from human CD14 monocytes stimulated with IFNγ, (b) shows results from the DU145 cell line stimulated with TNFα, and (c) shows results from the OVCA420 cells stimulated with TGFβ (Supplementary Methods). In each case, our Perturb-seq pathway lists show enrichment, but there is also enriched signal for alternative pathways since pathway gene sets include overlapping genes. Once we restrict the analysis to pathway exclusive gene sets, only the correct pathway exhibits evidence of enrichment. (d). The module score for IFNγ pathway genes, IRF1-associated genes, and IRF1-independent genes calculated in an IRF1-KO bat PakiT03 cell dataset (Supplementary Methods). The IRF1-associated genes and IRF1-independent genes are identified using the IFNγ program 1 and 2 in our Perturb-seq data (Supplementary Methods).
(a). The gene set enrichment test for DEG identified for patients with Crohn’s disease (CD) in an external dataset (Supplementary Methods). The analysis includes inflamed and non-inflamed tissues from CD patients. Each row indicates a gene set from our Perturb-seq data, and each column indicates a cell type from which the DEGs are obtained. The enrichment test odds ratio is represented by the size of the dot, and the enrichment test adjusted P-value (after Benjamini-Hochberg correction) is represented by the gradient of the color. Adjusted P-values less than 0.01 are labelled by asterisk.
ACKNOWLEDGEMENTS
We thank all members of the Satija Lab at New York Genome Center for useful discussion. We acknowledge the authors of the external datasets used in this study for making their valuable resources publicly available. This work was supported by the Chan Zuckerberg Initiative (EOSS5-0000000381, HCA-A-1704-01895 to R.S.), and the National Institutes of Health (RM1HG011014-02 and 1OT2OD033760-01 to R.S).
Footnotes
CONFLICT OF INTEREST STATEMENT is now updated.
REFERENCES
- 1.↵
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.↵
- 9.↵
- 10.
- 11.
- 12.
- 13.
- 14.↵
- 15.↵
- 16.
- 17.
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.
- 88.
- 89.↵
- 90.↵
- 91.↵
- 92.
- 93.↵
- 94.↵
- 95.↵
- 96.↵