Abstract
Pooled CRISPRi-mediated silencing of >1,000 transcriptional regulators expressed in single colorectal adenocarcinoma cells, followed by single-cell RNA-seq profiling at two timepoints, 1 day and 4 days, allowed reverse engineering the underlying tumor context-specific, causal regulatory network. Furthermore, the availability of experimentally derived, highly multiplexed gene reporter assays for each regulator, as identified by this analysis, allowed accurate assessment of differential protein activity following silencing of each regulator, thus providing proof-of-concept for generating comprehensive, tissue-specific networks of transcriptional and post-translational interactions. Analysis of this causal network allowed elucidation of complex autoregulatory mechanisms that have eluded previous computational approaches and supported systematic elucidation of cooperative mechanisms, where one regulatory protein can modulate the activity of another regulatory protein, as well as transcriptional mimicry, where one regulatory protein can phenocopy others.
Introduction
Precise control of transcriptional cell state by transcription factors and co-factors is critically required across all aspects of mammalian cell pathophysiology: from determining cell identity during lineage differentiation to providing mechanistic control of cancer cell plasticity. Yet, while global regulatory mechanisms involved in lineage specification have been experimentally elucidated in a few model organisms, such as C. elegans (Murray 2018) and sea urchin (Davidson, Rast et al. 2002), those relevant to pathophysiology of mammalian cells are only sparsely understood, mostly as a result of small-scale, hypothesis-driven assays (Takahashi 2017), or from sparsely validated computational reverse engineering algorithms (Basso, Margolin et al. 2005, Huynh-Thu, Irrthum et al. 2010). A key limitation of existing gene regulatory network (GRN) models is their inability to resolve the directionality (i.e., causality) of transcriptional interactions, thus limiting our ability to elucidate complex autoregulatory logic controlling cellular homeostasis, in both physiologic and pathologic cell states. Homeostatic control of cell state represents one of the most critical functions of GRNs, which is directly responsible for the adaptive behavior necessary to maintaining cellular identity and function in normal cell physiology as well as for the cell adaptation mechanisms that allow disease-related cells to adapt and escape pharmacological treatment. Indeed, elucidation of autoregulatory loops has been the Achilles’ heel of virtually all proposed computational approaches and has required complex, multi-gene experimental assays for identification and validation, even on a relatively small scale (Rajbhandari, Lopez et al. 2018).
Moreover, we and others have shown that the availability of accurate and comprehensive GRN models is critical for the utilization of network-based algorithms for the study of cellular phenotypes and functions. These include methodologies for (a) accurately measuring protein activity from RNA-seq profiles (Alvarez, Shen et al. 2016), including at the single-cell level (Obradovic, Chowdhury et al. 2021), (b) elucidating key regulators of cellular programs (Boorsma, Lu et al. 2008, Alvarez, Shen et al. 2016, Aibar, Gonzalez-Blas et al. 2017), (c) elucidating Master Regulators representing mechanistic determinants of cell state in cancer and other diseases (Carro, Lim et al. 2010, Aytes, Mitrofanova et al. 2014, Rajbhandari, Lopez et al. 2018), (d) predicting proteins capable of reprogramming cell-state (Dutta, Le Magnen et al. 2016, Arumugam, Shin et al. 2020), (e) dissecting drug Mechanism of Action (MoA) (Woo, Shimoni et al. 2015).
Unfortunately, since most GRNs have been computationally dissected (Basso, Margolin et al. 2005, Pe’er 2005, Wang, Saito et al. 2009) and only sparsely validated, there are still lingering concerns about whether they may truly recapitulate the underlying regulatory logic of the cell. Finally, virtually all computationally dissected GRNs have been reconstructed from steady-state molecular profiles - i.e., representative of cells whose dynamics are slow compared to the half-life of gene products. These GRNs will thus, by definition, miss key elements of dynamic/time-dependent gene regulation, including the ability to model autoregulatory loop-mediated oscillatory behavior following exogenous and endogenous perturbations (Ma, Wagner et al. 2005).
Unfortunately, in contrast to the experimental elucidation of protein-protein interaction networks (Rual, Venkatesan et al. 2005), the development of genome-wide experimental technologies for the reverse engineering of GRNs is challenging and remains elusive. Ideally, one would perturb each regulatory protein and assess its effect on the expression and activity of all other genes and proteins they encode, respectively, at multiple timepoints. While this would have been essentially unfeasible, due to cost and effort, until now, we were able to combine several recently-developed technologies into a fully integrated experimental pipeline for the systematic, single-cell-based dissection of GRNs. This Transcriptional Regulator Knockdown (TReK) pipeline leverages highly-multiplexed, pooled CRISPRi-mediated silencing of transcriptional regulators in single cells, followed by single-cell RNA sequencing (scRNA-seq) at multiple timepoints, including a short timepoint (24h) to capture transient regulatory interactions and a longer timepoint (96h) to capture regulatory interactions as cells start to achieve steady-states. While still coarse, we expect this to provide an initial framework to start revealing time-dependent regulatory cascades controlling critical genetic programs (Figure 1A).
To achieve optimal CRISPRi-mediated silencing of target genes, we combined an enhanced CRISPRi repressor (Yeo, Chavez et al. 2018) with improved single guide RNA (sgRNA) predictions (Sanson, Hanna et al. 2018) into pooled, single-cell CRISPRi screens with transcriptomic readout using the recently developed CROP-seq technology (Datlinger, Rendeiro et al. 2017). In addition, we developed an inducible version of Yeo et al.’s enhanced CRISPR repressor (Yeo, Chavez et al. 2018), thus allowing precise timing of cell harvesting and scRNA-seq profiling as early as 12h after repressor induction and gene silencing.
As a proof of concept, we applied TReK pipeline to dissect colorectal cancer-specific GRNs, via single-cell perturbation and scRNA-seq profiling of HT-29 cells. Specifically, we generated single-cell profiles of >150,000 HT-29 cells at two timepoints (24h and 96h), following pooled, CRISPRi-mediated, single-cell silencing of 1,007 transcriptional regulators, thus measuring 6 billion unique molecules (UMI-gene pairs), while dramatically reducing labor, assay costs, and potential batch effects. The genes silenced in this study were selected to include virtually all actively transcribed transcriptional regulators (hereafter TRs) - including transcription factors (TFs) and co-factors (co-TFs), such as (de)acetylases, (de)methylases, and other chromatin remodeling enzymes - that were detected as either significantly expressed or transcriptionally active in this cell line. Colorectal adenocarcinoma (COAD) was selected as a relevant tumor context because of lack of effective therapy and recent elucidation of Master Regulator proteins defining 8 molecularly distinct subtypes, including 3 subtypes highly enriched in MSI-high tumors and an aggressive, genomically-stable subtype characterized by very poor outcome (Paull, Aytes et al. 2021). Thus, we expected that a high-quality experimentally dissected GRN would be a first step towards the elucidation of critical COAD tumorigenesis and progression mechanisms.
When integrated into existing computational reverse engineering methods, the TReK data was instrumental in allowing significant improvement in GRN model accuracy, as objected assessed based on the ability to recapitulate changes in the activity of the proteins encoded by silenced genes both at 24h and 96h. These studies allowed reconstructing many established transcriptional and post-translational mechanisms, refining our understanding of protein-protein interactions in complexes, and most critically, reconstructing fully integrated, causal GRNs that include both transcriptional and post-translational interactions (Figure 1B).
Results
Large-scale CRISPRi-mediated TR silencing in HT-29 cells at two timepoints
To generate the perturbational data needed for the systematic dissection of COAD-specific molecular interactions, leading to a full GRN, we used assays that combine CRISPRi (Gilbert, Horlbeck et al. 2014) and CROP-seq (Datlinger, Rendeiro et al. 2017) technologies (Figure S1; Methods). Specifically, we measured changes in gene expression, within individual HT-29 cells, as measured by Chromium scRNA-seq profiles, following pooled CRISPR/dCas9-mediated silencing of each significantly expressed or active transcriptional regulator protein (see Methods) (Figure 1C).
Using this pipeline, we thus generated scRNA-seq profiles following silencing of 1,007 genes in HT-29 cells, at two timepoints (T1 = 24h, T2 = 96h), using three sgRNAs per gene. To select target genes, we first leveraged a manually curated repertoire of 2,527 transcriptional regulators (hereafter TRs) annotated as either TFs or co-TFs in Gene Ontology (Alvarez, Shen et al. 2016) and then sub-selected those that were either significantly expressed or active in HT-29 cells (see Methods). Thirty scrambled sgRNAs (sgCtrls) were also included as non-targeting negative controls, resulting in a library comprising 3,051 sgRNAs (Table S2). To achieve robust CRISPRi-mediated gene silencing, we engineered lentiviral plasmids supporting either constitutive or doxycycline-inducible expression of the enhanced repressor dCas9-KRAB-MeCP2 (Yeo, Chavez et al. 2018) (Figures S2A and S2B; Methods). To assess the actual length of effective gene silencing, we assessed the time required for the CRISPRi repressor to be transcribed and translated, following its doxycycline-inducible expression (Figure S2B; Methods). This shows that cells collected at 24h following dCas9-KRAB-MeCP2 induction (T1), have effectively undergone ∼12h of sustained repression. For the second timepoint, we used cells transduced with a constitutive dCas9-KRAB-MeCP2 expression plasmid (Figure S2C), harvested at 96h following lentivirally-mediated transduction with the sgRNA library (T2). HT-29 cells were selected also because, in mid-log growth, they have high transcriptome complexity, with a median of 35,000-42,000 UMIs/cell, at about 30% sequencing saturation and >90% of reads in cells, thus supporting follow-up analyses.
At each timepoint, >70,000 single-cell transcriptomes were sequenced (Figure 1D), followed by removal of low-quality profiles (<5,000 UMIs) and of profiles with two or more detected sgRNAs. More than 60% of the single cells passed this strict QC filter, thus producing a dataset with a median of 14-15 high-quality single-cell transcriptomes for each sgRNA (Figure 1E), yielding a total of 44-47 cells, on average, in which each target gene had been silenced by one of the three corresponding sgRNAs.
Achieving efficient, systematic gene silencing across a large pooled gene library is still a rather elusive goal with first-generation CRISPRi. Thus, it was critical to validate the performance of our second-generation CRISPRi implementation (Sanson, Hanna et al. 2018, Yeo, Chavez et al. 2018). To accomplish this, we performed a careful quality-control analysis. We analyzed the efficacy of gene silencing compared to controls (Figure S4), measured the correlation between different gRNAs targeting the same gene (Figures S5A and S5B), and compared our scRNA-seq gene silencing to published CRISPR knockout data (Figures S5G and S5H). In addition, we analyzed the factors that affect successful silencing (Figures S5C-S5F; Methods). Overall, we found that for 70% of tested TRs, we could achieve 50% silencing, sufficient to reliably identify targets, at both timepoints, with the later timepoint providing critical information for proteins with a long half-life.
Reconstruction of GRNs inferred from computational, experimental, and integrative analysis
As previously discussed, computationally inferred GRNs are typically reverse engineered from steady-state data, thus potentially reflecting broad co-expression patterns that may not be representative of direct regulatory events. Moreover-with the exception of Dynamic Bayesian Networks (Smith, Yu et al. 2006), which have never been validated in a mammalian context-, these methodologies are unable to elucidate auto-regulatory control mechanisms and fail to adequately represent regulators proteins whose expression represents an inadequate proxy of transcriptional activity.
For instance, while the ARACNe algorithm (Margolin, Nemenman et al. 2006) effectively eliminates indirect interactions using the Data Processing Inequality theorem and has been extensively experimentally validated, it still suffers from the above-mentioned limitations. To objectively assess the ability of TReK assays to improve GRN reconstruction quality, we compared three networks, including (a) purely computationally-inferred ARACNe networks inferred from the TReK scRNA-seq profiles at T1 or T2, (b) a purely experimental network assembled from genes that are differentially expressed following silencing of each regulator’s silencing at T1 or T2, and (c) an integrated network that combines both computational and experimental analyses (see Methods). For the latter, we modified a two-branch generalized linear model (GLM) that was previously used for single-cell differential expression analysis (Finak, McDavid et al. 2015) (Figure 2A; Methods). The integrated network combines ARACNe p-values for individual regulator-target interactions as well as the differential expression of the target following silencing of the regulator in the TReK assays, thus accounting for both interaction directionality and strength. Following GLM analysis, 65% and 61% of the TR targets inferred at T1 and T2 were derived from TReK data, respectively, with the remaining 35% and 39% coming from ARACNe.
Integrative GRN reverse engineering outperforms both computational and experimental methodologies
To assess network quality, we combined two metrics including (a) detection of regulatory protein activity decrease in the subset of cells where they were effectively silenced, (b) consistency with DNase-seq data (Mei, Qin et al. 2017, Zheng, Wan et al. 2019). For the first metric, we assessed protein activity using the VIPER algorithm, as determined by the differential expression of its target genes as represented in each of the three networks (computational, experimental, and integrated). A highly accurate network would show gene expression increase of induced targets and decrease of repressed targets in cells where the regulator is silenced. Thus, the more accurate the network is, the more accurately the activity of a TR should be measured by the differential expression of its regulon genes.
For this analysis, we selected 250 TRs, whose CRISPRi-mediated silencing produced ≥ 50% silencing at both timepoints, in at least 15 cells/sgRNA (see Methods), as the gold standard set, based on the assumption that their transcriptional activity would be significantly affected by silencing and presented in a sufficient number of cells to support the analysis. We then assessed GRN accuracy by 5-fold cross-validation within each timepoint (see Methods), based on VIPER-based TR inactivation assessment based on the differential gene expression signature of cells where the TR was effectively silenced vs. negative control cells transduced with non-targeting sgRNAs. In addition, we also performed cross-timepoint validation—i.e., assessing how well GRNs inferred at one timepoint could improve protein activity inference accuracy at the other timepoint (see Methods). The analysis clearly shows that while the experimental GRN outperforms the computational one, especially at the shorter timepoint (24h) the integrated GRN significantly outperforms both (Figure 2B). These results are consistent with expectations that computational GRNs will perform better as cells achieve steady-state and will thus fail to identify transient regulatory events that may be compensated by autoregulatory loops. It also suggests that the pure experimental data may be noisy and thus introduce false positive and false negative targets that are then optimally refined by the computational analysis.
To understand the factors that most contributed to improved GRN performance, we first tested whether TRs inducing a larger signature of significantly differentially expressed genes would outperform those with smaller effect size. Indeed, inferred protein activity accuracy was significantly correlated with the number of differentially expressed genes (p < 2.7e-9) (Figure 2C). We also tested whether silencing the most likely Master Regulators of cell-state would induce stronger transcriptional signatures. Specifically, we first inferred the differential activity of all tested TRs in HT-29 against the average of all CCLE cell lines (Ghandi, Huang et al. 2019) using VIPER with the integrated network (see Methods). We then measured the correlation between their VIPER-inferred differential activity in HT-29 cells and their differential activity following CRISPRi-mediated silencing. Correlations were significant (p = 2.4e-3 at T1; p = 4.1e-9 at T2) suggesting that silencing critical regulators of cell-state would induce stronger transcriptional signatures (Figure 2D). Finally, across 4 cross-validations, 27 of the 38 TRs (71%) that were silenced in ≥60 cells were significantly inactivated based on VIPER analysis (Figures 2C and 2D), suggesting that TR targets inference—and thus activity measurement—could be further improved by increasing the number of cell profiles in TReK assays. Indeed, saturation analysis shows that the VIPER-measured protein activity of most TRs sampled in ≥ 60 cells converged to a narrow range as an increasing number of cells were included in the analysis (Figure S6).
For the second metric, we show that the consistency of target gene genomic loci with open chromatin, as assessed by DNase-seq in HT-29 cell lines (Mei, Qin et al. 2017, Zheng, Wan et al. 2019), significantly (p = 1.52e-8 at T1; p = 3.43e-7 at T2) improves in the integrated GRN at both timepoints, compared to the ARACNe computational networks (Figure 2E).
Based on these results, the integrated computational/experimental GRNs was selected as the optimal one in all subsequent analyses.
Integrated GRNs capture multi-layer molecular interactions
A key goal of cellular network reverse engineering has been the assembly of molecular interaction networks that integrate distinct mechanisms of gene-product interaction. GRNs capture transcriptional regulation effects, mediated by regulatory proteins binding the regulatory region of a target gene, either directly or via cognate binding partners. However, these networks fail to capture the post-transcriptional effect of protein X on protein Y. For instance, two transcription factors may bind the same DNA-binding motif (e.g., ARNT and MYC binding E-boxes) thus inducing a stoichiometric mediated dependency where over/under-expression of one (e.g., ARNT) may induce activity increase/reduction of the latter (i.e., MYC), respectively (Wang, Saito et al. 2009). Similarly, a chromatin remodeling enzyme may result in epigenetic silencing of the targets of another TR. Finally, a TR may regulate a ubiquitin ligase adapter protein, thus inducing proteasomally-mediated degradation of another TR (Chen, Alvarez et al. 2014). A unique advantage of the TReK data is that it supports dissection of both transcriptional and post-transcriptional interactions, based on the VIPER inferred activity of each TR, following silencing of another TR, allowing reverse engineering of the first integrated, multi-layer TRN (transcriptional regulatory network) and ARN (activity regulatory network) for a human cell.
To further improve sensitivity, we integrated the T1 and T2 ARN networks, by taking the union of the regulons inferred at each timepoint, while removing mode-inverted interactions at T2—i.e., interactions detected as positive at T1 and repressive at T2, or vice versa (see Methods). This produced an integrated ARN comprising 23,607 activity-regulating interactions (false discovery rate, FDR < 0.1) involving 949 of the 1007 TRs. Since protein activity can be regulated by many complementary mechanisms, we studied several TR interaction mechanisms in greater detail. Specifically, we considered four non-mutually exclusive mechanisms that may account for ARN interactions, including transcriptional regulation, direct or indirect relationships in the same pathway, physical protein-protein interactions, and co-regulation of the same target genes.
To assess transcriptionally mediated ARN interactions—i.e., interactions where silencing of TRx affects the activity of TRy because TRx transcriptionally regulates TRy, we assessed how much of the differential activity of each TR could be accounted for by the differential expression of its encoding gene. Overall, only 2.9% (n = 685) of inferred ARN interactions could be accounted for by this mechanism (Figures 3A and 3B). Second, an examination of each TR’s pathway membership showed that ∼44% (n = 10,347) of ARN interactions were detected between same-pathway TRs. Third, ∼7.0% (n = 1,650) of ARN interactions were between structural cognate binding partners, as reported in protein-protein interaction databases (see Methods). Finally, a small fraction (4.2%, n = 980) of the interactions were explained by significant overlap between regulons (see Methods). In addition to ARN interactions, the TRN comprised n = 121,414 interactions (FDR < 0.2) detected by differential expression analysis, including n = 9,812 TR-TR interactions.
Analysis of integrated TRN/ARN networks
Integration of TRN and ARN interactions into a multi-layer network (MLN = TRN + ARN) contributes to a more holistic understanding of biological mechanisms. For instance, we used the more comprehensive T2 network to analyze an MLN sub-network comprising MYC and six established MYC regulators in the Wnt and TGF-β pathways (Figures 3C-3E). Both pathways are known to play critical roles in COAD, contributing to increased activity of the MYC proto-oncogene, by repression of TGF-β signaling and hyperactivation of Wnt signaling (Cancer Genome Atlas 2012). Our analysis recapitulated key, well-established interactions, such as the reciprocal activation (positive feedback loop) between TCF7L2 and CTNNB1 (Brantjes, Barker et al. 2002), which is a crucial step in the activation of the Wnt signaling pathway. Notably, the essential proto-oncogenes (MYC, CTNNB1, and TCF7L2) form a highly interconnected module with multiple activating interactions (Figure 3E, red edges). In contrast, the non-essential tumor suppressors SMAD2 and SMAD3 are identified as repressors of the MYC-centric oncogenic module (Figure 3E, blue edges). While some of these interactions are well-established in the literature (Yagi, Furuhashi et al. 2002, Rennoll and Yochum 2015), our analysis identified several novel interactions, such as the MYC-mediated transcriptional activation of TCF7L2 and CTNNB1, and the negative feedback loops formed by SOX9 with MYC, CTNNB1, and TCF7L2, which further increase our understanding of TGF-β /Wnt-mediated MYC dysregulation in COAD. Not surprisingly, consistent with the fact that SMAD4 has a nonsense mutation in HT-29, according to the COSMIC database (Tate, Bamford et al. 2019), we did not detect SMAD4 interactions with other proteins.
Analysis of the full MLN identified a preponderance of ARN vs. TRN TR interactions (n = 23,607 vs. n = 9,812, respectively) (Figure S7A). To further delineate transcriptional and post-transcriptional regulation in the ARN, we removed ARN interactions that were explained by a corresponding TRN interaction, this process resulted in n = 22,922 post-transcriptional interactions, including activating vs. repressive TR interactions (n = 20,278 vs. 12,456, respectively) (Figure S7B). The resulting networks showed statistically significant overlap with both MultiNet (p = 2.94e-19) (Khurana, Fu et al. 2013) and HumanNet (p = 3.86e-60) (Hwang, Kim et al. 2019) (Table S3).
Auto-regulatory loops are significantly over-represented in the multi-layer network
Auto-regulatory feedback represents a ubiquitous regulatory mechanism in biological systems, and some associated feedback loop motifs are significantly overrepresented (Alon 2007). A unique aspect of the TReK assays is that they allow direct, causal dissection of autoregulatory loops because each TR is independently silenced, thus breaking the upstream but not the downstream logic of the loops in which it is involved.
2-Protein Loops
Systematic analysis of feedback loops comprising 2 proteins in the MLN identified 1,816 instances, comprising 3 different motifs: positive feedback (two positive interactions), negative feedback (one positive and one repressive interaction), and genetic switches (two repressive interactions) (Figure 3F). Compared to the null distribution obtained by randomizing MLN edges, while preserving its overall topology, we observed significant overrepresentation of all loop motifs (3.3x, 2.4x, 1.89x) (Figure 3G). Interestingly, compared to the total number of 2-protein loops, positive feedback loops were further over-represented, while negative feedback loops and genetic switches were slightly under-represented (p < 2.2e-16 by chi-square test).
3-Protein Loops
In addition, 10,260 3-protein loops were identified across 4 different motifs (Figure 3H), depending on the number of positive and negative interactions (Figure 3I). Similar to 2-protein loops, we observed overrepresentation of positive loops (p < 8.94e-12 by chi-square test).
Loops comprising proteins with greater differential activity in HT-29 cells were more likely to be detected. Indeed, our analysis showed that loops comprising proteins in the first quintile of all TRs, ranked based on their differential activity in HT-29 against the average of CCLE, formed almost twice as many loops compared to other proteins (Figure 3J). This suggests that the proteins responsible for implementing and maintaining cell-state homeostasis are more likely to participate in feedback loops.
Variable knockdown efficacy and time-course data reveal systematic use of autoregulatory feedback
While variable sgRNA silencing efficacy is normally seen as a potential limitation of the CRISPRi technology, in the context of GRN reverse engineering, it can actually help further elucidate the presence of regulatory feedback (auto-regulation). In the absence of auto-regulatory feedback, the more effective the silencing of a TR, the stronger the activation or repression of its targets. However, auto-regulatory loops may induce significant, time-dependent oscillations in regulatory targets expression or activity.
For instance, of the 3 sgRNAs targeting MYC, two induced moderate (60%) and weak (17%) silencing, respectively, while the third failed to produce any silencing. Gene Set Enrichment Analysis (GSEA) (Subramanian, Tamayo et al. 2005) confirmed that enrichment of MYC targets in genes under-expressed following its silencing tracked with silencing efficacy (Figures 4A-4C). Similar correlation between silencing efficacy and target gene expression in the associated pathways was observed for many other genes, such as CTNNB1/Wnt and MYBL2/cell cycle (Figure S8). More interestingly, however, for some TRs, we observed virtual inversion of gene expression signature behavior as a function of sgRNA efficacy, implicating the existence of strong autoregulatory mechanisms. E2F7, a transcriptional repressor of the E2F family (de Bruin, Maiti et al. 2003, Di Stefano, Jensen et al. 2003, Westendorp, Mokry et al. 2012), provides an interesting example of such behavior. Under normal circumstances, repressing E2F7 activity will cause upregulation of its target genes (Segeren, van Rijnberk et al. 2020). Indeed, three sgRNAs targeting E2F7 induced significant E2F target upregulation (p = 0.01, 0.037, and 0.23, respectively) at the earlier timepoint (T1) (Figures S9A-S9C). However, at the later timepoint (T2 = 96h) the guide with the highest efficacy induced complete inversion of target expression (mode inversion), with significant E2F target downregulation (p < 0.001) (Figures 4D, 4E, and S9D). Since E2F targets, which are involved in cell cycle progression, are known to undergo oscillatory behavior (Westendorp, Mokry et al. 2012), consistent with overcompensation and mode inversion following its most significant silencing. Similar to E2F7, we observed highest-efficacy vs. low-efficacy mode inversion for 57 additional TRs at T2, suggesting their participation in strong autoregulatory loops. For 10 (>17.5%) of those highest-efficacy sgRNAs we observed mode inversion between T1 and T2 (Table S4). This is 7-fold what would be expected in the null hypothesis (see Methods).
Regulation of Hallmark Cellular Processes
To elucidate regulators of fundamental cellular processes, as cell achieve stable state, we assessed enrichment of genes differentially expressed at 96h following silencing of each TR in genes comprising 50 “hallmark” cellular processes by GSEA analysis (Liberzon, Subramanian et al. 2011, Liberzon, Birger et al. 2015). Potential mode of regulation was categorized as activation if silencing a TR caused downregulation of a cellular process, repression if silencing a TR caused upregulation of a cellular process. To address the effect of potential auto-regulatory loops (see previous section), we included a third category called “mixed” if both downregulation and upregulation were observed after silencing a TR with different sgRNAs (see Methods). To minimize false discovery rates, we only considered TRs for which at least 2 sgRNAs produced concordant enrichment. While this undoubtedly prevents detection of bona fide hallmark process regulators, for which there was only 1 effective sgRNA, the analysis revealed many COAD-specific hallmark process regulators, including both established and previously uncharacterized ones. For example, considering 16 TRs in the Wnt/TGF pathway (Figure 4F), we recapitulated regulation of both mTORC1 signaling and DNA repair by MYC (Hironaka, Factor et al. 2003, Karlsson, Deb-Basu et al. 2003, Yue, Jiang et al. 2017), of spermatogenesis by ID2 (Sablitzky, Moore et al. 1998), and of epithelial-mesenchymal-transition (EMT) by β catenin (Vu and Datta 2017). Indeed, cursory search of the literature supported 24 of the 55 inferred hallmark regulatory interactions associated with the 16 TRs (Table S5). Thus, availability of this analysis to all 1,007 TRs provides a powerful new resource to study regulatory processes and mechanisms in this COAD (Table S6).
Identification of upstream regulators and co-factors
The TReK assays also support identification of regulatory hierarchies and co-transcriptional factors. For instance, as expected, CRISPRi-mediated silencing of MAX, an established co-factor and modulator of Myc activity (Dang 2012), significantly affected MYC target expression (p < 0.001) (Figures S10A-S10D). To further confirm enrichment of potential MYC modulators in TRs that affect expression of MYC tatgets, we first assessed all TRs whose silencing affected the targets of a second TR, using the CINDy algorithm (Giorgi, Lopez et al. 2014), CINDy is an improvement of the original MINDy algorithm (Wang, Saito et al. 2009), which uses the conditional mutual information I[TR;T|M] between a TR and its targets T, given the expression of a candidate modulator M to predict TR activity modulators. Experimental validation rates for these algorithms have been in the 70%-80% range and they have been shown to recapitulate 60% to 70% of known upstream pathway interactions in lung adenocarcinoma and lymphoma (Wang, Saito et al. 2009, Giorgi, Lopez et al. 2014).
We found significant overlap between CINDY and TReK-based predictions of candidate MYC-activity modulators (Figure S10E), with 115 of the 289 TReK-predicted ones also predicted by CINDY (p = 0.014 by Fisher’s exact test). MLN-based characterization of these 115 TRs in the MLN identified 5 as upstream MYC transcriptional regulators, 13 as upstream post-transcriptional regulators, and 27 as physical interactors, based on PrePPI protein-protein interactions (Zhang, Petrey et al. 2013) (Figure S10F).
Discussion
While experimental and computational methodologies for the dissection of protein-protein interaction (PPI) networks have achieved relative maturity (Zhang, Petrey et al. 2012, Rolland, Tasan et al. 2014), the inference of transcriptional and post-transcriptional interactions is still broadly debated, despite the role they have played in elucidating novel biological mechanisms, disease drivers, and phenotype (Chen, Alvarez et al. 2014, Alvarez, Shen et al. 2016, Rajbhandari, Lopez et al. 2018, Arumugam, Shin et al. 2020). Among others, current limitations include: (a) the relatively low value of DNA-binding motifs, given their ubiquitous presence in open chromatin regions, the fact that most TFs do not bind DNA via their canonical motif, and the uncertainty of whether binding induces functional regulation (b) the inability of computational methods to assess causality and directionality thus preventing detection of autoregulatory loops playing a key role in cell homeostasis and (c) the very sparse experimental validation of transcriptional targets. This is largely due to the fact that technologies for the systematic RNA-seq profiling of specific cellular contexts following perturbation of each transcriptional regulator (TR) have been lacking or have been excessively labor-intensive and costly.
By combining two technologies, including 2nd generation, inducible, CRISPRi-mediated gene silencing and CROP-Seq, supporting single-cell RNA-seq profiling with knowledge of the specific single guide RNA(s) in each cell, we have developed a Transcriptional Regulator Knockdown methodology (TReK) for the systematic, experimentally-based reconstruction of gene regulatory (GRN) and activity regulatory (ARN) networks. GRNs comprise molecular interactions where a specific TR affects the expression of another gene, while ARNs comprise molecular interactions where a specific TR affects the transcriptional activity of another TR.
While the current implementation of TReK provides a critical proof-of-concept of the technology, its application to a colorectal cell line HT-29 already overcomes critical limitations of current methodologies for GRN inference, while allowing construction of previously unavailable ARNs representing key, biologically relevant post-transcriptional and post-translational interactions. In particular, TReK provides key experimental evidence supporting TR target inference based on changes in their expression following CRISPRi-mediated TR silencing at two timepoints. The rationale for the two timepoints is multifold. First, this allows incorporating proteins with different half-lives in the analysis. Second, this allows detection of autoregulatory effects where the targets of a TR are inversely correlated in their expression at the two timepoints; critically, decoupling TRs from their upstream regulation, via direct biochemical perturbation, allows deconvoluting both interaction directionality (i.e., A→B vs. A←B) and autoregulatory loops (e.g., A→B→A), which has been challenging using computational approaches. The data presented here, which include analysis of >150,000 single-cell profiles representing CRISPRi-mediated silencing of 1,007 expressed TRs in HT-29 cells at 24h and 96h, not only confirms previous interactions, such as those between MYC and WNT/ β-catenin pathway proteins in colon cancer, but identifies thousands of novel interactions and experimentally assessed effects of individual TRs on cancer hallmarks, including >15,000 experimentally dissected 2- and 3-protein loops contributing to cellular homeostasis.
Critically, while still limited, this study provides key indications on the number of cells that will be necessary for future implementations of the TReK assays to ensure accurate representation of all transcriptional regulators. Specifically, saturation analysis (Figure S6) suggests that ≥ 60 cells per silenced TR would be sufficient to stabilize protein activity predictions based on TR targets, using the VIPER algorithm, thus opening the way to dramatic improvements in the assessment of mechanistic determinants of cellular phenotypes, not only in cancer but across all accessible cell states. In addition, similar studies targeting signaling proteins would also be possible, allowing the activity of TRs to be used as gene reporter assays for their upstream signaling modulators as well as for deconvoluting the complete logic of signal transduction, including autoregulatory loops. The cost of a TReK experiment is still relatively high; however, the results of this proof-of-concept experiment combined with the advent of additional technologies such as scifi-RNA-seq (Datlinger, Rendeiro et al. 2021) and the ever-decreasing cost of sequencing suggests that a full TReK experiment including all TRs and signaling proteins (∼6,500 proteins in total) at 4 timepoints and with an average of 60 cells per silenced protein, could be performed for less than $20,000, thus providing an entirely novel perspective on transcriptional and post-translational regulation. Furthermore, inclusion of proteins representing targets of FDA-approved or late-stage development small molecular compounds, combined with their genome-wide perturbational profiles (Woo, Shimoni et al. 2015), could provide a remarkable ability to deconvolute drug mechanism of action and to identify drugs capable of activating or inactivating any desired regulatory program, as previously shown using computationally based GRNs, only with dramatically improved accuracy and sensitivity.
Among the limitations of the current approach, one should consider that perturbations may be implemented in single cells in different stages of cell cycle. As a result, cell cycle-dependent regulatory events may require additional regression of cell cycle stage. However, with ≥ 60 cells per silenced TR, such regression would become eminently possible. Additionally, current sgRNAs present highly different gene silencing efficiency, with about 10% to 20% of the genes not silenced at all. However, this can be effectively addressed by performing a smaller-scale TReK experiment to first select sgRNAs that provide optimal silencing, thus saving significant effort and cost in the larger experiments by removing the need to sequence a large number of cells where the target gene is not effectively silenced.
Analysis of the GRN and ARN networks already produced key novel findings. However, much as it has happened for PPI networks, comprehensive assessment of the TReK GRN/ARN networks will only be possible as they are explored by biologists with precise, phenotypically driven questions.
Methods
Experimental procedures
Cloning and validation of the pLX-UCOE-SFFV-dCas9-KRAB-MeCP2-P2A-BSD plasmid
To clone the pLX-UCOE-SFFV-dCas9-KRAB-MeCP2-P2A-BSD plasmid, we prepared the vector backbone by incubating 5 ug of UCOE-SFFV-dCas9-BFP-KRAB (Addgene #85969) for 2 h at 37 °C with 2 ul of FastDigest BamHI (ThermoFisher cat. no. FD0054) and 2 ul of FastDigest SbfI (ThermoFisher cat. no. FD1194) in a total volume of 50 μl 1× ThermoFisher Green FastDigest Buffer. The digested product was run on a 0.8% agarose gel. The 13,792 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704). The KRAB-MeCP2-P2A-BSD cassette was synthesized by Integrated DNA Technology as 1,634 bp dsDNA fragments with a BamHI restriction site on 5’ end and a SbfI restriction site on 3’ end. The synthesized product was digested by incubation for 1 h at 37 °C with 1 ul of FastDigest BamHI (ThermoFisher cat. no. FD0054) and 1 ul of FastDigest SbfI (ThermoFisher cat. no. FD1194) in a total volume of 20 μl 1× ThermoFisher Green FastDigest Buffer. The digested product was run on a 1.6% agarose gel. The 1,606 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704). The vector backbone and KRAB-MeCP2-P2A-BSD cassette were ligated by the following T4 ligase reaction: 100 ng of vector backbone, 50 ng of KRAB-MeCP2-P2A-BSD cassette, 2 μl 10× Thermo Scientific T4 DNA Ligase Buffer, 2 μl 50% PEG 4000 solution, 1 μl T4 DNA Ligase (Thermo Scientific cat. no. EL0011) and water up to 20 μl, incubated at room temperature for 1 h. 5 μl of the ligation product were chemically transformed into Invitrogen™ One Shot™ Stbl3™ Chemically Competent E. coli (Invitrogen cat. no. C737303) following the manufacturer’s high-efficiency protocol. We screened for correctly assembled clones by colony PCR and further validated the assembly by restriction digestion with FspI, SbfI, AgeI and BamHI as well as Sanger sequencing.
Cloning and validation of the pLX-TRE-dCas9-KRAB-MeCP2-P2A-BSD plasmid
To clone the pLX-TRE-dCas9-KRAB-MeCP2-P2A-BSD plasmid, we first prepare the vector backbone by incubating 5 ug of PB-TRE-dCas9-VPR (Addgene #63800) for 2 h at 37°C with 2 ul of FastDigest FspAI (ThermoFisher cat. no. FD1664) and 2 ul of FastDigest PmeI (ThermoFisher cat. no. FD1344) in a total volume of 50 μl 1× ThermoFisher Green FastDigest Buffer. The digested product was run on a 0.8% agarose gel. The 10,992 bp was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704). The truncated-dCas9-KRAB-MeCP2 cassette was synthesized by Integrated DNA Technology as 2,450 bp dsDNA fragments with a FspAI restriction site on 5’ end and a PmeI restriction site on 3’ end. The synthesized product was digested by incubation for 1 h at 37 °C with 1 ul of FastDigest FspAI (ThermoFisher cat. no. FD0054) and 1 ul of FastDigest PmeI (ThermoFisher cat. no. FD1194) in a total volume of 20 μl 1× ThermoFisher Green FastDigest Buffer. The digested product was run on a 1.6% agarose gel. The 2,432 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704). The vector backbone and truncated-dCas9-KRAB-MeCP2 cassette were ligated by the following T4 ligase reaction: 100 ng of vector backbone, 100 ng of truncated-dCas9-KRAB-MeCP2 cassette, 2 μl 10× Thermo Scientific T4 DNA Ligase Buffer, 2 μl 50% PEG 4000 solution, 1 μl T4 DNA Ligase (Thermo Scientific cat. no. EL0011) and water up to 20 μl, incubated at room temperature for 1 h. 5 μl of the ligation product were chemically transformed into Invitrogen™ One Shot™ Stbl3™ Chemically Competent E. coli (Invitrogen cat. no. C737303) following the manufacturer’s high-efficiency protocol. We screened for correctly assembled clones by colony PCR and further validated the assembly by restriction digestion with SalI, FspAI and PmeI as well as Sanger sequencing.
To clone the pLX-TRE-dCas9-KRAB-MeCP2-P2A-BSD plasmid, the TRE-dCas9-KRAB-MeCP2-P2A cassette (7,500 bp) was amplified from PB-TRE-dCas9-KRAB-MeCP2 and purified using protocol described above. The pLX vector backbone was amplified as 3 fragments (BSD-WPRE-3’LTR-AmpR cassette as 3,066 bp fragment, ori-pSV40 cassette as 2,942 bp fragment, 5’LTR-RRE-cPPT/CTS cassette as 3,543 bp fragment) from pLX-UCOE-SFFV-dCas9-KRAB-MeCP2-P2A-BSD and purified using the protocol described above. Primers for amplification were designed to contain >15bp homology arm on each side with adjacent fragments. The 4 fragments were assembled using Gibson’s isothermal assembly: 500ng of total amplified fragments were combined with 10 μl of NEBuilder HiFi DNA assembly master mix (NEB cat. no. E2621L) and water to 20 μl. After 1 h of incubation at 50 °C, reactions were purified by AMPure XP bead clean-up (Beckman Coulter cat. no. A63880), and 5μl of the ligation product were chemically transformed into Invitrogen™ One Shot™ Stbl3™ Chemically Competent E. coli (Invitrogen cat. no. C737303) following the manufacturer’s high-efficiency protocol. We screened for correctly assembled clones by colony PCR and further validated the assembly by restriction digestion with AdeI, FspAI and PmeI as well as Sanger sequencing.
Cloning and validation of the CROPseq-CaptureSeq-Guide-Puro plasmid
To clone the CROPseq-CaptureSeq-Guide-Puro plasmid, we prepared the vector backbone by incubating 5 ug of CROPseq-Guide-Puro (Addgene #86708) for 2 h at 37 °C with 2 ul of FastDigest Mph1103I (ThermoFisher cat. no. FD0734) and 2 ul of FastDigest Eco105I (ThermoFisher cat. no. FD0404) in a total volume of 50 μl 1× ThermoFisher Green FastDigest Buffer. The digested product was run on a 0.8% agarose gel. The 9,766 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704). The trcrRNA-CaptureSeq-5’LTR(truncated) cassette was synthesized by Integrated DNA Technology as 520 bp dsDNA fragments with 20 bp of homology on each end of the digested vector backbone, respectively.
The vector backbone and trcrRNA-CaptureSeq-5’LTR(truncated) cassette were assembled using Gibson’s isothermal assembly: 100ng of vector backbone and 20ng of trcrRNA-CaptureSeq-5’LTR(truncated) cassette were combined with 10 μl of NEBuilder HiFi DNA assembly master mix (NEB cat. no. E2621L) and water to 20 μl. After 1 h of incubation at 50 °C, reactions were purified by AMPure XP bead clean-up (Beckman Coulter cat. no. A63880), and 5 μl of the ligation product were chemically transformed into Invitrogen™ One Shot™ Stbl3™ Chemically Competent E. coli (Invitrogen cat. no. C737303) following the manufacturer’s high-efficiency protocol. We screened for correctly assembled clones by colony PCR and further validated the assembly by restriction digestion with Mph1103I, AdeI and BsmbI as well as Sanger sequencing.
Cloning of individual sgRNAs into the CROPseq-CaptureSeq-Guide-Puro plasmid
sgRNA cassettes were annealed from two oligonucleotides (top: 5’-CACCG(N)19-20-3’ bottom: 5’-AAAC(N)19-20-3’) by combining 1 μl of each 100 μM oligonucleotide with 1 μl of 10× T4 ligation buffer (NEB cat. no. B0202S), 6.5 μl of water, and 0.5 μl of T4 polynucleotide kinase (NEB cat. no. M0201S), incubating as follows: 37 °C for 30 min (oligonucleotide phosphorylation), 95 °C for 5 min, then ramping from 90 °C to 25 °C at 5 °C/min. Vector backbone was prepared by digesting and dephosphorylating 5 μg of CROPseq-CaptureSeq-Guide-Puro with 5 μl of FastDigest Esp3I (ThermoFisher cat. no. FD0454) and 2 μl of FastAP Thermosensitive Alkaline Phosphatase (ThermoFisher cat. no. EF0651) in a total volume of 50 μl 1× ThermoFisher Green FastDigest Buffer, incubating for 1 h at 37 °C. The digested product was run on a 1.6% agarose gel. The 8,361 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704).
Ligation reactions were set up as follows: 60 μg of CROPseq-CaptureSeq-Guide-Puro backbone, 1 μl gRNA cassette (diluted 1:200 in water), 2 μl 10× Thermo Scientific T4 DNA DNA Ligase Buffer, 2 μl 50% PEG 4000 solution, 1 μl T4 DNA Ligase (Thermo Scientific cat. no. EL0011) and water up to 20 μl, incubated at room temperature for 1 h. The ligation reaction was chemically transformed into Invitrogen™ One Shot™ Stbl3™ Chemically Competent E. coli (Invitrogen cat. no. C737303) following the manufacturer’s high-efficiency protocol.
Cloning of pooled sgRNA libraries into the CROPseq-CaptureSeq-Guide-Puro plasmid
Vector backbone was prepared by digesting 5 μg of CROPseq-CaptureSeq-Guide-Puro with 5 μl of FastDigest Esp3I (Thermo Scientific cat. no. FD0454) in a total volume of 50 μl 1× Thermo Scientific FastDigest Green Buffer, incubating for 1 h at 37 °C. The digested product was run on a 1.6% agarose gel. The 8,361 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704).
sgRNA insert fragments were synthesized by TWIST Bioscience as 74 nt oligonucleotides with 18 and 35 nt of homology to the hU6 promoter and guide RNA scaffold, respectively. Oligonucleotides were diluted to 1 ng/ul and pooled in equal amounts. The oligonucleotide pool was further amplified by mixing 12.5 μl KAPA HiFi HotStart 2× ReadyMix (KAPA Biosystems cat. no. KK2601), 0.75 μl of 10 μM forward primer, 0.75 μl of 10 μM reverse primer, 2 ng oligonucleotide pool and water up to 25 μl and incubating as follows: 95 °C for 3 min, 9× (98 °C for 20 s, 63 °C for 15 s, 72 °C for 15 s), 72 °C for 1 min, hold at 4 °C. PCR product was run on a 2% agarose gel. The 74 bp fragment was cut under UV-free blue light and purified using the QIAquick Gel Extraction Kit (QIAGEN cat. no. 28704).
sgRNA libraries were cloned by Gibson’s isothermal assembly: 22 fmoles (113.7 ng) of CROPseq-CaptureSeq-Guide-Puro backbone and 400 fmoles (18.31ng) of amplified dsDNA oligonucleotides were combined with 10 μl of NEBuilder HiFi DNA assembly master mix (NEB cat. no. E2621L) and water to 20 μl. After 1 h of incubation at 50 °C, reactions were purified by AMPure XP bead clean-up (Beckman Coulter cat. no. A63880), and 10 μl of the reaction was electroporated into 50 μl of Lucigen Endura E. coli cells (Lucigen cat. no. 60242-1) using prechilled 1 mm electroporation cuvettes (BioRad cat. no. 1652089) in a Lonza Amaxa™ Nucleofector™ II device set to Bacteria Program 4. Within seconds after the pulse, 1 ml of 37 °C Lucigen Recovery Medium was added and the bacteria were grown in a round-bottom tube for 1 h at 37 °C while shaking at 200 r.p.m. Then, 1 ml of the bacterial culture was plated on a 25 × 25 cm bioassay plate containing LB medium (Miller) with 100 μg/ml carbenicillin. Plates were incubated at 30 °C for 20 h, then LB medium was added and bacteria colonies were scraped off the plate. Bacterial cells were pelleted by 30 min of centrifugation at 3,000 RCF at 4 °C, and plasmid DNA was extracted with multiple ZymoPURE II Plasmid Midiprep columns (Zymo Research cat. no. D4200). Library coverage was estimated by counting the number of bacterial colonies on a 1:1,000 dilution plate and a 1:10,000 dilution plate. All the sgRNA libraries were cloned with at least 500× coverage.
Lentivirus production for single sgRNAs and pooled CROP-seq screens
HEK293T cells were plated into 100mm dishes at 2 million cells per dish in 12 ml of culture medium (DMEM (Gibco cat. no. 10569010), 10% Tetracycline-free FBS (Omega Scientific cat. no. FB-15), no antibiotics) and grown to reach 50% to 70% confluence. 6ml of culture medium was removed from the dishes and the cells were transfected with lipofectamine 3000 (Invitrogen cat. no. L3000015) using either 10.2 μg of CROPseq-CaptureSeq-Guide-Puro (containing single sgRNAs or libraries) or dCas9 expressing plasmid, and 4.5 μg each of the three packaging plasmids pMDLg/pRRE (Addgene #12251), pRSV-Rev (Addgene #12253), and pMD2.G (Addgene #12259). The medium was exchanged for fresh culture medium 16 h after the transfection. The supernatant containing viral particles was harvested at 30 h and passed through a 0.45 µm PVDF Millex-HV Syringe Filter (Millipore cat. no. SLHV033RS) to remove cells and debris. Viral particles were further concentrated by 40min centrifugation at 4,000 r.p.m. using Amicon Ultra-15 Ultracel-100 Centrifugal Filters (Millipore cat. no. UFC910024). The concentrated virus was aliquoted and stored at −80 °C.
Production of constitutive/inducible dCas9 expressing cell lines
For adherent cell lines HT-29, cells were seeded into 100mm dishes at 5 million cells per dish in 12 ml of complete culture medium (McCoy’s 5A Modified Medium for HT-29 (Gibco cat. no. 16-600-082) with 10% Tetracycline-free FBS (Omega Scientific cat. no. FB-15), no antibiotics). The cells were then transduced with 8 μg/ml polybrene (Millipore cat. no. TR-1003-G) and the proper amount of lentivirus containing dCas9 vector to ensure Multiplicity of Infection (MOI) equal to 1. An extra dish served as the untransduced control. After addition of the virus, cells were incubated overnight at 37 °C, 5% CO2. At 24 h post-transduction, selection with 20 μg/ml blasticidin (Research Product International cat. no. B12200-0.05) began. After 5 days of blasticidin selection, live cells were trypsinized and seeded into 100mm dishes at 2 million cells per dish in fresh complete culture medium containing 10 μg/ml blasticidin to allow for cell number amplification while renewing the medium (containing blasticidin) every 3 days.
Lentiviral transduction with sgRNA libraries or single sgRNAs
For adherent cell lines HT-29, cells were seeded into 100mm dishes at 5 million cells per dish in 12 ml of complete culture medium (McCoy’s 5A Modified Medium for HT-29 (Gibco cat. no. 16-600-082) with 10% Tetracycline-free FBS (Omega Scientific cat. no. FB-15), no antibiotics) containing 10 μg/ml blasticidin (Research Product International cat. no. B12200-0.05) to maintain selection for dCas9 expression. The cells were then transduced with 8 μg/ml l polybrene (Millipore cat. no. TR-1003-G) and the proper amount of lentivirus containing sgRNA libraries to ensure Multiplicity of Infection (MOI) equal to 0.3. An extra dish served as the untransduced control. After addition of the virus, cells were incubated overnight at 37 °C, 5% CO2. At 24 h post-transduction, selection with 8 μg/ml puromycin (AG Scientific cat. no. P-1033) began. After 3 days of puromycin selection, live cells were trypsinized and seeded into 100mm dishes at 2 million cells per dish in fresh complete culture medium containing blasticidin and 4 μg/ml puromycin. After overnight incubation at 37 °C, 5% CO2 to allow for reattaching, the cells are ready for doxycycline induction.
Lentivirus titration for single sgRNAs
Cells were seeded into 24-well plates at 50,000 cells per well in 500 μl of proper culture medium and grown overnight to reach 30% to 50% confluence. The next day, medium was exchanged for 450 μ per well of fresh culture medium containing 8 μg/ml polybrene (Millipore cat. no. TR-1003-G), which was also used to dilute the viral stock. Lentivirus aliquots were thawed from storage at −80 °C and titrated in a 1:2 dilution series ranging over six wells (1:2 to 1:32). Each dilution was tested in duplicate by adding 50 μl per well. At least two wells per dilution served as un-selected controls and at least two wells per plate served as untransduced controls. 24 h after the transduction, the medium was exchanged for 500 μl per well of selection medium prepared as described above, every 2–3 days. The blasticidin selection took 5 days and the puromycin selection took 3 days for HT-29. As soon as the selection was completed and all cells in the untransduced controls had died, cell viability was measured by CellTiter-Glo® Luminescent Cell Viability Assay (Promega cat. no. G7570). The virus titer (transducing units per ml) was calculated as follows: (initial number of cells × percentage of viable cells × dilution factor) × (1,000 μl /50 μl).
Quantification of dCas9 by Western Blot
To evaluate inducible dCas9 expression, HT-29 cells were transduced with pLX-TRE-dCas9-KRAB-MeCP2-P2A-BSD as described above. Transduced HT-29 were then plated in RPMI medium (Gibco cat. no. 21875-034) on 6-well plates at 0.5 million cells per well. Cells were recovered overnight and doxycycline (Fisher Scientific cat. no. BP26535) was added to culture medium at 1 μg/ml . At 6, 9, 12 and 24 h post doxycycline induction, respectively, cells were lysed in 300 μl RIPA buffer (Fisher Scientific cat. no. 50-843-016) with protease inhibitor (Fisher Scientific cat. no. PI78430). Protein concentration was measured by Pierce™ BCA Protein Assay Kit (Thermo Fisher cat. no. 23225). 20 ug of total protein from each lysate was loaded and separated on a NuPAGE 4%–12% Bis-Tris Gel (Invitrogen cat. no. NP0335BOX). Subsequently, the protein was transferred onto a nitrocellulose membrane (Fisher Scientific cat. no. IB23001) by iBlot 2 Dry Blotting System (ThermoFisher), which was then blocked by blocking buffer (Fisher Scientific cat. no. PI37536) for 1 h at room temperature. The blocked membrane was incubated overnight at 4 □ in primary antibodies: mouse polyclonal anti-Cas9 (Takara Bio, cat. no. 632607), mouse monoclonal anti-β-actin (8H10D10) (Cell Signaling Technology, cat. no. 3700S). After incubation, the membrane was washed 3 times with PBS-T buffer and then incubated at room temperature for 1 h with secondary antibodies: mouse monoclonal anti-rabbit IgG-HRP (Santa Cruz Biotechnology, cat. no. sc-2357), goat monoclonal anti-mouse IgG HRP (Santa Cruz Biotechnology, cat. no. sc-2031). The membrane was then washed 3 times with PBS-T buffer, activated by Pierce™ ECL Western Blotting Substrate (Fisher Scientific cat. no. PI32106), and imaged by X-Ray film. The film was then scanned and digital images were processed using ImageJ.
Computational procedures
Preprocessing of single-cell sequencing data
Single-cell sequencing data were aligned using CellRanger v3.0.2 to the human reference genome assembly (Ensembl GRCh38 release) with default parameters. CellRanger pipeline uses STAR for alignment. Briefly, reads aligned to exons are tagged with their respective gene names (annotated by Ensembl Gene ID). Then counts of unique molecular identifier (UMI)-deduplicated reads per gene within the same cell were counted to build a gene expression matrix comprising all cells with UMI counts. For downstream analysis, low expression genes–defined by being detected in less than 10% cells–are discarded. We found this step is critical for controlling noise. Furthermore, as quality control at single-cell level, cells with either low transcriptome complexity (<5000 UMI counts) or multiple sgRNAs were discarded. Cells with high mitochondrial reads fraction (>15%) were excluded in differential expression analysis and further downstream analyses.
Assignment of sgRNA to single-cell transcriptome
In the setting of this project, we enabled the detection of two types of sgRNA transcripts by incorporating Capture Sequence (Replogle, Norman et al. 2020) into the CROP-Seq sgRNA expressing vector CROP-Seq-GuidePuro (Datlinger, Rendeiro et al. 2017). Polyadenylated sgRNA transcripts synthesized by RNA polymerase II were captured by oligo-dT primers and subsequently sequenced as part of whole-transcriptome library. Non-polyadenylated sgRNA transcripts synthesized by RNA polymerase III were captured by CaptureSeq primers and were separately prepared as Feature Barcode libraries following 10X Genomics protocol (User Guide for Chromium Single Cell 3u Reagent Kits v3). Feature Barcode libraries were separately indexed and sequenced as spike-ins alongside the whole-transcriptome single-cell RNA-seq libraries. Final UMI and cell barcode assignments were made for each Feature Barcode read by alignment with CellRanger v3.0.2, as was done for the whole transcriptome libraries.
To computationally detect both types of sgRNA transcripts, we adapted and modified the sgRNA identification approach described in (Hill, McFaline-Figueroa et al. 2018). We allowed a maximal Hamming distance of 1 between the protospacer sequences extracted from the sequencing reads and the input sgRNA library. To account for background sgRNA reads resulting from low level of sgRNA transcripts released from lysed cells, we adopted a 2-Gaussian Mixture Model strategy described in CellRanger’s “CRISPR Algorithm” (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/crispr) and in (Replogle, Norman et al. 2020). We then made sgRNA assignments separately for whole-transcriptome library and Feature Barcode library. For each library, a cell was considered to express only one sgRNA if the most abundant sgRNA was five times more abundant than the second most abundant sgRNA; they were considered to express multiple sgRNAs if this ratio was smaller than five. To consolidate the sgRNA assignment from the two libraries, cells were considered to express unique sgRNA if 1) the same unique sgRNA was detected in the two libraries or 2) only one sgRNA is detected in one library and no sgRNA was detected in the other. Cells were considered to express multiple sgRNAs if 1) multiple sgRNAs were detected in either library or 2) different sgRNAs are detected in the two libraries.
This dual sgRNA detection strategy helps more thorough detection of sgRNA transcripts in single cells and effective removal of multiplets. We performed TReK experiments using MOI=0.3 to ensure that most of the single cells express only one sgRNA. As expected, for the majority (roughly 60% across two timepoints) of unique-sgRNA cells, the same sgRNA was identified from the two sgRNA-containing libraries. Our data also shows that while two libraries generated highly consistent sgRNA-cell association maps, each can detect sgRNAs the other strategy cannot detect in a small subset of cells (Figures S3E-S3G).
We observed that cells expressing only one sgRNA are enriched in the 10,000-50,000 UMI count range (Figures S3C and S3D). On the lower UMI count side, the majority of cells do not have any sgRNA detected due to lower transcriptome complexity. On the higher UMI count side, a significant proportion of cells have multiple sgRNAs detected, indicating that these transcriptomes likely resulted from droplets with multiple single cells, considering that the proportion of cells expressing multiple sgRNAs are theoretically minimal with low MOI transduction. The association between abnormally high transcriptome complexity and the detection of multiple sgRNAs indicates that with the dual sgRNA detection strategy, we were able to retrospectively remove potential doublets and triplets—a source of significant noise in single-cell sequencing--by simply removing cells with multiple sgRNAs.
Determination of on-target knockdown efficacy
To measure the extent to which each sgRNA reduces the expression of its target gene (knockdown efficacy), we compared cells expressing the corresponding sgRNA to cells expressing scrambled sgRNAs (sgCtrls). For each sgRNA, we estimated its target gene expression level by aggregating all cells that express the corresponding sgRNA. We also estimated the unperturbed expression level by aggregating the same number of randomly sampled cells that express sgCtrls. The sampling was done 100 times and the median value was taken. The knockdown efficacy of the sgRNA was then calculated as the fraction of target gene expression (normalized as count per million, or CPM) that was reduced due to the presence of the corresponding sgRNA.
Broadly, we observed acceptable silencing for most genes, with 1,877/3,016 sgRNAs (62%) achieving ≥ 50% knockdown and 1,275/3,016 sgRNAs (42%) achieving at least 80% knockdown at 96h (Figures S4A and S4B; Table S7; Methods). Since 3 sgRNAs per TR were used, most TRs were silenced at different silencing efficacies (Figures S4C and S4D). This turned out to be a useful feature, for instance, to dissect compensatory, dosage-dependent effects (Figure 4). When comparing silencing efficacy between timepoints, we generally observed a stronger knockdown at T2 = 96h vs. T1 = 24h (Figures S4A and Figure S4B). While this difference may be partly due to differences in the strengths of the constitutive and inducible promoters, we believe that much of the reduced efficacy at T1 is due to a combination of two factors: First, the dCas9-KRAB-MeCP2 abundance necessary for effective knockdown may vary from gene to gene. Second, mRNA degradation kinetics and the size of the extant mRNA pool can vary widely across genes.
Differential expression analysis
To perform differential expression analysis for each sgRNA perturbation, a pseudo-bulk sample was generated by aggregating the transcriptomes of all the single cells expressing that particular sgRNA. A pseudo-bulk gene expression signature was obtained by comparing the expression level of each gene against 200 reference pseudo-bulk samples as 1) z-scores generated by viperSignature function from viper R package and 2) log fold change. The reference pseudo-bulk samples were obtained by aggregating the transcriptomes of randomly sampled cells expressing other sgRNAs. To control for the variation resulted from cell number, each reference pseudo-bulk sample was aggregated from the same number of cells as the “tested” pseudo-bulk sample.
To perform differential expression analysis for each perturbed TR (as shown in Figures 2C, 3, S5G, and S5H), we first grouped cells expressing any of the 3 sgRNAs targeting that particular TR. These cells were then sorted by the normalized expression level of the perturbed TR. The top 50% of the cells are then selected to represent the TR-perturbed cell population. (If two cells have the same level of TR expression, the cell expressing more efficient sgRNA was selected. If two cells express the same sgRNA, the cell with a higher total UMI count was selected.) The differential expression analysis for the particular TR was then conducted in the same way as described in sgRNA-wise differential expression analysis.
Assessing the reproducibility of TReK differential expression
An unbiased approach to assess gene knockdown reproducibility is to compare different technologies. Searching the published literature, we found that RNA-sequencing of HT-29 cells following CRISPRi-mediated TCF7L2 silencing had been performed to investigate this gene’s role in colorectal cancer invasion and migration (Wenzel, Rose et al. 2020). Impressively, even when pooling only 17 or 21 cells (at T1 and T2, respectively), the overlap in terms of differentially expressed genes between our data and Wenzel et al. data is highly significant (empirical p < 0.001) (Figures S5G and S5H).
We were interested in the factors that affect the correlation between the differentially expressed gene signatures induced by different sgRNAs targeting the same gene. Not surprisingly, our analysis confirmed that the TRs with the most correlated gene signatures (Figures S5A and S5B) tended to be those with multiple high-efficiency sgRNAs (Figure S5C). Additionally, TRs with multiple, high-efficiency sgRNAs affected a greater fraction of downstream transcriptional targets (Figure S5D) and were characterized by slightly higher coverage (cells per sgRNA) (Figure S5E).
We were also interested in the effect of protein turnover on differential gene expression at early timepoints. For instance, one would expect that proteins with low turnover should affect a greater number of downstream target genes at T2 vs. T1 compared to proteins with high turnover. To investigate, we collected published data on protein stability as measured by mass-spectrometry (Cambridge, Gnad et al. 2011). Although we could find protein stability data for only a relatively small fraction of the TRs silenced in our assays, we observed that more stable proteins tend to have a weaker differential expression at the earlier timepoint T1 compared to T2 and thus a higher intra-gene sgRNA correlation at T2 vs. T1 (Figure S5F).
Gene Set Enrichment Analysis
One-set Gene Set Enrichment Analysis (GSEA) was implemented as described in (Subramanian, Tamayo et al. 2005). Briefly, the gene expression signature of interest was sorted and scanned by calculating an enrichment score (ES) starting with the most upregulated gene. If the encountered gene was present in the gene set, the running ES was increased. If the encountered gene was not present in the gene set, then the running ES was decreased. The final ES was determined by the maximum running ES and the leading edge subset was defined as the genes that were encountered before reaching maximum ES. The statistical significance of the ES was calculated by permuting differential expression-ordered genes 1,000 times, computing the ES to generate a null distribution, and comparing the unpermuted ES score to the null distribution of ES. Normalized enrichment score (NES) was calculated as the z-score transformed ES against the null distribution. An empirical p-value was calculated as the fraction of permuted ES that were more extreme than the actual ES.
For two-set GSEA (Figures S5G and S5H), the query gene set was divided into two subsets: a positive subset containing genes that were positively correlated with the gene set term, and a negative subset containing genes that were negatively correlated with the gene set term. The gene expression signature of interest was sorted and scanned as described in the one-set GSEA process. However, the ES for the positive and negative subset, respectively, were calculated separately and subsequently added together. The NES and empirical p-values were computed as described above.
Determination of mode of regulation (Figure 4F, Table S6)
To determine the mode of regulation of a TR within a specific pathway, GSEA analysis was conducted on the gene expression signature of all 3 sgRNAs targeting the particular TR. If the NES of at least two sgRNAs were statistically significant (p < 0.05) and have the same sign, the NES of the third sgRNA was taken into account. If the third sgRNA perturbation showed statistically significant NES with the same sign or 2) showed no statistically significant NES, the mode of regulation was determined by the first two sgRNAs. If the third sgRNA perturbation showed statistically significant NES with opposite sign, the mode of regulation was determined as “mixed”.
ARACNe network reconstruction
For each sgRNA, a pseudo-bulk sample was generated by aggregating transcriptomes of 15 single cells expressing that particular sgRNA. To maximize the transcriptome coverage, we included sgRNAs covered by more than 15 cells. This procedure generated 1,447 pseudo-bulk expression profiles for 14,773 transcriptionally informative genes (defined as being expressed by >10% samples) from T1 dataset and 1,310 pseudo-bulk expression profiles for 18,426 transcriptionally informative genes from T2 dataset. These expression profiles were used to generate ARACNe networks (“purely computational networks”) for transcription factors (TFs), coTFs and signaling proteins as previously described in (Basso, Margolin et al. 2005).
Hurdle model-based generalized linear model
Hurdle model was firstly described in (Cragg 1971) and applied in the context of single-cell RNA-seq in (Finak, McDavid et al. 2015) as MAST (Model-based Analysis of Single-cell Transcriptomics) approach. Here we modified MAST to be compatible with the single-cell CRISPRi perturbation setting. Normalized gene expression (count per million or CPM) was modeled as a generalized regression model (GLM) with two branches, one in binary space and the other in continuous space. For a specific gene, the probability of detecting non-zero CPM in a cell was modeled by binomial distribution, whereas the exact CPM|CPM>0 was modeled as Poisson distribution.
Besides the intercept unit in the GLM, we included two covariates in the regression. The first covariate represents the expression level of the TR of interest. When its expression level was unavailable due to low expression, protein activity inferred by VIPER was used to estimate the abundance of the TR. Secondly, as several publications found that the consideration of “cellular detection rate” (CDR), defined as the proportion of genes detected in a single cell, significantly improved the differential expression analysis in single-cell sequencing context (Finak, McDavid et al. 2015, Dixit, Parnas et al. 2016, Soneson and Robinson 2018), we included CDR as the second covariate.
Regression coefficients were estimated and regularized as described in (Finak, McDavid et al. 2015). For the binary model, coefficients were regularized using informative priors (Gelman et al., 2008) from which information from the ARACNe network can be incorporated. In practice, we assigned a 4x larger prior variance to any TR-target associations that were identified by the ARACNe network. For each TR, only a subset of the transcriptome was considered as candidate target genes. For the “integrated network”, the candidate target genes came from two sources: 1) target genes identified by ARACNe network and 2) the top 200 most differentially expressed genes after the TR perturbation. For the “purely experimental network”, the top 200 most differentially expressed genes after the TR perturbation were considered as candidate target genes.
Determination of regulon parameters
We define the “regulon” of a specific TR as all the transcriptional target genes of that TR. To generate regulons that can be used to infer protein activity of a TR, two sets of parameters have to be determined: tfmode and likelihood. As described in (Alvarez, Shen et al. 2016), the tfmode parameter represents the direction as well as strength of the TR-target regulation. The likelihood parameter denotes the statistical confidence of the TR-target regulation. For the “purely computational” ARACNe network, tfmode was calculated as the Pearson correlation of the corresponding regulator and target gene expression, and likelihood was calculated as scaled pairwise mutual information. For “purely experimental” networks and integrated networks, the linear model coefficient statistics were used to generate calculate the regulon parameters. We first determined the statistical significance of each coefficient assuming a Gaussian distribution. For each TR-target association in the regulon, the use of the continuous versus binary model was determined by greater coefficient significance. The tfmode parameter was then determined as the sign (1 or -1) of the coefficient. The likelihood parameter was determined as the significance (1 - p-value) of the corresponding coefficient.
5-fold cross-validation and cross-timepoint validation
Within the dataset of each timepoint, 5-fold cross-validation was used to assess the ability of a specific regulon to infer the protein activities of the corresponding TR. For each TR, a single-cell dataset was firstly assembled by including all the perturbed cells and cells expressing a scrambled sgRNA. The dataset was subsequently divided into 5 equally-sized sub-datasets. The regulon was reconstructed from 4 sub-datasets (training sets) and validated on the single left-out sub-dataset (test set). The protein activities of the TR in each perturbed cell were then calculated by VIPER based on computational network, experimental network or integrated network. For cross-timepoint validation, we simply reconstructed regulons based on one timepoint (training set) and validated the regulons on the other timepoint (test set).
For both within-timepoint and cross-timepoint cross-validations, we selected 340 sgRNAs that 1) had at least medium-efficacy (>50% knockdown) and 2) were covered by more than 15 cells at both timepoint. Out of the 250 TRs (thus 250 regulons) targeted by the 340 sgRNAs, 247 were successfully reconstructed by T1 dataset and 241 were reconstructed by T2 dataset. The unsuccessful regulon reconstruction was mainly due to low level of TR expression in single cells.
DNase-Seq validation
DNase-Seq data were accessed from Cistrome database (Liu, Ortiz et al. 2011) on Apr 29, 2020. Context-specific data were extracted by limiting the “Cell_line” to contain “HT29”. A peak was considered to be associated with a specific gene if it overlaps with the genomic region of that gene. Peaks that did not overlap with any gene were removed.
Protein activity inference of CCLE HT-29 and TR-perturbed HT-29
The gene expression profile of HT-29 was accessed from the Cancer Cell Line Encyclopedia (CCLE) database on Jul 10, 2020. The gene expression signature of HT-29 was then generated by z-score transformation implemented by viperSignature function from the VIPER R package, comparing the normalized expression level of HT-29 against all the other CCLE cell lines. The gene expression signatures of TR-perturbations were calculated as z-score as described in “Differential expression analysis”.
For all the gene expression signatures described above, the R package VIPER (Alvarez, Shen et al. 2016) was used to infer protein activity. Specifically, metaVIPER (Ding, Douglass et al. 2018) (as implement in the VIPER R package) was used to combine the information of T1 network and T2 network.
Characterization of activity regulating network (ARN) edges (Figure 3A)
The activity regulating network was first constructed by extracting TR-TR edges with statistically significant (FDR < 0.1) differential protein activity after perturbing one of the TRs. To consolidate the network constructed at T1 and T2, we took the union of the two networks. For edges with different modes of regulation (activation or repression) at T1 and T2, we kept the T1 edge and discarded the T2 edge.
An ARN edge was characterized by “transcriptional” if the differential protein activity had a corresponding differential gene expression (FDR < 0.2) of the same TR following the same perturbation. An interaction was characterized by “pathway” if the two TRs co-exist in a “pathway” gene set. ‘‘Pathway’’ gene sets were taken from a collection of the curated gene sets (c2) from MSigDB (Liberzon, Subramanian et al. 2011, Liberzon, Birger et al. 2015). Protein-protein interactions (PPI) were taken from PrePPI (Zhang, Petrey et al. 2012), BioPlex 3.0 Network for HCT116 (Huttlin, Bruckner et al. 2020), and CORUM (Ruepp, Waegele et al. 2010). Two TRs were characterized by “overlapping regulons” if the top 50 targets (ranked by likelihood) of the two regulons have statistically significant overlap (by Fisher’s exact test, FDR < 0.1).
Multi-layer regulatory network reconstruction
The multi-layer network (MLN) was reconstructed in the following 2 steps: 1) ARN and TRN were reconstructed separately by extracting TR-TR edges with statistically significant differential protein activity (FDR < 0.1) or different expression (FDR < 0.2) after perturbing one of the TRs; for edges with different modes of regulation (activation or repression) at T1 and T2, we kept the T1 edge and discarded the T2 edge; 2) ARN edges that had corresponding TRN edges were discarded.
Criteria for choosing TRs to include in TReK
Beginning with a list of 2,527 TRs (Table S1), we first removed TRs that were unlikely to be actively transcribed in HT-29 using a Gaussian mixture model. The remaining 1,826 genes were filtered to include candidate COAD regulators by PanCancer VIPER analysis (n = 30), MOMA-inferred checkpoint regulators (Paull, Aytes et al. 2021) (n = 69), the union of top VIPER scores by absolute value in TCGA stomach and esophageal carcinoma (STES) and COAD (n = 513), CHOPD TCGA PanCancer checkpoint regulators (n = 293), and gene expression (n = 95).
We began with a manually curated list of 2,527 Transcriptional Regulator genes (hereafter TRs) annotated by the Gene Ontology as either transcription factors or transcription cofactors (Table S1) (Alvarez, Shen et al. 2016). Candidate genes were selected by first removing TRs whose expression was not distinguishable from noise, and then by ranking remaining TRs based on their overall transcription levels and potential role as Master Regulators in COAD, based on MOMA analysis (Paull, Aytes et al. 2021) (see Methods). Each gene was targeted by the 3 best sgRNAs predicted by (Sanson, Hanna et al. 2018).
Acknowledgements
Funding sources that supported this work include The National Institutes of Health R35 CA197745 (Outstanding Investigator Award) to A. Califano.