Abstract
We developed a single-cell massively parallel reporter assay (scMPRA) to measure the activity of libraries of cis-regulatory sequences (CRSs) across multiple cell-types simultaneously. As a proof of concept, we assayed a library of core promoters in a mixture of HEK293 and K562 cells and showed that scMPRA is a reproducible, highly parallel, single-cell reporter gene assay. Our results show that housekeeping promoters and CpG island promoters have lower activity in K562 cells relative to HEK293, which likely reflects developmental differences between the cell lines. Within K562 cells, scMPRA identified a subset of developmental promoters that are upregulated in the CD34+/CD38− sub-state, confirming this state as more “stem-like.” Finally, we deconvolved the intrinsic and extrinsic components of promoter cell-to-cell variability and found that developmental promoters have a higher proportion of extrinsic noise compared to housekeeping promoters, which may reflect the responsiveness of developmental promoters to the cellular environment. We anticipate scMPRA will be widely applicable for studying the role of CRSs across diverse cell types.
Introduction
The majority of heritable variation for human diseases maps to the non-coding portions of the genome1–6. This observation has led to the hypothesis that genetic variation in the cis-regulatory sequences (CRSs) that control gene expression underlies a large fraction of disease burden 7–10. Because many CRSs function only in specific cell types11, there is intense interest in high-throughput assays that can measure the effects of cell-type-specific CRSs and their genetic variants.
Massively Parallel Reporter Assays (MPRAs) are one family of techniques that allow investigators to assay libraries of CRSs and their non-coding variants en masse12–18. In an MPRA experiment, every CRS drives a reporter gene carrying a unique DNA barcode in its 3’ UTR, which allows investigators to quantify the activity of each CRS by the ratio of its barcode abundances in the output RNA and input DNA. This approach allows investigators to identify new CRSs, assay the effects of non-coding variants, and discover general rules governing the functions of CRSs12,19–23. One limitation of MPRAs is that they are generally performed in monocultures, or as bulk assays across the cell types of a tissue. Performing cell-type specific MPRAs in tissues will require methods to simultaneously readout reporter gene activities and cell type information in heterogeneous pools of cells.
To address this problem, we developed scMPRA, a procedure that combines single-cell RNA sequencing with MPRA. scMPRA simultaneously measures the activities of reporter genes in single cells and the identities of those cells using their single-cell transcriptomes. The key component of scMPRA is a two-level barcoding scheme that allows us to measure the copy number of all reporter genes present in a single cell from mRNA alone. A specific barcode marks each CRS of interest (CRS barcode, “cBC”) and a second random barcode (rBC) acts as a proxy for DNA copy number of reporter genes in single cells (Fig. 1a). The critical aspect of the rBC is that it is complex enough to ensure that the probability of the same cBC-rBC appearing in the same cell more than once is vanishingly small. In this regime, the number of different cBC-rBC pairs in a single cell becomes an effective proxy for the copy number of a CRS in that cell. Even if a cell carries reporter genes for multiple different CRS, and each of those reporter genes is at a different copy number, it is still possible to normalize each reporter gene in each individual cell to its plasmid copy number. With this barcoding scheme, we can measure the activity of many CRSs with different input abundances in single cells.
Results
scMPRA enables single-cell measurement of CRS activity
As a proof of principle, we used scMPRA to test whether different classes of core promoters show different activities in different cell types. Core promoters are the non-coding sequences that surround transcription start sites, where general cofactors interact with RNA polymerase II24,25. Core promoters are divided into different classes by the functions of their host genes (housekeeping vs developmental), as well as by the sequence motifs they contain (TATA-box, downstream promoter element (DPE), and CpG islands). We selected 676 core promoters that we previously tested24 and cloned them into a double-barcoded MPRA library (Supplementary Table 1). In the first stage of library construction each core promoter reporter gene was represented by 10 unique cBCs. We then added rBCs to the library by cloning a 25 nt random oligonucleotide (oligo) directly downstream of the cBCs. The library contains ~ 1.4×107 unique cBC-rBC pairs (Methods, Fig. 1b). Using this complexity, we calculated that the probability of plasmids with the same cBC-rBC pair occurring in the same cell is less than 2×10−3 with our transfection protocols (Methods). Given this low likelihood, the number of rBC per cBC in a cell represents the copy number of a CRS in that cell. Knowing the copy number of CRSs in single cells allows us to normalize reporter gene expression from each CRS to its copy number in individual cells.
We performed a cell mixing experiment to test whether scMPRA could measure cell type specific expression of reporter genes. We transfected K562 and HEK293 cells (Methods), and performed scMPRA on a 1:1 mixture of those cell lines (Fig. 1c). We harvested cells and prepared them for sequencing using the 10X Chromium™ platform. The mRNA from single cells was captured, converted to cDNA, and pooled together. We then split the samples, with a quarter of the amplified cDNA library used for amplifying the cBC-rBC pairs and three-quarters used to amplify the transcriptome. The resulting reporter barcode abundances and transcriptome of each single cell are linked by their shared 10X cell barcode (Methods).
We recovered a total of 3112 cells (1524 in replicate 1 and 1588 in replicate 2) that are unambiguously assigned to one of the two cell types (Fig. 2a, Supplementary Figs S1 a,b). We determined the efficiency of our method by calculating the recovery rate of our input promoters. We then calculated the core promoter expression by taking the average of the cBC expression for the same promoter. We found that scMPRA recovered 99.5% (673 out of 676 core promoters) of the input library for K562 cells and 100% (676 out of 676 core promoters) for HEK293 cells, highlighting the efficiency of our method for recovering input elements.
We next calculated the number of individual cells in which each core promoter is measured. We found that the empirical distribution of the number of cells per core promoter is log normal, with a median of 76 cells per core promoter for K562 cells and 287 cells per core promoter for HEK293 cells (Fig. 2b,c). Given that the number of pBC-rBC pair is effectively the number of plasmids per cell, we also calculated the number of plasmid per cell, and found that fewer number of plasmids were incorporated into K562 cells compare to HEK293 cells (median plasmid number in K562 cells: 164, median plasmid number in HEK293 cells: 341. Supplementary Fig. 1c,d). The difference in transfection efficiency between these cell types with the same input likely reflects global cellular differences between them, and is representative of the condition when performing scMPRA in different cell types.
We calculated the biological reproducibility and found that scMPRA is highly reproducible in both cell types for measurements of mean expression (K562: Pearson R = 0.89, HEK293: Pearson R = 0.96) and cell-to-cell variance (K562: Pearson R = 0.78, HEK293: Pearson R = 0.94, Fig 2 d-g). To validate the measurements, we conducted bulk RNA-seq for the core promoter library in the two cell types separately, and found the bulk measurements correlate well with the aggregated single-cell measurements (Fig. 2 h,i, Supplementary Fig. 1e,f). This analysis shows that single-cell measurements of library members in as few as 70 individual cells still correlate well with bulk measurements, highlighting the sensitivity of our method.
scMPRA detects cell type specific CRS activity and non-coding variant effect
We asked whether the data allowed us to detect core promoters with differential activity between K562 and HEK293 cells. While different classes of core promoters had similar activities in both cell lines (Fig. 2j), our differential analysis using DEseq226 identified a small number of promoters (11 out of 669) that are upregulated in K562 cells, and 59 promoters that are downregulated in K562 cells (adjusted p< 0.01, log2 fold change > 0.3, Fig. 2k, Supplementary Table 2). Among the down-regulated promoters, 48 out of 59 core promoters belong to housekeeping genes (p = 1.08×10−11, Fig. 2l), and 46 out of 59 core promoters are CpG-island-containing core promoters (p=2.18×10−6, Fig. 2m). This down-regulation might be explained by the fact that the K562 cell line is a cancer derived cell line, and a hallmark regulatory change in cancer cells is the hypermethylation of CpG promoters27. These results demonstrate the ability of scMPRA to detect CRSs with cell-type specific activities.
Another application of scMPRA is to detect cell type specific effects of non-coding variants. To test whether our method can detect the effects of mutations in a given CRS, we included an artificial core promoter SCP128 along with mutated versions without a TATA Box or DPE motif in our library (Fig. 2n). We first computed the total number of captured reporter gene transcripts, since it is the closest proxy to the bulk expression measurement. We found that deletions of the TATA motif or DPE motif both reduced expression (Fig. 2o) and we observed a similar trend in the bulk data (Supplementary Fig. 1g). When we directly calculated the mean of the single-cell expression distribution instead of total number of captured reporter gene transcripts, we found that the deletion of the DPE motif has a stronger effect in K562 cells than in HEK293 cells (40% reduction vs 20% reduction) (Methods, Supplementary Fig. 1 h,i). We hypothesized that the differential expression of transcription factors between K562 and HEK293 cells leads to differential sensitivity to the TATA and DPE motifs. We examined the single-cell transcriptome and found that TAF9, which recognizes the DPE motif29, is more highly expressed in K562 cells compared to HEK293 Cells (Supplementary Fig. 1j, Wilcoxon p=4.27×10−94). This observation likely explains why the deletion of the DPE motif has a stronger effect in K562 cells. Our results demonstrate that scMPRA can identify and explain cell-type specific effects of non-coding variants.
scMPRA detects cell sub-state specific CRS activity
Single-cell studies have revealed heterogeneity in cell states even within isogenic cell types30–33. Therefore, we asked if scMPRA can identify CRSs with cell-state specific activity. We repeated scMPRA on K562 cells alone and obtained a total of 5141 cells from two biological replicates. Measurements of the mean and variance of each library member were again highly correlated between replicates and agree well with independent bulk measurement (Supplementary Fig. 2 a-d).
As the phases of the cell cycle represent distinct cell-states, we asked whether scMPRA could identify reporter genes with differential activity through the cell cycle. We assigned cell cycle phases to each cell using their single cell transcriptome data (Fig. 3a) and then calculated the mean expression of each reporter gene in different cell cycle phases. We found that most core promoters in our library are upregulated in the G1 phase of the cell cycle, and some housekeeping promoters are highly expressed through all cell cycle phases (Fig. 3b). We also identified core promoters with different expression dynamics through the cell cycle. For example, we found the core promoter for UBA52 remains highly expressed in the S phase, whereas the core promoter for CXCL10 is lowly expressed throughout (Supplementary Fig. 2e). This analysis illustrates the ability of scMPRA to identify CRSs whose expression naturally fluctuates with cellular dynamics. We then asked whether scMPRA could detect reporter genes with activities that were specific to other cell-states in K562 cells, after normalizing for cell cycle effects. We focused on two specific sub-states that have been reported and experimentally validated for high proliferation rates in K562 cells34,35. The first is the CD34+/CD38− sub-state that has been identified as a leukemia stem-cell subpopulation, and the second is the CD24+ sub-state that is linked to selective activation of proliferation genes by bromodomain transcription factors31,32. To identify these sub-states in our single-cell transcriptome data, we first regressed out the cell cycle effects and confirmed that the single cell transcriptome data no longer clustered by cell cycle phase (Supplementary Fig.2 f). We then identified clusters within K562 cells that have the CD34+/CD38− expression signature, or the CD24+ signature (Fig. 3 c,d). Although the CD34+/CD38− cells represent only 9.3% of the cells in a K562 culture, scMPRA revealed two distinct classes of core promoters that are upregulated and downregulated in these cells respectively (Fig 3e). Conversely, the expression patterns of promoters are similar between the CD24+ cluster and cells in the “differentiated” cluster (Fig. 3e, f). Motif analysis of the up/down regulated classes of promoters in CD34+/CD38− cells showed that different core promoter motifs are enriched in each class, with the TATA box and Motif 5 being enriched in the upregulated class and MTE and TCT motifs being enriched in downregulated class (Fig. 3g, Methods). This result suggests that differences in core promoter usage might be driving the differences between CD34 +/CD38− and the other clusters. Because the TATA box is mostly found in developmental core promoters, the CD34+/CD38− subpopulation likely reflects a more “stem-like” cellular environment in these cells. Our analysis highlights the ability of scMPRA to identify CRSs with differential activity in rare cell populations.
With the single-cell expression data, we asked how certain promoters achieve higher expression in the CD34+/CD38− state. We asked whether the single-cell expression distribution for the CD34+/CD38− state is shifted higher than for the other states, or if the range of expression is the same for each sub-state, with only the proportion of cells with high expression changing in each state. To answer this question, we calculated the proportion of cells in each sub-state belonging to the 90th percentile of the total single cell expression distribution. For the majority of promoters, the CD34+/CD38− cluster has a much higher proportion of cells in the 90th percentile (Supplementary Fig 3a). At the same time, there is no difference in the maximum expression of cells in different sub-states, and this maximum level is mainly set by the promoter identity (Supplementary Fig 3b). Even for the most differentially expressed promoter in the CD34+/CD38− subpopulation, TIA1, the expression distributions for cells in the three sub-states cover the same range, but the proportion of cells in the right-tail of the distribution is higher for CD34+/CD38− cells (Fig. 3h). This result suggests that the “stem-like” cellular environment of the CD34+/CD38− subpopulation increases the probability of certain promoters having higher expression, without shifting the maximum expression those promoters achieve. Taken together, these analyses highlight how the joint transcriptome and CRS measurements in scMPRA can be used to understand differences in behavior in cellular sub-states.
scMPRA enables decomposition of intrinsic and extrinsic noise
Finally, we analyzed the cell-to-cell variability of reporter genes across K562 cells. Cell-to-cell variability, or expression noise, is the phenomenon where gene expression varies among the cells of an isogenic population. Expression noise has important roles in development36, rare-cell cancer resistance30,37, and its origin is a central question in single-cell biology. A common framework for studying expression noise is to decompose it into its intrinsic component, which arises from the thermal fluctuations of macromolecular interactions, and its extrinsic component, which results from fluctuations in the global cellular environment38–42. Intrinsic and extrinsic noise can be decomposed using dual-reporter experiments, where two identical reporter genes are measured across the same single-cells39. High covariance of the two reporter genes indicates high extrinsic noise and low intrinsic noise, while independent variation of the two reporters suggests high intrinsic noise and low extrinsic noise. In scMPRA, plasmids with the same CRS but different barcodes are sometimes incorporated into the same cells, effectively serving as a dual-reporter experiment. We extracted pair-wise expression for the same core promoter labeled with different cBCs from our scMPRA data, and computed intrinsic noise and extrinsic noise using a previously developed statistical framework43 (Methods). We found that different core promoters have distinct intrinsic and extrinsic noise profiles (Fig 4 a,b). Globally, we found that intrinsic noise correlates with mean expression levels (Pearson ρ = 0.455), while extrinsic noise is not correlated with mean expression (Pearson ρ = −0.172, Fig. 4 c,d). This result agrees with the notion that intrinsic noise arises from the thermodynamics of transcription at different promoters, whereas many sources for extrinsic noise are independent of the specific promoters. We also found that developmental promoters have a higher proportion of noise that is extrinsic, reflecting their role in driving developmental promoters that respond to extrinsic cues during development (Fig. 4 e,f). This analysis suggests that scMPRA could be a powerful tool to study the mechanistic origin of cell-to-cell variability in a high throughput manner.
Conclusions
We have presented a method to measure the cell-type and cell-state specific effects of CRSs by devising a barcoding scheme to read out input copy number with mRNA. We demonstrated that scMPRA detects cell-type specific reporter gene activity in a mixed population of cells, and cell-state specific activity in an isogenic population. We also demonstrated that scMPRA can be a powerful tool to study how different CRS control cell-to-cell variability. The assay is reproducible and reports accurate mean levels of reporter gene activity in as few as 70 cells. The primary limitation of scMPRA is that it relies on mRNA counts of the rBC to estimate plasmid DNA abundance, and therefore it cannot accurately measure CRSs that are truly silent in a given cell type. The inclusion of a separate constitutive promoter on each plasmid driving expression of the rBCs would allow us to quantify plasmid copy number independent of the expression of the reporter gene.
A future direction is to perform scMPRA in complex tissues to measure the cell type specific effects of genetic variation in CRSs. With the burgeoning of Adeno-associated viral delivery systems with distinct tropisms44–47, we anticipate that scMPRA will be widely used to study cis-regulatory effects in a variety of complex tissues.
Methods
Cell culture
K562 cells were cultured using a medium consisting of Iscove’s Modified Dulbecco’s Medium (IMDM) + 10% Fetal Bovine Serum (FBS) + 1% non-essential amino acids + 1% pen/strep at 37 C with 5% of CO2. HEK293 cells were cultured using a medium consisting of Eagle’s Minimum Essential Medium (EMEM) + 10% Fetal Bovine Serum (FBS) + 1% pen/strep at 37 C with 5% of CO2.
Cloning Strategy
We developed a two-level barcoding technology to enable single-cell normalization for plasmid copy number. We applied this strategy to a promoter library we previously tested in bulk assays24. The original library contains 676 core promoters with a length of 133bp. Each core promoter has 10 promoter barcodes to provide redundancy in the measurements. We then synthesized a single-stranded 90 bp DNA oligonucleotide containing a 25 bp random sequence, a restriction site, and 30 bp homology on each side of the barcode region.
We used Hifi Assembly™ to add the random barcodes to the plasmid library. 4 μg of the plasmid library were split into 4 reactions and digested with 2μl of SalI for 1.5 hours at 37°C. The digested products were run at 100V for 2 hours on a 0.7% agarose gel. The correct size band was cut and purified with the Monarch Gel Extraction Kit (New England BioLabs T1020L). The insert single-stranded DNA was diluted in TE to a stock concentration of 100 uM. The insert was then further diluted to 1 uM with ddH2O. Three assembly reactions were pooled together, each reaction containing 100 ng of digested library backbone, 1 uM of insert DNA, 1μl of NEBuffer 2, 10 μl of 2X Hifi assembly mix, and H2O up to 20 ul, The reaction was incubated at 50°C for 1 hour. The assembled product was purified with the Monarch PCR&DNA Cleanup kit (New England BioLabs T1030L) and eluted in 12 μl of H2O.
The assembled plasmid was transformed using Gene Pulser Xcell Electroporation Systems by electroporation (BIO-RAD 1652661), 50 μl of ElectroMax DH10B electrocompetent cells (Invitrogen 18290015) with 1 μl of hifi assembled product at 2 kV, 2000 Ω, 25 nF, with 1 mm gap. 950 μl of SOC medium (Invitrogen 15544034) was added to the cuvette and then transferred to a 15 ml Falcon tube. Two transformations were performed, and each tube was incubated at 37 °C for 1 hour on a rotator with 300 rpm. The culture was then added to pre-warmed 150 μl LB/Amp medium and grown overnight at 37 °C. 1 μl of the culture was also diluted 1:100 and 50 μl of the diluted cultured was plated on a LB agar plate to check the transformation efficiency. For the core promoter library, we obtained more than 4×108 colonies, large enough to cover a complex library.
Estimating Library Complexity
To estimate the library complexity, we sequenced the DNA library using a nested PCR-based Illumina library preparation protocol. Briefly, we first used Q5 polymerase (New England BioLabs M0515) to amplify the region containing the two barcodes with SCARED P17 (5’-GACGAGCTCTATAAGTAATCTAGA-3’) and SCARED P18 (5’-TTTTCTAGGTCTCTGGTCGA-3’). The total reaction volume is 50μl with 50ng of plasmids with 2.5 μl of 10uM primer each. The annealing temperature is 61°C with an extension time of 10s. 25 cycles of amplification were done. The product was then purified with the Monarch PCR&DNA Cleanup kit (New England BioLabs T1030L), and eluted with 20 μl of ddH2O. For the second PCR (SCARED P19: 5’-GGACGAGCTCTATAAGTAATCTAGA-3’, SCARED P20: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT-3’), a 25 μl reaction was set up with 0.25 μl product from the previous step, the annealing temperature is 61°C, and the extension time is 10s, a total of 10 cycles was done. The PCR product was cleaned up using the Monarch PCR&DNA Cleanup kit. For the last PCR to add the P5 and P7 Illumina adapters (P5: 5’-AATGATACGGCGACCACCGAGATCTACACACCCGCACACTCTTTCCCTACACGACGCT-3’, P7:5’-CAAGCAGAAGACGGCATACGAGATAAGTTGACAGTGACTGGAGTTCAGACGTG-3’), a reaction with 25 μl of total volume was set up with 2 μl of cleaned product from PCR2, a total of 10 cycles of PCR was done.
The constructed Illumina library was sequenced on an Illumina MiSeq. A total of 1,693,933 reads was generated for this library. A filtering strategy was applied to the raw reads, where reads that do not have matching promoter barcodes and wrong-length random barcodes were filtered out. We obtained a total of 1,359,176 reads (80% of the total reads) that contain the correct promoter barcode and correct length random barcode.
The shallow sequencing of the input plasmid library enabled us to estimate the library complexity and the probability of two identical copies of the plasmid being transfected into the same cell. We first calculated that each random barcode is attached to 1.9 promoter barcodes on average. For a total of 6760 input promoter barcodes, this suggests that a given random barcode is being reused by 3200 different promoters. The reuse of random barcodes is the effective labeling complexity for the double-barcoding. For the Hifi assembly experiment, we used 300 ng input backbone plasmids containing only the promoter barcode (4.5×109 total copies and on average 6.65×106 copies of plasmids per promoter barcode). Given the effective labeling complexity, the average copy number of the plasmid containing the same promoter barcode-andom barcode pair is at most 2.08×103. For the transfection experiment done in this study, with 2 μg (6×109 copy of plasmids) for cell mixing experiment and 10 μg (3×1010 copy of plasmids) for K562 along experiment, the estimation of the average copy number for an identical plasmid is 4.4×102 and 2.2×103 respectively.
After obtaining the average copy number for identical plasmids, we estimate the probability of an identical plasmid being transfected into the same cell. We first define this probability as the collision rate. We note that the transfection of the identical copies of different plasmids are independent, so we could only calculate the collision rate for only one of such plasmids. The calculation of the collision rate for a given library member can be formulated as such: given the number of the identical copies of a plasmid, what is the probability of two or more of the copies being transfected into the same cell? We first write the expecovtation: where n denotes the total number of cells, m denotes the total number of identical plasmids, k denotes the number of cells with no plasmid, q denotes the cells with exactly 1 plasmid, parentheses denote binomial coefficient, and brackets denote partition function.
The above equation was simplified by substituting with the bivariate generating function, and the expected number is:
For a given transfection experiment, we can estimate the effective percentage of plasmid that is successfully transfected into the cell. Given the estimated copy number for identical plasmids is 4.4×102 and 2.2×103 for mixed cell experiment and K562 alone experiment respectively, the expected number of cells having more than 1 identical plasmid can be calculated with the aforementioned equation, and the probability of two copies of an identical plasmid appearing in the same cell is 0.0004 and 0.002 respectively. On a practical note, researchers have suggested that the effective number of the plasmid that are incorporated into the nucleus is about 0.01 - 0.1 of the input amount48, hence a library containing around 2.5×105 different members transfected to 1 million cells has a theoretical collision rate around 1%.
Transfection
K562 cells were transfected using electroporation with the Neon transfection system (Invitrogen MPK5000). 1 million cells were transfected with 2 μg of plasmid DNA (mixed-cell experiment) or 10 μg of plasmid DNA (K562 sub-state experiment), with 3 pulses of 1450 V for 10 ms. The cells were then plated to pre-warmed K562 medium.
HEK293 cells were transfected using the Lipofectamine3000 protocol. 4 μl of p3000 reagent, 4μl of Lipofectamine, and OptiMEM were mixed with 2 μg of plasmid DNA to a volume of 250 μl. The lipofectamine reagents and plasmid were mixed and incubated at room temp for 15 minutes and then added dropwise to the cells.
Bulk RNA extraction and sequencing
We determined the optimal harvest time based on plasmid dilution and protein maturation and found the optimal harvest time is between 22 - 28 hours after transfection. The rationale behind the choice of time is to balance the transcription rate and the plasmid dilution during cell replication.
For both K562 cells and HEK293 cells, we harvested the cells after transfection at 24 hours, and proceeded to extract total mRNA with Qiagen RNeasy kit for K562 cells and Monarch Total RNA miniprep kit for HEK293 cells. The reverse transcription was done with Superscript IV Reverse Transcriptase (Invitrogen 18090010). The final sequencing library was constructed using a nested PCR strategy. Briefly, we first used Q5 (New England BioLabs M0515) polymerase to amplify the region containing the 2 barcodes with SCARED P17 and SCARED P18. The total reaction volume is 50μl with 50ng of backbone with 2.5 μl of 10uM primer each. The annealing temperature is 61°C with an extension time of 10s. 25 cycles of amplification was done. The product was then purified with the Monarch PCR&DNA Cleanup kit (New England BioLabs T1030L), and eluted with 20 μl of ddH2O. For the second PCR using primers SCARED P19 and SCARED p20, a 25 μl reaction was set up with 0.25 μl product from the previous step, the annealing temperature is 61°C, and the extension time is 10s, a total of 10 cycles was done. The PCR product was cleaned up using the Monarch PCR&DNA Cleanup kit (New England BioLabs T1030L). For the last PCR to add the P5 and P7 Illumina adapters, a reaction with 25 μl of total volume was set up with 2 μl of cleaned product from PCR2, a total of 10 cycles of PCR was done. The sequencing library was sequenced on an Illumina Mi-seq machine with other samples pooled in the same lane.
10X Experiment for scMPRA
We harvested both K562 and HEK293 cells 24 hours after transfection, then followed the cell preparation protocol of 10X genomics. We used the 10X V3.1 chromium kit for our single-cell RNA-seq protocol. All PCRs were performed on an Invitrogen PCR machine. We targeted 2000 cells per replicate for each experiment for the mixed cell experiment. We targeted 2500 cells per replicate for the K562 substrate experiment. We followed the 10X protocol (https://support.10xgenomics.com/single-cell-gene-expression/library-prep/doc/user-guide-chromium-single-cell-3-reagent-kits-user-guide-v31-chemistry) with 12 cycles of cDNA amplification. To amplify the Capture-Sequence captured reads. 0.25 μl of 100 uM SCARED P32 (5’-GTCAGATGTGTATAAGAGACAG-3’) was added to the cDNA amplification mix. For step 2.2, we modify the clean-up protocol by saving both the beads and supernatents and for the supernatents, we use a final concentration of 1.2X beads to pull down the DNA fragments. We then take 25% of both the 0.6X and 1.2X pull down products for the next step of PCR. To construct the illumina sequencing library, we used a 3-step nested PCR strategy. Briefly, we first used Q5 (New England BioLabs M0515) polymerase to amplify the region containing the 2 barcodes with SCARED P17 and SCARED P18. We pooled 8 PCR reactions, each with 50 μl of total volume, with 10 cycles to reduce possible jackpotting. The annealing temperature is 61°C with an extension time of 10s. The product was then purified with the Monarch PCR&DNA Cleanup kit (New England BioLabs T1030L), and eluted with 20 μl of ddH2O. For the second PCR using the following 3 primers (SCARED P21: 5’-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGGACGAGCTCTATAAGTAATCT-3’, CAS PC2: 5’-CGAGATCTACACTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’, CAS PP2: 5’-ATCTACACTCTTTCCCTACACGACGCTCTTC-3’), we pulled 8 PCR reactions, each with 50 μl of total volume, with 10 cycles to reduce possible jackpotting, the annealing temperature is 61°C, and the extension time is 10s, a total of 10 cycles was done. The PCR product was cleaned up using the Monarch PCR&DNA Cleanup kit (New England BioLabs T1030L). For the last PCR to add the P5 and P7 Illumina adapters (CAS P48: 5’-CAAGCAGAAGACGGCATACGAGATNNNNNNNN[index]GTGACTGGAGTTCAGAC-3’, CAS PP4: 5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACA-3’, CAS PC4: 5’-AATGATACGGCGACCACCGAGATCTACACTCGTCG-3’), we pulled 8 PCR reactions, each with 50 μl of total volume, with 10 cycles to reduce possible jackpotting, a total of 10 cycles of PCR was done. The transcriptome is generated using the 10X Dual-Index Set TT expression kit (https://support.10xgenomics.com/single-cell-gene-expression/index/doc/technical-note-chromium-next-gem-single-cell-3-v31-dual-index-libraries).
The sequencing was done on an Illumina NextSeq machine. We used 40% of the barcode library, 40% of the balanced scRNA-seq transcriptome, and 20% Phi-X. Sequencing the constructed barcode library with transcriptome and Phi-X is crucial to reduce the sequencing error from the reporter constant sequence. On Read1, only 28 bps contains the 10X cell barcode and UMI wes amplified, to avoid sequencing the constant Poly(A) sequence; On Read2, 105 bps was sequenced. For the mixed experiment, we pool reads from a total of 2 runs of NextSeq High Throughput sequencing runs, and for the K562 cells, we pool 3 runs of NextSeq High Throughput runs.
scRNA-seq data processing
The single-cell RNAseq data were processed using Cellranger 6.0.1 (https://github.com/10XGenomics/cellranger) and Scanpy 1.8.149 (https://github.com/theislab/scanpy) following the standard pipeline. Briefly, different sequencing runs from the same biological replicate were pooled together and processed with CellRanger 6.1.1; the final output expression matrix was then imported into Scanpy for further normalization. We first removed cells with less than 1000 genes, and genes that were present in less than three cells. We then removed cells with high counts for mitochondrial genes. Next, we normalized the UMI counts to the total cell UMI counts. The normalized expression matrix was used for clustering and visualization with Scanpy. The clustering was done using the Leiden algorithm50.
scMPRA data processing
The relevant script for processing single-cell MPRA reads can be found on a Github repository (https://github.com/szhao045/scMPRA). The final sequencing product for scMPRA with Read1 contains the cell and molecular information (cellBC and UMI), and Read2 contains the MPRA library information (cBC and rBC). First, we fuzzy-matched the constant sequences before and after both the promoter barcode and random barcode. In this step, we filtered out the reads without correct promoter barcode length, or random barcode length. To increase the speed, we wrote a stand-alone program (https://github.com/szhao045/scMPRA_parsingtools) written with Golang, and can be compiled to work on many operating systems. Second, we filtered out cell barcodes based on the cell barcode list from the CellRanger output barcode list, with error-correction with maximum hamming distance of 1. Third, to mitigate the effect of template-switching during the PCR steps, we plotted the rank read depth for each unique quad of 10X Cell Barcode, UMI, cBC, and rBC. We identified an elbow point with minimum depth of 1 (mixed cell experiment) and 10 (K562 alone experiment), and kept any low-depth unique quad that contains the cBC-rBC pair at most hamming distance of 1 to a high depth pair. Lastly, we remove cells with less than 100 scMPRA-associated UMIs, since the scMPRA reads from those cells were poorly sampled.
Cell cycle analysis
Cell cycle analysis for the scRNA-seq experiment was done with Scanpy 1.8.1 with cell cycle genes51. The expression profile of each cell was projected onto a PCA plot based on the list of cell cycle genes using Scanpy.
Motif analysis
The core promoters were first clustered according to their expression levels in the different cell sub-state populations by hierarchical clustering. We categorized our data into up/down regulated clusters at the first branching point, aiming to preserve the large structure. We then identified core promoter motifs in each promoter according to the parameters in Zabidi et al52. using MAST v4.10.053 and plotted the proportion of promoters containing each motif in each promoter class.
Estimating intrinsic and extrinsic noise
Intrinsic and extrinsic noise were estimated using the statistical framework developed for the dual-reporter experiment43. We first extracted the pairwise expression level for cBCs that belong to the same promoter in every single cell. If more than two cBCs are found in the same cell, the pairwise expressions among them are recorded. We then removed promoters with less than 100 paired single-cell expression measurements (593 out of 676 promoters passed the filtering step). We then applied the statistical framework developed by Fu and Pachter43. The derivation is abbreviated and can be found in the original publication. Briefly, let C denote the expression for the first pBC in the cell and let Y denote the expression for the second pBC in the cell. Let ŋint denote the intrinsic noise, and it can be calculated as: where where n denotes the number of cells.
Similarly, let ŋext denote the extrinsic noise, and it can be calculated as: where where n denotes the number of cells.
Statistical Analyses
All statistical analyses were done using Python 3.9.6, Numpy 1.12.154, Scipy 1.6.3 and R 4.0.2.
Data and Code Availability
Next-generation sequencing data that support the findings of the study are available in the Gene Expression Omnibus using accession code GSE188639.
The code that supports the findings of this study is available on Github Repository (https://github.com/szhao045/scMPRA).
Author Contributions
S.Z. and B.A.C. conceived and designed the project, S.Z. performed most of the experiments and analyses with significant technical contributions from C.K.Y.H. and D.M.G., S.Z. and B.A.C. wrote the manuscript with input and feedback from all authors.
Competing Interests
S.Z. and B.A.C. are inventors on a pending patent filed by Washington University in St. Louis which may encompass the methods, reagents, and data disclosed in this manuscript. B.A.C is on the scientific advisory board of Patch Biosciences.
Supplementary Figures
Acknowledgements
We thank the members of the Cohen laboratory for their critical feedback on the manuscript. We thank Jess Hoistington-Lopez and MariaLynn Crosby for assistance with high-throughput sequencing. This work is supported by grants to B.A.C from the National Institutes of Health, R01 GM140711 and R01 GM092910.