ABSTRACT
Gene expression patterns depend on the interaction of diverse transcription factors with their target genes. While many factors have a restricted number of targets, some appear to affect transcription globally. An example of the latter is MeCP2; an abundant chromatin-associated protein that is mutated in the neurological disorder Rett Syndrome. To understand how MeCP2 affects transcription, we integrated mathematical modelling with quantitative experimental analysis of human neurons expressing graded levels of MeCP2. We first used a model of MeCP2-DNA binding to demonstrate that changes in gene expression reflect MeCP2 density downstream of transcription initiation. We then tested five biologically plausible hypotheses for the effect of MeCP2 on transcription. The only model compatible with the data involved slowing down of RNA polymerase II by MeCP2, causing reduced transcript output due to polymerase queueing. Our general approach may prove fruitful in deciphering the mechanisms by which other global regulators choreograph gene expression.
INTRODUCTION
Gene expression is a complex process affected by many factors. Traditionally, interactions between genes have been visualized as a gene regulatory network in which transcription factors are classified either as repressors or activators, each with a few well-defined target genes. This view becomes less informative when a factor interacts with the majority of genes. Such global regulators of transcription may either interact with the transcriptional machinery itself, for example by directly affecting the initiation rate of Pol II, or bind to abundant DNA features that are widespread in the genome, leading to altered transcriptional initiation or elongation. The phenomenon of global transcriptional regulation is well known in bacteria. In Escherichia coli, for example, seven global regulators directly modulate expression of half of all the genes which are responsible for cellular metabolism (Perrenoud and Sauer, 2005).
Many eukaryotic chromatin-binding factors can potentially act as global modulators, including “typical” transcription factors such as c-Myc, PLZF and SALL4 that bind thousands of sites in gene promoters (Lin et al., 2012; Lovelace et al., 2016). In principle, binding need not be restricted to promoters or even specific DNA sequences, as any protein that interacts broadly with DNA may modulate gene expression at a global level. For example, histones that form the core of nucleosomes in eukaryotes are known to directly block transcription (Li et al., 2007; Workman and Kingston, 1998). To facilitate transcription, DNA must thus be unwound from the histone core, and histone modifications such as acetylation and methylation affect this process in various ways (Berry et al., 2017; Ouararhni et al., 2006; Woo et al., 2017). Although the precise mechanisms by which global regulation is achieved are still unclear, most hypotheses rely on alteration of the initiation rate of transcription. For example, H3K27 histone methylation has been suggested to affect chromatin compaction near the promoter (Berry et al., 2017). Other studies have considered the presence of dynamic promoter-proximal obstacles (Hao et al., 2017), or interactions between RNA polymerases (Epshtein, 2003; Sneppen et al., 2005).
By comparison, global regulation via processes that occur in the bodies of genes rather than at promoters has received less attention. Potential examples include chromatin compaction mediated through histone modifications or other features. In particular, DNA methylation is known to affect gene expression. This has been well-studied in the context of promoter methylation, which represses transcription. However, DNA methylation also seems to affect transcription via gene bodies; RNA Pol II has been observed to slow down on exons that are more methylated than the surrounding introns (Jonkers and Lis, 2015). Since methylated DNA is targeted by a broad family of methyl-CpG binding proteins (Hendrich and Bird, 1998), the presence of these proteins in gene bodies may have a direct effect on transcription. In this work we focus on the putative global modulator, methyl CpG binding protein 2 (MeCP2), which is highly expressed in adult human brain, especially in neurons (Lewis et al., 1992b; Skene et al., 2010). Loss-of-function mutations in the X-linked MECP2 gene or excessive amounts of MeCP2 caused by duplication of the gene lead, respectively, to the severe neurological disorders Rett Syndrome (Amir et al., 1999) and MeCP2 duplication syndrome (Ramocki et al., 2010). Missense mutations responsible for Rett syndrome accumulate mainly in two regions of MeCP2, coincident with the Methyl-CpG Binding Domain (MBD), whose binding to DNA requires methylated cytosine (Ho et al., 2008; Nan et al., 1993; 1996), and a downstream NCoR interaction domain (NID), which binds the TBL1/TBLR1 subunits of the NCoR corepressor complex (Kruusvee et al., 2017; Lyst et al., 2013). In addition to mCpG, MeCP2 also binds to methylated CpA, predominantly in the trinucleotide mCAC, whose accumulation in the brain coincides with increased neuronal MeCP2 expression (Gabel et al., 2015; Guo et al., 2014; Lagger et al., 2017; Lister et al., 2013a).
Although the interactions between MeCP2 and DNA are well established, its precise role in gene expression is less clear. It is evident that MeCP2 can repress transcription in a DNA methylation-dependent manner (Guy et al., 2001; Nan et al., 1997) and that repression is reduced in MeCP2 mutants that cannot bind co-repressor complexes or by the addition of HDAC inhibitors (Bird et al., 1998; Lyst et al., 2013; Nan et al., 1997; 1998). In mouse brain it is reported that depletion of MeCP2 increases global histone acetylation (Shahbazian, 2002; Skene et al., 2010) and modestly elevates expression of repeats and retrotransposons (Skene et al., 2010). However, MeCP2 does not behave as a conventional transcription factor with a few well-defined targets, as its binding site occurs on average every ~100 base pairs. Previous RNA-seq data, predominantly from mouse brain regions over-expressing or deficient in MeCP2, indicate that the magnitude of effects on transcription are often small. However, correlations between MeCP2 occupancy across genes and the degree of transcriptional change when MeCP2 is removed or over-expressed have been reported, raising questions about the regulatory processes at work (Gabel et al, 2015; Kinde et al, 2016; Lagger et al, 2017).
Here we set out to resolve the mechanism of MeCP2-dependent transcriptional regulation. Because MeCP2 binding sites occur in the vast majority of genes, we reasoned that most are likely to be influenced to some extent by its presence. To confront the technical and analytical challenges posed by modest changes in expression of large numbers of genes, we adopted a quantitative approach that combined deep, high quality datasets obtained from a uniform population of Lund Human Mesencephalic (LUHMES)-derived human dopaminergic neurons (Scholz et al., 2011) with mathematical modelling. We created a spectrum of LUHMES cell lines expressing distinct levels of MeCP2. Using transposase-accessible chromatin sequencing (ATAC-seq) and chromatin immunoprecipitation (ChIP-seq) together with mathematical modelling, we detected a robust footprint of MeCP2 binding to mCG and mCA in vivo and determined the amount of MeCP2 bound to DNA. Quantification of mRNA abundance by RNA-seq revealed a previously unknown relationship between changes in transcription and the density of mCG on gene bodies. To explain this observation, we proposed and tested several distinct mechanistic models. The only model consistent with our experimental results is one in which MeCP2 leads to slowing down of RNA polymerase II progression through a transcription unit. Importantly, mutant MeCP2 that is unable to bind NCoR fails to repress efficiently, suggesting that repression depends upon this interaction.
RESULTS
Our approach is illustrated in Figures 1A and 1B. We first created cell lines with varying levels of MeCP2 (Figure 1A) and then used a range of techniques to determine: DNA methylation (whole-genome bisulfite sequencing – WGBS), total number of MeCP2 molecules in each cell line, MeCP2 chromatin occupancy (ChIP-seq, ATAC-seq), changes in transcription (RNA-seq) and chromatin accessibility (ATAC-seq). We next constructed mathematical models of MeCP2-chromatin binding; fitting the models to experimental data enabled us to precisely determine the density of MeCP2 on the chromatin. These results were subsequently used in mathematical models of MeCP2 transcriptional regulation. A comparison between the models and experiments enables us to falsify candidate models (Figure 1B).
Seven LUHMES-derived cell lines expressing a wide range of MeCP2 levels were created and assayed by western blotting (Figure 1C). Based on morphology, expression of neuronal markers and the temporal profile of MeCP2 protein during differentiation (Figures S1A-D), we chose to analyse neuronal preparations consistently on day 9. The following cell lines expressing varying MeCP2 levels were generated: zero MeCP2 (KO; purple) (Shah et al., 2016), low levels via two independent knock-downs (KDs; dark blue, 10-20% WT and lighter blue, 30-40% WT), wild-type (WT, green), and three over-expression levels (OE; yellow, 3-fold; orange, 4-fold; and red, 11-fold; see Table S1). The resulting MeCP2 levels corresponded to approximately 2.5 × 10’ MeCP2 molecules per nucleus in WT (Figure S1E) and 2.7 × 107 molecules in OE 11× (Figure 1D). For comparison, MeCP2 abundance in our highest expressing cell lines exceeds that in mature neurons derived from mouse brain (~1.5 x 107 molecules per nucleus; (Skene et al., 2010)). Immunostaining for MeCP2 showed uniform expression of MeCP2 in most cell lines, with the exception of lines KD2 and OE 3x, which displayed somewhat variable levels (Figure 1E; middle panel). All cell lines differentiated with similar kinetics (Figures S1J and S1M), expressing similar levels of cellular markers (Figures S1K and S1N) and acquiring similar neurite length (Figures S1L and S1O). Visualization of neurofilament (NF) staining and other neuron-specific markers indicated that the differing levels of MeCP2 do not significantly affect neuronal differentiation in this system (Figures 1E and S1N). These findings agree with previous observations that Mecp2-null neurons differentiate efficiently in vitro (Kishi and Macklis, 2004) and Mecp2-null mice show no overt phenotypic defects until ~4 weeks of age (Chen et al., 2001; Guy et al., 2001).
Methylated CG is the dominant DNA modification in LUHMES neurons
MeCP2 was initially detected as a m5C binding protein and subsequent work has confirmed this property (Baubec et al., 2013; Gabel et al., 2015; Lagger et al., 2017; Lewis et al., 1992a; Skene et al., 2010). As a reader of this epigenetic mark, MeCP2 is not expected to affect the level of DNA methylation. Accordingly, global levels of DNA methylation were consistent in all cell lines tested, with ~3.7% of all cytosines being methylated (Figure S1P). We determined the detailed pattern of DNA methylation and hydroxymethylation at single base pair resolution in WT cells using whole-genome bisulfite sequencing, TAB-seq and ox-BS. The analysis revealed that 70% of CGs in the genome are methylated, equivalent to ~40 million mCGs per haploid genome (Figure 1F). Notably, only 2.1% of 5mC is in the CA context, corresponding to ~1 million mCA moieties per haploid genome (Figures 1F and G), as compared with up to 50% 5mC in mCA in NeuN-positive adult human neurons (Lister et al., 2013b). Additionally, only ~2 million CGs are hydroxymethylated in the LUHMES haploid genome (Figure 1F). In summary, methylation of LUHMES neuronal genomes occurs predominantly in the mCG context, with much lower levels of hmCG and non-CG methylation.
MeCP2 binds predominantly methylated CG genome-wide
To map the binding of MeCP2 in these human neurons, we performed quantitative MeCP2 ChIP-seq for KO, WT, OE 4× and OE 11×, and simultaneously developed a computer model that simulates the ChIP-seq procedure and MeCP2 binding in vivo (Figure 2A). A constant amount of Drosophila chromatin was added as an internal standard (Figure S2A) and we confirmed that ChIP enrichment (human DNA relative to Drosophila DNA) was proportional to the level of MeCP2 in each cell line (Figures 2B and S2B). To determine the distribution of MeCP2 in the genome we accumulated the number of MeCP2 ChIP-seq reads on and around certain genomic features, including methylated/unmethylated dinucleotides, CpG islands and promoters. When ChIP enrichment profiles were centred at methylated CGs, a peak whose height increases with the level of MeCP2 was visible (Figure 2C). However, the peak height was not proportional to the amount of MeCP2 in the cell, and a peak was also present in KO cells indicating a bias unrelated to MeCP2. To interpret this peak and to extract MeCP2 occupancy (fraction of mCG occupied by MeCP2), we fitted the enrichment profiles obtained from the in silico generated data to the experimental profiles and extracted the absolute MeCP2 density. Based on good agreement between the simulated and experimental profiles (Figure 2C), we determined that <10% of mCG sites are occupied by MeCP2 in OE 11x genome (Figure S2E), suggesting that fewer than 8 million mCGs are bound by MeCP2 despite the ~25 million MeCP2 molecules present in the nucleus. The specificity of this approach was confirmed by deriving MeCP2 binding profiles over unmethylated CGs, resulting in a significant drop in enrichment (Figure 2D). The simulation also reproduces a lack of enrichment over GT dinucleotides (Figure S2F, upper panel), and, despite its relative rarity, an enrichment over mCA (Figure S2F, bottom panel). A model in which MeCP2 binds only to mCG (no mCA binding) does not reproduce the mCA peak (Figure S2G, bottom panel), further supporting specific binding in vivo to mCG and mCA. A weak peak found in MeCP2 KO ChIP-seq over mCG and mCA is also reproduced by the model in the absence of MeCP2 (Figure 2C, KO panel). Its MeCP2-independent origin is unknown, but may be due to biases in sonication, PCR amplification or sequencing.
To further assess the binding model, we checked if it could correctly predict other features of the ChIP-seq data. Figure 2E shows very good agreement between experimental and simulated ChIP enrichment plotted as a function of mCG density. This implies a linear relationship between MeCP2 density on the DNA and mCG density. Figure 2F shows that enrichment profiles in regions near the transcription start site (TSS) are also correctly reproduced by the model, confirming depletion of MeCP2 from promoters. In addition, a significant depletion of MeCP2 over unmethylated CGIs is visible by inspection of individual tracks (Figure S2H) and a dip in both the experimental and simulated data corresponds to reduced binding in CGIs (Figure S2I). Figure 2G summarizes model-based quantification of MeCP2 occupancy on various genomic features. The large difference in MeCP2 occupancy between methylated and unmethylated CGIs and the moderate binding in the regions of gene bodies, intergenic regions and promoters agrees with the average methylation status of each feature.
In vivo MeCP2 footprint is visible on methylated CG and CA
To derive an independent measure of absolute MeCP2 density on the DNA and to detect its molecular footprint with high resolution, we performed ATAC-seq (Buenrostro et al., 2013) in KO, WT, OE 4× and OE 11× neurons. Transposase Tn5 cuts the DNA in regions devoid of DNA-bound proteins, hence the number of Tn5 “insertions” obtained from ATAC-seq represents DNA accessibility within chromatin. To interpret the data, we developed a computer model of MeCP2-chromatin binding and the molecular consequences of the ATAC-seq procedure (Figure 3A). We first obtained Tn5 insertion profiles around the same genomic features as for ChIP-seq. The profiles showed multiple peaks and valleys surrounding the mCG dinucleotide (Figure S3A). As this structure also occurs in KO neurons we concluded that it is related to factors other than MeCP2, such as the insertion bias of the transposase (Figure S3B,C). The GC bias of sequencing (Figure S3D) and time of incubation with transposase (Figure S3E) also affect the profiles. To remove these biases, we calculated relative insertion profiles between all tested cell lines and KO1. This resulted in a clear signal whose amplitude increased with MeCP2 concentration (Figure 3B). The same analysis performed for non-methylated CG showed no equivalent profile (Figure 3C). We interpret this “molecular footprint” as a direct evidence of MeCP2 binding to its cognate recognition sequence. Excellent agreement between simulated and experimental footprints (Figure 3B) allowed us to draw several conclusions. First, the size of the footprint is consistent with a protein that obscures ~11bp, in agreement with in vitro MeCP2-DNA binding analysis (Nan et al., 1993; Nikitina et al., 2007). Second, the amplitude of the footprint is linear with respect to MeCP2 concentration (Figure 3D). Third, in agreement with the upper estimate of bound MeCP2 deduced from MeCP2 ChIP-seq (<10% for OE 11×), only 6.3% of mCG sites are actually occupied by MeCP2 in OE 11× neurons, with less than 1% occupancy in WT (Figure 3D). Considering the number of MeCP2 molecules in OE 11× neurons (~ 2.5 × 107) and the total number of mCG sites (~8 × 107/diploid genome), we calculate that only (p = 0.063) × (8 × 107)/(2.5 × 107) ≈ 20% of MeCP2 molecules are bound to DNA at any given time in OE 11×. The same methodology allowed us to detect a footprint of MeCP2 binding to mCA but no footprint at control GT (unmethylated) or TA dinucleotides (Figure S3F).
Overexpressing MeCP2 decreases accessibility of promoters and gene bodies
We used the ATAC-seq data to infer the effect of MeCP2 on chromatin accessibility. We first identified regions of high and low (“background”) ATAC-seq insertion probabilities and defined accessibility a as the total number of insertions in the peaks divided by the average insertion counts. The value of a is proportional to the fraction of cells in which a given region is accessible to Tn5. Figure 3E shows frequency distributions of accessibility a in gene bodies for different cell lines. A striking decrease in accessibility a is evident between the two OE cell lines compared with KO1, while two independent KO cell lines (KO1 and KO2) are indistinguishable from each other. A similar effect is also detected at promoters (Figure S3G). The mean accessibility of genomic DNA decreases monotonously as MeCP2 abundance increases (Figure 3F). Superficially, reduced accessibility for DNA binding proteins, such as transcription factors, might be expected to cause a negative constraint on transcription; we test this and other models below.
Global changes in transcription correlate with MeCP2 occupancy and mCG density in transcription units
To determine the influence of MeCP2 on transcription, we performed RNA-seq on all seven cell lines expressing varying levels of MeCP2 (Table S1), and validated the results by qPCR (Figures S4A,B). We also measured total RNA levels and found that it was modestly reduced in KO neurons (~8% less compared with control), as seen previously in cultured human and mouse neurons and mouse brain (Figure S4C) (Lagger et al., 2017; Li et al., 2013; Yazdani et al., 2012). For each gene represented in the RNA-seq data, we calculated the Log2-fold change relative to the appropriate control (Log2FC) expressed as transcript abundance (transcripts per million, TPM; Figure 4A). An initial analysis that was restricted to “significantly-changed” genes revealed a strong reciprocal relationship of Log2FC to mCG density ρmCG (Figure S4D). However, since RNA-seq only measures expression relative to some common average, rather than absolute expression, there is uncertainty in defining the basal level (Log2FC=0) and significance (p-values). In view of this limitation, we considered reliance on arbitrarily determined “significance” unsuitable for mathematical modelling and therefore adopted a different approach. We binned all expressed genes with sufficient coverage in WGBS, regardless of their Log2FC (~ 15,000 genes) according to ρmCG and calculated the average Log2FC in each bin. This procedure reduced noise and removed the effect of gene-gene interactions, enabling us to detect small changes in expression which might otherwise be obscured by other regulatory mechanisms and statistical noise. Figure 4B shows that most genes have mCG densities between 0.8 and 4 mCG per 100bp. These genes exhibit a strong, reciprocal relationship between methylation in gene bodies and Log2FC, dependent on whether MeCP2 was overexpressed or deficient (Figure 4C). In KO and KD neurons, Log2FC relative to WT increased with ρmCG in gene bodies (Figure 4C, upper panel), whereas in OE 3x, 4× and 11× neurons, increasing ρmCG correlated with lowered expression (Figure 4C, bottom panel). Previous transfection-based assays of MeCP2-mediated repression also detected an initial steep decrease in expression that plateaued at higher mCG densities (Nan et al., 1997), mimicking MeCP2-overexpressing neurons in this analysis.
To determine the relationship between the change in gene expression and the concentration of MeCP2 binding sites, we calculated the maximum slope (Methods) of the Log2FC in expression versus ρmCG in gene bodies (Figure 4C, black lines). The slope is strikingly proportional to MeCP2 density in gene bodies (Figure 4D). In contrast, when this analysis was repeated for promoter regions rather than gene bodies the slope was close to zero, indicating only a weak dependence on ρmCG in promoters (Figure 4E,F). Thus, although chromatin accessibility over promoters is constrained in MeCP2 OE cell lines (Figure S3G), promoter methylation appears to have little impact on the expression of genes that are actively transcribed. Since we previously established (ChIP-seq, ATAC-seq) that ρmCG is a strong predictor of MeCP2 density on the chromatin (Figures 4C,E), we infer that MeCP2 density in gene bodies has a strong effect on transcription, in agreement with a previous proposal (Kinde et al., 2016).
The results suggest that MeCP2 globally modulates transcription, however, they do not distinguish whether MeCP2 acts as a repressor or an activator of transcription as genes both increase and decrease their expression level in OE and KO/KD lines. Moreover, the apparent correlation between expression and mCG density does not prove causality, as Log2FC correlates with other quantities, including gene length, CG density regardless of methylation, and even GC content (Figures S4E-G).
Mathematical modelling excludes three plausible models of MeCP2 transcriptional regulation
To gain mechanistic insight into MeCP2 transcriptional regulation we proposed a range of models, derived predictions from each model, and compared these predictions with the experimental data to eliminate those models that are incompatible with the data. All five models are based on a commonly accepted paradigm of gene expression (Figure 5A), well supported in both eukaryotic and bacterial cells (Angel et al., 2015; Arjun Raj, 2008; Shahrezaei et al., 2008). A gene i switches between two states, ON or OFF, with rates kON,kOFF. Expression is negligible in the OFF state, whereas in the ON state it assumes rate Ji (transcripts/second). The rate J depends on the initiation rate α and elongation rate ν. Transcripts are degraded with rate di. We define fi as the fraction of cells that at any given time have gene i in the ON state. Thus, quite generally, the mean abundance of mRNA from gene i in the population of many cells, measured as transcripts per million (TPM) in our RNA-seq, can be expressed as TPMi ∝ jifi /di (Equation S3). Therefore, TPMs averaged over many genes with uncorrelated Ji,/fi,di; should be proportional to f, the average fraction of these genes in the ON state. Figure 5B shows that this is indeed true for 11× OE and KO when we take ATAC-seq-derived promoter accessibility a as a proxy for f. Based on the paradigm model, MeCP2 could in principle regulate transcription by affecting initiation, elongation or the fraction of cells with specific genes in the ON state (Figure 5A).
In our first model – the ON/OFF Condensation Model – MeCP2 affects the proportion f of genes in the ON state by condensing chromatin, for example by recruiting remodelling enzymes (Figure 5C). In accordance with our paradigm model we assume that f is proportional to the fraction of “open” (accessible) promoters as measured by accessibility a. Figure 5D shows that a decreases with increasing promoter mCG density, perhaps reflecting the known inhibitory effect of promoter methylation on transcription. Notably, accessibility of gene bodies also declines with increasing MeCP2 concentration but only a weak dependence on gene body methylation is discernible (Figure S5A). A possible interpretation of these results is that promoter accessibility (and thus the fraction fi of cells with gene i in the active state) depends on both the level of MeCP2 and the level of gene methylation. However, Log2FC of the expression ratio OE/KO predicted by promoter accessibility from the model (Figure S5B; black line) does not align with the Log2FC curve from the RNA-seq data (Figure S5B; red line). A similar plot focussed on the mCG density in gene bodies shows a clear Log2FC gene expression decay with increasing mCG density in the RNA-seq data, whereas the curve predicted from the model is almost flat (Figure 5E). We therefore conclude that the change in the proportion of ON/OFF genes caused by chromatin condensation cannot satisfactorily explain most of the observed correlation between expression and mCG density in gene bodies.
A second model – the Initiation Model – proposes that MeCP2 bound downstream of the promoter affects transcription initiation via mechanism such as interfering with binding of Pol IIand cofactors or Pol II pausing (Figure 5C) (Jonkers and Lis, 2015). In this model, we expect Log2FC to correlate better with mCG density near the promoter region than far away from the promoter. This does not agree with Figure 4 which shows the opposite effect. To confirm that this is not an artefact of differences in the accuracy with which mCG density can be determined in short promoter regions versus much longer gene bodies, we selected 2kbp-long regions around the TSS and compared them with same-length regions 40kb away (TSS+40kb), for ~3000 genes longer than 50kb. A potential confounding factor would be long-range correlations in mCG density which might cause Log2FC to correlate with both the promoter and gene body methylation even in the absence of a causal relationship. This correlation is very weak, however (r = 0.08, Figure S5C). Plotting Log2FC(OE 11x/KO) versus TSS+40±1kb (gene body) mCG density, showed the characteristic decrease in gene expression with increasing gene body methylation (Figure S5D). In contrast, the dependence on promoter methylation (TSS±1kb) was weak, offering little support for the Initiation Model. The non-zero slope in OE 11x may indicate a weak effect of MeCP2 on promoter activity, or it may reflect residual correlations between promoter and gene body methylation.
Since our analysis points to MeCP2 affecting transcription predominantly via gene body methylation, a reasonable expectation is that MeCP2 affects some aspect of the elongation phase of transcription. In our third model – the Detachment Model (Figure 5F) – MeCP2 causes Pol II to abort transcription either by directly blocking further polymerase progress or by indirectly causing the polymerase to detach from DNA, e.g., via chromatin modification. Log2FC in this model (Methods) is a function of the total number of methylated CGs (NmCG) in the gene: Log2FC = exp(–γMNmCG), where M is MeCP2 concentration relative to WT, and the parameter γ is proportional to the probability that Pol II aborts transcription when it collides with MeCP2. The dependence on the total number of mCGs in the gene is due to each “roadblock” increasing the probability of premature termination. We calculated the model’s only parameter γ by fitting the model to the Log2FC (KO/WT) data (Figure 5G, left panel). We found that the model failed to reproduce the Log2FC vs NmCG relationship for the OE 11X cell line (Figure 5G, right panel). The model also incorrectly predicts the observed relationship between Log2FC and mCG density in gene bodies (Figures S5E,F). We conclude that the Detachment Model can be rejected.
MeCP2 creates “slow sites” that impede transcriptional elongation
Finally, we considered a hypothesis that Pol II slows down when it encounters an obstruction in the form of MeCP2 or a reversible chemical modification of chromatin induced by MeCP2. This leads to Pol II queuing in front of the obstruction (Figure 6A), the density of which is equal to the density of MeCP2 on the DNA. Mathematical modelling (Methods) shows that, regardless of model details, the maximum slope of the Log2FC curves from Figure 4C should be proportional to the concentration of MeCP2, in agreement with the experimental data (Figure 4D). In the Slow Sites Congestion Model (our fourth model), which is based on exclusion processes (Blythe and Evans, 2007; Derrida, 1998; Sahoo et al., 2014; Turci et al., 2013), MeCP2 creates “slow sites” at methylated CGs. The parameters are: the fraction ρ of mCGs bound by MeCP2 (estimated from ATAC-seq), the speed ν of Pol II far from affected mCGs, the speed νs at affected mCGs, and (specific to each gene) the length L of the gene, the density ρ of methylated CGs, and the transcription initiation rate α. We took ν = 60bp/s, close to the maximum observed speed of Pol II in human cells (Fuchs et al., 2014; Veloso et al., 2014). In the biologically realistic limit kON,kOFF ≪ α, the transcription rate] predicted by the model is determined by three parameters: α,νs/ν, and pρ. Figure 6B shows the transcription rate for OE 11x predicted by the model as a function of initiation rate α, for different p and for νs/ν = 0.05. The assumed value of νs/ν = 0.05 anticipates a temporary MeCP2-dependent “roadblock” lasting on average 20s, which is compatible with the reported in vivo residence time of MeCP2 on chromatin (25-40s (Klose et al., 2005)). All curves in Figure 6B have a characteristic shape: a linear relationship J ≈ α for small α, followed by saturation at high initiation rates. Saturation occurs due to congestion: polymerases form queues in front of slow sites (SI Videos 1,2). A similar behaviour occurs in a fifth model (a modification of the Slow Sites Congestion Model) – the Dynamical Obstacles Congestion Model – in which MeCP2 molecules bind/unbind to mCG and directly block Pol II (Figure S6E,F, SI Video 3).
To estimate the rates α we fitted both models (fourth and fifth) to Log2FC(OE 11x/OE ctr) versus gene-body mCG density (Figures 6C and S6G). The predictions of both models agree well with the experimental data for OE 4× and KO (Figures 6C and S6G). The slopes of the Log2FC plots from the models also agree with the experimental slope for all seven cell lines (Figures 6D and S6H).
In summary, of the five models we considered, only the two Congestion Models (Slow Sites and Dynamical Obstacles) cannot be rejected based on the available data, and we consider the slowing down of Polymerase II to be a plausible mechanism by which MeCP2 affects transcription (Figure 6E).
MeCP2 binding to both DNA and NCoR are essential to slow down RNA Pol II
Under the Congestion Models, MeCP2 may impede RNA polymerase II progression directly by simple steric interference. Alternatively, it may induce an altered local chromatin structure that retards RNA polymerisation indirectly, for example by increasing the affinity of nucleosomes for DNA by deacetylation (Zentner and Henikoff, 2013). Aware that MeCP2 overexpression caused robust effects on transcription, we overexpressed mutant forms of MeCP2 (R111G and R306C) to a similar extent in order to distinguish these possibilities (Figures S7A-C). The Rett mutant R111G is unable to bind methylated DNA (Kudo et al., 2003), while Rett mutant R306C has lost the ability to recruit the histone deacetylase complex NCoR (Kruusvee et al., 2017; Lyst et al., 2013) (Figure 7A). The R111G mutant protein was 7-fold over-expressed and R306C was 11-fold over-expressed compared to WT control (Figures 7B and S7D,E). In the case of R111G, overexpression is not expected to affect transcription, regardless of whether MeCP2 interferes directly or indirectly with transcription elongation, as both effects would require binding to DNA. Accordingly, we observed no mCG-density dependent transcriptional changes due to overexpression of MeCP2-R111G (Figure 7C). The R306C mutant, on the other hand, should repress transcription if inhibition is directly due to MeCP2 binding to DNA. Again, we observed that 11-fold overexpression of MeCP2-R306C relative to WT MeCP2 gave a predominantly flat curve, indicating a significant loss of DNA methylation-dependent repression (Figure 7D). The presence of a weak slope suggests that minimal DNA methylation dependent repression persists, perhaps due to direct interference of DNA-bound MeCP2-R306C with transcription. The predominant conclusion is that MeCP2 binding to methylated cytosines is essential for slowing down elongation, with recruitment of the NCoR co-repressor complex playing an important role, as R306C, like R111G, does not fall on the line defining the linear relationship between gene repression and MeCP2 concentration (Figure 7E). On balance these findings favour an indirect mechanism of repression, whereby MeCP2 alters the chromatin state to impede the transcription process.
DISCUSSION
Accumulating evidence indicates that MeCP2 influences the expression of large numbers of genes in the brain. While the advent of highly replicated, well-controlled RNA-seq analyses has allowed the detection of the relationship between MeCP2 binding and restraint of transcription (Gabel et al., 2015; Kinde et al., 2016; Lagger et al., 2017), the small magnitude of differences in transcription between wildtype and mutant cells or tissues makes it difficult to distinguish the underlying mechanisms. To overcome this limitation, we have intimately combined experimental design and data capture with mathematical modelling to derive quantitative predictions with the aim of distinguishing competing hypotheses. To generate a consistent and relevant source of experimental data, we modified a genetically homogeneous line of cultured human neuronal progenitor cells to generate neuronal lines expressing widely disparate levels of MeCP2. Despite this difference in MeCP2 concentrations, each line retained the ability to rapidly differentiate into neurons.
Combined molecular characterisation and mathematical modelling revealed new features of MeCP2 binding to chromatin. Notably, we used ATAC-seq to visualise for the first time an in vivo footprint of MeCP2 binding to genomic mCG and mCA sites. Footprint intensity increased with MeCP2 concentration, providing direct evidence that the primary mode of binding to genomic DNA in vivo is dependent on methylation of cytosine in these short sequences. Mathematical modelling of quantitative ChIP-seq and ATAC-seq data provided independent estimates of the fraction of all potential binding sites that are occupied by MeCP2. Both methodologies agree that in vivo only ~6% of all mCGs are occupied in OE 11 x neurons, which is far from theoretical saturation. A potential contributory factor could be restricted accessibility to DNA due to nucleosomes, which may allow MeCP2 to associate exclusively with exposed inter-nucleosomal linker DNA. It was initially suggested that MeCP2 can bind to DNA on the surface of a nucleosome (Chandler et al., 1999), but more recent studies have emphasised its association with linkers and exclusion from nucleosomal DNA (Ishibashi et al., 2008; Nikitina et al., 2007; Thambirajah et al., 2011). Our results support the latter view.
Having developed a data-led mathematical model of MeCP2 binding to chromatin, we used this information to explore its effects on transcription in neurons expressing various levels of MeCP2, ranging from zero to 11x the WT amount. A strength of our approach is that we included all the transcription data, rather than focussing only on genes whose change in transcription was arbitrarily classified as significant. In general, increasing MeCP2 abundance caused progressively larger effects on gene expression, allowing us to clearly relate the degree of transcriptional disturbance to the density of mCG in gene bodies. This high-level relationship extends previous evidence derived from mouse brain (Kinde et al., 2016; Lagger et al., 2017), but does not shed light on the mechanisms at work. To address this issue, we evaluated RNA-seq data against a series of mechanistic models. Models whereby MeCP2 reduces initiation of transcription were not favoured, suggesting that elongation of RNA Pol II through the transcription unit may be affected. We rejected the possibility of premature transcription termination, and finally considered a model in which MeCP2 slows down the progress of RNA Pol II without displacing it. This last model reproduced the data very well.
Finally, we asked whether the inhibition of transcript elongation could be direct, for example due to interference with RNA Pol II progression by bound molecules of MeCP2, or if its effects are indirectly mediated by other mechanisms. To address this issue, we over-expressed Rett syndrome-causing mutant forms of MeCP2 that lacked either DNA binding capacity or the ability to interact with the NCoR co-repressor complexes. Under either model, loss of the interaction between methylated DNA and MeCP2 would abolish the relationship between MeCP2 binding sites and transcription, a scenario we indeed observed. Importantly, a missense mutation that greatly reduces the MeCP2-NCoR interaction also violated the established relationship between MeCP2 concentration and transcriptional inhibition. This result is compatible with the view that MeCP2 binding affects the chromatin template indirectly so as to impede transcription. A candidate mediator of this effect is histone modification, in particular histone acetylation. The NCoR1 and NCoR2 complexes both recruit histone deacetylases whose action would increase the affinity between DNA and nucleosomes and thus inhibit transcription (Zentner and Henikoff, 2013). In fact, cell transfection assays using MeCP2 and methylated reporters have demonstrated that repression depends upon histone deacetylase activity (Nan et al., 1997; 1998).
In summary, a close alliance between mathematical modelling and quantitative molecular characterisation of human neurons has allowed us to discriminate molecular mechanisms underlying the effects of MeCP2 on gene expression. We propose that our approach will illuminate the mechanisms by which other global regulators choreograph gene expression programmes.
Author contributions
Conceptualization, A.B., J.C.-W., R.S., B.W.; Methodology, J.C.-W., P.G., O.P., R.S., B.W.; Software, K.C., P.G., B.W., S.W.; Formal analysis, K.C., J.C.-W., B.R., B.W., S.W.; Investigation, K.C., J.C.-W., O.P., B.R., R.S., B.W., S.W., M.Y.; Visualization, K.C., J.C.-W., R.S., B.W., S.W.; Writing – Original draft, A.B., J.C.-W., R.S., B.W.; Writing – Review & Editing: all authors; Supervision, A.B. and B.W.; Funding acquisition, A.B.
METHODS
Experimental model and subject details
We used LUHMES-derived human neurons as a convenient cellular system characterized by rapid and effective differentiation (~5 days) and maturation times (~10 days) to an electrically active state (Figure S1A and (Scholz et al., 2011)). LUHMES cells are a sub-clone of the MESC.1 cell line, isolated from mesencephalon of an 8-week old embryo and immortalized with v-myc under the control of a TET-OFF system to maintain their proliferating capacity (Scholz et al., 2011). As immortalized neuronal precursors (day 0), they express Sox2 and are negative for mature neuronal markers such as MAP2 and TH (Figure S1B). After the addition of tetracycline in the culture medium to turn off v-myc expression, LUHMES cells differentiate and express neuronal markers such MAP2, NeuN (Fox-3) and dopaminergic markers such as TH, DRD2, and DAT (Figures S1B,C). MeCP2 is progressively upregulated during the course of differentiation and its maximal level is achieved around day 9 (Figure S1D).
MeCP2 level manipulation
To create two independent MECP2 knock-out lines, we used CRISPR-mediated gene disruption (Shah et al., 2016). Briefly, cells were nucleofected with two plasmids; one containing Cas9 and a guide RNA targeting exon 3 of MECP2 gene, and another encoding Puromycin resistance (Table S2, Figure S1F).
For generation of MeCP2 knock-downs, several shRNAs against MeCP2 were designed using Sigma-Aldrich Mission shRNA online software. Two shRNAs were chosen and cloned into pLKO.1 vector including scrambled shRNA as a control and lentiviruses were created (Table S2, Figure S1G). Knock-down cell lines were analysed as pool of cells.
To increase the level of MeCP2 we created lentiviruses expressing MeCP2 from two alternative promoters in the pLKO.1 vector: Synapsin and cytomegalovirus (CMV)(Figure S1H). The human Synapsin promoter drove expression of the MeCP2 E2 isoform, while the CMV promoter drove the MeCP2 E1 isoform. As a control, we expressed GFP under the control of the CMV promoter (Figures S1I). The Synapsin promoter becomes active on day 5 of LUHMES differentiation (Figure S1M – middle panels; see insets), while the CMV promoter is expressed from day 0 which allowed us to generate cell lines expressing different levels of MeCP2 overexpression, as follows. Cells overexpressing MeCP2 at 3-fold wildtype levels were obtained by infecting cells with lentiviral particles containing the Synapsin-MeCP2 construct. The resulting cells (OE 3x) were analysed as a pool. OE 4× and OE 11× was generated by infecting cells with lentiviruses containing the Synapsin-MeCP2 and CMV-MeCP2 constructs, respectively (Figure S1M – upper and middle panels; see insets for GFP expression). OE control cells were made by infecting cells with viruses containing the CMV-GFP construct (S1M – bottom panels; see insets). All three cell lines (OE 4×, OE 11x and OE ctr) were clonally selected by FACS into 96-well plates. OE R111G and R306C lines were generated by infecting cells with lentiviruses containing the appropriate CMV-mutated MeCP2 construct (Figure S7A) and selected as previously described. The mutations were introduced into plasmid using QuickChange II site-directed mutagenesis kit (Agilent Technologies; 200523).
Method Details
LUHMES culture and differentiation
The procedure for culture and differentiation of the LUHMES cell line was previously described (Scholz, Poltl et al. 2011). Briefly, proliferating LUHMES cells were cultured on Nunc plasticware coated with 40 μg/ml Poly-L-ornithine (Sigma-Aldrich; P-3655-100mg) and 1μg/ml Fibronectin (Sigma-Aldrich; F-1141-5mg) in medium for proliferating cells: Advanced DMEM (Life Technologies; 12634-010) plus N2 supplement (Life Technologies; 17502-048), 2mM L-Glutamine (Sigma-Aldrich; G7513), and 40 μg/ml bFGF (R&D Systems; 4114-TC-1 mg). For differentiation, LUHMES cells were seeded on 40 μg/ml Poly-L-Ornithine and 1 μg/ml Fibronectin coated plastic-ware at a density of 3-4 x 104 cells/cm2 in medium for proliferating cells. Next day after seeding, the medium was changed to the differentiation medium containing Advanced DMEM, 2mM L-Glutamine, N2 supplement, 1mM cAMP (Sigma-Aldrich; D0627), 1μg/ml Tetracycline (Sigma-Aldrich; T-7660) and 2 ng/ml GDNF (R&D Systems; 212-GD-50 μg). Two days later, cells were trypsinised and seeded at a density of 1.1×105 cells/cm2 in the differentiation medium for the final differentiation.
Lentiviral particle preparation and LUHMES infection
Lentiviruses were produced in HEK293FT cells (Life Technologies; R700-07) by retrograde transfection of vectors: 7.5 μg pLKO.1; 4.6 μg psPax2; 2.8 μg pMD2G using Lipofectamine 2000 (Life Technologies; 11668027). Briefly, for one transfection, plasmids in appropriate ratios were added to 1.5 ml OptiMEM (Life Technologies; 31985062) and 36 μl of Lipofectamine 2000 was mixed with 1.5 ml of OptiMEM. Both solutions were combined and incubated for 20 min at room temperature. Meanwhile, HEK293FT cells were trypsinised and concentration-adjusted to 1.2×106/ml. 5 ml of cell suspension was added to a 10 cm dish; the transfection mix was then added and the culture was left overnight. Cells were left for 72 hours, after which the supernatant was collected, and viral particles were precipitated using a PEG Virus Precipitation kit (BioVision) following the manufacturer’s protocol. Viruses were aliquoted and either used directly or frozen at −80°C. LUHMES cells were infected with lentiviral particles overnight. The next day, cells were selected with 0.5 μg/ml Puromycin (Life Technologies; A11138-03), and concentration of antibiotic was reduced the following day to 0.25 μg/ml. The population of resistant cells was either analysed as a pool or FACS-sorted into a 96-well plate in order to isolate single clones.
Western blot, qPCR, immunofluorescence, live cell imaging
For protein analysis cells were either lysed to give a whole cell extract or nuclei were isolated. For whole cell extracts cells were homogenized in the NE1 buffer (20 mM HEPES pH 7.9, 10 mM KCl, 1 mM MgCl2, 0.5 mM DTT, 0.1% Triton X, 20% glycerol, 2mM PMSF, protease inhibitor) and treated with Benzonase (Sigma) for 15 min at RT to remove DNA. Protein concentration was measured (Bradford, Bio-Rad) and loading buffer (5x, 250 mM Tris·HCl pH 6.8, 10% SDS, 30% [v/v] Glycerol, 10 mM DTT, 0.05% [w/v] Bromophenol Blue) was added. Samples were boiled for 5 min prior to loading 40 μg of protein extract onto a polyacrylamide gel.
To prepare nuclei, cells were lysed in hypotonic buffer (0.32 M Sucrose, 10 mM Tris-HCl pH 8, 3 mM CaCl2, 2 mM MgOAc, 0.1 mM EDTA, 0.5% NP-40). Nuclei were washed with NP-40-free buffer, counted, centrifuged, and resuspended in an appropriate volume of 20% glycerol in PBS. Nuclei were treated with Benzonase, loading buffer was added and samples were boiled for 5 min. Nuclear lysates (1.6×105 per lane) were run on the gradient 4-15% SDS-PAGE gels (Mini-Protean, Bio-Rad) in running buffer (25 mM Tris, 190 mM glycine, 0.1% SDS) at 200 V for ~40 min alongside a protein size marker (PageRuler, Thermo Scientific). Proteins were transferred onto a Nitrocellulose membrane using wet transfer method for 1 hour at 200V. The membrane was stained first with Ponceau S solution for quality check and then treated with blocking buffer (PBS, 1% PVP, 1% non-fat dried milk, 0.1% Tween 20, 0.01% NaN3) for 30 min at RT. The membrane was incubated for 1 hour with primary antibodies (see Table S3) at room temperature and then 1 hour with secondary antibodies conjugated with either IRDye 700DX or IRDye 800CW (LI-COR). The membrane was washed with PBS containing 0.1% Tween 20 after adding each antibody and was imaged using Odyssey CLx (LI-COR).
For qPCR analysis, total RNA was isolated using RNeasy kit (Qiagen) and treated with DNase I (DNA free kit, Ambion) to remove genomic DNA. The efficient removal of genomic DNA in the RNA samples was tested by performing PCR using primers against the GAPDH genomic locus. Synthesis of cDNA was performed using qScript cDNA Supermix (Quanta Biosciences) from 1 μg of total RNA according to the manufacturer’s protocol. cDNA was diluted 100x and 2.5 μl aliquots subjected to qPCR with SensiMix SYBR & Fluorescein Mastermix (Bioline) and appropriate primers (see Table S4) on a LightCycler 480 (Roche).
For immunofluorescence analysis, cells were seeded on coverslips and either fixed the next day for undifferentiated LUHMES cells or differentiated for defined time points and then fixed with 4% formaldehyde for 10 min at room temperature. Cells were permeabilised with 0.2% Triton X in PBS for 10 min and blocked with 10% fetal bovine serum in PBS for 30 min at RT. Primary antibody (see Table S3) incubation was performed in 1% FBS, 0.1% Tween 20 in PBS for 1 hour at RT and coverslips were washed three times with 0.1% Tween 20 in PBS. Secondary antibodies (Alexa Fluor 488, Alexa Fluor 555, Alexa Fluor 647, Thermo Scientific) were diluted 1000x in 1% FBS, 0.1% Tween 20 in PBS and applied for 1 hour at RT. Finally, cells were washed three times with 0.1% Tween 20 in PBS, stained with 5000-fold diluted DAPI in PBS for 10 min at RT, mounted using Prolong Gold or Diamond (Thermo Scientific) and dried overnight at RT. Cells were imaged on the Leica TCS Sp5 confocal microscope (Leica Microsystems) using either 40x or 63x oil immersion objectives. Image analysis was performed using Volocity (PerkinElmer). Live cell imaging of differentiating neurons was performed using the IncuCyte Zoom system (Essen BioScience) and neurite length analysis was performed using NeuroTrack package implemented into IncuCyte operating software.
Preparation of total nucleic acid for estimation of RNA versus DNA quantity
Cells were washed in PBS, lysed in lysis buffer (10mM Tris HCl [pH 7.4], 50 mM NaCl, 0.5% SDS, 100mM EDTA, 300 μg/ml proteinase K) and incubated at 50°C for 2 hours. Total nucleic acid was recovered from the completely lysed sample by ethanol precipitation in 2 volumes of 100% ethanol at room temperature (for 30 minutes), and pelleting by centrifugation. The pellet was washed once in 2 volumes of 70% ethanol, and the nucleic acid pellet was resuspended in hydrolysis buffer containing 1× DNase I buffer (NEB), 1mM zinc sulphate, DNase I (NEB) and Nuclease P1 (Sigma). After 4 hours, the sample was mixed thoroughly and digested for a further 8 hours. After 12 hours at 37°C, the sample was heated to 92°C for 3 minutes and cooled on ice. Two volumes of 30mM sodium acetate, 1 mM zinc sulphate [pH 5.2] were added plus additional Nuclease P1 and the nucleic acids were further digested to deoxyribonucleotide and ribonucleotide 5’ monophosphates for a further 24 hours at 37°C. The samples were then subjected to HPLC.
HPLC nucleotide quantifications
HPLC was performed on the 5 μm Apex ODS C18 column, with isocratic 50 mM ammonium phospate (monobasic) mobile phase. UV absorbance was recorded at 276 nm (dCMP, elution time 9.4 minutes), 282 nm (5mdCMP, elution time 17 minutes), 268 nm (dTMP, elution time 21.9 minutes), 260nm AMP and dAMP (elution times 27 minutes and 62.47 minutes) and 254 nm (GMP and dGMP, elution times 11.1 minutes and 29.7 minutes). Extinction coefficients used in nucleotide quantifications were dCMP, 8.86 x 10-3; 5mdCMP 9.0 x 10-3; dTMP, dGMP/GMP 12.16 x 10-3; dAMP/AMP 15.04 x 10-3. Relative amounts of all nucleotides were calculated from the area under each peak (Chromeleon software) using the respective extinction coefficients.
Whole genome bisulfite seq and TAB-seq
Methylome analysis of wildtype LUHMES-derived neurons at day 9 was performed by whole genome bisulfite sequencing using an Epignome Methyl-seq kit (Epicentre, Illumina). Briefly, genomic DNA was isolated using the DNeasy Blood and Tissue kit (Qiagen) according to the manufacturer’s protocol. The bisulfite reaction was performed using EZ DNA Methylation Gold kit on 100 ng of gDNA according to the manufacturer’s protocol. Genomic DNA was mixed with 0.5% of unmethylated lambda DNA (Promega) for assessment of efficiency of bisulfite reaction. Random primed DNA synthesis and tagging was performed on converted gDNA. Finally, DNA was PCR amplified to generate a library for high-throughput sequencing. To map hydroxymethylcytosine in LUHMES-derived neurons, TAB-seq and ox-BS technologies were used in parallel. For TAB-seq, the TAB reaction was performed according to the published protocol (Yu et al., 2012) and genomic DNA was sonicated for 30 cycles of 30 sec ON and 30 sec OFF on low power using a Bioruptor (Diagenode). DNA was end-repaired and the ends were 3’-adenylated in order to facilitate adapter ligation. Size selection was performed using Agencourt AMPure XP (Beckman Coulter) beads and, after adapter ligation and size selection, DNA was bisulfite treated using the EpiTect Bisulfite kit (Qiagen) and PCR amplified using custom primers. All libraries were sequenced as 100 bp long pair-end reads on HiSeq 2500 Illumina platforms.
RNA-seq
Total RNA was isolated from all generated cell lines (Table S1) at day 9 of differentiation using either the RNeasy Mini kit or the AllPrep DNA/RNA Mini kit (Qiagen). Genomic DNA contamination was removed with the DNA-free kit (Ambion) and remaining DNA-free RNA was tested for purity using PCR for the GAPDH genomic locus. Total RNA was tested on the 2100 Bioanalyzer (Agilent Technologies) to ensure a RIN quality higher than 9 and quantified using Nanodrop. Equal amounts of total RNA (1 μg to 2.5 μg) were taken forward for library preparation and ERCC RNA Spike-in control mixes (Ambion) were added according to the manufacturer’s guide. Ribosomal RNA was depleted using the Ribo-Zero Gold rRNA Removal module (Epicentre, Illumina) and isolated mRNA was tested for purity on a 2100 Bioanalyzer (Agilent Technologies). mRNA was quantified with Qubit and equal amounts used for cDNA synthesis and 3’ terminal tagging using ScriptSeq v2 RNA-seq library preparation kit (Epicentre, Illumina). The library was PCR amplified to add adaptors and barcodes. RNA-seq libraries were sequenced as 100 bp pair-end reads using HiSeq 2000 or HiSeq 2500 Illumina platforms.
ATAC-seq
Chromatin accessibility in four cell lines (KO, WT, OE 4× and OE 11×, see Table S1) was quantified by ATAC-seq which was performed as in (Buenrostro et al., 2015). Neurons were scraped from the plate and nuclei were isolated using hypotonic buffer (10 mM T ris-HCl pH7.4, 10 mM NaCl, 3mM MgCl2, 0.1% [v/v] Igepal CA-630) and counted. 50,000 nuclei were resuspended in 50 μl transposition reaction mix containing 2.5 μl Nextera Tn5 Transposase and 2x TD Nextera reaction buffer. The mix was incubated for 30 min at 37 °C. DNA was purified by either the MinElute PCR kit (Qiagen) or the Agencourt AMPure XP (Beckman Coulter) beads and PCR amplified with the NEBNext High Fidelity reaction mix (NEB) to generate a sequencing library. Libraries were sequenced as 75bp long pair-end reads on a HiSeq 2500 Illumina platform.
MeCP2 ChIP-seq
LUHMES-derived neurons at day 9 of differentiation with four levels of MeCP2: KO, WT, OE 4× and OE 11× (see Table S1) were crosslinked with 1% of Formaldehyde (Sigma) in the medium for 10 min at RT and quenched with 2.5 M Glycine (Sigma) for 2 min at RT. Cells were washed with PBS, scraped from the plate and centrifuged. Crosslinked nuclei were isolated using hypotonic buffer (10 mM Tris-HCl pH7.4, 10 mM NaCl, 3mM MgCl2, 0.1% [v/v] Igepal CA-630) and counted using a haemocytometer. Chromatin from ~4×106 nuclei was sonicated for 20 cycles (30 sec ON and 30 sec OFF) using Bioruptor (Diagenode) on the high power setting. Crosslinked and sonicated chromatin was mixed with 60 ng of sonicated Drosophila chromatin (Active Motif) as a spike-in, and the mix was incubated overnight at 4 °C with antibodies against MeCP2 (D4F3, Cell Signalling) plus spike-in antibody (Active Motif). After overnight incubation, magnetic Protein G coated beads (Thermo Scientific) were added and incubated for 4 hours at 4 °C. Beads were washed, and chromatin was reverse-crosslinked overnight at 65 °C. DNA was purified using the Agencourt AMPure XP (Beckman Coulter) beads. For ChIP-seq library preparation, IPs for each condition were pooled together to achieve 5 ng total DNA as a starting material. For example, 3-4 IPs were pooled together for the KO sample and 2 IPs were pooled for the WT sample. Libraries were prepared using the NEBNext Ultra II DNA library Prep kit (NEB) for both IPs and corresponding inputs. Libraries were sequenced as 75bp long pair-end reads on a HiSeq 2500 Illumina platform.
Quantification and statistical analysis
Statistical analysis
Calculation of standard deviation, standard error of mean and t tests for qPCR, Western blots, methylation and total RNA quantification using HPLC were performed using GraphPad Prism version 7.
Bisulfite sequencing
Trimmomatic version 0.32 (Bolger et al., 2014) was used to perform quality control on 94b paired-end reads to remove adapter sequence and poor quality bases at the ends of reads. We used Bismark version 0.10 (Krueger and Andrews, 2011) to further align and process the reads. Mapping was performed in bowtie2 mode to the human hg19 reference genome. Following alignment, reads were deduplicated and methylation values were extracted as bismark coverage and cytosine context files. For genome wide visualisations we calculated the methylation percentage at each cytosine position as (mC/C)x100 to generate bigWig files.
ChIP-seq data preparation
Trimmomatic version 0.32 (Bolger et al., 2014) was used to perform quality control on 75b paired-end reads to remove adapter sequence and poor quality bases at the ends of reads. We used bwa mem version 0.7.5 (Li, 2013) to map reads to the human hg19 reference genome. We filtered the alignments to remove reads that map to multiple locations in the genome and to blacklisted regions defined by the ENCODE project. We further removed duplicate reads with Picard version 1.107 MarkDuplicates (http://broadinstitute.github.io/picard/). To account for varying read depths we used deepTools version 2.5.1 (Ramirez et al., 2016) to create bigWig files normalised by RPKM (reads per kilobase per million reads).
Enrichment profiles on mCG and other genomic features
To quantify MeCP2 occupancy on the genomic features of interest (mCG, GmC, CGC, etc.), we processed ChIP-seq data as follows. The input data for the analysis are the start and end position (s, e) of the ChIP reads. We reject reads longerthan 1 kb regarding them as alignment artefacts. For each chromosome we create an array ni of the number of instances a particular locus (a single base pair at position i in the chromosome) is covered by ChIP-seq by incrementing it by one for each ChIP read for which s ≤ i ≤ e. When all reads in a given chromosome have been processed, accumulated counts around features of interest (e.g., mCG or mCA) are further analysed as follows. For feature x at position j we select a region of interest of length L around it (j ± L/2 bp) and accumulate the counts in an array (different for each feature x): , for all i such that . We proceed in a similar way for CG islands, TSS, and other regions (Figures 2F,G and S2I), in all cases positioning the midpoint j at the centre of the region of interest. Box 1 shows an example to illustrate the algorithm.
Box 1. Accumulation algorithm for ChIP-seq.
The following example demonstrates the procedure used to obtain ChIP-seq enrichment profiles. Let us take the following sequence as the reference genome with feature of interest (in this case CG) marked red:
Assume that ChIP-seq has generated the following reads:
Number of reads ni covering a specific site in the genome is
Raw counts over a fixed-length region (here +/-5bp, in the actual algorithm this would be +/- 5000bp) centred at the feature of interest are
Total accumulated counts versus position i relative to the feature of interest is therefore
In the real analysis, the above sequence would be a 10kbp-long array of integers.
The obtained accumulated counts for ChIP-seq are normalized (divided) by the accumulated counts for the corresponding input DNA-seq. This reduces the sequencing bias present in the data. We then calculate enrichment profiles (Figures 2C,D and S2F,G) as where Norm1(c)[i] normalizes the counts profiles such that their flanks have values close to one:
WGBS methylation data was used to determine whether a C was classified as methylated. If not stated otherwise, we assumed that at least 50% of the reads (coverage >0) must have been methylated to count a particular C as methylated.
Computer simulations of ChIP-seq
To understand how the position and the shape of the ChIP-seq enrichment profiles depend on the density of DNA-bound MeCP2, we performed computer simulations aimed at reproducing the experimental data.
We assume that MeCP2 occupies methylated cytosines with probability p times the probability of binding to a particular motif (Table S5). Binding probabilities for different motifs are based on binding affinities from (Sperlazza et al., 2017) and relative binding strengths from (Lagger et al., 2017).
We then create simulated ChIP fragments (Figure 2A). Our main assumption is that if a DNA fragment (s, e) (s=start position, e=end position) contains at least one MeCP2 bound to it, it will be present in the simulated ChIP-seq. On the other hand, if the fragment does not contain any MeCP2, it may still be present in the ChIP-seq data with probability pbg which accounts for “background” reads in ChIP-seq even in the absence of MeCP2. This is similar to previous models of ChIP-seq (Xu et al., 2010); even best ChIP-seq libraries can have a significant level of noise (pbg, close to 1) (Liang and Keles, 2012).
The input to the simulation are the *.bed files with ChIP or input (DNA) reads, and the two parameters p, pbg. The simulation uses the data files to preserve the distribution of the lengths of reads but not their positions. The positions are determined by the following algorithm:
For each input file, GC content- and length bias is first estimated by constructing a table of counts r[%GC][l] for the actual reads and g[%GC][l] for genomic sequences of the same length l, %GC, and randomly selected locations.
For each read from the experimental *.bed file we calculate its length l = e − s and then select a random start point s (within the same chromosome as the original read) and the end point e = s + l.
GC content and the number of mCs for the simulated read is calculated. We take a particular C to be methylated with probability proportional to the fraction of methylated reads containing this site in our methylation data (WGBS).
If there is at least one mC within the read, an auxiliary variable P is set to 1 with probability p times the relative binding affinity of the motif to which this C belongs. Otherwise P is set to pbg. This accounts for MeCP2 present/absent in this particular DNA fragment.
P is multiplied by r[%GC][l]/,g[%GC][l] to account for the GC content and read length bias.
The simulated read is accepted with probability P and saved to a file, or rejected (with probability 1 − P). If the read is rejected, another random position s is selected and the algorithm continues from (3).
Go to step 2 unless all reads from the *.bed file have been processed.
Simulated reads are then processed in the same way as the experimental ChIP data.
Since mCG is much more frequent than any other MeCP2-binding motif, in what follows we focus on enrichment on and around mCG. Figure S2C shows the enrichment profile on mCG obtained in simulations for pbg, = 0 and different p. Counterintuitively, the height of the central peak does not decrease with decreasing p, but it slowly increases. The height of the peak tends to a constant as p→ 0 (Figure S2D). This can be explained as follows. For large p, fragments centred at adjacent mCGs overlap, increasing the counts in the flanking regions (|i| ≫ 1). Normalization Norm(c)[i] lowers the apparent height of the peak since it divides the profile by the counts in the flanking regions. As p decreases, fragments from neighbouring mCGs overlap less; this decreases the counts in the flanks and increases the relative height of the central peak after normalization.
The overlap between fragments from neighbouring mCGs causes the shape of the peak to slightly vary with p. This can be used to approximately estimate p from the experimental data. To do this, for each ChIP-seq data set we fitted the simulated profile (parametrized by p,pbg) to the experimental enrichment profile by minimizing the sum of squared differences, with respect to p and pbg. The region +/-100bp was chosen because of the sensitivity of the profile shape to p,pbg, in this region. Minimization was performed as follows: we simulated profiles for a range of p and two values of pbg: 0 and 0.9; profiles for pbg, between these two values were obtained by linear interpolation (permitted due to the assumed additive model of signal and background reads). The minimum χ2 obtained in this way for 11x OE is plotted in Figure S2E. Any p ≤ 0.1 gives similar (low) values of χ2, indicating that p = 0.1 is the upper bound on mCG occupancy that can be obtained from ChIP-seq data.
Figure S2F shows predicted profiles on features other than mCG. We have used here the values of (p,pbg) obtained from the mCG profiles by minimizing χ2 with respect to pbg and assuming p = 0.1,0.05,0.02,0.002 for 11x OE, 4× OE, WT, and KO, respectively. These values are based on the best-fit p = 0.1 obtained for 11x OE, the other p’s being a fraction of this value approximately proportional to the relative abundance of MeCP2 in a given cell line versus 11x OE. We see in Figure S2F that strong peaks on mCA and trace peaks on GT (which MeCP2 does not bind to) are well reproduced. We stress that no additional adjustable parameters have been used here (no fitting).
We also simulated a model in which MeCP2 binds only to mCG. Figure S2G shows that while this model is equally good at reproducing peaks on mCG, it does not reproduce the peak on mCA.
ATAC-seq data preparation
Trimmomatic version 0.32 (Bolger et al., 2014) was used to perform quality control on 75b paired-end reads to remove adapter sequence and poor quality bases at the ends of reads. We used bwa mem version 0.7.5 (Li, 2013) to map reads to the human hg19 reference genome. We filtered the alignments to remove reads that map to multiple locations in the genome, to blacklisted regions defined by the ENCODE project or to mitochondrial DNA. To account for varying read depths we used deepTools version 2.5.1 (Ramírez et al., 2016) to create bigWig files normalised by RPKM (reads per kilobase per million reads).
Analysis of ATAC-seq data
ATAC-seq was analysed in a similar way to ChIP-seq, except that instead of taking the start and end positions of reads we used positions of insertion sites obtained from raw sequencing data. The positions of insertions were accumulated as in the ChIP-seq analysis to create insertion counts profiles centred at different genomic features (mCG, mCA, GT, TA). To calculate the relative insertion probability profiles showing molecular footprints of MeCP2, we calculated where and are the insertion counts profiles for a given cell line and KO1, respectively, and Norm2 normalizes the counts profiles such that their flanks have values close to one:
ATAC-seq computer simulations
In order to understand how different sources of bias introduced by the ATAC-seq protocol affect the observed footprint of MeCP2, we developed a computer model of ATAC-seq. We use the same binding model as in the ChIP simulations (binding to mC with motif-dependent affinity). We simulate binding to a short DNA sequence (first L=50Mbp of chromosome 1). We assume that MeCP2 occupies 11 bp [Nan et al., 1993] and that the protein is centred on a mC, thus the obscured genomic sequence is xxxxxmCxxxxx where x can be any nucleotide. We simulate the action of the Tn5 transposase by splitting the sequence into fragments in areas free of MeCP2. The following algorithm (Figure 3A) is repeated T times (larger T corresponds to longer digestion times):
A random location i is chosen as the position of the new cut. We assume the following convention: i refers to the position of the nucleotide immediately to the left from the centre of the cut. Hence, i + 1 denotes the position of the nucleotide to the right from the cut. Positions i,i + 1 are the positions of insertion sites (see the above Analysis of ATAC-seq).
The position is accepted with probability pi which depends on the nucleotide sequence i − w,…,i + w where w=10 bp. This is based on the known Tn5 sequence preference across 21 = 2w + 1 bp that it contacts (Buenrostro et al., 2013; Schep et al., 2015). The probability pi is calculated using the position weight matrix (PWM) obtained from ATAC-seq reads for KO1.
If the position is rejected, go back to 1.
If there is enough space for Tn5, a cut is made between i and i + 1. We assume that Tn5 occupies 21bp (i − w,…,i + w) and for a cut to be made there must not be any protein (MeCP2) overlapping with this region. There must not be any previous cuts made in this region too.
If the proposed cut is rejected, go to step 1 and try again.
A weight W = exp(b × GC) is assigned to each fragment where GC is the GC content (0…1) of the fragment and b is a constant. This simulates an additional GC bias that may be created by the sequencing procedure.
The simulation creates artificial fragments and their weights that we process in the same way as in the ATAC-seq analysis to obtain the simulated footprint of MeCP2. The shortest possible fragment generated by this algorithm will be thus 21bp which corresponds to two Tn5 cutting just next to each other. Since MeCP2 occupies ±5bp, the shortest MeCP2-containing fragment is 11 +21 =32bp. We take only fragments >35bp since only such fragments are present in our experimental data.
The input to the program is the DNA sequence (fixed), the density p of MeCP2 on mCxx, the average density of insertion (cut) sites t = T/L (cannot be larger than 1/21 bp), and the GC bias b.
To test the role of the parameters of the shape and depth of the footprint of MeCP2, we simulated the model for a range of p, t, b. We also performed simulations with/without Tn5 bias.
Figure S3B shows the counts profiles for p = 0 (no MeCP2) with and without Tn5 bias, compared to experimental profiles. It is evident that the insertion bias is required to reproduce the experimental insertion profiles. However, the bias cancels out when calculating the relative profile (footprint) ft. This is demonstrated in Figure S3C which shows the footprint obtained by dividing the counts profile for p = 0.05 by the profile for p = 0, for different t and b = 6 (Tn5 bias) orb = 0 (no bias). Increasing digestion time t increases the height of the side peaks surrounding a depression caused by MeCP2. However, the depth of the depression does not change noticeably.
Figure S3D, left shows that changing the GC bias b can does not change the depth of the footprint if the bias is the same in the simulated test and reference samples. However, the footprint can change if the GC bias is different in the two samples, see Figure S3D, right. Since such a mismatch in the GC bias would have a notable signature (rising/falling flanks >50bp away from mCG) which we do not see in our data, we conclude that the bias is very similar in all experimental samples.
Figure S3E shows that dividing the counts for p > 0 by the counts for p = 0 but with a different digestion time t changes the depth of the footprint slightly. Experimental variation causes t to be slightly different for different samples, and hence our results can have a small systematic error.
We also simulated the presence of nucleosomes by treating them as another protein of size 147bp that binds to the DNA, and distributing them randomly and uniformly on the DNA. We found no differences in the footprint compared to the case with no nucleosomes. However, the distribution of the number of fragments exhibited maxima at fragment lengths equal to runs of one, two, three etc. nucleosomes. This is in qualitative agreement with the ATAC-seq fragment length distribution (data not shown).
Our simulations cannot explain two subtle features of the experimental data: the presence of oscillations superimposed on the main profile, and a small central peak visible in WT/KO, CTRL/KO, and OE 4×/KO. We speculate that the first is caused by steric interactions between MeCP2 and Tn5, and the latter by interactions of proteins other than MeCP2 with mC or a small difference in the GC bias among the samples (Figure S3D, right).
Fitting simulated ATAC-seq profiles to data
The results from the previous section suggest that while some features of the footprint depends on how the samples have been prepared and sequenced (parameters t, b in our simulations), the depth of the footprint remains unaffected as long as we can guarantee that the test and control samples have been processed in a similar way.
We used the depth of the footprint to extract MeCP2 occupancy (fraction of occupied mCGs given by the parameter p). We simulated profiles for many pairs (p, t), for p = 0 … 0.1, t = 0.0.08, and a fixed b = 6.0 (the exact value is unimportant since GC bias cancels out when calculating fi). We then selected those (p, t) which minimized the distance between the simulated and experimental profiles and ,
This formula assigns a larger weight to the region which is least sensitive to variations in the ATAC-seq protocol (edges of the central dip) and reduces the systematic error caused by the (small) central peak of unknown origin (see previous section). Figure 3B shows the best-fit profiles centred on mCG. The best fit to 11x OE footprint yields p = 0.063, t = 0.04. Figure 3D shows p versus MeCP2 level M for all four cell lines. The relationship is linear, with the best-fit p = Mcell line/MWT.
Figure S3F shows profiles for OE 11x centred on other features (mCA, GT, TA). Although the fitting has been done to the mCG profile only, the footprint on mCA and the lack of it on GT, TA predicted by the simulation agree reasonably well with experimental data. This gives us confidence that the simulation captures the most important aspects of MeCP2-DNA binding and ATAC-seq. We note that the observed amplitude of the mCA footprint is slightly stronger than predicted by our simulations. A probable explanation is the reduced reliability of mCA detection in TAB-seq as compared to mCG.
ATAC-seq insertion peaks
We used the accumulated insertion counts ni to find the positions and heights of regions of frequent insertions (“insertion peaks”) as follows. For each gene, we calculated its mean insertion count and selected regions in which . We then calculated the total number of insertions in these regions. Accessibility was defined as the sum of all insertions in the peaks divided by the “background” :
RNA-seq data preparation
All paired-end sequencing reads were trimmed, and quality controlled using Trimmomatic version 0.33 (Bolger et al., 2014). The filtered reads were then mapped using STAR version 2.4.2 (Dobin et al., 2013) using hg19 human genome assembly and Ensembl 74 release for annotation. Additionally, TPM values for genomic features were quantified by quasi-mapping approach using Sailfish version 0.10.0 (Patro et al., 2014). Protein coding transcripts for Sailfish index generation were taken from Gencode release 19. Further detailed information on the parameters are in supplementary information. In order to assign reads to genomic features, featureCounts version 1.5.0 was used (Liao et al., 2014). Differential Expression analysis was performed using DESeq2 (Love et al., 2014). Comparisons between mutant vs wildtype and were performed within the same batch using the design formula ~differentiation+condition in order to account variance from the differentiation.
RNA-seq data analysis
All analysis was carried out using a subset of protein-coding genes characterized by the following features: sufficient methylation coverage (WGBS; at least 80% C detected as methylated, coverage 20 or higher), and gene bodies 1kb or longer. This resulted in 15382 genes out of the initial 17764 protein-coding genes (86%).
The variability in Log2FC for genes with similar methylation is quite high due to factors other than MeCP2. In order to get a meaningful signal representing the contribution from MeCP2 we binned the Log2FC of many genes according to their methylation. For plots of Log2FC versus methylation density (Figure 4C,E) we used bins of constant width 0.1bp/100bp. For other plots we varied the bin size to compromise between the number of plotted data points and the variability (represented by error bars in the plots) within a single bin.
To obtain the slope (Figure 4C,E) we fitted straight lines Log2FC = aρmCG + b to Log2FC(expression) to different ranges of ρmCG ∈ [ρ1, ρ1 + Δρ], for different pairs ρ1 ∈ [0.5,1] and Δρ ∈ [1,4] (units: 1/100bp). We then selected the fit with the largest absolute slope to error ratio (largest |a|/σa where σa is the standard error of a) and returned a of this fit as the largest computed slope of the Log2FC curve.
In all plots of Log2FC of differential gene expression (both experimental and theoretical) we also shifted the Log2FC values so that the average Log2FC in the range of mCG density ρmCG ∈ [1,6] bp/100bp was the same for all samples. This was motivated by a difficulty in determining the absolute levels of expression since we did not quantify total mRNA and the conversion from RNA-seq reads to TPMs removed any dependence on the total mRNA.
Mathematical Modelling
Generic model of gene expression
Below we list all the parameters and observables of the generic (“paradigm”) model of gene expression (Shahrezaei et al., 2008) from Figure 5A which is the basis for all other models. For each gene i we define
- ki,ON, ki,OFF – the switching rates OFF→ON and ON→OFF, respectively.
- Ji – transcription rate (in transcripts/second).
- αi – transcription initiation rate (1/second)
- vi – transcription elongation rate (bp/second).
- di – mRNA degradation rate (1/second).
- fi – the fraction of cells in the population that at any given time have gene ¿ in the ON state.
We assume that mRNA degrades according to 1st order kinetics (Rabani et al., 2011).The steady-state concentration ci of mRNA from gene i is then obtained by equating the influx (through transcription in the ON state) and degradation: from which it follows that
The number of transcripts per million bp (TPM) which we obtain from RNA-seq is thus where N is the (generally unknown) proportionality constant that depends on the sample and is different for different cell lines, replicates and conditions but is the same for all genes in a given sample.
Condensation model
This model considers a hypothesis (ultimately proven to be false, see the main text) that MeCP2 causes chromatin condensation which reduces the fraction of cells with genes in the active (ON) state (Figure 5C). We assume that the fraction fi of cells with gene i in the active state depends on promoter openness ai (measured by ATAC-seq) which in turn depends on the level M of MeCP2 and gene methylation ρi:
The model also assumes that transcription rate Ji and mRNA degradation rate di are not affected by MeCP2. The Log2FC of differential expression of gene i for cell lines X and Y is then
The average Log2FC of genes with the same methylation density ρ is therefore where 〈…〉 denotes averaging over genes with the same ρ. According to the above equation, Log2FCX/Y should yield the same curve (modulo a vertical shift due to different normalization factors NX and NY) as the logarithm of the ratio of accessibilities of X versus Y when plotted as a function of methylation density. Figures 5E and S5B shows that this is not the case. We can therefore reject the hypothesis that MeCP2 modulates gene expression primarily by altering the fraction of active genes.
Detachment model
In this hypothetical scenario we assume that RNA Pol II aborts transcription with some small probability λ when it collides with MeCP2 or encounters a chemical mark left by the interaction between MeCP2 and chromatin (Figure 5F). The probability that RNA Pol II reaches the end of the gene (transcription end site, TES) is thus where n is the number of “abort sites” on the gene, proportional to the number of MeCP2 molecules on the gene. ATAC-seq shows that the density of MeCP2 on a gene is proportional to its methylation density ρ and the total concentration M of MeCP2 in the nucleus, hence we can write that n = AMρL = AMNmCG where A is an unknown proportionality factor, L is the length of the gene, and NmCG is the total number of mCGs. Log2FC of the differential expression X versus Y can then be written as where γ = λAMY/log2 e is an unknown parameter identical for all cell lines, and is the relative difference between the level of MeCP2 in cell lines X and Y. For example, Δ MKO/WT = −1 and ΔM11×0E/WT = 10. Log2FCX/Y should therefore follow a straight line when plotted versus NmCG, and the slope of this line should be positive for KO/WT, and negative (and 10x more steep) for 11xOE/WT. Figure 5G shows that when the model is fitted to KO data to fix the unknown constant γ, it fails to reproduce the OE 11x data. We hence conclude that there is no evidence that MeCP2 cause premature termination of transcription.
Congestion model
General considerations
We consider a hypothesis that MeCP2 slows down the elongation step of transcription by creating queues of RNA Pol II in front of MeCP2 or chemical modifications left by it in gene bodies (Figure 6A). We assume that the density of obstacles, or “slow sites”, is equal to the density of molecules of MeCP2 bound to the DNA.
We show in the main text that (i) the density m of MeCP2 on the DNA is proportional to the concentration M of MeCP2 in the cell and (ii) that m is proportional to mCG density ρ. We can thus write that the transcription rate J averaged over many genes with the same mCG density ρ is
The second argument of J accounts for non-MeCP2 but mCG-density dependent modulation. Figure S5G show that gene expression (TPMs) versus gene body methylation for KO and 11x OE. The difference between the two cells lines is very small, and non-MeCP2 dependent component of J dominates the behaviour of J(ρ). We are thus permitted to rewrite J in the factorized form where K is a fast-changing function of ρ and the correction ϵ(m) due to MeCP2 is a slowly increasing function of m, and ϵ(0) = 0. This leads to the following expression for the Log2FC of cell lines X versus Y:
The latter approximation is valid since ϵ is supposed to be small. In particular, the Log2FC of any cell line X versus KO reads because ϵ(ρMKO) = ϵ(0) = 0. The Log2FC curves for different cell lines versus KO will therefore have the same shape when plotted in the variable ρMX. It follows from this that the maximum slope of these lines will be proportional to MX.
A similar argument holds when the reference is not the KO but a control line CTR with low but non-zero level of MeCP2, following a mild assumption that ϵ(m) is linear in m for small m and saturates for large m. We have
We can calculate ϵ(m) from this equation:
This enables us to express the Log2FC of any cell line X versus control CTR as
The maximum slope will occur for ρ = 0. In the vicinity of this point, where ϵ’(0) is the derivative of ϵ(m) at m = 0. The maximum slope is therefore proportional to MX/MCTR − 1. This is corroborated by experimental results presented in Figure 4D.
Specific examples of models that leads to congestion
We consider two example mechanisms by which MeCP2 could affect elongation.
Slow sites model
In this model, MeCP2 causes a chromatin modification that slows down the RNA polymerase II. To mathematically model this process we use totally asymmetric simple exclusion process (TASEP) with open boundaries (Blythe and Evans, 2007; Derrida, 1998; Krug, 1991). A gene is represented as a one-dimensional chain of L sites. Each site is either occupied by a particle, representing RNA Pol II, or is empty. Particles enter the chain at site i = 1 with rate α (transcription initiation rate), move along the chain and exit at site β = L with rate β. Sites can be either “fast” or “slow”. Slow sites represent mCGs affected by the interaction with MeCP2, whereas fast sites are all other sites (methylated or not). The speed of the particles is v on fast sites and vs on slow sites. Slow sites are randomly and uniformly distributed, and their density is ρs = ρp where ρ is the mCG density, and p is the probability that an mCG is occupied by MeCP2 (as in the ChIP-seq and ATAC-seq models). p is taken to be 0.063 for OE 11 x, and proportionally smaller for other cell lines (p = 0.0058Mcell/MWT, see Fitting simulated ATAC-seq profiles to data).
Since RNA Pol II occupies about 60bp on the DNA and moves with the speed of about 60-70 bp/s (Fuchs et al., 2014), it is convenient to equate a single site of the model with a 60bp-long stretch of the actual DNA, and set the RNA Pol II speed to v = 1 sites per second on fast sites. We also assume β = v = 1 sec−1 and that RNA Pol II is not blocked from exiting the chain at the end, so that always α < β.
We simulated this model for different chain lengths L = 166,500,1666,5000 corresponding to gene lengths between 10kb and 300kb, and a range of initiation rates α ∈ [0.001,1], densities of slow sites ρs ∈ [1/64,8] (mCG density between 0.026 and 13.3 per 100bp), and slow-site velocities vs = 0.01,0.02,0.05,0.1,0.2 (all rates are in 1/sec). For each set of (α,ρs, vs) we first let the model to reach steady state (“thermalization step”). We then measured the flow J of particles (equivalent to the rate of transcription) through the chain. The flow strongly depends on ρs and only weakly (logarithmically) on the length L (Figure S6A,B).
Therefore, in what follows we fix L = 5000 sites, which is equivalent to the gene length of 300kb. The flow J obtained from these simulations is presented in Figure 6B as a function of α, for different mCG densities ρ, and for p = 0.063 (OE 11 x). The flow is approximately linear in α until some critical αc which depends on ρs, and saturates at J = Jmax when α > αc.
To relate this model to the mRNA-seq differential expression data we must calculate Log2FC: where ρs,X = ρpX, ρs,Y = ρpY in which ρ is mCG density and pX,pY are MeCP2 occupation probabilities for cell lines X,Y. We take , and similarly for pY. In the above expression we know all quantities except the initiation rate α.
To obtain the initiation rate we bin genes according to their methylation as described, and then, for each bin, fit Log2FC(11xOE/WT) from the above equation to the experimental Log2FC. This gives us α(ρ) as a function of methylation density ρ (Figure S6C, average α = 0.027 s−1), as a calibration, to test our model prediction on the Log2FC data of the other cell lines. This approach, rather than fitting initiation rates of individual genes, removes correlations due to gene-gene interactions and produces a relatively smooth curve α(ρ). We can then use the fitted α(ρ) to predict Log2FCX/Y for other pairs of cells lines. The results are presented in Figure 6C,D and, as described in the main text, are in good agreement with experimental Log2FCs.
When the rates {α} are taken to be the same as for HeLa cells (1347 genes from (Fuchs et al., 2014) that were also present in our RNA-seq; average α = 0.05 s−1), the model still broadly agrees with the data, except for KO (Figure S6D) which is very sensitive to the exact values of α.
A model in which J is first evaluated for individual genes (defined by their ρ, L) and then Log2FC obtained by binning genes according to their mCG density ρ does not significantly improve the agreement (not shown).
Dynamical obstacles model
This model is very similar to the slow sites model with two exceptions: (i) polymerase always moves with the same speed v (no slow sites) as long as it is not blocked by other polymerases and MeCP2, (ii) MeCP2 binds and unbinds dynamically from the methylated sites (Figure S6E). We assume that unbinding occurs with rate ku per MeCP2 molecule, whereas binding occurs with rate kup per unoccupied mCG. MeCP2 does not bind if an mCG is occupied by MeCP2 or polymerase. The density of MeCP2 is thus ≈ pρ in the absence of transcription and is the same as in the slow-sites model. The parameters of the model are: α, v,L,α,p and, in addition, the unbinding constant ku.
The model behaves similarly to the slow-sites model. Figure S6F shows a plot of the flow as a function of α and ρ, for fixed ku = 0.04, p = 1. Figures S6G,H show that (after fitting the initiation rates as described for the slow-site model) the model is also able to reproduce the experimental data, but the apparent fraction p of occupied mCGs must be close to 1. This suggests that MeCP2 may slow down RNA Pol II by altering chromatin structure rather than by direct steric interference as indicated in the main text.
Acknowledgments
We thank Tanja Waldmann for introducing us to the LUHMES cell line, Beatrice Alexander-Howden for technical support, David Kelly for microscope assistance, Martin Waterfall for help with FACS and Jim Selfridge for help with preparing samples for HPLC. We thank Sabine Lagger and John Connelly for critical assessment of the manuscript. The work has made use of resources provided by the Edinburgh Compute and Data Facility (ECDF; www.ecdf.ed.ac.uk) and was supported by a Wellcome Trust Programme Grant and Investigator Award to AB. JCW was supported by a grant from the Rett Syndrome Research Trust. RS was supported by a Wellcome Trust 4 year PhD studentship. BW was supported by a Personal Research Fellowship from the Royal Society of Edinburgh.