The information about when and where each gene is to be expressed is mainly encoded in the DNA sequence of enhancers, sequence elements that comprise binding sites (motifs) for different transcription factors (TFs). Most of the research on enhancer sequences has been focused on TF motif presence, while the enhancer syntax, i.e. the flexibility of important motif positions and how the sequence context modulates the activity of TF motifs, remain poorly understood. Here, we explore the rules of enhancer syntax by a two-pronged approach in Drosophila melanogaster S2 cells: we (1) replace important motifs by an exhaustive set of all possible 65,536 eight-nucleotide-long random sequences and (2) paste eight important TF motif types into 763 positions within 496 enhancers. These complementary strategies reveal that enhancers display constrained sequence flexibility and the context-specific modulation of motif function. Important motifs can be functionally replaced by hundreds of sequences constituting several distinct motif types, but only a fraction of all possible sequences and motif types restore enhancer activity. Moreover, TF motifs contribute with different intrinsic strengths that are strongly modulated by the enhancer sequence context (the flanking sequence, presence and diversity of other motif types, and distance between motifs), such that not all motif types can work in all positions. The context-specific modulation of motif function is also a hallmark of human enhancers and TF motifs, as we demonstrate experimentally. Overall, these two general principles of enhancer sequences are important to understand and predict enhancer function during development, evolution and in disease.
Transcriptional enhancers are DNA sequence elements that control gene expression by modulating the transcription of their target genes in specific cell types and conditions (Banerji et al. 1981; Levine 2010). These elements contain short sequence motifs bound by different transcription factors (TFs), and the combined regulatory cues of all bound TFs determine an enhancer’s activity (Spitz and Furlong 2012). Due to the critical role of enhancers in development, evolution and disease (Levine 2010; Rickels and Shilatifard 2018), understanding how enhancer sequences encode function is a major question in biology. Previous studies have highlighted the importance of sequence constraints within enhancers, such as the presence of TF motifs and features related to the motifs’ flanking sequences, affinities, and arrangements (their number, order, orientation and spacing), termed here ‘motif syntax’ (King et al. 2020; Jindal and Farley 2021; Ludwig et al. 2000; Kulkarni and Arnosti 2003; Zinzen et al. 2006; Panne 2008; Swanson et al. 2010; Liu and Posakony 2012; Erceg et al. 2014; Crocker et al. 2015; Farley et al. 2016, 2015; Fiore and Cohen 2016; Smith et al. 2013; Sharon et al. 2012; Thanos and Maniatis 1995; Hanes et al. 1994; Arnosti et al. 1996; Avsec et al. 2021; de Almeida et al. 2022). However, while mutations in enhancer sequences can change enhancer function and lead to morphological evolution and disease (Visel et al. 2009; Levine 2010; Gompel et al. 2005; Rickels and Shilatifard 2018), enhancers usually display only modest or no sequence conservation across species (Villar et al. 2015; May et al. 2012; Blow et al. 2010; Schmidt et al. 2010; Arnold et al. 2014; Fuqua et al. 2020; Ludwig et al. 1998) and even random DNA sequences can show enhancer activity (Galupa et al. 2022; de Boer et al. 2019). Therefore, the importance of sequence constraints and motif syntax within enhancers remain outstanding questions in gene regulation.
Two main models have been proposed to explain how enhancer sequence relates to function. The enhanceosome model assumes very strict syntax rules with invariant motif arrangements required for cooperative TF binding (Panne 2008). In contrast, the billboard model proposes that TFs bind independently without constraints on how motifs are arranged within the enhancer (Arnosti and Kulkarni 2005; Kulkarni and Arnosti 2003). Yet very few enhancers fit these models, having either invariant syntax or no constraints at all, and most enhancers fall in between these two extremes, with a flexible syntax yet high degree of dependency between enhancer features (Kulkarni and Arnosti 2003; Vockley et al. 2017; Jindal and Farley 2021). This complexity in enhancer sequence has prevented the generalization of sequence-rules derived from individual enhancers into unifying principles of the regulatory code, thus limiting our understanding of the sequence constraints related to motif syntax and TF activity in enhancers.
Although enhancer sequences evolve rapidly, their function can be conserved despite significant sequence changes (Rastegar et al. 2008; Ludwig et al. 1998, 2000; Swanson et al. 2011; Taher et al. 2011; Fisher et al. 2006; He et al. 2011; Wong et al. 2020; Weirauch and Hughes 2010; Farley et al. 2015; Blow et al. 2010; Schmidt et al. 2010; May et al. 2012; Villar et al. 2015; Arnold et al. 2014; Vaishnav et al. 2022). This suggests that there is considerable flexibility within enhancer sequences, and that the maintenance of function-defining features rather than overall sequence similarity is important for enhancer activity. This is illustrated most clearly by the maintenance of TF motifs at invariant positions or at different relative positions within orthologous enhancer sequences (Ludwig et al. 1998, 2000; Arnold et al. 2014; Wong et al. 2020; Rastegar et al. 2008). However, how flexible or constrained motif positions within enhancers are at both, the DNA sequence and the TF motif level, i.e. how many different sequence variants or motif types might functionally replace the wildtype sequence at important motif positions, has remained unknown. Similarly, even though TF motifs have been observed to move between different enhancer positions over the course of evolution (presumably a consequence of motif decay and de novo formation), and despite position-independence being a key assumption of the billboard model, the influence of the position and sequence context on a motif’s contribution to enhancer function is not understood. These knowledge gaps restrict our understanding of the functional and evolutionary flexibility of enhancer sequences and how many sequence variants, as they might arise by DNA mutagenesis, might lead to similar or different enhancer activities.
Here, we investigated how many defined DNA sequences might functionally replace the wildtype sequence in various motif and control positions by exhaustively testing all possible 8-nucleotide-long random sequence variants at these positions in two enhancers in Drosophila melanogaster S2 cells. At each position, hundreds of sequence variants corresponding to several different motif types could functionally replace the wildtype sequence (i.e. constitute solutions), suggesting that enhancer sequences display flexibility within and across motif types. However, at each position, these solutions constituted only a tiny fraction of the approximately 65,000 possible sequences, indicating that enhancer sequence flexibility is constrained. In addition, the solutions differed between positions and most TF motifs had highly context-dependent activities. Indeed, across hundreds of enhancer positions eight prominent motif types contributed to enhancer activity with different intrinsic strengths that were further modulated by the enhancer sequence context, namely the flanking sequence, the presence and diversity of other motif types, and the distance between motifs. The modulation of TF motif activity by the sequence context is a general enhancer feature, as we also demonstrate in human cells.
STARR-seq comprehensively assesses the activity of enhancer variants revealing constrained enhancer sequence flexibility
To systematically test what sequences function in a certain enhancer position, we used an approach inspired by studies that tested the activity of fully randomized regulatory sequences (Farley et al. 2015; Galupa et al. 2022; Vaishnav et al. 2022; de Boer et al. 2019) or the local fitness landscape of the green-fluorescent protein (GFP; (Sarkisyan et al. 2016; Somermeyer et al. 2022)). We generated a comprehensive library of sequence variants by replacing a specific 8nt stretch of an enhancer with randomized nucleotides (N8) and assessed the enhancer activity of each variant by UMI-STARR-seq in Drosophila S2 cells (Fig 1A; see Methods; (Arnold et al. 2013; Neumayr et al. 2019)). We tested the power of this approach in the position of a GATA TF motif within the ced-6 enhancer (ced-6 position 241nt, or pos241) that is required for its activity. We recovered all possible 8nt variants (65,536) in the input library and obtained reliable enhancer activity measurements for each variant (Fig S1). This showed that the vast majority of all variants drive low activity levels, while only 374 (<1%) achieve similar activity to wildtype (+/-10%) and 600 (1%) drive even higher activity, i.e. constitute valid solutions at this motif position (Fig 1B).
A) Schematics of STARR-seq for the analysis of random variants in an enhancer position: (1) a comprehensive library of sequence variants was generated by replacing the 8nt stretch overlapping a GATA TF motif in the strong ced-6 enhancer with all possible 65,536 randomized nucleotides; (2) the enhancer activity of each variant was measured by STARR-seq in Drosophila S2 cells; (3) expected outcomes include the wildtype sequence (wt, blue), inactive variants (grey), and variants that recover the wildtype activity (green) or are even stronger (purple). B) Most sequence variants exhibit low activity levels. The distribution of enhancer activity for each of the 62,012 enhancer variants with confident activity is shown. The wildtype (wt, red) sequence, the strongest GATA variant in each orientation (blue) and the strongest sequence variant are highlighted, together with the number of variants that achieve similar activity to wildtype (+/-10%) or drive even higher activity. C) Strong sequence variants are highly diverse. Logos with nucleotide frequency of the most-active variants in STARR-seq (1, 2, 5, 10, 50, 100, 1,000 and all; red). These were compared with the same logos after randomly sorting the variants (grey). D) Sum of information content within the most-active 8-mers in STARR-seq (red) compared with the same after randomly sorting the variants (grey), considering different number of top sequences. E) Distribution of enhancer activity for all 62,012 enhancer variants (left) or variants creating each TF motif (right). The activity of the wildtype sequence (wt, red dot and dashed line) or median of all variants (grey dashed line) are shown. The string of each TF motif used for the motif matching and the number of variants matching to each motif are described in the x-axis in the format “motif string (TF motif name, number of variants)”. F) Number of variants among the 600 stronger than wildtype that match to motifs enriched in S2 developmental enhancers (PWM p-value cutoff 1e-04).
Although only a few hundred sequences functioned at this position, these were highly diverse (Fig 1C,D) and included not only different variants of the GATA motif (Fig 1B – in blue, and 1E,F) but also other TF motifs, such as SREBP and AP-1 (Fig 1E,F, S2A,B, S3A). Indeed, most of the 600 variants stronger than wildtype (94%) created TF motifs overrepresented in S2 developmental enhancers (PWM p-value 1e-04; Fig 1F, S3B), showing that there is flexibility in the DNA sequences but also in the motif types they encode. However, different TF motifs rescued enhancer activity to different levels (Fig 1E, S3A). While AP-1 and SREBP achieved similar activity to the wildtype GATA motif, twist and ETS had lower activity at this enhancer position, despite being generally associated with strong enhancer activity in S2 cells (de Almeida et al. 2022). Therefore, the observed sequence flexibility is constrained to some TF motifs. In addition, even within each TF motif not all specific sequence variants functioned similarly, as apparent in the large differences between their activities (Fig 1E).
We also observed TF motif types that had neutral or repressive functions at the tested 8nt position: The Dref motif, previously shown to only be important for housekeeping enhancers (Zabidi et al. 2015; de Almeida et al. 2022), had no activity in this ced-6 developmental enhancer, while the Ttk motif created the most inactive 8nt variants consistent with Ttk’s function as a repressor (Fig 1E, S2C; (Xiong and Montell 1993)). These results show that this approach can comprehensively assess the activity of all random variants in a specific region of the enhancer and identify activating, neutral and repressive sequences. Moreover, our findings indicate that developmental enhancers exhibit constrained flexibility, in that many variants, but still a strongly restricted number, can function at a given enhancer position. This constrained sequence flexibility applies not only to individual DNA sequences but also TF motif types in that several different motif types work, but not many or all.
Activity of random variants in seven specific positions of two different enhancers
To evaluate if the same principles and the same specific solutions apply at different enhancer positions, we selected three additional positions of the ced-6 enhancer and three positions of a strong enhancer in the ZnT63C locus (Fig 2A). To probe enhancer sequence flexibility at important motif positions and non-important control positions, we used the deep learning model DeepSTARR (Fig 2A; (de Almeida et al. 2022)) and previous experimental enhancer mutations (Fig S4F) to choose positions that should (ced-6 pos110, pos241; ZnT63C pos142, pos180, pos210) or should not (ced-6 pos182, pos230) be important for enhancer activity. We generated exhaustive libraries of all 8nt sequence variants for each position and performed UMI-STARR-seq on the combined libraries of each enhancer (S4A-E; see Methods). As observed for the GATA position in Fig 1 (pos241), only a restricted set of variants achieved wildtype activity at a second important GATA motif position in the same enhancer (pos110) or at the important motif positions in the ZnT63C enhancer (Fig 2B), confirming that important positions in enhancers show constrained flexibility. This contrasted with the non-important positions (pos182 and pos230 of the ced-6 enhancer) where most sequence variants were active at wildtype levels or above (Fig 2B). Thus, the importance of an enhancer position reflects its constraint, with non-important positions not being constrained (while they can still be modulated positively or negatively).
A) DeepSTARR-predicted nucleotide contribution scores for the ced-6 (top) and ZnT63C (bottom) selected enhancer sequences. Selected 8nt motif positions and non-important control positions are highlighted in yellow with the respective numerical position, TF motif identity and different colors. B) Distribution of enhancer activity for all enhancer variants detected in each enhancer position. The activity of the wildtype sequence of each enhancer (wt, red dashed line) or of inactive sequences (grey dashed line) are highlighted, together with the activity of example sequence variants that create different TF motifs (ETS, GATA and SREBP; dots and connected lines). Number of variants tested in each position are shown in the x-axis, while the number of variants with higher activity than wildtype is shown on the top (grey, *). C) Heatmap of z-scores of log2 enhancer activity of 21,235 variants across all seven enhancer positions. Only variants assessed in all positions and active (z-score > 1) in at least one are shown. Variants were clustered using hierarchical clustering and their activity is colored in shades of red (activating) and blue (repressing). D) Heatmap of average z-scores of log2 enhancer activity of variants creating each TF motif type (y-axis) across all enhancer positions (x-axis; sorted as in (C)). Motif activity is colored in shades of red (activating) and blue (repressing). E) Distribution of z-scores of log2 enhancer activity for variants creating each of four TF motifs (AP-1, GATA, ETS, ttk) in two selected enhancer positions (ced-6 pos241 and ZnT63C pos180).
The most active sequences at each enhancer position were highly diverse and exhibited distinct nucleotide preferences (Fig S5, S6). For example, two positions located either in the ced-6 (pos110) or the ZnT63C (pos210) enhancer showed distinct preferences among the strongest 100 variants, which preferentially match to an SREBP (GTCAC[flanked by GTC]) or an ETS motif (CCGGA[A]), respectively (Fig S5B). These results show that different enhancer positions require different motif types and thus are under different constraints.
Different TF motif types are active at different enhancer positions
Comparing the activity of the 8nt sequence variants between the enhancer positions (scaled to the average activity of variants to be comparable across positions; see Methods) revealed that they indeed functioned differently at different positions (Pearson correlation coefficients (PCCs) below 0.4 between positions; Fig 2C, S7A-C). Further consolidating the 8nt into 6nt variants to reduce the impact of the surrounding sequence of each position (averaged activity across the flanking nucleotides) still showed similar results (S7A,D). The top variants and solutions of each position differed substantially, with each position revealing specific sequences with particularly high activity, matching to known TF motifs (Fig 2C). For example, an ETS motif variant was amongst the strongest sequences at ced-6 pos110 but not at pos241, a GATA variant was very active at ced-6 pos182 but inactive at pos230, and a SREBP variant was active in all positions of the ZnT63C enhancer except at pos210 (Fig 2B).
We next compared the activity of motifs between the seven positions of the two different enhancers, by consolidating the activity of all 8nt variants (+/-4nt flanks) creating each motif (Fig 2D,E, S8; see Methods). Importantly, for each position the wildtype sequence as well as different variants of that motif were among the top variants. While the repressor ttk motif repressed in all positions and showed little specificity (similar to other known and novel repressor motifs; Fig S9), the activator motifs showed distinct profiles, such as motifs that are globally active in all positions (AP-1), motifs with low activity in all tested positions (STAT, CREB and Trl) and motifs with highly context-dependent activities (GATA, twist, ETS and SREBP) (Fig 2D,E). For example, GATA was active at the ced-6 pos110 but not at the ZnT63C pos180 position, whereas ETS motifs showed the opposite profile with the strongest activity at ZnT63C pos180 (Fig 2E). Interestingly, for GATA motifs we observed strong activity in all positions except on ced-6 pos230 and ZnT63C pos180, which are positioned close to another GATA motif (Fig 2A). This observation is in line with the previously observed negative interaction of GATA/GATA motif pairs at short distances (de Almeida et al. 2022) and suggests that the observed different activities of TF motifs at different enhancer positions depend on their interaction with other TFs and the sequence context.
In summary, testing thousands of random variants in different enhancer positions revealed that enhancer sequences display constrained flexibility, in that only a specific but still diverse set of sequences and TF motifs can function at a given position. However, importantly, these constraints and solutions differed between enhancer positions, with different TF motifs active at different positions, suggesting that their activity is modulated by the sequence context.
Systematic motif pasting shows that motifs work differently at different enhancer positions
To systematically test if and how the enhancer sequence context modulates the function of TF motifs, we selected eight TF motifs that showed distinct position-dependent preferences (GATA, Trl, SREBP, AP-1, Atf2, twist, Stat92E and ETS) and pasted their optimal sequences into 763 positions in a total of 496 enhancers (Fig 3A; see Methods). These positions were selected to be TF motifs important for the activity of the respective enhancers, as assessed by motif mutagenesis, allowing the reliable measurement of the increase in enhancer activity after pasting each TF motif (here quantified as the log2 fold-change activity over the motif-mutated enhancer). UMI-STARR-seq experiments with these designed libraries produced highly reproducible and quantitative enhancer activity measurements (replicates PCC between 0.95 and 0.98; Fig S10A,B). Disrupting the selected enhancer positions by shuffling the wildtype sequences substantially reduced the activity of the respective wildtype enhancers by an average of more than 6-fold, and pasting the different TF motifs in these same positions rescued enhancer activity to different levels (Fig S10C). Since for each TF motif we pasted the same optimal sequence into all positions, the differences in activity can only be explained by their respective sequence context; the differences between TF motifs are also directly comparable, since we pasted them in the same set of positions.
A) Schematics of systematic motif pasting in different enhancer positions. Eight TF motifs that showed distinct position-dependent preferences were selected and their optimal sequence was pasted in 763 positions distributed among 496 enhancers, representing different contexts. The enhancer activity of each variant was measured in STARR-seq in Drosophila S2 cells to quantify the activity of motifs at the different positions. B) Distribution of enhancer activity changes (log2 FC to mutated sequence) across all enhancer positions for each pasted TF motif. C) Bar plots with activity (log2) of variants of three different enhancers with a mutated sequence (grey), a GATA (blue) or a ETS (brown) motif pasted at the same position. D) Heatmap of enhancer activity changes (log2 FC to mutated sequence) after pasting each of the eight selected TF motifs in 721 enhancer positions (positions with data for at least six motifs). TF motifs and positions were clustered using hierarchical clustering and the activity is colored in shades of red (activating) and blue (repressing); missing values are colored in grey. E) GATA and ETS motifs work differently at different enhancer positions. Comparison between enhancer activity changes (log2 FC to mutated sequence) after pasting GATA (x-axis) or ETS (y-axis) across all enhancer positions. Positions with stronger activity of GATA or ETS (>= 2-fold in respect to the other motif) are colored in blue and brown, respectively. Enhancer positions shown in (C) are highlighted. PCC: Pearson correlation coefficient.
Across all positions TF motifs had different median activities, which we interpret as different intrinsic strengths, with SREBP, ETS and AP-1 being the strongest and Trl the weakest motifs (Fig 3B, S10C). However, enhancer positions had large effects on the motif activities that differed more than 100-fold for the same motif (Fig 3B). For example, pasting a GATA motif activated enhancer activity more than 20-fold for 33 positions but not at all for 72 different positions. This position-dependency was particularly strong for Trl, Stat92E and GATA motifs, and weaker for AP-1, SREBP and ETS (Fig S10D), which all had higher intrinsic strengths. Additionally, each TF motif showed differential activity across enhancer positions and activated in a unique set of positions. For example (Fig 3C), GATA motifs activated enhancer1-position168 but not enh2-pos68, while ETS showed the opposite effect, and both motifs activated enh3-pos135. The different TF motifs showed different activity profiles across all positions, as revealed by global comparisons and hierarchical clustering (Fig 3D, S11). For example, GATA showed differential activity from ETS (PCC=0.15; Fig 3E) or twist (PCC=0.33; Fig S11B), while others such as AP-1 and Atf2 showed more similar positional preferences (PCC=0.68; Fig S11). These results highlight the complexity of enhancers syntax and the difficulty of predicting and interpreting individual sequence manipulations.
The distinct preferences observed between pasted motifs were largely independent of the identity of the replaced wildtype motif across all positions, as revealed by the weak interaction scores between the wildtype and the pasted motif identity in a multivariate linear regression analysis of all motif-pasting experiments (< 1% explained variance, Fig S12). In contrast, the pasted motif identity (irrespective of the identity of the replaced motif) explains the most (23%) while 65% of variance remains unexplained and is likely due to surrounding enhancer sequence features affecting the motifs’ activities. Thus, systematic pasting of TF motifs across hundreds of enhancer contexts shows that motifs have different intrinsic strengths but work differently at different enhancers and positions, suggesting that the enhancer sequence context constrains the activity of TF motifs.
TF motifs have different intrinsic strengths that are modulated by the enhancer sequence context
The observed differential activities of motifs in different enhancer positions (Fig 3D) suggests that the enhancer sequence context modulates the function of TF motifs. We found no significant differences when comparing the motif activity between pairs of positions in the same enhancer or in different enhancers, suggesting that the local context immediately surrounding the motif is as important as enhancer identity (Fig S13).
More globally, the sequence context for a motif can be related to its position within the enhancer, the motif flanking sequence and the presence and distance to other motifs. To characterize the importance of these features, we tested if they contribute to the performance of predicting enhancer activity following the pasting of a motif at different enhancer positions. We first built a baseline random forest model that only includes the importance of the wildtype motif and the identity of the wildtype and pasted motifs as features, thereby not taking any sequence context features into account. This model obtained a PCC of 0.59 in the whole dataset using 10-fold cross-validation and showed that the pasted motif and the wildtype motif importance are strong determinants for enhancer activity (Fig S14A). Training a second random forest model that also includes context features such as the motif position relative to the enhancer center, the motif flanking sequence, and the presence and distance to other TF motifs, improved this performance to a PCC of 0.69 (Fig S14B). This shows that the enhancer sequence context, particularly the closest flanking nucleotides as well as the presence of other motifs at specific distances (e.g. GATA or ETS) have an impact on the activity of TF motifs (Fig S14B).
To better characterize the importance of these sequence rules for each TF motif separately, we generated interpretable linear models based on these rules to predict the motif activities across all positions (Fig 4A). These models were able to predict the motif pasting results, with PCCs to experimentally assessed log2 fold-changes between 0.39 (ETS) and 0.64 (Stat92E) (Fig 4A, S15). The motif flanks and the presence of additional motifs explained on average 16.7% and 6.7% of the motif activities variance, respectively, while the motif position within the enhancer had lower importance (0.4%).
A) Motif syntax rules modulate TF motif function. For each TF motif type (rows), a linear model was built to predict its activity across all enhancer positions, using as covariates the number of instances, the wildtype TF motif importance and identity, and sequence context features such as the position within the enhancer, the flanking nucleotides, and the presence at close or distal distances to all other TF motifs. The PCC between predicted and observed motif activities is shown with the green color scale on the left. Heatmap shows the contribution of each feature (columns) for each model, colored by the FDR-corrected p-value (red or blue scale depending on positive or negative association, respectively). B,C) Syntax features associated with GATA (B) or ETS (C) activity. Left: bar plot showing the variance explained by the different types of features (color legend) for each of the linear models. Middle-left: motif activity according to the different bases at each flanking position, colored by nucleotide identity. Middle-right and right: enhancer activity changes (log2 FC to mutated sequence) after pasting each TF motif in positions with no additional GATA (middle-right) or ETS (right) in the enhancer, or with additional GATA or ETS at close (<= 25 bp) or distal (>25 bp) distances. Number of instances are shown.
The TF motif type-specific models revealed how the sequence context rules differ between TF motif types, explaining the motif-specific enhancer position preferences. For example, GATA activity was strongly dependent on the flanking nucleotides and was modulated by the presence of a second GATA at close distance (negative interaction) or ETS motifs (positive interaction) (Fig 4B). We saw different associations for ETS activity, as expected by the different GATA and ETS activity profiles across all positions (Fig 3E). ETS activity was only mildly influenced by the flanking nucleotides but strongly by neighboring motifs: it was stronger close to GATA motifs and weaker in enhancers with another ETS motifs (Fig 4C). These sequence features, such as the negative GATA/GATA and the positive ETS/GATA interactions at close distances, were observed previously via computational models of wildtype S2 enhancer sequences (de Almeida et al. 2022).
In addition, analyzing the DeepSTARR predicted importance of each nucleotide when pasting different TF motifs at the same position revealed their interaction with the sequence context (Fig S16): GATA but not ETS activated the chr3L enhancer in a position with additional distal GATA motifs (note the increased weights for surrounding twist and AP-1 motifs when pasting GATA; Fig 4A,B), while ETS but not GATA activated the chrX enhancer in a position with a GATA motif at close distance (note the increased weights for the downstream GATA only when pasting ETS; Fig 4C,D), and both activated the chr2L enhancer that contains multiple surrounding twist motifs (Fig 4E,F). Together these results demonstrate how the sequence context (e.g. the flanking sequence, the presence and diversity of other motif types) modulates the function of TF motifs, constraining enhancer sequence flexibility.
Enhancer sequence context modulates the function of human TF motifs
To test whether TF motifs also work differently in different enhancer sequence contexts in other species, we performed the systematic motif pasting experiment in human HCT116 cells for eight previously characterized human TF motifs (P53, AP-1, ETS, CREB1, MAF, EGR1, E2F1 and MECP2; see Methods; (de Almeida et al. 2022)). Pasting of the motifs into 1,354 important positions in 753 different HCT116 enhancers revealed that human TF motifs also have different intrinsic strengths and work differently in different enhancers and positions (Fig 5A, S17). P53 was the strongest motif and the only one that showed globally strong activity across all enhancer positions, suggesting little dependence on the enhancer context, as has been suggested before (Verfaillie et al. 2016). AP-1, the second strongest motif, was strongly dependent on the enhancer positions, with activities ranging more than 50-fold across enhancer contexts. This position-dependence was also observed for the other motifs, even though their overall activity was lower (Fig 5A).
A) Distribution of enhancer activity changes (log2 FC to mutated sequence) across all enhancer positions for each pasted TF motif. B) Heatmap of enhancer activity changes (log2 FC to mutated sequence) after pasting each of the eight selected human TF motifs in 1,052 enhancer positions (positions with data for at least six motifs). TF motifs and positions were clustered using hierarchical clustering and the activity is colored in shades of red (activating) and blue (repressing); missing values are colored in grey. C) Human TF motifs work differently at different enhancer positions. Comparison between enhancer activity changes (log2 FC to mutated sequence) after pasting AP-1 (x-axis) and P53 (top) or ETS (bottom) (y-axis), across all enhancer positions. Positions with stronger activity of each motif (>= 2-fold in respect to the other motif in the scatter plot) are colored (P53: green, AP-1: purple, ETS: brown). PCC: Pearson correlation coefficient. D) TF motif activity in function of wildtype and pasted motif identity. Left: Bar plot showing the amount of variance explained by the wildtype motif importance and identity, the pasted motif identity and the interaction between the wildtype and pasted motifs, using a linear model fit on all motif pasting results. Right: Scatter plots of predicted (linear model) vs. observed enhancer activity changes (log2 FC to mutated sequence) across all motif pasting experiments. Color reflects point density. E) Motif syntax rules modulate the function of human TF motifs. For each TF motif type (rows), we built a linear model to predict their activity across all enhancer positions, using as covariates the number of instances, the wildtype TF motif importance and identity, and sequence context features such as the position within the enhancer, the flanking nucleotides, and the presence at close or distal distances to all other TF motifs. The PCC between predicted and observed motif activities is shown with the green color scale on the left. Heatmap shows the contribution of each feature (columns) for each model, colored by the FDR-corrected p-value (red or blue scale depending on positive or negative association, respectively).
TF motifs preferred different enhancer contexts, with four groups of motifs showing characteristically different preferences: (1 – P53) strong activity in all positions; (2 – CREB1, AP-1, MAF, EGR1) and (3 – ETS) highly context-dependent activities; (4 – MECP2, E2F1) only active in few and highly specific enhancer positions (Fig 5B,C, S18). These distinct preferences were independent of the identity of the replaced motif (Fig 5D, S19) but correlated with sequence context features. Similar to Drosophila TF motifs, motif context features such as motif flanks and the presence and distance to other TF motifs were important to predict the activities of human motifs across the different enhancer positions (Fig S20). TF-specific linear models based on such syntax features were able to predict the motif activities across all positions (PCCs between 0.46 and 0.51; Fig S21) and revealed the context preferences of each TF motif (Fig 5E).
All motif activities were influenced by the flanking nucleotides, that explained on average 8.2% of the motif activities variance, while the presence of additional motifs and their distance explained 8.5% (Fig 5E, S21,22). As expected by the weak context-specificity of P53 (group 1, Fig 5A), its activity was independent of the presence and distance to other TF motifs (Fig 5E, S22A). All the other motifs preferred contexts with an additional AP-1 instance (Fig 5E). The AP-1 motif itself, as well as MAF, CREB1 and EGR1 (group 2), all preferred positions close to an ETS motif, concordant with previous studies showing direct protein–protein interactions between ETS and other TFs (Li et al. 2000; Burda et al. 2010), while the ETS motif (group 3) had a negative interaction with a second close ETS motif (Fig 5E), as also observed in Drosophila enhancers (Fig 4A). These findings are also concordant with the motif syntax rules found in a previous study (de Almeida et al. 2022). Altogether, this establishes that TF motifs require specific enhancer sequence contexts in species as divergent as fly and human, suggesting that this is a general principle of regulatory enhancer sequences.
In this study, we used two complementary strategies to explore the flexibility of enhancer sequences with regards to nucleotide and motif identity at specific enhancer positions as well as the position dependence of motif activity. Even though median enhancer activity drops significantly when randomizing an 8nt stretch at important enhancer positions, many sequence variants, including both variants of the wildtype motif as well as other TF motifs, can still achieve strong enhancer-activity. The diverse set of solutions found at each position shows that enhancers exhibit some degree of flexibility. However, as demonstrated by the fact that only a few hundred out of the 65,000 possible sequences work, the flexibility at any given position is constrained by the enhancer context and determined by syntax rules. Similarly, systematically pasting different motifs into hundreds of enhancer positions revealed that motif activity is strongly modulated by the enhancer sequence context, and neither is any motif able to functionally replace any other motif, nor are all motifs able to function at all positions. Therefore, constrained sequence flexibility and the modulation of motif function by the sequence context seem to be key features of enhancers.
The observation that both Drosophila and human TF motifs require specific enhancer sequence contexts suggests that this is a general principle of regulatory enhancer sequences. Even though motifs possess some intrinsic strength, this potential to activate transcription strongly depends on the enhancer context and follows certain syntax rules, including motif flanks, combinations and distances. While the motif flanking sequence can influence TF binding affinity via changes in DNA shape (Mathelier et al. 2016; Dror et al. 2015), inter-motif distances can impact the synergy between TFs at the level of DNA binding or after binding, such as cofactor recruitment and activation (Reiter et al. 2017). Although these rules are stricter for some TF motifs (e.g. GATA) and more relaxed for others (e.g. P53), motifs are not simply independent modules but interact with all enhancer features in a highly cooperative manner, which can modulate motif activity by more than 100-fold. This is an important result that supports a model where enhancer activity is encoded through a complex interdependence between motifs and context, rather than TF motifs acting independently and additively as the billboard model would suggest (Kulkarni and Arnosti 2003; Arnosti and Kulkarni 2005). While tissue- or cell-type-specificity can already be predicted by motif presence-absence patterns alone (Kvon et al. 2014; Janssens et al. 2022), the encoding of different enhancer strengths depends on more complex cis-regulatory syntax rules (de Almeida et al. 2022; Jindal and Farley 2021). Mutations in TF motifs and changes in the enhancer sequences can therefore only be understood in the context of these syntax features.
The motif syntax rules described here (Fig 4, 5E), such as the interaction between motifs and their distances, agree well with the ones learned by the DeepSTARR deep learning model trained on genome-wide enhancer activity data (de Almeida et al. 2022), showing that these rules are present and important in wildtype enhancer sequences. Indeed, DeepSTARR also predicted with good accuracy the activity of all randomized sequence variants and of motifs pasted in different enhancer contexts (Fig S23, S24). This supports the validity of computational models such as DeepSTARR and their use in in-silico-like experiments (e.g motif pasting experiments with a larger set of TF motifs across many more genomic positions) to improve our understanding of the regulatory information encoded in enhancer sequences and the impact of mutations.
Our study shows that enhancer sequences are flexible enough for enhancer strength to be achieved by a small yet diverse set of sequence variants, and that mutations in information-poor positions have little impact on the enhancer activity in a single cell type. This flexibility that allows many different sequences to achieve similar enhancer activities in a single cell type might be an important pre-requisite for the evolution of developmental enhancers that operate under many additional constraints, e.g. regarding the precise spatio-temporal control of enhancer activities. Given that the activity in a given cell can be achieved by many solutions, the specific solutions that fulfill additional requirements can be explored during evolution. Indeed, previous studies that have analyzed expression changes of enhancer mutations across different cell types in vivo have observed that the cell type-specific expression patterns of enhancers can change upon (minimal) sequence perturbations (Farley et al. 2015; Galupa et al. 2022; Fuqua et al. 2020). The fact that enhancer strength in any given cell type and specificity across cell types and developmental time are subject to different sequence constraints highlight the complexity of the regulatory code and the challenges faced when trying to dissect it. We expect that the combination of quantitative enhancer-sequence-to-function models in individual cell types with qualitative predictions of enhancer activities across cell types will over the next years provide unprecedented progress in our understanding of enhancer biology and our ability to read and write enhancer sequences.
