Abstract
Chromatin states, fundamental to gene regulation and cellular identity, are defined by a unique combination of histone post-translational modifications. Despite their importance, comprehensive patterns within chromatin state sequences, which could provide insights into key biological functions, remain largely unexplored. In this study, we introduce ChromBERT, a BERT-based model specifically designed to detect distinct patterns of chromatin state annotation data sequences. Notably, ChromBERT was pre-trained on promoter regions across a diverse range of epigenomes and subsequently fine-tuned using a dataset from multiple cell lines where RNA-seq data were available, highlighting the model’s ability to discern conserved chromatin state patterns within these regions. In addition to its predictive powers across tasks, evidenced by high AUC scores, ChromBERT provides further analysis through the incorporation of motif clustering using Dynamic Time Warping (DTW). This method enhances the model’s ability to dissect chromatin state sequence motifs, typically involving transcription and enhancer sites. The introduction of motif clustering with DTW into ChromBERT’s workflow is poised to facilitate the discovery of genomic regions linked to novel biological functions, deepening our understanding of chromatin state dynamics.
Background
The understanding of chromatin organization is essential to reveal the complex mechanisms that govern gene regulation and function in the human genome[1], [2]. Chromatin states, defined by the combined features of histone modifications, including the involvement of DNA methylation and acetylation, are associated with specific functions such as gene activation, repression, or structural organization [3], [4]. For instance, the presence of active chromatin states, marked by histone modifications such as H3K4me3 (trimethylation of lysine 4 on histone H3) and H3K27ac (acetylation of lysine 27 on histone H3), is generally linked to gene activation and transcriptional initiation [5], [6]. A deeper understanding of chromatin states and their genomic distribution has been facilitated by the chromatin state annotation tools such as ChromHMM [7] and Segway [8] coupled with the advent of next-generation sequencing (NGS) technologies [9], [10], [11]. Such annotation tools systematically learn and characterize combinatorial patterns of histone modifications into distinct chromatin states, such as “active promoters” and “genic enhancers.” The algorithm processes genome-wide datasets generated from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) experiments, inferring the most likely chromatin state at each genomic locus.
Large databases of epigenome information have been generated by international consortia, such as ROADMAP [12], ENCODE [13], and IHEC [14]. For instance, the ROADMAP Epigenomics Project characterized and annotated the chromatin state of 127 distinct human cell and tissue types. The databases can be used to perform various downstream analyses to decipher the hidden intricacies of the chromatin state landscape. As one such attempt, ChromDiff [15] compared a summarized set of chromatin states across different epigenomes, offering new insights into tissue-specific or age-dependent chromatin alterations. More recently, CSREP [16], a framework for deriving representative chromatin state maps for groups of samples, was developed by Vu et al. Utilizing an ensemble of multi-class logistic regression classifiers, this method enables the summarization of chromatin states for different groups of epigenomes with high resolution. Additionally, Epilogos aims to visualize the conserved chromatin states across many epigenomes based on the surprisal score measured from the rarity of the chromatin state distribution [17]. A similar method, established with a different metric, the conservation-associated activity score, was suggested by Libbrecht et al [18] as a novel annotation strategy that has the advantage of detecting more biologically functional regions.
The availability of chromatin state annotations for a large dataset and the precedent related studies have paved the way for the next important step: identifying patterns of chromatin state motifs. A ‘motif’ refers to a recurring sequence that possesses a biological significance, often associated with specific functions such as binding sites in DNA sequences [19]. In the context of our study, we extend this concept to chromatin state sequences to identify distinct patterns of chromatin states that could be key to understanding gene regulation. The significance of chromatin state motifs extends to their potential role in dictating the structural and functional landscape of the genome, influencing the transcriptional output and cellular identity. By mapping these motifs, we can begin to decode the epigenetic language that modulates gene expression patterns across different cell types and developmental stages, offering insights into the molecular basis of diseases and identifying new targets for therapeutic intervention. However, due to their inherent complexity, these chromatin state patterns may not be fully captured by traditional computational methods, such as k-mer-based motif discovery algorithms [20], [21], which often rely on simplistic representations of DNA sequence motifs and do not consider the combinatorial nature of histone modifications and other chromatin-associated marks. This underscores the need for innovative strategies that can accurately decipher the motifs of chromatin states and reveal their functional implications in gene regulation and chromatin organization.
In this study, we introduce ChromBERT, which is specifically tailored for the discovery of chromatin state motifs using the Bidirectional Encoder Representations from Transformers (BERT) model [22]. BERT has proven highly effective in natural language processing tasks and provides an important way to analyze sequential patterns in biological data [23]. While DNABERT has been proposed for the analysis of DNA sequences [24], our ChromBERT extends the concept fundamentally to the adaptation of chromatin state-annotated human genome sequences by combining it with Dynamic Time Warping (DTW) [25]. Using the chromatin state annotation data for 127 distinct human cell and tissue types obtained from ROADMAP, ChromBERT uncovered previously unrecognized patterns of chromatin states. It has the potential to contribute to a deeper understanding of gene regulation and its implications for human health and disease. The source code of ChromBERT is available at https://github.com/caocao0525/ChromBERT.
Results
ChromBERT framework
Figure 1 provides an overview of the ChromBERT framework, illustrating the key components and workflow involved in the study. Initially, numerically annotated chromatin state labeling data from ROADMAP, based on a combination of core histone modifications, are converted into an alphabetic code ranging from A to O. This process is illustrated in Fig. 1(a) where chromatin state sequences are depicted in alphabetical labels following data preprocessing via ChromBERT. In the training phase, as shown in Fig. 1(b), these chromatin state sequences are tokenized and fed into a BERT model for embedding. The comprehensive workflow of ChromBERT is captured in Fig. 1(c), starting with the preprocessing of chromatin state sequence data. Subsequently, ChromBERT undergoes training to learn the general patterns of chromatin state placement through a pretraining stage. Following this, it is fine-tuned using binary classification data, such as distinguishing between regions with complex chromatin state patterns, high gene expression, and promoters, from those with less complex chromatin patterns, low gene expression, or non-promoter regions. The motifs identified from the targeted classes are then clustered using DTW to discern and interpret the representative motifs for specific regions of interest. See the Methods section for further technical details.
(a) Conversion of numerical chromatin state annotations (from a BED file) into alphabetical encoding, followed by concatenation for generating a sequence of chromatin states (upper). The alphabetical encoding for each chromatin state (lower). (b) Creation of input embeddings for the BERT model through the combination of token and positional embeddings. (c) The comprehensive process of ChromBERT, from pretraining on general data to fine-tuning for task-specific applications.
Categorization of complex genic regions and less-complex genic regions based on chromatin state switching frequency
Utilizing ChromBERT to classify genic regions based on the characteristics of chromatin state compositions, our initial approach involved fine-tuning for two distinct genic region types: complex and less-complex. The complexity of genic regions was determined through the frequency of annotated chromatin state switching within these regions (see Methods). The fine-tuning for this task was carried out after the pretraining phase was conducted on the entire spectrum of genic regions, concatenated by 10 different epigenomes. In the pretraining process, while DNABERT [24] used longer k-mers (3, 4, 5, and 6-mers) to capture the enduring patterns of sequences, we chose 4-mers for ChromBERT due to the substantial vocabulary size of chromatin states (15 characters) compared to the DNA sequence (4 characters, see Methods).
Figure 2(a) shows examples of complex and less-complex genic regions represented in chromatin state annotation data. Typically, over 50% of complex genic regions are annotated as either weak transcription (E), strong transcription (D), or active Active Transcription Start Site (TSS) (A). Conversely, more than 50% of the genic regions categorized as less-complex are annotated as quiescent or having low signal (O) (Supplementary Fig. S1). Because longer genes can accommodate more chromatin states, there is a positive correlation between gene length and the number of chromatin state changes, as shown in Fig. 2(b), for a representative cell. It is important to note that the complexity of chromatin states assigned to genes is inversely proportional to their length, indicating a negative relationship (Supplementary Fig. S2). The outcomes of pretraining across all genic regions are detailed in Fig. 2(c). The perplexity of the pretraining decreases over iterations, eventually reaching approximately 1.13 for both 3-mer and 4-mer cases, implying that the general placements of chromatin state annotation can be effectively trained. The relatively low initial value of perplexity is thought to stem from the persistent nature of chromatin states, which tend to remain consistent over extended regions of DNA, thereby simplifying the pretraining process.
(a) Examples of chromatin state sequences found in complex genic and less complex genic regions. (b) Correlation between the number of switches in chromatin state in genic regions and gene length. (c) Results of pretraining for genic regions using 10 different epigenomes. (d) Confusion matrix obtained from the binary classification of chromatin state patterns between complex genic regions and intergenic regions. (e) Example motifs found in a complex genic region compared to intergenic regions. The colored blocks represent the attention scores of the chromatin state, with motifs presented in bold and large font.
When we evaluated 1,000 random test instances—comprising 502 chromatin state segments from complex genic regions and 498 from less-complex genic regions, sourced from a sample cell type (E003, H1 cell line)—ChromBERT effectively predicted each case, achieving an F1 score of 0.98 (Fig. 2(d)). For this case, the attention score matrix exhibited relatively high scores towards the beginning, which is likely due to the location of the TSS in the initial segment of the genic region. This interpretation is further supported by similar trends observed in other classification tasks, such as those distinguishing genic regions from intergenic regions and complex genic regions from intergenic regions (Supplementary Fig. S3).
However, following the motif finding algorithm used in DNABERT [24] (see Methods), no distinct motifs in the complex genic regions were detected when compared to the less-complex regions. Upon relaxing the conditions for motif detection—defining high attention scores as values exceeding the third quartile without a minimum condition—only a series of active TSS states, represented as ‘AAA,’ emerged as a motif. This suggests no distinctive chromatin state patterns in complex genic regions compared to less-complex ones. When comparing complex genic regions with intergenic regions, several simple motifs, such as ‘AAAAB,’ ‘EEDDD,’ ‘EEEEE,’ and ‘DDDDD,’ were detected using the standard motif detection algorithm. Because ‘D’ and ‘E’ respectively represent strong and weak transcriptions, this outcome is consistent with the understanding that genic regions are defined as ‘transcribed’ regions. The example motifs found in complex genic regions, along with their attention scores, are shown in Fig. 2(e). Note that not all areas with high attention scores qualify as motifs. For an area to be considered a motif, it is required to appear at least three times (see Methods).
Classification of highly-expressed genic regions and non-expressed genic regions
In another task, we deployed ChromBERT to categorize genic regions based on their expression levels. Since this task involved the same genic regions, we used the parameters obtained from the pretraining on the whole genic regions, which was initially conducted for distinguishing between complex and less-complex genic regions. We defined genic regions with RPKM values exceeding 10, 20, 30, and 50 as ‘highly expressed’ and conducted pairwise fine tunings for each against ‘expressed genic regions (RPKM>0)’ and ‘non-expressed genic regions (RPKM=0)’.
Figure 3(a) displays the prediction performances, which are represented as the average of accuracy, AUC (Area Under the Curve), and F1 score calculated as the final score after training for each fine-tuning task. Although there are slight differences among the classification tasks for the different gene expression levels, an increase was observed corresponding to higher gene expression. For example, the average AUC of the final ten data points for the classification between RPKM>10 versus >0 produced 0.77, while an AUC of 0.895 was obtained for the classification between RPKM>50 versus >0. When the averages of ten final data points of AUC are compared, the classification between RPKM>30 versus =0 yielded the highest score, 0.904.
(a) A graph of average AUC, F1, and accuracy scores from the fine-tuning stage of various binary classifications of genic regions at different gene expression levels. (b) A partial attention score matrix representing a case of genic regions classification between RPKM>30 (‘highly expressed’) and RPKM>0 (‘expressed’), with rows indicating the number of each genic region segment and columns showing the position of the chromatin state (200-bps per state). (c-d) Word Cloud visualizations of motifs, with font size indicating frequency of occurrence, for the cases of RPKM>30 vs. RPKM>0 (c) and RPKM>30 vs. RPKM=0 (d).
In addition, similar to the previous task, the high attention scores appeared to be relatively concentrated near the start of the genic regions compared to other areas, as shown in Fig. 3(b). Such a pattern suggests that transcription start sites play a crucial role in distinguishing regions that are expressed. This is consistent with the understanding that the chromatin state around the transcription start sites is essential for estimating gene expression levels.
Figure 3(c) and (d) respectively illustrate the representative motifs in expressed genic regions, comparing the case of RPKM>30 versus RPKM>0 and RPKM>30 versus RPKM=0. The Word Cloud visualization shows that the motifs with the larger fonts represent those that appear more frequently. For example, “FFFFB”, representing an 800-bps long genic enhancer (“F”) followed by a 200-bps long flanking active TSS (“B”), is the most frequent chromatin state motif for the first case. Upon comparing the characteristics among the classifications at different gene expression levels, specifically RPKM>10 and RPKM>20, the most frequent motifs predominantly consist of a sequence of weak genic regions (“E”) and strong genic regions (“D”), represented by sequences like “EEEEE” or “DDDDE”. In contrast, at higher expression levels, such as RPKM>30 and RPKM>50, the motifs primarily composed of genic enhancers (“F”) and enhancers (“G”) appear, such as “FFFFF” or “GGGBB” (Supplementary Table S1). This result suggests the importance of a large number of enhancer-like regions for the high-level expression. Intriguingly, complex motifs involving more than three distinct chromatin states or showing frequent switching more than three times were observed in the classifications between RPKM>30 and RPKM>0 and between RPKM>30 and RPKM=0 (Supplementary Table S1). These classifications presented the motifs “DEFFF” and “GBBBG”, respectively. Here, “DEFFF” signifies the presence of a short, weak transcription region (“E”) positioned between a strong transcription region (“D”) and genic enhancer (“F”), while “GBBBG” suggests that flanking active TSS (“B”) is positioned between enhancers (“G”), effectively flanked on both ends. It should be noted that these motif patterns frequently appear in many genes from multiple epigenomes and are therefore not derived from chance observation or technical issues such as read depth or variance in the signal-to-noise ratio of the antibodies. These observations are particularly noteworthy as they reveal distinct chromatin state patterns in highly expressed genic regions compared to those in non-expressed or minimally expressed genic regions.
Identification of strong promoters based on gene expression level
After the initial categorization task for genic regions, we further employed ChromBERT to identify chromatin state motifs around promoter regions. While various lengths can be defined for promoter regions, we chose to use data spanning from 20 kbps upstream to 40 kbps downstream of genes for the 57 cell lines that had available RNA-seq data in ROADMAP [12]. We made this choice because these regions are often crucial for gene regulation, containing cis-regulatory elements like enhancers, silencers, and insulators that influence gene expression [26]. Using criteria similar to those defined in the previous task, promoter regions adjacent to genic regions with RPKM values higher than 10, 20, 30, and 50 were labeled as ‘highly expressed’. Promoter regions adjacent to genes with RPKM > 0 were labeled ‘expressed’, and those with RPKM = 0 were termed ‘non-expressed’. For pretraining purposes, only the promoter regions from all target cell lines were utilized. From the pairwise fine-tuning of the above combinations, we observed a clear tendency: the promoter regions adjacent to genes with higher RPKM values exhibited higher AUC values in classification tasks, especially when compared to promoter regions near genes with low (RPKM > 0) or no expression (RPKM = 0), as shown in Figure 4(a).
(a) The fine-tuning results for promoter regions in terms of mean AUC, accuracy (ACC), and F1 score. (b) The composition of chromatin states within detected motifs across various tasks in promoter regions. (c) Representative motifs were identified for different tasks within promoter regions.
The motifs identified in the promoter regions adjacent to highly expressed genes predominantly consisted of active markers, such as active TSS (“A”), flanking active TSS (“B”), strong transcription regions (“D”), weak transcription regions (“E”), and genic enhancers (“F”), as described in Figures 4(b) and (c). This aligns with conventional biology understanding where a strong promoter encompasses an active TSS to activate gene expression. The most common motifs comprised one or two of the aforementioned chromatin states in sequential order, such as “ABBBB”, “BBBBB”, “AAABB”, “DDEEE”, or “EEEEE”. This pattern mirrors observations from previous tasks, underscoring that chromatin state sequences tend to persist rather than switch frequently. Motifs with a relatively complex composition were also identified, especially when comparing promoter regions adjacent to genes with RPKM > 30 and RPKM > 50 to those associated with non-expressed genes (RPKM=0). For instance, sequences such as “BABA”, “GBAA”, and “GBBBA” were distinctive in the promoter regions of these highly expressed genes. This suggests that these specific patterns in chromatin state sequences could be precursors or indicators of highly expressed genes.
Clustering the chromatin state motifs using Dynamic Time Warping
The previous results suggested that chromatin state sequences tend to be persistent. Additionally, the observed patterns (both in terms of characters and length) may vary slightly due to the fact that some chromatin states are correlated with each other (e.g., strong and weak transcription) and due to technical issues, such as different read depths and signal-to-noise ratios of the antibodies across samples. Therefore, chromatin state motif analysis requires ‘collapsing’ the similar patterns of chromatin states into a single representative motif. This approach helps to understand the overarching patterns of chromatin state motifs within specific genomic regions post-fine-tuning.
To accomplish this, we have introduced a clustering step utilizing DTW. Initially, ChromBERT identifies motifs without imposing constraints on the merging. This approach captures a comprehensive set of chromatin state patterns that potentially signify the regions of interest, with selection based on statistical significance (p-value). ChromBERT then uses DTW to align the similar motifs in the set. Originating from the field of speech recognition, DTW is adept at accommodating variations in the tempo of analogous spoken words [27], making it particularly suitable for our purposes of aligning the different lengths of continuous sequences. DTW computes sequence similarities not by conventional point-to-point Euclidean distances but by minimizing the cumulative distances between motifs. Finally, the agglomerative clustering algorithm [28] is applied to cluster similar patterns into categories, utilizing the hierarchical merging of data points based on their pairwise similarities.
The overall workflow of motif clustering is described in Figure 5(a). First, the pairwise DTW scores of motifs are calculated in both forward-forward and forward-reverse manner because DNA transcription involves both forward and reverse directions. Then, the DTW score matrix is generated using the lower scores, which represent higher similarities. Clustered motifs can be identified after determining the optimal number of clusters using a dendrogram. Figure 5(b) visualizes the clustered motifs of a representative case study on promoter regions associated with highly expressed genes. Chromatin state motifs were compiled from those identified across three distinct fine-tuning thresholds, comparing highly expressed genes (with RPKM values exceeding 50, 30, and 20, respectively) against non-expressed genes (RPKM=0). We used a window size of 12, a minimum motif length of 5, and a minimum occurrence frequency of 2. The x-axis denotes the position of the chromatin state within a motif, whereas the y-axis denotes the chromatin state symbols, represented alphabetically. After applying agglomerative clustering followed by DTW, the motifs have been categorized into 11 clusters in this example. In this visualization, motifs categorized as similar share the same color and line style, facilitating easy identification of motifs that exhibit similar state transitions, prevalence, and patterns of stability and variability. Figure 5(c) represents the detected motif clusters plotted in the UMAP analysis [29]. ChromBERT includes a visualization function that summarizes the clustered chromatin state motifs, as shown in Figure 5(d). This feature helps users to understand the composition and relative size of each cluster.
(a) Workflow of motif clustering using DTW. (b) Graph representation of motif clustering achieved through DTW. A total of 116 entries are visually differentiated through various colors and line styles. (c) UMAP representation of motifs clustered using agglomerative clustering. (d) Visualization of chromatin state motif clusters, with each cluster’s size proportional to its number of elements.
In our analysis to identify the optimal number of clusters, we integrated several evaluative measures, including insights from the dendrogram, to achieve a judicious balance between detail and overarching pattern recognition. We tested different numbers of clusters, such as 9 and 11, and found that DTW was effective in improving the accuracy of our clustering process. With 11 clusters, for instance, motifs with similar state transitions but different lengths, such as “GBBBB”, “GBBAA”, “GBAAAA”, and “GBBBA”, were clustered into a single category, while “GABBBB,” which has a different transition, was clearly segregated into a separate category. Such transitions like “G-B-A” could represent enhancer regions immediately preceding active TSS regions, potentially acting as a herald for transcription initiation. This suggests a dynamic regulatory setup where enhancer activation might directly influence the onset of transcription at proximal promoter sites. Additionally, sequences such as “OAAA” and “OOOAAA” in other clusters may reflect abrupt changes in chromatin states, possibly serving as precursors to promoter regions. Such patterns suggest the potential complexity of chromatin regulation captured by ChromBERT, and these motifs could provide interesting insights into the orchestrated events leading up to gene expression.
Discussion
In ChromBERT, we applied the BERT algorithm to accommodate the unique characteristics of chromatin state data. While several tools, such as CSREP [16] and EpiAlign [30], have been developed to analyze the same genomic regions of multiple samples, our approach aims to obtain significant chromatin state patterns within different regions of multiple samples. This approach has the potential to identify unexplored functional genomic regions and their epigenomic features.
Unlike DNA sequences focused by DNABERT, which are comprised of 4 base nucleotides, the number of chromatin states in this study was 15. This distinction significantly expands the resulting vocabulary size. Therefore, the substantially increased vocabulary size influenced the complexity of the k-merization process, a critical step for tokenizing the data prior to pretraining. While DNABERT only necessitated vocabulary sets of 68 (3-mers), 260 (4-mers), 1028 (5-mers), and 4101 (6-mers), ChromBERT required 3,379, 50,629, 759,379, and 11,390,629 respectively. This pronounced difference in vocabulary size is computationally challenging and causes GPU memory overflows when attempting to pretrain with more than 5-mers, even with a single sample set. This effectively limited us to using 4-mers for the pre-training step.
Additionally, the perplexity of the model, a metric that quantifies how well a model predicts a sample, showed a decreasing trend from approximately 4 to 1 over iterations during the pretraining. Interestingly, the perplexity was already relatively low at the beginning of training (approx. 4). This suggests that the model is successfully predicting the masked tokens in the pretraining, even early in the phase. This initially low perplexity can be comprehended by considering several factors. The stable, long-lasting nature of chromatin states and their functional bias likely lead to more predictable sequence patterns, easing the model’s predictions. Additionally, our choice of 4-mer tokenization further simplifies the task for the model, as it only needs to predict patterns within a more limited context compared to larger k-mers.
Despite the computational constraints that led to the use of 4-mers, the pre-trained parameters have proved to be highly effective in fine-tuning the model for various downstream tasks. This includes binary classification of complex genes versus less-complex genes and highly expressed genes versus non-expressed genes. Furthermore, the model was able to identify several meaningful chromatin state motifs, suggesting its effectiveness in detecting unknown epigenomic patterns from large-scale data.
However, an intriguing aspect of our research emerged when ChromBERT was applied to broader genic regions. The model’s performance was notably better when pre-trained on a narrower set of 10 cell lines and then fine-tuned on data from 57 cell lines than when the inverse training scheme was employed. This performance discrepancy might arise from overfitting to the narrower set of 10 cell lines. It is conceivable that chromatin states in a larger set of cell lines, which may contain a certain amount of moderate-quality data, introduce complexities that challenge motif detection. In contrast, promoter regions, which are functionally critical for transcription initiation, may exhibit more conserved chromatin state patterns, enhancing ChromBERT’s accuracy in motif detection in such regions.
A current limitation of ChromBERT pertains to the detection of chromatin state pattern motifs, specifically in terms of their length. As we follow the data preprocessing steps outlined in DNABERT, which likely adheres to the sequence length constraints of the original BERT architecture, the input sequence length for both training and testing is capped at 510 tokens. This limitation may have affected the ability of ChromBERT to detect longer motifs, as the majority of identified motifs were approximately 5 or 6 states long.
Another potential reason for the shorter length of the identified motifs could be related to the k-merization process. Although the 4-mer tokenization used for the pretraining primarily helps to learn general patterns within the data and does not directly limit the length of the motifs, it could shape the initial context in which the model learns to predict chromatin states. This might make the model more sensitive to patterns of similar length to the 4-mers, even when a longer window size is used during the motif-finding stage.
Therefore, an important avenue for future research will be to devise methodologies that allow the capture of longer motifs, potentially those spanning extended segments of DNA. This might involve modifying the sequence length constraints or adjusting the tokenization process to better accommodate the unique properties of chromatin state annotation data. By doing so, we anticipate a further enhancement of ChromBERT’s utility in understanding the complex landscape of chromatin states.
Conclusion
In this study, we introduced ChromBERT, a model specifically designed to detect distinctive patterns within chromatin state annotation data sequences. By adapting the BERT algorithm as utilized in DNABERT, we pretrained the model on the complete set of genic regions using 4-mer tokenization. While ChromBERT demonstrated its proficiency in tasks such as differentiating between complex-genic regions characterized by recurrent chromatin state changes and regions with varying gene expression levels determined by RPKM thresholds, its most notable achievement was in the fine-tuning for promoter regions. This implies that promoter regions might encapsulate more conserved chromatin state patterns, enhancing their discernibility, particularly in more expansive datasets. Furthermore, ChromBERT identified chromatin state sequence motifs in these promoter-focused evaluations, often characterized by strong and weak transcription sites along with enhancer regions, predominantly spanning 5 states across 1000 bps with minimal internal variations. To provide deeper insights into these motifs, we added a motif clustering step using DTW, offering a nuanced view for estimating the motif patterns of chromatin state sequences. This addition enriches our analysis, setting the stage for ChromBERT’s potential in distinguishing specialized regions based on chromatin states, thereby aiding in the discovery of genomic regions associated with novel biological functions.
Methods
1. Data collection
The chromatin annotation BED files are downloaded from the ROADMAP [12] project, which includes 127 distinct epigenomes annotated with 15 different chromatin states at 200-bp intervals using ChromHMM [7]. The 15 different chromatin states are 1. Active Transcription Start Site (TSS), 2. Flanking Active TSS, 3. Transcription at Gene 5’ to 3’ Ends, 4. Strong Transcription, 5. Weak Transcription, 6. Genic Enhancers, 7. Enhancers, 8. Zinc Finger Protein (ZNF) Genes & Repeats, 9. Heterochromatin, 10. Bivalent/Poised TSS, 11. Flanking Bivalent TSS/Enhancer, 12. Bivalent Enhancer, 13. Repressed Polycomb, 14. Weak Repressed Polycomb, and 15. Quiescent and low signal. The chromatin annotation BED files consist of four columns: chromosome number, start position, end position, and the number corresponding to the annotated state.
2. Data preprocessing
To prepare the data for natural language processing (NLP) techniques, we first converted the chromatin-annotated data, originally represented by numbers 1 to 15, into corresponding alphabetic characters A to O. This transformation was necessary because NLP cannot be directly applied to sequences of numerical chromatin states. After changing the chromatin state annotations to alphabetic symbols, we condensed the genomic positions by a factor of 200 since the original annotations were conducted at 200-bp intervals. Consequently, we generated a continuous string of chromatin states, with each alphabet character (A to O) representing the corresponding chromatin state (1 to 15) for a 200-bp segment. Additionally, to focus our analysis on the relevant genomic regions and to exclude potential biases from telomeres, we removed the initial and final 10,000 bps from each chromosome, as these regions generally represent the extreme ends and may not accurately reflect the overall chromatin states. In this manner, all the chromatin annotation data for 127 epigenomes were processed chromosome-wise. Since we adapted DNABERT, the processed chromatin annotation data were randomly cut to ensure that the length of the data did not exceed 510 but was longer than 5, following their data preparation method. Subsequently, the chromatin state sequences were tokenized into 3-mers, 4-mers, 5-mers, and 6-mers. Any sequences shorter than the length of each k-mer were not included in the dataset.
3. Definition of complexity for complex genic regions and less-complex genic regions
For conducting the binary classification between complex genic regions and less-complex genic regions, the entire genic regions were investigated to define the complexity of the specific data segments. To assess the complexity of the chromatin annotation data processed, as explained in the previous section, we utilized two factors: the length of the data and the number of chromatin state switches within a single data segment. Complexity can be defined as the proportion of state symbols that are not adjacent to identical symbols. For example, in the case of the data segment “AABBBCCAABBBCCC”, the complexity is 6/15. For a sample epigenome, the mean complexity of the chromatin annotation data for genic regions is approximately 0.11. As we define complex genic regions as those with complexity higher than this threshold, there were approximately 13,192 complex genic regions, while the number of genic regions defined as less-complex was approximately 30,068 (Supplementary Fig. S3).
4. Definition of ‘highly expressed’, ‘expressed’, and ‘non-expressed’ genic regions for selective comparisons across various gene expression levels
In the case of pairwise binary categorization across various gene expression levels, such as ‘highly expressed’, ‘expressed’, and ‘non-expressed’, we utilized the RNA-seq data from 57 different cell lines provided by the ROADMAP project. A genic region is defined as ‘expressed’ if the RPKM (Reads Per Kilobase per Million mapped reads) is greater than 0, while genic regions with an RPKM value of 0 are considered ‘non-expressed’. For ‘highly expressed’ genic regions, we established four different RPKM thresholds: 10, 20, 30, and 50. These thresholds were chosen based on our preliminary analysis, with an RPKM value of 10 serving as a reasonable starting point for high expression. We selected an upper limit of 50 due to the significantly reduced data quantity at this threshold, which was approximately a tenth of the data available at an RPKM value of 10.
5. Pretraining
The BERT architecture that is utilized in the DNABERT model was pretrained for chromatin annotation data, using multiple datasets acquired from different cell types, except for brain cells. The epigenomes used for the pretraining included H1 cell line (E003), H1 BMP4 derived mesendoderm cultured cells (E004), H1 derived mesenchymal stem cells (E006), hESC derived CD184+ endoderm cultured cells (E011), HUES64 cell line (E016), breast myoepithelial cells (E027), CD4 memory primary cells (E037), mobilized CD34 primary cells female (E050), adult liver (E066) and lung (E096). The sequence of chromatin annotation data for the genic regions of these ten epigenomes was preprocessed as described in the data processing section and then concatenated, followed by tokenization using k-merization. Unlike DNABERT, we focused on testing 3-mers and 4-mers only, as 5-mers and 6-mers have a significantly larger vocabulary size (759,380 and 11,390,630, respectively), making them impractical to run. The vocabulary sizes for 3-mers and 4-mers were 3,379 and 50,630, respectively. Although both 3-mers and 4-mers demonstrated sufficiently low perplexity, we opted for 4-mers because they comparably capture more complex patterns more effectively, enhancing our analysis. The transformer architecture used for pretraining remained consistent with DNABERT, featuring an intermediate layer size of 3,072 and a hidden layer size of 768. To ensure smooth training without encountering out-of-memory errors, we adjusted the number of train and evaluation batches per GPU to 5 and 3, respectively. For other training conditions, we followed the same strategy as DNABERT, including a linear increase in the learning rate from 0 to 4e-4 during the warm-up phase. The pretraining process was conducted on a GPU machine equipped with two Nvidia GeForce RTX 2080Ti cards, and it took approximately three days to complete.
6. Fine-tuning
We conducted fine-tuning for each downstream application based on the parameters acquired from the pretraining process. We followed the fine-tuning conditions used in DNABERT, using AdamW as the optimization method with a fixed weight decay of 0.10 and a dropout probability of 0.1 for the hidden layer. Notably, in each fine-tuning iteration, we prioritized results obtained from 4-mer sequences. This focus was predicated on our intent to intensify our efforts on longer pattern recognition within chromatin state sequences and their respective motifs, going beyond the 3-mer sequences used during pretraining.
7. Motif finding
Similarly to the definition of DNA motifs in DNABERT, chromatin state motifs in our study are defined as frequently appearing patterns of consecutive chromatin states that are distinctively present in a certain category during binary classification. We followed the parameter conditions used in DNABERT to detect and filter these motifs. Firstly, we captured regions with high attention scores from the output attention matrix. A threshold was defined as a score higher than the mean attention scores of the matrix and higher than 10 times the minimum score to determine regions where motifs are likely included. These regions were registered as candidates, which were further filtered using a p-value of 0.05. This p-value determined whether the motifs were significantly enriched in one class compared to the other. Finally, motifs were identified as regions that are longer than four chromatin state sequences (equivalent to 800 bps) and appear at least three times exclusively in the target class. While in the case of DNA motifs, final motifs are typically compared with existing DNA motif libraries, we collected chromatin state motifs without such a comparison because there is currently no known library available for this task.
8. Motif clustering
To further estimate the representative motif chromatin sequence pattern, we utilized DTW and agglomerative clustering to analyze the chromatin state motifs. Our initial dataset included motifs identified after careful screening for p-values, minimum length, and minimum occurrences within predefined regions of interest, avoiding the merging step to allow a detailed examination of chromatin sequences, which are represented alphabetically. Given the potential for variability in these sequences due to factors such as data quality (e.g., the signal-to-noise ratio of the antibodies) or fluctuations in histone modification peaks, we opted for a more flexible approach to motif identification. DTW was selected for its capability to adjust for sequence length variations, accommodating the intrinsic dynamics of chromatin state sequences.
First, to implement DTW, we converted the chromatin state alphabets (“A” to “O”) into numerical values (1 to 15) to facilitate a standardized analysis. We provide two options for handling this numerical translation of chromatin states by offering a ‘categorical’ parameter, which can be set to ‘True’ or ‘False’. If users set it to ‘True’, the numbers are considered categorical, meaning the states A to B and A to O are equally distant. On the other hand, if it is set to ‘False’, which is the default option, the numbers representing the states are considered numerically; thus, the distance from state A to C is twice that of the distance from state A to B. The result of motif clustering when the ‘categorical’ parameter is set to ‘True’ is as shown in Supplementary Figure S5. Next, Shorter sequences were padded with ‘NaN’ (Not a Number) to match the maximum length dictated by the window size, ensuring consistency across all entries. For DTW calculation, the NaN values in shorter sequences are, by default, converted to the number representing the nearest state according to the ’ffill’ filling method. Because the DNA sequences involve both forward and reverse directions, we considered both the forward-forward and forward-reverse strands. Consequently, the DTW matrix was composed of the minimum score between these two comparisons. Using the ‘tslearn’ Python package, particularly the ‘dtw’ module, we calculated similarity scores between sequences by focusing on the minimum cumulative distances in both directions, allowing for a detailed comparison of motifs.
Following this similarity assessment, we employed agglomerative clustering via the ‘AgglomerativeClustering’ module from the same package, grouping the motifs into clusters based on their similarities. Additionally, we provide a function to generate a dendrogram, which displays the hierarchy and helps users determine the optimal number of clusters when the number of clusters is not predetermined. While the dendrogram was our primary tool for selecting the optimal number of clusters, users are encouraged to consider alternative methods for determining the cluster count. For visualization purposes, we provide two functions: the first displays the UMAP, which reduces the dimensionality of the data for easier interpretation and visualization of the clustering results; the second visualizes the final result of clustering with actual motif entries. This approach facilitates the identification of representative sequential patterns among chromatin states, thereby offering deeper insights into genomic regulatory landscapes.
Authors’ contributions
SL developed ChromBERT and performed all analyses in this study. RN conceived and designed the study. SL and RN contributed to the drafting of the manuscript. CL and CYC supervised the deep learning process and provided suggestions for improving the analysis and the manuscript.
Funding
This work was supported by a Grant-in-Aid for Scientific Research under grant number 23H02466, the Japan Agency for Medical Research and Development under grant number JP23gm6310012h0004, and the JST FOREST Program under grant number JPMJFR224Y.
Competing interests
The authors declare no competing interests.
Acknowledgements
We thank all the laboratory members for their helpful discussions.