The human genome’s vocabulary as proposed by the DNA language model GROVER

Large Language Models (LLMs) on natural language have achieved a level of performance that allows the generation of coherent and syntactically correct text. DNA sequence of genomes follows rules similar to natural language, but a distinguishing factor is the absence of a concept analogous to words. We established byte-pair tokenization on the human genome and trained a foundation language model called GROVER (“Genome Rules Obtained Via Extracted Representations”) to select the optimal vocabulary with a custom fine-tuning task of next-k-mer prediction. We thus defined a dictionary of words/tokens in the human genome that best carries the information content for DNA language models. Analyzing GROVER’s learned representations, we observed that token embeddings primarily encode information related to their frequency, sequence content, and length. Some tokens are almost exclusively localized in repeats, while the vast majority widely distributes over the genome. The model also learns context and lexical ambiguity. Average embeddings of genomic regions relate to functional genomics annotation and thus indicate that GROVER has learned these structures purely from the contextual relationships of tokens. That we can extract functional annotations from the genome, purely based on sequence representation to the trained model, highlights the extent of information content encoded by the sequence. This is supported by fine-tuning tasks on genome biology with questions of promoter identity and protein-DNA binding. GROVER learns sequence context, a sense for grammatical structures and language rules in the genome. This knowledge can be extracted and used to compose a grammar book for the code of life.

While genomes and DNA are very analogous to language through sequence structure that 47 resembles grammar, syntax, and semantics, they also differ. First, there is no clearly defined 48 direction in DNA sequence, unless viewed relative to biological processes like transcription or 49 replication. Second, there is no natural definition of words. We know of motifs, such as 50 sequence with preferences to be bound by specific transcription factors, or the triplet code 51 that encodes for protein. However, when looking at the genome as a whole, there is no 52 straight forward concept of words. To overcome those challenges for training transformer 53 models on DNA, so called DNA language models, there have been several approaches. Some 54 are aimed to address specific tasks, such as the modelling of gene expression with Enformer 55 5 , a model that combines convolutional layers with transformer blocks. Through the 56 convolutional layer, there is no definition of words necessary. Foundation models however 57 are trained not directly on a specific genome biology task, but are first pretrained on masked 58 token prediction and subsequently fine-tuned. This strategy requires the definition of discrete 59 tokens, i.e. to build "words" from DNA. Available models include DNABERT 6 and Nucleotide 60 transformer 7 that use a Bidirectional Encoder Representations from Transformers (BERT) 8 61 architecture and apply different strategies of generating the vocabulary. Nucleotide 62 transformer uses as vocabulary 6mers and single nucleotides on edges of known sequence. 63 DNABERT on the other hand uses k-mers of 3, 4, 5, and 6 nucleotides in length for four 64 different models, of which the 6mer model performs best 6 . The k-mers have overlaps and 65 the training is designed for the central nucleotide of a masked sequence not to overlap with 66 any unmasked tokens. As a consequence, the model largely learns the sequence of the tokens, 67 rather than larger sequence context 9 . 68 How to choose the right vocabulary for training a DNA language model has been challenging. 69 On the one hand, the tokens should have an appropriate length so that they capture the 70 language structure of the genome. However, if this length is chosen as a constant, frequencies 71 of tokens become very heterogeneous. For 6mers, for example, the frequency of each token 72 ranges from ca. 10 4 to 10 7 occurrences in the human genome (hg19). Such a frequency 73 imbalance can inhibit model training through Rare Word Problems. We therefore developed 74 byte-pair tokenization on genomic sequence as a strategy to build multiple frequency 75 balanced vocabularies and select the vocabulary that carries the information content of the 76 human genome in an optimal way. In combination with fine-tuning tasks, and the inbuilt 77 transparency of the model architecture, we can now start using the resulting foundation DNA 78 language model, GROVER ("Genome Rules Obtained Via Extracted Representations") to 79 extract the different layers of the genetic code with unprecedented performance and detail. 80 81 Results

82
Building a frequency-balanced vocabulary on the human genome 83 Foundation DNA language models require a vocabulary for training that is generated on the 84 genome by grouping nucleotides into tokens. For a human DNA language model this is a 85 difficult task, because sequence composition over the genome is very heterogeneous. While 86 A-and T-rich sequences are relatively frequent, sequences with CG dinucleotides are 87 depleted, due to their susceptibility to mutation 10 . To achieve tokens of not too 88 heterogeneous frequencies, tokens that contain rarer sequence content should be shorter for 89 optimal contribution to model training, whereas tokens with frequent sequence content 90 should be longer. In the case of CG dinucleotides, this is of particular importance, given that 91 through potential DNA methylation in the form of 5-methyl-cytosine 11 , this dinucleotide 92 fulfils a special biological role in gene regulation 12,13 and retrotransposon silencing 14 . To 93 tokenize the genome in a frequency-balanced fashion, we employed byte-pair tokenization 94 15 , a method originally developed for text compression (Fig1A). The algorithm prioritizes  95  building larger tokens of more frequent sequence content by sequentially combining the most  96  frequent token pairs into new tokens. Starting with the four nucleotides A, C, G, and T, in the  97  first cycle of byte-pair tokenization, two Ts are combined to a TT token, where the initial  98 embedding of the token does not retain the information on the sequence content, but adapts 99 a new token identity. This pairing can in principle be continued for many cycles, forming 100 continuously larger tokens. With the dictionary growing, the new pair in every cycle becomes 101 less and less frequent. We use these vocabularies to train a model with BERT architecture 102 (Fig1B) for masked token prediction with cross-entropy loss. This results in a model for each 103 cycle of byte-pair tokenization from which the optimal model and therefore vocabulary is 104 selected. 105 106 Using next-token-prediction to select an optimized vocabulary for DNA language models 107 To select an optimal vocabulary and model, performance can in principle be assessed with 108 two different strategies, intrinsic validation and extrinsic validation, where extrinsic validation 109 would typically be performed on a task of specific genome biology. Since we aim to build a 110 foundation model that is not biased towards a specific biological task, we consider intrinsic 111 validation more suitable. However, metrics like perplexity and accuracy are dependent on the 112 size of the dictionary and consequently unsuitable for direct comparison of vocabularies. We 113 therefore employed a fine-tuning task of next-k-mer prediction 9 , which fulfils the criteria of 114 a task that does not directly focus on a specific question of genome biology, is independent 115 of the number of training parameters of the foundation model, the tokenization strategy and 116 vocabulary size. The task is set to predict the next fixed-size k-mer, using 5 fine-tuning models 117 with k=[2,3,4,5,6] and therefore requires knowledge of context for good performance. We 118 applied this task with the foundation models trained on 100 to 5000 cycles of byte-pair 119 tokenized vocabulary and assessed performance through accuracy of next-k-mer prediction 120 (Fig2A). While there are small differences between the different k-mer-models, they all 121 generally perform in an optimal range between cycles 400 and 800. We thus picked cycle 600 122 for our final model, GROVER. To compare GROVER with models of fixed-size vocabularies, we 123 established foundation models of non-overlapping 4mers, 5mers, and 6mers to perform the 124 equivalent task (Fig 2B). While they all showed inferior accuracy for next-token prediction, it 125 is noteworthy that foundation models with the matched k-mer size as the next-k-mer 126 prediction task hardly performed better than fine-tuning foundation models with unmatched 127 k-mer sizes. We also assessed performance for the GROVER foundation model training task 128 of masked token prediction (Fig 2C), which achieves 21 % accuracy. When allowing for the 129 token to be among the up to 60 top predicted tokens, which represents 10% of the dictionary, 130 accuracy rises to 75%. Perplexity was assessed in comparison to the fixed-size k-mer models, 131 divided by the number of tokens in the dictionary, which to some extent can compensate for 132 the differences in dictionary sizes between the models. GROVER shows perplexity of 12% of 133 the vocabulary, whereas the fixed-size k-mer models show perplexity of 25%, 21%, and 36% 134 for 4mers, 5mers, and 6mers, respectively (Fig2D). The selected byte-pair-tokenized 135 vocabulary is therefore outperforming the fixed-size k-mer models and the vocabulary is thus 136 optimized to carry the information content of the genome with relevance for this type of 137 model training. We next had a closer look at the vocabulary composition to ask what 138 characteristics might allow such improvement in performance. 139 140 The vocabulary that carries optimized information content for DNA language training 141 Byte-pair tokenization started with the special tokens and the four nucleotides. It added one 142 token for each of the 600 cycles. All single As, Cs, and Ts (that were not adjacent to Ns) were 143 incorporated into larger tokens and thus got removed from the dictionary. As a result, the 144 dictionary contains 601 GROVER tokens. First, we assessed their frequency in the human 145 genome (Fig3A). While the fixed-size k-mer models show bimodal distributions of token 146 frequencies, the GROVER vocabulary has the vast majority of the tokens with frequencies 147 higher than 100,000 times in the genome with a median of ca. 400,000. While the 4mer model 148 also has most token frequencies above 100,000, the complexity of the dictionary is only 256 149 tokens as compared to 601 for GROVER. Still, the majority of tokens in the GROVER vocabulary 150 are 4mers and the average token length is 4.07 (Fig3B) with a close to symmetric distribution 151 of other token sizes. The length of the tokens ranges from one 1mer, a single guanine, to two 152 16mers, A16 and T16. In the formation of the vocabulary not all k-mers are generated, since 153 the required smaller tokens may have been eaten up by other more frequent combinations. 154 This results in a heterogeneous representation of k-mers in the GROVER dictionary (Fig3C). 155 Most tokens in the dictionary are 5mers and 6mers with 213 and 224 tokens each and larger 156 tokens become less frequent. The proportional representation of k-mers in the dictionary is 157 therefore also heterogeneous (Fig3D). With the G, 25% of the 1mers are represented, and 158 63% of the 2mers. A CG dinucleotide, for example, has never formed and this sequence is 159 either part of larger tokens or distributed between two tokens. Even though 4mers are the 160 most frequent tokens in the vocabulary, only 32% of the possible 4mer sequences are 161 represented. Still, in total the nucleotide representation within the dictionary is a reflection 162 of the nucleotide composition of the genome (Fig3E). There is however a substantial 163 imbalance in respect to the first nucleotide of the tokens in the dictionary with 97% of them 164 starting with an A or a T, while nucleotide composition of the genome would suggest an 165 expected proportion of 60%. This is probably a side-effect of the byte-pair tokenization 166 algorithm, which initially prioritizes the more frequent As and Ts to generate new tokens and 167 thus amplifies the nucleotide frequency imbalance of the genome in the representation of 168 the tokens' first nucleotides. To characterize the tokens further, we assessed performance 169 metrics per token length and discovered heterogeneous performance. 6mers perform worst 170 on average for the Area Under the Curve (AUC) and accuracy, while both the shorter and 171 especially the longer tokens show much better performance (Fig3F&G). Importantly, every 172 token lies above 50% and thus contributes to the non-random performance of GROVER. 173 Interestingly, the tokens with most accurate predictions (>99%), are a 9-mer (ATTACAGGC) 174 and a 12-mer (TGTAATCCCAGC We use self-similarity to asses context (Fig6A). The self-similarity of a token sequence is the 241 cosine similarity between its embeddings across its different contexts. The more 242 contextualized the representations are, the lower we expect self-similarity. Using hierarchical 243 clustering on the Euclidean distance over self-similarities for each transformer layer, 244 clustering resembles the analogous results of the average embeddings. Self-similarity varies 245 between the layers. Tokens that localize largely to repeats show highest self-similarity almost 246 throughout the layers. They are also distinct in their good per-token performance metrics and 247 long token lengths. Otherwise, high self-similarity in layer 12 is also highlighting a special 248 token group, the short tokens of high frequency, which also show good performance metrics. 249 For the other tokens it can be concluded that dependent on their cluster, there is low self-250 similarity in different layers, which indicates contextualized learning. We have built GROVER ("Genome Rules Obtained Via Extracted Representations"), a 299 foundation DNA language model with an optimized vocabulary for the human genome, 300 selected by a task of next-k-mer prediction, a fine-tuning task that is independent on the 301 structure of the foundation model and thus can handle different vocabulary sizes and 302 tokenization strategies, without directly selecting models for biological tasks. GROVER can 303 grasp DNA language structure by learning both characteristics of the tokens and larger 304 sequence contexts. It outperforms analogous models with fixed-size k-mers both for tasks of 305 next token prediction and fine-tuning tasks of promoter identification and DNA-Protein 306 binding. Thus, we have identified the vocabulary that currently best defines the information 307 of the human genome as it can be extracted by a BERT model. GROVER can be a basis to 308 extract the information content of the genome, its grammatical and general language 309 structures via token embeddings or through extracting attention from the foundation model. 310 In addition, token embeddings and attention can be obtained from specific fine-tuning tasks 311 to interrogate specific genome biology. Such tasks could be genome annotation with 312 functional data, genotype-phenotype predictions, or technical tasks for example for data 313 augmentation. 314 We have built GROVER exclusively on the human genome, which is a different strategy from 315 other models, such as Nucleotide Transformer 7 , where different genomes are combined in 316 one model. While the number of parameters is limited by using just one genome, the models 317 serve different purposes given that different genomes follow different language rules. For 318 example, the human genome contains Alu retrotransposons to ca. 12%, which are primate 319 specific genome elements. While the coding genome is relatively conserved over multiple 320 species, the non-coding genome is more unique. It is these language rules we aim to learn 321 with GROVER in a transparent way, specifically for biomedical questions and therefore 322 decided not to mix the human genome with other species. Still, also with this strategy we are 323 not compromising in regards to performance for biological fine-tuning tasks, where a one-324 species approach seems to be sufficient. However, this approach may hit its limits with 325 smaller genomes. 326 The different layers of the genetic code can now be approached through these models and it 327 can be extracted how DNA is coding for protein and transcripts, for genes regulation, self-328 propagation, and stability. In there lies not only the key of genotype to phenotype prediction 329 and the information of what in the DNA makes us human, but also the information of 330 predisposition to disease and treatment responses, which is to a large extent encoded in the 331 patients' general and somatic genomes. DNA language models like GROVER therefore have 332 the potential to substantially push progress in personalized medicine. 333 334 Methods 335 Unless otherwise specified as being written in R, all code is written in Python.

336
The final source code and fine-tuned models will be made available at https://zenodo.org

338
Data 339 We used the Homo sapiens (human) genome assembly GRCh37 (hg19) and only take into account the 340 sequences that contain A,C,G and T. Each chromosome is split into windows varying between 20 and 341 510 tokens in size. Specifically, with a 50% probability, the size of a window is 510. With another 50% 342 probability, its size is a random integer between 20 and 510. 80% of the windows are taken as training 343 set and the remaining as test set.

352
The optimal vocabulary was selected through performance assessment with next-k-mer prediction

373
Next k-mer prediction

374
For model validation and selection of the optimal vocabulary we used a fine-tuning task of next-k-mer-375 prediction that we previously developed 9 . It allows to compare different foundation models that rely 376 on context learning independent on how their vocabulary was generated, the size of the vocabulary, 377 or the learning parameters. The task is not dependent on a specific biological question. The principle of next-k-mer-prediction is to take the pre-trained language models and fine tune it to predict the next 379 k-mer, where k is 2, 3, 4, 5 and 6.

380
Chromosome 21 is split into sequences of 510 nucleotides, where we keep the first 56 nucleotides of 381 each sequence. We randomly select 500,000 sequences, 80% for training and 20% for testing.

382
The samples are defined as the first 50 nucleotides of each sequence. For the labels, we take the k 383 nucleotides that follow the 50 nucleotides. The next-kmer model has 4^k different classes, i.e., 16, 64,

461
The same human promoter annotations as used in the Prom300 task were taken in 10 kb windows 462 around the TSS. The sequence was divided into overlapping 1001 bp windows with a step size of 300 463 bp. Training was performed for classification of the presence of a TSS in a respective window. Only the 464 central TSS was considered, even in the presence of more than one TSS.