Abstract
Inspired by the success of large language models, we develop a long-context generative model for genomes. Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96K base pairs, which contain functional regulatory elements and novel proteins with phage-related functions.
Large pre-trained language models have drastically transformed the natural language processing (NLP) field1,2. Drawing on the similarity of natural language and genome sequences, genomic language models have been developed. These models were trained on large scale genomic datasets, and they effectively predict regulatory elements, uncover co-regulation patterns in proteins, and identify the genome-wide variant effects3–7. However, it remains an open question whether language models can be tailored to generate genome-scale sequences with functional elements, while retaining the capacity to decipher the intricate relationships within DNA sequences.
Most of the current genomic language models used masked language modeling like Bidirectional Encoder Representations from Transformers (BERT)1. This approach is not ideal for tasks that require generating new contents. In addition, they face technical constraints such as short context size and the aggregation of sequences in k-mer tokenization. These limitations hinder their ability to learn from genome-scale data with the level of resolution needed for designing functional elements.
In this work, we introduce megaDNA, a long-context language model that demonstrates foundational capacities in understanding and generating genomic sequences. Our model draws inspiration from the Generative Pre-trained Transformers (GPT) model2, which is renowned for its proficiency in generating long and coherent texts. We utilized a multiscale transformer structure8 that enables us to train the model on unannotated whole bacteriophage genomes at the single nucleotide-level. Without further training, our model can predict gene essentiality across the phage genome in a zero-shot manner. The model embeddings can be directly applicable to predict functional properties of both regulatory elements and proteins. Moreover, the trained model generates sequences up to 96K base pairs (bp), sharing a similar genomic structure with the natural bacteriophage genomes. We found functional promoters and ribosome binding sites (RBS) in the 5’ untranslated regions (5’UTR) of the predicted genes. The proteins from the generated sequences are predicted to be structurally plausible. Our model is available from GitHub: https://github.com/lingxusb/megaDNA
To construct the training dataset, we collected bacteriophage genomes with high confidence from three sources including the NCBI genebank, the Metagenomic Gut Virus (MGV) catalogue9, and the Gut Phage Database (GPD)10 (Supplementary Fig. 1). After data cleaning, we constructed a dataset of 99.7K bacteriophage genomes to pre-train our model (Methods). The training data was byte-level tokenized, and we employed a multi-scale transformer structure with three layers and a long-context window in model training, as proposed by Yu et al.8
We hypothesize that our pretrained language model captures the structural patterns of bacteriophage genomes, allowing the model’s loss to approximate the fitness of genome sequences. To test this hypothesis, we conducted in silico mutagenesis analysis to predict essential genes in the lambda phage genome11 (Fig. 1b). Without any supervised training, we found mutations within the coding sequences of essential genes result in higher losses than non-essential genes (Fig. 1c). Consequently, changes in model loss can be used a zero-shot predictor of essential genes (AUROC: 0.86, Fig. 1d). Similarly, mutations in the start and stop codons of essential genes lead to higher model losses than non-essential genes (Fig. 1d, Supplementary Fig. 2).
Sequence embeddings from language models capture rich contextual information. We further explored the utility of our model’s embeddings for a range of predictive tasks. We first evaluated our model’s ability to predict effects of sequence mutations on protein functions (Fig. 1e). Mutated gene coding sequences were used as inputs, and a regressor was trained on model embeddings to predict the mutational effects. Our model’s prediction performance closely matched the state-of-the-art model DeepSequence12 (Fig. 1f), including for a protein not existing in the training dataset (Supplementary Fig. 3). Moreover, our model successfully predicted the impact of SNPs across the T7 bacteriophage genome13, even with limited training samples (Fig. 1g, Supplementary Fig. 4).
Expression of phage genes relies on the protein synthesis machinery in host bacteria cells. We investigated the potential of the model embeddings to predict regulatory element activity in bacteria (Fig. 1 h). Our model effectively predicted the translation efficiencies of 5’UTR in non-model organisms such as K. Oxytoca, P. Protegens, as well as for high-throughput gene expression libraries in E. coli14 (Fig. 1i). The model performance is also robust to the training sample size (Supplementary Fig. 5).
Lastly, we extended our model to identify taxonomy of unannotated sequences (Fig. 1j). We collected unannotated sequences from bacteriophage, bacteria, and archaea. The embeddings from our model separated these sequences in a low-dimensional space (Fig. 1k). By training a linear regressor based on model embeddings, we achieved a high classification accuracy (average AUROC of 0.98, Supplementary Fig. 6). This high level of accuracy was consistent across different layers of our model (Supplementary Fig. 7). Since the training data doesn’t contain genome sequences of bacteria or archaea, these results demonstrate the broad applicability of our model.
Our approach enables de novo generation of genome sequence of bacteriophage (Fig. 2a). We generated a total of 1,024 sequences longer than 1K bp. geNomad15 was used for functional annotation of the generated sequences. Among all these sequences, 607 have a virus score larger than zero. The average sequence length is 43K bp, and the average number of predicted genes per sequence is 67, which are similar to the training dataset (mean length: 48K bp, average number of predicted genes: 68). The gene length distribution is close to that of the training dataset (Fig. 1b, average gene length: 558 bp vs 646 bp), while the predicted gene numbers show wider spread (Supplementary Fig. 8). The median virus score of these generated sequences is 0.84 and the maximum score is 0.97, comparable to the virus scores of natural bacteriophage genomes which range from 0.70 to 0.98 (Fig. 1c). 223 out of 607 generated sequences (37%) are predicted to be Caudoviricetes by geNomad (Fig. 1d). As a comparison, 98% of the genomes in the training dataset were classified as Caudoviricetes. Additionally, 388 sequences were predicted to have bacterial hosts with a probability greater than 0.95, as determined by the DeepHost model16 (Supplementary Fig. 9).
We then examined the 5’UTR of the annotated genes in the generated sequences to determine if they contain functional regulatory elements such as promoters and RBS to initiate transcription and translation. We chose the generated sequence #87 for further analysis due to its high predicted virus score (0.96) and its relatively small size (28K bp). Using a machine learning tool (Promoter Calculator)17, we identified the -35 box and -10 box of the promoter within the 5’UTR of the predicted phage stabilization protein. Notably, their sequences are close to the established consensus motifs: TTGACA and TATAAT (Fig. 1f). Prior to the start codon of the same gene, we observed a region enriched in adenine (A) and guanine (G) nucleotides, indicative of a functional ribosome binding site (Fig. 1f).
Analyzing all 5’UTR of the predicted genes from this sequence, we found a significantly higher promoter activity compared to random sequences of the same length (Fig. 1g). Intriguingly, the proportion of A and G nucleotides peaked around 10 bp upstream of the start codon, aligning closely to the optimal position for an RBS to drive translation initiation (Fig. 1h). This trend of A/G enrichment is also consistent across all the generated sequences (Supplementary Fig. 10). In short, our generated sequences harbor functional regulatory sequences that could enable expression of the predicted genes.
In our generated sequences, 343 annotated genes were predicted to match geNomad markers. These genes share very little homology with the training dataset (Supplementary Table 1). We employed ESMfold18 to predict their structures and calculated the average predicted local distance difference test (pLDDT) scores. This score reflects the confidence of ESMfold on the predicted structures. The median pLDDT score for these proteins is higher than that of random peptide sequences (28 vs 18). We further randomly sampled 10K annotated genes from all the generated sequences and found high pLDDT scores for them (median value of 36, Supplementary Fig. 11), suggesting that these generated proteins are more likely to adopt a stable conformation. Functional annotation of all annotated proteins using deepFRI19 reveals several large families with diverse functional roles, including the transporter activity and the structural molecule activity (Fig. 1j). Among these, several proteins were predicted to have DNA- binding activity, and the predicted structure resembles the canonical helix-turn-helix (HTH) domain in this protein family (Supplementary Fig. 12).
To the best of our knowledge, our work presents the first long-context generative model for genomic sequences. Our language model effectively learns the language of gene coding and regulatory sequences via a single step of self-supervised training on unannotated whole genomes. The generated sequences match the length of natural bacteriophage genomes and display functional genomic architectures. It is worth noting that these sequences have not been optimized at the codon or gene level to allow for efficient self-replication in bacteria. However, with further scaling up and fine-tuning, we envision that generative genomic models have the potential to enable de novo design of the whole genome, offering opportunities for breakthroughs in medicine, agriculture, and environmental science. This field also faces ongoing challenges in ethical considerations, biosafety, and regulatory frameworks, which are critical for the responsible advancement of generative modeling in synthetic biology.
Methods
Model training
Our training dataset was curated from three sources. Firstly, we downloaded all the complete virus genomes from NCBI genebank, retaining only those with “phage” in the organism’s name. Secondly, the phage genomes from MGV were downloaded, and we only included genomes with a completeness score larger than 95% and classified under the order Caudovirales. Our third source was GPD, and we kept all the genomes with a completeness score above 0.95. Following the initial collection, we undertook an additional round of filtering. We used geNomad to predict the taxonomy of these genomes and then deleted all the genomes whose predicted host is not a unicellular organism. All genomes smaller than 96K bp were used to construct the final training dataset.
Our megaDNA model utilized a three-layer transformer structure8. Each layer had a depth of 8 and progressively larger dimensions (local: 196, middle: 256, global: 512). The sequence lengths for three layers are 16, 64, 128. The model contains 145M parameters in total. We assigned numerical tokens (1, 2, 3, and 4) to the nucleotides A, T, C, and G, respectively. For model training, we used a batch size of 1 and set the learning rate at 0.0002. The learning rate was progressively increased during the initial 50,000 steps as part of a warmup schedule. We utilized the Adam optimizer and applied gradient clipping with a norm of 0.5 to prevent gradient explosion.
In silico mutagenesis of phage genomes
Lambda phage genome sequences and annotations were downloaded from NCBI (Accession number: NC_001416.1). Essential genes were identified according to Piya et al11. We conducted in silico mutagenesis using a 50 bp sliding window across the genome, and each nucleotide was randomly mutated to A, T, C, or G with equal probability. The impact of mutations was assessed by computing the model loss, which is further compared with their original counterparts. For both essential gens and non- essential genes, we calculated the mean model loss for all the windows within the gene coding region. Mann-Whitney U test was used to evaluate statistical differences between these two groups (scipy.stats.mannwhitneyu). Furthermore, mutations targeting the start and stop codons of all coding genes were simulated. We also generated a control set comprising an equivalent number of 3-nt mutations randomly distributed across the genome. The effect of these mutations on model loss was analyzed to infer their impact on fitness. The changes in model loss were used a predictor of gene essentiality, and we computed the Receiver Operating Characteristic (ROC) curve and the Area Under the ROC Curve (AUROC).
Prediction of mutational effects on protein function
The DNA sequences of the mutated genes were used as the model input. Embeddings from three different layers of the model were extracted (dim = 196, 256, 512). These embeddings were concatenated to form a 964-dimensional vector representing each gene coding sequence. We used the 5-fold cross validation to evaluate the prediction performance of our model. In each fold, one fifth of the data was held out as test data while the remaining data were used as training data. A Ridge regression model was trained on the input and fitness values in the training dataset with default parameters (sklearn.linear_regression.RidgeCV). The predictive performance of this model was then evaluated on the test dataset. The infA gene dataset was obtained from Kelsic et al20. For the T7 bacteriophage dataset13, genome sequence and annotations were downloaded from NCBI (Accession number: V01146.1). The model performance was evaluated for each gene in the same manner.
Prediction of translation efficiency
We assessed the translation efficiency (TE) of genes in Klebsiella oxytoca, Pseudomonas protegenes Pf-5, and Escherichia coli DH10B by calculating the ratio of average ribosome density (RD) to mRNA expression. The ribosome density of each gene was calculated by averaging all ribosome occupancies over the length of the gene. The mRNA expression in FPKM (fragments per kilobase of transcript per million mapped reads) of each gene that was calculated by averaging the height of RNA-seq profile over the length of the gene. The ribosome profiling and RNA-seq datasets of K. oxytoca and P. protegenes Pf- 5 were obtained from the Sequence Read Archive with the accession code PRJNA57976721. The E. coli DH10B datasets were obtained from NCBI GEO database with accession number GSE15266422. We used the DNA sequences spanning from -160 to 160 relative to the start codon as the input to our model.
Model embeddings were extracted from three layers (dim = 196, 256, 512) and concatenated together to form a 964-dim vector for each input sequence. To mitigate the influence of lowly expressed genes on TE calculations, we focused on the top 25% expressed genes in Klebsiella oxytoca and Escherichia coli DH10B, and the top 20% expressed genes in P. protegenes Pf-5. We used 5-fold cross validation to evaluate the performance of our model. In each fold, a ridge regression model was trained on the input and TE values in the training dataset with default parameters (sklearn.linear_regression.RidgeCV). The trained model was then used to predict TE values in the test dataset. For the Evfratov et al. dataset14, 20 nt and 30 nt 5’UTR sequences were used as the input. The model performance was evaluated as previously described.
Classification taxonomy of unannotated sequences
We analyzed 10K bp sequences randomly sampled from bacteriophage, bacteria, and archaea genomes downloaded from NCBI genebank (n = 5,000 each). For the total of 15,000 sequences, sequence embeddings were generated using our model across three layers (dim = 196, 256, 512). For embedding visualization, we used Uniform Manifold Approximation and Projection (UMAP)23 as implemented in the python package umap-learn. To classify these sequences, we employed a logistic regression model to evaluate the predictive performance of the embeddings across multiple classes (sklearn.linear_model.LogisticRegression). The model’s performance was assessed using a 5-fold Stratified K-Fold cross-validation. This method ensures that each fold is a good representative of the whole by maintaining the same proportion of samples for each class as in the complete dataset. Within each fold, the model was trained on a subset of the data and then used to predict probabilities on the test subset. For each class, we computed the Receiver Operating Characteristic (ROC) curve and the Area Under the ROC Curve (AUROC).
Model inference
We generated sequences from the trained model using a predefined set of parameters. Specifically, we adjusted the temperature to 0.95 to ensure a balance between variety and coherence in the sequences and kept the filter threshold at 0.0 to avoid limiting the range of token probabilities. For model training and inference, we utilized Nvidia’s A100 GPU (40GB) and 3090 Ti GPU (24GB) and used the PyTorch version 2.1.1 software package.
Analysis of the generated sequence
geNomad15 was used for sequence annotation of all generated sequences. The 100 base pair regions preceding the start codon of each predicted gene was designated as the 5’UTR. We employed the Promoter Calculator17 to find the promoters in these regions. Only the promoter with the highest predicted activity was annotated. For protein structure prediction, we used the pretrained ESMfold model v118. The chunk size of the model was set to be 64 for proteins longer than 700 AA and 128 for shorter proteins. We limited our structure calculation to proteins less than 1000 AA in length. Function prediction for these proteins was carried out using the default deepFRI model19, as available on GitHub (https://github.com/flatironinstitute/DeepFRI). We used a score cutoff of 0.5 which was reported to be significant in the original publication. Bacteria host prediction was done using DeepHost species-level model16 with default parameters (https://github.com/deepomicslab/DeepHost). Predictions with a probability greater than 0.95 were used in the following analysis.
Data availability
The bacteriophage genomes were downloaded from public databases including NCBI genebank (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/), MGV (https://portal.nersc.gov/MGV), and GPD (https://www.sanger.ac.uk/data/gut-phage-database/).
Code availability
Our trained model and model inference codes are available from GitHub: https://github.com/lingxusb/megaDNA
Footnotes
This is a major update of the manuscript with a new figure 1 and 24 additional figure panels.