Abstract
The expanding catalog of genome-wide association studies (GWAS) provides biological in-sights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pre-training on extensive protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using a zero-shot approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pre-training on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order, and test its ability to predict the functional impact of genetic variants in Arabidopsis thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for Arabidopsis thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling the zero-shot prediction of variant effects across the entire genome.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
We have updated the model, expanded the training set, and carried out additional evaluations. In particular, GPN is now trained on unaligned genomes of multiple species within the Brassicales order. A comprehensive database of GWAS hits for Arabidopsis thaliana is used to evaluate the model's ability to predict genome-wide variant effects. Our predictions can be visualized as sequence logos in the UCSC Genome Browser.