Abstract
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Evolutionary-scale language modeling is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone; no information about the properties of the sequences is given through supervision or domain-specific features. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵† Work performed while at Facebook AI Research
v3 (Aug 2020): Paper thoroughly revised, with improved results and updated methodology. Main changes: (a) improved pre-training setup and experimental results studying sequence diversity; (b) additional benchmarks on remote homology detection, secondary structure prediction, contact prediction, and mutational effects; (c) combination of pretraining features and classical features for secondary structure and contact prediction.