RT Journal Article
SR Electronic
T1 MSA Transformer
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 2021.02.12.430858
DO 10.1101/2021.02.12.430858
A1 Roshan Rao
A1 Jason Liu
A1 Robert Verkuil
A1 Joshua Meier
A1 John F. Canny
A1 Pieter Abbeel
A1 Tom Sercu
A1 Alexander Rives
YR 2021
UL http://biorxiv.org/content/early/2021/08/27/2021.02.12.430858.abstract
AB Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evo lutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.Competing Interest StatementThe authors have declared no competing interest.