RT Journal Article SR Electronic T1 Clustering FunFams using sequence embeddings improves EC purity JF bioRxiv FD Cold Spring Harbor Laboratory SP 2021.01.21.427551 DO 10.1101/2021.01.21.427551 A1 Maria Littmann A1 Nicola Bordin A1 Michael Heinzinger A1 Christine Orengo A1 Burkhard Rost YR 2021 UL http://biorxiv.org/content/early/2021/01/21/2021.01.21.427551.abstract AB Motivation Classifying proteins into functional families can improve our understanding of a protein’s function and can allow transferring annotations within the same family. Toward this end, functional families need to be “pure”, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function, based on differentially conserved residues. 11% of all FunFams (22,830 of 203,639) also contain EC annotations and of those, 7% (1,526 of 22,830) have at least two different EC annotations, i.e., inconsistent functional annotations.Results We propose an approach to further cluster FunFams into smaller and functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from deep learned language models (LMs) transferring the knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between sequences in embedding space and DBSCAN to cluster FunFams, as well as identify outlier sequences, resulted in twice as many more pure clusters per FunFam than for a random clustering. 52% of the impure FunFams were split into pure clusters, four times more than for random. While functional consistency was mainly measured using EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other definitions of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency can be used to infer annotations more reliably. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.Availability The source code and PB-Tucker embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClusteringCompeting Interest StatementThe authors have declared no competing interest.DBSCANdensity-based spatial clustering of applications with noiseddimensionsECEnzyme CommissionFunFamfunctional familyLMlanguage modelNLPnatural language processing