Uniclust databases of clustered and deeply annotated protein sequences and alignments

Milot Mirdita; Lars von den Driesch; Clovis Galiez; Maria J Martin; Johannes Söding; Martin Steinegger

doi:10.1093/nar/gkw1081

Uniclust databases of clustered and deeply annotated protein sequences and alignments

Nucleic Acids Res. 2017 Jan 4;45(D1):D170-D176. doi: 10.1093/nar/gkw1081. Epub 2016 Nov 28.

Authors

Milot Mirdita¹, Lars von den Driesch^{1

2}, Clovis Galiez¹, Maria J Martin², Johannes Söding³, Martin Steinegger^{4

5

6}

Affiliations

¹ Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK.
³ Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany soeding@mpibpc.mpg.de.
⁴ Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany martin.steinegger@mpibpc.mpg.de.
⁵ Department for Bioinformatics and Computational Biology, Technische Universität München, Munich, Germany.
⁶ Department of Chemistry, Seoul National University, Seoul, Korea.

Abstract

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uniboost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust.mmseqs.com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cluster Analysis
Computational Biology / methods*
Databases, Nucleic Acid*
Gene Ontology
Molecular Sequence Annotation
Software*
Web Browser