Quaternary structure evaluation tool for protein assemblies

The Protein Data Bank (PDB) is the single worldwide archive of experimentally-determined three-dimensional (3D) structures of proteins and nucleic acids. As of January 2017, the PDB housed more than 125,000 structures and was growing by more than 11,000 structures annually. Since the 3D structure of a protein is vital to understand the mechanisms of biological processes, diseases, and drug design, correct oligomeric assembly information is of critical importance. For example, it makes a difference if the protein is normally a dimer and not a monomer or a trimer or a tetramer or a hexamer in nature. Unfortunately, the biologically relevant oligomeric form of a 3D structure is not directly obtainable by X-ray crystallography. Instead, this information may be provided by the PDB Depositor as metadata coming from additional experiments, be inferred by sequence-sequence comparisons with similar proteins of known oligomeric state, or predicted using software, such as PISA (Proteins, Interfaces, Structures and Assemblies) or EPPIC (Evolutionary Protein Protein Interface Classifier). Despite significant efforts by professional PDB Biocurators during data deposition, there remain a number of structures in the archive with incorrect quaternary structure descriptions (or annotations). Further investigation is, therefore, needed to evaluate the correctness of quaternary structure annotations. In this study, we aim to identify the most probable oligomeric states for proteins represented in the PDB. Our approach evaluated the performance of four independent prediction methods, including text mining of primary publications, inference from homologous protein structures, and two computational methods (PISA and EPPIC). Aggregating predictions to give consensus results outperformed all four of the independent prediction methods, yielding 86% correct, 9% incorrect, and 5% inconclusive predictions, when tested with a well-curated benchmark dataset. We have developed a freely-available web-based tool to make this approach accessible to researchers and PDB Biocurators (http://quatstruct.rcsb.org).

benchmark dataset. We have developed a freely-available web-based tool to make this approach accessible to researchers and PDB Biocurators (http://quatstruct.rcsb.org).

Introduction
The Protein Data Bank (PDB, pdb.org) [1] provides detailed information about the three-dimensional (3D) structures of biological macromolecules, including proteins and nucleic acids. The PDB was established in 1971 with only 7 X-ray crystal structures of proteins and now contains more than 125,000 structures (as of January 2017). Today, the PDB archive is managed by the international Worldwide Protein Data Bank (wwPDB, wwpdb.org) partnership [2], which includes the RCSB Protein Data Bank (RCSB PDB, rcsb.org) [1], the Protein Data Bank in Europe (PDBe, pdbe.org), Protein Data Bank Japan (PDBj, pdbj.org), and BioMagResBank (BMRB, bmrb.org). The majority (~90%) PDB structures were determined by X-ray crystallography. This experimental method yields 3D atomic level structures of the so-called asymmetric unit ( Fig 1A), which is the repeating unit that makes up the crystal (Fig 1B). Knowledge of the 3D structure of the asymmetric unit and intermolecular interactions among asymmetric units does not provide sufficient information to reveal conclusively the oligomeric structures of protein assemblies, because is often not possible to distinguish biologically relevant intermolecular contacts from contacts that merely stabilize the crystal lattice. Many proteins form structurally well-characterized thermodynamically stable multimeric complexes, which are important for biological function [e.g., hemoglobin occurs in nature a heterotetramer (A2B2) with a cyclic (C2) symmetry and dihedral (D2) pseudo-symmetry] [3]. Experimental methods, such as size exclusion chromatography or analytical ultracentrifugation are sometimes required to ascertain the correct oligomerization state for a protein structure determined by X-ray crystallography.
Alternatively, correct oligomeric state information may be inferred by comparison with better characterized homologous proteins or be provided by the PDB Depositor as metadata. It can also be predicted using computational methods, such as PISA (Proteins, Interfaces, Structures, and Assemblies) [4] or EPPIC (Evolutionary Protein-Protein Interface Classifier) [5]. Since the PDB was established in 1971, oligomeric state information has been obtained from Depositors or predicted by PQS [6] and more recently with PISA. Although experimental evidence for oligomeric state was not a mandatory data item in legacy PDB deposition systems, collection of experimental evidence has been improved in the new wwPDB OneDep global deposition, biocuration, and validation system [7].
Quaternary structures of proteins can be characterized by two main descriptors that define their oligomeric states: stoichiometry and symmetry. Stoichiometry describes the composition of the assembly in terms of subunit number and composition. There are several widely used methods for determining the stoichiometry of protein complexes, including size exclusion chromatography [8], analytical ultracentrifugation [9], and gelelectrophoresis [10,11]. Protein assembly stoichiometry is described using a composition formula. Typically, an uppercase letter, such as A, B, C, etc., represents each type of different protein subunit in alphabetical order. (N.B.: These letters are not the same as the chain identifiers found in PDB archival entries.] The number of equivalent subunits is added as a coefficient next to each letter. For example the stoichiometry of the two-component human hemoglobin heterotetramer is represented as A2B2 (a dimer of heterodimers composed to two distinct polypeptide chains).
Symmetry is another important feature of protein tertiary and quaternary structure [12] and plays a key role in understanding protein evolution and structure/function relationships [3], [12], [13], [14], [15], [16]. At the quaternary structure level, we characterize symmetry by the point group, a set of symmetry elements, whose symmetry axes go through a single point [12]. Most oligomeric protein structures (either homomeric or heteromeric) are symmetric macromolecules, which is probably a simple consequence of how subunits associate in solution without aggregating indefinitely [17] and can be classified using closed symmetry groups [18]. They are typically described as cyclic (i.e., C2, C3, C4, ...), dihedral (i.e., D2, D3, D4, ...), or cubic (tetrahedral, octahedral, icosahedral). Dihedral and cyclic symmetries are geometrically related: a structure with Dn symmetry can be constructed from n dimers with C2 or from two nmers with Cn symmetry [19]. Additionally, helical symmetry is also a common opensymmetry encountered in protein structures.
The PDB archive grows by more than 11,000 structures annually. However, because of incomplete data and errors made during data entry, the oligomeric state annotations provided by the PDB are not always correct and reliable [20]. In spite of great efforts to improve the quality of the PDB archive, it has been reported that there are a significant number of PDB entries with incorrect quaternary structure annotations ( Fig 2). Levy put the the error rate at ~14% [21], while more recently Baskaran et al. [22] reported a lower bound for the error rate of ~7%. Development of methods for accurate detection of incorrect annotations and assignment of most probable oligomeric state is, therefore, a matter of some urgency. This study has two main objectives: (i) to enable identification of incorrect quaternary structure annotations in the PDB archive, and (ii) to enable assignment the most probable quaternary structure for such cases. To accomplish these goals, we evaluated four different methods for assessing quaternary structure annotations in the PDB. First, we took an evolutionarily approach by clustering proteins related by amino acid sequence and attributed to each member of a given cluster the oligomerization state found to be most prevalent with the cluster. Second, we took a text mining approach by searching through the primary citations of indivdiual PDB entries and extracting information about oligomeric state and experimental evidence thereof. Third and fourth, we took two independent computational approaches using PISA and EPPIC, respectively, to predict oligomeric states. We aggregated results from these methods to generate a consensus prediction for the most probable oligomeric state for each protein structure in the PDB. We tested this combined approach using a well-curated benchmark dataset. During the course of this effort, we developed an efficient approach to evaluate oligomeric states of protein structures in the PDB, which we have made freely available to both PDB Biocurators and researcher as a web-based tool.

Benchmark Dataset
We aggregated three previously published, manually curated, benchmark datasets to create a considerably larger combined benchmark dataset for this work. The

Sequence Clustering
Since many protein chains in the PDB are similar at the level of sequence, we use this information to cluster polypeptide chains on the basis of amino acid sequence identity and assign a representative oligomeric state for each cluster based on a consistency score. For this purpose, we first constructed sequence clusters at various identity thresholds, including 95%, 90%, 70% and 40%. Clusters were calculated using the BLASTClust algorithm [26], which detects pairwise matches with the blastp algorithm [27] and then places each sequence in a certain cluster if the sequence matches that of at least one cluster member (Fig. 4). A cluster is defined as a set of protein chains that are at least k% sequence identical to each other over 90% of the same length, and is associated with two discrete random variables: • S t represents stoichiometry and consists of a list (a,b,c,…,m) giving the number of copies of each unique molecule (e.g., A: monomer, A2: homodimer, A2B2: heterotetramer).
• S y represents symmetry and takes on values such as C2 (cyclic), D4 (dihedral), For a certain sequence identity threshold k, a consistency score for a given stoichiometry t and symmetry y can be estimated by the joint probability of these two events: After consistency score calculation, we applied a binary decision rule based on the majority probability to predict a representative quaternary structure for a certain cluster. The maximum consistency score in a cluster must be greater than 0.5 to satisfy the majority rule and to predict a representative quaternary structure, otherwise the result is deemed inconclusive.
For the consistency score to be statistically meaningful, there must be minimum number of members in a cluster. We used our benchmark dataset to determine the minimum number of cluster members for different sequence cluster identity thresholds, including 40%, 70%, 90% and 95%. For this purpose, we selected different number of cluster sizes from (n=1,…,50) and predicted the most representative oligomeric state for each cluster using the consistency score described above. Then, we calculated the percentage of correct, incorrect, and inconclusive predictions for each minimum number We first split the full-text article into sentences. Then, we identified sentences containing oligomeric state information using a keyword list (monomer, dimer, trimer, etc., see Table S1 for the full keyword list). Some of these sentences proved misleading or irrelevant. For example, some sentences described the asymmetric unit not quaternary structure, and some sentences refer to protein structures other than the one of interest. We, therefore, used a machine-learning approach to eliminate non-relevant sentences by classifying sentences as quaternary structure relevant (positive) or irrelevant (negative). To do so with traditional machine learning algorithms that require numerical inputs, each sentence had to be tokenized into words. Then, we converted each word to numerical values using the term frequency-inverse document frequency (tf-idf) method [28] to create a numerical data matrix. This method reflects how important a word is to a document in a corpus using the following formula: were tf(n,s) represents the frequency of word n in sentence s, df(n) represents the number of sentences containing word n, and N is the total number of sentences. To improve the effectiveness of the tf-idf method, all words in each sentence were converted to lower case, with extra spaces and internal punctuation marks removed.
Finally, we calculated the tf-idf score for each word in each sentence and created a data matrix for the training procedure (wherein each row represented a single sentence and each column represented a unique word). To avoid the high-dimensional data matrix, we mapped the each column (i.e. features) to a hash-table by using a hash function. In this study, we used the murmurhash3 hash function, proposed by Weinberger et al. [29]. After applying the hashing function, we used two machine-learning algorithms, support vector machines (SVM) [30] and boosted logistic regression (BLR) [31][32][33][34], to classify each sentence in a paper as positive and negative.
A similar approach was used to search sentences for experimental evidence of oligomeric state. An experimental evidence keyword list was used for this task (see S2  To train and test our machine learning algorithms for the text mining approach, we created a dataset using PDB primary papers. For this task, first, we extracted sentences from the papers using the keyword list in the S1

PISA Prediction
Following successful crystallographic structure determination efforts are made to identify biologically relevant intermolecular interactions within the crystal [35], and distinguish them from intermolcular contacts that simply stabilize the crystal lattice. The PISA program, developed by Krissinel and Henrick [4], uses a quantitative approach to address this problem [35]. The stability of an oligomeric structure is a function of free energy formation, solvation energy gain, interface area, hydrogen bonds, salt-bridges across the interface, and hydrophobic specificity [4]. PISA uses these properties to analyze protein structures and predict possible stable oligomeric states. Following successful evaluation (i.e., 90% accuracy [20]) using the Ponstingl et al. [23] benchmark data in 2007, PISA was deployed as a web server at the European Bioinformatics Institute (EBI) [35]. Soon thereafter, it was adopted as a quaternary structure validation and annotation tool for PDB archival depositions.  BioJava is used to assign stoichiometry and symmetry.

EPPIC Prediction
To distinguish intermolecular contacts stabilizing biologically relevant oligomeric states from simple crystal contacts, an evolutionary-based classifier (EPPIC; http://www.eppic-web.org) was developed by Duarte et al. [5]. This method uses a geometric measure, number of interfacial residues, and evolutionary features to classify

Consensus Result Approach
Predictions from sequence clustering, text mining, PISA, and EPPIC, were combined and a majority vote rule was applied to arrive at a consensus result. In the following cases, predictions from individual methods were excluded: • Sequence clustering: insufficient homologous proteins comprising the cluster (at least 3 structures for 70% sequence identity).
• Text mining: no full-text publication available or no clear information regarding quaternary structure therein.

Web-Tool Development
To make this approach accessible, we developed a user-friendly, easy-to-use web-based tool using R, JavaScript, jQuery, CSS, and HTML (Fig 8). The tool was predominantly constructed using R software [40]. Publications were downloaded either in PDF or XML format, and XML [41], Rcurl [42], tm [43], NLP [44], openNLP [45], and stringr [46] packages were used to convert PDF files to plain text files, to parse XML files, to extract sentences, and to split words. The FeatureHashing package [47] was used to map features to the hash table [29]. Machine learning algorithms were trained and tested using the caret package [48]. Data tables were built using the DT package [49], and the shiny package [50] was used to create an interactive web-based application. ii. Generate a PISA oligomeric state prediction for a given structure, rebuild the quaternary structure, and assign stoichiometry and symmetry using BioJava.
iii. Generate a EPPIC oligomeric state prediction for a given structure, including stoichiometry and symmetry.
iv. Extract oligomeric state information with experimental evidence from any publication describing a crystal structure of a protein.
The tool has a simple user interface, requiring only four character PDB IDs as input and a single mouse click to launch the calculation. Users can either enter a single PDB entry or upload a .txt file, which includes multiple PDB IDs, using the option in the tool for processing multiple PDB entries. Users can also upload a PDF version of a paper within the text mining module.

Benchmark Dataset Results
We applied sequence clustering, text mining, PISA, and EPPIC to our 543 structure benchmark dataset to test the performance of each approach. Then, we aggregated all available results to arrive at a consensus result. S3 Table lists the   individual predictions and S4 Table summarizes the results for individual methods and the consensus predictions. Accuracy rates for individual methods ranged between 46% and 81%. We achieved 86% accuracy rate using the consensus approach (Fig 9).  C) Inconclusive prediction agreements between methods.
Since TM gives high inconlusive results, we also provide a 3-method consensus based on SC, PISA and EPPIC results. After exclusion of TM, the accuracy rate decreased to 81%, while incorrect rate increased to 11% and inconclusive rate increased to 8%. Thus, even though its highly inconclusive nature, TM provides useful insights regarding correct quaternary structure information.

Discussion
The oligomeric state (or states) of a protein represents the essential biological unit(s) that carries out a biological function in a living organism [4]. It is, therefore, crucial to determine with high reliability the biologically-relevant oligomerization state(s) of a macromolecule. This problem has been extensively studied and the scientific literature abounds with various methods and approaches. An extensive analysis of the problem can be found in Capitani et al. [20], wherein the authors reviewed main concepts of different approaches, including thermodynamic estimation of interface stability, evolutionary approaches and interface co-occurrence across different crystal forms.
The difficulty of the protein interface classification problem has led to incomplete or ambiguous experimental data, which in turn has resulted in incorrect annotations of the oligomerization states of macromolecules in the PDB. Another source of error are the simple mistakes during the deposition process. For example, authors sometimes simply identify the quaternary structure of the asydmmetric unit. To the best of our knowledge, there are only two studies in the literature, which have investigated quaternary structure annotation errors in the PDB [21,22].
In this study, we focused on two main issues: (i) detection of the incorrectly papers. In addition, they do not provide all the publications through these services; instead they only provide a small subset of them. We found CrossRef TDM services as the most useful service for text mining purposes. They provide full-text links from various publishers in a user friendly and easy-to-access way. However, since some publishers only share a small part of their paper repositories, we are only able to reach 30% of the primary articles of the PDB entries.
In addition to the 4-method consensus results, we also provided a 3-method consensus based on SC, PISA and EPPIC results by excluding TM. The 3-method consensus has the advantage that it can be applied to most structures in the PDB. On the other hand, TM results are useful to guide the user/biocurator to the specific papers/sentences that contain statements about the experimental evidence, as well as a resource to verify the author annotations. Therefore, we provided 3-method consensus as well as 4-method consensus in our web-tool.

Conclusion
Determination of a 3D macromolecular structure is crucial to understand the fundamental mechanisms of biological processes, such as enzymatic reactions, ligand binding, or signalling. It is also important to reveal the underlying mechanisms of diseases, such as genetic variations. Furthermore, the 3D structures of macromolecules are vital for drug design and development studies, especially in structure-based drug design. Because of these reasons, the correct quaternary structure has a critical importance.
In this study, we developed a consensus approach by aggregating predictions from three and four different methods in order to detect incorrect quaternary structures in the PDB and assign the most likely quaternary structures for the possibly incorrect annotations. For this task, we first benefited from homology to cluster similar PDB entries based on a certain sequence identity threshold and to predict a representative quaternary structure for the cluster through a consistency score calculation. inconclusive predictions, respectively. Therefore, our method provides more reliable evaluation than any single approach. Finally, we developed a web-based tool in order to make this approach usable for researchers in the field and PDB Biocuators.

Availabilty
The tool is freely available through http://quatstruct.rcsb.org. All source code is available on Github repository at https://github.com/selcukorkmaz/BET. This tool will be updated regularly to include the future quaternary structures in the PDB.