DisPhaseDB, an integrative database of diseases related variations in liquid-liquid phase separation proteins

Motivation Proteins involved in liquid-liquid phase separation (LLPS) and membraneless organelles (MLOs) are recognized to be decisive for many biological processes and also responsible for several diseases. The recent explosion of research in the area still lacks tools for the analysis and data integration among different repositories. Currently, there is not a comprehensive and dedicated database that collects all disease-related variations in combination with the protein location, biological role in the MLO and all the metadata available for each protein and disease. Disease related protein variants and additional features are dispersed and the user has to navigate many databases, with different focus, formats and often not user friendly. Results We present DisPhaseDB, a database dedicated to disease related variants of LLPS proteins and/or are involved in MLOs. It integrates 10 databases, contains 5.741 proteins, 1.660.059 variants and 4.051 disease terms. It also offers intuitive navigation and an informative display. It constitutes a pivotal starting point for further analysis, encouraging the development of new computational tools. Availability and Implementation The database is freely available at http://disphasedb.leloir.org.ar. Contact jiserte@leloir.org.ar and cmb@leloir.org.ar Graphical abstract

Introduction disease related variations. We expect our database to be of interest for researchers studying MLOs, LLPS proteins, diseases, proteins for targeting therapies, specific MLO components in a disease and also for computational groups developing methods to understand sequence-function relationships and mutational impact.

Selection of proteins involved in LLPS and MLO associated
Our starting point was an integrated set of MLO associated proteins that were collected in a previous group effort (Orti et al., 2021). It consists of the entries of four databases of LLPS and MLOs associated proteins that were compiled, merged, completed and stored in a local database: PhaSePro (Mészáros et al., 2020), PhaSepDB (You et al., 2020) DrLLPS (Ning et al., 2020) and LLPSDB (Li et al., 2019). This set is periodically updated with the databases' new releases. The consolidated dataset is available at http://mlos.leloir.org.ar/ (Orti et al., 2021).
The role of the proteins in the LLPS process and their association with the MLOs, is taken from the annotation of the source database. There are four types of Protein roles: Drivers, Regulators, Clients and Unassigned when no database describes its role. In addition, we grouped their experimental evidence supporting the roles as low throughput and high throughput for user evaluation of their confidence.

Mutation collection
We obtained human coding variants from ClinVar release 20200402 (Landrum et al., 2018), a public archive of human genetic variants and their interpretation with respect to a clinical condition or phenotypes, along with supporting evidence for such association. DisGeNET (Piñero et al., 2020) offers several datasets based on gene-disease associations (GDAs) and variant-disease associations (VDAs). For our database we took mutations from the curated VDA dataset (October 2020), which at the same time integrates variants from UniProt, ClinVar, GWASdb (Li et al., 2012) and GWAS catalog (Buniello et al., 2019). From UniProt (UniProt Consortium, 2021) we used the dataset of human variants that are manually annotated in UniProtKB/Swiss-Prot (release-2021_02). Lastly, COSMIC release v94 was used to obtain the coding point mutations in human cancers (Tate et al., 2019). In all cases, we mapped variants with genomic coordinates from the human genome assembly GRCh38 onto the canonical protein sequence. Disease and other altered phenotypic effects annotations in ClinVar, COSMIC, DisGeNET and UniProt are not consistent between databases nor within the same database. They are frequently cross referenced to one or many ontologies that collect medical terms, and/or diseases, such as Disease Ontology (DO) (Schriml et al., 2019), the Human Phenotype Ontology (HPO) (Köhler et al., 2021), Medical Subject Headings (MeSH) (Nelson, 2009), Medical Genetics (MedGen, https://www.ncbi.nlm.nih.gov/medgen/), The Monarch Merged Disease Ontology (MONDO) (Mungall et al., 2017), National Cancer Institute Thesaurus (NCI, https://ncim.nci.nih.gov/), Online Mendelian Inheritance in Man (OMIM) (Amberger et al., 2018), among others. In some cases there is no reference to any ontology. A mutation can be associated with several diseases and vice versa. In this context studying a variant, a protein or a disease is challenging. As an example, mutation R521C in FUS protein is associated with different diseases in different ontologies: Melanoma of skin (SNOMEDCT_US: 93655004), amyotrophic lateral sclerosis ALS6 (MEDCIN: 315716 and MedGen: C1842675) and Gastric Carcinoma (NCI: C4911). In addition, there are many synonyms annotated for the same disease in one ontology, as an example "Cancer of Stomach", "Cancer of the Stomach", "Carcinoma of Stomach", "Gastric Cancer", etc, are references to the same disease in NCI. Another case are synonyms in different ontologies, as example: Cutaneous Melanoma (MedGen: C0151779), Melanoma of skin (SNOMEDCT_US: 93655004) and "Melanoma, Cutaneous Malignant" (OMIM: 155600).
Finally, there are different grades of specificity of a disease that are referred to as different terms, as an example, "Acanthoma" is a type of "Skin Neoplasms". Therefore, mapping all disease terminology into a single ontology is not feasible. So, to facilitate the user navigating through this tangle of terms in dozens of ontologies to study a variation or a protein, DisPhaseDB includes all available disease annotations and, when there are no references to an ontology, reference to the source mutation database.

Implementation
The server backend consists of a http web-server developed in Python 3.8+ using the Flask framework and MySQL. The client web application was developed with the AngularJS framework.

DisPhaseDB in numbers
We present DisPhaseDB, available at https://disphasedb.leloir.org.ar. DisPhaseDB contains 5.741 LLPS proteins, all of them with experimental evidence that supports their association to the MLOs. For these proteins we collected human disease mutations from up-to-date databases including UniProt, ClinVar, DisGeNET and COSMIC. After merging the four databases, the total number of unique coding variants (protein mutations) is 1.660.059. COSMIC contributes 1.464.124, ClinVar 221.097, DisGeNET 56.813 and UniProt 22.965. Supplementary  Figure 1 shows the overlap of the four protein variation resources, showing that all of them are needed to have a better landscape of mutation in LLPS proteins. The most common type are missense mutations, followed by synonyms mutations (66.57% and 23.41% respectively) (Supplementary Figure 2). On average, proteins in DisPhaseDB have around 200 mutations, although few proteins are exceptionally highly mutated ( Figure 1). As an example, TITIN (20.552 mutations) is a key component of striated muscles and mutations in this protein are related to different types of cardiomyopathies and muscular dystrophies (Hackman et al., 2002;Itoh-Satoh et al., 2002;Matsumoto et al., 2005). BRCA1, BRCA2 and APC (9.172, 12.063 and 9.237 mutations respectively) are proteins involved in DNA repair and tumor suppressor (Kawasaki et al., 2007;Liu et al., 2010;Shukla et al., 2011). It is well known mutations in these proteins produce an increased risk for different types of cancer, especially breast, ovarian and colorectal cancer (Easton et al., 2007;Mersch et al., 2015;Yamaguchi et al., 2016). Mutations do not appear equally in different protein regions, IDR and LCR have more mutations than the ordered portion of the protein (Supplementary Figure 3).  Also mutated proteins are associated with one or more diseases, Figure 3 (upper panel) shows the number of DisPhaseDB mutated proteins associated with all the Mesh ontology subheadings in the disease category. These headings are nodes near the root of the ontology, but the annotations allow going forward to more specificity, for example Supplementary Figure 5 shows the terms under "neoplasms" subheading disaggregated by site. Since 80% of the mutations in DisPhaseDB are contributed by COSMIC (somatic mutations in cancer). Figure 3 (lower panel) shows the distribution of mutated proteins by disease removing those mutations contributed by COSMIC. Even though removing COSMIC mutations, proteins associated with neoplasms are still predominant.

Server usage
DisPhaseDB offers either a quick search by protein, MLO or disease or an advanced search applying one or several filters. Possible fIlters are by protein, role, MLO, disease name or keyword, by evidence (low or high throughput experiments), by protein disorder content and mutation type (missense, frameshift, nonsense, etc). In addition, filters can be combined in such a way that users can customize the set of proteins according to their need or interest.
As an example, a protein search, Synaptic functional regulator FMR1 (UP: Q06787), displays the following outputs: section I) a protein summary with general information and fasta sequence; II) protein MLO location; III) protein features mapped onto the sequence such as regions, domains, disorder content and mutations (disaggregated by type), among others IV) mutation summary and V) a mutation table to download. Figure 4 shows sections III and IV of the output as an example. Other examples of searches can be retrieving all the proteins that are classified by role or in a particular disease. Another example, proteins associated with a particular MLO, the database will bring the proteins related to this MLO regardless of their role and disease.

Discussion
To the best of our knowledge, there is not an integrated and comprehensive resource for mutations in MLOs associated proteins. For this reason, we integrated all state-of-the-art resources of proteins involved in LLPS and MLOs with four relevant disease databases that annotate medical terms and phenotypic effects. The variants' databases selected are not redundant showing very little overlap among them. Furthermore, there are many mutation databases which makes it difficult to cover the range of diseases or effects with a single one. They are often not user friendly and they cross-reference to different ontologies and many other databases. This highlights the need for a unification of these resources.
It also provides mutation mapping over the sequence and metadata associated with the proteins, as disordered, low complexity and ordered regions, post translational modifications, among other features. Therefore this resource will be helpful to understand sequence-function relationships and mutational impact.
We expect DisPhaseDB to assist researchers to better understand complex human diseases under the lens of phase separation.