A configurable model of the synaptic proteome reveals the molecular mechanisms of disease co-morbidity

Synapses contain highly complex proteomes which control synaptic transmission, cognition and behaviour. Genes encoding synaptic proteins are associated with neuronal disorders many of which show clinical co-morbidity. Our hypothesis is that there is mechanistic overlap that is emergent from the network properties of the molecular complex. To test this requires a detailed and comprehensive molecular network model. We integrated 57 published synaptic proteomic datasets obtained between 2000 and 2019 that describe over 7000 proteins. The complexity of the postsynaptic proteome is reaching an asymptote with a core set of ~3000 proteins, with less data on the presynaptic terminal, where each new study reveals new components in its landscape. To complete the network, we added direct protein-protein interaction data and functional metadata including disease association. The resulting amalgamated molecular interaction network model is embedded into a SQLite database. The database is highly flexible allowing the widest range of queries to derive custom network models based on meta-data including species, disease association, synaptic compartment, brain region, and method of extraction. This network model enables us to perform in-depth analyses that dissect molecular pathways of multiple diseases revealing shared and unique protein components. We can clearly identify common and unique molecular profiles for co-morbid neurological disorders such as Schizophrenia and Bipolar Disorder and even disease comorbidities which span biological systems such as the intersection of Alzheimer’s Disease with Hypertension.

A configurable model of the 1 synaptic proteome reveals the 2 molecular mechanisms of disease 3 co-morbidity.  measures such as degree and betweenness reveal influential signalling proteins within 96 the network. The 'scale-free' property [20] found in many biological networks can be 97 used to identify hub genes, which often encode disease related genes [19]. Clustering 98 algorithms attempt to identify densely connected communities (or modules) of 99 vertices within the network [21,22]. Synaptic gene-disease or gene-functional 100 association data can then be annotated onto these communities and tests made for the 101 functional/disease enrichment at the level of the cluster. This can be used to predict 102 gene-sets associated with known synaptic diseases, which are useful elucidating 103 molecular mechanisms [22]. Using this network approach, we can then test where 104 these gene-sets overlap across different disorders.  Collating the synaptic proteome. 111 112 We systematically curated synaptic proteomic datasets from the literature, to produce 113 a comprehensive index of the proteins (and their genes) reported at the synapse. To 114 find proteomic studies we searched PubMed using the key words: "synaptic", 115 "postsynaptic', "presynaptic", "synaptosome". Preference was given to the studies 116 focussing on mammalian brains in healthy/normal experimental conditions. The many 117 disease specific studies identified where not included. A total of 57 papers describing 118 a landscape of 7814 synaptic genes (Table 1) were annotated with the following 119 metadata: PUBMED ID, species, method of extraction, number of identified proteins 120 and brain region. Each study's respective protein list was extracted and mapped to 121 stable identifiers (MGI, Entrez and Uniprot) for the predicted set of orthologues for 122 three species (human, mouse, rat). Additional functional information (e.g. GO 123 function, disease association) was overlaid as metadata onto the vertices. 124 Supplementary The majority of PSP samples were collected from whole brain [3,10,29,32,33] and 159 forebrain [1, 2, 4-6, 23, 27, 34, 35]. However, several studies focus on specific brain 160 regions, e.g hippocampus [12,24,36,37] and cerebellum [23,25]. Several studies 161 that considered multiple distinct brain regions simultaneously were also included [11, 162 38, 39]. 163

164
The discovery rate of new PSD proteins was analysed across the multiple studies as 165 shown in Figure 1A, where the number of proteins is plotted against the frequency of 166 identification. Figure 1B  High throughput proteomic techniques are powerful, but they are noisy, and 176 contamination is always a concern. A large number (2091) of PSP proteins have been 177 observed just once. While single hits may be accounted for by lack of sensitivity with 178 low abundance molecules, it could also indicate the presence of false positive 179 components brought in by experimental uncertainty. 180

181
The rate of growth with respect to newly discovered proteins appears to be slowing 182 ( Figure 1B) and therefore there is now an opportunity to define a more reliable subset. 183 Following the approach described in [1], we selected genes found in two or more 184 independent studies to designate the "consensus" PSP. This resulted in 3441 genes, 185 which is ~7 times larger than reported by [1] and describes a subset of synaptic 186 proteins for which have higher confidence. In this subset we observe the increment of 187 new genes per year decreases after 2008 and drops completely after 2014 ( Figure 1C). 188 Seeing that the accumulated number of consensus PSP genes is plateauing, we 189 performed a non-linear fit to extrapolate a predicted total number of consensus PSP 190 genes ( Figure 1F, Methods). From the fit, we predict a total number of consensus PSP 191 genes to be 3499 (P = 2.36E-11, residual standard error: 192.7 on 12 degrees of 192 freedom) ( Figure 1F)  The frequency of identification of the presynaptic proteins is shown in Figure 1D. 224 From Figure 1E, we see two jumps in newly discovered proteins correspond to studies 225 performed in years 2010 and 2014. Approximately half of the proteins in the 226 presynaptic proteome (1064 genes) have been reported more than twice, which can be 227 viewed as a rough estimate for the consensus of presynapse, however the recent trend 228 in newly identified genes indicates that saturation has not been achieved yet (see As before, most of the studies (9/11) could be classified as shotgun proteomic 246 experiments where the entire synaptosome is analysed for its molecular components 247 [12,[52][53][54][55][56][57][58]. Two studies describe the molecular structure of specific protein 248 In addition to the proteomic, interactomic data and metadata mentioned above, the 277 database also includes the GO function information for three species: mouse, rat and We provide the data as simple flat tables for maximum flexibility and a SQLlite 291 implementation. For anyone who wishes to explore the data we also provide a simple 292 walkthrough of how to install the system and run the following common uses/queries 293 from SQLite Studio or R (Supplementary Files). Following use cases illustrate how 294 the database can be used to obtain the information for specific protein(s) and build the 295 PPI network for subset of proteins (1,2), and how the PPI network can be used for 296 further analysis (3,4 Here, FullGeneFullPaperFullRegion is a database view that combines the major 320 information for all genes in more convenient spreadsheet style representation. 321

322
The database returns the results in a form of table (Table 3 contains  Clustering is commonly used to identify substructures/communities within networks. 360 Concrete clustering methods usually assign each protein to the single most likely 361 community despite many of them actually being involved in multiple communities. 362 To match this we calculate the probability of each protein being involved in every 363 cluster in the network. The more communities a specific protein "bridges" (has a 364 probability belonging to) the more likely it is involved in communicating between 365 these communities across the network [68] (Methods). We hypothesise these 366 'Bridging' proteins will also correlate with functional/disease importance at the 367 biological level. 368 369 Figure 3. Bridging   The more a disease pair overlaps, the greater the evidence the corresponding disease 476 pair share the same genes, and therefore participate in similar pathways. Since we find 477 significant overlaps between AD-HTN and AD-PD, but not PD-HTN, it indicates a 478 potential shared mechanistic pathway between AD and HTN, which is different to the 479 pathways shared between AD and PD; Supplementary Figure 3A and 3C illustrates 480 this difference by plotting the z-score calculated against the distribution of for AD-481 HTN and PD-HTN after 10000 randomised models using the PSP network. 482 Discussion. This dataset is the largest and the most complete up to date and is freely available 498 with lightweight tools to allow anyone to extract relevant subsets. By mirroring the 499 methods used it would be straightforward for any user to add in their own datasets for 500 comparison. 501 502 The case studies described provide useful examples of the analyses that can be readily 503 extracted from these data. In addition, to information for specific proteins, we can use 504 the model to analyse the entire protein-protein interaction network to query topology-505 function relationships that provides insights for possible disease mechanism. For 506 example, since the Bridging proteins are spread out over the network, the correlated 507 disease-disease pairs we find are not centred in a single cluster but shared across 508 several communities. In turn, the co-occurrence of enrichment of specific synaptic 509 functions with disease in the discovered communities may indicate that molecular 510 complexes underlying the specific synaptic functions are also involved in disease 511 mechanisms. When comparing across diseases we find evidence for shared molecular 512 mechanisms that span common neurological disorders, such as Bipolar disease and 513 Autism Spectrum Disorder, Bipolar Disease and Schizophrenia. Moreover we see 514 evidence for molecular mechanisms that span more diverse disease pairs such as the 515 existence of common molecular pathways linked to both Alzheimer's disease and 516 Hypertension. 517 Methods. Only proteins that were found more than one time were taken into account to make 534 the most confident "consensus" dataset for pre-and post-synaptic proteomes. 535 We fitted the accumulation of new proteins against the year they were first time 536 identified in R, using linear (y ~ x), and non-linear (y ~SSlogis (x, Asym, xmid, scal)) 537 models. The goodness of fit was compared by Akaike's Information Criteria) AIC 538 function [78], where lower indicates a more parsimonious model. 539

540
For post synaptic proteome the non-linear model is shown at Figure 1F and predicted 541 maximum size of "consensus" PSP proteome is 3499, achieved by roughly 2023. 542 AIC coefficient is 213.8955 for linear fit and 205.0504 for non-linear fit, which means 543 the latter is more parsimonious. 544 545 For presynaptic proteome the non-linear fit is shown at Supplementary Figure 1, 546 predicting 1309 proteins in total reached by 2035 year. 547 However, by AIC criteria, the liner model for presynaptic "consensus" proteome is 548 better than non-linear (103.2001 and 107.2766, respectively), which likely means that 549 presynaptic proteome is not in its "saturation" phase yet. To limit the analysis to high-confidence direct physical interactions, those annotated 565 by "association" term (MI:0914) and its offspring were preserved. This step excluded 566 interactions of types: colocalization, functional association, genetic interaction and 567 predicted interaction. Although MI ontology offers "direct interaction" term in the 568 category of interaction type, according the the iMEX curation rules 569 (https://github.com/IMEx-Consortium/IMEx-site/raw/gh-570 pages/static/files/imex_curation_rules.pdf) two hybrid assays are categorised as non-571 direct interactions. Therefore, this classification was not used to preserve interactions 572 reported by two-hybrid assays. We further removed indirect interactions, often linked 573 to methods likely to generate false positive hits (spoke-expansion). Interactions 574 originating from experiments involving co-complexes (e.g. pull-down, affinity 575 technology) were excluded from the analysis by filtering out a selected subset of 576 terms denoting these methods. Since the intAct interaction Interactors that were annotated with a different taxon than Mouse, Rat or Human were 594 removed. Non-human Entrez Gene IDs were mapped to Human identifiers with gene 595 orthology mapping tables available on NCBI ftp server (file address: 596 https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_orthologs.gz). 597 After application of the above criteria, 407 443 interactions were obtained. 598 Additionally, after manual curation, 200 interactions from PDB database were added 599 directly to the PPI list with Method "MI:0114" -X-ray crystallography, and Type 600 "MI:0407"-Direct interaction, which resulted in total of 407643 interactions. We investigated the overlap and separation of each disease-disease pair by measuring 647 the mean shortest distance for each disease, using the shortest distance between each 648 GDA to its next nearest GDA neighbour [72]. The overlap, or separation, of each 649 disease-disease pair in the pre-post-synaptic PPI networks, could then be quantified 650 using: 651 Where 〈 〉and 〈 〉 quantify the mean shortest network distance between genes 654 associated with disease A (or B), and 〈 〉 the mean shortest distance between 655 diseases.
is bound by the diameter of the network, i.e., ≤ ≤ where 656 is 8, 7, 8 for the presynaptic, PSP and PSP consensus PPI networks respectively. 657 The magnitude of depends on the number of GDSs associated with each disease. 658 Large positive values imply two well separated diseases, while large negative values 659 indicate large (number of GDAs) diseases with a big overlap, often implying one 660 disease is the variant or precursor to the other. Each disease-disease network 661 separation pair ( ) was compared against a full randomised model: drawing the 662 same number of GDAs (from the set of all network genes) for each disease at random, 663 before computing its separation. For each disease-disease pair, we performed 10,000 664 iterations of the full randomised model using the ECDF distributed computing 665 facility. 666 The difference between the observed and randomised disease pair separations, was 667 quantified using the z-score: 668

669
- (4)  670 Where 〈 〉 and ( ) are the mean and standard deviation obtained from the 671 10,000 iterations. Each disease-disease pair separation using the full randomised 672 model, i.e., , was found to follow a normal distribution. We therefore assessed 673 the significance of each disease-disease pair's separation, from P-values estimated 674 from its z-score calculated in (4) Where we take the negative of the absolute value of each disease-disease pairs z-score 679 calculated in (4) and make use of R's pnorm function available in the 'stats' package 680 (R version 3.4.2). 681 The confidence in each disease-disease pairs P-value was tested for by calculating its