Harnessing natural diversity to identify key residues in Prolidase

Prolidase (PEPD) catalyses the cleavage of dipeptides with high affinity for proline at the C-terminal end. This function is required in almost all living organisms and orthologues of PEPD were thus detected across a broad taxonomic range. In order to detect strongly conserved residues in PEPD, we analysed PEPD orthologous sequences identified in data sets of animals, plants, fungi, archaea, and bacteria. Due to conservation over very long evolutionary time, conserved residues are likely to be of functional relevance. Single amino acid mutations in PEPD cause an autosomal disorder called prolidase deficiency and were associated with various cancer types. We provide new insights into 15 additional residues with putative roles in prolidase deficiency and cancer. Moreover, our results confirm previous reports identifying five residues involved in the binding of metal cofactors as highly conserved and enable the classification of several non-synonymous single nucleotide polymorphisms as likely pathogenic and seven as putative polymorphisms. Moreover, more than 50 conserved residues across species, which were not previously described, were identified. Conservation degree per residue across the animal kingdom were mapped to the human PEPD 3D structure revealing the strongest conservation close to the active site accompanied with a higher functional implication and pathogenic potential, validating the importance of a characteristic active site fold for prolidase identity.

Introduction been reported to influence the p53 pathway by direct protein-protein interaction [32] and acts as 82 ligand for EGFR and ErbB2 when released by injured cells [33,34]. 83 84

Characterization of the enzymatic and structural properties of PEPD 85
The crystal structure of PEPD has been extensively investigated in several species, including bacteria 86 [16,35], archaea [36], and eukaryotes [17]. PEPD belongs together with methionine aminopeptidase 87 (MetAP; EC 3.4.11.18) and aminopeptidase P (APP; EC 3.4.11.9) to the "pita-bread" family, which is 88 able to hydrolyse amido-, imido-, and amidino-containing bonds [37,38]. Characteristic for this family 89 is the highly conserved characteristic pita-bread fold in the catalytic C-terminal domain including the 90 metal centre and a well-defined substrate binding pocket [37,39]. The catalytic C-terminal domain 91 comprises five highly conserved residues for the binding of the metal cofactors: D276, D287, H370, 92 E412, and E452 (positions refer to human sequence) [17]. 93 The preferable substrate, optimal pH and temperature, and required metal ions (e.g. Mn 2+ , Zn 2+ or 94 Co 2+ ) are species-dependent [2]. Although PEPD appears to be a (homo)dimer in most species including 95 humans, it can be also active as a monomer or even as a tetramer in certain species [2]. The 96 homodimeric human PEPD preferably hydrolyses G-P, is adapted to a pH value of 7.8 with a 97 temperature optimum of 50°C, and shows long-term activity at 37°C [17,40]. In vitro studies based on 98 recombinant PEPD produced in CHO cell lines and E. coli as well as endogenous PEPD of human 99 fibroblasts, revealed G-P as preferred substrate followed by a lower substrate specificity for A-P, M-P, 100 F-P, V-P, and L-P dipeptides [40]. Moreover, in human PEPD the substrate specificity for dipeptides is 101 determined through the presence of specific residues, like R398 and T241, which prevent the binding 102 of longer substrates [17]. 103 104

Regulation of PEPD 105
PEPD is a phosphotyrosine and phosphothreonine/serine enzyme [41,42]. Phosphorylation results in 106 an increase of PEPD activity and is mediated by the MAPK pathway and NO/cGMP signalling for 107 tyrosine and threonine/serine residues, respectively [41,42]. Phosphorylation mediated up-regulation 108 of PEPD activity was reported without an increased gene expression, indicating the importance of 109 post-translational modification in its regulation [41,42]. In silico analysis of human PEPD indicated 110 post-translational modifications like glycosylations. N-glycosylation was predicted for N13 and N172, 111 while O-glycosylation was thought to effect T458 [22]. 112 We anticipate the detailed profiling of conserved residues in PEPD during evolution may help to 113 identify and understand essential components for mentioned PEPD functions and structure. This 114 increased knowledge could help explain the role of PEPD in diseases, especially prolidase deficiency. 115 Taxon-specific conservation of residues provides additional insights e.g. into post-translational 116 modification in eukaryotes. This study identified orthologous sequences of PEPD in peptide sequence 117 sets of several hundred organisms including bacteria, archaea, animal, fungi, and plant species to 118 investigate the conservation of residues in PEPD across the tree of life. We further identified highly 119 conserved residues, which are likely to play key functional roles. (H. sapiens) [11,17]. In general, our observations indicate that PEPD sequence length has changed 131 during evolution. This length difference could be due to an increase of complexity and functionality of 132 PEPD in eukaryotes, where it is known as a multifunctional enzyme [2], or due to a loss of domains in 133 prokaryotes. Observing longer version in eukaryotes is not surprising, because eukaryotes are probably 134 more likely to tolerate larger proteins than bacteria due to differences in the relative metabolic burden 135 [43]. 136 137

Analysis of previously described residues 138
Our broad taxonomic sampling captured vast natural diversity, which was harnessed to identify highly 139 conserved residues. From conservation of amino acid residues over billions of years during evolution, 140 we infer functional relevance. A huge diversity of different species and thus sequences is key to 141 distinguish relevant residues from the phylogenetic background. To ensure an accurate alignment of 142 all analysed sequences, the alignment was performed with permutations of the input sequences and 143 repeated with different alignment tools. The average difference per position in the resulting 144 alignments is low ( Supplementary File 3 and 4). 145 146

Conservation of functional and structural relevant residues 147
Highly conserved residues are likely to have a high functional, and/or structural relevance. Aiming to 148 extend the knowledge about the already existing crystallization models of especially human PEPD, we 149 analysed the conservation degree of known residues relevant for the structure and function of PEPD 150 [17]. Despite the high diversity of metal ions accepted by different species [2], the amino acids 151 responsible for the binding of the metal ions (D276, D287, H370, E412, and E452) are highly conserved 152 across species (Supplementary File 5). All residues reported for the interaction with metal ions were 153 detected in over 90% of all sequences. Sequences without these particular residues are likely to be 154 partial and thus not covering this position leading to a lower observed conservation value. When 155 excluding sequence gaps, almost 100% match is reached for all five positions. Based on these results, 156 we conclude that all selected sequences are bona fide prolidases. This finding marks the conservation 157 of these five residues as one important structural and functional characteristic of PEPD ( Figure 1 illuminate the consequences of these substitutions thus providing natural sequences to assess the 170 contribution of each residue. Since D287N was reported before as a probably deleterious substitution 171 [44], these prolidases may have lost their ability to cleave X-P dipeptides. 172 Another essential step for the enzymatic catalysis of prolidases is the binding of their dipeptide 173 substrate (e.g. G-P) [17]. For example, H255 binds to the carboxylate group of the C-terminal proline 174 residue of the substrate and its side chain moves upon substrate binding by about 6 A° narrowing down 175 the size of the active site [17]. The importance of such substrate binding residues, like H255 and H377 176 [17], was validated through a high conservation degree of minimum 94% in all living organisms ( Figure  177 1). Interestingly, another residue involved in G-P binding in human PEPD, R398 [17], is highly conserved 178 except in archaea ( Figure 1). Besides its role in G-P binding, this residue is also important for the 179 specificity of PEPD for dipeptides by determining the length of the ligand at the C-terminus through its In order to identify a common disulfide bond responsible for the common dimer formation of 206 prolidases previously reported cysteine residues [17] were analysed. In human PEPD an intramolecular 207 disulfide bridge was observed between C58 from chain A and C158 from chain B [17]. However, this 208 bond was only present in the inactive (Mn 2+ free) enzyme complex, while the substrate was bound in 209 the active site [17]. These amino acids are weakly conserved in the animal kingdom (58% and 40% 210 respectively), but showed an almost complete conservation among vertebrata likely due to their 211 relevance in the dimer formation in this group. However, these cysteines might not be responsible for 212 the dimer formation in the active form of the enzyme, which occurs in most of the prolidases [8,17,45]. 213 Therefore, we aimed to identify a better candidate for this common PEPD conformation. However, we 214 could not identify a highly conserved cysteine across species, suggesting (I) the presence of different 215 interactions for stabilization of e.g. PEPD dimers or (II) frequent occurrence of PEPD as a monomer. 216 217

Analysis of residues known to be mutated in prolidase deficiency 218
The majority of amino acids that are hot spots causing PD (6/11: D276, G278, L368, E412, G448, G452) 219 are localised near or in the active side of PEPD [22,46]. These amino acids are conserved across species, 220 thus suggesting a negative correlation between the distance of a residue to the active site and its 221 conservation in animals. As expected, highly conserved (>85%) residues are more likely to be located 222 close to the active site (p-value= 3.76e-06, Mann-Whitney U test) ( in an inactive PEPD or one with highly reduced enzyme activity, respectively [21]. Therefore, R184 250 might be essential for the functionality and structure of PEPD, which is supported by its high 251 conservation across many species [22].In this study, this finding was validated with a minimum 252 conservation degree of 92% of all sequences analysed. Moreover, D375 and D378 were identified as 253 highly conserved across species. Interestingly, these residues were both recently reported to directly However, the remaining nsSNPs showing a higher conservation degree in the animal kingdom indicate 281 that they may be important for structure or function of PEPD in the animal kingdom and that 282 substitutions of these residues have a pathogenic potential [44]. This is especially the case for the 283 overlaps of the identified consensus nsSNPs, which were predicted from all tools as damage 284 associated, with our results stressing that these residues are highly conserved not only in the animal 285 kingdom, but also across species [44]( Table 1). 286   in various cancer types to be distributed across the whole PEPD sequence (Supplementary File 7). As 300 many SNPs were associated with a low frequency, we focused on a small set of more frequent ones. 301 Surprisingly, the amino acid affected by the most frequent SNPs in various cancer types is A74, a 302 residue located in the non-catalytic N-terminal domain. While the general frequency in animals is low 303 (38%), it displays a strong conservation in mammals thus suggesting a functional role. Other frequently 304 effected residues are A122, H155, G257, R311, M329, and D378. All of them are conserved to different 305 extents in the animal kingdom, while three (G257, M329, and D378) are also conserved in plants. 306 However, D378 is the only amino acid conserved across all species. Being in proximity to the metal 307 binding residue H370, the high conservation degree of D378 might be due to its role in forming a 308 functional catalytic site. However, we could not identify a "cancer specific hot spot residue" in the 309 animal kingdom and thus the appearance of SNPs in PEPD in various cancer types is likely not to be the 310 driving force of a specific cancer type and the identified SNPs might be polymorphisms. 311

Post-translational regulation of PEPD 313
Since there is experimental evidence of PEPD activity being regulated at the post-translational level 314 through phosphorylation [41,42] were found to be highly conserved among vertebrates. This situation could be explained by a more 327 recently evolved function or a relaxed ancestral function in species without strong conservation. In 328 silico prediction of new phosphorylation sites resulted in T90, S113, Y121, Y128, S202, S224, S138, 329 S240, S247 and S460 as best candidates. Conservation degrees generally support these predictions 330 (Supplementary File 5) and distribution across species suggests a more recently increased relevance of 331 S113 and S138. 332 333

Identification of novel conserved residues 334
All structure related observation and hypothesis are based on human prolidase crystallization structure 335 (PDB: 5M4G). As we already validated through the correlation in the animal kingdom, highly conserved 336 residues are located nearby or in the substrate binding site. Therefore, it was not surprising that 337 residues near the metal binding residue E452 are highly conserved across species especially R450:92% 338 along with the previously reported G448:93%. The side chain of R450 is near the metal binding site, 339 indicating that it might be essential for the formation of a functional metal ion binding site 340 ( Supplementary File 8 (A)). Another two conserved residues, T458 and G461, are located in the curve 341 of a C-terminal loop near the binding site ( Supplementary File 8 (B)). The small size of these amino 342 acids might be necessary to form this structural feature. However, T458 could be a putative 343 phosphorylation site. Since it is located on the outer surface of the enzyme, it is accessible for 344 modifications. Additionally, we observed a cluster of highly conserved residues (G406-V408), which 345 are part of the pita-bread structure, stressing the importance of this fold for the function of PEPD as 346 metalloproteinase. 347 Again, highly conserved residues across species were identified near another known metal binding 348 residue E412: Y416:94%, P413:94%, and G414:93% are located near the active site and are therefore 349 good candidates for generating a functional binding site. The glycine and proline seem to be important

363
However, it is more likely that it has a stabilizing effect building a hydrogen bond with the NH group of 364 R450:92% (Figure 3 (A)) thus stabilizing the anti-parallel β-strand. This anti-parallel β-strand seems to 365 be highly important for PEPD functionality, since substitutions in the parallel β-strand e.g. G447R or 366 G448R were reported to null PEPD activity [44]. The insertion of a bulky arginine side chain, which 367 prevents the correct assembly of the β-sheet, could be the explanation [44]. Furthermore, F417:82% 368 is highly conserved in every kingdom except archaea, expanding the number of conserved residues in 369 this conserved region (Figure 3 (A)). 370 The conserved G373 is located in a tied turn of the peptide chain, suggesting its interplay with the 371 conserved residues D375, D378, and G381 to form a loop. As a result, the important dipeptide-binding 372 residue H377 is placed near the catalytic site (Figure 3 (B)). Weak conservation of these residues in 373 archaea vindicates the previously mentioned hypothesis that archaea PEPD might be able to hydrolyze 374 a broader substrate spectrum. Additionally, we identified the two conserved residues G369 and H366 375 near the metal binding residue H370 (Supplementary File 8 (C)). The side chain of H366 is pointing into 376 the active site, indicating that it will narrow down the active site, therefore contributing to substrate 377 specificity. Interestingly, residues near H366 e.g. P365, G367, and L368 are highly conserved with 378 exception of the archaea kingdom. This could explain the ability of archaeal prolidases to process 379 tripeptides in addition to dipeptides. 380 The highly conserved residues T299, E302, Y306, A308, V309, L310, K321, P322, G323, V324, D328, 381 H330, and L341 form two parallel helices located in the periphery of PEPD, thus exposed to the solvent 382 ( Figure 3 (C)). Based on their extremely high conservation, V309 and A333 are probably most important 383 for this structure (Figure 3 (D)). Whether this region could be the cause for some of extracellular 384 functions of PEPD, e.g. EGFR or ErbB2 binding [33,34] or might be a target for a regulatory protein, 385 needs to be investigated in the future. 386 Moreover, T299, F298, G296 and P293 are highly conserved across species except archaea. These 387 residues might stabilize the pita-bread fold by strengthening a loop near the catalytic site 388 Numerous PEPD orthologues were identified across all living organisms to pinpoint key residues in this 417 protein. The selection of sequences from different groups is not balanced and we do not attempt to 418 assign evolution events to certain groups, which would be possible based on an even more 419 comprehensive sample. A high natural diversity allowed us to distinguish between variable positions 420 with low if any functional relevance and highly conserved residues, which are likely to play key 421 catalytic, structural, or regulatory roles in PEPD. The results match previously reported residues and 422 enabled us to identify additional residues, which should be subjected to in-depth investigation and will 423 eventually shed light on function and structure of PEPD. However, 264 (27%) of the screened data sets 424 did not reveal a PEPD candidate based on our bait sequences. A majority of species without PEPD 425 candidates (175) were bacteria (Supplementary File 9). Since PEPD is a relevant enzyme at least in 426 eukaryotes, it is unlikely to be missing in many species. Technical limitations like incomplete assemblies 427 or annotations could be the reasons for the absence of PEPD from some data sets. Therefore, we 428 checked the completeness of all analysed data sets through the identification of suitable benchmarking 429 genes that are assumed to be present in the respective species (Supplementary File 9) and discussed 430 it in detail (Supplementary File 10). The identification of additional PEPD orthologues would facilitate 431 further analyses e.g. improve the differentiation between pathogenic substitutions and harmless 432 polymorphisms. We used our observations to predict the functional impact of nsSNPs and expect that 433 this approach will be useful in the future for similar applications. We anticipate that the use of in silico 434 tools integrating evolutionary genetics and structural data available will help to gain knowledge e.g. To determine the conservation degree in correlation to the distance to the active site, the average 484 localisation of the five metal binding residues was identified and used to calculate the distance of each 485 residue to this focus of the catalytic site (Supplementary File 13). Information about the position of 486 each residue was taken from the PDB file 5M4G of human PEPD [17]. The Python modules matplotlib 487 [65] and seaborn (https://github.com/mwaskom/seaborn) were applied to construct a conservation 488 heatmap. In addition, the conservation of all residues in animals was mapped to the 3D model of the 489 human PEPD by assigning colours within a colour gradient to each amino acid representing its 490