Abstract
Prolidase (PEPD) catalyses the cleavage of dipeptides with high affinity for proline at the C-terminal end. This function is required in almost all living organisms. In order to detect strongly conserved residues in PEPD, we analysed PEPD orthologous sequences identified in data sets of animals, plants, fungi, archaea, and bacteria. Due to conservation over very long evolutionary time, conserved residues are likely to be of functional relevance. Single amino acid mutations in PEPD cause a disorder called prolidase deficiency and are associated with various cancer types. We provide new insights into 15 additional residues with putative roles in prolidase deficiency and cancer. Moreover, our results confirm previous reports identifying five residues involved in the binding of metal cofactors as highly conserved and enable the classification of several non-synonymous single nucleotide polymorphisms as likely pathogenic and seven as putative polymorphisms. Moreover, more than 50 novel conserved residues across species were identified. Conservation degree per residue across the animal kingdom were mapped to the human PEPD 3D structure revealing the strongest conservation close to the active site accompanied with a higher functional implication and pathogenic potential, validating the importance of a characteristic active site fold for prolidase identity.
Introduction
Human peptidase D (PEPD) or prolidase (EC 3.4.13.9) is a multifunctional manganese-requiring homodimeric iminodipeptidase. Its enzymatic activity was reported in 1937 for the first time with the observation of Glycyl-Proline dipeptides degradation 1. PEPD belongs to the metalloproteinase M24 family. Its major function is the hydrolysis of peptide bonds of imidodipeptides with a C-terminal proline or hydroxyproline, thus liberating proline and hydroxyproline, respectively 2.
The biological significance of PEPD is indicated by the presence in the genomes of most animal species and its expression in several tissues 3–7. Moreover, PEPD has been identified in fungi 8,9, plants 10, archaea 11, and even bacteria 12–15. Especially the presence of PEPD in several mycoplasma species stresses its essential role in their metabolism and maintaining cellular functions, as these intracellular parasites display an otherwise extremely reduced gene set 16.
Physiological role of PEPD
PEPD is the only known metalloenzyme in eukaryotes catalysing the hydrolysis of X-P 17. Therefore, deleterious mutations in PEPD in human lead to a rare autosomal disease called prolidase deficiency (PD), which is characterized by skin ulcerations due to defective wound healing, immunodeficiency, mental retardation, splenomegaly, recurrent respiratory infections and imidodipeptiduria 18–20. To date, 29 different pathogenic variants have been reported and associated with PD, resulting in a partial or complete enzyme inactivation 21. In addition to this autosomal disease, perturbations in PEPD expression, (serum) activity or serum levels have been associated with several (patho)physiological processes, including remodelling of the extracellular matrix, inflammation, carcinogenesis, angiogenesis, cell migration, and cell differentiation 22–27. Moreover, alterations of PEPD serum activity are associated with a spectrum of mental diseases, like post-traumatic stress disorder 28 and depression 29. Altered PEPD activity and serum level have also been frequently described in different cancer types suggesting an involvement of PEPD in cancer 2,23,24,48.
In bacteria and archaea, PEPD is assumed to be involved in the degradation of intracellular proteins and proline recycling 30. In animals, PEPD is involved in the degradation proline-rich dietary proteins and seems to play an important role in proline recycling 2. Since collagen (a major components of extracellular matrix) consists of 25% proline and hydroxyproline, PEPD activity is thought to be the rate limiting factor in collagen turnover 2,31. Interestingly, there is a growing body of evidence showing that PEPD may also have additional pleiotropic effects, independent from its enzymatic activity. Thus, PEPD has been reported to influence the p53 pathway by direct protein-protein interaction 32 and acts as ligand for EGFR and ErbB2 when released by injured cells 33,34.
Characterization of the enzymatic and structural properties of PEPD
The crystal structure of PEPD has been extensively investigated in several species, including bacteria 16,35, archaea 36, and eukaryotes 17. PEPD belongs together with methionine aminopeptidase (MetAP; EC 3.4.11.18) and aminopeptidase P (APP; EC 3.4.11.9) to the “pita-bread” family, which is able to hydrolyse amido-, imido-, and amidino-containing bonds 37,38. Characteristic for this family is the highly conserved characteristic pita-bread fold in the catalytic C-terminal domain including the metal centre and a well-defined substrate binding pocket 37,39. The catalytic C-terminal domain comprises five highly conserved residues for the binding of the metal cofactors: D276, D287, H370, E412, and E452 (positions refer to human sequence) 17.
The preferred substrate, optimal pH and temperature, and required metal ions (e.g. Mn2+, Zn2+ or Co2+) are species-dependent 2. Although PEPD appears to be a (homo)dimer in most species including humans, it can be also active as a monomer or even as a tetramer in certain species 2. The homodimeric human PEPD preferably hydrolyses G-P, is adapted to a pH value of 7.8 with a temperature optimum of 50°C, and shows long-term activity at 37°C 17,40. In vitro studies based on recombinant PEPD produced in CHO cell lines and E. coli as well as endogenous PEPD of human fibroblasts, revealed G-P as preferred substrate followed by a lower substrate specificity for A-P, M-P, F-P, V-P, and L-P dipeptides 40. Moreover, in human PEPD the substrate specificity for dipeptides is determined through the presence of specific residues, like R398 and T241, which prevent the binding of longer substrates 17.
Regulation of PEPD
PEPD is a phosphotyrosine and phosphothreonine/serine enzyme 41,42. Phosphorylation results in an increase of PEPD activity and is mediated by the MAPK pathway and NO/cGMP signalling for tyrosine and threonine/serine residues, respectively 41,42. Phosphorylation mediated up-regulation of PEPD activity was reported without an increased gene expression, indicating the importance of post-translational modification in its regulation 41,42. In silico analysis of human PEPD indicated post-translational modifications like glycosylations. N-glycosylation was predicted for N13 and N172, while O-glycosylation was thought to effect T458 22.
We anticipate the detailed profiling of conserved residues in PEPD during evolution may help to identify and understand essential components for mentioned PEPD functions and structure. This increased knowledge could help explain the role of PEPD in diseases, especially prolidase deficiency. Taxon-specific conservation of residues provides additional insights e.g. into post-translational modification in eukaryotes. This study identified orthologous sequences of PEPD in peptide sequence sets of several hundred organisms including bacteria, archaea, animal, fungi, and plant species to investigate the conservation of residues in PEPD across the tree of life. We further identified highly conserved residues, which are likely to play key functional roles.
Results and Discussion
Sequence lengths differentiate between high-level taxonomic groups
In total, 769 putative PEPD orthologues were identified in animals (440), plants (122), fungi (72), archaea (42), and bacteria (93) (Supplementary File 1). PEPD orthologues in animals revealed an average sequence length of 493 amino acids (aa), while plants and fungi orthologues had an average sequence length of 499 aa and 507 aa, respectively (Supplementary File 2). Compared to these three kingdoms, PEPD sequences of bacteria were slightly smaller with an average sequence length of 455 aa. However, PEPD orthologues identified in archaea showed the smallest average sequence length of a kingdom with 360 aa. These findings matched previous reports of 349 aa (P. furiosus) and 493 aa (H. sapiens) 11,17. In general, our observations indicate that PEPD sequence length has changed during evolution. This length difference could be due to an increase of complexity and functionality of PEPD in eukaryotes, where it is known as a multifunctional enzyme 2, or due to a loss of domains in prokaryotes. Observing longer version in eukaryotes is not surprising, because eukaryotes are probably more likely to tolerate larger proteins than bacteria due to differences in the relative metabolic burden 43.
Analysis of previously described residues
Our broad taxonomic sampling captured vast natural diversity, which was harnessed to identify highly conserved residues. From conservation of amino acid residues over billions of years during evolution, we infer functional relevance. A huge diversity of different species and thus sequences is key to distinguish relevant residues from the phylogenetic background. To ensure an accurate alignment of all analysed sequences, the alignment was performed with permutations of the input sequences and repeated with different alignment tools. The average difference per position in the resulting alignments is low (Supplementary File 3 and 4).
Conservation of functional and structural relevant residues
Highly conserved residues are likely to have a high functional, and/or structural relevance. Aiming to extend the knowledge about the already existing crystal structure of especially human PEPD, we analysed the conservation degree of known residues relevant for the structure and function of PEPD 17. Despite the high diversity of metal ions accepted by different species 2, the amino acids responsible for the binding of the metal ions (D276, D287, H370, E412, and E452) are highly conserved across species (Supplementary File 5). All residues reported for the interaction with metal ions were detected in over 90% of all sequences. Sequences without these particular residues are likely to be partial and thus not covering this position leading to a lower observed conservation value. When excluding sequence gaps, almost 100% match is reached for all five positions. Based on these results, we conclude that all selected sequences are bona fide prolidases. This finding marks the conservation of these five residues as one important structural and functional characteristic of PEPD (Figure 1).
Additionally, strong conservation of T289 and T410 in proximity to the manganese ions supports previous reports and hypotheses of their functional relevance in PEPD 22.
Nevertheless, one plant- and three animal PEPD orthologues showed an amino acid substitution of one metal binding residue: Ancylostoma ceylanicum (H370V), Arachis duranensis (D287N), Oncorhynchus kisutch (E452K) and Tetraodon nigroviridis (E452R). Crystal structures and enzyme assays could illuminate the consequences of these substitutions thus providing natural sequences to assess the contribution of each residue. Since D287N was reported before as a probably deleterious substitution 44, these prolidases may have lost their ability to cleave X-P dipeptides.
Another essential step for the enzymatic catalysis of prolidases is the binding of their dipeptide substrate (e.g. G-P)17. For example, H255 binds to the carboxylate group of the C-terminal proline residue of the substrate and its side chain moves upon substrate binding by about 6 A° narrowing down the size of the active site 17. The importance of such substrate binding residues, like H255 and H377 17, was validated through a high conservation degree of minimum 94% in all living organisms (Figure 1). Interestingly, another residue involved in G-P binding in human PEPD, R398 17, is highly conserved except in archaea (Figure 1). Besides its role in G-P binding, this residue is also important for the specificity of PEPD for dipeptides by determining the length of the ligand at the C-terminus through its large side chain 16,17. These results suggest that the majority of analysed archaeal prolidases might not be capable of G-P degradation and may have a broader substrate spectrum due to the missing R398. In line with the hypotheses, Ghosh et al. showed that PEPD purified from the archaeon P. furiosus revealed no substrate specificity for G-P, but for longer substrates like K-W-A-P and P-P-G-F-S-P, although this specificity was rather weak 11. However, the preferred substrates of this enzyme were the dipeptides M-P and L-P 11. Interestingly, P. furiosus still has a corresponding arginine residue at the position 295 16. This R295 was reported to have dual functionality for cleaving di- and tripeptides due to the intermediate position of this arginine 16. These reports support the hypothesis that archaeal prolidases have a broader substrate spectrum compared to the prolidases of the other kingdoms. In turn, the strong conservation of R398 in eukaryotes may indicate an adaptation to the specific recognition of dipeptides. In line with the hypothesis, the bulky side chain of R398 was reported to prevent the acceptance of tripeptides 17. Moreover, a strong conservation of W107, except in archaea, was identified (Figure 1). After G-P binding, W107 is shifted inwards to the active site, sealing the active site 17. The low conservation of W107 in archaea suggests that archaeal prolidases might use a different conformational change, probably due to their putative expanded substrate spectrum.
Furthermore, some residues were reported to be involved in the interaction of L-P, another potential prolidase substrate: Y241, I244, H255, and V376 17. H255 and I244 are highly conserved across species (Figure 1). V376 is less conserved in fungi and not conserved in plants. Y241 is not conserved in archaea. Since P. furiosus PEPD is capable of binding and degrading L-P, Y241 is probably not essential for this binding process in archaea. Another reason for the flexibility in archaea might be the putatively expanded substrate spectrum due to the absence of Y241, which is reported to close the active site on the side where the N-terminus of the substrate is placed 21. To the best of our knowledge, the effect of the absence of V376 in plants was not investigated yet.
In order to identify a common disulfide bond responsible for the common dimer formation of prolidases previously reported cysteine residues 17 were analysed. In human PEPD an intramolecular disulfide bridge was observed between C58 from chain A and C158 from chain B 17. However, this bond was only present in the inactive (Mn2+ free) enzyme complex, while the substrate was bound in the active site 17. These amino acids are weakly conserved in the animal kingdom (58% and 40% respectively), but showed an almost complete conservation among vertebrata likely due to their relevance in the dimer formation in this group. However, these cysteines might not be responsible for the dimer formation in the active form of the enzyme, which occurs in most of the prolidases 8,17,45. Therefore, we aimed to identify a better candidate for this common PEPD conformation. However, we could not identify a highly conserved cysteine across species, suggesting (I) the presence of different interactions for stabilization of e.g. PEPD dimers or (II) frequent occurrence of PEPD as a monomer.
Analysis of residues known to be mutated in prolidase deficiency
The majority of amino acids that are hot spots causing PD (6/11: D276, G278, L368, E412, G448, G452) are localised near or in the active side of PEPD 22,46. These amino acids are conserved across species, thus suggesting a negative correlation between the distance of a residue to the active site and its conservation in animals. As expected, highly conserved (>85%) residues are more likely to be located close to the active site (p-value= 3.76e-06, Mann-Whitney U test)(Figure 2, Supplementary File 6).
As mentioned previously the metal binding residue E452 is highly conserved across species and its deletion results surprisingly in a preservation of the active site 21, likely because it can be replaced by neighbouring residues. However, the mutated protein shows less than 5% of the WT activity 47 supporting our findings. Additionally, our results are in line with findings of Bhatnager and Dang, who identified the mutation of D276N, G278D, E412K, and G448R as damaging substitutions 44, because we observed a strong conservation of all four residues. Recently the structural basis of these and other PD mutations have been analysed in detail 21. Once again in accordance with our results, Wilk et al. claimed that the D276N mutation results in an excessive reduction of the PEPD activity due to the loss of one of the catalytic metal ions derived from the charge change caused by the substitution 21. Similarly, in the G278D mutant the loss of one metal ion and additional enhanced disorder were observed 21. Interestingly, the previously as highly conserved identified Y241 seems to have high functional relevance since its displacement in this mutant results in a destabilization of two metal binding residues (D276 and D287)21. In addition, the highly conserved substrate coordinating residue H255 is completely absent from the active site of the G278D mutant 21 stressing its importance in maintaining PEPD functionality. H255 is also absent in the G448R mutant contributing to a dysfunctional protein core 21. The substitution of the metal binding E412 to K results once again in the loss of one metal ion by an amino acid side chain leading to PEPD inactivation 21.
R184 is defined by the shortest atom-to-atom distance to G-P in human PEPD and marks the end of the N-terminal chain of human PEPD 21. The deletion or mutation of R184 to Q in PD patients results in an inactive PEPD or one with highly reduced enzyme activity, respectively 21. Therefore, R184 might be essential for the functionality and structure of PEPD, which is supported by its high conservation across many species 22. In this study, this finding was validated with a minimum conservation degree of 92% of all sequences analysed. Moreover, D375 and D378 were identified as highly conserved across species. Interestingly, these residues were both recently reported to directly interact with R184 21. In the PD mutation variant R184Q, the interaction between R184 with D375 and D378 is lost, due to the replacement of the positive charged guanidinum group of R184 to the neutral amide group of Q 21. The resulting protein shows only residual activity, supporting the hypothesis that D375 and D378 are highly important for PEPD functionality.
Additional relevant residues in PD are not particular conserved across different phyla. Among them are S202 (90%) and Y231 (89%) highly conserved in animals. While the deletion of Y231 results in alterations in the dimer interface with remaining PEPD activity, the S202F substitution increases PEPD disorder resulting in the inability to hydrolyse G-P 21. Y241 is affected by S202F contributing to loss of PEPD activity, since Y becomes disordered even though all other metal binding residue are not affected 21. Since Y241 interacts in the WT human PEPD structure with the metal binding aspartates 21, its disorder might result in the loss of this interaction, thus destabilizing PEPD. However, A212 (45%) and R265 (35%) show a substantially smaller conservation degree compared to S202 and Y231. Strong conservation of A212 and R265 is limited to vertebrates thus suggesting a pathogenic role limited to this branch. The phenotype of S202P, A212P, and L368R are not distinguishable from each other, posing an example for relevant residues in PD without strong conservation 46.
Identification of polymorphisms in damage-associated SNPs in human prolidase gene
Recently, Bhatnager and Dang (2018), identified damage associated single-nucleotide polymorphisms (SNPs) in human prolidase gene based on a comprehensive in silico analysis 44. We observed that some of their non-synonymous SNPs are leading to substitutions at variable positions thus qualifying as polymorphisms instead of pathogenic variants. Such a SNP is causing the substitution of V to I at position 305, while our analysis revealed V in 78% and I in 16% of all animal PEPD sequences. Six out of seven tools predicted this SNP as neutral, supporting our assumption 44. Similar ratios and even dominance of a different amino acid were observed for I45V, E227L, and L435F indicating three additional polymorphisms. Additionally, we hypothesize that nsSNPs leading to T137M, V456M, and D125N are likely to be polymorphisms as the conservation of the canonical amino acid is low.
However, the remaining nsSNPs showing a higher conservation degree in the animal kingdom indicate that they may be important for structure or function of PEPD in the animal kingdom and that substitutions of these residues have a pathogenic potential 44. This is especially the case for the overlaps of the identified consensus nsSNPs, which were predicted from all tools as damage associated, with our results stressing that these residues are highly conserved not only in the animal kingdom, but also across species 44(Table 1).
PEPD in cancer
The investigation of curated SNPs in PEPD, which are associated with specific cancer types (BioMuta database 49), revealed missense mutations in various cancer types to be distributed across the whole PEPD sequence (Supplementary File 7). As many SNPs were associated with a low frequency, we focused on a small set of more frequent ones. Surprisingly, the amino acid affected by the most frequent SNPs in various cancer types is A74, a residue located in the non-catalytic N-terminal domain. While the general frequency in animals is low (38%), it displays a strong conservation in mammals thus suggesting a functional role. Other frequently effected residues are A122, H155, G257, R311, M329, and D378. All of them are conserved to different extents in the animal kingdom, while three (G257, M329, and D378) are also conserved in plants. However, D378 is the only amino acid conserved across all species. Being in proximity to the metal binding residue H370, the high conservation degree of D378 might be due to its role in forming a functional catalytic site. However, we could not identify a “cancer specific hot spot residue” in the animal kingdom and thus the appearance of SNPs in PEPD in various cancer types is likely not to be the driving force of a specific cancer type and the identified SNPs might be polymorphisms.
Post-translational regulation of PEPD
Since there is experimental evidence of PEPD activity being regulated at the post-translational level through phosphorylation 41,42, we aimed to validate previously predicted post-translational modifications (PTMs) 50 in human PEPD. None of the examined sites were highly conserved across species (Supplementary File 5), which could be explained by differences in the PTM mechanisms between prokaryotes and eukaryotes 51,52. Nevertheless, some residues were conserved in the animal kingdom e.g. R196 (88%). The low conservation values could be due to differences in PTMs between different groups of eukaryotes 51. The lack of conservation for some of these residues (S8, K36, S113, T487, A490, K493) could be explained in three ways: (I) no strong functional relevance for PEPD, (II) false positive prediction, or (III) a human specific regulation system. Vice versa, three residues are highly conserved at least in the animal kingdom (T15:80%, Y128:78%, R196:88%) posing good candidates for a PTM site. Two of the three amino acids are predicted to be phosphorylated (T15 and Y128), while R196 is thought to be monomethylated 50.
Lupi et al. predicted putative PTMs at N13, N172 (NetNGly), and T458 (NetOGlyc) 22. These residues were found to be highly conserved among vertebrates. This situation could be explained by a more recently evolved function or a relaxed ancestral function in species without strong conservation. In silico prediction of new phosphorylation sites resulted in T90, S113, Y121, Y128, S202, S224, S138, S240, S247 and S460 as best candidates. Conservation degrees generally support these predictions (Supplementary File 5) and distribution across species suggests a more recently increased relevance of S113 and S138.
Identification of novel conserved residues
All structure related observation and hypothesis are based on human prolidase crystal structure (PDB: 5M4G). As we already validated through the correlation in the animal kingdom, highly conserved residues are located nearby or in the substrate binding site. Therefore, it was not surprising that residues near the metal binding residue E452 are highly conserved across species especially R450:92% along with the previously reported G448:93%. The side chain of R450 is near the metal binding site, indicating that it might be essential for the formation of a functional metal ion binding site (Supplementary File 8 (a)). Another two conserved residues, T458 and G461, are located in the curve of a C-terminal loop near the binding site (Supplementary File 8 (b)). The small size of these amino acids might be necessary to form this structural feature. However, T458 could be a putative phosphorylation site. Since it is located on the outer surface of the enzyme, it is accessible for modifications. Additionally, we observed a cluster of highly conserved residues (G406-V408), which are part of the pita-bread structure, stressing the importance of this fold for the function of PEPD as metalloproteinase.
Again, highly conserved residues across species were identified near another known metal binding residue E412: Y416:94%, P413:94%, and G414:93% are located near the active site and are therefore good candidates for generating a functional binding site. The glycine and proline seem to be important to allow the proper arrangement of the metal binding residues by providing space between them. The side chain of Y416 is pointing into the active side, indicating it might have an additional functional role (Figure 3 (a)).
However, it is more likely that it has a stabilizing effect building a hydrogen bond with the NH group of R450:92% (Figure 3 (a)) thus stabilizing the anti-parallel β-strand. This anti-parallel β-strand seems to be highly important for PEPD functionality, since substitutions in the parallel β-strand e.g. G447R or G448R were reported to null PEPD activity 44. The insertion of a bulky arginine side chain, which prevents the correct assembly of the β-sheet, could be the explanation 44. Furthermore, F417:82% is highly conserved in every kingdom except archaea, expanding the number of conserved residues in this conserved region (Figure 3 (a)).
The conserved G373 is located in a tied turn of the peptide chain, suggesting its interplay with the conserved residues D375, D378, and G381 to form a loop. As a result, the important dipeptide-binding residue H377 is placed near the catalytic site (Figure 3 (b)). Weak conservation of these residues in archaea vindicates the previously mentioned hypothesis that archaea PEPD might be able to hydrolyze a broader substrate spectrum. Additionally, we identified the two conserved residues G369 and H366 near the metal binding residue H370 (Supplementary File 8 (c)). The side chain of H366 is pointing into the active site, indicating that it will narrow down the active site, therefore contributing to substrate specificity. Interestingly, residues near H366 e.g. P365, G367, and L368 are highly conserved with exception of the archaea kingdom. This could explain the ability of archaeal prolidases to process tripeptides in addition to dipeptides.
The highly conserved residues T299, E302, Y306, A308, V309, L310, K321, P322, G323, V324, D328, H330, and L341 form two parallel helices located in the periphery of PEPD, thus exposed to the solvent (Figure 3 (c)). Based on their extremely high conservation, V309 and A333 are probably most important for this structure (Figure 3 (d)). Whether this region could be the cause for some of extracellular functions of PEPD, e.g. EGFR or ErbB2 binding 33,34 or might be a target for a regulatory protein, needs to be investigated in the future.
Moreover, T299, F298, G296, and P293 are highly conserved across species except archaea. These residues might stabilize the pita-bread fold by strengthening a loop near the catalytic site (Supplementary File 8 (d)). Additionally, near the metal binding residue D276, some amino acids display strong conservation including G278, G270, E280, and L274.
Interestingly, investigation of residues near the highly conserved H255 revealed an exclusive conservation of the region between L257 and A259 in animals and plants. It is located in a loop structure at the periphery of PEPD. This region and other similar observations e.g. G385, V386, M236, G149, N151, T152, Q49, and G50 indicating that plant and animal prolidases might have distinct structural features compared to archaea, bacteria, and fungi. However, the flanking amino acids of H255 are highly conserved at a minimum of 94% in animals, plants and fungi, stressing its importance in eukaryotes.
Overall, we observe more conserved residues in the C-terminal catalytic region compared to the N-terminal region. Nevertheless, P98, L95, P80, G76, and F65 are examples for conserved residues in the N-terminal part. Their functions are yet to be determined.
Limitations and perspectives
Numerous PEPD orthologues were identified across all living organisms to pinpoint key residues in this protein. The selection of sequences from different groups is not balanced and we do not attempt to assign evolution events to certain groups, which would be possible based on an even more comprehensive sample. A high natural diversity allowed us to distinguish between variable positions with low if any functional relevance and highly conserved residues, which are likely to play key catalytic, structural, or regulatory roles in PEPD. The results match previously reported residues and enabled us to identify additional residues, which should be subjected to in-depth investigation and will eventually shed light on function and structure of PEPD. However, 264 (27%) of the screened data sets did not reveal a PEPD candidate based on our bait sequences. A majority of species without PEPD candidates (175) were bacteria (Supplementary File 9). Since PEPD is a relevant enzyme at least in eukaryotes, it is unlikely to be missing in many species. Technical limitations like incomplete assemblies or annotations could be the reasons for the absence of PEPD from some data sets. Therefore, we checked the completeness of all analysed data sets through the identification of suitable benchmarking genes that are assumed to be present in the respective species (Supplementary File 9) and discussed it in detail (Supplementary File 10). The identification of additional PEPD orthologues would facilitate further analyses e.g. improve the differentiation between pathogenic substitutions and harmless polymorphisms. We used our observations to predict the functional impact of nsSNPs and expect that this approach will be useful in the future for similar applications. We anticipate that the use of in silico tools integrating evolutionary genetics and structural data available will help to gain knowledge e.g. regarding the molecular characterization of PEPD, the identification of new regulatory residues, the extracellular role of PEPD, and new therapeutic strategies against prolidase deficiency and other PEPD associated disorders.
Material and methods
Data set collection
The peptide sequence sets of 475 animals, 122 plants, 72 fungi, 49 archaea, and 236 bacteria were retrieved from the NCBI. All sequences were pre-processed with a dedicated Python script to generate customized data files mainly with adjusted sequence names as long sequence names can pose a problem to some alignment tools (https://github.com/bpucker/PEPD). Next, peptide sequence sets were subjected to BUSCO v3 53 to assess their completeness based on the reference sequence sets ‘metazoa odb9’ (animals), ‘embryophyta odb9’ and ‘eukaryota odb9’ (plants), ‘eukaryote odb9’ (fungi), and ‘bacteria odb9’ (bacteria). Since there is no dedicated reference sequence set available for archaea, we used the eukaryota and bacteria sets. PEPD bait sequences (Supplementary File 11 and 12) were selected manually based on the literature and/or curated UniProt entries 8,36. Initial selection of related sequences was based on a pipeline combining previously published scripts and using their default parameters 54. Candidate sequences were identified in a sensitive similarity search by SWIPE v2.0.12 55 and filtered through iterative steps of phylogenetic analyses involving MAFFT v7.299b 56, phyx 57, and FastTree v2.1.10 58. Results were manually inspected and polished to identify bona fide orthologous genes with a high confidence. As the average length of PEPD in animals and plants is around 500 amino acids, sequences outside the range 200-700 amino acids were filtered out to avoid bias in downstream analyses through partial sequences or likely annotation artefacts.
Identification and investigation of conserved residues
MAFFT v.7.299b 56 was applied for the generation of multiple sequence alignments. Resulting alignments were cleaned by removal of all alignment columns with less than 30% occupancy. Conserved residues were identified and listed based on positions in the human PEPD sequence (UniProt ID: P12955) using the Python script ‘conservation_per_pos.py’ (Supplementary File 1). This analysis was repeated 50 times with randomly reshuffled sequences as the order of sequences can heavily impact the alignment process 59. In addition, we compared the alignments generated by MAFFT v.7.299b to ClustalO v.1.2.4 60 and MUSCLE v.3.8.31 61 alignments of the same data sets. The alignment bias through the order of input sequences was quantified for all positions of the aligned Homo sapiens sequence. For the in silico prediction of phosphorylation sites the H. sapiens PEPD sequence (UniProt ID: P12955) was submitted to NetPhos 3.1 62,63. Only the best prediction for each residue with a high confident score of >0.8 was considered for further analyses.
Sources of previously reported data
Previously reported residues with functional implications (Supplementary File 7) were checked for conservation. Additionally, the alignment was screened for highly conserved residues to the best of our knowledge not previously reported in respect to functionality or structure of PEPD. The results of the residue conservation analysis for the animal kingdom were mapped to a 3D structure of human PEPD (PDB: 5G4M). Putative post-translational modification sites were obtained from PhosphoSitePlus and literature 22,50. Residues associated with PD were retrieved from literature 22,46. Non-synonymous single-nucleotide polymorphisms (nsSNPs) 44 and details about observations were retrieved from the curated BioMuta database 49.
Correlation analysis of conservation degree and distance to the active site of PEPD
To determine the conservation degree in correlation to the distance to the active site, the average localisation of the five metal binding residues was identified and used to calculate the distance of each residue to this focus of the catalytic site (Supplementary File 13). Information about the position of each residue was taken from the PDB file 5M4G of human PEPD 17. The Python modules matplotlib 64 and seaborn (https://github.com/mwaskom/seaborn) were applied to construct a conservation heatmap. In addition, the conservation of all residues in animals was mapped to the 3D model of the human PEPD by assigning colours within a colour gradient to each amino acid representing its conservation among animal sequences.
Phylogenetic analysis
A phylogenetic tree was constructed via FastTree v.2.1.10 58 based on alignments generated via MAFFT v.7.299b 56 and trimmed via pxclsq 57 to a minimal occupancy of 60%. The conservation of different key residues was mapped to this tree for visualization. A Python script (https://github.com/bpucker/PEPD) was deployed to colour all leaves representing sequences with the conserved residue in red.
Data Availability
All data generated or analysed during this study are included in this published article (and its Supplementary Information files).
Authors’ contributions
HMS and BP designed the experiments, performed bioinformatics analyses, interpreted the results, and wrote the manuscript. All authors revised the manuscript. We acknowledge support for the Article Processing Charge by the Open Access Publication Fund of Bielefeld University.
Competing interests
The author(s) declare no competing interests.
Acknowledgements
We thank Samuel F. Brockington, Nathanael Walker-Hale, and Kali Swichtenberg for critical reading of the manuscript and very helpful comments.