Quantitative profiling of protease specificity

Proteases are an important class of enzymes, whose activity is central to many physiologic and pathologic processes. Detailed knowledge of protease specificity is key to understanding their function. Although many methods have been developed to profile specificities of proteases, few have the diversity and quantitative grasp necessary to fully define specificity of a protease, both in terms of substrate numbers and their catalytic efficiencies. We have developed a concept of “selectome”; the set of substrate amino acid sequences that uniquely represent the specificity of a protease. We applied it to two closely related members of the Matrixin family–MMP-2 and MMP-9 by using substrate phage display coupled with Next Generation Sequencing and information theory-based data analysis. We have also derived a quantitative measure of substrate specificity, which accounts for both the number of substrates and their relative catalytic efficiencies. Using these advances greatly facilitates elucidation of substrate selectivity between closely related members of a protease family. The study also provides insight into the degree to which the catalytic cleft defines substrate recognition, thus providing basis for overcoming two of the major challenges in the field of proteolysis: 1) development of highly selective activity probes for studying proteases with overlapping specificities, and 2) distinguishing targeted proteolysis from bystander proteolytic events.


25
measure of substrate specificity, which accounts for both the numbers and relative catalytic efficiencies of 26 substrates. Using these advances greatly facilitates uncovering selectivity between closely related members 27 of protease families and provides insight into to the degree of contribution of catalytic cleft specificity to 28 protein substrate recognition, thus providing basis to overcoming two of the major challenges in the field 29 of proteolysis: 1) development of highly selective activity probes and inhibitors for studying proteases with 30 overlapping specificities, and 2) distinguishing targeted proteolysis from bystander proteolytic events. Introduction 58 the study of proteolysis into the systems realm. Previously, the function and specificity of the catalytic cleft 59 of many classes of proteases has been studied with: 1) synthetic peptide libraries [17,18], 2) covalent active 60 site probes and suicide substrates [19,20], 3) substrate phage display [10,21] and 4) proteome-derived 61 peptide libraries [22]. Until now, no study has taken advantage of advances in new technology to gain the 62 volumes of data that can instruct a systems view of proteolysis. Three recent studies have incorporated NGS 63 into substrate phage profiling of the catalytic clefts of proteases [23][24][25], but approaches for the analysis of 64 these large data sets to gain important mechanistic insight beyond what was possible with a typical substrate 65 phage display experiment are lacking. Here, we describe a new strategy for interrogating protease function 66 and are able to define the entire landscape of peptide substrate recognition, uncover for the first time unique 67 sequences that are highly selective for closely related proteases, and estimate the contribution of the 68 catalytic cleft to specificity of recognition of protein substrates.

95
preferences. This means that it is assumed that the set of probes used for that purpose has a uniform 96 distribution of amino acid patterns recognized by the catalytic cleft. This assumption, though basically 97 important, is not routinely tested in experiments using synthetic and proteome-derived peptide libraries.

98
Out of all existing approaches, only substrate phage display can meet the challenge of achieving maximum 99 sequence diversity that can be quantitatively defined. The diversity afforded by substrate phage display, 100 however, presents a technical challenge for determining the scissile bonds in millions of identified 101 substrates. Combining Next Generation Sequencing (NGS) of substrate phage DNA with information 102 theory-based data analysis allows to obtain highly defined selectomes without the need for experimental 103 identification of scissile bonds. Due to the number of generated substrate sequences, this approach uniquely 104 provides the possibility of quantifying specificity as well as selectivity and redundancy of proteases from 105 the same phylogenetic families. In addition, detailed selectivity profiles obtained using this approach allow 106 to determine how closely they represent the cleavage profiles derived from the protein substrates using N-107 terminomics and other experimental approaches. This, in turn, allows to establish if the catalytic cleft 108 specificity is the main driver of physiologic substrate recognition or other features such as exosites or 109 auxiliary domains are the primary determinants or modifiers of specificity.

110
When setting out to define the selectome of a protease, it is important to estimate how well its 111 complexity is matched by the amino acid sequence library used for that purpose. While theoretically it may 112 be possible to assess the weighted contribution of individual sites across the catalytic cleft to substrate 113 specificity of a given protease with as few as 30 substrates of maximally variable composition [30], defining 114 the selectome of that protease is a far larger combinatorial problem requiring a substrate library of 115 appropriate diversity. Proteases vary widely in catalytic cleft specificity. An information entropy based 116 analysis of specificity of proteases of all catalytic classes using the MEROPS database of protease substrates

117
[11] demonstrates that while most serine, cysteine and aspartic proteases possess S1 (Schechter and Berger 118 nomenclature, [31]) centric specificity, S3 and S1' are the most stringent selectivity determinants in the catalytic cleft of metalloproteinases. The role of the S3 and S1' sites in selectivity of Matrix 120 Metalloproteinases (MMPs) is well documented [29,[32][33][34]. Since MMPs have two major selectivity 121 determinants in their catalytic clefts that are located two subsites apart from each other, selectomes of these 122 enzymes will range between 1 and 160,000 sequences spanning the P3-P1' positions relative to the scissile 123 bond (Fig. 1A). Sequence space covered by the randomized hexapeptide library, used in our phage display 124 approach, (Fig. 1B, theoretical maximum = 6.4 x10 7 ) is adequate for interrogating selectome sizes of 125 160,000 and below. So, we used two closely related members (MMP-2 and MMP-9) of the 24-member 126 Matrixin family and close to completely diverse library of hexapeptides displayed on gene 3 protein of M13 127 phage to explore and validate the concept of selectome.

A. Interactions between individual subsites in the catalytic cleft of MMPs and the corresponding positions
131 in the peptide substrates determine substrate fitness. S3 and S1' are the most selective binding sites in the 132 catalytic cleft of the MMPs (bold red lettering). Together with the S2 and S1, they interact with P3-P1' 133 tetramers in substrates. Changes in repertoires of the P3 and P1' residues in substrates (bold green lettering) 134 affect substrate fitness the most. The fitter a particular tetramer is as a substrate, the larger the population 135 of hexamers containing that tetramer in the substrate set (tetramer cluster) will be. The number of amino 136 acid hexamers per tetramer cluster (1-1200) is a measure of fitness of a particular tetramer as a substrate.

137
The larger the number of tetramer clusters in the substrate population (1-160,000) of a given protease, the 138 less defined is the specificity of that enzyme. B. Phage display library used for specificity profiling of

139
MMPs has the diversity matching the theoretical maximum. To interrogate specificity of MMP-2 and 9, we 140 used a library of randomized hexapeptides displayed on gene 3 of M13 phage. The theoretical maximum 141 of hexamer combinations is 6.4 x 10 7 . The theoretical maximum for the number of hexapeptides in a 142 tetramer cluster is 1200. There are 160,000 combinations of natural amino acid residues in random 143 tetramers.

145
The S3 and S1' binding pockets along the catalytic cleft of MMPs together with S2 and S1 between 146 them form a tetramer binding unit (Fig. 1A). The same tetramer combination of residues can occupy three 147 different frames in a hexapeptide (Fig. 1B). When all the hexamers containing a given tetramer are present 148 in the library, they will form a 1200-member tetramer cluster (Fig. 1B). The number of unique hexamer 149 peptides containing the same tetramer (Fig. 1A)  performed NGS analysis of the naïve (or initial) phage display library and also grouped the hexamer peptide 173 sequences into tetramer clusters (Fig. 1B, S1 Table, S2 Table), but without eliminating redundancy (i.e.

174
allowing the same hexamer to belong to more than one tetramer cluster), since no selective pressure that 175 could influence the distribution of the tetramer clusters has been applied in generating this set. Next, we 176 compared the distributions of relative abundances of tetramer clusters in the naïve phage display library 177 and MMP-2 and 9 substrate selections. As can be seen in Fig. 2A where H(T) is Shannon entropy of the distribution of tetramer clusters (T) each with probability P(t). The

185
probability of a given tetramer can be defined as: where P(t) is the probability of the t th tetramer calculated as the ratio between the number of hexamers in 188 that tetramer cluster (n t ) and the number of hexamers in the entire set of tetramer clusters (160,000 189 tetramers). The naïve phage display library has the Shannon entropy value of 17.218 (S5 Table, S6 Table, 190 which is not very far from that of a uniform distribution of tetramer clusters equal to 17.288 (log 2 160,000).

191
This is an important characteristic of the library we used for substrate selections that gives an idea of its 193 MMP-2 and 9, respectively (S5 Table, S6 Table), which are significantly lower than that of the naïve library,

194
as expected based on the changes in the probability distributions between the naïve library and the substrate 195 selections ( Fig. 2A).

213
To characterize the distribution of probabilities of tetramer clusters in the substrate sets in terms of 214 substrate fitness related to the catalytic efficiency, we introduced the ratio between probabilities (Relative display library, which can be used as a measure of substrate fitness of a tetramer relative to all others in a 217 substrate set: Where P S (t) is the probability of a t th tetramer cluster in substrate selection and P NL (t) is the probability of 220 that tetramer cluster in the naïve phage display library. Furthermore, the use of RP eliminates potential 221 biases in tetramer probability distributions of the substrate sets due to deviation from uniformity of the 222 tetramer probability distribution in the naïve library and potential differences in sequencing depths between 223 the two sets.

224
The RP value for tetramer clusters in substrate sets has a theoretical range of maximum values 225 between 1 (for a non-specific protease cleaving all tetramer substrates with equal efficiency: ) 1 = 1/160,000 1/160,000

227
Importantly, RP must correlate with relative substrate fitness across its range of values for a given protease.

228
To validate this assumption, we used the data obtained for a published set of 1369 phage substrates with 229 experimentally determined scissile bonds and K (obs) values (S7 Table) [10]. In this set, of all substrates 230 containing non-redundant tetramers only 1.2% and 1.9% had no matching tetramer clusters in the MMP-2 231 and MMP-9 substrate selections, respectively. This observation confirms accuracy of the P3-P1' 232 assignments in tetramer clusters of the substrate selections. Next, we performed a standard statistical binary 233 classification test (S8 Table) at increasing values of RP to determine if RP is a good predictor of a phage 234 displayed hexamer peptide being a substrate. In this analysis, all P3-P1' tetramers with RP values above a 235 certain threshold and a non-zero value of K (obs) were considered as true positives (TP). All tetramers with

236
RP values below that threshold and a K (obs) equal to 0 were considered as true negatives (TN). If a value of 237 RP was above the threshold, but the K (obs) was equal to 0, then the tetramer was classified as a false positive 238 (FP). Finally, the tetramers with RP values below a threshold but a non-zero K (obs) were classified as false Mathews Correlation Coefficient (MCC) improved significantly when the RP threshold value increased 241 from 0 to 1, mostly due to a decrease in the FP and an increase in TN rates, respectively (S8 Table). Further 242 increase in RP threshold value did not result in a significant change in MCC but the FP and the TN rates 243 continued to decline. This analysis indicates that there is a positive correlation between K (obs) of individual 244 phage substrates and RP of the matching P3-P1' tetramers.

245
Next, we performed an analysis of correlation between RP of tetramer clusters and K (obs) of the 246 hexamer substrates containing the matching P3-P1' tetramers. It is expected that only averages of the 247 catalytic efficiency constants of hexamer substrates contributing to a particular tetramer cluster will 248 correlate with the RP value of that tetramer cluster because residues outside of the P3-P1' tetramer will 249 affect the catalytic efficiency to some degree. It is impractical to determine the K (obs) value of each of the 250 hexamer substrates in each of the tetramer clusters to obtain the averages. So, we used the following 251 approach to make the correlation analysis feasible. First, we obtained the RP values for the tetramer clusters

260
This limits the range of the K (obs) values that can be obtained to between 0 and 12,792 (M -1 s -1 ). So, all 261 substrates with the true K (obs) above this value will nevertheless have a K (obs) equal to the preset maximum.

262
With this limitation in mind, we corroborated the results using synthetic peptides, thereby extending the 263 correlation to the entire range of k cat /K M values for each MMP. Sequences of the 100 peptide set used for the analysis were derived from phage substrate selections and their k cat /K M values were experimentally determined as described in [33] (S9 Table). The correlation analysis was carried out the same way as for  , Table, S3, Table). Therefore, these probability distributions must be 283 representative of the entire ranges of the respective catalytic cleft specificities. Since the majority of the 284 tetramer clusters in the substrate sets of MMP-2 and 9 have probabilities lower than in the naïve library and 285 constitute rare events, they must be relatively poor substrates and therefore contribute little if at all to the 286 specificity of the two enzymes. We asked a question if one could find an appropriate threshold to select the 287 tetramer substrates with statistically significant contribution to specificity of the catalytic cleft. To that end, distributions of tetramer clusters in substrate selections and the naïve phage display library, thus reflecting 290 specificity of the catalytic cleft: Where D KL (P S || P NL ) is the K-L divergence between the tetramer probability distributions in the selections 293 P S (t) and the naïve library P NL (t) defined on the same probability space T.

294
The K-L divergence, or relative entropy determines how one probability distribution is different from 295 another, reference distribution. The larger the value of relative entropy, the more divergent the probability 296 distributions of the test and reference sets are. The K-L divergence values can range from 0 for a protease 297 with no specificity to 17.288 for a perfectly specific protease with a single tetramer substrate assuming 298 uniform probability distribution for the reference set. We performed K-L divergence analysis using 299 probability distributions of tetramer clusters in the MMP selections as the test and those in the naïve library 300 as the reference sets, respectively. The relative entropies are 3.173 and 3.495 (S5 Table and S6 Table) for

301
MMP-2 and 9 tetramer clusters, respectively, indicating that MMP-9 has a narrower specificity than MMP-302 2, although not by much. The total number of tetramer clusters with non-zero probabilities P(t) and thus 303 non-zero contributions to the values of K-L divergence, is 78,757 and 76,696 for MMP-2 and 9, 304 respectively. The plots of the sum of individual components in the calculations of the expected value using 305 equation 4, as a function of RP for MMP-2 and 9 have two distinct parts: one below and the other above 306 the zero value of K-L divergence (Fig. 2B). While the former has no net contribution to K-L divergence, 307 the latter is the sole contributor. The RP value at the intersection of the line in the graph with the X-axis is 308 a useful threshold to define the set of tetramer clusters, which as a whole, is unique to a given protease and 309 therefore represents its "selectome". These values are 4.5 and 4.7 for MMP-2 and 9, respectively (indicated 310 by red arrow in Fig. 2B). There are 7,921 and 6,094 tetramers above the RP threshold, belonging to the 311 MMP-2 and 9 selectomes, respectively (S5, S6 Tables). They constitute 8-10% of all tetramers with non-312 zero value of RP. Another useful threshold, at which the probabilities of finding a tetramer in the substrate selection and in the naïve library are the same, occurs at the RP value of 1. Tetramer clusters with the RP 314 values greater than 1 are considered optimal substrates (marked by the green downward arrow in the Fig.   315 2B), since their individual contributions to the K-L divergence are positive. The numbers of tetramers 316 corresponding to the RP values above 1 are 16,395 in the MMP-2 and 15,581 in the MMP-9 substrate sets.

317
Tetramer clusters with RP values less than 1 are considered suboptimal substrates, since they contribute 318 negatively to the K-L divergence. Thus, out of the total of 78,757 MMP-2 and 76,696 MMP-9 non-zero 319 tetramer clusters (i.e. containing at least one hexamer) found in the selection sets, ~20% are optimal 320 substrates. Based on the statistically defined threshold introduced above, the selectome constitutes a unique 321 subset of substrates of a given protease.

322
To corroborate the findings of the K-L divergence analysis, we looked at the distributions of 323 tetramer clusters across the RP range in 10% increments from highest to lowest (S1(A) Fig.). The number 324 of tetramer clusters across the RP range shows a slow increase until it reaches the lowest 10%, when it 325 increases dramatically. S1(B) Fig. shows  which is consistent with percentages of tetramer clusters in the selectomes of MMP-2 and 9. To put these 330 data in perspective, one must keep in mind that the tetramer clusters with positive cumulative contribution 331 to K-L divergence (the selectome) in the set of MMP-2 substrates contain 2.31 x 10 6 hexamers substrates, 332 while those with zero cumulative contribution to K-L divergence, (RP interval between 0 and 4.5) -only 333 0.56 x 10 6 . The same numbers for MMP-9 are 1.64 x 10 6 and 0.54 x 10 6 , respectively. So, 80% of hexamer 334 substrates of MMP-2 and 75% of MMP-9 belong to their respective selectomes. This observation provides 335 basis for the conclusion that the catalytic cleft specificity of MMP-2 and 9 is primarily defined by S3-S1' 336 subsites, as expected. The poorly populated tetramer clusters are represented by sequences that, as P3-P1' tetramers, are suboptimal substrates, whose fitness may be modulated by exosites outside S3-S1' and which 338 are found in the minority (20-25%) of the hexamer substrates.

339
In this section of the results we have developed a concept of substrate specificity we call 340 "selectome", which though intuitive, is not easy to grasp. To the best of our knowledge, there have been no 341 prior reports of an approach aimed at defining the set of substrates that fully captures substrate specificity 342 of a protease. In the following sections, we will substantiate this concept by applying it to analyses of 343 catalytic cleft selectivity between MMP-2 and 9 and contribution of the catalytic cleft specificity to protein 344 substrate recognition.

354
We used a simple trapezoidal formula for numerical integration to calculate TSF.

372
A non-specific protease will have 1.6 x 10 5 tetramer clusters times the RP/RP Max value of 1, which 373 yields a TSF value of 1.6 x 10 5 (Fig. 1B). A perfectly specific protease will have 1 tetramer cluster times 374 the RP/RP Max value of 1 yielding a TSF value of 1. According to this calculation, TSF of MMP-2 and 9 375 selectomes is equal to 691 and 511, respectively (Fig. 4A). Thus, the catalytic cleft specificity of MMP-9 376 is narrower and constitutes 74% of that of MMP-2.

377
To analyze the composition of the MMP-2 and 9 substrates in respective selectomes, hexamers

411
In this section we have introduced for the first time, a measure of protease specificity that reflects 412 both substrate numbers and their relative catalytic efficiencies. We have also shown how substrate 413 composition changes across the fitness range, which provides valuable insight into the correlation between 414 catalytic efficiency and subsite specificity.
415 Comparative analysis of selectomes reveals distinctions between selectivity determinants of MMP-2 and 9.

417
One of the central obstacles to understanding protease biology is functional redundancy and 418 specificity overlap between proteases from the same phylogenetic groups [4]. Selectome profiling presented 419 in this study makes it possible to determine how much overlap and distinction there is between specificities 420 of closely related proteases. Catalytic domains of human MMP-2 and 9 are 73% identical and 81% similar 421 in their amino acid sequences. Direct comparison reveals that out of the total of 10,110 tetramers comprising 422 the combined selectomes, 3,902 are shared by both, and 4,019 and 2,189 are found exclusively in the 423 respective selectomes of MMP-2 and 9 (Fig. 5A, S10-S13 Tables). Thus, a pair of 73% identical proteases 424 has only 39% of the combined selectomes in common, demonstrating a significant amount of S3-S1' 425 distinction between the two MMPs. MMP-2 has the broader specificity of the pair with 40% unique 426 tetramers, while MMP-9 has only 22%, almost two-fold less than its closest relative in the MMP family.

458
By comparing the compositions of the subsites contributing to selectivity between MMP-2 and 9, one can 459 account for distinctions observed between the unique substrate sets (Fig. 6). SDPs at S4/S3 junction

479
S2, S1 and S1' binding pockets are shown on the surface representations of the three-dimensional structures 480 of MMP-2 and 9 in colors matching the sequence alignments in A. See text for more details PyMOL 481 molecular visualization system was used for display and analysis of 3D structures.

483
Another notable difference between the MMP-2 and 9 specificities is evident from the repertoire 484 of residues at the P2 position of the selective substrates, as discussed above. Dominance of Ala, Gly and

485
Ser at the P2 of MMP-2 is contrasted by the preponderance of bulky aromatic and to some extent aliphatic 486 side chain residues of the MMP-9 selective substrates (Fig. 4C, S2 Fig.). This observation is consistent with 487 the differences in composition of the S2 binding pocket sandwiched between the S2/S3 Ala179 and S2 Glu 488 210 in MMP-2 and S2/S3 Pro192 and S2 Asp 210 in MMP-9 (Fig. 6). The distance difference between 489 these residues is 5.5 Å in MMP-2 A vs. 6.2 Å in MMP-9, in agreement with the observed differences in P2 composition of the respective unique substrates. Additionally, bulkier Glu210 narrows the catalytic cleft in 491 MMP-2 to 12 Å from 14.2 Å in MMP-9, which has a more compact Asp235 in that position (Fig. 6).

492
Quite remarkable is the lack of significant contribution of P1 to selectivity as expected based on 493 identical residues at the S1 SDPs of both enzymes (Leu163 in MMP-2 and Leu187 in MMP-9).

494
Differences in P1' composition of the selective tetramers are more difficult to explain structurally 495 due to the complexity of the S1' binding site, formed by an allosteric hydrophobic tunnel preferentially 496 occupied by Leu, Trp, Met and Ile residues in the selective substrates of MMP-2. In MMP-9 selective 497 substrates P1' Leu, the preferred residue by the S1' pocket of the entire MMP family, is virtually absent 498 and becomes noticeable only in the lower (0.2-0.5) RP/RP Max range of the MMP-9 tetramer clusters. Out of 499 the 18 residues comprising the S1' loop, 10 are different between MMP-2 and 9, with 5 non-conserved 500 substitutions. The fact that the selective substrates of both enzymes have significant differences in the 501 repertoires of the P1' residues is consistent with significant differences in SDP compositions of the S1' 502 binding pocket between the two enzymes. Of note, however, is that one of the SDPs forming the opening 503 of the S1' tunnel is different between the two enzymes (Ile212 and Met237) and could be a significant 504 contributor to selectivity.

505
Importantly, the composition of substrates comprising the overlapping set of 4,019 tetrameric 506 clusters is consistent with the classic PxxL pattern common for most MMPs [10] (Fig. 5C). Interestingly, 507 even though the combined specificity profiles of the tetrameric clusters shared by both enzymes are nearly 508 identical, there are noticeable differences in the profiles of tetrameric clusters at different RP levels (S2 509 Fig.).

510
In this section, for the first time, we provided a quantitative measure of overlap and distinction 511 between specificities of closely related proteases that accounts for both substrate number and fitness. In identifying SDPs responsible for selectivity between these closely related enzymes.

516
Telling a target from a bystander: contribution of the catalytic cleft specificity to protein substrate 517 recognition.

518
One of the questions central to understanding protease function is how to distinguish between 519 targeted and coincidental proteolytic events. It stands to reason that proteases and their physiologic 520 substrates co-evolved to be integral parts of complex physiological processes [4,40,41] physiologically relevant proteolytic event to be integrated into the larger context of underlying biology.

530
To assess the contribution of the catalytic cleft specificity to physiologic substrate recognition by 531 MMP-2 and 9, we used the data on protein substrate hydrolysis obtained by us and those available in the 532 literature. The data set published in [42] was taken as a benchmark for protein cleavage site identification 533 due to the rigor of data analysis and independent verification (S14 Table). Based on the comparison of this 534 data set with ours, 81 and 84% of all the cleavages in the MMP-2 and 9 substrate sets are optimal substrates.

536
These numbers are not very far from the probability (86%) of unambiguous identification of cut sites of a 537 protease with known specificity (Glu C) used by the authors for validation of the statistical model for 538 cleavage site identification used in their study [42]. The rest of the identified cleavages (19% for MMP-2 539 and 16% for MMP-9) are either suboptimal substrates (RP<1, 13.6% for MMP-2 and 10.5% for MMP-9)

541
there is a very good correlation between a cut site being a part of the selectome of MMP-2 or 9 and also 542 being a validated substrate of the same MMP.

543
In the publication we used as the benchmark [42], the criteria for cleavage site identification were 544 set very stringently, so that the ratios between the iTRAQ reporter ion intensities in the MMP-treated 545 samples and the untreated controls had to be in order for N-terminally labeled peptides to meet the ≥ 10 546 statistical threshold to be considered a candidate cleavage sites. This was done to achieve a reasonable values above the selectome thresholds for MMP-2 (RP>4.5) and MMP-9 (RP>4.7) are predominantly found 563 in the intervals with  > 1 above the average EI (Fig. 7). The further do the IE values go below  = 1 the 564 higher is the proportion of the N-termini with RP values below 1. Based on these observations, in our study, 565 the IE cutoff for calling a labeled N-terminus a cleavage site resides one standard deviation above the 567 performed binary classification analysis of the data shown in Fig. 7. The results demonstrate that an RP 568 value above the selectome threshold (RP=4.5 for MMP-2 and 4.7 for MMP-9) is the best predictor (MCC 569 = 0.502 for MMP-2 and 0.435, respectively) of an N-terminal peptide to have an EI value above  = 1 570 relative to the population mean (S16 , Table). These data are highly consistent with what we observed using 571 the benchmark data set discussed above and provide basis for distinguishing between the true positive and

589
Following hydrolysis with MMP-2 or 9, the secretome of HEK293 cells was labeled with TMT isobaric 590 tags. Isotopic enrichment (IE) of the novel N-terminally labeled peptides in the MMP-treated samples 592 secretomes (See text and Materials and Methods for details). N-terminally labeled peptides were split into 593 groups based on their IE, expressed as multiples of the standard deviation away from the population mean.

599
MEROPS is a rich source of data for specificity profiling of proteases [2,11]. It is therefore of 600 interest to compare how the information on MMP-2 and 9 cleavages compiled from a wide variety of 601 experimental studies is matched by our criteria for specificity. As can be seen in S17 Table, 65 and 50% of 602 all cleavages are optimal substrates (RP>1) out of which 55% and 36% belong to the selectomes of MMP-2 603 and MMP-9, respectively. Of the cleavages with RP values below 1, 23 and 32% constitute suboptimal 604 substrates and 12 and 18% are not substrates of MMP-2 and 9 based on our criteria. Of the published 605 "physiologic" substrates, 53 and 45% are optimal, out of which 44 and 31% belong to the selectomes of 606 MMP-2 and MMP-9, respectively. 31 and 35% of cut sites in the "physiologic" substrate category belong 607 to the sets of suboptimal substrates and 16 and 20% are not substrates of MMP-2 and MMP-9.

608
In summary, based on our analysis, the catalytic cleft specificity is an important determinant of

643
This is a very valuable aspect of the selectome-based substrate specificity profiling as it provides basis 644 for distinguishing specificities of closely related members of protease families in order to develop 645 selective activity probes and inhibitors.

648
We developed methodology for quantification of substrate specificity based on both the number of

666
In relative terms, based on their analysis, specificities of MMP-2 and 9 are 20 8 /20 7.386 = 6.29-fold 667 and 20 8 /20 7.078 = 15.83-fold narrower than that of a protease with no specificity. Our data based on the 669 specificities are 2 17.288 /2 13.93 = 10.25-fold and 2 17.288 /2 13.67 = 12.28-fold narrower than that of a randomly 670 specific protease. The two analyses are in a reasonably good agreement on the overall specificities of 671 the two enzymes. These numbers imply that approximately 10% of all peptide bonds in proteins 672 available for cleavage are substrates of MMP-2 and 9, which makes every protein in the human 673 proteome a potential target with at least one cleavage site. Based on the TSF quantitation of selectomes,

674
MMP-2 and 9 are 160,000/691 = 232 and 160,000/511= 313-fold more specific than a non-specific 675 protease, respectively. This implies that if one knows the composition of the selectome of a protease 676 and takes into account the substrate numbers and relative fitness to quantify specificity, both enzymes 677 are much more specific than can be assessed just based on residue frequencies relative to the scissile 678 bond of all the cleavage sites derived from the substrate sequences.

680
Relevance to proteolysis of folded proteins.

681
Using the results of a published rigorous study [42]and our own data, we determined the relevance 682 of our selectome-based approach to identification of cleavage sites in folded proteins. Our analysis of 683 the published results shows that from 70 to 80% of the protein substrates identified with high confidence 684 belong to the selectomes of MMP-2 and 9. Our own analysis of cleavages in folded proteins based on 685 enrichment of novel N-termini following MMP treatment, demonstrates that RP above the selectome 686 threshold for the matching P3-P1' tetramers is the best predictor for enrichment of the corresponding 687 N-termini greater than  = 1 above the population mean (S16 Table).

688
Selectome-based quantification of substrate specificity shows that MMP-2 and 9 are ~200-300-fold 689 more specific than a randomly specific protease, which implies that 1 out of 200-300 peptide bonds in 690 proteins has the same probability of being cleaved by MMP-2 and 9 as by a non-specific protease.

691
Given the average length of a eukaryotic protein (472 residues [45]), about 2 peptide bonds per protein 692 are potentially relevant cleavage sites of either MMP. Given that in order to be accessible to proteolysis proteins is 20-30% [47], less than 1 peptide bond per average-size eukaryotic protein is potentially a 695 cleavage site that belongs to the selectome of either MMP-2 or 9. This is consistent with the notion of 696 substrate-protease co-evolution necessary for regulatory proteolysis to take place in a living organism

697
[48], where protease activity is strictly limited to physiologically compatible levels by natural inhibition 698 and pro-enzyme latency. A corollary to that is that the relatively poor substrates, constituting most of

712
Work presented here establishes a new approach to studying substrate specificity of proteases and 713 possibly other enzymes involved in posttranslational modification of proteins [50]. It is based on 714 statistically saturated data sets and a new way of applying information theory to quantitatively define 715 substrate specificity of proteases by employing a novel concept of "selectome". In practical terms, this 716 approach can be invaluable for developing highly selective activity probes and inhibitors for closely 717 related members of large protease families. By providing a measure of catalytic efficiency, our approach can also be used to help determine which cleavages in folded proteins represent physiologic 719 and pathologic targets and which are bystander proteolytic events.

721
Expression and Purification of Recombinant Catalytic Domains and activity assays.

722
The recombinant catalytic domains of MMP-2 and -9 were expressed in HEK293 cells stably transfected 723 with respective constructs and purified from serum-free culture medium using Gelatin Sepharose 4B (GE

776
TFA and centrifuged to remove insoluble material. The supernatants were then desalted using Sep-Pak

779
Dried pooled sample was reconstituted in 20 mM ammonium formate pH ~10 and fractionated using a

781
Performance Liquid Chromatography (UPLC) system (Waters). Peptides were then separated using a 35-782 min gradient: 5% to 18% B in 3 min, 18% to 36% B in 20 min, 36% to 46% B in 2 min, 46% to 60% B in 783 5 min, and 60% to 70% B in 5 min (A=20 mM ammonium formate, pH 10; B = 100% ACN). A total of 784 32 fractions were collected and pooled in a non-contiguous manner into 16 total fractions. Pooled fractions 785 were dried to completeness in a SpeedVac concentrator prior to mass spectrometry analysis  information about: a) the amino acid sequence of tetramer cluster, b) rank of the tetramer cluster 838 calculated using relative probability, c) number of hexamers in a cluster from MMP set and d) number amino acid tetramer sequences for P3-P1' positions, c) rank of tetramer based on RP value, d) relative 865 probability (RP), e) measured k cat /K M (M -1 s -1 ), f-g) standard deviation and standard error for measured 866 k cat /K M (M -1 s -1 ), based on triplicate experiments for each experiment.
867 S10 Table. Analysis of combined selectomes of MMP-2 and 9. A. List of unique tetramer clusters 868 and corresponding hexamer sequences aligned across P3-P1' positions of substrates belonging to 869 MMP-2 selectome. Tetramers are ranked according to MMP-2 relative probability. The number of 870 hexamers in a tetramer cluster depends on which MMP ranking was applied. Information about the 871 tetramer amino acids sequence, rank, number of hexamers and relative probability of a cluster is 872 provided in the header for each tetramer cluster.
873 S11 Table. Analysis of combined selectomes of MMP-2 and 9. List of unique tetramer clusters and 874 corresponding hexamer sequences aligned across P3-P1' positions of substrates belonging to MMP-9 875 selectome. Tetramers are ranked according to MMP-9 relative probability. Information about the 876 tetramer amino acids sequence, rank, number of hexamers and relative probability of a cluster is 877 provided in the header for each tetramer cluster.
878 S12 Table. Analysis of combined selectomes of MMP-2 and 9. List of tetramer clusters common 879 between the selectomes of MMP-2 and 9 together with the corresponding hexamer sequences aligned 880 across P3-P1' positions of substrates. Tetramers are ranked according to MMP-2 relative probability.

881
Information about the tetramer amino acids sequence, rank, number of hexamers and relative 882 probability of a cluster is provided in the header of each tetramer cluster. 883 S13 Table. Analysis of combined selectomes of MMP-2 and 9. List of tetramer clusters common 884 between the selectomes of MMP-2 and 9 together with the corresponding hexamer sequences aligned 885 across P3-P1' positions of substrates. Tetramers are ranked according to MMP-9 relative probability.

886
Information about the tetramer amino acids sequence, rank, number of hexamers and relative 887 probability of a cluster is provided in the header of each tetramer cluster.