B-SIDER: Computational Algorithm for the Design of Complementary β-sheet Sequences

The β-sheet is an element of protein secondary structure, and intra-/inter-molecular β-sheet interactions play pivotal roles in biological regulatory processes including scaffolding, transporting, and oligomerization. In nature, a β-sheet formation is tightly regulated because dysregulated β-stacking often leads to severe diseases such as Alzheimer’s, Parkinson’s, systemic amyloidosis, or diabetes. Thus, the identification of intrinsic β-sheet forming propensities can provide valuable insight into protein designs for the development of novel therapeutics. However, structure-based design methods may not be generally applicable to such amyloidogenic peptides mainly owing to high structural plasticity and complexity. Therefore, an alternative design strategy based on complementary sequence information is of significant importance. Herein, we developed a database search method called B-SIDER for the design of complementary β-strands. This method makes use of the structural database information and generates query-specific score matrices. The discriminatory power of the B-SIDER score function was tested on representative amyloidogenic peptide substructures against a sequence-based score matrix (PASTA2.0) and two popular ab initio protein design score functions (Rosetta and FoldX). B-SIDER is able to distinguish wild-type amyloidogenic β-strands as favored interactions in a more consistent manner than other methods. B-SIDER was prospectively applied to the design of complementary β-strands for a splitGFP scaffold. Three variants were identified to have stronger interactions than the original sequence selected through a directed evolution, emitting higher fluorescence intensities. Our results indicate that B-SIDER can be applicable to the design of other β-strands, assisting in the development of therapeutics against disease-related amyloidogenic peptides.


Introduction 36
The β-sheet is one of the major units of protein structure 1 , and has a variety of functions in 37 transportation, recognition, scaffolding, and enzymatic processes 2 . The mechanism of a β-sheet 38 formation has recently received significant attention owing to its close relations with several 39 critical diseases such as Alzheimer's, Parkinson's, type 2 diabetes, and systemic amyloidosis 3,4 . 40 Such diseases are known to be linked to the precipitation of dysregulated β-stacking between 41 neighboring β-strands [5][6][7] . In this regard, information on the amino acid propensity of intrinsic β-42 sheet forming motifs and its use in the design of their complementary sequences are crucial for 43 understanding the mechanism of β-sheet formation and developing potential therapeutics 44 specifically targeting aggregation-prone regions 8 . 45 Whereas structure-based protein design approaches have shown notable successes in several 46 cases 9 , their application to de novo β-sheet designs still remains a challenge 2, 10 . Although 47 structure-based design approaches require a well-defined protein structure, amyloidogenic 48 peptides usually have highly disordered structures 11 . The structural identification of such peptides 49 has long been hindered by high degrees of structural plasticity, transiency, and complexity owing 50 to a self-oligomerization 12,13 . It is thus necessary to exploit the complementarity across 51 neighboring β-strand pairs using sequence information. 52 Intriguingly, significant conservation and covariations of residue pairs between neighboring β-53 strands have been identified in many different protein families 14 . For instance, pairs of β-branched 54 residues and cysteines are preferred at nonhydrogen-bonded positions. Aromatic residues tend to 55 be paired with valine or glycine 15 . Several computational algorithms have been developed to 56 predict aggregation-prone regions based on the internal β-sheet forming patterns. While differing 57 in detail, they make use of either statistical potentials such as Tango 16 , PASTA 17 , SALSA 18 , 58 BETASCAN 19 , and Waltz 20 or physicochemical properties of amino acids 21 . In addition, 59 consensus methods and machine-learning approaches have also been developed 22,23 , showing fine 60 agreement with the experiment results. 61 It has been reported that β-strand interactions can be stabilized by introducing β-sheet favored 62 pairs 24-28 and charge pairing between neighboring β-strands 29-31 . Recent studies have shown that 63 fragments derived from the amyloidogenic region can be used for β-stack modeling 6, 32 . While the 64 use of amino acid pairing information in a protein design has been attempted elsewhere, practical 65 applications of such patterns have been limited, mainly owing to the lack of a comprehensive 66 quantification for residue pairing and noisy patterns of β-sheet forming residue pairs 1, 33, 34 . The 67 β-sheet forming peptides appear to have poor sequence commonalities and imperfect repeats 19 . 68 Therefore, a careful curation of meaningful patterns is required for a practical protein design 69 strategy of complementary β-strands. 70 Herein, we present a database search method, B-SIDER (β-Stacking Interaction DEsign for 71 Reciprocity), for the design of complementary β-strands. The method generates a query-specific 72 score matrix from the structure database. To utilize the pairing information and overcome the 73 pattern noise, we hypothesized that significant complementary pairs can be amplified by 74 superposing a subset of sequence fragments. Moreover, the recent growth boom of β-sheet 75 structures 35 allows the solid statistics of β-sheet forming residue pairings 36 . Based on the 76 hypothesis and statistics of β-sheet forming residue pairings, we developed a fast and reliable 77 computational method for the design of complementary β-strand sequences. The methodology 78 augments β-sheet forming residue preferences by overlaying complementary fragment sequences 79 (Figure 1). We retrospectively validated our approach using a set of curated amyloidogenic targets 80 and compared it with two popularly used structure-based methods (Rosetta and FoldX) and a 81 sequence-based aggregation prediction algorithm (PASTA 2.0). Our algorithm was shown to 82 clearly distinguish favorable β-sheet forming sequences entirely based on the query sequence, 83 whereas the structure-based energy functions exhibited inconsistent results depending on the 84 targets. The utility and potential of our method were demonstrated by designing novel 85 complementary peptides for splitGFP. The designed sequences showed stronger interactions with 86 neighboring strands of the scaffold, and consequently higher fluorescence emissions, than the 87 original peptide selected through directed mutagenesis 37 . 88

Computational algorithm for the design of complementary sequences 99
Collection of β-strand information Non-redundant structures determined by high-resolution X-100 ray crystallography were collected from the PISCES 38 (10~90% sequence identity), < 3 Å 101 resolution, < 0.3 R-factor and sequence length from 40 to 10000. Given the query target sequence, 102 the non-redundant structure database was used to extract pairing information from matched 103 sequences with the same directionality. The target sequence is initially divided into linear moving-104 windows whose residues in length range from 3 to the entire target-sequence length. Any structures 105 with identical target sequence fragments as the split queries were collected, followed by further 106 filtering based on the definition of β-sheet secondary structure (the distance between the backbone 107 nitrogen-oxygen atom pairs < 5 Å). To remove any redundancy, protein structures that contain 108 matches were compared using TMalign 39 . If the TM-score > 0.7 and sequence identity > 90%, 109 one of the matched sequences is removed. 110 While the method is applicable to both parallel and anti-parallel β-sheets in theory, we 111 entirely focused on anti-parallel β-sheets in this study 27,28 and all the test cases are anti-parallel 112 because anti-parallel cases are more frequently observed compared to parallel cases 40 . Disease-113 related amyloidogenesis is also known to be initiated with anti-parallel β-sheets, and soluble 114 oligomeric amyloid species mainly exist as anti-parallel 41, 42 . 115

116
Complementary sequence score The β-sheet complementarity score function is derived from the 117 environment-specific substitution score 43 . We hypothesized that each position of a β-strand is 118 independent of one another, and their complementarities are determined through residue pairs from 119 neighboring strands. Given the query sequence, all identified neighboring sequences are pooled 120 together, as described in the previous section. The amino acid frequency at each complementary 121 position is counted as 122 where Ai,p is the frequency of a certain amino acid (i) at a specific complementary position (p), and 123 Oi,p is the total count of the amino acid at p. The background frequency of a certain amino acid Bi 124 is independent of the query and thus counted from a well defined structure database(HOMSTRAD 125 database 44 ). The frequency is calculated as 126 The complementary sequence score of the amino acid at the position Si,p is calculated as 127 It should be noted that the complementarity score is completely data-driven, i.e., if an 128 amino acid never appears at a certain position, a high penalty score is imposed. We only consider 129 complementary amino acids that are found at least once in the entire identified sequences. The 130 final score is the sum of the scores at all positions. 131 132

Protein expression and complementation assay 133
Gene construction The gene coding for splitGFP 37 consists of the first through tenth (GFP1-10) 134 and 11 th strand (GFP11) templates (Supporting Information, Table S1). They were cloned into 135 pET-28a (Novagen) vector between the Nde-I and Xho-I restriction sites. We introduced additional 136 mutations to GFP1-10 to inhibit aggregation and for a convenient expression 45 . The GFP11 strand 137 was fused with a P22 virus-like particle scaffolding protein 46 for a soluble and stable expression. 138 Mutations on GFP11 were introduced through PCR using mutagenic primers (Supporting 139 Information, Table S2), and the resulting genes were cloned into the pET-28a vector. Six histidine 140 residues were fused to the N-terminal of the GFP1-10 and GFP11 genes as an affinity purification 141 tag.

Overview of the design process 172
We hypothesized that repetitively observed amino acid pairing patterns indicate a strong 173 preference toward the β-sheet. It was also assumed that the sequence with the most frequent 174 patterns will directly form a β-sheet without considering other environmental contributions. 175 There are two major steps applied in the algorithm: 1) The extraction of complementarity 176 information of the β-sheet and 2) the construction of a scoring matrix (Figure 1). When a query 177 sequence is given, it is fragmented into several pieces of short peptides longer than three residues 178 in length, and the matching neighboring strands are collected. These fragmentation and overlaying 179 processes naturally impose weights on complementary-prone positions and amplify the pattern 180 signals. It should be noted that the subsequence search starts from the -2 nd position in order to 181 avoid underweighting teminal regions (Supporting Information, Figure S1). After the collection 182 of matched sequences, a position-specific complementarity scoring matrix is constructed. The 183 scoring matrix obtained is used to evaluate and design the complementarity of β-strand interactions. 184

Validation of the score function on retrospective cases 185
In an effort to validate the complementarity score, we manually curated a test set of 186 naturally occurring β-strand pairs whose environmental effects are minimal. It is known that β-187 strand pairing is in general significantly hindered by local environments 47 , whereas amyloidogenic 188 peptide segments are known to form natural β-sheets themselves 17 . We thus selected a set of 189 widely known amyloidogenic structures whose aggregation-prone regions have been identified 190 (Table1 and Supporting Information, Table S3). When multiple β-strands are present, we assumed 191 that the first strands may be the primary amyloid building unit and so only the first two strands as 192 a pair were extracted for the structure-based energy calculation. 193 To assess the complementarity of the native sequences, we compared their scores with 196 those of random sequences which were generated using all 20 amino acids in a position-197 independent manner. Natural amyloidogenic segments are known to be highly prone to 198 aggregation, and thus they are expected to be highly preferred, i.e., having fairly low scores in the 199 random sequence score distributions. Figure 2 shows that all native sequence scores are ranked 200 extremely low in all distributions. On average, the native sequences are within 5.2% of the 201 distributions (Figure 2). The results indicate that the scoring function is extremely useful in 202 detecting favorable β-strand counterparts. 203 We also compared the B-SIDER score with two structure-based all-atom energy functions 209 and a sequence-based score matrix. For the structure-based methods, we picked Rosetta (Talaris 210 13) 49, 50 and FoldX 51 , which have been popularly used in de novo protein designs 52, 53 . PASTA 211 2.0 54 is a method for predicting regions prone to aggregation using a scoring matrix derived from 212 residue pairing patterns of β-sheets. To avoid any biases, 1,000 random sequences were newly 213 prepared per target. The "FastRelax" protocol 55 from Rosetta (Ver. 3.7), the "BuildModel" Considering that the Rosetta relax protocol performs flexible backbone refinements, the 237 use of a fixed-backbone calculation seems to be better for an evaluation of the β-sheet 238 complementarity. It should be noted that the inconsistent results of Rosetta imply that FoldX 239 prediction will also be highly driven by the preparation of the structure, i.e., a design with an ill-240 defined model may not be generally successful. By contrast, B-SIDER and PASTA 2.0 do not 241 require any structure at all, and thus can be applied to general cases such as β-sheet interactions 242 with high structural plasticity and poor structural integrity, which are the common features of 243 amyloidogenic peptides. Furthermore, the process of collecting complementary motifs of B-244 SIDER also appears to be powerful, making it possible to distinguish favorable complementary 245 sequences not easily detected through one-to-one residue pairing. We also tested the robustness of 246 the B-SIDER approach by examining the algorithm at various sequence identity cutoff values of 247 the structure database. The results are fairly consistent until < 40% sequence identity cutoff 248 ( Figure 3B and Supporting Information, Figure S2). We observe that significant outliers start 249 appearing at < 30%. Despite such outliers, B-SIDER still successfully discriminates most test cases 250 as favorable. The length distributions of matched subsequences ( Figure S3) show that 251 complementarity information is mainly derived from subsequences with the minimal length. 252 Perhaps the consistency of the predictive power at low homology thresholds may come from the 253 subsequence overlapping. 254

Prospective appplication of the algorithm to splitGFP 255
As shown in the retrospective test, B-SIDER is extremely useful in discriminating naturally 256 β-strand forming sequences. As a proof of concept, we prospectively designed novel 257 complementary β-strands for splitGFP. SplitGFP is a fragmented protein pair derived from 258 superfolderGFP 37 , comprising a scaffold containing ten β-strands (GFP1-10) and their 259 complementary β-strand peptide (GFP11). GFP11 specifically interacts with GFP1-10, and the 260 strand tightly forms a stable β-sheet structure, which facilitates the chromophore maturation in an 261 irreversible manner 56 . This assembly process results in the emission of green fluorescence. 262 Because GFP11 is known to be disordered in a solution, its conformational transition from a 263 disordered to an induced β-sheet is similar to that of amyloidogenic peptides 57, 58 . This model 264 system thus efficiently assesses whether sequences designed using B-SIDER have favorable β-265 sheet interactions. 266 Table 2 The original GFP11 was designed using directed mutagenesis, and it shows a high intrinsic 270 propensity to form hydrogen bonds with the neighboring β-strands of GFP1-10 59 . In our case, the 271 queries are the neighboring strands of GFP1-10 ( Figure 4A). The total score was calculated as a 272 sum of the scores from both sides without specific weights. It is known that the inward pointing 273 residues (1, 3, 5, and 7 th positions) directly interact with the chromophore, and thus are not subject 274 to a mutation. It should be noted that we utilized the best structure data available (PDB < 90 % 275 sequence identity). B-SIDER identified 2,637 non-redundant sequences from the structure 276 database. The native sequence is ranked modestly among 1,000 possible randomly chosen 277 sequence variants (46 th percentile), indicating that there could be room for complementary 278 sequences with stronger interactions than the original (Figure 4B). We then selected sequences 279 with the lowest B-SIDER scores (top_vars; Table2). Amino acid compositions of the ten variants 280 are mostly hydrophobic or branched amino acids (Supporting Information, Figure S4). In addition, 281 one sequence with a high score (> 75 th percentile) was randomly selected as a negative control 282 (neg_var, 77 th ). 283 The selected variants were successfully expressed and purified (Supporting Information, 284 Figure S5) except for four clones (top_var6, 7, 8, and 10), which were observed to be insoluble, 285 perhaps due to aggregation. Among those expressed, three variants (top_var1, 2, and 9) showed 286 faster assembly patterns and higher signals compared to the original GFP11 (Figure 5). No 287 functional aberrance with excitations or emissions was observed (Supporting Information, Figure  288 S6). All successful variants, which emitted stronger fluorescence levels than the original, were 289 shown to have pairs of phenylalanine and threonine at positions 6 and 8, respectively. These results 290 demonstrate that the designed variants indeed formed complementary β-strands in a more 291 favorable fashion than the original peptide as predicted. The other variants showed slightly lower 292 signals than the original, but still gave rise to clear fluorescence signals (Figure 5). The negative 293 control (neg_var) barely emitted any signal, suggesting that the score indeed indicates the 294 complementarity of the β-stacking interactions. We also assigned scores to the GFP11 variants 295 using other scoring methods. As shown in the retrospective test set, Rosetta and FoldX were unable 296 to discriminate top_vars as favorable (Supporting Information, Figure S7). However, PASTA 2.0 297 was again fairly accurate in this case. 298 For understanding the aggregation mechanism of disease-related β-sheets and developing 305 potential therapeutics against them, β-sheet forming patterns are essential. Unlike α-helices, 306 however, there has been no established design principle for the complementarity of β-sheets. In 307 this study, we developed B-SIDER, a database search method for the design of complementary β-308 strands based on the intrinsic β-sheet forming propensities. Statistical patterns of interacting 309 residue pairs between neighboring β-strands enable the complementary interaction to be quantified. 310 We demonstrated that the statistical potential can be directly applied to the design of 311 complementary β-strand sequences. Using splitGFP as a model system, we successfully designed 312 fragment variants, which led to stronger fluorescence emissions than the native one originally 313 identified through a directed mutagenesis. The results clearly indicate that B-SIDER can be useful 314 for the detection and design of β-stacking interactions between unstructured fragments. Therefore, 315 our approach can find wide applications in protein designs where structure-based methods are 316 ineffective, including the development of protein binders specifically against intrinsically 317 disordered disease-related proteins.  (Table S1 and S2), identification 324 of query sequences of the retrospective test set (Table S3)