Extending the application of the SCA/Sectors method for the identification of domain boundaries and subtype specific residues in multi-domain biosynthetic proteins: Application to Polyketide Synthases

Polypeptides with multiple enzyme domains, such as type I polyketide synthases, produce chemically complex compounds that are difficult to produce via conventional chemical synthesis and are often pharmaceutically or otherwise commercially valuable. Engineering polyketide synthases, via domain swapping and/or site directed mutagenesis, in order to generate novel polyketides, has tended to produce either low yields of product or no product at all. The success of such experiments may be limited by our inability to predict the key functional residues and boundaries of protein domains. Computational tools to identify the boundaries and the residues determining the substrate specificity of domains could reduce the trial and error involved in engineering multi-domain proteins. In this study we use statistical coupling analysis to identify networks of co-evolving residues in type I polyketide synthases, thereby predicting domain boundaries. We extend the method to predicting key residues for enzyme substrate specificity. We introduce bootstrapping calculations to test the relationship between sequence length and the number of sequences needed for a robust analysis. Our results show no simple predictor of the number of sequences needed for an analysis, which can be as few as a hundred and as many as a few thousand. We find that polyketide synthases contain multiple networks of co-substituting residues: some are intradomain but most multiple domains. Some networks of coupled residues correlate with specific functions such as the substrate specificity of the acyl transferase domain, the stereo chemistry of the ketoreductase domain, or domain boundaries that are consistent with experimental data. Our extension of the method provides a ranking of the likely importance of these residues to enzyme substrate specificity, allowing us to propose residues for further mutagenesis work. We conclude that analysis of co-evolving networks of residues is likely to be an important tool for re-engineering multi-domain proteins. Author summary Many important compounds such as antibiotics or food flavourings are produced naturally by molecular factories within plant, fungal and bacterial cells. These molecular factories typically comprise a complex of multiple interacting enzymes, each enzyme being a stage in a molecular production line. Often the enzymes are connected together as subsections of the same amino acid chain, i.e. protein, with the amino acid chain folding into the separate functional enzymatic domains that comprise the production line. Polyketide synthases are such multi-domain proteins, and their products often have antibacterial, antifungal and antitumoric effects. Engineering polyketide synthases thus has the potential to produce novel drug candidates. We applied and developed statistical approaches to detect where in an amino acid sequence the boundaries are between different domains, potentially allowing these regions to be swapped around for the synthesis of novel compounds. We used the same approaches to identify parts of the amino acid chain important for the function of different types of domain, pointing to how they might be modified to make novel compounds. These analyses agree with published experimental data and allow us to make novel predictions, which we expect to help experimentalists produce novel compounds of commercial and pharmaceutical interest.

products often have antibacterial, antifungal and antitumoric effects. Engineering polyketide synthases thus has the potential to produce novel drug candidates. We applied and developed statistical approaches to detect where in an amino acid sequence the boundaries are between different domains, potentially allowing these regions to be swapped around for the synthesis of novel compounds. We used the same approaches to identify parts of the amino acid chain important for the function of different types of domain, pointing to how they might be modified to make novel compounds. These analyses agree with published experimental data and allow us to make novel predictions, which we expect to help experimentalists produce novel compounds of commercial and pharmaceutical interest. A minimal PKS includes the KS, AT and ACP domains, the AT loading an extender unit onto the ACP, which then presents it to the substrate bound to the KS, elongating that substrate and resulting in a keto group at the β-carbon. (C) One or more further rounds of elongation can occur, and the presence of modification domains (KR, DH, ER) reduces the keto group at the β-carbon, shown by a β on the figure, which results in a fully reduced carbon chain when KR, DH and ER domains are all present in the module. (D) The modular organization of the DEBS system, which consists of three polypeptides (DEBS1, DEBS2, DEBS3) and six elongation modules. The fragment of DEBS1 used as the basis of our analysis is marked in red and has a KS at the N-and C-termini of the fragment (KS1 and KS2, respectively). This sequence was chosen since we would expect it to form a structurally closed unit, since the PKSs are actually homodimers with KS1 of one PKS polypeptide binding to KS1 of another polypeptide chain and KS2 similarly binding another KS2, effectively bounding the structure of the module [3,4]. Choosing this sequence for study also fits the most prevalent definition of a PKS module, with the KS at the N-terminus as show in the figure, but also for the possibility that the more natural definition of a module might be with the KS at the C-terminus rather than the N-terminus [5,6]. Although every module of DEBS includes a KR domain, they are of different subtypes, A1, B2, C2, which are described later in the text.
( Fig 1D). Type A KR domains produce a hydroxyl group with L-configuration at the 41 beta-carbon, whereas type B KR domains produce a D-configured hyrdoxyl group. 42 Types A and B can be further classified by their ability to epimerise the alpha carbon. 43 A1/B1 leave a D-configured alpha carbon, thought to be a result of the stereochemistry 44 of the extender unit [8], whereas A2/B2 leave an L-configured alpha carbon, a result of 45 enzyme catalysed epimerisation [9]. Type C1 have no catalytic ability, whereas type C2 46 only epimerise the alpha carbon to be L-configured [9]. 47 New polyketide variants can be produced by modifying the extender unit specificity 48 of the AT domain, which will change the moiety added to the growing polyketide chain, 49 or by changing the KR domain subtype [7,[10][11][12][13], which can modify the extended 50 polyketide as described above. One common approach to changing AT specificity is to 51 swap the native domain with the corresponding domain from another PKS system or 52 module. The challenging points of domain swapping experiments are understanding the 53 compatibility for interaction between the inserted domain and the domains on the host 54 system, and correctly identifying the functional boundaries of the domain, both of 55 which are thought to be needed to maintain structural and functional integrity. 56 Although studies of the AT domain swapping have been going on for more than two 57 decades, successful swapping was achieved only recently [14][15][16][17][18][19]. Experimenting with 58 different domain boundaries, swapping the AT domain from EPOS module 4 into the 59 position of the AT in DEBS module 6 identified two functional variants from four 60 constructs [19]. The same work demonstrated the functionality of these domain 61 boundaries in other systems. On the other hand, no similar study has been published 62 regarding KR domain boundaries, to the best of our knowledge. KR swap experiments 63 have been published but their success varies with KR domain subtype [20][21][22][23]. 64 Another approach to engineering polyketide variants has been point mutation of 65 residues that are thought to be critical for the functional specificity of the domain. For 66 example, sequence analysis revealed that methylmalonyl specific AT domains bear a 67 YASH motif whereas the corresponding position in the sequence of malonyl specific AT 68 domains is a HAFH motif. However, switching extender unit specificity by mutating 69 only the YASH/HAFH motifs led to incorporation of both types of extender unit, 70 indicating that specificity is not solely defined by these motifs [24][25][26]. This motivated 71 the search for additional residues that are functional in extender unit binding. A favourable than the wild type interactions. Thus, although the challenge of loading an 79 AT with an alternative substrate seems to have been addressed, the challenge of 80 mutating AT specificity in the context of the PKS still needs more work. These 81 experiments showed that switching the extender unit specificity from one type to 82 another is possible only if all the key substrate binding residues are identified with 83 further work presumably needed to characterise the residues that couple an AT's 84 function to its cognate ACP. 85 Similarly, site-directed mutagenesis has been applied to KR domains to switch the 86 KR subtypes. In type-B KR domains, approximately 57 residues before the catalytic 87 tyrosine there is a highly conserved LDD motif (the second aspartic acid residue is 88 strictly conserved). This motif is absent in type-A KR domains, but these conserve a 89 tryptophan eight residues before the catalytic tyrosine. Although mutating these critical 90 residues has given some promising results, with type-A1 and type-B2 being mutated to 91 type-A2, switches to other types have not yet given high product yields [28,29]. 92 Overall, although these studies provided better understanding of the domains and 93 their subtypes, a complete conversion from one subtype to another, whilst maintaining 94 high product yields, has yet to be achieved in vivo. However, the current results make it 95 clear that identifying what constitutes a functional domain is critical for the success of 96 domain swapping experiments. Furthermore, although sequence analysis has provided 97 fingerprint motifs of the AT and KR subtypes, the low levels of success in experiments 98 that switch such motifs suggests a need to identify additional residues necessary for the 99 specific functions of these domains. Thus if residues that are closely functionally 100 coupled could be easily identified this would expedite the problem of engineering novel 101 polyketide synthase pathways. By closely functionally coupled we mean either residues 102 that work together structurally as an enzymic domain or as the group of key residues 103 responsible for subtype specificity. 104 Many computational approaches have been developed to detect co-substituting 105 residue pairs or groups [30][31][32][33] and thence to identify networks of residues that appear 106 to work together [33]. These methods are predicated on the process of evolution 107 leading to coupled amino acid changes. To understand how this might happen consider 108 the following: if a mutation occurs at a position in the three dimensional structure of a 109 protein it may lead to suboptimal protein function, possibly leading to the death of the 110 host organism and thus to the loss of the mutation from the gene pool, unless one or 111 more further mutations compensate for that first change. Amino acid substitution 112 patterns within multiple sequence alignments evidence covariance between positions in 113 the alignment, supporting this idea of coupled amino acid substitutions, and analysis of 114 this covariance can identify functionally and structurally coupled amino acids.

115
One approach to identifying networks of coevolving residues is sectors analysis 116 [33,34]. Sectors analysis is built on the statistical coupling analysis (SCA) of 117 Ranganathan and co-workers and depends on having the sequences of a family of related 118 proteins that are sufficiently divergent that a significant proportion of the possible 119 substitutions that can occur have occurred in at least one member of the sequence 120 family. SCA creates a matrix that represents the covariance of amino acids between the 121 different columns of a multiple sequence alignment. Each entry in the covariance matrix 122 is weighted by a measure of the amino acid distribution (the derivative of the 123 Kulback-Leibler divergence with respect to amino acid frequency) in the pair of sequence 124 positions under consideration [34]. This upweights the contribution of positions with 125 residue types that deviate from a precalculated average background frequency. 126 Conservation of rare residue types such as cysteine and tryptophan thus have a strong 127 upweighting effect, whereas the absence of a common residue type such as leucine will 128 also have an upweighting effect, as will their conservation, although this will not be as 129 great as the upweighting that a highly conserved tryptophan receives. The covariance matrix is then decomposed via eigenvalue spectral decomposition and independent 131 component (IC) analysis, to identify networks of residues that are particularly strongly 132 coupled with regard to their tendency to co-vary. Since they coevolved, residues in each 133 group (sector or IC) are expected to be functional for a specific purpose.

134
The idea of functionally independent sectors was first validated in the serine 135 protease enzyme family. Computational analysis identified three almost independently 136 co-varying sectors, two of which were confirmed with experiments in the laboratory.

137
Mutational analysis showed that one sector seemed to control the stability of the serine 138 protease fold but had little effect on catalytic efficacy, and another sector was important 139 for catalysis but had little effect on protein stability [33]. This approach has since been 140 applied to several protein families including G-protein coupled receptors, DHFR, 141 β-lactamase and the pancreatic-type ribonuclease superfamily [34,35]. Although sectors 142 analysis has been successfully applied on single-domain systems, there is no study to 143 date that has analysed a multidomain protein.

144
The polyketide synthases provide a good system for testing how sectors analysis 145 performs on a multidomain protein and sectors analysis has the potential to identify 146 networks of residues defining domain boundaries or critical residues for enzyme 147 specificity, and is likely to be complementary to existing methods (e.g. [36,37] others with subtypes of the PKS domains. Some ICs span multiple domains suggesting 156 that ICs can give further insight into the interaction pattern of the residues within and 157 between domains and inter-domain linkers. We perform bootstrapping analysis to test 158 the robustness of our IC predictions and to provide recommendations for the number of 159 sequences required for sectors analysis. To the best of our knowledge, this is the first 160 study that applies sectors analysis to multidomain proteins like the PKSs. analysis.

164
The literature suggests that 100 sequences are sufficient for a sectors analysis [34], 165 based on calculations from one example system. In contrast, other covariance methods 166 such as direct coupling analysis require thousands of sequences to give reliable results. 167 Moreover, the DEBS module that we wish to analyse has 1901 amino acids, which is 168 considerably larger than any sequence previously analysed by sectors analysis [34,35] 169 and which we thus anticipated might need a larger number of sequences to obtain 170 robust results. We, therefore, analysed several uni-domain proteins to see whether we 171 could detect a trend between the length of a protein and the number of sequences 172 adequate for a sectors analysis.

173
To test how the number of sequences in an MSA affected the ICs, varying numbers 174 of sequences were randomly selected from a source MSA, without replacement, and ICs 175 were determined. This was repeated three times for each different number of sequences 176 (n=3). To see the similarity between the ICs from subsampled MSAs and the ICs from 177 the source MSA a similarity score (ss) was calculated between the sets of ICs, ranging 178 from 0 (no similarity) to 1 (every IC calculated for the first MSA can be paired with an 179 effective sequence is the number of sequences in the alignment whose pairwise sequence 185 identity compared to every other sequence is 80% or less.

186
However, no trend was detected between the length of the protein and the effective 187 number of proteins in the MSA at which the ICs of the full MSA were recovered (the   averaged across all ICs. This indicates that the alignment may be too small to robustly 223 identify the ICs. Therefore, we included all the 2245 sequences we had obtained from the preliminary sequence search. Although all sequences had at least two KS domains 225 and an ACP, sequences were also included without AT and/or KR domains or with DH 226 and/or ER domains.

227
Although the inclusion of additional sequences was not adequate to obtain the 228 convergence of the results, the maximum similarity score is higher than the one where 229 only sequences with KS1 AT KR ACP KS2 domain composition were included in the 230 MSA (S6 Fig). As noted in detail later, the conclusions arising from our analysis were 231 tested to ensure that the additional sequences did not lead to misleading artefacts.

232
Analysis of the full sequence alignment gave 34 ICs with two or more residues. We 233 performed resampling by randomly drawing from the full MSA three lots each of 100, 234 500, 900, 1300, 1700 and 2100 sequences. Some ICs were detected even when there were 235 only a few sequences in the alignment, but some occurred three or fewer times, over all 236 samplings, and these were discarded as not being robustly defined. IC 25 was also 237 discarded as it consisted of only two residues. After discarding the aforementioned ICs 238 this left 22 ICs (Figs S7 Fig, S8 Fig and Fig 3).  Some independent components consist of residues that are predominantly from only one 241 domain but most ICs have signal from multiple domains and linkers (Fig 3). However, 242 even where an IC spans multiple domains, the strongest couplings are sometimes still 243 confined to one domain. Further insight into the coevolutionary relationships within a 244 protein can also be gained by looking at residues coupled across two ICs. Although an 245 IC is a grouping of residues that coevolve together, predominantly independently of 246 other ICs, there can nonetheless be couplings between residues in different ICs (S9 Fig). 247 For this reason, previous work has grouped together ICs with high inter-IC coupling, 248 each grouping termed a sector [34]. However, there is no standard way to group ICs 249 into sectors. Here, we therefore do not definitively define sectors but instead make 250 hierarchical clusterings of ICs based on the average interaction score between residues in 251 each pair of ICs as explained further in Methods section 4.6 (Fig 4). Such hierarchical 252 clustering shows that AT and KR domains are characterised by several ICs that 253 cross-couple (Fig 3) and which have residues predominantly from the same domain, thus 254 allowing us to define domain boundaries. Ketosynthase of module 2 (residue 1478 to 1901). Residue 1 here corresponds to residue 505 of the polypeptide chain of DEBS1 since we have ignored the loading domains of DEBS1, which acquire and activate the starter unit for biosynthesis residues from the ACP and the upstream and downstream KS domains, although all the 265 couplings in IC 16 are weak compared to those intra ICs 17, 10, 21 and 3. However, in 266 the cluster analysis, IC 3 clusters with ICs 2 and 5, which consist of residues from the 267 upstream and downstream KS and the ACP, but not with residues in the KRs domain, 268 thus somewhat mimicking what is seen in IC 16. 269 We were concerned that these results may be an artefact of some sequences in the 270 MSA not having a KR domain but highly similar domain boundaries were detected 271 when only sequences with a domain composition that includes a KR were analysed (S10 272 Fig IC 10). In that analysis KRc also still comes out as defined by one IC (labelled  Table). IC 3 of the full MSA is highly similar to IC 5 from the MSA with a KR in 280 every sequence (S1 Table). Similarly ICs 10, 17 and 21 of the full MSA, which define 281 the KR domain boundaries, were highly similar to ICs 10, 29 and 20, respectively, of the 282 shorter MSA. The analysis of the MSA with a KR in all sequences indicates boundaries 283 for the KR to be R886 to A1356 and for the KRc domain to be P1107 to L1320 (IC 5 of 284 that analysis, S10 Fig)). Considering the analysis of the full MSA, then ICs 10, 17 and 285 21 of that analysis (Fig 3) together define a boundary for the whole KR, based on 286 highly co-evolved residues, of L896 to R1357 and IC3 defines a boundary to the KRc of 287 G1109 to L1320 compared to the boundaries in the literature [38] of V890 to A1360,

291
IC 3, the IC that identifies the boundaries of KRc, consists of residues that are 292 highly conserved, as do IC 2 and IC 5 (S11 Fig), the ICs that cluster with IC 3 (Fig 3). 293 IC 3 includes the catalytic triad of the KR, K1219, S1243 and Y1256 and the NADPH 294 binding site TGGTGxLG (T1114 to G1121) [39]. Keatinge-Clay and Stroud identified 295 that the adenine ring of NADPH stacks with R1141 and forms hydrogen bonds with 296 D1169 and V1170, which are also detected in IC 3 [38]. Additionally, they showed that 297 the phosphate group of the adenine ribose forms a salt bridge with R1141 and a 298 hydrogen bond with S1142, which are detected in IC 3 and IC 21, respectively. An 299 alpha helix and a loop at its N-terminus, which are close to the active site and referred 300 to as the lid region, are thought to act together with the LDD motif of B-type KR 301 domains, or the conserved tryptophan of A-type KR domains, to determine the 302 stereochemistry of the KR products [9]. The highly conserved second aspartic acid in 303 the LDD motif (D1201) and the residues W1282, T1284, W1285 and G1303 of the lid 304 region are detected in IC 3; whereas the LD of the LDD motif, conserved tryptophan 305 and the rest of lid residues are detected in other ICs as discussed below.

306
The ICs coupled to IC 3 (ICs 2, 5, 7, 11 and 13) bear highly conserved residues, as 307 well (S12 Fig). IC 2 contains the GXDS motif that is highly conserved in ACPs. The 308 residues of the catalytic triad of the KS1 domain (C173, H308 and H346) are also 309 detected in IC 2. For the KS2 domain while the catalytic residues C1644 and H1819 are 310 detected in IC 2, H1779 is detected in IC 13, which contains highly conserved residues, 311 only from KS2. One intrigue here is how such highly conserved residues can have a Cross-coupling between residues in the robust ICs from the MSA of sequences similar to DEBS module 1. Residues are grouped along the axes according to the IC in which they occur, allowing the reader to see the cross-coupling between residues intra-IC and inter-ICs, but meaning that they are not in sequence order. The yellow blocks along the diagonal clearly show high intra-IC coupling, compared to inter-IC coupling, but there is clear evidence of inter-IC coupling, most notably between ICs 13, 5, 11, 2, 3 and 7 in the bottom right corner of the plot. Details of clustering analysis are explained in Methods. Some of the axes labels sit out of line with the other labels due to the very few members in that IC leading to too little space for the label. The colour key represents the scores from the SCA reconstructed without the first eigenvalue and eigenvector.
The AT together with the linker region immediately C-terminal to it (PAL) seems to 315 be an independently evolving unit. The analysis of the full MSA has the AT consisting 316 of three ICs, 8, 4 and 6, which have residues that cross-couple with each other, as shown 317 by the hierarchical clustering. There is negligible coupling of the residues in these three 318 ICs with residues from any other domains, except some residues from PAL1, which are 319 part of IC 4. (Fig 3). Although ICs 8 and 6 are coupled to residues from the ACP and 320 IC 6 also couples with residues from the KR structural domain, these couplings are very 321 weak compared to those between residues within the AT domain, with e.g. only five for an IC. Thus, it seems that many residues of PAL1 are tightly coupled to the AT 331 domain via IC 4, with residues from PAL1 and PAL2c forming another highly covarying 332 unit that seems to be cross-coupled with the ICs of the AT. few residues from other domains, these latter positions having extremely low coupling to 338 the IC as compared to the AT and PAL1 couplings (S15 Fig). The second IC that has a 339 dominant signal from the AT domain is cis-AT IC 5. Similar to cis-AT IC 2, there are 340 couplings to positions outside of the AT domain but they are weak (S15 FigC).

341
Parenthetically, cis-AT IC 11 has many residues from the AT domain that are 342 moderately coupled, but nonetheless very much more strongly coupled than from 343 residues in other domains, but these also are mostly too weak to pass the test to be 344 included in the IC. All the residues in IC 4 (from the full MSA) are found in 345 cis-AT IC 2, and 78% of the residues from IC 8 are also found in cis-AT IC 2 (S2 346   Table), suggesting that: (i) the residues in IC 4 are robust to sequence selection, and to 347 some degree also those in IC 8; (ii) the existence of IC 4 and IC 8 as separate ICs is 348 possibly an artefact of the sequence alignment, arising from some mis-alignment of some 349 trans-AT sequences. IC 6 and cis-AT IC 5 are also highly similar.

350
Analysis of the cis-AT-only alignment also finds a group of PAL1 residues that form 351 an IC, cis-AT IC 12 together with a few residues from the KAL (S15 FigA). The 352 residues from the KAL are weakly coupled to this IC, but pass the threshold for 353 inclusion in the IC, whereas there are many residues in the AT domain that couple to 354 the IC in the raw analysis (S15 FigB) but don't pass the threshold for inclusion in the 355 IC. This coupling to cis-AT IC 12 is nonetheless much stronger than from residues in any of the other domains. The PAL2c region couples with many residues in the AT, as 357 well as a few residues in the KRs and KR-ACP linker, as part of cis-AT IC 5, but those 358 coupling outside of the AT region seem few and weak (S15 Fig). However, cis-AT ICs 359 12, 5, 2 and 11 cluster together, again suggesting that the AT and PAL1 form a 360 functional unit, with their relation to PAL2c being less clearly defined.

361
Taking ICs 8, 4 and 6 of the full MSA together, they define an AT PAL1 unit of 362 residues V527 to S854, whereas cis-AT IC 2 defines the boundaries of the coevolving 363 unit as residues V527 to R863. The residues of cis-AT IC 11 and cis-AT IC 5 are 364 within the boundaries defined by cis-AT IC 2, except for a few residues of cis-AT IC 5 365 that are either in the KS, KR or PAL2c region but with low IC scores. Recent 366 experimental work has demonstrated the need for the PAL1, but not PAL2, if one is to 367 successfully replace the AT of DEBS module 6 with that of the equivalent residues from 368 EPOS module 4 [19], corresponding to residues Q524 to P865 in DEBS1. This is 369 consistent with our results here. However, they found that including the KAL region as 370 part of the replacement unit significantly improved yields via improved K M , which is 371 not evident in our ICs, and this KLA AT PAL1 also gave functional swaps in other 372 systems (transferrability of the AT PAL1 unit was only tested in one construct). 373 Many ICs indicate that the KAL has many weak couplings with the co-joined KS 374 domain, notably in IC 5, suggesting that it is somehow a unit with the KS. However,    Methylmalonyl-CoA and malonyl-CoA specific AT domains show distinct patterns in 463 IC 4 which suggests the residues of this IC should be functional in the extender unit 464 specificity of the AT domain (Fig 6A). The separation that the sequence-position map 465 suggests can also be seen by a dendrogram created by comparing only the residues that 466 contribute to IC 4 ( Fig 6B, S18 FigB). From the dendrogram, we can see a clear In order to identify the importance of the residues for specific sub-types, we 473 calculated the projection of the sequence covariance of the different subtypes onto the 474 ICs. Conservation weighted covariance matrices for malonly-and methylmalonyl-475 specific sequences,C m ij andC mm ij respectively, were projected onto the independent 476 component matrix, calculated from the whole MSA (Ṽ p 1···k * ).
To determine the residues that lead to the malonyl-and methlymalonyl sub-types 478 scoring differently along the different ICs, we calculated the difference between scores of 479 these projections onto the ICs. The sequence logos show that positions that scored highly for a subtype are mostly 487 conserved for that sub-type but not for the other sub-types. There is an exception at Leu-Tyr and Arg-Tyr suggesting that the interaction between these two positions may 495 be important and evolutionarily constrained (S19 FigC); whereas, a similar pattern is 496 not observed for malonyl-specific sequences (S19 FigD). The C β -C β distance between 497 residues 760 and 763 is 5.2Å. The Y and S residues of the YASH motif and H and F residues of the HAFH motif, 499 which provide the distinction between the fingerprint motifs, are detected in this IC 500 (positions 721 for Y/H and 723 for S/F, marked with stars in Fig 6). Site-directed 501 mutagenesis studies on the YASH and HAFH motifs, aiming to switch the extender unit 502 specificity, generally result in promiscuous domains which can accept both 503 methylmalonyl-CoA and malonyl-CoA as an extender unit. Further, kinetic analysis 504 shows that the cause of the promiscuity is not based on an increased affinity for a 505 non-native extender unit but rather a decrease in the capability of accepting the native 506 one. A recent study by Zhang et al. performed experiments on salinomycin polyketide 507 synthase, targeting additional residues on malonyl, methylmalonyl and ethylmalonyl 508 specific ATs besides the YASH/HAFH motifs [27]. After performing structural analysis 509 and molecular dynamics simulations, they determined that hydrophobic residues at 510 positions V592, I653, M662 and L775 (DEBS1 module 1 numbering) are critical for 511 substrate specificity. While mutating the residues of the YASH/HASH motifs did not 512 provide a switch from one specificity to another, additional mutations at the four newly 513 identified positions successfully switched between malonyl/methylmalonyl/ethylmalonyl 514 specificity. The SalAT2 H276Y/F278S/I177Q/M205L/V327 mutant showed a change in 515 substrate specificity from malonyl CoA to methylmalonyl CoA, and reverse mutations, 516 i.e. Y288H/S290F/Q189I/L217M/L341V, in SalAT8 changed its specificty from methyl 517 malonyl CoA to malonyl CoA. Consistent with these results, except the residue at 662 518 (which is in IC 6 and found highly conserved as Met in all specificities, 519 malonyl/methylmalonyl/ethylmalonyl, but Val in the ethylmalonyl specific AT of the 520 salinomycin pathway), the other positions were detected in the IC 4 ( Fig 7C). Thus, 521 residues important for specificity, which were identified by analysis of structural 522 dynamics, can be identified directly from the sectors analysis of sequence data.

523
As well as the YASH/HAFH motifs and residues identified by Zhang et al., IC 4 also 524 encompasses D593, V594, L595, and Q625, which have all been shown to affect the 525 extender unit specificity of ATs. In AT4 of the DEBS pathway, mutating RVDVL to 526 DTLYA allows the AT to accept a malonyl extender unit as well as methylmalony 527 [24,42]. D593, V594, and L595 are second, seventh and nineteenth from the left in 528 Fig 6D, which thus predicts them as particularly likely to affect specificity, and V592 is 529 very much in the middle of the plot. The V to T mutation is predicted by Fig 6D to   530 weakly favour malonly specificity, T being conserved in malonly specific ATs, whereas 531 DVL to LYA should disfavour methylmalonyl specificity. Q625 is adjacent to the 532 catalytic serine in the GHSXG motif, which was detected as a branched hydrophobic 533 residue (Val or Leu) in malonyl-specific ATs but as methionine or glutamine in other 534 AT types [42,43]. Q625, is 13 positions from the left side of Fig5D and very highly 535 conserved in methylmalonyl specifc ATs. R591, the first residue of the RVDVL motif, 536 occurs in IC 6, which doesn't discriminate substrate specificity, as discussed later, and is 537 conserved as either R or D. Thus IC 4, which is determined purely by sequence analysis, 538 identifies five residues that have previously been shown to be important for AT extender 539 unit specificity (Fig 7D) and highlights further residues that could be important.

540
Although it is not impossible for the method to have detected these key residues by 541 chance, it seems unlikely. For example, in the Zhang et al. [27] study there are six 542 residues that were experimentally shown as important for the extender unit specificity. 543 We found five of them in the same IC (IC 4). As we could detect five of the six residues, 544 the probability of detecting them by chance in an IC with a length of 123 amino acids 545 (length of IC 4) from a domain sequence whose length is 342 (AT domain + PAL1 , which 547 is equal to 0.022. This low probability indicates finding five of the six experimentally 548 determined functional residues in an IC with 123 residues by chance is quite small.

549
Historically, it has proved difficult to identify the residues that do provide substrate 550 specificity to the different ATs, indeed identifying residues important for catalytic 551 activity and residue specificity is a general problem in protein design. Directed 552 evolution has often demonstrated that residues far from the active site can have a large 553 effect on these properties and it is not understood why, which makes it difficult to 554 identify such residues. The residues identified here are also seen to be distributed close 555 to and far from the active site cleft (Fig 7), suggesting that this may be a way to 556 identify such residues. Some caution is required, since the residues that have been 557 identified are generally highly conserved within subtypes and the correlations seen 558 might be driven by unimportant changes in common ancestors, i.e. tied to phylogeny 559 but not function. However, support for interpreting the residues as functionally 560 important is provided by the number of residues in IC 4 that have already been 561 identified as important for functionality, as noted above. Thus, the residues identified at 562 the left and right sides of Fig 6D and at the left side of Fig 8C (discussed below) are 563 prime targets for mutagenesis work to change the substrate specificity of AT domains, 564 although the implication of the method here is that the correct network of residue types 565 needs to be incorporated, and not simply isolated mutations [44]. 566 Sequence-position mapping based on the AT extender unit specificity also clarifies 567 the ambiguity of the double peak in IC 8 arising from the domain composition mapping 568 (Fig 8A). The second peak is composed of only methylmalonyl specific ATs while the 569 first peak bears a mixture of the rest. The distinction between the methylmalonyl-(and 570 ethylmalonyl-) specific ATs and the others is also clear in pairwise comparison of the 571 IC 8 residues' sequence. (Fig 8B, S18 FigC).

572
The residues of IC8 are much more highly conserved in the methylmalonyl and ethyl 573 malonyl specific AT domains than within the malonyl specific ATs (Fig 8C). We are not 574 aware of any experimental work to support or disagree with a role for these residues in 575 methylmalonyl specificity, yet the separation of the methylmalonyl specific AT 576 sequences for the malonyl specific ones is much clearer along IC 8 than for any other IC, 577 which is consistent with the large values of the ∆Ṽ p between methylmalonyl and 578 malonyl specific ATs. It seems worth reminding the reader that the residues in IC 8 and 579 IC 6 were present together in cis-AT IC 2, from our analysis solely of cis-AT sequences, 580 consistent with their both having a similar role, which seems most likely to be substrate 581 specificity, and they cluster together in the hierarchical clustering, indicating that there 582 is some coupling between residues in the two ICs. The residues of IC 8 are shown on a 583 model structure in  Highly conserved residues in both malonyl-CoA and methylmalonyl-CoA specific 585 subtypes are detected in the IC 6 ( Fig 10B). Catalytic residues (S624, H724) are also 586 detected in this IC. This suggests that the residues detected in IC 6 are critical for the 587 proper function of the AT domain irrespective of the extender unit specificity.

2.4.3
Residues that distinguish trans-AT and cis-AT systems.

589
A large number, but not all, trans-AT sequences separate from the other sequences by 590 sequence position mapping along IC 1 (Fig 11A), which consists of residues from both 591 of the KS domains, KAL, ACP and KRc, but not the KRs, ACP linkers, AT or the 592 PAL1. These residues show very similar variation across the cis-AT sequences, but see 593 different residues at these positions in the trans-AT sequences, and slightly higher 594 conservation (Fig 11C).  compared to other KR types, as it also is to some degree in IC 10, and the distribution 602 of C type domains along IC 21 is slightly skewed to the right compared to other KR 603 types (S20 Fig). The residues covarying in IC 3 are from the KRc and KRs subdomains 604 and include the conserved catalytic triad, so it is congruent with this that the 605 non-reducing C type KR domains should have a different coevolutionary profile for this 606 set of residues compared to the reducing KR types. Similarly IC 10 and IC 21 consist of 607 covariance from residues in KRc and the KRs. In the hierarchical clustering ICs 3, 10 608 and 21 are closely related (Fig 3). In contrast IC 16, which also contains coupled 609 residues from the KRc, albeit more weakly coupled than in IC 3, does not show a skew 610 in the distribution of C type sequences compared to the other KR types.

611
Thus C type residues can be separated from other KR types by sectors analyis. This 612 might be anticipated since we need only detect the non-functional substitution of the 613 residues required for the reduction of the β-keto group. KR domains of the C2 type do 614 need to maintain the capability to bind the substrate and to epimerise the alpha carbon. 615 We did not attempt to separate the C1 and C2 types along the ICs, since we are not 616 aware of any existing sequence signature that we could use to characterise the two types 617 of sequence.

618
The results suggest that A and B type KR domains can also be separated along the 619 ICs, although there are too few examples of B2 type for any conclusions to be drawn for 620 these. IC 17 shows a skew to the right for A type and to the left for B1 type (Fig 12). 621 C type have no skew along this IC. A similar although weaker skew is seen when the 622 sequences are projected onto IC 10, which as noted above also shows a left hand skew 623 for C type KR domains (S20 Fig). IC 10 and 17 are next to each other in the 624 hierarchical cluster analysis and are two of the most closely coupled ICs (Fig 3). This 625 suggests that IC 17 and, to a lesser extent, IC 10 specify the residues that determine 626 whether the beta-hydroxyl is L or D configured.

627
B type KR domains have a sequence fingerprint of an LDD motif, of which KR1 of 628 the DEBS pathway, the one we use in our search query here, is an example. The LD of 629 this motif is present in IC 17, however the D following the L is not well conserved and is 630 not thought to be particularly important. The final D of the motif has been shown to 631 be important and is hypothesised to control the direction of entry of the substrate into 632 the active site, which is thought to control the direction from which the hydride attacks 633 the carbonyl group and thus the chirality of the resulting beta-hydroxyl [38]. The 634 mutant D1201A (according to the numbering scheme we use here, the last D of the 635 LDD motif) reduced the production of the natural product by 40%, and 636 D1201A/F1244G reduced production to 10%, also implicating F1244 as an important 637 residue [38]. D1201 is detected in IC 3, which contains the highly conserved positions 638 of KRc (S11 Fig) and F1244 is detected in IC 17.

639
Type A KR domains have no LDD domain but rather have a conserved tryptophan 640 and these two changes, as compared to B type, are thought to lead to the substrate 641 entering with the opposite face of its beta-carbonyl facing the NADPH, and thus to the 642 L stereo-configuration of the resulting beta-hydroxyl group. This key tryptophan, 643 located at position 1248, is in IC 10, which also includes residues from a loop-helix lid 644 region that closes the top of the active site cavity after substrate binding. Residues 645 A1286, G1287, A1291, from the loop, and F1299, R1300 and H1302, from the last turn 646 of the helix, are in IC 10. The helix of this region is thought to be also involved in 647 directing the substrate into the active site, with residues V1295 and R1298 (inferred 648 from the structure of first ketoreductase of the tylosin PKS, pdb ID 2Z5L) [9], which 649 are from two consecutive turns of the helix and most likely point from the helix into the 650 space where the substrate is thought pass in A type KR domains [9], being in IC 17.

651
Key conserved residues at the N-terminus of the loop and C-terminus of the helix are in 652 IC 3 (A1281, T1284, W1285 and G1303). For completeness, G1283 is in IC 16, S1288 is 653 in IC 1 and G1289 and M1290 are in IC 28.

654
Sequence position mapping of IC 17 distinguishes A1/A2 type KR domains from 655 B1/B2 type KR domains (Fig 12A). The similarity between the sequences when 656 considering only those residues contributing to IC 17 is shown in Fig 12C as a   657 dendrogram. Although the clustering pattern of IC 17 residues resembles the 658 dendrogram from clustering whole KR domain sequences (Fig 12B) differences in the 659 distribution of different KR types can be seen.

660
Similar to our analysis of AT domains, we identified the residues of IC 17 that 661 differentiate A1 from B1 ketoreductases, by calculating the difference of the scores of 662 the sequence subtypes projected onto the IC. Since the number of sequences of type B2 663 and type A2 are low, we applied this approach only on types B1 and A1.
The residues in the IC were sorted according to the ∆Ṽ p IC 17 scores (Fig 12D). The 665 LD residues of the LDD motif are close to the right end of the plot supporting the 666 importance of these residues for type B specification. Interestingly, we detect 12 residues 667 with higher scores than L1199 and D1200 residues, suggesting these 12 positions should 668 be taken into consideration for experimental studies to switch the KR type to B2. Residues L1199 and D1200 of the LDD motif, which is a finger print for type B KR domains, occur towards the right of the figure. W1248, the fingerprint residue for A type KR domains, is not present, being found in IC 10.
2.5 Effects of the method of alignment on results.

670
One methodological challenge is to determine which sequences to keep in the alignment, 671 as different alignments may lead to different IC sets. Although differences in the 672 alignments used here resulted in variations in the ICs, the interpretation of the results is 673 not qualitatively changed (Figs. S10 Fig, S15 Fig, S1 Table, S2 Table). Additionally, 674 there are some unexpected residues in certain regions of the MSA, possibly causing 675 some noise in the results. For example, there are some residues in the AT region of the 676 sequences labeled as trans-AT while no residue would be expected to be detected in that 677 region. These residues in the trans-AT sequences are thought to arise either from the  Table). 685 Although the overall similarities between the ICs are not very high, the ICs that define 686 the domain boundary and show sequence divergence in different domain sub-types have 687 high similarity with the ICs of the MSA generated based on domain composition (S4 688   Table), suggesting that even though the MSA varies, ICs that carry important 689 information can be detected. with experimental studies as well as residues strongly associated with subtype specificity. 707 This gives one explanation for the importance of domain boundary optimisation for the 708 successful swaps [19], since residues within the domain boundaries have clearly been 709 under selective pressure to function together. Furthermore, sequence-position mapping 710 analysis identified groups of residues that distinguish subtypes of domains, with the 711 analysis showing the inclusion of experimentally validated residues but also implicating 712 further residues as important. This is consistent with experimental results since 713 mutating only a few residues has not yet switched a domain from one subtype to 714 another.

715
The challenge, as regards re-engineering for novel purpose a multidomain protein 716 such as a polyketide synthase, is not to identify the functional domains; hidden Markov 717 models, BLAST and similar sequence searching tools can perform this job well. The 718 challenge is to know exactly where the functional boundaries of these domains lie, so 719 that domains can be spliced from one system into another to generate novel 720 functionality. In the case of polyketide synthases the KR domains were originally only 721 annoted for the KRc segment and the structural subdomain of the KR was completely 722 ignored [38], yet the coevolutionary analysis here marks the KRs and KRc as one 723 clearly co-evolving unit very much distinct from the rest of the module. Similarly, the 724 analysis also highlights the AT PAL as a distinct coevolving unit, which has been shown 725 in experiment to be critical for swapping an exogenous AT for the native AT of a 726 module. The coupling we see here between the residues in the C-ter of KS2 is intriguing 727 and it is unclear what it represents functionally, but it may provide some insight into 728 how to reengineer KS domains.  Here we have highlighted KR and AT boundaries and residues associated with 737 subtype specificity but an attentive reader may spot further associations in the data. type KR sequences along IC8 disproportionately associate with the methyl malonyl and 743 ethyl malonyl AT types. Similarly, projecting the sequences onto IC4 shows a leftward 744 skew in the distribution of A2 type KR sequences, which again correlates with methyl 745 malonyl and ethyl malonyl AT types. This assocation is presumably because there is no 746 need to epimerise the alpha carbon if it has two hydrogen atoms attached, as is the case 747 for malonyl extender units. Sectors analysis has been found to provide some insight into 748 allosteric regulation [45] and future work looking at sectors in PKSs may give clues as 749 to how they coordinate the biosynthetic process across multiple domains and modules. 750 Beyond the interpretation of results in PKS systems, and the potential to break a 751 bottle neck in the research in that field by identifying residues that correlate with 752 subtype specificity, we have also demonstrated the benefits of using bootstrapping to 753 test the robustness of results, which is likely to be important in the application of these 754 methods to other multi-domain protein families. There seems to be no obvious method 755 for a priori determining the number of sequences needed for a reliable sectors analysis, 756 but bootstrapping provides a way of having confidence in the output. We have also 757 shown how to adapt the existing method to predict the likely importance of different 758 residues in their contributions to subtype specificity.

759
The SCA/Sectors analysis applied here is an unsupervised method for discovering 760 structure within protein sequences, as are related techniques such as DCA [46,47] where protein sequences are few in number. Since the depth of the sequence alignment 772 was clearly important in the analysis presented in this paper, the incorporation of such 773 techniques into neural networks proposes exciting possibilities for the future. Beyond 774 this other methods for calculating groups of co-evolving residues, not investigated here, 775 already exist [55,56] and using unsupervised training of deep networks also seems to 776 reveal information about the structure of proteins [57]. However, the aforementioned 777 methods have yet to be tested against experiment in the way that SCA/sectors analysis 778 has. A particular interest would be whether they can recapitulate the success of 779 SCA/sectors in predicting pathways of allosteric connectivity through the protein, 780 which is of interest not only for studying protein function and evolution, but also 781 potentially for drug discovery [58]. sequences.

797
The MSA was preprocessed with the scaProcessMSA.py script from the SCA 798 analysis tool [34] and the reference sequence was set to the DEBS1 query sequence by 799 -refindex 0 command. The second and third steps of the analysis, which apply the SCA 800 method and determine the independent components, were performed with the 801 scaCore.py and scaSector.py scripts [34].  with LDD motifs were labelled as type B KR domains, the ones with a tryptophan eight 812 residues before the catalytic tyrosine were labelled as type A. Further classification was 813 made based on the three residues before the catalytic tyrosine. Since leucine, histidine, 814 and glutamine residues are conserved in B2 type, A2 type and A1-B1 types of KR  The similarity score (ss) was calculated as where δ(IC m , IC n ) is equal to one if the ratio of the number of mutual residues 826 between two ICs to the number of residues in the IC with fewer residues 827 (min(len(IC m ),len(IC n ))) is greater than a threshold (set at 0.7) and either IC m or 828 IC n is not matched with another IC; zero otherwise. In order to detect the coevolved residue groups, the first step is to calculate the 831 covariation of the amino acids along the MSA: where f a i is the frequency of the amino acid a at position i; f b j is the frequency of 833 amino acid b at position j. f ab ij is their joint probability, which measures the probability 834 of a and b being at positions i and j simultaneously. 835 Position-specific conservation is calculated by the Kullback-Leibler relative entropy 836 and the covariance matrix (C ab ij ) is weighted by the conservation score of each amino 837 acid at each position with respect to the amino acid frequency, which is denoted by φ a i 838 for amino acid a at position i and φ b j for amino acid b at position j. This gives the 839 conservation weighted correlation matrix,C ab ij , where: C ab ij , is a four-dimensional matrix (LxLx20x20, where L is the length of the sequence) 841 and it is compressed to a two-dimensional LxL matrix by taking the Frobenius norm.

842
The resulting matrixC ij (or statistical coupling analysis (SCA) matrix) is used to 843 determine the coevolved residue groups.

844
The matrixC ij gives the couplings between the positions of the amino acids but 845 doesn't define networks of co-evolving residues that might have specific subfunction 846 such as catalytic activity, ligand binding, or allosteric signalling. To do that, we need to 847 transformC ij in a way that can separate the residues as different groups and these 848 residue groups should be as independent as possible from each other for the proper 849 detection of a residue group-function relation. For this transformation, firstly, 850 eigenvalue (spectral) decomposition is applied to theC ij . This process decomposesC ij 851 into three matrices: whereṼ is an eigenvector matrix and∆ is an eigenvalue matrix, which is a diagonal components, which are maximally independent from each other.
where W is a transformation matrix. This mathematical transformation is known as 869 independent component analysis (ICA) [62,63] and resulting residue groups are called 870 independent components (ICs) [34]. 871 40 independent components are detected; however, only 34 of them have at least two 872 residues in the groups. Therefore, six of them were not taken into account for the 873 further analyses.

874
When the first eigenvalue is much larger than the remaining significant eigenvalues, 875 it indicates that the first eigenvector is the "coherent" mode and it describes the 876 contribution to all positions to the total correlation [33]. In our system, the first 877 eigenvalue is 822, whereas the second highest is only 143, indicating the dominant first 878 mode. As the first mode has the contributions from all positions, it was removed for the 879 SCA matrix visualization and hierarchical clustering for a clearer detection of the 880 covariation of the residue groups.  Hierarchical clustering was applied on the average-coupling score matrix by scipy.cluster 887 hierarchical clustering package where complete linkage calculations were performed on 888 the average-coupling matrix as distance matrix and the clusters were generated from the 889 calculated linkage matrix.  Basically, SVD is used to decompose any matrix into three matrices: Here, U carries the information about sequence similarities (rows-based analysis) and V 901 carries information about position similarities (columns-based analysis). λ is a diagonal 902 matrix that carries information about which parts of the U and V matrices provide the 903 "more important" contributions to the X matrix.

904
With a slight modification, the matrix U , which carries the sequence similarity 905 pattern, can be obtained from the alignment matrix, X, and the position correlation 906 matrix, V : This means that we can obtain sequence similarity patterns specific to coevolved 908 residue groups i.e. independent components. However, since we applied modifications 909 on the positional correlations of the X matrix to obtain the coupling matrix (C ij ), we 910 need to follow the same transformations.
wherex is a compressed alignment matrix (from MxLx20 to MxL),Ṽ and∆ are 912 eigenvectors and eigenvalues ofC ij , respectively. And as the final step, the ICA 913 transformation is applied: where W is the same transformation matrix that was used to transform the top 915 eigenvectors to independent components.

916
The finalŨ p matrix carries the information of sequence divergence patterns of the 917 coevolved residue groups (independent components). Therefore, this analysis allows us 918 to map positional correlations onto sequences and thus this protocol is referred to as 919 sequence-position mapping. For a more detailed explanation of the method please see 920 [34]. only if they pass the threshold test for inclusion in a given IC, with each IC assigned its 1013