Deep learning-enabled design of synthetic orthologs of a signaling protein

Evolution-based deep generative models represent an exciting direction in understanding and designing proteins. An open question is whether such models can represent the constraints underlying specialized functions that are necessary for organismal fitness in specific biological contexts. Here, we examine the ability of three different models to produce synthetic versions of SH3 domains that can support function in a yeast stress signaling pathway. Using a select-seq assay, we show that one form of a variational autoencoder (VAE) recapitulates the functional characteristics of natural SH3 domains and classifies fungal SH3 homologs hierarchically by function and phylogeny. Locality in the latent space of the model predicts and extends the function of natural orthologs and exposes amino acid constraints distributed near and far from the SH3 ligand-binding site. The ability of deep generative models to specify orthologous function in vivo opens new avenues for probing and engineering protein function in specific cellular environments.

Introduction siderable sequence divergence in the synthetic proteins, demonstrating access 38 to an enormous space of functional proteins consistent with the evolutionary 39 constraints. 40 The DCA model is relatively simple because it is inferred only from the 41 first-and second-order statistics of sequence alignments. Given this, it is 42 impressive that it can suffice to capture the design constraints for specifying 43 proteins that can fold and function in their natural cellular context. How- 44 ever, it is also true that the chorismate mutases largely represent a family of 45 orthologs -extant proteins that are descended by speciation events and are 46 expected to share the same function across species. Indeed, a large fraction of homologous chorismate mutases operate in E. coli in the specific experi- 48 mental conditions in which the design was carried out [5]. Such consistency 49 of function in a protein family likely represents a simpler problem for infer- 50 ence of generative models. A deeper and more general test of evolution-based 51 generative models would come from a study of a family of paralogs -proteins 52 that arose through gene duplication events and typically have diverged to 53 carry out distinct and specialized functions. Indeed, paralogs of a protein 54 family are thought to under strong selection to be functionally orthogonal 55 with respect to each other [28], a strategy to ensure specificity in signaling 56 [28,29] and metabolic [30] pathways. These observations raise the question 57 of whether it is even possible to make generative models for specific orthologs 58 given input data comprising the full spectrum of functional divergences in ing in the Sho1 pathway [28]. This exclusivity in vivo is recapitulated in 75 direct binding assays with the Pbs2 ligand, demonstrating that the speci-76 ficity is directly encoded in the Sho1 SH3 amino acid sequence. For all these 77 reasons, the SH3 domain provides a powerful system to test the generative 78 power of evolution-based models. 79 Here, we examine the ability of three modern machine-learning approaches 80 to design of "synthetic orthologs" of Sho1 SH3 starting from sequences com-81 prising the full SH3 family. By synthetic orthologs, we mean a designed 82 proteins that span the same diversity as natural Sho1 SH3 orthologs but that 83 are functionally indistinguishable, both in vitro and in vivo. We show that 84 one method (InfoVAE [33]) learns a low-dimensional "latent" space that hier-85 archically organizes SH3 homologs by function and phylogeny. Furthermore, 86 we show that locality in the latent space is both necessary and sufficient to 87 design synthetic Sho1 SH3 orthologs that bind Pbs2 and support osmosensing 88 in S. cerevisiae. Interestingly, constraints on orthology are spread both near 89 and far from the SH3 binding pocket, including many unconserved, solvent-90 exposed regions that would not be conventionally obvious. The capacity to 91 learn the rules for ortholog function from a functionally diverse protein fam-  is independent of sequence similarity within the SH3 domain. This MSA 104 comprises the input data to algorithms that compress the information con-105 tained within the natural sequences into a low-dimensional model (Fig. 1C). 106 If the compression captures the essential constraints on folding and binding 107 specificity, it should be possible to design diverse synthetic orthologs of SH3 108 domains (e.g. Sho1 SH3 ) that reproduce the activity and diversity of natural 109 orthologs (Fig. 1C).
110 Figure 1: Evolutionary-based deep generative models of SH3 domains in the context of the yeast high-osmolarity pathway. (A) A structure of the S. cerevisiae Sho1 SH3 domain (PDB 2VKN) in complex with the Pbs2 peptide ligand (yellow stick bonds). SH3 domains are protein interaction modules that bind to polyproline containing target ligands. (B) Binding between the Sho1 SH3 domain and its target sequence in the Pbs2 MAP kinase kinase mediates responses to fluctuations in external osmotic pressure by controlling the production of internal osmolytes. (C) Schematic of evolutionary-based data-driven generative models, consisting of a compression step (the encoder) that maps a sequence alignment of natural homologs to a low-dimensional Gaussian latent space (blue box), defined by vector z for each sequence, and a decoder which converts latent space coordinates to protein sequences. By definition a VAE is trained to reproduce its inputs; thus decoded sequences represent hypotheses for synthetic members of the protein family. (D) The three-dimensional latent space for the SH3 MSA; the Sho1 SH3 ortholog group is highlighted in red.
The first model we consider is the Boltzmann machine direct-coupling 111 analysis (bmDCA) [26]. The DCA approach assumes that the probability of 112 each natural amino acid sequence x = (x 1 , . . . , x L ) to occur is exponentially 113 related to an "energy" function parameterized by the intrinsic constraints 114 on each amino acid x i at each position i (h i (x i )) and the pairwise couplings 115 between amino acids (x i , x j ) at positions (i, j) (J ij (x i , x j )): The parameters (h, J) are trained to reproduce the empirical positional fre- vious work [5] to test whether the design of members of the ortholog family 127 studied in that work generalizes to a functionally diverse family of paralogs.

128
The second class of models we examined are DGMs known as a varia-129 tional autoencoders (VAEs) [34], consisting of two back-to-back deep neural space without the need for computationally expensive numerical simulations 145 [37,38,39,40]. 146 We implemented two forms of a VAE: (1) a generic, widely-used form 147 that we call the "vanilla-VAE", and (2) a variant known as an information 148 maximizing VAE (InfoVAE) [33]. While the generic algorithms have proven 149 useful for studying protein properties [41,42,43,44,45,25,37,39,38], 150 they can also lead to inaccurate latent inference and non-optimal decoder 151 performance [46,47]. The InfoVAE addresses these problems, incorporating 152 additional constraints during training models that encourages more accurate 153 decoding from the latent space for design [33]. We present data on both VAE 154 architectures in this work, but for brevity, we illustrate features of the latent 155 space representations in figures below using the infoVAE method.

156
The VAE latent space for the SH3 family which natural sequences are embedded (Fig. 1D). Interestingly, annotation 161 shows that phylogeny is not the primary organizing principle [25]. For ex- nearly the same organization as coloring by orthology (Fig. 2B, 3D). same probability) as the natural homologs (Fig. S5B) [5]. For the SH3 family, the result shows that no bmDCA designed sequences are capable of full 256 complementation of the Sho1 deletion phenotype, though a few sequences fall 257 into a partial rescue range (Fig. 4B). This result is particularly interesting strates that they globally sample the latent space in both models (Fig. S5C).

274
Experimental analysis with the select-seq assay shows that both models are

294
Experimental testing shows that indeed, local sampling produces a much 295 higher density of fully functional synthetic orthologs (Fig. 4D, 4F). An inter-296 esting observation is that natural Sho1 SH3 orthologs fall into phylogenetically 297 defined radially organized sub-regions within an overall space filled out by 298 functional synthetic sequences Fig. 5. Thus, locality in latent space corre-299 sponds to locality in function, even for models trained on sequence data alone 300 and no prior knowledge of function. 301 We selected five synthetic orthologs that show full function in vivo for 302 in-depth biochemical characterization. These proteins were expressed in Es-303 cherichia coli as His6-tagged fusions, purified to homogeneity, and assayed 304 for (1) binding to the S. cerevisiae Pbs2 target peptide using a standard 305 tryptophan fluorescence assay [49] and (2)   a radially extended wedge-like structure in the VAE latent space (Fig. 5).
To make this quantitative, we defined a minimal polygon in the latent space 331 (a so-called "convex hull") that bounds the natural sequences displaying full 332 function in the S. cerevisiae Sho1 pathway (Fig. 6A). The majority of Sho1 SH3 333 orthologs in the fungal kingdom (155/172) lie within the hull, and very few 334 sequences within the hull are not functional (Fig. 6B). Also, synthetic or-335 thologs embedding inside the hull show the same distribution of function as 336 their natural counterparts (Fig. 6C-D). Thus, the hull represents a bounding The finding that locality within the convex hull of the InfoVAE latent 357 space defines Sho1 SH3 function provides an opportunity to examine the pat-358 tern of amino acid constraints that specifically underlie orthologous function.

359
A simple approach is to compare the conservation of sequence positions in 360 sequences sampled globally from the VAE latent space with that from se-361 quences embedded within the convex hull (Fig. 7). In essence, this analysis 362 provides as first-order view of where the "extra" constraints to be a Sho1 SH3 that contribute the most to Sho1 function (Fig. 7B). The extra constraints 371 for Sho1 SH3 function arise both at known specificity determining sites in the 372 ligand binding pocket [50,51] and at a set of weakly-conserved and solvent-373 exposed positions distributed throughout the protein structure (Fig. 7C). to transit to new phenotypes through paths of single-step variations [54]. 435 Thus, an iterative search of the local environment is a productive approach 436 for discovery of novel functions [55]. The data presented here suggests an