Protein generation with evolutionary diffusion: sequence is all you need

Deep generative models are increasingly powerful tools for the in silico design of novel proteins. Recently, a family of generative models called diffusion models has demonstrated the ability to generate biologically plausible proteins that are dissimilar to any actual proteins seen in nature, enabling unprecedented capability and control in de novo protein design. However, current state-of-the-art models generate protein structures, which limits the scope of their training data and restricts generations to a small and biased subset of protein design space. Here, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.

We combine evolutionary-scale datasets with diffusion models to develop a powerful new generative modeling framework, which we term EvoDiff, for controllable protein design from sequence data alone (Fig. 1). Given the natural framing of proteins as sequences of discrete tokens over an amino acid language, we use a discrete diffusion framework in which a forward process iteratively corrupts a protein sequence by changing its amino acid identities, and a learned reverse process, parameterized by a neural network, predicts the changes made at each iteration (Fig. 1B). The reverse process can then be used to generate new protein sequences starting from random noise (Fig. 1C). Importantly, EvoDiff's discrete diffusion formulation is mathematically distinct from continuous diffusion formulations previously used for protein structure design (7)(8)(9)(10)(11)(12)(13)(14)(15). Beyond evolutionary-scale datasets of single protein sequences, multiple sequence alignments (MSAs) inherently capture evolutionary relationships by revealing patterns of conservation and variation in the amino acid sequences of sets of related proteins.
We thus additionally build discrete diffusion models trained on MSAs to leverage this additional layer of evolutionary information to generate new single sequences ( Fig. 1C-D).
We evaluate our sequence and MSA models -EvoDiff-Seq and EvoDiff-MSA, respectively -across a range of generation tasks to demonstrate their power for controllable protein design ( Fig. 1D). We first show that EvoDiff-Seq unconditionally generates high-quality, diverse proteins that capture the natural distribution of protein sequence, structural, and functional space.
Using EvoDiff-MSA, we achieve evolution-guided design of novel sequences conditioned on an alignment of evolutionarily-related, but distinct, proteins. Finally, by exploiting the conditioning capabilities of our diffusion-based modeling framework and its grounding in a universal design space, we demonstrate that EvoDiff can reliably generate proteins with IDRs, directly overcoming a key limitation of structure-based generative models, and generate scaffolds for functional structural motifs without any explicit structural information.  Discrete diffusion models of protein sequence EvoDiff is the first generative diffusion model for protein design trained on evolutionary-scale protein sequence data. We investigated two types of forward processes for diffusion over discrete data modalities (24,25) to determine which would be most effective (Fig. 1B). In order-agnostic autoregressive diffusion (EvoDiff-OADM, see Methods) (24), one amino acid is converted to a special mask token at each step in the forward process (Fig. 1B). After T =L steps, where L is the length of the sequence, the entire sequence is masked. We additionally designed discrete denoising diffusion probabilistic models (EvoDiff-D3PM, see Methods) (25) for protein sequences. In EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after T steps the sequence is indistinguishable from a uniform sample over the amino acids ( Fig. 1B). In the reverse process for both, a neural network model is trained to undo the previous corruption. The trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively (Fig. 1C).
To facilitate direct and quantitative model comparisons, we trained all EvoDiff sequence models on 42M sequences from UniRef50 (26) using a dilated convolutional neural network architecture introduced in the CARP protein masked language model (27). We trained 38Mparameter and 640M-parameter versions for each forward corruption scheme to test the effect of model size on model performance. As a first evaluation of our EvoDiff sequence models, we calculated each model's test-set perplexity, which reflects its ability to capture the distribution of natural sequences and generalize to unseen sequences (see Methods). We observe that EvoDiff-OADM learns to reconstruct the test set more accurately than two tested EvoDiff-D3PM variants employing uniform and BLOSUM62-based transition matrices (Table S1; Fig. S1). Furthermore, EvoDiff-OADM is the only model variant where performance scales with increased model size (Table S1; Fig. S1).

6
To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the MSA Transformer (28) architecture on the OpenFold dataset (29). To do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max" Structural plausibility of generated sequences We next investigated whether EvoDiff could generate new protein sequences that were individually valid and structurally plausible. To assess this, we developed a workflow that evaluates the foldability and self-consistency of sequences generated by EvoDiff ( Fig. 2A). We generated 1000 sequences from each EvoDiff sequence model with lengths drawn from the empirical distribution of lengths in the training set. We compared EvoDiff's generations to sequences generated from a left-to-right autoregressive language model (LRAR) with the same architecture and training set as EvoDiff and to sequences generated from protein masked language models such as ESM-2 (30) (Figs. 2B-C, S3, S4; Table S3).
We assessed the foldability of individual sequences by predicting their corresponding structures using OmegaFold (31) and computing the average predicted local distance difference test (pLDDT) across the whole structure (Fig. 2B). pLDDT reflects OmegaFold's confidence in its EvoDiff-OADM Test Figure 2: EvoDiff generates realistic and structurally-plausible protein sequences. (A) Workflow for evaluating the foldability and self-consistency of sequences generated by EvoDiff sequence models. (B-C) Distributions of foldability, measured by sequence pLDDT of predicted structures (B), and self-consistency, measured by scPerplexity (C), for sequences from the test set, EvoDiff models, and baselines (n=1000 sequences per model; box plots show median and interquartile range). (D) Sequence pLDDT versus scPerplexity for sequences from the test set (grey, n=1000) and the 640M-parameter OADM model EvoDiff-Seq (blue, n=1000).
(E) Predicted structures and metrics for representative structurally plausible generations from EvoDiff-Seq, the 640M-parameter OADM model. 8 structure prediction for each residue. In addition to the average pLDDT across a whole protein, we observe that pLDDT scores can vary significantly across a protein sequence (Fig. S5). It is important to note that while pLDDT scores above 70 are often considered to indicate high prediction confidence, low pLDDT scores can be consistent with intrinsically disordered regions (IDRs) of proteins (32), which are found in many natural proteins. As an additional metric of structural plausibility, we computed a self-consistency perplexity (scPerplexity) by redesigning each predicted structure with the inverse folding algorithm ESM-IF (33) and computing the perplexity against the original generated sequence ( Fig. 2A, C; Table S3). Given that ESM-IF and EvoDiff were both trained on UniRef50 data, it is possible that sequences from EvoDiff's validation set overlap with sequences in the ESM-IF train set; thus we performed the same selfconsistency evaluations using ProteinMPNN (34), which is not trained on UniRef50, for inverse folding (Table S3). While no generative model approaches the test set values for foldability and self-consistency, EvoDiff-OADM outperforms EvoDiff-D3PM and improves when increasing the model size  Table S3). We therefore selected the 640M-parameter EvoDiff-OADM model for downstream analysis and hereafter refer to it as EvoDiff-Seq. While a left-to-right autoregressive (LRAR) protein language model generates slightly more structurally-plausible sequences (Table S3), EvoDiff-Seq offers the advantage of direct, flexible conditional generation due to its order-agnostic decoding. Unconditional generation from masked language models produces less structurally-plausible sequences because of the mismatch between the training and generation tasks ( Table S3). Analysis of representative examples of structurally plausible sequences sampled from EvoDiff-Seq across 4 different sequence lengths illustrates their structural plausibility and novelty from sequences in the training set, demonstrating that EvoDiff generates protein sequences that are individually valid (Fig. 2E).
Biological properties of generated sequence distributions Having shown that EvoDiff's generations are individually foldable and self-consistent, we next evaluated how well the distribution of designed protein sequences covered natural protein space. Ideally, generated sequences should capture the natural distribution of sequence, structural, and functional properties while still being diverse from each other and from natural sequences.
Previous work has shown that even without explicit supervision, protein language model embeddings contain information about both sequence and function as captured in GO annotations (35, 36). To evaluate coverage over the distribution of sequence and functional properties, we embedded each generated sequence using ProtT5 (37), a protein language model explicitly benchmarked for imputing GO annotations (35), and calculated the embedding space Fréchet distance between a set of generated sequences and the test set, where lower distance reflects better coverage. We refer to this metric as the Fréchet ProtT5 distance (FPD) and visualize these embeddings and the corresponding FPDs for sequences generated by EvoDiff-Seq and baseline models (Figs. 3A, S6, S7; Table S1). For RFdiffusion, we unconditionally generated 1000 structures with the same lengths as for EvoDiff-Seq and then used ESM-IF (33) to design their sequences. Both qualitatively and quantitatively, EvoDiff-Seq generates proteins that better recapitulate natural sequence and functional diversity than sampling from a state-of-the-art protein masked language model (ESM-2) or predicting sequences from structures generated by a state-of-the-art structure diffusion model (RFdiffusion) (Fig. 3A).
To evaluate the distribution of structural properties in generated sequences, we computed 3-state secondary structures (38) for each residue in generated and natural sequences and quantitatively compared the resulting distributions of structural properties to the distribution for the test set (Figs. 3B, S8). EvoDiff-Seq generates proportions of strands and disordered regions that are much more similar to those in natural sequences, while ESM-2 and RFdiffusion both generate proteins enriched in helices (Fig. 3B). To ensure our models were not memorizing training data, we calculated the Hamming distance between each generated sequence and all training sequences of the same length and reported the minimum Hamming distance, representing the closest match of any generated sequence to any sequence in the train set (Table S1). On average, a sequence generated from EvoDiff-Seq has a Hamming distance of 0.83 from the most similar training distance of the same length. Together, these results demonstrate, via comparison to ESM-2 and RFdiffusion, that EvoDiff's diffusion objective and evolutionary-scale training data are both necessary to generate novel sequences that cover protein sequence, functional, and structural space.
Conditional sequence generation for controllable design EvoDiff's OADM diffusion framework induces a natural method for conditional sequence generation by fixing some subsequences and inpainting the remainder. Because the model is trained to generate proteins with an arbitrary decoding order, this is easily accomplished by simply masking and decoding the desired portions. We applied EvoDiff's power for controllable protein design across three scenarios: conditioning on evolutionary information encoded in MSAs, inpainting functional domains, and scaffolding functional structural motifs (Fig. 1D).
Evolution-guided protein generation with EvoDiff-MSA First, we tested the ability of EvoDiff-MSA to generate query sequences conditioned on the remainder of an MSA, thus generating new members of a protein family without needing to train family-specific generative models. We masked the query sequences from 250 randomly-chosen MSAs from the validation set and newly generated these sequences using EvoDiff-MSA. We then evaluated the quality of the resulting conditionally-generated query sequences via our foldability and self-consistency pipeline (Fig. 4A). We find that EvoDiff-MSA generates more foldable and self-consistent sequences than sampling from ESM-MSA (28) Table S4). To evaluate sample diversity, we computed the aligned 11 residue-wise sequence similarity between the generated query sequence and the most similar sequence in the original MSA. In contrast to sampling from a Potts model, generating from EvoDiff-MSA yields sequences that exhibit strikingly low similarity to those in the original MSA ( Fig. 4D; Table S4) while still retaining structural integrity relative to the original query sequences ( Fig. 4E-F  conditionally-generated query sequences that exhibit low sequence similarity to anything in the conditioning MSA (Fig. 4G). These results indicate that EvoDiff-MSA can conditionally generate novel, structurally plausible members of a protein family given guidance from evolutionary information and without further finetuning.
Generating intrinsically disordered regions Because it generates directly in sequence space, we hypothesized that EvoDiff could natively generate intrinsically disordered regions (IDRs).
IDRs are regions within a protein that lack secondary or tertiary structure; up to 30% of eukaryotic proteins contain at least one IDR, and IDRs make up over 40% of the residues in eukaryotic proteomes (19). IDRs carry out important and diverse functional roles in the cell directly facilitated by their lack of structure, such as protein-protein interactions (40, 41) and signaling (42).
Altered abundance and mutations in IDRs have been implicated in human disease, including neurodegeneration and cancer (43-45). Despite their prevalence and critical roles in function and disease, IDRs do not fit neatly in the structure-function paradigm and remain outside the capabilities of structure-based protein design methods.
Having observed that unconditional generation using EvoDiff-Seq produced a similar fraction of residues predicted to lack secondary structure as that in natural sequences (Fig. 3B), we used inpainting with EvoDiff-Seq and EvoDiff-MSA to intentionally generate disordered regions via conditioning on their surrounding structured regions (Fig. 5A). To accomplish this, we leveraged a previously curated dataset of computationally predicted IDRs covering the human proteome (46). We selected this dataset because it also curates orthologs for these proteins, enabling construction of MSAs (46). After using EvoDiff to generate putative IDRs via inpainting, we then predicted disorder scores for each residue in the generated and natural sequences using DR-BERT (47) (Figs. 5A, S10). Over 100 generations, we observe that IDR regions inpainted by EvoDiff-Seq and EvoDiff-MSA result in distributions of disorder scores similar to those for  More similar Generations from EvoDiff-MSA exhibit strong correlation in predicted disorder scores to those of true IDRs (Fig. S11). Although putative IDRs generated by EvoDiff-Seq are less similar to their original IDR than those from EvoDiff-MSA (Fig. 5C), both models generated disordered regions that preserve disorder scores over the entire protein sequence and still exhibit low sequence similarity to the original IDR ( Fig. 5D-E). These results demonstrate that EvoDiff can robustly generate IDRs conditioned on sequence context from surrounding structured regions.
Scaffolding functional motifs with sequence information alone Thus far, the primary application of deep generative models of protein structure in protein engineering is their ability to scaffold binding and catalytic motifs: given the 3D coordinates of a functional motif, these models can often generate a structural scaffold that holds the motif in precisely the 3D geometry needed for function (10,14,48). Given that the fixed functional motif includes the residue identities for the motif, we investigated whether a structural model is actually necessary for motif scaffolding.
We used conditional generation with EvoDiff to generate scaffolds for a diverse set of 17 motif-scaffolding problems (10) by fixing the functional motif, supplying only the motif's amino-acid sequence as conditioning information, and then decoding the remainder of the sequence (Fig. 1D). The problems include simple "inpainting", viral epitopes, receptor traps, small molecule binding sites, protein-binding interfaces, and enzyme active sites. Many of the motifs are not contiguous in sequence space. We compared the performance of EvoDiff, which uses only sequence information, to the state-of-the-art structure model RFdiffusion, and facilitated direct comparisons by using OmegaFold to predict structures for our generated sequences as well as for sequences inverse-folded from RFdiffusion structures. Notably, we use the same EvoDiff models for both unconditional and conditional generation, while the version  of RFdiffusion used for scaffolding is finetuned from that used for unconditional generation.
We evaluated the ability of each of EvoDiff-Seq, EvoDiff-MSA, and RFdiffusion to generate successful scaffolds ( Fig. 6A-B), where we define a scaffold to be successful if the predicted motif coordinates have less than 1Å RMSD from the desired motif coordinates. Despite operating entirely in sequence space, EvoDiff-Seq and EvoDiff-MSA generate successful scaffolds for 8 and 13 of the 17 problems, respectively ( Table S5, S6). EvoDiff-MSA has a higher success rate than EvoDiff-Seq for 10 problems and a higher success rate than RFdiffusion for 6 problems. EvoDiff-Seq has a higher success rate than RFdiffusion for 2 problems and a higher success rate than EvoDiff-MSA for 3 problems. There are two scaffolding problems (1YCR, 3IXT) where EvoDiff-MSA is outperformed by both EvoDiff-Seq and RFdiffusion ( Interestingly, there is almost no correlation between the problem-specific success rates of EvoDiff and RFdiffusion, and there are very few problems for which both methods have high success rates, showing that EvoDiff may have orthogonal strengths to RFdiffusion (Fig. 6A-B). Due to its conditioning on evolutionary information, EvoDiff-MSA generates scaffolds that are more structurally similar to the native scaffold than EvoDiff-Seq (Fig. 6C). To ensure that EvoDiff is not finding trivial solutions, we show that it outperforms both random generation and the single-order LRAR model (which decodes unconditionally up to and after a motif) ( Table S5). ESM-MSA performs similarly to EvoDiff-MSA on this task, as the motif scaffolding task is well-aligned with its training task, and it is trained on approximately 200x more MSAs than EvoDiff-MSA (Table S6).
We illustrate examples of successful scaffolds sampled from EvoDiff and note both the qualitative and quantitative quality of generated proteins and predicted structures across a range of functional motifs (Fig. 6D-G). These results demonstrate that EvoDiff can design functional scaffolds around structural motifs via conditional generation in sequence space alone.

Discussion
We present EvoDiff, a diffusion modeling framework capable of generating high-fidelity, diverse, and novel proteins with the option of conditioning according to sequence constraints.
Because it operates in the universal protein design space, EvoDiff can unconditionally sample diverse structurally-plausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structure-based protein design.
EvoDiff is the first deep learning framework to demonstrate the power of diffusion genera- This generalized mathematical formulation yields empirical benefits, as EvoDiff-Seq produces sequences that better cover protein functional and structural space than sampling from state-of-the-art protein MLMs (Fig. 3). While an LRAR model learned to fit the evolutionary sequence distribution better ( Native structure Generated sequence Functional motif Native structure Generated sequence Functional motif Native structure Generated sequence Functional motif this barrier by enabling different forms of conditioning, including evolution-guided generation ( Fig. 4) as well as inpainting and scaffolding (Figs. 5-6). We report the first demonstrations of these programmable generation capabilities from deep generative models of protein sequence alone.
Future work may expand these capabilities to enable conditioning via guidance, in which generated sequences can be iteratively refined to fit desired properties. While we observe that OADM generally outperforms D3PM in unconditional generation, likely because the OADM denoising task is easier to learn than that of D3PM, conditioning via guidance intuitively fits into the EvoDiff-D3PM framework because the identity of each residue in a sequence can be edited at every decoding step. OADM and existing conditional LRAR models, such as Pro-Gen (54), both fix the identity of each amino acid once it is decoded, limiting the effectiveness of guidance. Guidance-based conditioning of EvoDiff-D3PM should enable the generation of new protein sequences specifying functional objectives, such as those specified by sequencefunction classifiers.
Because EvoDiff only requires sequence data, it can readily be extended for diverse downstream applications, including those not reachable from a traditional structure-based paradigm.
As a first example, we have demonstrated EvoDiff's ability to generate IDRs -overcoming a prototypical failure mode of structure-based predictive and generative models -via inpainting without fine-tuning. Fine-tuning EvoDiff on application-specific datasets, such as those from display libraries or large-scale screens, may unlock new biological, therapeutic, or scientific design opportunities that would be otherwise inaccessible due to the cost of obtaining structures for large sequence datasets. Experimental data for structures is much sparser compared to sequences, and while structures for many sequences can be predicted using AlphaFold and similar algorithms, these methods do not work well on point mutants and can be overconfident on spurious proteins (59, 60).

20
While we demonstrated some coarse-grained strategies for conditioning generation through scaffolding and inpainting, to achieve even more fine-grained control over protein function, with future development EvoDiff may be conditioned on text, chemical information, or other modalities. For example, text-based conditioning (61) could be used to ensure that generated proteins are soluble, readily expressed, and non-immunogenic. Future use cases for this vision of controllable protein sequence design include programmable modulation of nucleic acids via conditionally-designed transcription factors or endonucleases, improved therapeutic windows via biologics optimized for in vivo delivery and trafficking, as well as newly-enabled catalysis via zero-shot tuning of enzyme substrate specificity.
In summary, we present an open-source suite of discrete diffusion models that provide a foundation for sequence-based protein engineering and design. EvoDiff models can be directly deployed for unconditional, evolution-guided, and conditional generation of protein sequences and may be extended for guided design based on structure or function. We envision that EvoDiff will enable new abilities in controllable protein design by reading and writing function directly in the language of proteins.

Methods
Diffusion models Diffusion models are a class of generative models that learn to generate data from noise. They consist of a forward corruption process and a learned reverse denoising process. The forward process is a Markov chain of diffusion steps q(x t |x t 1 ) that corrupts an input (x 0 ) over T timesteps such that x T is indistinguishable from random noise. The learned reverse denoising process p ✓ (x t 1 |x t ) is parameterized by a model such as a neural network and generates new data from noise. Discrete diffusion models have previously been developed over binary random variables (3), developed over categorical random variables with uniform transition matrices (62,63), linked to autoregressive models (24), and optimized for use with transition matrices (25).
This work presents models from two different discrete diffusion frameworks -order-agnostic autoregressive diffusion models (OADMs) and discrete denoising diffusion probabilistic models (D3PMs) -on protein sequences and multiple sequence alignments (MSAs).
Discrete Denoising Diffusion Probabilistic Models (D3PMs) Discrete denoising diffusion probabilistic models (D3PMs) operate by defining a transition matrix Q such that, over T timesteps, discrete inputs (i.e. protein amino-acid sequences for EvoDiff) are iteratively corrupted via a controlled Markov process until they constitute samples from a uniform stationary distribution at time T . This section describes the D3PM process and loss for a single categorical variable x in one-hot format. The forward corruption process is described by: This allows for efficient training via efficient computation of q(x t |x 0 ) and q(x t 1 |x t ). The EvoDiff-D3PM models are trained via a hybrid loss function This loss combines a variational lower bound L vb on the negative log likelihood and a cross-entropy loss L ce on p ✓ (x 0 |x t ). Investigation of the impact of on model performance revealed minimal improvement to sample generation quality when > 0, consistent with the findings of the original D3PM paper (25). Thus =0 and T =500 were used in all D3PM experiments.
L vb has three terms. L T measures whether the corruption reaches the stationary distribution p(x T ) at time T and does not depend on ✓. Next consider the remaining two terms L t 1 and L 0 , which depend on ✓. Following the original D3PM paper,p ✓ (x 0 |x t ) is directly predicted by the neural network. To compute the loss at timesteps 0 < t < T , the terms q(x t 1 |x t , x 0 ) and p ✓ (x t 1 |x t ) must be computed from x t , x 0 , andp ✓ (x 0 |x t ) using Markov properties

32
where represents an element-wise product. For Equation 5 rules of conditional probability and Markov properties are used to define q(x t 1 , x t |x 0 ) in terms of x t andx 0 : Putting everything together, at each step of training a corruption timestep is sampled accord- Order-Agnostic Autoregressive Diffusion Models (OADMs) Order-agnostic autoregressive diffusion models (OADMs) generalize absorbing-state D3PM and left-to-right autoregressive models (LRARs) (24). This section describes the OADM process and loss for a sequence x of L categorical variables. In the case of EvoDiff, L is the sequence length.
LRARs factorize a high-dimensional joint distribution p(x) into the product of L univariate distributions using the probability chain rule: where x <t = x 1 , x 2 , . . . x t 1 . LRARs are typically parametrized using a triangular dependency structure, such as causal masking in a transformer or CNN, in order to allow parallelized computation of all the conditional distributions in the likelihood during training. LRARs learn to generate sequences in a pre-specified left-to-right decoding order, which may be non-obvious for modalities such as proteins and does not allow conditioning on arbitrary fixed subsequences.

34
LRARs can be expanded into a diffusion framework via two subtle changes. Following the exposition in Hoogeboom et al., (24), the first change is to allow order-agnostic decoding. In an order-agnostic autoregressive model, a decoding order is first sampled uniformly from all possible decoding orders S L . At time step t in the forward process, x (L t) is masked. The log-likelihood for an order-agnostic autoregressive model is derived using Jensen's inequality: The next change involves an objective that optimizes over arbitrary decoding orders one timestep at a time in the style of modern diffusion models, without requiring a neural network that enforces a triangular or causal dependency structure. This is accomplished by replacing the summation over t by an expectation that is appropriately re-weighted.
The overall expected log likelihood log p(x) can be thought of according to a series of likelihoods, each captured in the loss at step t, L t : Thus, the overall expected log likelihood is lower bounded as: A neural network can be efficiently trained to learn the reverse process p ✓ (x (t) |x (<t) ) by randomly masking a set of t tokens at each iteration and minimizing the reweighted loss, allowing the model to learn from predictions of all masked positions at each timestep. By learning one model over all possible decoding orders, OADM allows for conditioning by fixing arbitrary subsequences at generation time. Sequences were generated unconditionally from OADM models by beginning with an all-mask sequence as input, randomly sampling a decoding order, and sampling each token from the predicted probability distribution.
Left-to-right autoregressive and masked language models are diffusion models The connection between autoregressive models and diffusion models has been described previously (24,25). Left-to-right autoregressive (LRAR) diffusion models implement a masked modeling process that is akin to a process which iteratively and deterministically masks all tokens to the right of the sampled token x t , where the current diffusion timestep t is equivalent to the number of tokens masked over the entire sequence length, with all tokens masked at the final timestep Likewise, masked language models (MLMs) are equivalent to only learning one step t of OADM: log p(x k |x (<t) ).
Thus, the OADM setup generalizes LRAR models by considering all possible decoding orders rather than left-to-right decoding, while the MLM learning task is equivalent to only training on one step of the OADM diffusion process.
Datasets Sequence-only EvoDiff models were trained on UniRef50 (26)  38M parameter models were trained on 8 32GB NVIDIA V100 GPUs; 640M parameter models were trained on 32 (2x16) 32GB NVIDIA V100 GPUs. The maximum number of tokens per GPU in each batch was reduced from 40,000 to 6,000 to accommodate training the larger 640M parameter models. 38M parameter models were trained for approximately 2 weeks and saw ca. 3e14 tokens over 700,000 training steps. 640M parameter models were trained for as long as computationally feasible to achieve the best results possible; models saw between ca. Computation of test-set perplexities Perplexity was calculated by uniformly sampling a timestep for each test sequence, corrupting the sequence according to each diffusion model, predicting the sequence x 0 at t = 0 by passing inputs once through each trained model, and then computing the perplexity. For D3PM models, the perplexity is: For OADMs, the perplexity is: To enable model comparison, perplexities for MLMs (CARP, ESM-1b, ESM-2) were computed as if they are OADMs.
And for LRAR models, the perplexity is: Calculated D3PM perplexities were on average higher as t ! T and lower as t ! 1, and masked perplexities were similarly higher for a greater number of masked tokens per sequence, i.e., as t ! L masked (Fig. S1, S2). Lower perplexities indicated improved performance and generalization capacity.
Evaluation of structural plausibility The structural plausibility pipeline ( Fig. 2A) evaluates both the foldability and self-consistency of a given sequence. Foldability was evaluated by averaging the per-residue confidence score, reported as pLDDT by OmegaFold, across the entire sequence. Sequence self-consistency, denoted scPerpelxity, describes how likely the generated sequence is to correspond to the predicted structure. Self-consistency was measured by taking structures predicted for a sequence from OmegaFold, running them through ESM-IF, and calculating the perplexity between the ESM-IF predicted-sequence and the original generated sequence.
The novelty of generated sequences was evaluated relative to training data seen by the model, by computing the Hamming distance between each generated sequence and every trainingset sequence of the same sequence length. The minimum of these Hamming distances, representing the closest sequence seen by the model during training, was reported for each sequence.
Computation of functional and structural features To evaluate sequence coverage, ProtT5 embeddings were computed for each of 1,000 generated protein sequences and 10,000 sequences sampled from the test set using the Tools from Protein Prediction for Interpretation of Hallucinated Proteins (PPIHP) package (69). The resulting distributions of sequence embeddings (i.e., representing the corresponding distributions of sequences) were compared via the Fréchet ProtT5 distance (FPD), where, given the embedding space feature vectors for the test and generated distributions, µ is the feature-wise mean for each set of sequences, C is the respective covariance matrix, and Evolution-guided generation with EvoDiff-MSA Starting with either a random or Max-Hamming subsampled MSA, new query sequences were generated by sampling from an allmask starting query sequence. The generated query sequence was evaluated relative to the corresponding original query sequence using the same tools and workflow described in Evaluation of structural plausibility. Each generated sequence was additionally evaluated for similarity relative to its reference MSA, which is comprised of a query sequence and alignment 41 sequences. The % similarity of each generated sequence relative to its parent MSA was computed as the maximum % similarity over all sequences in the original MSA. Specifically, for a pair of sequences, the % similarity was computed by calculating the number of shared residue identities (accounting for both amino-acid identity and position index in the sequence), and for a given generated sequence the maximum value of these % similarities was determined. Across generated sequences both the CDF and mean of maximum % similarity were reported. Generated sequences were additionally evaluated for structural similarity relative to their original query sequences. Structures were predicted for each of the generated query sequences and the original query sequences using OmegaFold. Structural similarity was measured via the template modeling score (TM-score) (70) for the two predicted structures following structural alignment: where L gen is the length of the generated query sequence; L common is the number of shared residues; d i is the distance between the i th pair of residues; L true is the length of the true query sequence; and d 0 (L true ) = 1.24 3 p L true 15 1.8 is a distance scale for normalization.
Generation of intrinsically disordered regions (IDRs) IDR generation and analysis lever- For input to EvoDiff models, the full sequence of an IDR-containing human protein was treated as the query sequence, and a corresponding MSA was constructed by subsampling 63 other sequences from all the query's orthologs. All sequences were subsampled to 512 residues in length, with the following criteria maintained. Subsampling criteria were that the subsampled query sequence contain at least 1 IDR, and that the total IDR region was less than half the total length of the subsampled sequence (L IDR  256). For IDR generation from EvoDiff-Seq, the query sequence with the IDR region masked was provided as the only input to EvoDiff-Seq, which then generated new residues for the masked region (i.e., the region corresponding to the true IDR). For IDR generation from EvoDiff-MSA, the query sequence with the IDR region masked, aligned to the rest of the MSA, was provided as input to EvoDiff-MSA, which then generated new residues for the masked region.
The resulting generations, containing putative IDRs, were input to DR-BERT, a protein language model fine-tuned for disorder prediction (47), to obtain per-residue disorder scores ranging from 0-1 (less to more disordered). A single-sequence IDR predictor (DR-BERT) was used in place of MSA-based IDR scoring methods, because of an observed bias towards higher disorder scores with MSA-based methods -e.g., random uniform sampling of residues in the masked query positions still resulted in a prediction of disorder given the presence of the orthologs in the alignment. Disorder scores for true IDRs, generated IDRs, scrambled IDRs, and randomly generated IDRs were computed to evaluate the performance of DR-BERT predictions. The randomly-sampled baseline was constructed by randomly sampling amino acids over an IDR region; the scrambled baseline was constructed by shuffling the existing amino acids over an IDR region into a scrambled permutation. In all cases (true IDRs, generated IDRs, scrambled and random baselines), the entire protein sequence was input to DR-BERT for scoring. Since DR-BERT is for single-sequences, for putative IDRs generated by EvoDiff-MSA, the entire query sequence was inputted into DR-BERT, with < GAP > tokens eliminated, to obtain per-residue disorder scores. Lastly, a direct comparison between the original IDR and the generated putative IDR was conducted by calculating the % sequence similarity between the fraction of shared residues between the two IDR regions. EvoDiff.

Motif scaffolding
To generate a scaffold with EvoDiff-Seq, a scaffold length between 50-100 residues (exclusive of the motif) was sampled uniformly; the motif was placed randomly within the length; and scaffold residues were generated from EvoDiff-Seq conditioned on the provided motif residues.
In this approach, on average, protein sequences generated by EvoDiff-Seq were longer (between 45 and 194 residues in length) than those inverse-folded from structures generated by RFdiffusion, which range from 30-152 residues in total length inclusive of the length of the motif.
For scaffolding with EvoDiff-MSA, MSAs for each sequence corresponding to the original PDB structure were generated using the tools from AlphaFold (73)  OmegaFold was used to predict structures corresponding to sequences generated by EvoDiff. A generation was counted as 'successful' if its predicted structure had a pLDDT 70 and a motifRMSD  1.0Å relative to the original motif crystal structure. Note that these success criteria are cutoffs proposed by structure-based models (10) and adopted here to facilitate comparison. The motifRMSD was computed as the RMSD between the alpha-carbons of the motif in the original crystal structure and the predicted structure for the scaffolded motif.  Figure S1: Perplexity as a function of corruption step for EvoDiff sequence models. Test-set perplexities at sampled intervals of the degree of corruption, specifically the diffusion timestep for D3PM models, the fraction of masked residues for OADM and masked language models, and the fraction of evaluated sequence for LRAR models. Intervals reflect evenly spaced windows of 50 timesteps for D3PM models or 10% masking for masked models.  Figure S2: Perplexity as a function of corruption step for EvoDiff MSA models. Test-set MSA perplexities at sampled intervals of the degree of corruption, specifically the diffusion timestep for D3PM models and the fraction of masked residues for OADM and ESM models. The test-set evaluated for each model was sampled using the same sampling scheme assigned during training. Intervals reflect evenly spaced windows of 50 timesteps for D3PM models or 10% masking for masked models.    Figure S5: Per-residue pLDDT for representative proteins generated by EvoDiff-Seq. pLDDT scores computed based on the OmegaFold predicted structures, for individual residues in representative high-fidelity generations from EvoDiff-Seq (Fig. 2E). Points are colored by pLDDT (0-100, red to blue).

LR-AR 640M
EvoDiff-D3PM-Uniform 640M EvoDiff-D3PM-BLOSUM 640M CARP 640M ESM-1b 650M FoldingDiff Valid Random Figure S7: Coverage of sequence and functional space for generated distributions from 640M parameter EvoDiff sequence models and baselines. UMAP of ProtT5 embeddings, annotated with FPD, of natural sequences from test set (grey, n= 1000) and of generated sequences from EvoDiff 640M parameter models and baselines (various colors, n=1000). A visualization of sequences from the validation set (dark grey, n=1000) is included for reference. The visualization for the 640M OADM model is excluded due to inclusion in Fig. 3A. Disorder score (true IDRs) Disorder score (true IDRs) Disorder score (gen. IDRs) Disorder score (gen. IDRs)