Transferable deep generative modeling of intrinsically disordered protein conformations

Intrinsically disordered proteins have dynamic structures through which they play key biological roles. The elucidation of their conformational ensembles is a challenging problem requiring an integrated use of computational and experimental methods. Molecular simulations are a valuable computational strategy for constructing structural ensembles of disordered proteins but are highly resource-intensive. Recently, machine learning approaches based on deep generative models that learn from simulation data have emerged as an efficient alternative for generating structural ensembles. However, such methods currently suffer from limited transferability when modeling sequences and conformations absent in the training data. Here, we develop a novel generative model that achieves high levels of transferability for intrinsically disordered protein ensembles. The approach, named idpSAM, is a latent diffusion model based on transformer neural networks. It combines an autoencoder to learn a representation of protein geometry and a diffusion model to sample novel conformations in the encoded space. IdpSAM was trained on a large dataset of simulations of disordered protein regions performed with the ABSINTH implicit solvent model. Thanks to the expressiveness of its neural networks and its training stability, idpSAM faithfully captures 3D structural ensembles of test sequences with no similarity in the training set. Our study also demonstrates the potential for generating full conformational ensembles from datasets with limited sampling and underscores the importance of training set size for generalization. We believe that idpSAM represents a significant progress in transferable protein ensemble modeling through machine learning.


S2 Text. MCMC simulation protocol.
Here we describe the protocols used to run MCMC simulations via CAMPARI [7] in this study, which was adopted from a previous study for simulating charged IDRs [8].
MCMC runs at 298 K Hamiltonian: simulations were performed using the OPLS-AA/L force field and the ABSINTH implicit solvent model.The CAMPARI parameter file abs3.1_opls.prm was used.Cutoffs for Lennard-Jones and electrostatic interactions were set at 10 and 14 Å respectively.Peptide modeling: peptides were capped with an acetyl group at the N-terminus and a Nmethylamide group at the C-terminus.Histidine sidechains were protonated only at the ε position, making them neutrally charged.System: peptides were placed inside a spherical droplet.The radius of the droplet was set to 70 Å for peptides with L <= 34 residues ("short" peptides) and 100 Å for peptides with L >= 35 residues ("long" peptides).Na + and Cl -ions were added to neutralize net peptide charges and to represent a 125 mM salt solution (resulting in 108 excess ion pairs for 70 Å droplets and 315 pairs for 100 Å droplets).Sampling: Metropolis MCMC simulations were performed in the NVT ensemble at 298 K.The degrees of freedom in the simulations were backbone φ, ψ, ω torsion angles and sidechain χ torsion angles and rigid-body coordinates for peptides molecules and ions.Monte Carlo move set: the move set was based on a one employed previously [8].Simulations: independent simulations were performed using randomly generated initial conformations.For "short" peptides, we performed 1 × 10 6 equilibration and 2.5 × 10 7 production steps.For "long" peptides, we performed 2 × 10 6 equilibration and 5 × 10 7 production steps.Output: snapshots were saved every 5,000 steps during the production phase, resulting in 5,000 and 10,000 snapshots for a "short" and "long" peptide simulation respectively.

Replica exchange (RE) simulations
For four test set peptides (nls, protan, protac and drk_sh3), we additionally performed thermal RE MCMC simulations as implemented in CAMPARI.The RE strategy was inspired by one used previously to simulate polyampholytic peptides [9].The temperature schedule comprised at least 12 temperatures: [298K, 302K, 306K, 310K, 315K, 320K, 330K, 340K, 350K, 360K, 370K, 380K].For all simulations of nls and drk_sh3 and a small portion of protan and protac simulations, we added 4 temperatures: [390K ,400K, 410K, 420K].The simulation setup (Hamiltonian, system preparation and MCMC sampling) was the same described above for simulations at constant 298 K.Each thermal replica was initiated using a randomly generated initial conformations.For most RE simulations, we performed 5 × 10 5 equilibration and 1.25 × 10 7 production steps.In a minority of cases, we performed 1 × 10 6 equilibration and 2.5 × 10 7 production steps.Swaps between two neighboring replicas were always attempted every 2,000 steps.Snapshots were saved every 5,000 steps.

S3 Text. Neural networks of idpSAM.
This section describes the neural networks of idpSAM with pseudocode for some key operations.The PyTorch code for all networks can be found at: https://github.com/giacomo-janson/idpsam.

Encoder
The encoder network   of is based on a transformer architecture (S15 Fig A and B).It has 912,888 trainable parameters.
Input.The input of   consist of: (i) two types of geometrical features extracted from the coordinates  ∈ ℝ ×3 of the Cα atoms of a peptide ( is the number of its residues); (ii) the amino acid sequence tokens of the peptide  ∈ ℝ ×1 .The first type of geometrical features consists in the full distance matrix  ∈ ℝ × calculated from .This is processed by a radial base function (RBF) expansion as in SchNet [10,11] with 320 equally spaced Gaussians from 0.0 to 30.0 Å.The RBF expansion is embedded to a dimension of   = 192 by a multilayer perceptron (MLP) with a GELU activation [12]: The second type of geometrical features is a sequence  ∈ ℝ ×3 storing α torsion angle values (see main text).The first two channels of  store the cosine and sine values of the  − 3 angles of a peptide, with the first and last two positions padded with 0. The third channel contains a mask with values of 0 for the three zero-padded positions and 1 for the rest.These features are embedded to a dimension of   = 128 by an MLP:   = Linear(GELU(Linear())) The amino acid sequence  is processed via a learnable embedding layer [1,13] to yield an amino acid embedding   ∈ ℝ ×  (with   = 32).
Transformer blocks.The network has   = 5 transformer blocks [14] with self-attention and  ℎ = 8 heads, a "pre" layer normalization configuration [15], an MLP with hidden dimension of 256, GELU activation and no dropout.The input of the first block is  [1] =   .At each block , a linear layer prepares the input of the block in the following way:
These concatenated features are projected via a linear layer to obtain: which contains bias terms that are then added to the logit values of the attention heads in the layer.In summary, the transformer block operates the following update: Output.The output of the last layer if processed via: where  is the encoded representation of the Cα coordinates of the peptide and  = 16.The layer normalization operation does not use learnable elementwise affine parameters.It is used to rescale the output of the encoder to help downstream applications with other neural networks.

Scaling of interatomic distances in the AE loss
The AE loss used in this study includes a term for the reconstruction of Cα-Cα interatomic distances.

Decoder
The decoder   of SAM is similar to the encoder (S15 Fig C).It has 1,161,347 trainable parameters.In describing its components, we employ similar notations to those used for the encoder, for reasons of economy and clarity.
Input.The input of the decoder is an encoding .This is first projected to a dimension of   = 128 by an MLP: = Linear(GELU(Linear())) which is the input to the first transformer block, so  [1] =   .
Transformer blocks.The network has a stack of   = 5 modified transformer blocks.They share similar architecture and hyper-parameters of the blocks in   , with only one exception: instead of using a scaled dot product attention [14] with 8 heads, they use  ℎ = 32 heads with an attention mechanism inspired by the kernel self-attention in the Timewarp model [16] and AF2 invariant point attention [13].More specifically, the input  [] of a block  is first mapped to query and key 3D coordinates: () = Linear( [] ) where  is the index of the head (we omit the block index in the notations for  and ).Value vectors are instead obtained with the usual mechanism used in dot product attention.The logit values   () of the attention map are calculated by: () ∈ ℝ with: where  () is a learnable scaling parameter and  is a small real number for numerical stability.Similar to the encoder, the decoder uses a layer for creating 2d relative positional embeddings with  2 = 64.These are linearly projected to obtain bias values [] ∈ ℝ ×× ℎ which are summed to logits to obtain final attention maps.Like in dot product attention, the maps are then used to update  [] by matrix multiplication with the values vectors.Overall, the update is: We found the use of this self-attention mechanism to slightly improve the reconstruction accuracy of the decoder.We hypothesize that the improvement could be caused by better inductive biases for modeling Cartesian coordinates.
Output.The output of the final transformer block is ultimately mapped to a tensor  � ∈ ℝ ×3 by an MLP: which represent the reconstructed Cα coordinates of an encoding .

Noise prediction network
The noise prediction network   of SAM is also based on a transformer architecture (S16 Fig) .It has 15,161,296 trainable parameters.In describing its components, we again employ similar notations to those used for the other networks.
Input.The input of   consist of: (i) an encoded peptide conformation   ∈ ℝ × perturbed at an integer-valued timestep 1 <  ≤ ; (ii) the timestep ; (iii) the amino acid sequence  of the peptide.The input encoding is first project to the node hidden dimension   = 256 by a linear layer: which is the input to the first transformer block, so  [1] =   .The input timestep  is processed via a sinusoidal embeddings and an MLP with a SiLU activation to project it to a dimension of   = 256, then is finally tiled to a sequence of  tokens: Similar to the encoder,   uses a layer for creating amino acid embedding   ∈ ℝ ×  (with   = 32).
Transformer blocks.The network has   = 16 modified transformer blocks with self-attention with  ℎ = 16 heads, a "pre" layer normalization configuration, an MLP with hidden dimension of 512, GELU activation and no dropout.The modification consists in the way timestep and amino acid embeddings are injected.We use the adaLN-Zero (LN: layer normalization) mechanism from the Latent Diffusion Transformer [17].The scale (), shift () and gate () values for adaLN-Zero are obtained by: these values are used to normalize the embedded sequence in the transformer block (we omit the block index in their notations).We found the use of adaLN-Zero mechanism highly beneficial for the performance of SAM (see the main text), which is consistent with Peebles and Xie [17].The injection mechanism of the adaLN-Zero we implement is illustrated in S16 Fig B and the gate and modulate algorithms are given below (the ⨀ operator indicates element-wise multiplication and addition with 1 involves broadcasting): Note that the layer normalization operations preceding the adaLN-Zero operations do not use learnable element wise affine parameters.Similar to both AE networks,   also uses a layer for creating 2D relative positional embeddings with  2 = 64.Again, these are linearly projected to bias values   [] ∈ ℝ ×× ℎ which are summed to logits to obtain attention maps for a regular multi-head self-attention mechanism.The modified transformer block update can be summarized as:

𝐡𝐡 𝑛𝑛𝑜𝑜𝑑𝑑
[] = LatentDiffusionTransformer( =  [] ,  =   [] , _ =   [] ) to complete the update, the input of the first transformer block is used to carry out the following operation at every block: + Linear( [1] )) We found the injection of the input embedding at every block to be beneficial for SAM performance.
Output.The output of the last block if processed via: where  � is the predicted noise.

Standardization of encodings
To numerically help DDPM training, each of the  channels in a training set encoding is normalized via a standard scaler using the mean and standard deviation of the channel in the dataset.At inference time, the encodings generated by the DDPM are first transformed back by the inverse of the standard scalar function.Then they are used as input for the decoder.Note that the encodings produced by   are already normalized via a layer normalization operation in the output module (see above), but we found this additional standardization procedure to slightly improve the training stability of the DDPM.

S4 Text. Evaluation metrics.
For comparing a proposed structural ensemble (e.g.: from SAM) of a peptide with a reference MCMC ensemble of the same peptide, we used different evaluation scores.In this work, they are calculated using ensembles with 10,000 conformations each.

MSE_c
The Cα-Cα contact probabilities for a peptide of length  were evaluated with the following mean squared error (MSE) score: where   = ( − 1) × ( − 2)/2 is the number of residue pairs with sequence separation > 1 and   and ̂  are the contact frequencies for residues  and  in the reference and generated ensembles respectively.To avoid frequencies equal to zero, we use a pseudo-count value of 0.01.Contacts are defined with a Cα distance threshold of 8.0 Å.
In contrast to a similar score we used previously, this score only considers pairs of residues with a sequence separation of at least 2, since there is very little variability in the distance between two adjacent Cα atoms and predicting contact frequencies would be trivial.This modification is included in the remaining scores considering a pair of residues (MSE_d, aKLD_d and aJSD_d).

MSE_d
To evaluate average Cα-Cα interatomic distance values, we use the following MSE score: where   and  �  are the average distances between the Cα atoms of residue  and  in the reference and generated ensembles.

KLD approximations
To compare mono-dimensional distributions of continuous features, we employ an approximation of the Kullback-Leibler divergence (KLD) by binning the values of the features as we did similarly to previous work 3 .We first take the minimum and maximum value of a feature over the reference and generated ensembles and split the range in   = 50 equally-spaced bins.We then approximate KLD as: where  is the index of a bin and   and   are the frequencies calculated in bin  (using a pseudocount value of 0.001) for the generated and reference ensembles respectively.

aKLD_d
To compare pairwise Cα-Cα interatomic distance distributions, we use the aKLD_d score, which is computed as: where   and  �  are the distributions of Cα-Cα distances between residue  and  in the reference and generated ensembles.

aKLD_t
To compare distribution of α torsion angles, we use the aKLD_t score, which is computed as: where   =  − 3 is the number α angles in a peptide and where  ,+1,+2,+3 and  ̂,+1,+2,+3 are the distributions of α angles among residue  and its next 3 residues in the reference and generated ensembles.

KLD_r
To compare Cα radius-of-gyration distributions, we use the the KLD_r score: where  and  � are the radius-of-gyration distributions for the reference and generated ensembles.

aJSD_d and aJSD_t
The KLD values that we use in the scores above, are asymmetric, since swapping the reference and proposed distributions would result in a different KLD value.Therefore, KLD-based scores assume that there is a reference distribution.To compare pairs of distributions  and  in which we do not assume any of them to be a reference, we use the symmetric Jensen-Shannon (JS) divergence which we compute via the KLD approximation above: where  is a mixture distribution for which we compute frequencies as: The aJSD_d and aJSD_t scores that we use to compare some ensembles in this work are calculated similarly to the aKLD_d and aKLD_t scores defined above, but instead use this JSD approximation.).SAM-E has slightly better validation loss with respect to SAM versions with comparable numbers of layers, but it does not seem to significantly improve modeling performance (Table 1).For each model, we show validation loss curves from 5 different training runs.A "linear with warm up" learning rate schedule is not used to train the DDPM.e The adaLN-zero mechanism is not used in the transformer blocks of the noise prediction network.Instead, time step and amino acid embeddings are projected to the same dimension of the node embeddings and added to them before the first layer normalization operation of the block.f The initial node embedding is not inject at every block of the noise prediction network (S16 Fig) .g IdpSAM version with encoding dimension  = 4.Both the AE and DDPM were re-trained.* Asterisks denote a statistically significant difference (using a Wilcoxon signed-rank test with a significance level of 0.05) between the scores of a strategy and of the default idpSAM version.

S4
Fig. Evaluating SAM and idpGAN.(A) Each subplot confronts the evaluation scores of the two models for the 22 test set peptides.The SAM version evaluated here was trained with the same dataset of idpGAN.(B) Similar confrontations, but the SAM version evaluated here was trained with the entire training set presented in this study.S5 Fig. Modeling the ensemble of Q2KXY0 with SAM and idpGAN.(A) PCA histograms for the structural ensembles of idpGAN, SAM-o and SAM-e.Here, SAM-o is a model trained with original training set of idpGAN (1089 IDRs), while SAM-e was trained with the expanded training set of this study (3,259 peptides) and is the model discussed in most of the main text.Frequency values in colorbars are multiplied by 1 × 10 3 .(B) Cα-Cα contact maps of the three methods.The MSE_c scores of the ensembles are reported in brackets.(C) Average Cα-Cα distance maps of the three methods, with their MSE_d scores reported in brackets.(D) Cα Rg histograms the three methods, with KLD_r values in brackets.S6 Fig. Cα-Cα distances and α torsion angle distributions in CGRP variants.(A) Histograms of the five Cα-Cα distance distributions with the highest JS divergence between the wild-type and mutant ensembles from MCMC simulations.Residue indices of the Cα atoms are reported on top of the subplots (B).Histograms of the five α angle distributions with the highest JS divergence between the wild-type and mutant ensembles from MCMC.Residue indices of the Cα atoms defining the torsion angles are reported on top.S7 Fig. Cα-Cα distance distributions in ak37 ensembles.The plots show data for those Cα-Cα distances that exhibit the highest absolute loading value for 10 of the top principal components (PCs) identified in a PCA on the MCMC ensemble (refer to the main text for PCA details).The distances associated with PC 5 and 11 are not shown since it is the same shown for PC 1. SAM-ak was trained on the full training set of SAM and additional simulation data from three peptides with sequences similar to ak37.S8 Fig. Properties of the 3,259 training and 22 test sequences.Left panel: lengths of the peptides.Central panel: net charge per residue.Right panel: average helicity in an ensemble of 10,000 conformations from MCMC simulations.Helicity is defined as the fraction of residues in an all-atom peptide conformation found in a helical state according to the DSSP algorithm[18].The average helicity value of the ak37 peptide is highlighted in the plot.S9 Fig. Overfitting on the structural ensemble of ak37.(A) to (E) Ensemble of ak37 modeled by a SAM version trained only on ak37 data.See Fig 2 in the main text for more details.(F) Cα-Cα distance distributions of the same ensemble.See S7 Fig for details.S10 Fig. MCMC ensembles for the ak synthetic peptides.The ensembles consist of conformations randomly extracted from MCMC simulations.(A) PCA histograms for the four peptides.PCA was performed on each peptide independently and each row uses its own principal axes.(B) Cα-Cα contact maps.(C) Average Cα-Cα distances.(D) Cα Rg histograms.(E) α torsion angle histograms.S11 Fig. Modeling of the DP03125r003 peptide (L = 15) with SAM and different levels of MCMC sampling.Refer to Fig 5 in the main text for a description of the panels.The ensemble from only 5 MCMC runs closely approximates the one from extensive sampling consisting of 73 MCMC runs (S2 Table).S12 Fig. Modeling of the nls (L = 46), protac (L = 55) and protan (L = 55) peptides with SAM and different levels of MCMC sampling.Refer to Fig 5 in the main text for a description of the panels.(A) to (C): data or nls.(D) to (F): data for protac.(G) to (I): data for protan.S13 Fig. Evaluating SAM and the MCMC-5 sampling strategy.Each subplot confronts the evaluation scores of the two strategies for the 22 test set peptides.Markers are colored according to the length of the corresponding peptide.S14 Fig. Validation losses observed during DDPM training for different idpSAM versions.All the models with the "SAM" label have the same architecture for the noise prediction network, but have different number of transformer blocks in it.The SAM model with 20 blocks corresponds to the SAM-G model discussed in the main text.The SAM-E model has a different architecture, it has 16 blocks and incorporates 4 FrameDiff edge update operations.In case of the SAM models with 4 and 16 blocks, there is a large difference in validation loss at the end of training, which translates into a large difference in ensemble modeling performance (S4 Table

S15Fig.
Encoder and decoder networks of SAM.(A) Outline of the encoder network.The elements of the network are colored in blue, the tensors being processed by it are colored in gray.The tensor r represents the numerical indices of the input peptide, which are used to generate a 2d relative positional embedding h2dpos.Embed: embedding layer.MLP: multilayer perceptron.RBF: radial base function for embedding interatomic distances.LN: layer normalization.(B) Preparation of the input of a transformer block in the encoder.(C) Outline of the decoder network.S16 Fig. Noise prediction network of SAM.(A) Outline of the entire noise prediction network.SinEmb: sinusoidal embedding.(B) Illustration of a transformer block of the network.FC: fully-connected module consisting of an activation and a linear layer.LN: layer normalization (colored in white if does not have learnable elementwise affine parameters).Mod.: modulate operation for adaLN-Zero.Gate: gate operation of adaLN-Zero.S2 d

b
Number of MCMC snapshots extracted from the simulation data of a single peptide during a training epoch.c With the exception of the initial learning rate, we kept all other Adam parameters at their default PyTorch value.d On a single NVIDIA RTX2080Ti GPU, using PyTorch 1.11.0 with automatic mixed precision training.
Before calculating the loss, a distance   = ‖  −   ‖ between the Cα atoms of residue  and  with sequence separation  = | − | is first scaled by:�  , � = (  −   )  where   and   are the mean and standard deviation in the AE training set for Cα-Cα distances with sequence separation .We found this standardization procedure to be critical for efficiently training the AE model, since the scale of distances between atom pairs with different  values can vary greatly.

Table. Properties of the 22 test set peptides.
UniProt accession number of the sequence.Synthetic peptides have an empty value.Bold font corresponds to sequences covering their entire UniProt entry.e Number of MCMC runs at 298 K. f Number of replica exchange (RE) MCMC runs.Full sequence of the biologically-active peptide.
a Positively charged residues are in blue, negatively charged in red.b Number of residues in a peptide.c Net charge per residue of a peptide.d *

S4 Table. Evaluation scores of ensembles from modified versions of SAM.
Average scores are reported along with standard errors for the 22 test set peptides.Unless specified, all models were trained with the full training set of the default SAM version.Unless specified, all models use the same AE with only the DDPM being re-trained.IdpSAM version with a DDPM trained with the full training set and three ak synthetic peptides.c IdpSAM version with 4 transformer blocks in the noise prediction network of the DDPM.
a Default idpSAM version.b

Table. Properties of the 25 validation set peptides.
For all sequences, 3 MCMC runs were performed at 298K. a UniProt accession number of the sequence with the starting and ending residue positions in the full sequence shown in parentheses.
b Positively charged residues are in blue, negatively charged residue in red.c Number of residues.S6 Table.

Details of SAM neural network training.
a Number of different peptides used in the training set.