Context-Aware Generative Models for Multi-Domain Proteins using Transformers

Being able to artiﬁcially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from Natural Language Processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional speciﬁcity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multi-domain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artiﬁcial partner sequences conditional on an input sequence. Evaluating our model’s quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow auto-regressive strategies.


Introduction
Generating novel protein sequences with desired properties is one of the key challenges of computational biology. It is likely that machine learning methods will play an important role in this task, being already used for the generation of new enzymes, biological sensors and drug molecules [1]. A promising approach is to leverage deep generative models, which use neural networks for learning probability distributions from known, naturally occurring protein sequences [2,3,4,5,6]. Apart from other uses, like the prediction of mutational effects [7,8], these models can be used for protein design by selecting high-probability sequences (possibly under constraints) from the learned distribution.
Naturally occurring protein sequences are often comprised of several domains, and domains can be classified into different families [9]. Models that work on the domain level usually use as training data a single multiple sequence alignment (MSA) [10], containing sequences from the same domain family after aligning them, and make the assumption that each sequence is constrained by the same fitness landscape. This modeling paradigm neglects the dependence of the sequence constraints on the specific context corresponding to each organism, including other proteins interacting with the sequence or other domains on the same protein. Together with the fact that most of the crystallographic structures deposited in the PDB database [11] are resolved only at the single domain Preprint. Under review.
level [12], this poses interesting questions about the limitations of current approaches, for example when predicting the relative orientation of multi-domain proteins [1]. Another field where this issue arises is immunology, where monoclonal antibody experiments are typically performed on mouse models and only later tested in humans. This is related to the so-called humanization problem, i.e. how to graft a promising variable receptor region (CDR) from a murine to a human context [13].
Known families of interacting domains can be organized in a paired multiple sequence alignment (pMSA), where the aligned interaction partners are concatenated [14]. Given the evolutionary pressure for maintaining functional interactions between proteins, amino acid substitutions at interaction surfaces are not independent between the interaction partners. The interacting sequence therefore can be used as additional information when generating a novel sequence.
The current work addresses the task of generating domain sequences given an interacting domain sequence. Given that this task is similar to translation tasks in natural language processing (NLP), we explore the use of Transformers in this context. While there is some recent work using Transformers for translating between protein sequences [15] for specific applications, there is, to the best of our knowledge, no systematic exploration of this idea on the level of protein domain families on a diverse dataset. We explore different architectural choices, regularization schemes and compare our results with a recently published shallow auto-regressive method [16], which we use as a baseline.

Related Literature
Generative modelling for protein design has a wide range of applications and a considerable number of different models have been proposed in literature, recently especially deep neural network models [1]. These include autoregressive models based on convolutional architectures [2], generative adversarial networks [3], variational autoencoders [4], LSTM based architectures [5] and self-attention based architectures [6]. The latter work allows for sequence generation conditioned on tags corresponding to molecular function or taxonomic information.
Transformer-based architectures [17], which we use in the present work for sequence-to-sequence prediction, have also been used, for example, for creating generic embeddings trained on almost all known protein sequences [18], the prediction of mutational effects [19], protein interaction prediction and protein family classification [20], MSA-based language modelling [21], protein contact prediction [22], inverse folding [23,24] and have been at the core of recent breakthroughs in protein structure prediction [25].
Recently, specific tasks have been cast as sequence-to-sequence translation problems using Transformers, similarly to our approach. This includes for example the generation of drug molecules given a protein sequence the molecule should interact with [26] and the generation of short signal-peptides guiding the secretion of industrial enzymes, given the amino acid sequence of the enzymes [15].
Finally, non-neural network models borrowed from statistical mechanics have been extensively used in the context of sequence generation, for example Generalized Potts Models, a particular form of Markov Random Field [27]. This type of model can be used for generating sequences using MCMC strategies, albeit with a significant computational cost. Relevant approximation strategies are for example the recently introduced auto-regressive (shallow) variants [16], which show a similar performance to Potts models but are computationally more efficient.

Dataset
Our data consists of 27 pMSAs containing domain sequence pairs that are part of the same multi domain proteins, taken from [14]. The dataset contains only domain pairs which form a structural contact in at least one resolved PDB structure, making it likely that the two domains co-evolve in order to maintain compatibility. Each dataset is comprised of M rows corresponding to M sequence pairs, where M depends on the dataset and ranges from a few hundred to more than 15.000, see Appendix Sec.A.1 for a summary of the datasets used. The sequences are already aligned using standard bioinformatics tools [28], which means that sequences belonging to the same domain family have the same length. Each row l in a dataset represents a pair of domain sequences B l and A l , which are part of the same protein. Every sequence consists of symbols denoting either one of 20 amino acids or an alignment gap, making the total size of the vocabulary equal to 21.
The first sequence B l = (b l 1 , . . . , b l Nin ) is called the source or input sequence and we denote its length by N in . It is used as an input to predict the second sequence, called the target or output sequence, A l = (a l 1 , . . . , a l Nout ), which is of length N out . All source sequences {B l } M l=1 in the pMSA are members of the same domain family, and all target sequences {A l } M l=1 in the pMSA are members of the same domain family. Each dataset was randomly split into a training set (70%) and a validation set (15%). The last 15% were kept for a testing set in order to be able to optimize hyperparameters for every domain, but we did not use it in the experiments shown in this work. Since sequences in an MSA are related to each other due to phylogeny, the validation set might contain sequences that are nearly identical to some sequences in the training set. We therefore further divided the validation set into two parts, one close to the training set and one far from it. This allows us to control for the effects of phylogeny on the performance metrics. This second splitting was made based on the median of the Hamming distance from the training set. The details for this subpartition are found in the Appendix Sec.A.1.

Performance Metrics
We explored a series of metrics for model quality and performance. Given the interdisplinary nature of this work, we used metrics both from NLP and computational biology. Using these metrics, we compare the performance of the Transformer models we propose in this work with the recently proposed arDCA [16] used as a baseline.

Log-Likelihood and Perplexity
An interesting property of autoregressive models, such as the Transformer or arDCA, is that they define a tractable probability distribution over the space of sequences. Contrary to, e.g., Potts Models and other energy based models, we do not have to evaluate a global normalizing constant over the complete space of possible sequences. We can therefore calculate the log-likelihood of a sequence A given B as log P (A|B) = Nout i=1 log(P (a i |B, a 1 , ..., a i−1 )). ( This is related to the cross-entropy, which we use as a loss during training, which we average over batches during training. For assessing one aspect of the quality of our models, we use the closely related perplexity PP(A, B), which is a common quality metric for protein language models [29], and can be defined as Below we show averages of the perplexity over the training and validation sets and use the notation PP train and PP val for these.

Accuracy
While we use the perplexity as one metric for the quality of our model, it is not always easy to interpret: A high perplexity can result from a single wrong prediction with a high level of confidence. We therefore also use the accuracy A(A, B) for assessing our models. This measure takes the same input as the cross-entropy (the conditional probability for every position) and counts the fraction of times where the true amino acid is the one with the highest probability, leading to where V is the alphabet of symbols and I is an indicator function that is 1 if its argument is true, and 0 else. We define A train and A val as the average of the accuracy on the training and validation set.

Matching Specificity
We expect the interaction between two domains to affect the probability distribution of the target sequence only marginally, with much of the variability in the distribution being explainable by constraints internal to the target sequence. As a consequence, a good performance in the quality measures defined above might be due to the decoder being a good language model of the target protein, possibly ignoring the input sequence altogether. We therefore also evaluate the specificity of the predicted target sequence given the source sequence.
Specificity is also related to the task of matching pairs of protein sequences, which is an active domain of research in bioinformatics [30, 31, 32]. We implement this task by separating the source and target sequences in the validation pMSA, resulting in two separate MSAs with the same number of rows, one containing the source sequences and one the target sequences. We then shuffle the rows in the target MSA randomly and attempt to use our models to find the permutation of the target sequences that matches the original order. In order to create a matching based on a model, we calculate the log-likelihood of every combination of source and target sequences in the shuffled validation set and create a matching between source and target sequences based on the Hungarian algorithm [33].
We then use the ratio of correctly matched pairs as an additional metric for the performance of our model, formally defining it as where M val the size of the validation set. Note that the difficulty of this task increases with the size of the validation set, since the expected ratio of correctly matched pairs using a random matching is 1/M val .

Transformer and arDCA
We used two Transformer models with different sizes, calling one the shallow and one the large model. The shallow Transformer consists of 2 layers with a single attention head, has an embedding dimension of d model = 55 and a forward dimension of d f f = 2048. The large Transformer consists of 3 layers with one attention heads, has an embedding dimension of d model = 2048 and the same forward dimension as the shallow transformer. Further details on their architectures can be found in Appendix Sec. A.2. We compare their performance to the recently introduced shallow auto-regressive model called arDCA [16]. Details on this method can be found in Appendix Sec. A.3. For the datasets used in this work the training time of the Transformer models ranges from less than an hour to about 1.5 days for the large Transformer on the largest dataset. Training was done using a single Nvidia V100 GPU. When using entropic regularization, which we will introduce in a later section, the training time increases significantly. We provide a table with training times in Appendix Sec. A.14.

Performance Gain From Context Sequence
We first tested whether the input sequences had any effect on the perplexity of the target sequence. As already mentioned in Sec. 3.2.3, this is not self-evident, since the Transformer decoder itself could be a good model for the target sequence distribution without taking the input into account. We therefore trained two shallow Transformer models, one with the the normal training set and one where we randomly shuffled the pairing between input and output sequences. We then evaluated the val 3HUIRUPDQFH,QFUHDVH&RPLQJIURPWKH,QWHUDFWLQJ'RPDLQ 3DLUHG 6KXIIOHG Figure 1: Performance increase when taking domain sequence in context into account. We plot the Perplexity PP val for target sequences on the validation set, once for a model trained with the true pairings (Paired) and once for a model trained with shuffled pairs in the training set. models on the normal validation sets, without shuffling. We expect that if the model trained on the normal training set exploits the information in the inputs when predicting the output, it should show a considerably lower perplexity than the model trained on a shuffled dataset.
We show the results of these experiments in Fig. 1. As can be seen, the models trained on the normal dataset have a significantly lower perplexity than the models trained on a shuffled dataset. This corroborates and quantifies the idea that domain sequences that appear in the context of a second domain contain information that can be used for modelling the constraints on the sequence of the second domain. We note that the difference in the logarithm of the perplexity, which is equivalent to the the cross entropy, can be seen as a rough estimate of the mutual information between the output and the input as detailed in Appendix Sec. A.8.

Results on Performance Metrics of the Shallow Transformer
We next compared the shallow Transformer models to the arDCA baseline. Shallow Transformers outperform arDCA on nearly all datasets above a certain training set size in terms of perplexity, accuracy and matching with a large margin, as can be seen in Fig. 2. We note here that the Transformer models (both shallow and large) have less parameters than arDCA for every family size we tried, see Fig. 3. The number of parameters in the Transformer models is independent of the length of the input and target sequences, while the number of parameters in the arDCA models scales quadratically with the concatenated input length.
The best performance is achieved for the families with the largest training sets, indicating that the performance of the Transformer might further increase with increasing training set size. We repeated the calculation of the matching performance on subsampled versions of the validation set in order to estimate the dependence on the size of the validation set, see Appendix Sec. A.10.

Model Analysis
In this section, we show further results on the performance of the shallow Transformer on the family pair PF013171-PF14226. The results on the other pairs can be found in Appendix Sec. A.7.2 and in Appendix Sec. A.6.
We show the perplexity on target sequences in the validation set in dependence of the distance from the training set in the right panel of Fig. 4, where the distance of a sequence to the training set is the smallest Hamming distance from the sequence to any training sequence. Interestingly, it seems that the advantage in performance of Transformer models over arDCA is mostly due to sequences far away from the training set, indicating that Transformers generalize better in regions of sequence space far away from the training set. While Transformers outperform arDCA in matching on sequences both close and far from the training set, the distance from the training set does not seem to systematically influence the performance difference.
Next we tested whether the target sequence distribution of trained Transformers is integrating structural information. It is well-known that co-evolution between pairs of positions in protein sequences, visible as pairs of columns with high covariance in an MSA, is a hall-mark of structural contacts [34,35]. Methods like plmDCA [36] use this fact for extracting structural contacts from MSAs, a method that can also be used for predicting contacts between different domains [37]. We therefore tested whether we can use the same method for extracting structural contacts when replacing the natural target sequences in the dataset with artificial sequences sampled from the Transformer. To this end, we sampled for each input protein sequence of the training set 8 target sequences, adding all sampled sequences together with the input sequences to a pMSA, which therefore contained natural sequences from the input domain concatenated to artificial sequences from the target domain. We then attempted to extract structural contacts between the two domains using this pMSA. The results can be seen in the left part of Fig. 4. While the performance in contact prediction is worse than when using the natural target sequences directly, the prediction based on artificial target sequences recover 10 contacts before making an error, with a true positive fraction of 0.8 for the first 30 predictions. This indicates that the Transformer indeed captures signals that can be used for structure prediction. Right: Perplexity (lower is better) and accuracy (higher is better) for every sequence of the validation set in dependence on the distance of the sequence from the training set. The distance of a sequence to the training set is the Hamming distance to the closest sequence in the training set.

Regularization
When experimenting with the large Transformer, we observed strong overfitting of the perplexity, especially when trained on smaller datasets. While this could be expected, we found that the matching performance was not following the same trend: While the perplexity started to degrade at some point during training, which is indicative of overfitting, the accuracy and the matching performance were still increasing, see Appendix Fig. A.2. While the shallow Transformer is less prone to overfitting, most likely due to its limited capacity, we found it necessary to introduce regularization for the large Transformer. We experimented with dropout and weight decay with limited success. While both schemes prevent overfitting in terms of the perplexity, the matching performance and the accuracy  Figure 5: Diagram explaining training with entropic regularization: (A) corresponds to the training batch, with yellow being the input protein sequences and blue the output protein sequences; At (B), the batch is sent to the Transformer and the loglikelihood Data LL is computed; At (C), N s output protein sequences are sampled from the Transformer for every input protein sequence using the Gumbel softmax operation; At (D) we evaluate the loglikelihood of the sequences sampled at (C) and call it Samples LL; At (E) we measure how well the loglikelihood separates the training sequences from the sampled sequences using a logarithmic softmax, creating an additional loss term; At (F), the losses calculated at (E) and (B) are combined. dropped significantly. We show this effect in Appendix Sec. A.4 for different training set sizes and regularization settings.

Entropic Regularization
In order to find models with a good performance on perplexity, matching and accuracy at the same time, we explored other regularization approaches. In this section, we present an approach based on entropic regularization, where we enforce the probability of a target sequence A given a source sequence B to be similar to other sequences sampled from the model conditioned on B. This encourages the model to give similar weights to different possible interaction partners, even if there is only a single one present in the training set.
We therefore add a regularization term 1 T T l=1 R ent (A l , B l ) to the loss, where l indexes the input sequence B l and the target sequence A l in the batch and T is the batch size. We sample S different target sequences for B l from the model. We denote the k th sampled sequence conditioned on B l as A l,k . We sample using a Gumbel-Softmax distribution [38], which enables back-propagation through the sampling step. For computational efficiency, we sample every amino acid in A l,k conditioning on the preceding amino acids of the true A l . Then we evaluate the log-likelihoods R l,k of the target sequence A l,k given B l and the log-likelihood of the true pair R l , We then use these quantities as the input for a log-softmax operation, resulting in This term is multiplied by a factor α > 0 to regulate its strength and added to the loss function, meaning that we aim to minimize it. This enforces similar probabilities for the true target sequence A l and the sampled target sequences A l,k , conditioned on B l . A diagram summarizing the regularization approach can be found in Fig. 5. A closer look reveals that it is a form of entropic regularization, maximizing the conditional Rényi entropy of order 2, see Appendix Sec. A.11.
We performed a set of experiments on the 27 datasets in order see if this type of regularization improves the performance. We retrained the large Transformer with and without the entropic regularization. We used S = 5 and α = 0.7 for the experiments. The results can be seen in Fig. 6, where we plot the performance of the shallow Transformer against the performance of the large Transformer for different regularization schemes and arDCA. The details of the training, models and the details of the performance for every family can be found in the Appendix Sec. A.12. The large Transformer outperforms the shallow Transformer in terms of accuracy and matching both with and without regularization, indicating that the large Transformer extracted more useful information from the training set. However, the large Transformer without regularization has a significantly higher perplexity on the validation set, indicating overfitting. Adding the entropic regularization leads to a good performance of the large Transformer in all metrics.
We also performed a systematic comparison of the entropic regularization scheme with standard weight decay, testing different weight decay values for the large Transformer. The details of the experiments and the results for every family can be found in Appendix Sec. A.13.1 along with Appendix Fig. A.16, where we average the performance over the families. While weight decay reduces overfitting in terms of the perplexity, it decreases the performance in terms of matching and accuracy. The entropic regularization on the contrary preserves the high performance with respect to these two measures while regularizing the perplexity. Moreover, the entropic regularization seems to consistently improve the matching on sequences far away from the training set, whereas weight decay is decreasing it.

Discussion
In this work, we explored the use of Transformers for generating protein domain sequences while taking into account other domain sequences that are part of the same multi-domain protein. We cast the problem as a translation task, which allowed us to directly use Transformers developed for translation between natural languages. We showed that this architecture is capable of outperforming state-of-the-art shallow autoregressive models in several metrics and explored a new regularization scheme optimized for our use case. Casting the task as a translation problem allowed us to use metrics like the matching performance for assessing the quality of the generative models.
Our work is placed at the intersection of two streams of research: There is a long history of building domain specific generative models on aligned sequences for tasks like drug design or mutational effect prediction. More recently, however, large models based on Transformer architectures trained on all or nearly all unaligned protein sequences available have shown remarkable capabilities for capturing complex patterns in the data. Our work, on the other hand, solves a very generic sequenceto-sequence prediction task using smaller Transformer architectures, specialized for a family pair and using aligned sequences, which allows for domain-specific models. One limitation of our work is that we consider only a single domain as the context when predicting the sequence of an interacting domain, disregarding additional domains that might be present in the same protein. Conceptually, however, it would not be hard to create a model that takes more than one other domain as context when generating a new sequence.
An interesting question for further research is if we could solve the task equally well or even better using a single large model for training on all families at once. We note however, that this is not straightforward since a family might appear with more than one different family on the same protein, making the prediction task ambiguous. From the practical perspective, our method can be used for all tasks where a novel amino acid sequence needs to be generated conditioned on other amino acid sequences. This includes the case of general protein-protein interaction, which we did not analyze in the current work, but which can in principle be solved using the same models we presented.

A.2 Transformer
The translation model we use is matching closely the original transformer model from Ref.
[17], featuring an encoder-decoder architecture. While we refer to this work for more details, we review here the key components. Sequences from the source family are encoded by the encoder and used as the input for the decoder. The source sequence is processed through alternating blocks of selfattention and linear layers. The same is done for the already translated part of the target sequence, while the part of the target sequences not yet decoded is masked. Typical vocabulary sizes in NLP are in the order of 10 4 to 10 5 , while in our case we have a vocabulary V is composed of 21 tokens, corresponding to 20 amino acids and an alignment gap symbol.
The input embedding is composed of two parts, one for the amino acid identity and one for the position in the sequence. We learn a dictionary W , mapping each of the 21 symbols to a vector of dimension d model . The sequence position is embedded as a vector P E, calculated as where i is the position in the sequence and k is dimension in the embedding vector. The embedding of a sequence is then taken as the sum of the amino acid and positional embeddings. The embedded amino acid sequences are then passed to the encoder, mapping them to a latent representation z = (z 1 , . . . , z n ). This latent representation is then passed to the decoder that predicts the interaction partner sequence.
The decoder implements an auto-regressive distribution P (a i |z, a <i ), defining the probability of the i th amino acid in the interaction partner sequence given the preceding amino acids in the interaction partner a <i and the hidden representation z of the input sequence B. During training we use the true amino acids for a <i , while during sampling we sample the sequence A sequentially.

A.2.1 Attention Mechanism
Both the encoder and decoder use self-attention mechanisms and the decoder also the cross-attention mechanism.
Following [17], we define the attention operation as where Q is the query matrix, K is the key matrix, V is the value matrix and d model the dimension of the keys, which we will define below.
For each element of the query Q we compute its similarity with the different values of the keys K. This yields weights used to compute a weighted average of the value V .
The output of the i th attention head, called head i , is calculated as We then linearly combine the different heads, resulting in In self attention layers, the keys, queries and values are calculated from the same input. In this case, In cross attention, the keys and values are based on z and the queries on the intermediate decoder representations. In the results presented in this paper we only used models with a single head.

A.2.2 Transformer Architecture
The complete architecture is represented in Fig. A.1 and is based on stacking encoder and decoder blocks in the two parts. At the end of these blocks, linear with dimension d f f and residual connections to the input of the blocks are added.  The code is based on the the PyTorch implementation of the Transformer https://github.com/ pytorch/pytorch.

A.3 arDCA Baseline
As a baseline we use the recently introduced arDCA [16], which is an efficient autoregressive model for protein sequences. or an amino acid sequence A = (a i , ..., a N ) of length N , arDCA defines the conditional probability P (a i |a i−1 , ..., a 1 ) as with z i (a i−1 , ..., a 1 ) = ai exp{h i (a i ) + i−1 j=1 J ij (a i , a j )} being a normalization factor. The parameters h depend on a single position and the amino acid found at that position and the parameters J on pairs of positions and the two amino acids found at that pair of positions.
The probability of a sequence can be computed using the decomposition P (a 1 , ..., a L ) = P (a 1 ) · P (a 2 |a 1 ) · · · P (a L |a L−1 , ..., a 1 ), which is tractable. Training can be done using standard convex optimization methods.
For our purposes, we concatenate the source protein B and the target protein A into a single sequence during training. During evaluation, we just need the conditional probability P (A|B), which we calculate using P (a 1 , ..., a Nout |B) = P (a 1 |B) · P (a 2 |B, a 1 ) · · · P (a Nout |B, a Nout−1 , ..., a 1 ) .
We also added an L2 regularization on the parameters h and J. During our experiments we used the regularization parameters communicated by the authors (λ h = λ J = 0.0001).

A.4 Regularization Benchmark
In this section we show learning curves related to overfitting behaviour and regularization.

A.5 Sequence Logo and Loss
In Fig. A.4 we show the perplexity per position for family pair PF013171-PF14226, together with the sequence logo. It is evident that biologically conserved positions correspond to lower perplexities, which is to be expected.

A.7 Structural Information using DCA
Direct Coupling Analysis is a group of unsupervised methods for modelling aligned protein sequences, see Ref. [39]. Apart from other applications, it can be used for predicting structural contacts from MSAs.

A.7.1 plmDCA
plmDCA is a specific method of DCA based on a pseudolikelihood approximation for training the Potts Model. In this paper we used the asymmetric version of the method from https:// github.com/pagnani/ArDCA.jl with default hyperparamters. The sequences sampled from the Transformer where realigned using HMMer [28].

A.7.2 Results per Pairs
In this section, we show the contact prediction results obtained with plmDCA for the 27 families.

A.8 Measuring the Model Specificity
The difference in log-likelihood (which is proportional to the logarithm of the perplexity) of a target sequence with or without conditioning on the input sequence can be seen as a rough estimate of the mutual information.
When the input sequence is randomly chosen, there is no correlation between the input and the output, and the corresponding probability can be seen as the marginal probability of the output sequence. We can therefore write where a and b are sequences in the validation set V .

A.9 Matching Performance for Different Distances from Training Set
Here we plot the ratio of correctly matched pairs in the validation set, separated into below-median and above-median distance from the training set.

A.10 Matching Curves
For each protein family we measure the fraction of correctly matched pairs when restricting the problem to the first n sequence pairs. 1XPEHURI3DLUV

A.11 Entropic Formulation of The Regularization
There is a relation between the entropic regularization and the Rényi entropy: For a number of sampled sequences S sufficiently large we can rewrite the regularization term where the last sum is over all possible sequences A. The first term in the last line is equivalent to the standard loss and can be absorbed there. The second term is a constant and will not influence the gradient. The last term is the logarithm of the Rényi entropy of order 2, also called the collision entropy, of the distribution over target sequences conditioned on B i .

A.12 Regularization Performance
This section presents the results of the large Transformer trained with α = 0.7, and S = 5 in comparison with arDCA and the shallow Transformer in Fig. A.13, Fig. A.14 and Fig. A.15. The comparison with the large Transformer without regularization or with weight decay is presented in the following section, see Sec. A.13.1.

A.13 Entropic Regularization Compared with Weight Decay
In this section, we compare the entropic regularization and weight decay on different metrics, average over all 27 families.

A.14 Training Time
In this section we present a table with the training time of our models using a single Nvidia V100 GPU. The first column refers to the family pair, the second to the training time of the shallow Transformer, the third one to the large Transformer with entropic regularization and the last one to the large Transformer without entropic regularization.