DR-BERT: A Protein Language Model 1 to Annotate Disordered Regions 2

Despite their lack of a rigid structure, intrinsically disordered regions in proteins play 12 important roles in cellular functions, including mediating protein-protein interactions. Therefore, 13 it is important to computationally annotate disordered regions of proteins with high accuracy. 14 Most popular tools use evolutionary or biophysical features to make predictions of disordered 15 regions. In this study, we present DR-BERT, a compact protein language model that is ﬁrst 16 pretrained on a large number of unannotated proteins before being trained to predict 17 disordered regions. Although it does not use any explicit evolutionary or biophysical information, 18 DR-BERT shows a statistically signiﬁcant improvement when compared to several existing 19 methods on a gold standard dataset. We show that this performance is due to the information 20 learned during pretraining and DR-BERT’s ability to use contextual information. A web application 21 for using DR-BERT is available at https://huggingface.co/spaces/nambiar4/DR-BERT and the code to 22 run the model can be found


Introduction
Over a century ago, the chemist Emil Fischer postulated the lock-and-key model for enzymatic reactions, giving rise to the theory that a protein's function depends on its unique and rigid threedimensional structure (Fischer, 1894).Within this paradigm, two proteins can interact if they have complementary structures.This idea has contributed to several advances in the understanding of protein function, and it is undeniable that the structure of a protein affects its function.However, studies in the late 1990s and early 2000s recognized that a stable structure is often not necessary for functional function (Wright and Dyson, 1999;Uversky, 2002).Segments that lack a rigid structure, also known as intrinsically disordered regions (IDR), have been found in many proteins and shown to actively participate in diverse functions Van Der Lee et al. (2014).In fact, these disordered regions are critical for some proteins with central roles in cellular signaling and regulatory networks, allowing them to interact with different proteins (Wright andDyson, 1999, 2015).
Given the functional importance of disordered regions, computational methods for predicting disordered regions have been studied for decades, and over a hundred methods, ranging from biophysical to machine learning-based models, have been developed (Zhao and Kurgan, 2022).
Recently, predictors that use deep learning have gained traction (Zhao and Kurgan, 2022).This was particularly evident in the Critical Assessment of protein Intrinsic Disorder (CAID) competi-tions, where deep learning-based models consistently delivered the best performance (Necci et al., 2021).Many existing deep learning methods to predict disordered regions utilize recurrent neural networks and convolutional neural networks, sometimes paired with an attention mechanism (Hanson et al., 2019;Tang et al., 2022Tang et al., , 2020)).This success of deep learning-based methods to predict disordered regions in proteins can be attributed to both the complex and non-linear nature of sequence-structure maps as well as the steady increase of data availability (Piovesan et al., 2016).
Protein language modeling has been a particularly fast-growing area of deep learning research for computational biology.Inspired by natural language processing, the core idea of protein language modeling is that the amino acids (or sometimes small groups of amino acids) that make up a protein are analogous to the words that make up a sentence (Rives et al., 2021;Nambiar et al., 2020).Like their natural language counterparts, protein language models leverage the large number of unannotated amino acid sequence data to pretrain deep learning models before specializing them on much smaller amounts of annotated data.Usually, this pretraining step consists of training the model either to predict the context surrounding a particular residue (Asgari and Mofrad, 2015) or to predict the identity of a hidden residue, given its surrounding context, i.e. the set of its nearby amino-acid residues (Rives et al., 2021).These models have then been successfully used to perform various downstream tasks including protein family labeling (Asgari and Mofrad, 2015;Nambiar et al., 2020), prediction of protein interactions (Nambiar et al., 2020) and subcellular localization (Stärk et al., 2021), and the inference of evolutionary trajectories and phylogenetic relationships of proteins (Hie et al., 2022;Lupo et al., 2022).While most protein language models tend to be large and GPU intensive, there have been studies proposing small and computationally inexpensive protein language models (Nambiar et al., 2020).
In this paper, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a small protein language model that is first pretrained on a large corpus of amino acid sequences and then finetuned to predict disordered regions in proteins.
We validate our model on both CAID 1 and CAID 2 evaluation data and benchmark it against some of the best performing models.We then investigate the impact of pretraining on the performance of DR-BERT.Finally, we dive into one particular biological case study involving RPB6, a subunit of RNA polymerase, to illustrate how DR-BERT arrives at its predictions and learns to use contextual information from the amino acid sequence.The CAID 1 and CAID 2 results of DR-BERT compared to some of the best-performing models from the CAID competitions (Necci et al., 2021).Cells are colored based on the performance of each model for a particular metric for CAID 1.

Results
While many models for disordered region prediction depend on knowledge of biophysical properties of amino acids used as inputs, previous work has shown that pretraining a protein language model may allow it to learn these biophysical and functional properties in a self-supervised manner (Rives et al., 2021;Nambiar et al., 2020).Therefore, we chose to build our DR-BERT model using only the amino acid sequence of a protein as the input.This model is first pretrained on the masked language modeling task as shown in Figure 1 before it is finetuned to predict intrinsically disordered regions.
The model itself is a neural network with a Transformer encoder block composed of six stacked Transformer encoder layers (see Methods for details).The purpose of the encoder block is to create contextual latent representations of each residue.That is, each residue is represented by a vector that captures the context of the rest of the sequence.By stacking multiple transformer encoder layers within the encoder block, the final latent representations can capture more complex higherlevel information and relationships from the amino acid sequence.These vectors are then passed to a final linear layer that constructs a task-specific output.
In the pretraining task of masked language modeling, the neural network is asked to predict the identities of amino acids that have been masked in the input.In this study, we pretrained our model on 6,564,742 proteins randomly sampled from the UniRef90 dataset (Suzek et al., 2014).
Next, we finetuned DR-BERT by tasking it to classify residues in proteins as disordered or ordered using annotated data from the DisProt database.The performance of DR-BERT on this finetuning task is shown in Figure 1c alongside previous state-of-the-art methods.

Benchmarking DR-BERT's performance
When finetuning DR-BERT on the disordered region classification task, we split the DisProt data into train/validation/test sets with the aim of enabling a systematic and unbiased comparison against existing methods.In particular, proteins from the Critical Assessment of Protein Intrinsic Disorder Prediction (CAID) competitions were reserved as test data and were not available to the model during training.In addition, any proteins that shared more than 25% similarity to proteins from the test set was excluded from the train set.We ran out our benchmarking on both CAID 1 and biophysical and evolutionary information.A notable pattern here is that most of these methods use pre-computed features.First, the performance of the model is reliant on its upstream dependencies.For example, if a model uses MSAs as input, one would expect its performance to deteriorate for proteins that do not have many known homologs.In addition, the presence of multiple third-party techniques in a prediction pipeline makes it more difficult to optimize computational efficiency.In contrast, DR-BERT is a fully self-contained model that does not rely on any additional information besides the amino acid sequence of a protein.Despite not requiring any additional information, the Receiver Operating Characteristic (ROC) curves (Figure 2) on both the CAID 1 and CAID 2 test sets demonstrate that DR-BERT outperforms all of the other methods in predicting disordered regions.For CAID 1, DR-BERT ranks first in terms of area under the ROC curves (AU-ROC) with a value of 0.82.The scores then incrementally decrease with flDPnn, RawMSA and SPOT-Disorder2.For CAID 2, DR-BERT is again the highest ranking method followed by flDPnn and a three way tie between rawMSA, SPOT-Disorder2 and Disomine.
The ROC curves also show that DR-BERT offers particularly evident improvements in the lower range of false positive rates.However, as the disordered region dataset is imbalanced with more ordered residues than disordered ones, the ROC curves may show an overly optimistic view of the classifiers (Davis and Goadrich, 2006).Therefore, we also calculate F1 and Matthews correlation coefficients (MCCs) for each model.These scores, along with the AU-ROC scores are shown on Figure 3a for CAID 1 and Figure 3b for CAID 2. Again, DR-BERT scores the highest on both metrics with an F1 of 0.55 and MCC of 0.43 for CAID 1 and an F1 of 0.56 and MCC of 0.43 for CAID 2. The precision-recall plots on Supplementary Figure 1 also shows that DR-BERT performs better than the other methods in balancing the trade-off between precision and recall.
To determine the statistical significance of DR-BERT's improvement over the existing methods, we performed a resampling analysis similar to that of This supremacy of DR-BERT over methods that use evolutionary and structural features suggest that these features can be successfully learned by the model either during pretraining or finetuning.
In fact, it has been previously shown that pretrained protein language models are able to extract structural information from amino acid sequences (Bhattacharya et al., 2020;Singh et al., 2022).
However, these results alone do not elucidate the contribution of pretraining to the success of DR-BERT.

Pretraining and Finetuning
To better understand the role that pretraining plays in extracting the information relevant to disordered region prediction, we interrogated DR-BERT models at two stages: (a) after only pretraining and (b) after pretraining and finetuning.At both of these stages, we extracted the embeddings from the Encoder Block for each residue in the test set.Using t-SNE, we projected these embeddings down to two dimensions (Van der Maaten and Hinton, 2008).Then, we calculated kernel density estimates (KDEs) separately for ordered and disordered residues.These KDEs are shown in Figure 4a for the pretrained model and in Figure 4b for the model that was pretrained and finetuned.The plot for the pretrained model shows about 20 different clusters of ordered residues and 15 distinct clusters of disordered residues.Upon further investigation, we see that each cluster corresponds to an individual amino acid.There are a few exceptions to this for disordered residues.For instance, there is no clear disordered cluster for the amino acid tryptophan (W).This is because tryptophan is one of the most order-promoting amino acids and is rarely encountered inside intrinsically disordered regions (Campen et al., 2008).However, the clear overall pattern in Figure 4a is that most ordered clusters are accompanied by an adjacent disordered cluster for the same amino acid.This is in contrast to the null model where the disorder/order residue labels are shuffled, shown on Supplementary Figure 5. On the other hand, the plot for the finetuned embeddings depicts a different story.While the embeddings are not clustered by amino acid, the disordered residues are all clustered together and are well-separated from the ordered residues.
The difference between the pretrained and finetuned embeddings highlights that pretraining a protein language model is sufficient to extract some information regarding disordered regions in proteins.Finetuning the model allows it to then hone in on the differences between disordered and ordered residues to more efficiently separate them.This result gives credence to an observation we made in Nambiar et al. (2020) where we noted that pretraining a protein language model allows it to learn general but biologically relevant information from amino acid sequences, whereas finetuning gives the model more information about one characteristic but at the expense of generality.
In addition, we wanted to quantify the advantage of pretraining for predicting disordered regions.To do this, we trained a model with an identical architecture to DR-BERT to predict disordered regions without any pretraining.The results of this non-pretrained model evaluated on CAID 1, shown in Figure 5, show that pretraining DR-BERT gives it a considerable advantage.In fact, in the absence of pretraining, our model lags behind the models from the CAID competition, shown in Figure 3 and Supplementary Figure 4.This showcases the advantage of pretraining for Transformer neural networks, especially for low data regimes.

A case study: the disordered region in RPB6 protein
Evaluating DR-BERT on a large annotated dataset gave us confidence in DR-BERT's ability to make accurate predictions regarding disordered regions.However, it is also useful to illustrate a potential use-case by focusing on the predictions of the model for an intrinsically disordered region within a single protein.Doing so gives us the opportunity to gain insight into how the context of a particular sequence is used by the attention heads of the model (see Methods) to make predictions for different residues in the same protein.We decided to illustrate this using RPB6 protein as an example.RPB6 is a subunit of an RNA polymerase in fission yeast.It is known to bind to the general transcription factor, TFIIS (Ishiguro et al., 2000).This example allows us to test our disordered region prediction for a protein that is known to perform an important function.Figure 6b shows that DR-BERT predicts with high confidence that the N-terminal tail of RPB6 is in fact disordered.Indeed, NMR spectroscopy shows that not only does RPB6 have a flexible N-terminal tail, this tail is also used to bind to the p62 subunit of the TFIIH transcription factor (Okuda et al., 2021).Figure 6a shows DR-BERT's predictions overlaid on the NMR-determined structure of the complex between RPB6 and the TFIIH p62 PH domain (PDB: 7DTI).To analyze how DR-BERT uses sequence context to make its predictions, we extracted the self-attention heads for each of the six layers in DR-BERT's encoder as it processed the 130 amino acid long RPB6 sequence.Each attention map is represented as a 130 × 130 matrix where gives a numerical score as to how much that particular attention head focuses on amino acid when determining the context relevant for amino acid .A sample of these attention maps for each layer in DR-BERT is displayed in Figure 6c (a complete table is shown on Supplementary Figure 6).We observed that the features learned by the attention maps of the initial layer do not have any clear high-level patterns.However, attention maps from layers two to four display some distinct patterns.For example, in layer 4, the attention map reveals that the relevant context for each residue includes a large window of surrounding residues in addition to several smaller windows at intervals on either side of the residue in question.By layer 5, at least one attention map divides the residues into two distinct groups: one group

Discussion
In this study we introduce DR-BERT, a protein language model for predicting disordered regions in proteins.DR-BERT is first pretrained on the masked language modeling task before it is finetuned to predict disordered regions.
This finetuned model is benchmarked using the CAID evaluation data and significantly surpasses the other models.This improvement over models that use biophysical and structural information supports the hypothesis that pretraining protein language models enables them to learn biologically relevant information in a self-supervised manner without any provided annotations.
This hypothesis is further validated as we show that the embeddings of the pretrained model are able to differentiate between disordered and ordered residues without access to any annotations during training.Furthermore, we showed that a model with an identical architecture as DR-BERT suffers a large loss in performance when the pretraining step is skipped.
Finally, we took a closer look at how DR-BERT makes predictions for RPB6.Through this exercise we saw that DR-BERT extracts patterns hierarchically, with higher level features extracted by attention heads in deeper layers of the neural network.This is similar to behavior that had been observed in computer vision.
To verify that DR-BERT was not overfitting on the training data, we excluded from the training set proteins that were clustered with proteins in the test set with 25% sequence similarity.The clustering in this process was performed using CD-HIT (Fu et al., 2012).
Given the high performance of DR-BERT on the disordered region prediction task, we also investigated its ability to perform related tasks from CAID 2. This included evaluating DR-BERT on a disordered region dataset where X-ray attotations were removed (disorder-noX) and a dataset where PDB residues were incorporated (disorder-PDB).The results shown on Supplementary Fig-  and area under precision-recall plot metrics and second to SPOT-Disorder2 on F1 score and MCC.
However, DR-BERT only shows middling performance on the disorder-PDB set.Given that DR-BERT was trained on the vanilla disordered region dataset from Disprot, it is not surprising that DR-BERT's performance dropped on some of these variants.In addition to variants of disordered region annotations, we also evaluated DR-BERT on predicting protein-binding regions.Protein-binding disordered regions are regions in disordered proteins that bind to structured partners and potentially allow the disordered protein to bind to multiple partners (Mészáros et al., 2009).DR-BERT achieved an AU-ROC of 0.75, F1 score of 0.45 and MCC of 0.32 beating other protein-binding predictors from CAID 2.
The success of DR-BERT, in addition to the insight into how DR-BERT makes predictions, leads us to believe that protein language models could play an important role in the next generation of neural networks for predicting disordered regions.In fact, after completing our study, we found that a similar model to DR-BERT was presented in a recent preprint by Redl et al. (2023).However, there are significant differences in our studies, including our investigation of the effect of pretraining on the success of the protein language model and the insight into the features extracted by the attentional layers.In addition, DR-BERT is significantly smaller than the model proposed by Redl et al. (2023) (with 15x fewer parameters), which may make DR-BERT more accessible to users without access to high-performance GPUs.An alternative approach to the one shown in our study would be to extract embeddings from a pretrained model and pass them to a downstream classifier without finetuning the embeddings.This approach, which is used by the SETH model, makes it more efficient to train models on downstream tasks using embeddings from a large pretrained lan-main component layers: a word embedding layer and a positional embedding layer.The word embedding layer takes the tokenized sequence of amino acids and maps each token to a 768 dimensional vector.In contrast, the positional embedding layer captures the spatial information of the tokens to preserve the notion of context within the sequence (Vaswani et al., 2017).

Encoding Layers
After a dropout layer is applied to decrease the potential for overfitting (Srivastava et al., 2014), the embedding, consisting of a 768 dimensional vector for each amino acid token, is used by the Transformer encoder layers.The RoBERTa transformer layer consists of a self-attention layer and a feed-forward network layer.The self-attention mechanism described in (Vaswani et al., 2017)

Disordered Region Prediction
To finetune DR-BERT, we applied a token classification training method.A classification layer is trained and applied to each positional embedding output Then, a softmax function is applied to transform the embedding into probability space, taking the rounded result as the predicted label.
Then, cross-entropy loss is applied between the predicted labels and the ground truths.The classification training lasted 10 epochs, with the best-performing checkpoint on the validation dataset chosen as the final model.The learning rate was empirically chosen to be 2 −6 (against 2 −5 and 2 −7 ) using the cosine scheduler with hard restarts, as opposed to a linear scheduler.To compare against similar models, DR-BERT was tested on the CAID dataset, which we ensured to be disjoint from both the training and evaluation datasets.We also tested on the CAID 2 dataset, which was again ensured to be disjoint from the training and evaluation datasets.

Evaluation Metrics
The primary evaluation metrics used for DR-BERT were Area Under the Receiver Operating Characteristic Curve (AU-ROC), F1 score and the Matthews Correlation Coefficient (MCC).The receiver operating curve is given by observing the change in the true positive to false positive ratio as the probability decision threshold is varied.Therefore, ROC-AUC for a perfect classifier would be 1.0 and a random classifier would have an area of 0.5.F1 scores are computed as a flattened vector of all predicted disorder binary labels against their ground truth, and is given by The MCC score offers a metric that is stable in imbalanced datasets (Chicco and Jurman, 2020).
Because of the MCC formula's false-positive symmetry, the MCC metric is invariant on which class is considered to be negative or positive.As the DisProt dataset has approximately 3 times as many ordered labels as disordered, the MCC metric is an appropriate metric to characterize the model's performance.MCC is defined as: For a fair comparison between methods, when these evaluation metrics were run, only test sequences that successfully ran on all methods were used.In addition, we attempted to emulate the evaluation strategy of the CAID competitions.In particular, when reporting F1 and MCC, we use the binary labels reported by CAID whenever available for a method since CAID identifies the threshold that maximizes F1 score for a particular method (Necci et al., 2021).In the case of methods where a binary label was not provided by CAID and for DR-BERT, we identify the threshhold that maximizes F1 score ourselves.The threshhold for protein binding is calculated independently from the threshold for disordered region prediction (including disorder, disorder-PDB and disorder-noX).
However, the same variant of DR-BERT was used for both disordered region and protein binding prediction.

Figure 1 .
Figure 1.The DR-BERT model is pretrained on the masked language modelling task and finetuned on predicting disordered regions in proteins.(a) A schematic of the DR-BERT model and the pretraining and finetuning procedures.(b)The statistics of data used in this study and (c) The CAID 1 and CAID 2 results of DR-BERT compared to some of the best-performing models from the CAID competitions(Necci et al., 2021).Cells are colored based on the performance of each model for a particular metric for CAID 1.

CAID 2 .
This left us with 1,408 examples in the train set, 156 sequences in validation and 652 in the test set for CAID 1 and 2013 examples in the train set, 216 sequences in validation and 348in the test set for CAID 2. By doing so, we were able to reproduce the results of some of the topperforming models from CAID(Necci et al., 2021).In particular, we benchmarked DR-BERT against flDPnn(Hu et al., 2021), RawMSA (Mirabello and Wallner, 2019), SPOT-Disorder2 (Hanson et al., 2019), DisoMine (Orlando et al., 2022), Espritz-D (Walsh et al., 2012), AUCpreD (Wang et al., 2016), IUPred2A/3 (Mészáros et al., 2018) and Predisorder (Deng et al., 2009).Of these methods, all but IUPred2A/3 are deep learning-based models based on feed-forward, recurrent, and convolutional neural network architectures.flDPnn is a feed-forward neural network that uses evolutionary and structural information in addition to disordered region predictions from simpler models; RawMSA uses convolutional neural networks (CNNs) on evolutionary information (in the form of MSAs); SPOT-Disorder2 uses a combination of CNNs and recurrent neural networks (RNNs) on input with evolutionary information; DisoMine uses RNNs on structural information; Espritz uses RNNs on evolutionary information; AUCpreD uses CNNs on sequence information (with optional evolutionary information); and Predisorder uses RNNs with structural,

Figure 2 .
Figure 2. The ROC curves of DR-BERT and other models on test sets from (a) CAID 1 and (b) CAID 2. The legends display the area under the curve (AUC) for each model.The models are ordered based on the AUC in CAID 1.

Figure 3 .
Figure 3. Comparing the results of DR-BERT with other top-performing methods on the CAID datasets.(a) The MCC, F1, and AU-ROC scores of DR-BERT and the top performing methods from CAID 1, evaluated on the test split.(b) The MCC, F1, and AU-ROC scores for corresponding methods evaluated on the CAID 2 test data.

Figure 4 .
Figure 4. Plotting the embeddings of ordered and disordered residues.(a) A t-SNE projection of the pretrained embeddings of residues in the CAID 1 test set.The plot shows the kernel density estimates of ordered residues in blue and disordered residues in red.The labeled points indicate the mean position of each amino acid.This plot should be compared to the null model shown in Supplementary Figure 5.(b) A similar plot but with embeddings from a model finetuned to predict disordered regions.For both plots the two sample Z-test is performed after reducing the dimensionality of the embedding to 1-D.

Figure 5 .
Figure 5.Comparison of the results of DR-BERT with its version without pretraining.(a) The ROC plots of DR-BERT and the non-pretrained model, evaluated on the CAID 1 test set.The area under each curve (AUC) is presented in the legend.(b) The MCC, F1 and AU-ROC scores of DR-BERT and the version without pretraining.
ure 3 show that DR-BERT performs well on the disorder-noX test set, placing first on the AU-ROC

Figure 6 .
Figure 6.Application of DR-BERT to RPB6, a subunit of RNA polymerase.(a) The three-dimensional structure of RPB6 as it binds to the TFIIH p62 PH domain (PDB:7DTI).The protein is colored by the DR-BERT score, which represents the probability that a given residue is disordered.(b) A plot of the DR-BERT scores for RPB6 shown for each position along the amino acid sequence.(c) A sample of DR-BERT's self-attention maps for each of the 6 layers in the model as it processes the RPB6 sequence.The attention maps have been log transformed, and the red cells indicate higher attention values while the blue cells indicate lower attention values.

,
captures the relationship between different tokens in a sequence.Each attention layer consists of 12 heads, which can each capture different contextual information in parallel.The final output from the encoder layers is 1026 vectors, each of length 768, where the first corresponds to a standard summary [CLS] token and the last corresponds to a [SEP] separator token.Many of the hyperparameters used in this paper, including the hidden size of 768 and 12 attention heads, are based on our previous work in Nambiar et al. (2020).
(Liu et al., 2019)ning of DR-BERT used masked language modeling (MLM): in each example, the model is tasked with identifying some hidden tokens.Following RoBERTa, the masks are set independently during epochs, and 15% of tokens are replaced with a [MASK] token for each example, with crossentropy loss being applied for every batch of proteins(Liu et al., 2019).Pretraining lasted for approximately 11 epochs, allowing the model to see 70 million examples.The batch size was set to 10 examples per device, and the model was trained on 2 NVIDIA V100s.