Protein sequence profile prediction using ProtAlbert transformer1

Armin Behjati; Fatemeh Zare-Mirakabad; Seyed Shahriar Arab; Abbas Nowzari-Dalini

doi:10.1101/2021.09.23.461475

Abstract

Protein sequences can be viewed as a language; therefore, we benefit from using the models initially developed for natural languages such as transformers. ProtAlbert is one of the best pre-trained transformers on protein sequences, and its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. This paper includes two main parts: transformer analysis and profile prediction. In the first part, we propose five algorithms to assess the attention heads in different layers of ProtAlbert for five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure. These algorithms are performed on 55 proteins extracted from CASP13 and three case study proteins whose sequences, experimental tertiary structures, and HSSP profiles are available. This assessment shows that although the model is only pre-trained on protein sequences, attention heads in the layers of ProtAlbert are representative of some protein family characteristics. This conclusion leads to the second part of our work. We propose an algorithm called PA_SPP for protein sequence profile prediction by pre-trained ProtAlbert using masked-language modeling. PA_SPP algorithm can help the researchers to predict an HSSP profile while there are no similar sequences to a query sequence in the database for making the HSSP profile.

1. Introduction

Proteins consist of linear chains of twenty types of amino acids, each with different chemical properties. Proteins are the most versatile organic molecules in cells or living organisms and play critical roles in the body. The diversity of proteins functions is generally related to their diverse structures. The sequence of amino acids determines a unique protein tertiary structure which directly impacts its specific function¹. New sequencing technologies have led to an explosion in generating biological data such as protein sequences in the past two decades. UniProt² Archive and Swiss-Prot³ databases contain most of the publicly available protein sequences globally. These sequences grow exponentially every few years². Despite the strong interest in protein structure determination, there is currently a massive gap between the number of known sequences and experimentally determined structures deposited in the Protein Data Bank⁴ (PDB), highlighting the difficulties of structure elucidation⁵. Therefore, computationally predicting protein structure from the query sequence remains to be largely unsolved^6,7,8. Homology modeling is a common approach for protein structure prediction. In this approach, homologous proteins of the query sequence are found by sequence comparison in a database. Then, a sequence profile is created to show the conservative and non-conservative regions in the homologous sequences ^9,10.

Profiles are used in many bioinformatics problems. For example, they are applied to model protein families¹¹, predict protein domains¹², detect protein homology^13,14, design proteins^15,16, and identify orthologous genes and proteins¹⁷. Homology-derived Secondary Structure of Proteins (HSSP) database includes a sequence profile for each PDB protein. In HSSP, a Multiple Sequence Alignment (MSA) of putative homologs is prepared to construct a profile for each PDB protein. The list of homologous sequences is the result of an iterative database search in Swiss-Prot¹⁸. A well-defined profile can group information of similar sequences on conserved regions. It helps us to assign a query sequence to the family. This assignment is challenging when the query sequence length is short, and there is little similarity between this sequence and any sequences in the profile.

Protein structures are more conserved than protein sequences. Homologous proteins sharing a common evolutionary ancestor can have high sequence-level variations¹⁹, and when the protein sequence similarity is below 30% at the amino acid level, the alignment score usually falls into a twilight zone^20,21. Therefore, simply comparing sequence similarities often fails to capture global structural and functional similarities of proteins.

Concerning the above discussion, improving the profile prediction methods to get more information about the sequence and families is an active research area in bioinformatics. In this paper, our primary goal is to predict a profile for query protein sequence using transformers.

In the following, we review the transformer-based models processing protein sequences. Proteins, as a linear chain of amino acids, can be viewed precisely as a language. Therefore, they can be modeled using Language Models (LMs) taken from Natural Language Processing (NLP). These LMs are used for biology identity representation and new prediction tools in various bioinformatics problems. The central concept behind this approach is to interpret protein sequences as sentences of characters (amino acids) and each character as a single word^22,23,24. Recent research has shown that contextualized representations in NLP work well for contextual protein representation learning^25,26. In the training phase, LMs learn to extract useful features from many samples and generate appropriate representations of these features ^27,28,29,30. In these papers, architectures inspired by NLP are employed for protein processing. Also, pre-training tasks such as Masked-Language Modeling (MLM) and autoregressive generation are utilized to investigate protein-specific pre-training tasks.

One of the latest architectures that showed significant superiority over previous models is transformers³¹. Devlin et al.³² introduced a new language representation model based on transformers called Bidirectional Encoder Representations from Transformers (BERT). This model is designed to pre-train deep bidirectional representations from unlabeled text to create state-of-the-art models for a wide range of tasks. Bepler and Berge³³ proposed a framework for mapping any protein sequence to a sequence of vector embeddings that encode structural information. Also, they defined a novel similarity measure between these arbitrary length vectors to learn useful position-specific embeddings. Similarly, Alley et al.³⁴ used a Recurrent Neural Network (RNN) named UniRep to learn statistical representations of proteins and demonstrated that such representations predict the stability of natural and de novo designed proteins, as well as the quantitative function of molecularly diverse mutants. Rao et al.³⁵ introduced TAPE as a new benchmark consisting of five relevant semi-supervised tasks for assessing such protein representation.

Elnaggar et al.²⁹ trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (BERT, ALBERT) on data extracted from UniProt Reference Clusters (UniRef) datasets and Big Fat Database (BFD). They showed the effects of these pre-training models upon the success of the subsequent supervised training for predicting secondary structure, subcellular localization, and membrane-bound or water-soluble protein problems. Lu et al.³⁶ applied the principle of mutual information maximization between local and global information as a self-supervised pre-training signal for protein embeddings to introduce a contrastive loss that trains an RNN to discriminate fragments from a source sequence versus randomly sampled fragments from other sequences. Min et al.³⁷ introduced a novel pre-training scheme for protein sequence modeling called PLUS consisting of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. They showed the advances of the PLUS on six out of seven protein biology tasks. Sturmfels et al.³⁸ introduced a new pre-training task for protein sequence models. They used profile-hidden Markov models derived from MSAs as labels during pre-training for profile prediction. They utilized the model on a set of five downstream tasks for protein modeling and demonstrated that the model outperforms masked language modeling alone on all five tasks.

Although most previous studies on using transformer models for embedding protein sequences in different bioinformatics problems show acceptable results, they apply the model as a black box.

Here, we analyze heads in layers of a pre-trained transformer on protein sequences to find representative heads for some protein characteristics. The results of the analyses lead us to propose an algorithm for protein sequence profile prediction.

At the first step, we select pre-trained ProtAlbert, because its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. Then, we propose five algorithms called RLH_NNI, RH_SAA, RH_BBP, RH_PSS, and RH_PTS to analyze five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure at attention heads in the layers of ProtAlbert.

For this assessment, we make a dataset by extracting 55 proteins from CASP13 which their sequences, experimental tertiary structures, and HSSP profiles are available. In addition, we perform our analysis on three proteins to show no difference between the average result of CASP13 and case studies.

After executing each of the transformer head analyzer algorithms, we reach the following results:

RLH_NNIⁱ algorithm detects representative heads in the layers of the ProtAlberl model for interaction between amino acids located at k (1 ≤ k ≤ 5) distances on the protein sequence.
RH_SAAⁱⁱ algorithm finds specific heads for aspartic acid, glutamic acid, proline, tryptophan, and histidine.
RH_BBPⁱⁱⁱ algorithm announces representative heads for amino acids classified based on R-group.
RH_PSS^iv algorithm identifies some heads which contain significant attention weights from helix to helix, coil to coil, and sheet to sheet.
RH_PTS^v algorithm finds a representative head for protein contact map, which is a simple tertiary structure representation.

Generally, these analyses show the representative heads of the pre-trained ProtAlbert on protein sequences to detect protein family features. So, we propose an algorithm called PA_SPP^vi for sequence profile prediction by pre-trained ProtAlbert on protein sequences using MLM. Next, the predicted profiles are compared to the HSSP profiles. The result shows the high similarity between the predicted and HSSP profiles. PA_SPP algorithm can help the researchers to predict a profile similar to the HSSP profile while there are no similar sequences to the query sequence in the database for making the HSSP profile.

2. Material and Method

This section first introduces the basic definitions needed to interpret ProtAlbert as a transformer model. Next, we propose five algorithms for assessing the layers and heads of ProtAlbert to identify some protein characteristics. Then, our approach is illustrated for the sequence profile prediction problem in more detail. In the end, we introduce the dataset used for evaluation.

2.1 Notation and Definition

The sequence of protein P with length n is represented by: where AA={a₁,…,a₂₀} shows the set of amino acids. We define amino acid as a k-neighbor of in sequence S^P. The positive (negative) value of k shows that the position i attends from left to right (from right to left) of the sequence to find the neighboring amino acid at distance k.

In protein folding, the sidechain backbone of nearest-neighbor interactions may restrict the accessible conformations to a chain of protein³⁹. Neighboring amino acids can be structurally categorized according to their separation in the primary sequence as proximal (1-4 positions apart) and otherwise distal⁴⁰. For each protein P, the k-neighbor interaction is defined based on the interaction of each position i with position i + k on the sequence S^P.

In addition to the effect of the nearest neighbor amino acids on protein folding, each amino acid has different biochemical and biophysical properties that can effectively determine the protein structure. Amino acids are classified based on R-group^vii into five classes ℂ = {N, H, U, A, B}(see Table 1).

View this table:

Table 1:

Classification of amino acids based on R-group: ℂ = {N, H, U, A, B}.

The experimental structure of proteins can be extracted from PDB^viii. Therefore, the 3D coordinate of each atom of amino acids in the protein sequence is available. Here, we represent the tertiary structure of protein P with length n, by contact map as follows: where dis(.,.) is the Euclidian distance and shows the 3D coordinate of the atom c_α for amino acid at position of protein. The value θ of is set 4.87 based on paper⁴¹. Each element D^P[i,j] with value 1 indicates that two amino acids and are in contact.

The secondary structure of protein P is extracted from the tertiary structure using DSSP^ix software. This method provides eight classes, 3-helix, 4-helix, 5-helix, β-strand, β-bridge, turn, bend, and coil. Typically, the DSSP states are converted into three classes using the following convention. 3-helix, 4-helix, and 5-helix are considered helix (H). β-strand and β-bridge are displayed by a sheet (E). The rest of the states are shown as a coil (C). The secondary structure of protein with P length n is displayed as follows:

As mentioned in⁴², the secondary structure of each position in the protein sequence is dependent on its neighbors. The length of each type of regular secondary structure⁴³ is about 6. We define a secondary structure matrix named on protein P with length n as follows: where H^P[i,j]=1 indicates the same secondary structure between two amino acids and with distance less than 7 in sequence S^P.

For each protein P with length n in PDB database, a profile named is extracted from the HSSP database¹⁸. In this database, there is an MSA of all available homologous sequences properly aligned to protein sequence S^P. This MSA is constructed based on searching in the Swiss-Prot database considering the sequence family and structure. Each sequence of MSA is more than 30% identical to S^P. Using MSA, the profile is generated where R^P[i,j] shows the probability of amino acid a_j ∈ AA at position of MSA.

In the following, we assume that dataset Δ ={P₁,…,P_t} includes proteins where their sequences, experimental tertiary structures, and HSSP profiles are available.

2.2 ProtAlbert as a pre-trained transformer model on protein sequences

As described earlier, protein sequences can be viewed as a language, and therefore, we can benefit from using the models initially developed for natural languages. One of the latest architectures that showed significant superiority over previous models is transformers.

As it was mentioned, BERT³²is a method of pre-training language representations. It means that after training a general-purpose language understanding model on a large corpus of text, the model can be used on downstream tasks. BERT is an example of auto encoding language modeling trained using MLM. During the training, 15% of the input is randomly masked, and the model is asked to predict the masked tokens. This process lets the model predicts the masked tokens based on the other available tokens. It shows that the model has a good idea about the language and the context. This self-supervised pre-training method, which means the labels are in the training corpus, got better results in many downstream tasks.

A year after BERT³², ALBERT⁴⁴ was released by Google research that improved state-of-the-art performance in 12 NLP tasks. The main idea in the ALBERT was to allocate capacity more efficiently. They made two design changes to BERT, but the training process was MLM. First, while the input level embeddings need to be context-independent representations, the hidden-state embeddings need to take context into account. This was addressed by splitting the embedding matrix between a low dimension input-level embedding with length 128 and a higher dimension hidden-layer embedding with size 4096. The second critical change was removing redundancy and therefore increasing the capacity of the model to learn. Previously, it was observed that the various layers of BERT with different parameters in the model learned similar operations. This possible redundancy was eliminated in ALBERT by parameter sharing in different layers. These two design changes resulted in 90% parameter reduction compared to BERT with slightly decreased accuracy. However, this reduction allows scaling the hidden size from 768 in BERT to 4096 in ALBERT. It is shown that the bigger hidden layer embeddings can capture and represent the context better⁴⁴.

We base our experiments on ProtAlbert, a transformer-based model on ALBERT architecture from the ProtTrans project²⁹. ProtAlbert is pre-trained on 216 million protein sequences from the UniRef100 dataset. In this paper, we do not train or fine-tune the model. In the ProtAlbert model, the protein sequences are tokenized using a single space between each amino acid (indicating words), and each sequence is stored in a separate line (indicating sentences). Also, all non-generic or unresolved amino acids (B,O,U,Z) are mapped to the unknown token X. This model can process sequences with lengths of up to 40K, although this length is bound by the hardware capacity. The details of the ProtAlbert model are available in Table 2.

View this table:

Table 2:

ProtAlbert Parameters.

Our work contains two main parts, transformer analysis, and profile prediction. For the first part, a protein sequence is given as an input to the ProtAlbert transformer. Then, we analyze and interpret the attention weights at attention heads in different layers. In the second part, protein profile is predicted using ProtAlbert and masked token prediction. In other words, a protein sequence with some masked amino acids is fed to the model for predicting the most likely amino acids in the masked positions.

We choose ProtAlbert²⁹ because its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with ProtBert²⁹, which is a great advantage. The ProtBert model is a pre-trained BERT-based language model with 420M parameters from the ProTrans project that has been trained on the same dataset as the ProtAlbert model with 224M parameters.

2.3 Proposed algorithms for analyzing ProtAlbert transformer to identify protein characteristics

In this sub-section, we propose five algorithms to analyze the attention heads and layers of ProtAlbert for finding the specific properties of proteins (see Table 3). This analysis is essential because it shows that ProtAlbert transformer can learn some biological features from only protein sequences. It allows us not to apply the transformer as a black box but to select the ProtAlbert features specific to the bioinformatics problems.

View this table:

Table 3:

Five protein characteristics.

The input and output of this assessment are defined as follows P:

Input: Sequence of protein.
Output: Extracting attention matrix from ProtAlbert for each head h in layer l to interpret the properties of P protein displayed in Table 3.

ProtAlbert includes 12 encoder layers, and each encoder has 64 attention heads. Each protein sequence is given to the model as an input, then it goes through the encoder layers, and the attention mechanism in each layer generates output to go to the next layer.

For the input sequence S^P with length n, each attention head h (1 ≤ h ≤ 64) in the layer l (1 ≤ l ≤ 12) produces a matrix of positive attention weights named . The value A^{P,l, h}[i,j] shows the attention weight from amino acid to and . So, each amino acid at head h in layer l can attend to all other amino acids in the sequence, but the level of the attention is determined by the A^{P,l, h} [i,j].

Based on the attention matrix , adjacency matrix is constructed as follows: where the value of θ is determined by its application. We use attention and adjacency matrices for introducing our approaches to quantify representative heads in layers for some protein features (see Table 3).

2.3.1 RHL_NNI algorithm to quantify the representative heads and layers of ProtAlbert for nearest-neighbor interaction

We propose the RHL_NNI algorithm to determine if head h in layer l of the ProtAlbert model represents the interaction of k-neighbor amino acids in dataset Δ. The main steps of this algorithm are defined as follows:

For each protein P ∈ Δ ={P₁,…,P_l},
1. The interaction of k-neighbor amino acids from the sequence S^P (| S^P | =n) is quantified, as: where adjacency matrix M^P,l,h is generated based on Eq.3 for each head h in layer l.
2. The normalized k-neighbor interaction is defined like this: where | · |shows the absolute function. For each head h in layer l, N^P,l,k[k] indicates the percentage of positions in protein P which attend to the k^th amino acid in the neighbor.
3. The weighted quantification of k-neighbor interaction is computed based on attention matrix, as :
The average of normalized k-neighbor interaction is computed on dataset Δ, as:
The maximum interaction value of neighbor amino acids at each head in the layer is computed to determine the nearest neighbor radius for interaction, as:
For each 1 ≤ l ≤ 12 and 1 ≤h ≤ 64, if ,
1. Head h in layer l is announced representative for -neighbor interaction on dataset Δ.
2. For head h in layer l, the average of weighted quantification of -neighbor interaction is computed on dataset Δ, as:

The average of weighted quantification and the average of normalized interaction is compared to show the effect of discretizing of attention weights in the adjacency matrix.

2.3.2 RH_SAA algorithm to quantify representative heads of ProtAlbert for specific amino acids

Here, we introduce the RH_SAA algorithm to investigate if the head h of ProtAlbert attends significantly to a specific amino acid in the protein dataset Δ. To quantify the quality of attention head h for amino acid a ∈ AA, we apply the F-measure criterion to evaluate the occurrence rate of amino acid a versus the rest in this head. In the following, this algorithm is described in more detail:

For each protein P∈ Δ.={P₁,…,P_l},
1. True positive is defined based on the number of amino acid a in protein sequence which is attended by at least one position of the sequence at head h in at least one layer: where where adjacency matrix M^P,l,h is generated based on Eq.3 for head h in layer l.
2. False positive,, is obtained as follows: where represents the number of amino acid q ≠ a attended by at least one position of the sequence at head h in at least one layer.
3. False negative,, is computed as follows: where shows the frequency of amino acid a ∈ AA in sequence S^P.
4. F-measure criterion, , is computed to quantify head h for amino acid a:
5. The relative occurrence of amino acid a of protein P in head h is computed as:
6. The weighted occurrence of amino acid a of protein P is calculated as follows: where
7. The normalized weighted occurrence of amino acid a of protein P is calculated as follows:
The average of F-measure is computed on the dataset Δ:
The candidate representative head for amino acid is computed, as:
For each a ∈ AA,, if :
1. Head is announced as a representative head for amino acid a.
2. In head , the average of normalized weighted occurrence of amino acid a is computed on dataset Δ, as: where the normalized weighted occurrence of amino acid a shows the effect of attention weights attending from each amino acid to a at head .
3. The average of the relative occurrence of amino acid a at head is computed on dataset Δ as: where the relative occurrence of amino acid a shows the probability of amino acid a detection at head .

2.3.3 RH_BBP algorithm to quantify representative heads of ProtAlbert for biochemical and biophysical properties of amino acids

In this sub-section, we illustrate the RH_BBP algorithm to find representative heads of ProtAlbert on the biochemical and biophysical properties using the classification ℂ= {N, H, U, A, B}.of amino acids based on the R-group. Table 1 shows this classification,. The algorithm is very similar to RH_SAA, which identifies specific heads for amino acids. In the following, the details of RH_BBP are available:

For each protein P ∈ Δ={P₁,…,P_l},
1. For class C ∈ ℂ, true positive is defined by to show the number of amino acids from class in protein sequence S^P(| S^P | = n) attended by at least one position of the sequence at head in at least one layer: where
2. For class C ∈ ℂ,, false positive,, is obtained as follows: where represents the number of amino acids from class q ≠ C attended by at least one position of the sequence S^P at head h in at least one layer.
3. For class C ∈ ℂ, false negative, , is computed as follows: where shows the frequency of the amino acids from class C in sequence S^P.
4. For class C ∈ ℂ, F-measure criterion,, is computed to quantify head h at this class:
5. The relative occurrence of class C for protein P in head h is computed as:
6. The weighted occurrence of class C for protein P is calculated as follows: where
7. The normalized weighted occurrence of class C for protein P is calculated as follows:
The average of F-measure is computed on dataset Δ:
The candidate representative head for class C is calculated, as:
For each C ∈ ℂ, if :
1. Head is announced as a representative head for class C.
2. In head , the average of normalized weighted occurrence of class C is computed on dataset Δ, as: where the normalized weighted occurrence of class C shows the effect of attention weights attending from each amino acid to the amino acids in class at head .
3. The average of the relative occurrence of class C at head on dataset Δ is computed as: where the relative occurrence of class C shows the probability of class C detection at head .

2.3.4 RH_PSS algorithm to quantify representative heads of ProtAlbert for protein secondary structure

Although ProtAlbert only has been pre-trained on protein sequences, we propose the RH_PSS algorithm on dataset Δ to assess attention heads about the protein secondary structure matrix (see Eq.2). The detail of this algorithm is as below:

For each protein P ∈ Δ ={P₁,…,P_t},
1. Predicting the secondary structure matrix , of protein P with length n for each head h as follows: where adjacency matrix M^P,l,h is constructed based on Eq.3 for each head h in layer l.
2. Making the natural secondary structure for protein P based on Eq.2.
3. Computing the cosine similarity between and for each h, 1 ≤ h ≤ 64, cos(ℋ^P,h,H^P).
The average of cosine similarity is computed as:

2.3.5 RH_PTS algorithm to quantify representative heads of ProtAlbert for protein tertiary structure

We propose the RH_PTS algorithm to compare the natural protein contact map to the predicted contact map from head h on dataset Δ. The main steps of this algorithm are as follows:

For each protein P,∈ Δ.={P₁,…,P_t}
1. Making matrix as: where A^P,l,h represents the attention matrix of protein P in layer l and head h.
2. Normalizing matrix as bellow:
3. Predicting contact map based on matrix , as:
4. Making real contact map for protein P based on Eq.1.
5. Computing the cosine similarity between and for each 1 ≤ h ≤ 64, cos(𝒟^P,h, D^P).
Computing the average of cosine similarity as:
Finding head h_max to indicate the maximum similarity between natural and predicted contact maps: where head h_max is known as a representative head for contact maps.

2.4 Proposed algorithm for sequence profile prediction problem

In the second part of our work, we propose the PA_SPP algorithm for the sequence profile prediction problem. The input and output of this problem are defined as follows:

Input: Sequence of protein P.
Output: Predicting profile using pre-trained ProtAlbert.

To solve this problem, we apply pre-trained ProtAlbert to predict the masked token of an input sequence containing unknown amino acids in one position of the sequence S^P. ProtAlbert model generates the most likely amino acids for that position. In other words, the model predicts the masked amino acid in the sequence based on the context of other amino acids surrounding it. This process is called masked token prediction and represented by where generates two vectors and . Vectors γ^P and Π^P represent the type of amino acids and the score for each amino acid replaced at the masked position in the sequence S^P. For each shows the score of substitution of amino acid at position i of sequence S^P. Figure 1 illustrates the PA_SPP algorithm for solving the profile prediction problem. In the first step, the sequence S^P is given as an input to the algorithm. In the second step, a zero-matrix named is defined where ℛ^P[i,j] is updated during algorithm running by predicting the probability of j^th amino acid at the i^th position of sequence S^P. The third step selects each position i, 1 ≤ i ≤ n,, in sequence S^P for masking. In the fourth step, temporary memory T is defined to keep the sequence S^P with masking position i. In the fifth step, sequence T is fed to Masking process of ProtAlbert. The model generates two vectors ϒ^P and Π^P for position i. In the sixth step, we set the probability vector Π^P into the i^th row of matrix ℛ according to the order of amino acids in ϒ^P. In the end, we call the predicted profile for protein P.

Figure 1:

PA_SPP algorithm for protein profile prediction.

2.5 Dataset

In this study, we use the CASP13^x dataset. This dataset includes 194 proteins. We select 55 proteins (see Supplementary 1) whose profiles are available in the HSSP database. We call the selected proteins from CASP13, dataset Δ where |Δ | = 55. The tertiary structure and sequence of each protein P ∈ Δ are extracted from the PDB database. In addition, their HSSP profiles are downloaded from xssp site.

The distribution of the extracted target sequences lengths is shown in Figure 2. In addition, Figure 3 represents the frequency of amino acids in the sequences of dataset Δ.

Figure 2:

Distribution of the length of protein sequences in Δ.

Figure 3:

The frequency of amino acids (AA) in the Δ.

In addition to dataset Δ, we select three essential proteins (see Table 4) in different organisms for case studies to show that our result is generally reliable. The details of these proteins are available in Supplementary 2.

View this table:

Table 4:

Details of three case study proteins.

3. Result and Discussion

In this section, we apply Δ ⊆ CASP13 and three case study proteins, LuxB, Mpro, and Taq, to analyze ProtAlbert as a pre-trained transformer on protein sequences. We find representative heads of ProtAlbert for five protein characteristics (Table 3). This part assures us that the heads contain the information required by a family of proteins. Then, we use this dataset for profile prediction. In the end, we compare the predicted profiles to the HSS profiles.

3.1 Analyzing ProtAlbert as a pre-trained transformer on protein sequences

Here, we find representative heads in the layers of ProtAlbert for five protein characteristics displayed in Table 3 using algorithms RLH_NNI, RH_SAA, RH_BBP, RH_PSS, and RH_PTS. In these algorithms, we use some cutoffs obtained by our trial and error. Cutoffs are set high for sequence feature analysis because ProtAlbert has been pre-trained on the protein sequences. For structures feature analysis, cutoffs are set low.

3.1.1 Assessment of nearest-neighbor interactions at heads in layers of ProtAlbert

As mentioned in⁴⁰, k-neighbor interaction where − 4 ≤ k ≠ 0 ≤ 4 is known as proximal interaction, which is effective in the first step of protein folding. Here, we apply RHL_NNI algorithm on dataset Δ ⊆ CASP13 to find the representative heads in the layers of ProtAlbert for the nearest neighbor radius of amino acids interaction.

In Eq.4 and Eq.5 of this algorithm, we consider threshold 0.5 to make an adjacency matrix from the attention matrix. In the fourth step of RHL_NNI, we select representative head h in layer l for the interaction of -neighbor amino acids in dataset Δ, if . For each selected head in layer and protein P ∈ {LuxB, Mpro, Taq}, is computed. Table 5 shows the representative heads in layers for the interaction of -neighbor amino acids. The results show that the average of normalized -neighbor interaction is close to the normalized -neighbor interaction on each case study protein. Also, the average of the weighted quantification of -neighbor interaction, is calculated. Also, the weighted quantification for each case study protein P, , is available in this table. The values of are W close to N ones; it shows that attention weights are high in -neighbor on the dataset and cases study proteins. The results show that

View this table:

Table 5:

Representative heads in layers of ProtAlbert for nearest-neighbor interaction.

head 10 in layers 2-9, head 21 in layers 1-9, and heads 14 and 44 in layer 1 represent interactions at one position apart.
head 23 in layers 1-8 and head 33 in layer 1 are specific for interactions at two positions apart.
heads 3 and 51 in layer 1 indicate interactions between each amino acid and its third neighbor in the sequence.
head 51 in layers 2 – 8, head 53 in layers 1-8, head 2 in layers 1-2 represent the interaction between each amino acid and its fourth neighbor in the sequence.
head 56 in layer 1 is specific for interactions at five positions apart.

In conclusion, we have identified the representative heads in different layers for proximal positions in proteins. According to ⁴⁰, proximal positions are essential in the first step of protein folding.

3.1.2 Assessment of the type of amino acids at heads of ProtAlbert

In this sub-section, we use the RH_SAA algorithm to find a representative head for each amino acid on dataset Δ ⊆ CASP13 In Eq.6 and Eq.7 of this algorithm, we consider threshold 0.4 to make adjacency matrix from attention matrix. In the third step of RH_SAA, we select candidate representative head for amino acid a. At the fourth step, head is announced as a representative head for amino acid a, if . Meanwhile, we compute the F-measure criterion, , for amino acid a in each protein P ∈ {LuxB,Mpro,Taq} at head . Table 6 shows that the average of the F-measure is similar to the F-measure of each case study.

View this table:

Table 6:

The representative heads for amino acids found based on F-measure .

Moreover, this table represents the average of the relative occurrence of amino acid a in dataset Δ and each case study protein P ∈ {LuxB,Mpro,Taq} by and , respectively. In addition, the average of normalized weighted occurrence of amino acid a in dataset Δ and case study protein P are shown by and , respectively. As a result, we find that

the average of F-measure on the dataset is close to case study ones,
heads 8 and 18 can support hydrophilic acidic amino acids, aspartic acid (D) and glutamic acid (E),
heads 13, 20, and 63 are specific for proline (P), tryptophan (W), and histidine (H), respectively.

To better understand the selected heads for specific amino acids, Figure 4 shows the weighted stacking of amino acids at attention heads 8, 13, 18, 20, and 63.

Figure 4:

Logo consists of the weighted stacking of amino acids relative to the occurrences of amino acids in the protein sequences at heads 8, 13, 18, 20, and 63.

3.1.3 Assessment of biochemical and biophysical properties of amino acids at heads of ProtAlbert

In the previous sub-section, we found representative heads 8 and 18 for amino acids D and E. They are hydrophilic acidic amino acids. In the following, we assess the heads in layers to find more biochemical and biophysical properties based on the R-group of amino acids. This classification, ℂ = {N, H, U, A, B}, is shown in Table 1. To do the assessment, we apply the RH_BBP algorithm on dataset Δ ⊆ CASP13. In Eq.8 and Eq.9 of this algorithm, we consider threshold 0.4 to make adjacency matrix from attention matrix. In the third step of RH_BBP, head is selected to identify the maximum quantity for class C ∈ ℂ. At the fourth step, we announce that head is representative for class C if . Meanwhile, we compute for each protein P ∈ {LuxB,Mpro,Taq} at the selected head . Table 7 shows the average F-measure for representative class C at head is similar to the case study ones.

View this table:

Table 7:

The representative heads for the classes in set ℂ based on F-measure .

Moreover, this table represents the average relative occurrence of class C for dataset Δ and each case study protein P by and , respectively. In addition, the average weighted occurrence of class C and each case study protein P at this head are shown by and , respectively.

In conclusion, representative heads 8, 44, and 49 show hydrophilic acidic, hydrophobic aliphatic, and hydrophobic aromatic amino acids, respectively. Also, head 43 can represent both polar and basic amino acids. Figure 5 consists of the weighted stacking of amino acids at attention heads 8, 43, 44, and 49.

Figure 5:

The logo consists of weighted stacking of amino acids in class C ∈ ℂ relative to the occurrences of these amino acids in the protein sequences at heads 8, 43, 44, and 49.

3.1.4 Assessment of the protein secondary structure at heads of ProtAlbert

ProAlbert has been pre-trained on protein sequences, but we use the RH_PSS algorithm on dataset Δ ⊆ CASP13 to show that some heads with high attention weights are attending from helix to helix, sheet to sheet, and coil to coil. In Eq.10 of this algorithm, we consider threshold 0.1 to make an adjacency matrix from the attention matrix. At step two of RH_PSS, the average of cosine similarity, , between the predicted and natural secondary structure matrices.

Figure 6 shows the heatmaps of , at each head h, 1 ≤ h ≤ 64, on data set Δ ⊆ CASP13. In addition, the cosine similarity, cos(ℋ^P,h,H^P,h), for each case study protein P ∈ {Taq,Mpro,LuxB} is computed. The high similarity between the predicted and natural protein secondary matrices can be seen at heads 2, 3, 8, 9, 10, 13,14, 18, 20, 21, 23, 32, 49, 51, 53, 56, and 63. Some of these heads are common with the heads in nearest-neighbor interaction. After removing the common heads, we find that heads 8, 9, 13,18, 20, 32, 49, and 63 are only informative about the secondary structure. These heads show more attention from each amino acid secondary structure to the same structure, with less than 6 amino acids in neighbors.

Figure 6:

Heatmap of cosine similarity between the predicted and natural protein secondary structure matrices

3.1.5 Assessment of the protein tertiary structure at heads of ProtAlbert

In this sub-section, we compute the similarity between the natural contact map (see Eq.1) and predicted contact map of protein P using the RH_PTS algorithm on dataset Δ ⊆ CASP13. In this algorithm, threshed 0.1 is defined for Eq.11 to discretize the predicted contact map.

Table 8 shows the average similarity between the predicted and natural contact maps at h_max= 10 on Δ dataset obtained from step three of the algorithm. Then is calculated based on the second step of RH_PTS. In addition, the cosine similarity, is computed for each case study protein P ∈ {Taq,Mpro,LuxB}. It seems that head 10 can show appropriate information on contact maps.

View this table:

Table 8:

Cosine similarity between natural and predicted contact map for proteins at head 10.

3.2 Predicting profile using ProtAlbert

The above assessment shows that transformers can extract some protein features from the sequence to represent the protein family. These features can lead us to find appropriate information about the homologous sequences of each protein sequence given as an input to ProtAlbert. Therefore, the PA_SPP algorithm (see Figure 1) employs pre-trained ProtAlbert to predict a profile for a query sequence. Here, we compare the predicted profiles to real ones obtained from the homologous sequences (HSSP profile).

For each protein P Δ ⊆ CASP13 and three case study proteins, Taq, Mpro, and LuxB, the PA_SPP algorithm predicts profile ℛ^P. Then, we compare the similarity of the predicted profile ℛ^P to HSSP profile R^P using cosine similarity. We want to show that the predicted profile is close to the HSSP profile. It should be noted; some HSSP profiles are more reliable than the other ones because the number distribution of sequences aligned to the query sequence is different. For example, some profiles are obtained by less than 100 aligned sequences, and some are made based on more than 1000 aligned sequences. Therefore, the HSSP profile constructed with more aligned sequences is more reliable. So, the weighted average similarity between predicted and HSSP profiles are computed by the number of aligned sequences. Figure 7 shows that the predicted profiles are more similar to the HSSP profiles with more alignment sequences.

Figure 7:

The weighted cosine similarity between predicted profiles and HSSP profiles based on the number of aligned sequences to the query sequence.

4. Conclusion

This paper contained two parts, ProtAlbert model analysis and profile prediction. Most previous studies used pre-trained transformer models to generate an embedding for protein sequence in different bioinformatics problems as a black box. Here, we would like to find the representative heads in layers for some protein characteristics. For this assessment, we used ProtAlbert because its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers on proteins which is a great advantage for us.

In this study, we did not train or fine-tune ProtAlbert. In other words, we used pre-trained ProtAlbert to determine the interaction of nearest-neighbor amino acids, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structures, and tertiary structures at attention heads in different layers. This analysis is crucial because it shows that ProtAlbert learns some protein family features from only sequences. It led us to propose an algorithm called PA_SPP for profile prediction from a query sequence using ProtAlbert. The results showed that the predicted profile is close to the profile obtained from the homologous sequences.

We believe that the proposed algorithm for profile prediction can help the researchers to make a profile for a query sequence while there are no similar sequences to the query sequence in the database. In the future, we can improve this predictor with new transformer models.

Footnotes

1 This article was submitted to “proteins-structure function and bioinformatics” journal : 18 September 2021
↵ⁱ Representative Heads in Layers of ProtAlbert for Nearest Neighbor Interactions
↵ⁱⁱ Representative Heads of ProtAlbert for Specific Amino Acids
↵ⁱⁱⁱ Representative Head of ProtAlbert for Biochemical and Biophysical Properties of amino acids
↵^iv Representative Heads of ProtAlbert for Protein Secondary Structure
↵^v Representative Heads of of ProtAlbert for Protein Tertiary Structure
↵^vi Using ProtAlbert for Sequence Profile Prediction
↵^vii https://microbenotes.com/amino-acids-properties-structure-classification-and-functions/
↵^viii https://www.rcsb.org/
↵^ix https://www3.cmbi.umcn.nl/xssp/
↵^x https://www.predictioncenter.org/casp13/domains_summary.cgi

5. Reference

1.↵
Alberts B, Johnson AD, Lewis J, et al. Molecular Biology of Cell. W. W. Norton & Company; 2014.
2.↵
Consortium TU. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–D515.
OpenUrl CrossRef PubMed
3.↵
Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28(1):45. doi:10.1093/NAR/28.1.45
OpenUrl CrossRef PubMed Web of Science
4.↵
Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242.
OpenUrl CrossRef PubMed Web of Science
5.↵
Seffernick JT, Lindert S. Hybrid methods for combined experimental and computational determination of protein structure. J Chem Phys. 2020;153(24):240901.
OpenUrl
6.↵
Bhattacharya D, Cao R, Cheng J. UniCon3D:de novoprotein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics. 2016;32(18):2791–2799.
OpenUrl CrossRef PubMed
7.↵
Guo Z, Hou J, Cheng J. DNSS2: Improved ab initio protein secondary structure prediction using advanced deep learning architectures. Proteins Struct Funct Bioinforma. 2021;89(2):207–217.
OpenUrl
8.↵
Hou J, Wu T, Cao R, Cheng J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins Struct Funct Bioinforma. 2019;87(12):1165–1178.
OpenUrl
9.↵
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763.
OpenUrl CrossRef PubMed Web of Science
10.↵
Haft DH, Loftus BJ, Richardson DL, et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 2001;29(1):41–43.
OpenUrl CrossRef PubMed Web of Science
11.↵
Nguyen N, Nute M, Mirarab S, Warnow T. HIPPI: highly accurate protein family classification with ensembles of HMMs. BMC Genomics. 2016;17(10):89–100.
OpenUrl
12.↵
Galzitskaya O V., Melnik BS. Prediction of protein domain boundaries from sequence alone. Protein Sci. 2003;12(4):696.
OpenUrl CrossRef PubMed Web of Science
13.↵
Chen J, Liu B, Huang D. Protein remote homology detection based on an ensemble learning approach. Biomed Res Int. 2016;2016:5813645.
OpenUrl
14.↵
Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics. 2015;31(21):3492–3498.
OpenUrl CrossRef PubMed
15.↵
Pan X, Kortemme T. Recent advances in de novo protein design: Principles, methods, and applications. J Biol Chem. 2021;296:100558.
OpenUrl
16.↵
Zhang Y, Chen Y, Wang C, et al. ProDCoNN: Protein design using a convolutional neural network. Proteins Struct Funct Bioinforma. 2020;88(7):819–829.
OpenUrl
17.↵
Hulsen T, Huynen MA, de Vlieg J, Groenen PMA. Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 2006;7(4):R31.
OpenUrl CrossRef PubMed
18.↵
Schneider R, de Daruvar A, Sander C. The HSSP database of protein structure-sequence alignments. Nucleic Acids Res. 1997;25(1):226–230.
OpenUrl CrossRef PubMed Web of Science
19.↵
Creighton T. Proteins Structures and Molecular Properties. W.H. Freeman and Company.; 1993.
20.↵
Rost B. Twilight zone of protein sequence alignments. Protein Eng Des Sel. 1999;12(2):85–94.
OpenUrl CrossRef PubMed Web of Science
21.↵
Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics. 2008;9:510.
OpenUrl CrossRef PubMed
22.↵
Ingraham J, Garg VK, Barzilay R, Jaakkola T. Generative models for graph-based protein design. In: Advances in Neural Information Processing Systems 32.; 2019:9689–9701.
OpenUrl
23.↵
Heinzinger M, Elnaggar A, Wang Y, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019;20(1):723.
OpenUrl CrossRef
24.↵
Armenteros JJA, Johansen AR, Winther O, Nielsen H. Language modelling for biological sequences – curated datasets and baselines. bioRxiv. Published online March 9, 2020. doi:10.1101/2020.03.09.983585
OpenUrl Abstract/FREE Full Text
25.↵
McCann B, Bradbury J, Xiong C, Socher R. Learned in Translation: Contextualized Word Vectors. In: Advances in Neural Information Processing Systems.; 2017:6294–6305.
26.↵
Peters ME, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol 1.; 2018:2227–2237.
OpenUrl
27.↵
Madani A, McCann B, Naik N, et al. ProGen: Language Modeling for Protein Generation. bioRxiv. Published online March 13, 2020. doi:10.1101/2020.03.07.982272
OpenUrl Abstract/FREE Full Text
28.↵
Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021;118(15):e2016239118.
OpenUrl Abstract/FREE Full Text
29.↵
Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans Pattern Anal Mach Intell. 2021;14(8):1–16.
OpenUrl
30.↵
Strodthoff N, Wagner P, Wenzel M, Samek W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics. 2020;36(8):2401–2409.
OpenUrl CrossRef
31.↵
Vaswani A, Shazeer N, Parmar N, et al. Attention is All you Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Vol 30.; 2017:6000–6010.
OpenUrl
32.↵
Devlin J, Chang M-W, Lee K, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.; 2019:4171–4186.
33.↵
Bepler T, Berger B. Learning protein sequence embeddings using information from structure. In: 7th International Conference on Learning Representations, ICLR 2019.; 2019.
34.↵
Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–1322.
OpenUrl
35.↵
Rao R, Bhattacharya N, Thomas N, et al. Evaluating Protein Transfer Learning with TAPE. Adv Neural Inf Process Syst. 2019;32:9689.
OpenUrl
36.↵
Lu AX, Zhang H, Ghassemi M, Moses A. Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization. bioRxiv. Published online September 6, 2020. doi:10.1101/2020.09.04.283929
OpenUrl Abstract/FREE Full Text
37.↵
Min S, Park S, Kim S, Choi H-S, Yoon S. Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information. arXiv. Published online November 25, 2019. Accessed September 11, 2021. http://arxiv.org/abs/1912.05625
38.↵
Sturmfels P, Allen PG, Vig J, Madani A, Rajani NF. Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models. arXiv. Published online 2020. Accessed September 11, 2021. https://arxiv.org/abs/2012.00195
39.↵
Kovacs JM, Mant CT, Kwok SC, Osguthorpe DJ, S. Hodges R. Quantitation of the Nearest-neighbour Effects of Amino Acid Side-Chains that Restrict Conformational Freedom of the Polypeptide Chain using Reversed-Phase Liquid Chromatography of Synthetic Model Peptides with L-and D-amino Acid Substitutions. J Chromatogr A. 2006;1123(2):212–224.
OpenUrl
40.↵
Brocchieri L, Karlin S. How are close residues of protein structures distributed in primary sequence? Biophysics (Oxf). 1995;92:12136–12140.
OpenUrl
41.↵
Pietal MJ, Bujnicki JM, Kozlowski LP. GDFuzz3D: a method for protein 3D structure reconstruction from contact maps, based on a non-Euclidean distance function. Bioinformatics. 2015;31(21):3499–3505.
OpenUrl CrossRef PubMed
42.↵
Salzberg S, Cost S. Predicting protein secondary structure with a nearest-neighbor algorithm. J Mol Biol. 1992;227(2):371–374.
OpenUrl PubMed
43.↵
Ashok Kumar T. CFSSP: Chou and Fasman Secondary Structure Prediction server. Wide Spectr. 2013;1(9):15–19.
OpenUrl
44.↵
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv. Published online September 26, 2019. Accessed September 13, 2021. https://arxiv.org/abs/1909.11942v6

View the discussion thread.

Posted September 24, 2021.

Download PDF

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5200)
Biochemistry (11703)
Bioengineering (8722)
Bioinformatics (29127)
Biophysics (14932)
Cancer Biology (12048)
Cell Biology (17359)
Clinical Trials (138)
Developmental Biology (9406)
Ecology (14143)
Epidemiology (2067)
Evolutionary Biology (18268)
Genetics (12220)
Genomics (16766)
Immunology (11841)
Microbiology (28005)
Molecular Biology (11552)
Neuroscience (60808)
Paleontology (450)
Pathology (1864)
Pharmacology and Toxicology (3231)
Physiology (4939)
Plant Biology (10384)
Scientific Communication and Education (1679)
Synthetic Biology (2877)
Systems Biology (7333)
Zoology (1642)

[1] 1.↵
Alberts B, Johnson AD, Lewis J, et al. Molecular Biology of Cell. W. W. Norton & Company; 2014.

[2] 2.↵
Consortium TU. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–D515.
OpenUrl CrossRef PubMed

[3] 3.↵
Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28(1):45. doi:10.1093/NAR/28.1.45
OpenUrl CrossRef PubMed Web of Science

[4] 4.↵
Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242.
OpenUrl CrossRef PubMed Web of Science

[5] 5.↵
Seffernick JT, Lindert S. Hybrid methods for combined experimental and computational determination of protein structure. J Chem Phys. 2020;153(24):240901.
OpenUrl

[6] 6.↵
Bhattacharya D, Cao R, Cheng J. UniCon3D:de novoprotein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics. 2016;32(18):2791–2799.
OpenUrl CrossRef PubMed

[7] 7.↵
Guo Z, Hou J, Cheng J. DNSS2: Improved ab initio protein secondary structure prediction using advanced deep learning architectures. Proteins Struct Funct Bioinforma. 2021;89(2):207–217.
OpenUrl

[8] 8.↵
Hou J, Wu T, Cao R, Cheng J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins Struct Funct Bioinforma. 2019;87(12):1165–1178.
OpenUrl

[9] 9.↵
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763.
OpenUrl CrossRef PubMed Web of Science

[10] 10.↵
Haft DH, Loftus BJ, Richardson DL, et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 2001;29(1):41–43.
OpenUrl CrossRef PubMed Web of Science

[11] 11.↵
Nguyen N, Nute M, Mirarab S, Warnow T. HIPPI: highly accurate protein family classification with ensembles of HMMs. BMC Genomics. 2016;17(10):89–100.
OpenUrl

[12] 12.↵
Galzitskaya O V., Melnik BS. Prediction of protein domain boundaries from sequence alone. Protein Sci. 2003;12(4):696.
OpenUrl CrossRef PubMed Web of Science

[13] 13.↵
Chen J, Liu B, Huang D. Protein remote homology detection based on an ensemble learning approach. Biomed Res Int. 2016;2016:5813645.
OpenUrl

[14] 14.↵
Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics. 2015;31(21):3492–3498.
OpenUrl CrossRef PubMed

[15] 15.↵
Pan X, Kortemme T. Recent advances in de novo protein design: Principles, methods, and applications. J Biol Chem. 2021;296:100558.
OpenUrl

[16] 16.↵
Zhang Y, Chen Y, Wang C, et al. ProDCoNN: Protein design using a convolutional neural network. Proteins Struct Funct Bioinforma. 2020;88(7):819–829.
OpenUrl

[17] 17.↵
Hulsen T, Huynen MA, de Vlieg J, Groenen PMA. Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 2006;7(4):R31.
OpenUrl CrossRef PubMed

[18] 18.↵
Schneider R, de Daruvar A, Sander C. The HSSP database of protein structure-sequence alignments. Nucleic Acids Res. 1997;25(1):226–230.
OpenUrl CrossRef PubMed Web of Science

[19] 19.↵
Creighton T. Proteins Structures and Molecular Properties. W.H. Freeman and Company.; 1993.

[20] 20.↵
Rost B. Twilight zone of protein sequence alignments. Protein Eng Des Sel. 1999;12(2):85–94.
OpenUrl CrossRef PubMed Web of Science

[21] 21.↵
Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics. 2008;9:510.
OpenUrl CrossRef PubMed

[22] 22.↵
Ingraham J, Garg VK, Barzilay R, Jaakkola T. Generative models for graph-based protein design. In: Advances in Neural Information Processing Systems 32.; 2019:9689–9701.
OpenUrl

[23] 23.↵
Heinzinger M, Elnaggar A, Wang Y, et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019;20(1):723.
OpenUrl CrossRef

[24] 24.↵
Armenteros JJA, Johansen AR, Winther O, Nielsen H. Language modelling for biological sequences – curated datasets and baselines. bioRxiv. Published online March 9, 2020. doi:10.1101/2020.03.09.983585
OpenUrl Abstract/FREE Full Text

[25] 25.↵
McCann B, Bradbury J, Xiong C, Socher R. Learned in Translation: Contextualized Word Vectors. In: Advances in Neural Information Processing Systems.; 2017:6294–6305.

[26] 26.↵
Peters ME, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol 1.; 2018:2227–2237.
OpenUrl

[27] 27.↵
Madani A, McCann B, Naik N, et al. ProGen: Language Modeling for Protein Generation. bioRxiv. Published online March 13, 2020. doi:10.1101/2020.03.07.982272
OpenUrl Abstract/FREE Full Text

[28] 28.↵
Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci. 2021;118(15):e2016239118.
OpenUrl Abstract/FREE Full Text

[29] 29.↵
Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans Pattern Anal Mach Intell. 2021;14(8):1–16.
OpenUrl

[30] 30.↵
Strodthoff N, Wagner P, Wenzel M, Samek W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics. 2020;36(8):2401–2409.
OpenUrl CrossRef

[31] 31.↵
Vaswani A, Shazeer N, Parmar N, et al. Attention is All you Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Vol 30.; 2017:6000–6010.
OpenUrl

[32] 32.↵
Devlin J, Chang M-W, Lee K, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.; 2019:4171–4186.

[33] 33.↵
Bepler T, Berger B. Learning protein sequence embeddings using information from structure. In: 7th International Conference on Learning Representations, ICLR 2019.; 2019.

[34] 34.↵
Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–1322.
OpenUrl

[35] 35.↵
Rao R, Bhattacharya N, Thomas N, et al. Evaluating Protein Transfer Learning with TAPE. Adv Neural Inf Process Syst. 2019;32:9689.
OpenUrl

[36] 36.↵
Lu AX, Zhang H, Ghassemi M, Moses A. Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization. bioRxiv. Published online September 6, 2020. doi:10.1101/2020.09.04.283929
OpenUrl Abstract/FREE Full Text

[37] 37.↵
Min S, Park S, Kim S, Choi H-S, Yoon S. Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information. arXiv. Published online November 25, 2019. Accessed September 11, 2021. http://arxiv.org/abs/1912.05625

[38] 38.↵
Sturmfels P, Allen PG, Vig J, Madani A, Rajani NF. Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models. arXiv. Published online 2020. Accessed September 11, 2021. https://arxiv.org/abs/2012.00195

[39] 39.↵
Kovacs JM, Mant CT, Kwok SC, Osguthorpe DJ, S. Hodges R. Quantitation of the Nearest-neighbour Effects of Amino Acid Side-Chains that Restrict Conformational Freedom of the Polypeptide Chain using Reversed-Phase Liquid Chromatography of Synthetic Model Peptides with L-and D-amino Acid Substitutions. J Chromatogr A. 2006;1123(2):212–224.
OpenUrl

[40] 40.↵
Brocchieri L, Karlin S. How are close residues of protein structures distributed in primary sequence? Biophysics (Oxf). 1995;92:12136–12140.
OpenUrl

[41] 41.↵
Pietal MJ, Bujnicki JM, Kozlowski LP. GDFuzz3D: a method for protein 3D structure reconstruction from contact maps, based on a non-Euclidean distance function. Bioinformatics. 2015;31(21):3499–3505.
OpenUrl CrossRef PubMed

[42] 42.↵
Salzberg S, Cost S. Predicting protein secondary structure with a nearest-neighbor algorithm. J Mol Biol. 1992;227(2):371–374.
OpenUrl PubMed

[43] 43.↵
Ashok Kumar T. CFSSP: Chou and Fasman Secondary Structure Prediction server. Wide Spectr. 2013;1(9):15–19.
OpenUrl

[44] 44.↵
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv. Published online September 26, 2019. Accessed September 13, 2021. https://arxiv.org/abs/1909.11942v6

Protein sequence profile prediction using ProtAlbert transformer¹

Abstract

1. Introduction

2. Material and Method

2.1 Notation and Definition

2.2 ProtAlbert as a pre-trained transformer model on protein sequences

2.3 Proposed algorithms for analyzing ProtAlbert transformer to identify protein characteristics

2.3.1 RHL_NNI algorithm to quantify the representative heads and layers of ProtAlbert for nearest-neighbor interaction

2.3.2 RH_SAA algorithm to quantify representative heads of ProtAlbert for specific amino acids

2.3.3 RH_BBP algorithm to quantify representative heads of ProtAlbert for biochemical and biophysical properties of amino acids

2.3.4 RH_PSS algorithm to quantify representative heads of ProtAlbert for protein secondary structure

2.3.5 RH_PTS algorithm to quantify representative heads of ProtAlbert for protein tertiary structure

2.4 Proposed algorithm for sequence profile prediction problem

2.5 Dataset

3. Result and Discussion

3.1 Analyzing ProtAlbert as a pre-trained transformer on protein sequences

3.1.1 Assessment of nearest-neighbor interactions at heads in layers of ProtAlbert

3.1.2 Assessment of the type of amino acids at heads of ProtAlbert

3.1.3 Assessment of biochemical and biophysical properties of amino acids at heads of ProtAlbert

3.1.4 Assessment of the protein secondary structure at heads of ProtAlbert

3.1.5 Assessment of the protein tertiary structure at heads of ProtAlbert

3.2 Predicting profile using ProtAlbert

4. Conclusion

Footnotes

5. Reference

Citation Manager Formats

Subject Area