Abstract
The enzymes of the GCN5-related N-acetyltransferase (GNAT) superfamily count more than 870 000 members through all kingdoms of life and share the same structural fold. GNAT enzymes transfer an acyl moiety from acyl coenzyme A to a wide range of substrates including aminoglycosides, serotonin, glucosamine-6-phosphate, protein N-termini and lysine residues of histones and other proteins. The GNAT subtype of protein N-terminal acetyltransferases (NATs) alone targets a majority of all eukaryotic proteins stressing the omnipresence of the GNAT enzymes. Despite the highly conserved GNAT fold, sequence similarity is quite low between members of this superfamily even when substrates are similar. Furthermore, this superfamily is phylogenetically not well characterized. Thus functional annotation based on homology is unreliable and strongly hampered for thousands of GNAT members that remain biochemically uncharacterized. Here we used sequence similarity networks to map the sequence space and propose a new classification for eukaryotic GNAT acetyltransferases. Using the new classification, we built a phylogenetic tree, representing the entire GNAT acetyltransferase superfamily. Our results show that protein NATs have evolved more than once on the GNAT acetylation scaffold. We use our classification to predict the function of uncharacterized sequences and verify by in vitro protein assays that two fungi genes encode NAT enzymes targeting specific protein N-terminal sequences, showing that even slight changes on the GNAT fold can lead to change in substrate specificity. In addition to providing a new map of the relationship between eukaryotic acetyltransferases the classification proposed constitutes a tool to improve functional annotation of GNAT acetyltransferases.
Author Summary Enzymes of the GCN5-related N-acetyltransferase (GNAT) superfamily transfer an acetyl group from one molecule to another. This reaction is called acetylation and is one of the most common reactions inside the cell. The GNAT superfamily counts more than 870 000 members through all kingdoms of life. Despite sharing the same fold the GNAT superfamily is very diverse in terms of amino acid sequence and substrates. The eight N-terminal acetyltransferases (NatA, NatB, etc.. to NatH) are a GNAT subtype which acetylates the free amine group of polypeptide chains. This modification is called N-terminal acetylation and is one of the most abundant protein modifications in eukaryotic cells. This subtype is also characterized by a high sequence diversity even though they share the same substrate. In addition the phylogeny of the superfamily is not characterized. This hampers functional annotation based on homology, and discovery of novel NATs. In this work we set out to solve the problem of the classification of eukaryotic GCN5-related acetyltransferases and report the first classification framework of the superfamily. This framework can be used as a tool for annotation of all GCN5-related acetyltransferases. As an example of what can be achieved we report in this paper the computational prediction and in vitro verification of the function of two previously uncharacterized N-terminal acetyltransferases. We also report the first acetyltransferase phylogenetic tree of the GCN5 superfamily. It indicates that N-terminal acetyltransferases do not constitute one homogeneous protein family, but that the ability to bind and acetylate protein N-termini had evolved more than once on the same acetylation scaffold. We also show that even small changes in key positions can lead to altered enzyme specificity.
Introduction
Transfer of an acetyl group from one molecule to another is one of the most common reactions inside the cell. The rich and diverse, but structurally highly conserved, superfamily of GCN5-related acetyltransferases is one of the enzyme superfamilies able to catalyze the acetylation reaction (1–3). Members of the GCN5-related acetyltransferase superfamily are able to accommodate numerous types of substrates including lysine sidechains (4–6) and N-termini of proteins (7), serotonin (8), glucosamine 6-phosphate (9), polyamines (10) and others. N-terminal acetylation is one of the most abundant protein modifications in eukaryotic cells, with over 80% of proteins susceptible to acetylation in higher eukaryotes (11). The reaction entails transfer of an acetyl group from a substrate donor, most often acetyl coenzyme A, to a substrate acceptor, which is the N-terminus of the acetylated protein (12). The abundance of N-terminal acetylation implies numerous effects of this modification on normal cell functioning and, indeed, it has been shown that N-terminal acetylation affects protein synthesis indirectly (13), protein folding (14,15), protein half-life (16), protein-protein (17) and protein-lipid interactions (18) protein targeting (19), apoptosis (20,21), cancer (22), a variety of congenital anomalies and autism spectrum disorder (23–26). Despite the importance of N-terminal acetylation the number of N-terminal acetylating enzymes and cellular pathways remain unclear.
Thus far eight N-terminal acetyltransferases (NATs) have been discovered in eukaryotes with the last one identified in 2018 (27,28,37–42,29–36). NATs are named NatA-NatH, by convention, and their catalytic subunits, which are the focus of this work, are named NAA10-NAA80. Each of the catalytic subunits has the same fold, called the GNAT fold. GNAT is the acetylation scaffold in the entire GCN5-related acetyltransferase superfamily (2,3). It is an α-β-α layered structure with a characteristic V-shaped splay between the two core parallel β-strands (usually β4 and β5 strands) (Fig 1). Together with the core strands, two loops (usually α1-α2 and β6-β7 loops) are involved in catalysis and substrate binding. They are located on one side of the splay. On the other side an α-helix (usually α3) common to all acetyltransferases binds Ac-CoA (2,3) (Fig 1.).While the β4 and β5 strands and the loops α1-α2 and β6-β7, are structurally quite conserved, their amino acid sequence varies with ligand specificity (2,3,43–50). Consequently, the key determinants of the ligand specificity of an acetyltransferase are sequence motifs in the crucial positions on the GNAT fold.
The NATs can be more or less promiscuous when it comes to substrate specificity (7). Usually the first two residues of a substrate protein determine whether the protein can be acetylated (11). There is some overlap between in vitro specificities of NATs (11,51) and interestingly, it has been shown that some non-NAT acetyltransferases have the ability to N-terminally acetylate polypeptide chains. Glucosamine 6-phosphate acetyltransferases are one such example and were recently shown to in vitro acetylate N-terminal serine (52). NATs are referred to as a family of enzymes since they all acetylate the same type of substrate, namely protein N-termini, but the fact is that there is no deeper classification than at the superfamily level for all GNAT acetyltransferases.
Majority of all known types of acetyltransferases are members of the same Pfam (53) family (Acetyltransf_1, code: PF00583) which contains almost 50% of the entire acetyltransferase clan (Pfam code: CL0257). The Acetyltransf_1 Pfam family contains 120,379 sequences out of the 280,421 sequences of the acetyltransferase clan and consists of numerous types of acetyltransferases. PROSITE (54) does not differentiate between different types of acetyltransferases either and recognizes four types of GNAT fold: GNAT (PS51186), GNAT_ATAT (PS51730), GNAT_NAGS (PS51731) and GNAT_YJDJ (PS51729). The CATH database (55) offers a slightly better classification than Pfam or PROSITE, but CATH does not accurately differentiate between all known NAT sequences. As a result, and despite extensive efforts on the experimental front, the current classification of acetyltransferases is based on a collection of ligand specificity assays which can only sparsely cover the variety of enzymes in the superfamily.
Several studies have identified a large number of proteins that can be N-terminally acetylated (27,41,51,56–58). Much of the identified acetylated N-termini can be explained by currently known NATs (11,51). However, we do not know whether or not known NATs acetylate other exotic N-termini found to be N-terminally acetylated in cells, such as those with acetylated initial tyrosine (PCD23_HUMAN, KS6A5_HUMAN, etc) (51). N-terminal acetylation events following post-translational protease action are not well characterized either; known NATs except NatF, NatG and NatH sit on the ribosome and catalyze cotranslational acetylation (59). Therefore, there might be unidentified NATs in eukaryotes responsible for such events. The lack of a classification of acetyltransferases at the family level hinders functional annotation based on homology, and hence slows down the identification of new NATs.
In order to create a better classification framework for the eukaryotic acetyltransferase superfamily we used a combination of bioinformatics sequence analysis consisting in sequence similarity networks (SSNs), motif discovery and phylogenetic analysis. We showed that N-terminal acetyltransferases do not constitute one homogeneous family, even though they acetylate the same type of substrate. Our analyses all converge to the conclusion that NATs evolved more than once. Finally, we could predict and experimentally verify that two uncharacterized sequences from fungi closely related to two known NATs, NAA50 and NAA60, encode NAT enzymes targeting specific protein N-terminal sequences. This experimental validation gives us confidence that our classification will be a valuable tool for identification and annotation of new superfamily members.
Results
1. Sequence similarity networks (SSNs)
We collected from UniProt all eukaryotic sequences matching the GNAT signature defined by PROSITE. The collected sequences were then filtered at 70% identity to reduce the size of the dataset, using h-cd-hit (60), which resulted in a dataset of 14,396 sequences. We also collected a second dataset restricted to the sequence of the GNAT-domains. We generated SSNs for each of the datasets using EFI-EST (61). By adjusting the E-value and alignment score threshold for drawing SSN edges (Fig s1), we created an SSN with the highest probability of having isofunctional clusters. Both SSNs resulted in a sparse topology indicating a high sequence diversity in the acetyltransferase superfamily (Fig s2). The convergence ratio of the SSN built from the full-length sequences is 0,008 and it is equal to 0,009 for the network build from the GNAT domains only. This illustrates the high level of divergence between acetyltransferases.
We used the clusterONE algorithm (62) through Cytoscape (63) to determine the boundaries between each cluster in our SSNs. We identified 232 clusters in the full-length sequence SSN and 221 clusters in the GNAT domain SSN. Since the results for both networks are highly similar, we opted to use the full-sequence SSN for further analyses. When applying clusterONE to SSNs, we used the percentage of sequence identity as edge weight to make sure that clusters are identified based on a reliable measure of similarity. Dense regions thus correspond to closely related sequences. We also observe that, with few exceptions, known acetyltransferases of one particular function never appear in multiple clusters. We can thus reasonably assume that the clusters in our SSN are isofunctional.
In order to better visualize the relationships between clusters we represented the SSNs as simplified, “pivot”, networks. Each cluster of the original SSN is represented by a single node. An edge between nodes in the simplified network is drawn where there was at least one edge between any nodes of the two corresponding clusters in the original SSN (Fig 2 and Fig s3). The main topological characteristics of the SSNs are network sparsity, the resulting absence of SSN hubs, several connected components that contain a varying number of clusters, and a large number of isolated clusters (Fig 2). We identified 48 clusters with known acetyltransferases. A majority of proteins in our SSN are from fungi (Fig s3), but all eukaryotic kingdoms are represented. There is a total of 80 Homo sapiens proteins in the SSN, spread into 21 clusters. The observed clustering is not based on taxonomy of acetyltransferases from higher and lower eukaryotes, but instead correlates with ligand specificity (Figures s4-s8). Interestingly, acetyltransferases that acetylate the same type of substrate (e.g. either N-termini of proteins or histones) are not necessarily found within the same connected component but are scattered over the SSN. This is the case with NATs, which are found clustering together with other types of acetyltransferases rather than forming one homogeneous group. This is the first indicator that NATs do not constitute one homogeneous family but have, rather, evolved more than once on the same scaffold.
2. Identification of five NATs groups and their sequence motif fingerprints
Known NATs do not all inhabit the same connected components of the SSN (Fig 2), which indicates NATs are not one homogeneous family of acetyltransferases. We used MEME (64) from the MEME suite (65) to identify motifs in each of the SSN clusters. Sequence motifs of highly conserved residues were detectable for each of the clusters. Based on the similarity between motifs and on the clustering of the NATs in the SSN, we defined five different groups of NATs (Fig 2). We subsequently calculated sequence motifs for each of these groups. The motifs are shown in Fig 3 and Table 1.
Group 1 contains NAA10, NAA20 and NAA30. NAA10 and NAA20 are in the same connected component, while NAA30 is found in a single isolated cluster. The sequence motifs that are important for binding of substrate and acetylation in the group 1 NATs are localized on the α1-α2 loop, the β4 and β5 strands and the β6-β7 loop (45,46,48) (Fig s9) and this is true for groups 2 and 3 as well. Group 2 consists of NAA50 and NAA60. NAA50 and NAA60 do not cluster together in the SSN but the resemblance between their key sequence motifs (Fig s10) justifies placing them in the same group. We define Group 3 around Naa40. A striking characteristic of NAA40 is its long α0 helix and the position of its α1-α2 loop (49) which extends over and covering the binding site where the β6-β7 loop lies in other NATs. Group 4 is defined around NAA80 which is structurally different from the first three groups. Its surface shows a large cleft which is covered by loops in all other NAT structures available to date (50). The need for a larger ligand binding site is explained by the fact that NAA80 has evolved to catalyzes N-terminal acetylation of fully folded actin and harbors an extensive binding surface to actin (66). Finally, Group 5 contains NAA70 which is a chloroplast NAT discovered in Arabidopsis thaliana (29). NAA70 is closer to bacterial acetyltransferases than to the eukaryotic ones in Groups 1 to 4. A BLAST search against the NCBI non-redundant database (67) and excluding green plants, suggests that NAA70 is most similar to cyanobacterial proteins with the best hit being a protein from Gleocapsa sp (29,7 %id over 62% query cover). We also found that NAA70 shares a high percentage of sequence identity with Enterococcus faecalis acetyltransferase whose structure has been solved (PDB code 1U6M). Unfortunately, there is not enough reliable structure information on NAA70 to be able to map the position of the key sequence motifs onto the secondary structure elements.
While there are important differences between each of the groups in terms of sequence motifs, some similarities emerge (Fig 3). They are especially obvious between groups 1 and 2, where we can observe a well conserved tyrosine in the α1-α2 loop (Fig 3 and Fig s11) and most importantly, another conserved tyrosine in the β6-β7 loop (Fig 3 and Fig s11). This tyrosine is essential for function and is strictly conserved in all members of groups 1 and 2 (43–45,48,68). The tyrosine in the α1-α2 loop is conserved in all NATs of group 1 and group 2 (43,44,69) except for NAA20 where it is replaced by phenylalanine (45). Group 3 and group 4 motifs clearly differ from those of group 1 (Fig 3). Compared to the other groups, strands β4 and β5 stand out in groups 3 and 4 where they play a major role in substrate binding and catalysis. Interestingly their sequence motifs and key residue positions differ between the two groups (49,50).
3. Structure comparison
We compared structures of acetyltransferases to one another using the DALI server (70). Our dataset consists in structures of 38 catalytic subunits of acetyltransferases, all belonging to our SSN. The resulting dendrogram (Fig 4) shows a classification that overlaps with that of the SSN. In addition, it highlights that Groups 1 and 2 are more similar to one another than to the other NATs, and to the rest of the entire superfamily. NAA40 (Group 3) is the NAT closest to NAA80 (Group 4), to the histone acetyltransferase GCN5, to the dopamine N-acetyltransferase and to the arylalkylamine N-acetyltransferase. The proximity of NAA40 and NAA80 is only observed in the structure-based classification and was not observed in the sequence similarity network. NAA80 is also close to the histone acetyltransferase GCN5. Groups 1, 2, 3 and 4 of NATs are more similar to one another than to the rest of the superfamily when structures are compared, but this is not the case when sequences are compared. In-between these 4 groups one finds non-NAT acetyltransferases, namely the histone acetyltransferase GCN5 (cluster 1) and serotonin acetyltransferases (cluster 122).
It is important to mention that while we can see differences in structures of different acetyltransferases, they are still quite similar to one another. One indication of how small the differences are between structures is the fact that the structures of NAA40 with and without substrate are found to be very distant from one another in the dendrogram (blue branches on Fig 4). This is the result of the position of the β6-β7 loop which is opened without the substrate and closed with the substrate bound to the enzyme (49). We verified the proximity of the structures by building a network based on a structure similarity matrix. The resulting network is random with all nodes connected to all nodes when we use a Z-score higher than 2, which is considered to be significant as a threshold for an edge between two nodes (71). The Z-score threshold needs to be increased to at least 15 for cluster separation in the similarity network to appear (Fig s12).
4. Phylogeny
We used the clustering information obtained from the smallworld SSN (Fig.S13) to generate the dataset for phylogeny. We selected 3 random sequences per SSN cluster and created an MSA for the structural motifs A (β4 and α3) and B (β5 and α4) of the GNAT fold (See Fig 1 and Fig S15). They are the most conserved structural motifs across the superfamily (2) and their alignment yields a better MSA than a whole-sequence alignment would. Note that NAA70 was not included in the MSA because it is not found in the connected component of the SSN used to generate the phylogeny dataset.
The phylogenetic tree is shown in Figure 5. The branching in the tree clearly reflects the four different groups of NATs corresponding to Groups 1 to 4 in the SSN shown in Fig.2. NATs from Group 1 (NAA10, NAA20, NAA30) and Group 2 (NAA50 and NAA60) are closely related according to the tree (Fig 5) and according to the smallworld SSN (Fig s13). This is in agreement with evidence that NAA10 and NAA50 have evolved from the same archaeal ancestor (72). Groups 3 and 4 appear close to each other as well. NAA40 and NAA80 share a common ancestor. Several distinct branches of the tree carry a particular type of acetyltransferases (Fig 5), but even within some of these branches we see acetyltransferases acetylating different types of substrates. We have mapped the SSN clusters to the tree in order to observe evolutionary relationships between NATs and other identified acetyltransferases.
The tree shows several acetyltransferases, annotated as non-NAT enzymes, sharing a common ancestor with group 1 NATs (red and magenta branches). For example, a histone acetyltransferase (KAT14 – cluster 18) is found close to Group 1 of NATs (NAA10 and NAA20) and these sequences are the closest relatives according to the tree. An MSA of these acetyltransferases (Fig s16) reveals that KAT14 and sequences in Group 1 share sequence motifs. Indeed, the best conserved sequence motif found in Groups 1 and 2, located on the β6-β7 loop, is conserved in KAT14, as well. The β6-β7 loop motif contains a tyrosine present in all Group 1 and Group 2 N-terminal acetyltransferases (NAA10, NAA20, NAA30, NAA50 and NAA60). This tyrosine has been shown to be essential for substrate binding (43,48,68) and it has been suggested that the size and flexibility of the β6-β7 loop plays an important role in substrate recognition (2,73). Based on similarity between the β6-β7 loop of KAT14 and the NATs from Groups 1 and 2 and given the fact that the β6-β7 loop differs in size and primary sequence in other acetyltransferases, it is not excluded that KAT14 might be able to accommodate the same type of substrate as NATs and acetylate N-termini of proteins.
Looking now more specifically at the branches around Group 2 (green branches), we can see that clusters 49, 120, 128, 135, 136 and 212 are found close to NAA50 (cluster 9) and share a common ancestor (Fig 5). Clusters 83 and 209 are found close to NAA60 in the phylogenetic tree as well (Fig 5). Additionally, according to the tree, clusters 122, 78, 16 and 37 share a common ancestor with NAA60. Cluster 122 is a serotonin N-acetyltransferase (74) and forms a single cluster in the stringent SSN. There are similarities between serotonin N-acetyltransferase and NAA60. Like NAA60, serotonin acetyltransferase has a long β3-β4 loop unlike other NATs (44) (Fig s17). Catalytic residues are positioned similarly in both enzymes. Tyr97 in NAA60 and His 120 in serotonin acetyltransferase have equivalent positions in on the GNAT fold (Fig s17). The other catalytic residue of NAA60 (His138) and cluster 122 serotonin acetyltransferase (His122) are both located in the core of the GNAT fold (Fig s17) even if their positions are not equivalent. Cluster 16 is annotated as a polyamine acetyltransferase (75,76) and it establishes weak connections with, among few others, cluster 14 (diamine acetyltransferases) in the stringent SSN.
NAA50, NAA60 and their surrounding clusters share a common ancestor with cluster 161 and the MSA of NAA50, NAA60 and sequences in cluster 161 shows many conserved key residues (Fig s18). Cluster 161 contains only sequences of Caenorhabditis tropicalis and is highly similar to both Naa50 and Naa60. It might therefore acetylate substrates similar to those acetylated by NAA50 and NAA60.
In Group 3 clusters 104, 222, 13 and 47 are close to NAA40 (ochre branches) (Fig 5). Additionally, clusters 111 and 176 share the same common ancestor with NAA40 and surrounding clusters (Fig 5). Sequences in Cluster 176 are annotated as NAA40. It is unclear whether cluster 111 is also a NAA40 or if it has a different substrate specificity.
In Group 4 of NATs (dark blue branches), clusters 94, 106, 226 and 232 are close to NAA80 (cluster 63) (Fig 5), These clusters share the same common ancestor. This group of sequences shares a common ancestor with cluster 32. Another branch, branching from the NAA80 branch contains clusters 14 (Diamine acetyltransferases), 15 and 53 (Tyramine N-feruloyl transferase 4/11). In addition, on the same branch, but closer to clusters 14, 15 and 53 than to NAA80, lie uncharacterized clusters 29, 152 and 218. Clusters 29, 152 and 218 share a common ancestor, according to our tree, with cluster 68 (Histone acetyltransferase HPA2 (77)) and 194.
Histone acetyltransferase GCN5 (cluster 1) is found on the same branch as NAA40 on the phylogenetic tree and, also, close to NAA40 and NAA80 on the structure similarity dendrogram (Fig 4). The MSA between NAA40 and acetyltransferases from cluster 1 shows some conservation between these two types of acetyltransferases, but none of the functional key residues for NAA40 are conserved in sequences from cluster 1 (Fig s19). Judging by the branching of our tree, NAA40 and NAA80 have evolved from, or together with, histone acetyltransferases (Fig 5). Indeed, these two NATs do not share any of the characteristics of Group 1 and Group 2 NATs. Their separate branching is in agreement with the assumptions we made about N-terminal acetyltransferases evolving more than once, which was based on the topology of our SSNs and on the sequence motif composition.
5. Prediction of new acetyltransferases
5.1. Predictions based on SSN and sequence motifs
The initial SSN (Fig 2) shows that there are clusters containing uncharacterized sequences around clusters of known NATs. We focused on those clusters, calculated their sequence motifs and compared them to motifs of known NATs (see Methods section). We are interested in finding proteins with sequences that are in the vicinity of known NATs in the SSN, and that display sequence motifs close to the NATs motifs reported in this work (Fig 3).
Around group 1, no neighboring cluster showed sequence motifs close to those found in clusters 2 (NAA30), 8 (NAA20) and 10 (NAA10). Therefore, none of the connected clusters to either NAA10 or NAA20 were considered as potential new NATs.
In group 2, clusters 9 (NAA50) and 24 (NAA60) belong to connected components that contain uncharacterized clusters numbered 49, 83, 120, 128, 135, 136, 209 and 212 (Cf. Figure 3). We found that all of these clusters, in addition to an isolated cluster numbered 207, have sequence motifs highly similar to the fingerprint of group 2 (Fig s20). Yet the motifs are not identical to those of NAA50 and NAA60. In the 6 clusters around cluster 9 (NAA50) (Fig 2), NAA50 is the only confirmed acetyltransferase and we found sequences annotated as NAA50 both in clusters 9 and 135. There are X-ray structures available for both of these clusters but proteomics and biochemical experiments have shown that they may differ in their substrate specificity (47,78). We observe a difference in sequence motifs; a mutation-sensitive phenylalanine (43) in the α1-α2 loop of NAA50 (first Phe in the motif of group 2 shown on Fig.4) is replaced by a less bulky leucine in sequences from cluster 135 (Fig 6A and 6C). We observe the same differences in the α1-α2 loop between cluster 9 (NAA50) and uncharacterized clusters 49, 120, 128 and 212 (Fig 6A). Two residues downstream from the leucine/phenylalanine substitution, we observe a conserved isoleucine in cluster 120 instead of a highly conserved valine in NAA50. This valine forms van der Waals contacts with the substrate in NAA50 (43) and is, thus, important for substrate binding. Moreover, the α1-α2 loop in cluster 120 sequences contains two conserved prolines (Fig 6A) unlike NAA50 that contains only one (Fig 3).The characteristic β6-β7 motif of NAA50 (Cf Table 1 and Fig 3) is not present in cluster 135, which doesn’t have the conserved tyrosine in this loop (second tyrosine of the sequence motif on Fig 3). Structurally, the differences between cluster 9 (NAA50) and cluster 135 enzymes are precisely in the β6-β7 loop, which is longer in cluster 135 structure (Fig s21). Sequences from all other clusters, found clustering around cluster 9, carry the same β6-β7 loop motif as NAA50 (Fig 6A). Finally, there are differences in sequence motifs carried by the β4 strand; the methionine responsible for interacting with the substrate in NAA50 (third position in β4 motif on Fig 4) is substituted by a glutamine in clusters 49 and 128 and by a glutamate in clusters 135 and 136, while sequences in cluster 212 retain the conserved methionine. Based on the presented differences we predict that clusters 49, 120, 128, 136, and 212 have substrate specificities different from that of NAA50.
We predict that the position of clusters 83 and 209 around NAA60 (Fig 2) reflects different substrate specificities, as well. The main difference between clusters 83 and 209 and cluster 24 (NAA60) is in the α1-α2 loop. While the mutation-sensitive phenylalanine is present in clusters 83 and 209 (Fig 6A) there is a difference four residues downstream of it; where the NAA60 sequence contains a conserved acidic residue, sequences in clusters 83 and 209 have a conserved positively charged residue (Fig 6A). Given the importance of the α1-α2 loop (43–45,48) we predict that such a drastic change will result in proteins belonging to clusters 83 and 209 having a ligand specificity that differs from that of NAA60.
We applied the same strategy as above to predict the specificity of clusters surrounding the NAA40 cluster (cluster 20) (Fig 2). Some of the NatD residues have been shown to be essential for substrate binding. These residues (Y136 – in β4, Y138 – in β4, D127 – in β3 and E129 between β3 and β4 in human NatD) are involved in interaction with the first 4 residues of the NatD substrate (H4 and H2A histones) and their mutation greatly reduces the catalysis rate (49). We show that some of these essential residues are not conserved in sequences from clusters surrounding the NatD cluster (Fig s22). This raises the question of the type of substrate acetylated by enzymes from these clusters and whether these enzymes are pseudoenzymes, given that mutated residues have been shown to be essential for NAA40 substrate recognition and catalysis (49). Sequences in clusters 13, 104 and 222 do not have the conserved aspartate on β3, while clusters 13, 47 and 222 do not have a tryptophan in the α1-α2 loop (Fig s22). Both of the missing residues are crucial for substrate binding, which could mean that clusters 12, 97, 223 and 47 could bind and acetylate different substrates. We could not find any other clusters from this group that share NatD motifs.
Even though there are four clusters around NAA80 (cluster 63, group 4) (Fig 2), we did not find any variations in their key sequence motifs (Fig s23). The clustering in this case was likely based on taxonomical differences.
5.2. Experimental verification of clusters 49 and 120
To evaluate the accuracy of our predictions, we recombinantly expressed two candidates from the clusters 49 and 120, purified them and tested their ability to acetylate N termini from a selection of 24 amino acids-long synthetic peptides (Fig 7). One of the candidate enzymes was N1Q410 from the fungus Dothistroma septosporum. After expression and subsequent purification of N1Q410 (Fig s24A and s24B), we tested its ability to acetylate N-termini of different sequences in a DTNB-based spectrophotometric assay (Fig 7A). The first seven peptides represent typical substrates for the seven known NATs in higher eukaryotes (NatA, SESS; NatB, MDEL; NatC, MLPG; NatD, SGRG; NatE, MLGP; NatF, MLGP; NatH, DDDI) (7,30). The subsequent six peptides have been selected dependent on the initial results for both proteins, resembling amino acid combinations that are potential substrates. Although the overall activity of N1Q410 was relatively low, there was a clear preference for methionine starting peptides, especially MDEL (21.09 ± 4.03 µM) and MEEE (15.10 ± 0.25 µM) (Fig 7A). The putative NAT A0A194XTA9 from the fungus Phialocephala scopiformis (Fig s24C and Fig s24D) showed a higher activity in general as well as a broader substrate specificity (Fig 7B). Similar to N1Q410 only peptides starting with a methionine were Nt-acetylated by A0A194XTA9, with the peptides MAPL (50.92 ± 1.89 µM), MFGP (48.94 ± 2.50 µM) and MVEL (148.74 ± 2.25 µM) showing the highest activities.
Discussion
Using pairwise sequence comparisons and phylogenetic analyses we have mapped the sequence space of the eukaryotic N-terminal acetyltransferases superfamily and the evolutionary relationship between its members.
High diversity in the acetyltransferase superfamily
We first sketched the topology of the entire acetyltransferase superfamily in the form of a sequence similarity network (SSN). The size and topology of the network, with a large number of single isolated clusters, reveals the high diversity of the acetyltransferase superfamily. Numerous clusters contain only uncharacterized sequences, but many contain at least one defined and annotated sequence. Inside these clusters we transferred annotation from known to uncharacterized sequences and we could observe that N-terminal acetyltransferases are found in different parts of the network. Using information from the network topology together with the identification of sequence motifs for each of the known NATs, we could classify NATs into 5 different groups.
NAA10, NAA20, NAA30, NAA50 and NAA60 share a common ancestor
Group 1 of NATs contains NAA10, NAA20 and NAA30. Sequence motifs on α1-α2 loop, β4 and β5 strands and β6-β7 loop are characteristic to this group and represent its signature. Protein N-termini starting with a small residue, which is exposed after removing of initial methionine, and protein termini with the initial methionine can be acetylated by this group of NATs (7). Even though these enzymes are obviously closely related, they employ different solutions to bind and acetylate substrates. Slight changes are sufficient to shift the substrate specificity of the GNAT fold. The same GNAT elements in Group 2 of NATs, which contains NAA50 and NAA60, are important for substrate binding and catalysis (43,44). Group 2 NATs acetylate protein N-termini starting with a methionine (43,44). Large differences between Group 1 and Group 2 exist in the way the substrate binds to the enzyme and also the position of the catalytic residue on the fold. The difference in catalytic strategy between Group 1 and Group 2 enzymes can be illustrated by drawing a horizontal line through the middle of the V-shaped splay across the β4 and β5 strands (Cf Fig.1); in Group 1 the active site would be above the line, while it would be below the line in Group 2. Interestingly, catalytic residues of Group 2 are conserved in Group 1, but they are not catalytically active in Group 1 (48). The two groups share two conserved tyrosines in each of the β6-β7 and α1-α2 loop. Both are involved in substrate binding. In Group 1 of NATs, residues at positions upstream of the mentioned tyrosine (positions -2 and -4 with respect to the Tyr) are also involved in substrate binding, but the same positions in Group 2 do not seem to be as important for binding the substrate, even though position -4 shows mutational sensitivity (45). Our phylogenetic tree supports earlier suggestions that NATs from Groups 1 and 2 share a common ancestor. An archaeal N-terminal acetyltransferase, whose structure was solved by Liszczak and Marmorstein (72), can acetylate substrates of both NAA10 and NAA50. The archaeal enzyme employs catalytic strategies from both of these enzymes. It is most likely, as the authors suggested, that NAA10 and NAA50 evolved from this common ancestor. Common ancestry is also supported by the conserved sequence motifs which interestingly do not all necessarily retain a significant functional role in each of the NAT groups. A less stringent SSN also shows closer clustering of the Group 1 and Group 2 NATs.
Large evolutionary distance between NAA40 and NAA80 on one hand, and the other NATs on the other hand
Group 3 contains NAA40, the NAT with the specificity towards the histone H4 and H2A N-termini (sequence: SGRG) (49). The first major difference between NAA40 and Group 1 and 2 NATs is the fact that the β6-β7 loop in NAA40 has no role in determining substrate specificity (49) and does not carry an invariable tyrosine. While the α1-α2 loop of NAA40 plays a role in substrate binding (49), just like in groups 1 and 2 of NATs, it has a different position in the 3D structure and its sequence motif does not bear any resemblance with the conserved motifs of groups 1 and 2. Our SSNs and phylogenetic tree all show a large evolutionary distance between Group 3 and Groups 1 and 2.
Group 4 is defined by NAA80, the most recently discovered N-terminal acetyltransferase (30). The β6-β7 loop of NAA80 is not conserved and does not play an important role in acetylation. As in most other NATs the α1-α2 loop plays an important role in substrate binding and is well conserved (50). The α1-α2 loop sequence motif is different from those of other NATs. Moreover, NAA80 has a wider substrate binding groove between the α1-α2 and β6-β7 loops. This structural feature supports classifying NAA80 into a different NAT type. In addition NAA80 does not have the α1-α2 and β6-β7 tyrosines found in Groups 1 and 2. Our results confirm, over a larger set of sequences, an observation that has been reported earlier (50).
Different evolutionary paths
Each of the NAT groups have clear characteristics that distinguish them unequivocally from one another. This observation indicates different evolutionary paths for NATs, and not divergent evolution. Our results indicate that N-terminal acetyltransferases evolved more than once on the GNAT fold. The phylogenetic tree which informs on the position of the different NATs in the acetyltransferase superfamily and provides a useful perspective on the evolution of ligand specificities, confirms this. The relationships between enzymes revealed by the SSNs and the structural comparison are also in agreement with the phylogenetic tree. Interestingly Groups 1 and 2 are located on the same branches as acetyltransferases known to have other functions. The histone acetyltransferase KAT14 is close to Group 1 and serotonin acetyltransferases (AANAT) are close to Group 2. Those are all present in the human proteome. Histone acetyltransferases KAT2A and KAT2B and diamine acetyltransferases are on the same branch of the phylogenetic tree as NAA40 and NAA80. Glucosamine 6-phosphate acetyltransferases are quite close to NAA40 and NAA80 branches as well. Groups 3 and 4 are also found to share a common ancestor in the phylogenetic tree and they are found to be closely related based on structure similarity. Furthermore, serotonin acetyltransferase from cluster 122 lies on the same branch as NAA60 and is also shown to be close to the NAA60 structure in the structure similarity dendrogram (Fig 4). N-terminal acetyltransferases are not one homogenous, uniform, family of enzymes and the GNAT fold has evolved different specificities more than once.
Consequences for function and functional annotation of acetyltransferases
Because of this we cannot exclude that N-terminal acetyltransferases can acetylate other substrates than N-terminal amines. NAA10 and NAA60 are suspected to be able to acetylate lysine side chains in addition to protein N-termini (79,80), even if this has been debated (81). A related consequence is that other acetyltransferases might be able to acetylate N-terminal amino acids. One of the most recent findings is that glucosamine 6-phosphate acetyltransferases can acetylate protein N-termini (52). Moreover, our results indicate that serotonin acetyltransferases could have the ability to acetylate protein N-termini and have a biological role as N-terminal acetyltransferases, as well. This is relevant in the quest and characterization of yet-to-be discovered enzymes catalyzing N-terminal acetylation of particular groups of protein N-termini (for instance those resulting from post-translational protease action) or specific proteins (analog to NAA80 specifically acetylating actins). Indeed, the currently known NATs are not yet defined as responsible for all cellular N-terminal acetylation events though the major classes of co-translational acetylation have been accounted for using S. cerevisiae genetics and proteomics (11,35,41). In the human proteome we could not find uncharacterized sequences qualifying as NATs as per the characteristics we define in this study. It is therefore important to thoroughly inspect all close relatives to known NATs for the discovery of new enzymes.
The fact that there is not one single catalytic site and mechanism for acetylation even for the closest of NATs creates another conundrum. NAA10, for example, has a conserved glutamate in α1-α2 loop which is involved in catalysis, but in the case of NAA20, NAA10’s closest relative, the same conserved glutamate has no role in catalysis (45,48). This case became even more puzzling when the study of NAA20 revealed no obvious catalytic residue. Furthermore, NAA10 acetylates different substrate N-termini when in a monomeric form as compared to when it is complexed with its auxiliary subunit NAA15 (48,82). It can look as if as long as a substrate can bind properly to the GNAT, the chances are high it can be acetylated. It follows that the impossibility to strictly define what makes N-terminal acetyltransferases acetylate N-termini and no other substrates greatly limits our ability to predict NAT function from sequence. We are left to only comparing key sequence motifs in order to detect similarities and predict NAT function. Yet, subtle sequence changes might also affect substrate specificity. Despite those difficulties we were able, using this approach, to predict two new NATs and confirm their function by acetylation assays in vitro.
Using the classification for functional annotation of uncharacterized sequences
Representatives from the clusters 49 and 120 from Group 2 (Figure 2), N1Q410 from the fungus D. septosporum and A0A194XTA9 from the fungus P. scopiformis were expressed, purified and subjected to in vitro NAT assays. Group 2 also harbors clusters 9 and 24 containing known NATs NAA50 and NAA60, respectively. Thus, we would expect that proteins from other Group 2 clusters would express NAT-activity and further that these display a substrate specificity similar to what is observed for NAA50 and NAA60. NAA50 and NAA60 have overlapping substrate specificities in vitro, but in vivo substrates are not likely to overlap since NAA50 is nuclear/cytosolic and partly anchored to the ribosome via NAA15-NAA10 (83,84) while NAA60 acetylates transmembrane proteins via anchoring to the cytosolic side of the Golgi-membrane and other cellular membranes (27,85). Both enzymes may acetylate a variety of Met-starting N-termini, in particular Met-Leu, Met-Ala, Met-Val, Met-Lys, Met-Met (40,82). Both N1Q410 and A0A194XTA9 display clear N-terminal acetyltransferase activity confirming that these are true NATs (Fig 7). Furthermore, both enzymes prefer Met-starting N-termini among the peptides tested. A0A194XTA9 has a clear preference for the NatE/NatF (NAA50/NAA60) type of substrates strongly suggesting that this NAT is either a NAA50 or NAA60 type of enzyme in P. scopiformis. For N1Q410, we observe a preference for N-terminal peptides where Met is followed by an acidic residue at the second position, very similar to NAA20/NatB activity (41) despite the fact that it harbours sequence motifs highly similar to those of Group 2 NATs. This is an example of how sensitive N-acetyltransferase ligand specificity can be to subtle sequence changes. In this case they mainly consist in: (1) two substitutions in the α1-α2 loop where the highly conserved F and V in the of NAA50 motif are replaced by an L and an I, respectively, (2) NAA50 has only one proline in the α1-α2 loop while cluster 120 has 2 and (3) the highly conserved M in the β4 strand of NAA50 is at a different position in cluster 120 sequences. Thus, N1Q410 might be a NAA20 type enzyme which is clustered among NAA50/NAA60 type enzymes in Group 2, or there might be other factors skewing the substrate preference in vitro.
The superfamily has highly diverged in primary structure, but secondary and tertiary structures remain largely intact. The GNAT is the scaffold on which numerous types of molecules can get acetylated and it evolves different specificities by changes in sequence that do not affect the overall structure. Our work shows that it is possible, within the limits discussed in “Consequences for function and functional annotation of acetyltransferases”, to predict ligand specificity similarity or differences between GNAT-containing sequences if they are closely related and by comparing the key sequence motifs that we report here. Predicting the substrate specificity of an uncharacterized GNAT sequence which doesn’t have close relatives with known function is practically impossible in silico. In vitro assays are necessary to map function and specificity of uncharacterized parts of the acetyltransferase superfamily. It is important to note that large portions of the phylogenetic tree have exclusively uncharacterized sequences and it is impossible to say anything about their substrate specificity. There are no human proteins in the uncharacterized parts of the tree. While this work is restricted to eukaryotic GNAT-containing sequences and encompasses the majority of eukaryotic acetyltransferases it is important to mention that some non-GNAT acetyltransferases like FrBf (86) were discovered as recently as in 2011. Members of the MYST family (87) are also relevant non-GNAT acetyltransferases. New potential acetyltransferases could be found among those enzymes. Moreover recent studies have shown that most N-terminal acetyltransferases evolved before eukaryotic cells (46) so it might be that looking at bacterial and archaeal proteomes would provide valuable information.
In summary our work provides the first classification and phylogenetic analysis of the eukaryotic GNAT acetyltransferases superfamily. It reveals that NATs evolved more than once on the GNAT fold and that they do not form a homogenous family. We provide sequence motif signatures of known NATs that, together with this classification form a solid basis for functional annotation and discovery of new NATs.
Material and methods
Sequence similarity networks (SSN)
Collection of sequence dataset
All members of GCN5-related acetyltransferase superfamily contain the GNAT fold. As there is no finer classification to aid dataset creation, we retrieved all UniProt sequences that match the GNAT fold signature as defined by PROSITE (54,88). According to PROSITE there are four types of GNAT fold – GNAT (code: PS51186), GNAT_ATAT (code: PS51730), GNAT_NAGS (code: PS51731) and GNAT_YJDJ (code: PS51729). These PROSITE signatures match sequences from all domains of life (around 900 000 sequences in UniProt). We restricted our dataset to only eukaryotic entries (more than 50000 sequences) in agreement with the focus of this work. We kept all SwissProt (manually curated) sequences and Homo sapiens TrEMBL (not reviewed) sequences in the dataset. The remaining TrEMBL sequences were filtered to reduce the size of the dataset. Filtering of TrEMBL sequences was performed using h-cd-hit (60,89) in three steps – a first run performed at 90% identity, a second at 80% and a third at 70% identity. The threshold was set to be 70% sequence identity as this usually indicates shared function. We created two datasets using this strategy: the full-sequence dataset and the GNAT-domain dataset. We used the pfamscan tool from Pfam (90) together with HMMER3.2.1 (91) to locate the GNAT fold boundaries in the full-sequence dataset in order to generate the GNAT-domain dataset.
Generating the SSNs
The final, filtered, dataset (14396 sequences) was used to generate the SSN using EFI-EST (61) with the following parameters: E-value of 10−15 and alignment score of 30. The chosen values ensured that sequences clustering together were closely related (Cf. Fig.S1) with a minimal sequence identity equal to of 40% on average yielding isofunctional clusters. The shortest sequence kept in the network was 34 amino acids long. It is not known yet what the minimal functional part of the GNAT fold is. The resulting network was analyzed using Cytoscape (63). To visualize the network in Cytoscape we used γfiles organic algorithm by γWorks (https://www.yworks.com/). In addition to the network made from E-value thresholds equal to 10−15 and alignment score equal to 30 we created several other networks, mainly for the purpose of finding the best dataset for phylogenetic analyses (see Phylogeny section below for more details). Parameters for these networks were: for E-value of 10−5, alignment scores of 15, 20, 25, 30, 35 or 40; for E-value of 10−10, alignment scores of 15, 20, 25, 30, 35 or 40; for E-value of 10−15, alignment score 16, 25, 30, 35 or 40; for E-value equal to 10−20, alignment score equal to 20, 30, 35 or 40.
Identification of isofunctional clusters and their neighbours
In the resulting SSN (E-value 10−15 and alignment score 30) there were no clear boundaries between different clusters. In order to identify separate clusters, we applied the clusterONE algorithm (62) which is designed to recognize dense and overlapping regions in a graph. The search for dense regions in a network (clusters) was performed with the following parameters: minimum size of 10 sequences for a cluster to be considered, minimum density: auto, edge weights: percentage identity, and the remaining settings were taken as their default values. Next, we identified known NATs, and other non-NAT acetyltransferases, in their corresponding clusters (using annotation details added to the network) and we let these clusters be defined by experimentally confirmed enzymes (based on the assumption of cluster isofunctionality). Given the high percentage identity inside the identified clusters, we assumed cluster isofunctionality (i.e. similar ligand specificity) and transfered annotation from experimentally confirmed proteins to unknown ones within the same cluster. We also created a simplified network using the clusterONE results as input. We represented each cluster by defined ClusterONE as a single node. Nodes in the simplified network are connected by an edge if at least one edge exists between nodes of two given clusters in the original network. After adding all nodes and edges to the simplified network, we applied γFiles (https://www.yworks.com/) orthogonal algorithm to get the final view.
Network analyses
The topology of the simplified network was analyzed using Network Analyzer through Cytoscape. Mainly, we used node degree and betweenness centrality, where node degree tells how many neighbors a node has and betweenness centrality describes how important is a given node for interactions between different parts of a network. Network analyzer calculates betweenness centrality using algorithm by Brandes (92).
Motif discovery
We used the MEME tool (64) to find characteristic sequence motifs within clusters. Each motif search was performed on all sequences of a given cluster. Enriched motifs were discovered relative to a random model based on frequencies of letters in the supplied set of sequences. As we work with protein sequences zero to one occurrence of each motif per sequence was expected and searched for. A maximum of 25 unique motifs were searched for per sequence set, with 5 to 10 amino acid width. Only motifs with e-value below 1 were taken onto account.
Prediction of NATs among uncharacterized sequences
The prediction of NATs among uncharacterized sequences in the SSN started by the selection of the 29 clusters (cluster numbers: 227, 3, 121, 42, 62, 36, 82, 40, 114, 25, 223, 168, 135, 136, 212, 128, 49, 120, 83, 209, 106, 232, 104, 226, 104, 13, 222, 46, 47) neighboring the clusters containing known NATs, namely clusters 10 (NAA10), 8 (NAA20), 2 (NAA30), 20 (NAA40), 9 (NAA50) 24 (NAA60), 97 (NAA70) and 63 (NAA80). We searched for occurrences of key sequence motifs of known NATs (shown in Figure 4 of the Results section) in all sequences of the 29 selected clusters using MAST (93). When we found in a cluster sequence motif similar to that of a cluster of a known NAT, we generated a MSA using three random sequences from the identified cluster and three sequences from the cluster of known NAT.
Phylogeny
Choice of sequence dataset for phylogeny
Since there are no clear boundaries between different acetyltransferases, due to lack of detailed classification, we based our phylogeny analysis on our SSNs. We used the more stringent SSN (E-value = 10−15, alignment score = 30) and selected three representative sequences for each cluster. In order to create the dataset for phylogenetic analyses, we created several networks that allowed for more connections between nodes (and clusters) (see Table s1) and looked for the SSN with the largest single connected component (the largest group of clusters) exhibiting smallworld properties (94). We calculated smallworldness for each of the largest connected components using NetworkX Python library (95).
Sequence alignment for phylogeny
We selected three sequences per cluster to generate the multiple sequence alignment. If a cluster contained sequences from SwissProt, those sequences were used in the alignment. Otherwise, TrEMBL sequences were randomly selected as cluster representatives. As sequence divergence within the acetyltransferase superfamily is extremely high, we used only the highly conserved A and B motifs of the GNAT fold. The alignment was generated using Clustal Omega (96) and the full alignment was constructed step by step. Sequences from closely related clusters were aligned first and different alignments were then merged using MAFT (97). Merging two alignments using MAFT was always performed using “anchor” sequences and ensuring that both alignments had one set of five sequences (i.e. one cluster) in common (“anchor” sequences). That also ensured that corresponding secondary structure elements was kept intact after merging. Alignments generated for merging were manually edited, using acetyltransferases with known structures used as reference to increase the alignment precision.
Model of evolution
To select the right amino acid replacement model, which describes the probabilities of amino acid change in the sequence, we used ProtTest3 (98). As input, we used the previously generated multiple sequence alignment. Tested substitution model matrices were JTT (99), LG (100), DCMut (101), Dayhoff (102), WAG (103) and VT (104). All rate variations were included in the calculation (allowing proportion of invariable sites or +I (105), discrete gamma model or +G (106) (with 4 rate categories) and a combination of invariable sites and discrete gamma model or +I+G (107). Empirical amino acid frequencies were used. We calculated a maximum likelihood tree to be used as starting topology.
Construction and evaluation of the phylogenetic tree
Finally, a maximum likelihood tree was calculated using RAxML (108) based on the generated alignment. We used LG+G+F model of evolution since it provided the best fit according to prottest3 (98) calculation (with AIC, AICc and BIC models selection strategies). Ten searches for the best tree were conducted. The tree was not rooted. Once the best tree was calculated, its robustness was assessed using bootstrap. As stop criterion we used a frequency-based criterion, by calculating the Pearson’s correlation coefficient (109). After bootstrapping was complete, we used transfer bootstrap expectation (TBE) (110) which has been shown to be more informative than Felsenstein’s bootstrap method for larger trees built with less similar sequences.
Experimental
A detailed description of the material and methods is provided in Supplementary Information. In brief, the genes N1Q410 and A0A194XTA9 were cloned into pETM11 vectors. The encoded proteins were recombinantly expressed in E. coli BL21 Star™ (DE3) cells and purified using affinity and size exclusion chromatography. The purity of the proteins was determined by SDS-PAGE and protein concentrations were determined spectrometrically. The enzyme activity was determined via DTNB assay as described in (111).
Supplementary information captions
S1 Supplementary Information.supplementary_information.pdf
Supplementary Methods, Figures and Tables.
S1 File. SSN_dataset_(full_lenght_seqeunces).txt
Sequences used to calculate the full-length sequence SSN, given in the FASTA format.
S2 File. SSN_dataset_(only_GNAT_domain).txt
Sequences used to calculate SSNs for the GNAT domain portion of sequences. All sequences are provided in the FASTA format
S3 File. Phylogenetic_tree.txt
Phylogenetic tree of the GNAT acetyltransferase superfamily calculated using RAxML in newick format. Leaves of the tree are labeled with accession numbers of a given protein and a corresponding cluster number. Inner nodes of the tree are labeled with calculated support values for each node.
S4 File. MSA_for_phylogeny.txt
Multiple sequence alignment used for calculating the phylogenetic tree.
S5 File. Group_5_sequence_motifs.txt
Sequence motifs for Group 5 of NATs calculated using MEME tool from MEME Suite. The file contains
motif P-values;
block diagrams showing the position of the motifs on the relevant sequences;
PSSM;
position-specific probability matrix;
regular expression for the given motif.
S6 File. Group_5_motifs_position_on_seqeunce.txt
Positions of Group 5 sequence motifs on the representative sequence calculated using MAST from MEME Suite.
S7 File. Cluster_97_SEQ.txt
Sequences in FASTA format found in cluster 97 of our full-sequence SSN. These sequences belong to the NAA70 plastid N-terminal acetyltransferase and were used as the dataset for calculating Group 5 sequence motifs.
S8 File. cluster_numbers.xls
This is the table of all proteins from our SSN. The table contains accession numbers, Uniprot annotation status (SwissProt/TrEMBL), description and a corresponding cluster number for each of the proteins.
S9 File. full_sequence_SSN.xgmml.zip
SSN calculated based on the full-length sequence acetyltransferase dataset.
Acknowledgements
NR was supported by the Research Council of Norway (Projects 251247, 288008). T.A was supported by the Research Council of Norway (Project 249843), the Norwegian Health Authorities of Western Norway (Project 912176), and the Norwegian Cancer Society.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.
- 32.
- 33.
- 34.
- 35.↵
- 36.↵
- 37.↵
- 38.
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵