Conservation motifs - a novel evolutionary-based classification of proteins

Cross-species protein conservation patterns, as directed by natural selection, are indicative of the interplay between protein function, protein-protein interaction and evolution. Since the beginning of the genomic era, proteins were characterized as either conserved or not conserved. This simple classification became archaic and cursory once data on protein orthologs became available for thousands of species. To enrich the language used to describe protein conservation patterns, and to understand their biological significance, we classified 20,294 human proteins against 1096 species. Analyses of the conservation patterns of human proteins in different eukaryotic clades yielded extremely variable and rich patterns that had never been characterized or studied before. Using mathematical classifications, we defined seven conservation motifs: Steps, Critical, Lately Developed, Plateau, Clade Loss, Trait Loss and Gain, which describe the evolution of human proteins. One type of motif, which we termed Gain, describes the human proteins that are highly conserved in a small number of organisms but are not found in most other species. Interestingly, this pattern predicts 73 possible instances of horizontal gene transfer in eukaryotes. Overall, our work offers novel terms for conservation patterns and defines a new language intended to classify proteins based on evolution, reveal aspects of protein evolution, and improve the understanding of protein functions.


Introduction
Creating a more accurate terminology and precise nomenclature can facilitate a deeper perspective on biological processes. One of the revolutions in system biology resulted from the definition and study of network motifs 1 . Network motifs are defined by a particular pattern of interactions between proteins that reflect functional properties. A contribution to the protein-protein interaction field was also presented, suggesting two types of networks (hubs): 'party' and 'date' 2 . These networks can better define protein dynamic connectivity in a single organism over time. Based on these definitions, new analyses have become possible, contributing greatly to our understanding of protein interaction and function. In a similar manner, numerous paths exist to characterize and classify proteins, which improved the ability to assign function to poorly characterized proteins.
These include classification of proteins into functional annotations [3][4][5] , subcellular localization and organismal levels [6][7][8] , and classification of domains and functional sites [9][10][11] . These characterizations generated a unified language now applied by the scientific community, allowing better communication and resulting in advances the relevant field of research. Although many aspects of protein characterization have been analyzed, no characterization uses information related to protein conservation and evolution. In this work, we offer a new layer to protein classification, based on evolutionary patterns, as revealed by phylogenetic profiling data.
Phylogenetic profiling is a mathematical representation that describes the evolution of a protein as a pattern of its conservation (i.e. presence or absence) in a set of species 12 . A recent approach with a continuous scale for protein conservation is the Normalized Phylogenetic Profiling (NPP) 13,14 . The NPP assigns, for each of the human proteins, a score representing its relative conservation (between 0 to 1) in each species. Over the past decade, NPP has been used to discover genetic interactions and novel genes in the RNA interference pathway, RNA methylation, and various human diseases including cancer [13][14][15][16][17][18][19][20] . Recently we demonstrated the importance of focusing on different taxonomic clades as it can improve the ability to identify novel factors in different pathways, specifically in the homologous recombination repair pathway 21 .
Despite an exponential accumulation of protein sequences from thousands of genomes, protein conservation is still defined essentially by Boolean terms (Fig. 1-A). This definition of conservation is very subjective, as "conserved" or "not conserved" may differ when comparing orthologs of two organisms or of multiple organisms. Furthermore, the comparison may be between closely related organisms (such as human and mouse) or between very distant ones (such as human and E. coli). Moreover, in nature, proteins have very complex evolutionary trajectories such that, within the same evolutionary timeframe, a protein can be fully lost, diverge dramatically, or stay almost without changes throughout the tree of life or in specific branches (clades) ( Fig. 1-B). However, no objective criteria exists that includes the above information. A demonstration of the signal lost by using partial information and simplistic terms for describing protein evolution. Dark blue high conservation score, light bluelow conservation. (A) The illustration presents two human proteins; protein 1 is conserved in a distant organism while protein 2 is not. Although this description seems adequate and correct, it is rather simplistic. (B) Representation of the complexity of protein conservation as observed from the LNPP data. When considering the continuous scores among multiple species, a new layer for the term 'conservation' can be added.
Phylogenic profiling analysis is commonly used to detect functionally related proteins based on correlated evolution. However, detecting protein conservation motifs can reveal much more than co-evolved proteins as it can shape the overall evolutionary concepts of proteins. Conservation motifs reflect protein function, interaction and selective pressure undergone throughout evolution. An intuitive example for a conservation pattern of a protein may be high conservation among all, or most, species. The evolutionary insight about this Critical motif is that the protein serves a crucial role in many species survival and may indicate a role for this protein as a "house-keeping" gene. Other proteins show conservation patterns that correlate with the evolutionary distance of the species. In contrast, some proteins show complex patterns such as massive loss or conservation variability within taxonomic clades.
Here we identify and define recurring conservation motifs of almost all human proteincoding genes across eukaryotes. We show that conservation motifs can define and frequently explain proteins changes along evolution, adding a new layer to traditional protein evolution terminologies. We show that conservation motifs encompass biological insights with regard to the proteins role based on their continuous evolutionary score.

Results
Our goal was to identify recurring evolutionary patterns, termed conservation motifs, throughout 1096 species in the tree of life. We searched for common conservation patterns, which we formulated using mathematical criteria. The mathematical criteria recapitulate significant information about protein evolution and the effect of selective pressures on proteins. The motifs were defined a-priori in order to capture a spectrum of evolutionary behaviors. The species include 8 taxonomy clades: Primates, Other Mammalians, Other Chordata, Other Metazoa, Alveolata, Fungi, Viridiplantae and Other Eukaryotes 21 . These clades were designed to capture a human-centric understanding of protein evolution.
The conservation motifs can be analyzed using two approaches: Global, which considers the full profile of the protein among all, or most, of the species, and Local, which considers specific regions of the profile within specific taxonomic groups. The motifs mathematical criteria were formulated to be suitable for a global and a local analysis, according to one's field of interest. However, some of the motifs are more global and some are more local by definition (Fig. 2). The local category includes three motifs: Clade loss, Trait loss and Gain. For each motif a scheme illustration of the profile is presented. The X-axis represents taxonomy clades ordered by a descending evolutionary distance from human. The Y-axis represents the length normalized score for the protein versus each organism as compared to human.

Global Motifs
Global motifs refer to the overall profile of a protein among all, or most, of the species.

Steps
The sequence similarity between two orthologs is a function of both the evolutionary distance between the species and the selective pressures that acted on those proteins.
We would expect that, on average, proteins would behave as a "molecular clock" and would be more conserved in taxonomic clades closer to human, creating a Step-like pattern. This is indeed the case and our analyses generated a global Steps appearance where the protein scores in distant taxonomic clades were lower than those in closer taxonomic clades (i.e. the average sequence similarity score in Mammalians is higher than in Fungi, whose score is higher than in Viridiplantae). The proteins related to that motif form a stepwise behavior. The Steps motifs can also be local; primate proteins are more similar to human proteins as compared to orthologs from other chordata. Similarly, orthologs from the Mammalia clade are more conserved than orthologs from the rest of the animal kingdom. Consequently, we wanted to identify which, and how many, proteins have global Steps behavior as a probable result of the molecular clock. To this end, we defined the sequence divergence of the human proteins in each organism and normalized it to the expected divergence. The expected divergence is the average difference between all the human proteins to their orthologs in each species. This means that the vector indeed gradually decreases when the species are ordered in a descending evolutionary distance from human, based on the NCBI Common Taxonomy Tree. We defined the Steps criterion to calculate the Euclidean distance of protein scores to the means vector (see Methods).
This score indicates how close the protein is to the 'average'. Overall, the Steps criterion distribution is very narrow, with an average distribution of 0.37 and a standard deviation of 0.23. In order to validate the significance of the results, we calculated the same Steps score on a shuffled version of the data (see Methods). The criterion distribution is very close to the shuffled distribution and to the "average protein" (Supplementary Fig. 1-A).
This supports our aim of finding the most general motif among the protein sets that behave according to the expected evolutionary trend. Nevertheless, the criterion distribution is skewed to the right, which implies additional protein sets that respond to very different selective pressure and show different evolutionary patterns. We then turned to characterizing those protein sets that showed less trivial evolutionary patterns.

Critical
Some proteins are essential for sustaining life. These proteins show very low sequence variability among species and mutations within them and almost always show a reduction in organism fitness. If the function of these proteins is critical to the survival of the vast majority of species, we expect to find a Critical pattern. Critical proteins show a high sequence similarity, much more so than expected, among all eukaryotes (global) or in clades (local). These proteins are almost never lost in species.
We defined the Critical criterion to calculate the average sequence similarity in each of the 8 taxonomy clades and then summarized the means. We calculated the same score to a shuffled version of the data (see Methods). We defined proteins with scores higher than the maximal score in the shuffled version as Critical (Supplementary Fig. 1 It is well established that histone H3 proteins are highly conserved across all eukaryotes 22,23 . In comparison, H1 is known to be less conserved than other histones, probably because it is a linker histone 24 . We show that H1 is indeed ranked low in the Critical criterion. H2A and H2B, which are involved in regulation, are known to have sequence and structural variations and to be post-translationally modified 25 . Therefore, some of them are Critical while others are less conserved and ranked lower in the criterion. No variants are known for H4 26 and they are highly ranked in the Critical criterion ( Supplementary Fig. 2).
The next very Critical proteins are the actins, which are known to be highly conserved and are involved in various modes of cell motility and in the maintenance of the cytoskeleton 27 . Followed by the Critical protein groups of the α-and βtubulins, which are also known to be highly conserved heterodimers, they act as structural components of microtubules, which are a cytoskeleton element of all eukaryotic cells 28 . An additional protein that emerged as Critical is the CALM1, Calmodulin Ca(2+)-sensing protein, which is known to be highly conserved in eukaryotes 29 .
The 'nonsense-mediated mRNA decay' GO annotation term is highly enriched within the Critical proteins (p-value: 1.17E-59) (Supplementary Table 2.1). 'Nonsense-mediated mRNA decay' is a crucial process regulating transcript quality and abundance and it is known to be conserved among eukaryotes [30][31][32] .
The 'translational initiation' GO annotation term is also highly enriched within the Critical proteins. 'Translational initiation' is a complex process that involves eukaryotic conserved initiation factors (such as EIF1B, EIF2S3, EIF4A2 and ribosomes) and regulation mechanisms that select the start codon on the mRNA 33 .
Interestingly, among the Critical proteins, some are not known to be highly conserved such as the ARL3 (ADP Ribosylation Factor Like GTPase 3) and the PRPF8 (Pre-MRNA Processing Factor 8), which are associated with Nonsyndromic Autosomal Dominant Retinitis Pigmentosa 34 .

Lately Developed
The evolution of species and taxonomic clades is frequently derived from new proteins that are unique to the clade. From the human perspective, most Lately developed proteins are absent among most eukaryotes and can be found specifically in primates and humans. These proteins are probably important for primate unique characteristics.
This protein set was identified in a complementary fashion to the Critical pattern (see Methods). Protein scores that are lower than the minimal score of the shuffled operation reflect the Lately Developed motif, which are proteins that are conserved less than expected randomly (Supplementary Fig. 1 The mucins protein evolution is driven by a "Red Queen" effect, in which hosts are constantly in a race to evade the more rapidly evolving pathogens that infect them 35,36 .
As the pathogens vary between different host species, the mucins proteins would vary as well.
Another enriched GO annotation is 'establishment of skin barrier' (p-value: 3.5e-05), which are proteins responsible for the epithelial barrier of the skin and for limiting its permeability. This term includes the FLG, HRNR and FLG2 proteins. Null mutations within the FLG gene encoding filaggrin have been identified in approximately 30% of atopic dermatitis patients, a common allergic skin disease 37 . HRNR differential expression levels have also been associated with atopic dermatitis susceptibility 38 . It is known that many aspects of the anatomy of the skin are unique to humans as compared to non-human mammals, and many drastic changes have occurred in the skin that allow thermoregulation 39 .
In addition, the NBPF protein family, which are part of the neuroblastoma breakpoint family, are ranked at the top of the Lately Developed criterion. Interestingly, this protein family has no discernible orthologues in rodent genomes and is the result of speciesspecific gene duplication events that occurred during primate evolution 40 . Some attempts have been made to connect the NBPF protein family to brain development 41 .

Plateau
Plateau proteins show a constant percentage of conservation at a specific value among most species. Such constant behavior is less expected as the conservation should, theoretically, change in correlation with the evolutionary distance from human. We defined the Plateau criterion to calculate the variance among the invertebrate clades and summarized the variances. In order to validate the significance of the results, we calculated the same score on permutations of the data (see Methods). For further analyses, we extracted the protein that scored lower than the minimal score in the shuffled operation as these proteins are "less random" than expected (Supplementary Fig. 1-D We hypothesized that Plateau proteins contain highly conserved regions (in variable sizes) that dictate the percentage of conservation. Therefore we examined whether proteins that share the same constant conservation score tend to contain an equal sized structural domain. We determined the 'Pfam coverage score' as the length of the largest Pfam domain 42 of a protein, divided by the protein length. We noticed that, in general, proteins Pfam-coverage-score-trend increased as the mean-conservation-score enlarged. In addition, the constant protein conservation value can be either unique or noisy. When comparing two Plateau proteins sets with different variance values, the Pfam coverage score among the low variance proteins was consistently larger than among the high variance proteins. This indicates that Plateau proteins with a higher variance include several smaller domains instead of a large conserved domain ( Supplementary Fig. 3).
The Plateau motif provides information on the structure of a protein, based on its phylogenetic profile alone. As the information for the construction of the profile is based on 1096 species, the prediction is less biased by the choice of species for multiple sequence alignment. Furthermore, the Plateau motif answers the questions "what characterizes proteins that contain a large conserved domain" and "which highly

Local motifs
The conservation and phylogenetic patterns of proteins can, and should, be described globally. Nevertheless, some "less trivial" evolutionary patterns have a more local nature, although fundamental to understanding protein function. Local motifs focus on proteins that show an unexpected profile in specific taxonomic groups. This includes loss or gain of the proteins at certain positions across the tree of life. Below, we describe the local motifs, which are: Clade loss, Trait Loss, and Gain (Fig. 2).

Loss
A loss event can occur at the common ancestor of a taxonomic clade causing a group of proteins to be absent from a monophyletic taxonomy clade. We termed this type of loss Clade Loss. Another possibility for a loss event is within species that share a common phenotype, which are part of a polyphyletic clade. We termed this type of loss Trait Loss.
The hypothesis for both motifs is that the analyzed species share common characteristics that may provide an explanation for a significant loss of the proteins.

Clade Loss
Clade Loss proteins are significantly diverged or lost within a taxonomic clade, while conserved in close clades. Clade Loss proteins can describe many different and distinct evolutionary phenomena. One example is a protein group that is lost in invertebrate animals 'Arthropoda' but exists in 'Viridiplantae'. We identified 102 proteins (Supplementary Table 1

Trait Loss
One of the major challenges in biology is to associate genotype with phenotype. We and others have previously shown that protein phylogenetic profiles can be used to predict organism traits 13,14,46,47 . Trait Loss proteins are unexpectedly diverged or lost within a polyphyletic clade, in correlation with certain traits. One example of the Trait Loss motif is metazoan parasites, which are a polyphyletic group of species. We used 69 metazoan human parasites in our analyses (Supplementary Table 1.2). We were interested in the proteins whose scores differed between these parasites and other metazoa (Fig. 4, see Methods).   [48][49][50] .
Interestingly, we found enrichment for peroxisome organization pathway proteins (pvalue: 2.14e-06). The peroxisome is responsible for oxidative metabolism of lipids and the detoxification of reactive oxygen species. Previous work identified the loss of peroxisomes in other free living related species 51 . Members of the peroxisomal proteins are studied as a drug target for the development of novel therapies against diseases caused by parasites [52][53][54][55] . We propose that the peroxisomal protein set, absent specifically in metazoan parasites, can be considered as targets for drugs against human parasites.
In addition, the three proteins ALAD, UROD and CPOX, revealed in the motif and related to heme biosynthetic process, are also known to differ significantly between the human

Gain
Some patterns have local, highly conserved, orthologs that are found only in few distant organisms, while in other species there are no significantly similar orthologs. We termed this sporadic appearance of an ortholog, within a select few species, as Gain. A gain event at a monophyletic taxonomic clade can appear only in the reference proteome's taxonomy clade, which in our case is the primate taxonomy.
The Lately Developed pattern is a specific case of that scenario. Therefore we focused only on gain in polyphyletic taxonomic clades. The rationale behind the Gain proteins is that their similarity within a few specific species does not correlate to the taxonomic distance of those species. Therefore we defined the Gain criterion to capture proteins whose conservation score, within a specific organism, is much higher than expected, considering the overall conservation scores of this organism's proteome and the overall conservation of other species in its taxonomic clade.
We calculated the Z-scores of the Length Normalized Phylogenetic Profiling scores 13 , which measures how much a protein is conserved as compared to its expected conservation (see Methods). We defined the Gain criterion to detect gaps within the We validated the presence of a possible SEPSECS ortholog specific to these species by blastp the human SEPSECS versus all the Smittium species, and all the Chytridiomycetes species using the nr database. Significant results appeared only in the species where significant Z-scores appeared in our data (except for S. megazygosporum where blastp revealed a hypothetical protein of 30%) (Fig. 5-A, Supplementary Table 1.4).

Figure 5: Examples of HGT events. (A)
Smoothed Z-score phylogenetic profiles of the SEPSECS protein among the fungi taxonomic clade. The X-axis represents the fungi Zoopagomycota and Chytridiomycetes clade (orange). The Y-axis represents the Z-score of the SEPSECS in each organism. A sudden peak (red) of SEPSECS is observed at S. culicis, S. simulii, S. angustum and G. prolifera and is therefore a candidate for an HGT event. (B) Taxonomy tree of the Zoopagomycota and Chytridiomycetes taxonomic clades along with the blastp scores of the SEPSECS protein in each organism. Annotaded in red are the species where SEPSECS was revealed as gained within them according to the gain criterion. (C) Smoothed Z-score phylogenetic profiles of the ENOSF1 protein among the fungi and plants taxonomic clade. The X-axis represents the Dothideomycetes and Agaricales fungi (orange) and Fabids plants (green) taxonomic clades. The Y-axis represents the Z-score of ENOSF1 in each organism. A sudden peak (red) of ENOSF1 is observed at Q. suber and its parasites and is therefore a candidate for an HGT event. (D) Taxonomy tree of the Fabids clades and Q. suber parasites along with the blastp scores of the ENOSF1 protein in each organism. The Q. suber and its parasites are annotated in red because ENOSF1 was revealed as gained within them according to the gain criterion.
Another protein that was revealed in the Gain criterion in plants is ENOSF1 (Mitochondrial enolase superfamily member 1), which showed a sudden peak in Q. suber, a cork oak part of the fabids taxonomy. We validated the presence of a possible ENOSF1 ortholog specific to Q. suber by blastp the human ENOSF1 versus candidate species in each of the fabids sub-clades. No significant score was obtained in any of the species except for Q. suber (Fig. 5

Discussion
Protein evolution is an extremely complex and variable process that is affected by different evolutionary forces, such as positive and purified selection, drift, gene flow and even HGT. These events vary in every species, affecting an organism's fitness and driving its evolution. Despite this complexity, the scientific community currently uses a relatively limited vocabulary, lacking in dimensionality, when describing protein evolution along different species. This definition of conserved, poorly conserved, etc., seems rather simplistic when aiming to describe and study a subject as complex as protein evolution across many species. By identifying and describing common conservation patterns, "conservation motifs", we were able to enrich the language for describing and studying protein evolution. Additionally, we were also able to identify proteins with unexpected patterns of conservation. This flexible terminology makes it possible to characterize and, for the first time, communicate protein conservation patterns simply and more accurately.
Here, we analyzed the evolutionary profiles of 20,294 human proteins against 1096 species and offered a novel terminology of the observed protein evolution. We identified seven conservation motifs: Steps, Critical, Lately Developed, Plateau, Clade Loss, Trait Loss and Gain, which present both expected and more complex evolutionary patterns.
We provide a richer but simpler language than the basic terminology of "conserved or not conserved". This study highlighted and defined common conservation patterns, associating them with protein evolution and function. Overall conservation motifs revealed biological and evolutionary insights about protein function and interaction with the environment.
Furthermore, we believe that this terminology is applicable for describing the evolution of most human proteins better than the terms currently in use. For example, TP53 (Tumor Protein P53), the most frequently mutated gene in human cancer, has a Steps motif and is not conserved in all the non-metazoan eukaryotes (Fig. 6-A). ACTA2 (Actin Alpha 2), a highly conserved protein, observes a Critical motif (Fig. 6-B  insights about proteins, such as structure and domain coverage, revealed from the Plateau motif. Moreover, conservation motifs can relate phenotype to genotype such as to a protein group absent exclusively among polyphyletic parasites (Trait Loss). The discovery of this behavior provides a clue as to the function of these proteins as well as to highlight them as candidate targets for drug development.
Conservation motifs also revealed a set of 73 candidate proteins for horizontal gene transfer events, along with a hint about the organisms where the transfer occurred.
Furthermore, conservation motifs explain evolutionary relations between close and distant species. One example is the absence of sterol proteins among insects, but which is compensated by their presence in plants and points to strong interactions between these two taxonomic groups. Another example is the ability to note the donor and acceptor organisms of horizontal gene transfer events and to predict the putative proteins.
Specifically, we presented a lateral transfer of ENOSF1 to Q. suber from its fungal parasites. Of 18 proteins revealed as horizontally transferred to Z. mays, seven of these proteins are paralogues. Further investigation is needed to better understand the lateral transfer process to Z. mays.
One current limitation of this analysis results from poor genome assembly that are incomplete. This causes false negatives since the absence of proteins is not fully proven. Therefore, when detecting protein absence, the biological hypotheses revealed from the patterns should be carefully considered in order to avoid mistaken insights. For example, the reason for the significant loss of proteins among the 'Aves' class is probably a result of a high GC content in bird genomes and not an actual signal of protein absence 65 .
As an annotation or language, we believe that this work will open a new and more relevant way to consider conservation patterns. Following the exponential growth in the number of sequenced genomes, the annotation can be improved and expanded to include other motifs. We hope that over time, the biological meaning of these motifs will be better understood. For example, we chose the Arthropoda and Viridiplantae taxonomic clades to demonstrate the Clade Loss motif, and the parasites species as an example of polyphyletic species (where the proteins are absent) to demonstrate the Trait Loss motif.
However, this analysis can and should be applied to any taxonomic group of interest.
In conclusion, there is an unmet need to describe protein evolution in simple terms. As far as we are aware, this work is the first to define a method for describing protein evolution terminology. During an era of exponential growth in the genomic data, we believe that this work offers a new language for proteins and can enhance our understanding of their evolution.

Data retrieval
The protein phylogenetic profiling data was generated as previously described 13,14,21 . 1154 proteomes were downloaded from both NCBI's RefSeq non-redundant protein database and Uniprot. The script 'update_blastdb.pl' was used for the download of the non-redundant protein database 66 (accessed at 25.12.2018, link -ftp://ftp.ncbi.nlm.nih.gov/blast/db/). In order to extract 1154 specific taxonomies from the compressed RefSeq downloaded database, we used the 'taxonkit' tool, which maps the sequence identifiers to the relevant species and extract the relevant fasta files. The proteomes from Uniprot were downloaded as FASTA files using the REST API (The UniProt Consortium 2019) (accessed at 28.12.2018, linkhttps://www.uniprot.org/uniprot/?query=proteome:Proteome_ID&format=fasta).
The reference human proteome was retrieved from UniProt as well (accessed at 28.12.18) and filtered such that each gene had a single representative protein when the largest protein for each gene was considered. We reached 20294 human reference proteins.

Length normalized score calculation
For each of the human proteins (20294) we calculated a continuous score (between 0 to 1) that reflects its quantitative conservation among the large set (1154) of eukaryotic proteomes. The continuous score was calculated by: = , ,ℎ ⁄ , = blastp bit score of the human query protein p aligned with its one-directional best hit in one of the 1154 subject proteomes s. ,ℎ = blastp bit score of the human query protein p aligned with its one-directional best hit in the human proteome h.
Since the bit score depends on the length of the protein, this score reveals a length normalized score of the protein conservation.
We executed blastp on the command-line 67 (version 2.7.1) with the argument '-max_target_seqs 1' to retain only the top hit per protein per species.
The resulting matrix is 20294 rows (proteins) x 1154 columns (species), where , = blastp score of protein i in organism j, divided by protein length. The matrix is called LNPP-Length Normalized Phylogenetic Profiling. Most of the conservation motifs criteria are based on the LNPP matrix.

Z-Score calculation
For the Gain motif we used the Z-Score Normalized Phylogenetic profiling data 13,14 . For each of the human proteins (20294) we calculated the Z-Score that reflects how much the protein is conserved in (1154) eukaryotic proteomes more than expected. The Z-Score was calculated based on the LNPP matrix: = , − µ , = length normalized phylogenetic profiling score of protein p in proteome s µ = the mean of all the proteins in proteome s = the standard deviation of all the proteins in this proteome s.

Species selection
Some of the LNPP columns, representing 1152 species, were excluded from the analysis in order to reduce noise. First, we selected only one representative organism for each taxonomy strain, avoiding duplicated columns that might bias the analysis. Second, we detected some species whose scores among all the proteins is very low. This is a result of small proteome size or bad proteome annotation. Therefore we calculated the mean of each organism among all the proteins and excluded from the analysis the bottom 5% species. We considered a final set of 1096 species for the motifs identification (Supplementary Table 1.5).

Taxonomy Clades reordering
The species taxonomy kingdoms were ordered in a descending evolutionary distance from homo sapiens. Furthermore, we reordered the internal species within the taxonomy clades, in order to place closer species next to each other. The information about the species order was downloaded from the NCBI Common Taxonomy Tree architecture (linkhttps://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi) of the 1096 species. The tree was then rerooted for homo sapiens using the command nw_reroot in the Unix shell, belong to the newick utilities 1.6 tool (link-http://cegg.unige.ch/newick_utils). The tree was then reordered using the R generic 'reorder' function (belongs to the 'stats' package) using the 'OLO' method, based on the Pearson correlation of the species LNPP scores.

Mathematical criteria:
For all the mathematical criteria that included the taxonomy clades in calculations, we considered 8 taxonomy clades: Primates, Other Mammalians, Other Chordata, Other Metazoa, Alveolata, Fungi, Viridiplantae and Others, so that = 8 consistently.
Steps criterion: Euclidean distance to the means vector.
For each protein p we calculated the score: We ordered the proteins in a descending order of the criterion scores. In addition, in order to validate the significance of the results, we applied the above criterion on a shuffled version of the data, i.e. shuffling the protein means among each clade i.

Critical criterion: sum of clades means.
For each protein p we calculated the score: criterion score = ∑ =1 = mean of length normalized score of protein p in clade i.
Since the 8 taxonomy clades sizes are variable, we summarized the means and all of the scores so that all clades would have the same power to affect the score. We ordered the proteins in a descending order of the criterion scores. In addition, in order to validate the significance of the results, we applied the above criterion on a shuffled version of the data by shuffling the protein means among each clade i. We extracted proteins that received a score higher than the maximal value in the shuffling operation after the shuffling was simulated 1000 times, and then the maximal value of all iterations was picked (maximal value: 5).

Lately Developed criterion: sum of clades means.
For each protein p we calculated the score: criterion score = ∑ =2 = mean of length normalized score of protein p in clade i.
We ordered the proteins in an ascending order of the criterion scores. We did not include the Primate taxonomy clade in this criterion, since we were interested in proteins that were absent among most eukaryotes except for primates. Therefore we want the score to be minimal in all eukaryotes except for primates (7 clades). In addition, in order to validate the significance of the results, we applied the above criterion on a shuffled version of the data, i.e. by shuffling the protein means among each clade i. We extracted proteins that received a score lower than the minimal value in the shuffling operation after the shuffling was simulated 1000 times, and then the minimal value of all iterations was picked (minimal value: 0.21).

Plateau criterion: distance of means from Plateau values.
For each protein p, with a mean of length normalized score greater than 0.3, we calculated the score: criterion score = ∑ =4 = variance of length normalized score of protein p in clade i.
Some of the proteins that had a constant percentage of conservation were a result of very low conservation and short sequence. In order to avoid their emergence, we limited the protein group in advance to proteins that had an average length normalized score greater than 0.3. We did not include in the criterion the Vertebrates taxonomic clades, since within them most of the proteins show pattern of Steps. In addition, in order to validate the significance of the results, we applied the above criterion to a shuffled version of the data, i.e. by shuffling the protein variance among each clade i. We extracted proteins that received a score lower than the minimal value in the shuffling operation, when the shuffling was simulated 1000 times, and then the minimal value of all iterations was picked (minimal value: 0.008). The protein group that emerged included some Critical proteins, which we did not include in the final set of the Plateau proteins.

Loss
The threshold used for these motifs to cut the proteins ranked in the criterion was 0.5%, thus limiting the motif size to ~100 ranked proteins.

Clade Loss criterion: distance of scores in two monophyletic clades.
For example, for Arthropoda and Viridiplantae clades, we calculated the score for each protein p: We ordered the proteins in a decreasing order of the criterion scores and picked the top threshold.

Trait Loss criterion: distance of scores in two polyphyletic clades.
For example for Metazoa and parasitic Metazoa, we calculated the score for each protein p: = mean score of protein p in a set of metazoan parasitic species.
ℎ _ = mean score of protein p in the rest of the species in the Metazoa clade.
We ordered the proteins in a decreasing order of the criterion scores and picked the top threshold.

Gain
Gain criterion: distance between two following z-scores.
For example: clade fungi. For each protein p we calculated the score: = ( _ ) = ( 1 , 2 , … , ) _ = ( 1 , 2 , … , ) = +1 − = the z-score of protein p in each of the fungi clade species. _ = the distance between two following z-scores. m = number of species in fungi clade We ordered the proteins in a decreasing order of the criterion scores and picked the top threshold. To reduce noise, we further limited the criterion for proteins where their maximal gap between the sorted z-scores appear at the top quartile of the values, i.e. the unexpected high conservation score exists in no more than 25% of the clade species.

Pfam domains retrieval
We downloaded the data about Pfam domains using the 'biomaRt' R package, using the 'hsapiens_gene_ensembl' dataset and 'ensembl' mart. Using the getBM function, we queried for the attributes: 'external_gene_name', 'pfam', 'pfam_start', 'pfam_end', for the filters: 'hgnc_symbol' where the values are all the NPP protein names.

Phylogenetic profiling smoothing visualization
In order to illustrate specific phylogenetic profiles of proteins, we generated a geometric line where the x-axis represents all eukaryotes available in the time of the analysis, annotated to 8 taxonomy clades, and the y-axis is the normalized score that the protein received versus each specie compared to human. In order to reduce noise in visualization, smoothing methods were applied on the geometric line using the 'smth' function in the 'smoother' R package, with a window of 0.05 and a 'Gaussian' method.

Statistical analyses
In order to calculate the significance of the presence of Histone, Ribosomes and Tubulins within the Criticals proteins, we used a hypergeometric test, using the R function 'phyper' belonging to the 'stats' package. For example, in order to calculate the significance of the presence of the histones proteins within the criterion proteins, the p-value is the probability of getting k or more histones in a sample of size, n, where the population contains m histones and RNPP-m other proteins.
K= the length of intersection of histones proteins and NPP proteins. N= number of Critical proteins, number of trials. M= number of histones proteins in NPP. RNPP= all the proteins in the NPP.

Enrichment analyses
We applied enrichment analysis on the conservation motifs protein in order to further assess the terms highly enriched within the protein groups. The enrichment was done using the function 'gprofiler' belonging to the 'gProfileR' R package. The parameters of the function are: organism = "hsapiens", src_filter="GO:BP", correction_method="fdr", max_set_size= 200, min_set_size=15, min_isect_size=3, and max_p_value=0.05. We observed that many of the terms contain the same proteins and that they basically represent the same biological process. Therefore, we filtered the terms by extracting those that had more than 90% overlapping proteins, by considering terms with better p-values. Figure S1: Distribution plots of the global conservation motif criteria. The X-axis represents the protein criteria scores. The Y-axis represents the density. The motif criteria (pink) are illustrated versus the same score when applied to shuffled data (green). The proteins with a score higher/lower than the maximal/minimal value of the criteria when applied to shuffled data were considered as 'conservation motif' proteins (blue).  Illustration of the different histones scores versus their rank in the critical criterion. The X-axis represents the histones scores in the Critical criterion, and the Y-axis represents the rank of the histones in the criterion. The different types of histones: H1, H2A, H2B, H3, and H4 are colored separately. Circles are histones that were defined as Criticals according to the Critical criterion. Triangles are histones that were not defined as Criticals according to the Critical criterion.