Abstract
Proteins are the building blocks for almost all the functions in cells. Understanding the molecular evolution of proteins and the forces that shape protein evolution is an essential step in understanding the basis of function and evolution. Previous studies have shown that adaptation occurs frequently at the protein surface, such as in genes involved in host-pathogen interactions. However, it remains unclear whether adaptive sites are distributed randomly or at regions that are associated with particular structural or functional characteristics across the genome, since many of the proteins lack structural or functional annotations. Here, we seek to tackle this question by combining large-scale bioinformatic prediction, structural analysis, phylogenetic inference, and population genomic analysis of Drosophila protein-coding genes. Although adaptation is more relevant to function-related rather than structure-related properties, we observed that physical interactions may play a role in the co-adaptation of fast-adaptive proteins. Importantly, protein-protein and protein-DNA interaction sites are hotspots for protein adaptive evolution, regardless of the levels of intrinsic structural disorder or relative solvent accessibility. We found that strongly differentiated amino acids across geographic regions in protein coding genes are mostly adaptive, which may contribute to the long-term adaptive evolution. This strongly indicates that a number of adaptive sites are repeatedly mutated and selected in evolution, in the past, present, and maybe future. Our results suggest important roles of intermolecular interactions and co-adaptation in the adaptive evolution of proteins both at the species and population levels.
Introduction
Natural selection plays an important role in molecular evolution of protein sequences. Recent advances in genome sequencing and reliable inference methods at both phylogenetic and population levels have enabled fast and robust estimation of evolutionary rates and adaptation driven by natural selection. In addition, the increased availabilities of structural and functional data of proteins have made it possible to study how structural and functional constraints affect protein sequence evolution and adaptation. It is now well established that different proteins and different sites within a protein have varying rates of evolution and adaptation due to both structural and functional constraints (Echave et al., 2016; Kosiol et al., 2008; Lindblad-Toh et al., 2011; Zhang and Yang, 2015). For example, genes that are highly expressed or perform essential functions are under strong purifying selection and tend to evolve slowly (Drummond et al., 2005; Moutinho et al., 2019; Pál et al., 2001; Zhang and He, 2005; Zhang and Yang, 2015); genes involved in host-pathogen interactions, e.g., immune responses and antivirus responses, show exceptionally high rates of adaptive changes (Enard et al., 2016; Nielsen et al., 2005; Obbard et al., 2009; Palmer et al., 2018; Sackton et al., 2007; Sironi et al., 2015; Uricchio et al., 2019); and residues that are intrinsically disordered or at the protein surface are fast evolving and has been proved to be hotspots of adaptive evolution (Afanasyeva et al., 2018; Goldman et al., 1998; Lin et al., 2007; Moutinho et al., 2019; Ramsey et al., 2011). More recently, Slodkowicz & Goldman (Slodkowicz and Goldman, 2020) employed genomic-scale integrated structural and phylogenetic evolutionary analysis in mammals and showed that positively selected residues are clustered near ligand binding sites, especially in proteins that are associated with immune responses and xenobiotic metabolism. However, vast majority of the work focused on differences at the species level, it is unclear how much of the polymorphic changes within a species may contribute to long-term evolution.
Although evidence have shown that adaptation is more likely to occur at intrinsically disordered regions and clustered at the surface of proteins, the functional properties of adaptation in the genomic and population scale remains unclear. Moreover, due to lack of structural and functional information of many proteins in the genome, the underlying mechanism derived from current studies might be incomplete. Here, we systematically investigated the evolution and adaptation of protein-coding genes in Drosophila melanogaster by comparing it to its closely related species and their own populations, in order to distinguish the main factors that impact the evolution and adaption at the protein-coding level. We applied large-scale bioinformatic and structural analysis to obtain structural and functional properties of proteins. We then classified residues into different structural and functional sites. By comparing rates of sequence evolution and adaptation between different proteins and different sites, we were able to locate hotspots of adaptation at the genome scale. Although adaptation is more sensitive to functional properties rather than structural properties, we found that putative binding regions including allosteric sites at protein surface show higher rates of adaptation than other sites. For proteins that are under fast-adaptive evolution, we showed that they tend to interact with each other more frequently than random expectations and are often associated with reproduction, immunity, and environmental information processing in D. melanogaster. In addition, we showed that interacting proteins in D. melanogaster might undergo co-adaptive evolution. Furthermore, we hypothesize that molecular interactions or physical interactions might be an important mechanism that contribute to the adaptive and co-adaptive evolution in D. melanogaster genome. At last, we showed that many non-synonymous SNPs contributing to short-term adaptation are overlapped with SNPs contributing to long-term adaptive evolution, suggesting that a subset of SNPs on the genomes are constantly utilized for adaptive purpose.
Results
Putative molecular interaction sites are hotspots for protein adaptive evolution
To uncover the main factors that impact the evolutionary rates of genes, we analyzed 13,528 protein-coding genes in D. melanogaster using genome data from melanogaster subgroup species and D. melanogaster population genomics data from 205 inbred lines from Drosophila Genetic Reference Panel, Freeze 2.0, DGRP2 (Huang et al., 2014). We applied a maximum likelihood method (Yang, 2007) to compute dN/dS ratio (ω) using the protein-coding sequences of five closely related melanogaster subgroup species (D. melanogaster, D. simulans, D. sechellia, D. yakuba and D. erecta). We estimated the proportions of adaptive changes (α) in each gene by applying an extension of MK test named asymptotic MK (Messer and Petrov, 2013; Uricchio et al., 2019) using D. simulans as outgroup. We then calculated the rate of adaptive changes (ωa) of each gene by multiplying ω to α (ωa = αω) (Moutinho et al., 2019) using D. yakuba as the outgroup species (See methods). The rate of nonadaptive changes can be further calculated by ωna=ω-ωa. Finally, we successfully assigned ω to 12,118 protein coding genes and ωa and ωna to 7,192 genes. For each of D. melanogaster genes subjecting the same pipeline of analysis, we further obtained 17 different structural or functional properties (see Methods and supplementary file S1). We calculated Pearson’s correlations of ω, ωa and ωna with all these properties (Table S1). We showed that many of these genome-wide correlations were expected according to previous studies (Supplement Information, section Impact of gene properties on evolution of protein-coding genes in D. melanogaster, Table S1). Interestingly, among these properties, we found that some previously not reported properties, fractions of molecular-interaction sites (PPI-site ratio, ratio of residues involved in protein-protein interactions, and DNA-site ratio, ratio of residues involved in protein-DNA interactions) strongly positively correlated with ω, ωa and ωna (Supplement Information, section Molecular interactions contribute to the variations of protein sequence evolution and adaptation, Table S1, Figure S1). The results indicate that molecular interactions might act as an important factor that drive protein adaptive evolution in Drosophila genome.
We then investigate whether residues involved in molecular interactions are targets for adaptive evolution. To tackle this question, we predicted protein-protein interaction sites (PPI-sites) and DNA binding sites (DNA-sites) for each of D. melanogaster protein sequence (see Methods). In addition, we characterized allosteric residues as surface and interior critical residues with STRESS model (Clarke et al., 2016) for all the structural models. We also extracted putative binding sites from STRESS Monte Carlo (MC) simulations. We calculated ω, ωa and ωna for residues in each of the putative molecular interaction category. Strikingly, we observed that residues involved in protein-protein interactions, DNA binding and ligand binding exhibited higher rates of adaptive evolution compared to their corresponding null sites (Fig. 1A-C). In addition, allosteric residues at protein surface showed higher adaptation rates than allosteric residues at protein interior or residues that are not involved in ligand binding (Fig. 1C).
Since we observed significant positive intercorrelations between PPI and DNA binding with ISD (intrinsic structural disorder) and RSA (relative solvent accessibility) (Table S2), we next asked whether the increase of ωa in protein-protein interactions sites or DNA binding sites was caused by the increase of disorder or site exposure. We calculated and compared ω, ωa and ωna for putative PPI and DNA binding sites with different levels of ISD or RSA. Remarkably, we found that ωa of these binding sites remains similar among different levels of ISD or RSA (Fig. S5AC). The results suggest that PPI or DNA binding events in proteins can result in elevated adaptation rates regardless their structural disorder or site exposure. While for residues that are not associated with putative PPI or DNA binding, we also observed increase in ωa when increasing ISD or RSA (Fig. S5BD), which could be the result of some other yet unknown underlying mechanisms. In addition, there is possibility that binding sites in disordered regions are not well-predicted. However, given that ISD does not show strong impact to binding sites (Fig. S5AC), we think the inaccuracy of binding sites may not play a significant role.
In order to gain better understanding of adaptation in molecular interaction sites, we further visualized positive selections that are associated with molecular interactions. We first investigated whether adaptive evolution is associated with particular protein structures or protein families. To do this, we looked into fast-adaptive proteins with the largest ∼15% rates of adaptation (ωa > 0.15) that are linked to high quality structural models. Interestingly, among these proteins, we found 45 enriched as trypsin-like cysteine/serine peptidase domain and 17 7TM chemoreceptors, suggesting widespread adaptive evolution acting on these protein families or protein domains in D. melanogaster (Table S3). Many of the 7TM chemoreceptors are olfactory and gustatory genes and show adaptive evolution in various species such as Drosophila and mosquito (Hill et al., 2002; Lawniczak and Begun, 2007; McBride, 2007; Wu et al., 2009). In addition to these two protein families, previous studies identified recurrent positive selections acting on some other fast-adaptive proteins in Drosophila and mammals, and the possible adaptive evolution mechanisms have been linked to exogenous ligand binding, for example, serine protease inhibitors (serpin), Toll-like receptor 4 (TLR-4), and cytochrome P450 (Jiggins and Kim, 2007; Slodkowicz and Goldman, 2020).
In order to visualize the link between adaptive evolution and molecular interactions in the two protein families with frequent adaptive evolution, we showed significant positive selections and molecular interactions in two representatives: CG10232 and Or67a, each for trypsin-like cysteine/serine peptidase domain and 7TM chemoreceptors, respectively. We observed that in both cases, positively selected sites highly overlapped with predicted or inferred binding pockets (Fig. 1D-E). Specifically, in CG10232, we found clusters of positive selected sites around NAG binding sites that are inferred from a crystal structure of serine protease (PDB code: 2XXL) (Fig. 1D), while in Or67a, positively selected sites expand around the putative odorant binding channel formed by helices S1-S6 in extracellular regions (Butterwick et al., 2018) (Fig. 1E).
Except for these examples that are associated with exogenous ligand or exogenous peptide binding, we also identified two previously not described examples where adaptive evolution might be linked to endogenous protein binding: Spaztle (spz, Fig. 1F) and Cul6 (Fig. 1G). Spaztle can bind to Toll-like receptors (TLR) and trigger humoral innate immune response. We built the missing loop in Spaztle in the crystal structure of Toll/Spaztle complex (PDB code 4BV4) according to the dimeric crystal structure of Spaztle (PDB code 3E07). In this complex structural model, we observed several positively selected sites in Toll-4/Spaztle interfaces (Fig. 1F). Cul6, another example, is a protein in cullins family in D. melanogaster. The cullins protein family are known as scaffold proteins that assemble multi-subunit Cullin-RING E3 ubiquitin ligase by forming SCF complex with F box and RING-box (Rbx) proteins (Zheng et al., 2002). We constructed the putative Cul6 contained SCF complex by superimposition to the crystal structure of the Cul1-Rbx1-Skp1-F boxSkp2 SCF ubiquitin ligase complex (Zheng et al., 2002). In the structural model, we observed positive selected sites in Cul6 clustered around the binding sites of RING-box protein, Rbx1, and F-box protein, Skp1 (Fig. 1G).
Frequent adaptive evolution and co-adaptative evolution in genes involved in reproduction, immune system, and environmental information processing
To find out whether specific biological functions were associated with fast-adaptive genes, we applied DAVID Go analysis with genes that have largest ∼15% rates of adaptation (ωa > 0.15). The significant Go terms are frequently linked to serine-type endopeptidase activity, reproduction, protein lysis, chemosensory and other related biological functions (Table S4). As these fast-adaptive genes tend to be enriched in similar biological functions, we asked whether these genes are evolved co-adaptively, i.e., whether these proteins are interacting with each other frequently. To test this possibility, we obtained PPI of D. melanogaster from STRING database (Szklarczyk et al., 2019) and analyzed protein-protein interactions among fast-adaptive proteins. We found that fast-adaptive proteins tend to interact with each other more frequently than expected (PPI enrichment p-value < 1.0e-16). In the PPI network of fast-adaptive proteins, we observed 7 strongly connected sub-clusters with at least 5 members (Fig. 2A, Table S5). Proteins in these sub-clusters are enriched in biological processes such as reproduction, immune response, defense response to bacterium and virus, RNA interference, chitin metabolic, etc., which are in line with the Go analysis of fast-adaptive genes (Table S6-S11).
We next asked whether co-adaptation plays a role in the adaptive evolution of interacting proteins to a broader extend, including both fast- and slow-adaptive proteins. To address this question, we analyzed and compared adaptation rates of all D. melanogaster PPIs available in STRING database with high confidence and we found that protein partners of fast-adaptive proteins (ωa>0.15) have significantly larger maximum/average ωa compared to slow-adaptive proteins (Figure 3). We further analyzed and visualized adaptive evolutionary rates of proteins in PPI networks of 9 different biological pathways extracted from KEGG pathways, including immune system, xenobiotics biodegradation, response to environment, aging and development, genetic information processing, sensory system, transport and catabolism, cell growth and death and metabolism. We observed that, in these PPI networks, proteins with relatively large ωa tend to interact with each other (Figure 4AB). We also noticed that, for pathways that are previously known as adaptation-hotspots, e.g., immune system, fast-adaptive proteins can act as central nodes and are co-adaptively evolved with other fast-adaptive proteins (Figure 4AC). While in pathways such as transport and catabolism, fast-adaptive proteins are mainly at PPI periphery. In line with these findings, we found that ωa are larger in pathways that harbor fast-adaptive proteins as central nodes than other pathways (Figure S6).
Physical interactions contribute to co-adaptation of fast-adaptive genes
Having established that molecular interactions contribute to adaptive evolution of protein sequence, we then investigated whether these physical molecular interactions could drive protein-protein co-adaptation. To do this, we looked into interacting fast-adaptive protein pairs that are associated known or inferred complex structural models. For inferred complex structural models, we superimposed the structural models of the pair of proteins onto their high resolution homologous complex structures. Here we observed and illustrated co-adaptation at PPI interface in two examples: Toll-4/Spatzle and Spn28Db/CG18563 (Fig. 2BC).
Toll-4/Spatzle
Toll-4 is a member of toll-like receptors. Previous studies have shown strong evidence of adaptive evolution of Toll-4 in Drosophila and mammals (Levin and Malik, 2017; Slodkowicz and Goldman, 2020). Toll-4 can bind to Spatzle and trigger further innate immune responses with high confidence (inferred from STRING database). In the previous section, we showed that several positively selected sites in Spatzle overlap with Toll-Spatzle interfaces (Fig. 1F). Here, we further showed that, in Toll-4, considerable number of significant positively selected sites were located at interface for Spatzle (Fig. 2B), which is in line with a previous study of Toll-4 in D. willistoni (Levin and Malik, 2017).
Spn28Db/CG18563
Spn28Db is one of the serine protease inhibitors in D. melanogaster that are expressed in male accessory glands, while CG18563 belongs to the protein family of trypsin-like cysteine/serine peptidase domain. The interactions between the two proteins were predicted with high confidence from STRING database, and the molecular interactions can be inferred from existing crystal structure of serpin and bacteria protease complex (PDB code 1EZX). We observed many positive selected sites at the molecular interface between the two proteins (Fig. 2C), suggesting that physical interactions might play a role in the co-adaptation of the two proteins.
Most clinally differentiated non-synonymous SNPs in protein-coding genes are adaptive
To find out the relations between short-term adaptation to local environments and long-term adaptive evolution, we extracted residues with significant FST SNPs from clinal variations (Svetec et al., 2016). We then computed evolutionary rates (ω), adaptation rates (ωa) and non-adaptation rates (ωna) of these residues as in previous section. We observed that these residues have much higher ratio of adaptation rates over non-adaptation rates than genome-wide random expectations (Fig. 5A), suggesting that these residues have higher proportions of adaptive changes, and that they can be hotspots for adaptive evolution. To find out whether these SNPs are related with even longer-term adaptive evolution, we inferred positive selection sites of each protein-coding gene from phylogenic data (see Methods). We found that the non-synonymous FST SNPs are significantly enriched in long-term positive selections (Table S12-S13). To further characterize structural and functional properties of short-term genetic variations, we mapped significant nonsynonymous FST residues to different structural and functional characteristics, such as ISD, RSA, PPI-sites, DNA-sites and ligand-binding sites. We found that these non-synonymous SNPs were enriched in disordered regions and protein surfaces and were significantly more likely to be involved in protein-protein interactions and ligand-binding than expectation (Table S14-S18). To better visualize the characteristics of these SNPs, we used Toll-4 as an example. We mapped significant non-synonymous FST SNPs in Toll-4 on to its structural model. We showed that FST SNPs are either positively selected or being very close to positively selected sites (Fig. 5BC). For example, highly differentiated sites, N279 (FDR 3e-7) and H431 (FDR 3e-6) were predicted to be positively selected both at probability at p=0.9. While another highly differentiated site, D424 was close to three positively selected sites S401 (p=0.8), H431 (p=0.95) and V448 (p=0.8). We also noticed some differentiated sites that may be located within ligand binding sites, including F297 (FDR 3e-3), S311 (FDR 3e-3), H431 (FDR 3e-6) and H462 (FDR 1e-2).
Discussion
In this study, we systematically studied the impact of structure- and function-related gene properties on protein sequence evolution and adaptation in D. melanogaster genome. We found that molecular interactions in proteins contribute to the variation of protein sequence adaptive evolution. A novel discovery of this work is that molecular interaction sites including protein-protein interaction sites and protein-DNA interaction sites are hotspots for adaptative evolution. We revealed that fast-adaptive proteins tend to interact with each other frequently and protein partners of these fast-adaptive proteins tend to have higher adaptation rates, suggesting that co-adaptive evolution might be common in D. melanogaster. By looking at interacting fast-adaptive proteins, we further demonstrated that physical interactions may contribute to the mechanisms of co-adaptative evolution of fast-adaptive proteins.
Although our results are in agreement with previous studies on the factors driving protein sequence evolution (Zhang and Yang, 2015), we showed some complex correlations between ω, ωa and ωna and protein length and male specificity (Supplement information, section Complex correlations of protein length and male expression level with protein evolutionary rates, Fig. S2-S4, supplement file S2). These complex correlations suggest caveat exists when we looked at protein length and gene expression levels. For example, gene expression level was proved to be a major determinant (Zhang and Yang, 2015) through mechanisms such as the pressure for translational robustness, i.e., robustness to translational missense errors (Drummond et al., 2005). Previous studies have revealed that male biased or female biased genes can be fast evolving (Yang et al., 2016). While on the other hand, many male biased genes can be highly expressed in testis, which results in a complex correlation between protein sequence evolutionary rate and male expression level or even mean expression level of D. melanogaster. The unique evolutionary property of these male biased or specific genes could be caused by the unique transcriptional scanning mechanism in testis (Xia et al., 2020). We propose that tissue specificity might be a better quantity when considering the impact of gene expression profile on protein sequence evolution in D. melanogaster. In addition to male expression level, a similar complex correlation was observed for protein length. It has been the notion that short proteins tend to evolve faster than long proteins, which may be biologically relevant or byproduct of other factors such as selection on buried and exposed sites (Moutinho et al., 2019). Here, we demonstrated that, in D. melanogaster, although protein length is strongly negatively correlated with protein sequence evolutionary rate, genes that have the slowest evolutionary rates tend to be relatively short. This could be caused by the fact that under essential functional constraint, genes can undergo strong purifying selections, while essential genes such as secreted proteins are constrained to be smaller, and that essential genes could be shorter than other genes (Chen et al., 2020).
Protein surface and intrinsic disorder regions are frequent targets for adaptive evolution and contribute to the variations of protein sequence adaptive evolution (Afanasyeva et al., 2018; Moutinho et al., 2019), however, the detailed mechanisms underlying these observations remains unclear. One possible explanation would be that these regions are frequently linked to intermolecular interactions (Afanasyeva et al., 2018; Moutinho et al., 2019). For example, Moutinho et al hypothesized that molecular interactions involved in host-pathogen coevolution were the major driver of protein adaptation (Moutinho et al., 2019). Here, we further identified that proportions of possible molecular interaction sites inside proteins contribute to the variations of protein sequence adaptive evolution and that these molecular interaction sites or regulatory sites at protein surface can be hotspots of protein adaptation. Indeed, some specific molecular interactions have been linked to adaptive evolution in several case studies (Bachtrog, 2008; Hughes and Nei, 1988; Levin and Malik, 2017; Schott et al., 2014) and large-scale studies based on proteins with high quality structural models (Slodkowicz and Goldman, 2020). In the latter study, the authors showed that positive selections in mammals tend to cluster closer to binding sites of exogenous ligands than expected by chance (Slodkowicz and Goldman, 2020), suggesting an important role of function important regions in adaptive evolution. Here, we extend the conclusion to D. melanogaster genome, including proteins with or without high resolution structural models. We also showed that except for exogenous ligands, endogenous ligands might also contribution to adaptive evolution, while the latter might explain why interacting proteins tend to evolve co-adaptively.
Notably, previous studies have revealed that multi-interface proteins tend to be evolving more slowly than single-interface proteins (Kim et al., 2006), which seems to be contradictory to our results that proteins with more interaction sites evolve faster and have faster adaptation rates. Here, we argue that, in our study, we used sequence profile to predict molecular interaction sites in proteins at a genomic scale, rather than only looking into proteins with high resolution structures. In this way, we may capture many weak or transient interactions, which are thought to be evolving faster than obligate and conserved interactions (Mintseris and Weng, 2005). Meanwhile, we did not exclude intrinsic disordered regions (IDR) or intrinsic disordered proteins (IDP) in our study, which are widespread in D. melanogaster genome. It has been suggested that IDR/IDP tend to evolve fast due to lack of structural restraints (Echave et al., 2016). In the functional aspect, IDR/IDP are thought to be promiscuous binders through many multiple binding mechanisms, including forming static, semi-static, and fuzzy or dynamic complexes (Uversky, 2019), suggesting that the evolution of IDR/IDP cannot be explained merely by the lack of structural restraints. Actually, IDP and IDR in human genome were found to be undergoing extensive adaptive evolution (Afanasyeva et al., 2018). At last, it has been recognized that, except for allosteric regulations, encounter complexes (Gabdoulline and Wade, 1999) might also play an important role in mediating intermolecular interactions, such as protein-protein association (Tang et al., 2006) and protein-ligand binding (Re et al., 2019). Since encounter residues that are responsible for encounter complexes do not reside in conserved binding interfaces, these residues could be under relaxed purifying selections or even positive selections, which could be another yet-to-identify mechanism that contribute to protein sequence adaptive evolution.
We showed that fast-adaptive proteins are enriched in molecular functions such as reproduction, immunity and environmental information processing (Begun and Lindfors, 2005; Begun and Whitley, 2000; Lazzaro et al., 2004). We further demonstrated that fast-adaptive proteins tend to interact with each other more frequently than random expectations, suggesting co-adaptation might be common among fast-adaptive proteins. Mechanisms that contribute to the co-adaptation could be: (1) interacting fast-adaptive proteins are often enriched in similar molecular functions and under similar selective pressure; (2) interacting fast-adaptive undergo co-evolution through physical interactions. In this study we showed two examples that adaptive evolution could occur at protein-protein interface, which suggest that physical interactions could contribute to the co-adaptation of fast-adaptive proteins in D. melanogaster. Moreover, we showed that co-adaptation might exist to a broader extend rather than only among fast-adaptive proteins. Specifically, proteins that interact with fast-adaptive proteins tend to have higher adaptation rates. Since molecular interactions contribute to adaptive evolution, it is reasonable to hypothesize that co-adaptation at this broader extend could be regulated by these interactions. Actually, it has been suggested that interacting proteins tend to have similar evolutionary rates and the possible mechanism would be the co-evolution of physical interactions (Pazos and Valencia, 2008).
In this study, we found that loci with significant genetic variance among populations harbor higher proportions of long-term adaptive changes and these loci follow similar patterns as adaptive changes, i.e. they are enriched in disordered regions, protein surfaces, and functionally important regions. These results suggest that population differentiation of protein-coding genes can be an important basis for long-term adaptive evolution. In other word, many SNPs are repeatedly selected for adaptive process in evolution. Importantly, our results indicate that most of the clinal amino-acid changes are adaptive, suggesting that non-selective forces play a non-essential role in the SNPs that show strong geographic differences. Our results also support a large effect of spatially varying selection on protein sequence and structures (Storz and Kelly, 2008).
It should be noted that studies at the genomic scale that aim to uncover the function- or structure-related constraints imposed on protein sequence evolution and adaptation share similar limitations that for most of the proteins or residues, structural or functional information would be incomplete or even missing. To overcome this, in this study, we used highly accurate neural-network based tools to predict molecular interactions, secondary structures, intrinsic structural disorder, relative solvent accessibility for each of the protein. In this way we were able to identify key factors that impact protein sequence evolution and adaptation in a less accurate but rather systematic fashion. We hope that with the availability of more and more curated structural, functional information and complex structural models of proteins in the near future, we will be able to uncover the precise role of molecular interactions in protein sequence adaptive evolution.
Material and Methods
dN/dS ratio (ω)
We used a maximum likelihood method to infer dN/dS ratio (ω) of D. melanogaster protein-coding genes using the genome sequences of five species in melanogaster subgroup (D. melanogaster, D. simulans, D. sechellia, D. yakuba, and D. erecta). The protein-coding sequences were extracted from the alignments of 26 insects, which were obtained from UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/downloads.html). The sequences were further processed by GeneWise (Birney et al., 2004) to remove possible insertions and deletions using the longest isoforms of the corresponding D. melanogaster protein sequences as references (FlyBase version r6.15) (Thurmond et al., 2019). The processed sequences were then realigned by PRANK -codon function (Löytynoja, 2014). We used codeml in PAML (Yang, 2007) to compute gene-specific ω using M0 model. We removed sequences that have more than 15% of their nucleotides not aligned (gaps) to D. melanogaster genes in more than 2 species. To further avoid numeric errors and ensure reasonable estimations, we only retained relatively divergent sequences that are: (1) divergent with dS larger than 0.3, (2) less divergent with dS larger than 0.1 and dN smaller than 0.001 (dS>>dN). At last, there were 12118 genes in total passed all the criteria and were assigned gene specific ω, containing 6,538,872 amino acids. We also calculated site-specific ω by using likelihood ratio tests (LRT) comparing M7 model against M8 model (Yang et al., 2005).
Rate of adaptive and nonadaptive changes
We recalled all SNPs of 205 inbred lines from the Drosophila Genetic Reference Panel (DGRP), Freeze 2.0 (Huang et al., 2014) (http://dgrp2.gnets.ncsu.edu). We then generated 410 alternative genomes using all monoallelic and bi-allelic SNP data sets. We extracted the coding sequences of D. melanogaster genes from the generated alternative genomes, removed all possible insertions and deletions using GeneWise (Birney et al., 2004) as described above. We then align all the coding sequences to their corresponding aligned CDS sequences using PRANK -codon function (Löytynoja, 2014). We removed polymorphisms segregating at frequencies smaller than 5% to reduce possible slightly deleterious mutations (Charlesworth and Eyre-Walker, 2008). In order to avoid possible effects of low divergence between D. simulans and D melanogaster (Keightley and Eyre-Walker, 2012), we used D. yakuba as outgroup to estimate nonsynonymous polymorphisms (Pn), synonymous polymorphisms (Ps), nonsynonymous substitutions (Dn) and synonymous substitutions (Ds) by MK.pl (Begun et al., 2007; Langley et al., 2012). Similar as Begun et al. (Begun et al., 2007), we only analyzed genes with at least six variants for each of substitutions, polymorphisms, nonsynonymous changes and synonymous changes. We used an extension of MK test, asymptotic MK (Messer and Petrov, 2013; Uricchio et al., 2019), to estimate the proportions of adaptive changes (α). The rate of adaptive changes (ωa) was then calculated as ωa = ωα and the rate of non-adaptive changes as ωna = ω -ωa. Details of the asymptotic MK test were as following:
(1) Classical McDonald–Kreitman test. According to Smith and Eyre-Walker (Smith and Eyre-Walker, 2002), the proportions of adaptive changes for protein-coding genes can be calculated as following:
According to this equation, we could estimate the proportion of adaptive changes and carried out classical MK test by applying Fisher’s exact test.
(2) Asymptotic estimation of α. A known problem of the classical estimation of α above is the accumulation of slightly deleterious mutations at low frequencies. We therefore used an extension of MK test, asymptotic MK test approach (Messer and Petrov, 2013) to estimate the proportions of adaptive changes. As in original aMK, we defined α(x) as a function of derived allele frequency (x): where Pn(x) and Ps(x) are number of non-synonymous and synonymous polymorphisms at frequency x, respectively. However, the original approach may suffer from numeric errors when there were very few polymorphic sites, which is quite common in many of D. melanogaster genes. To make the estimations more robust while preserving the same asymptote, we further define Pn (x) and Ps(x) as total number of Pn and Ps above frequency x as described in Uricchio et al (Uricchio et al., 2019). We fitted α(x) to an exponential curve of α(x) ≈ exp(-bx)+c using lmfit (Newville and Stensitzki, 2018) and determined the asymptotic value of α at the limit of x, 1.0. We then estimate the rate of adaptive changes (ωa) as where Na is the number of adaptive changes and dNa=Na/LN is the number of adaptive changes per nonsynonymous site. Finally, we calculated the rate of nonadaptive changes (ωna) as ωna=ω-ωa. The final dataset contains 7192 protein-coding genes, with smallest ωa being 0.00 and largest being 1.29.
Structure-/function-related properties of D. melanogaster proteins
We obtained function-related properties mentioned in main text as following. We derived D. melanogaster gene ages (Kondo et al., 2017; Zhang et al., 2010) for genes that are specific to Drosophila, and from GenTree (Shao et al., 2019) for genes that are beyond Drosophila clade. We then assigned a pseudo-age to each of the genes. Specifically, there are 11 age groups from “cellular organisms”, assigning to a pseudo age value of 0, to “melanogaster”, assigning a pseudo age value of 10. We downloaded D. melanogaster protein-protein interaction (PPI) from STRING database (Szklarczyk et al., 2019). A cut-off of combined score larger than 0.7 was used to retain high confident PPI for further analysis. We then used BSpred (Mukherjee and Zhang, 2011) to predict protein-protein interaction (PPI) sites and DRNApred (Yan and Kurgan, 2017) to predict DNA binding sites. For each protein, we calculated ratios of protein interaction residues (PPI-site ratio) and ratios of DNA binding residues (DNA-site ratio) by dividing total predicted protein interaction sites and DNA binding sites over protein length, respectively. For structure-related properties, we used DeepCNF (Wang et al., 2016) to predict these properties for each gene, including three-state secondary structures (helix, sheet, and coil), structural disorder, relative solvent accessibility (RSA). Further, we calculated the ratios of helix, sheet, helix+sheet, and coil residues of each gene from predicted secondary structures. For each gene, we computed intrinsic structural disorder (ISD) and relative solvent accessibility (RSA), as protein-length normalized summations of the probabilities of each residue being disorder and exposed, respectively.
Gene expression patterns
We downloaded gene expression profile from FlyAtlas2 (Leader et al., 2018). We converted FPKM to TPM by normalizing FPKM against the summation of all FPKMs as following: After TPM conversion, we only retained genes with expression level larger than 0.1 TPM for further analysis. We treated male and female whole-body TPM as male and female expression levels. We calculated mean expression level by averaging male and female TPM. We used following Z-score to describe male specificities of D. melanogaster genes: We calculated tissue specificities of genes using tau values (Yanai et al., 2005) based on the expression profiles of 27 different tissues.
High quality 3D structures of D. melanogaster proteins
We downloaded high-quality structures or structural models of D. melanogaster proteins from protein data bank (PDB) (Burley et al., 2019), SWISS-MODEL Repository (Bienert et al., 2017), and MODBASE (Pieper et al., 2011), with descending priorities. For example, if there were 3D structures of a same protein or protein region in multiple databases, we first considered high-resolution structures from PDB; if no structures were found in PDB, we then considered SWISS-MODEL Repository; and at last from MODBASE. In addition, we used blastp (Camacho et al., 2009) to search homologs of each D. melanogaster protein against all PDB sequences with E-value threshold of 0.001. We further carried out comparative structural modeling using RosettaCM (Song et al., 2013) to model high-quality structural models of proteins or protein regions that were not available in PDB, SWISS-MODEL Repository and MODBASE. For each RosettaCM simulation, we used no more than 5 most significant hits from blastp search. For proteins that are in complex forms, we only extracted monomers for further analysis. At last, we obtained 14543 high quality structural models, corresponding to 11284 genes. These structural models contain 2,691,913 unique amino acids, 41.2% of all the residues in genes that were assigned ω.
Evolutionary rates of different structural/functional sites
We classified amino acids into different classes of structural/functional properties. Specifically, we classified three classes for both ISD and RSA according the probability of residues being disordered or exposed: ordered or buried (0.00 to 0.33), medium (0.33 to 0.67), disordered or exposed (0.67 to 1.00). For both PPI and DNA binding, we classified two classes: PPI-site or DNA-site (binding sites), None-PPI or None-DNA (corresponding null sites for PPI or DNA binding). For residues that have 3D structures, we used STRESS (Clarke et al., 2016) to predict putative ligand binding sites and allosteric sites from all the high-quality structures or structural models. The allosteric sites were further classified as surface critical or interior critical according to their locations. We then classified these residues into four groups: LIG (ligand binding sites), Surf. Crit. (surface critical sites), Interior Crit. (interior critical sites) and Others (other sites). For each of the site classes, we randomly sampled 100 sequences, each containing 10,000 amino acids. We computed ω, ωa, and ωna for the randomly sampled sequences similar as the steps described in the above sections.
Author contribution
J.P. and L.Z. conceived the study. J.P. performed the analysis with the input from L.Z.. J.P. and L.Z. wrote the manuscript.
Funding
The work was supported by NIH MIRA R35GM133780, the Robertson Foundation, a Monique Weill-Caulier Career Scientist Award, an Alfred P. Sloan Research Fellowship (FG-2018-10627), a Rita Allen Foundation Scholar Program, and a Vallee Scholar Program (VS-2020-35) to L. Z.. J.P. is supported by a C. H. Li Memorial Scholar Fund Award at The Rockefeller University.
Declaration of interests
The authors declare no competing interests.
Acknowledgement
We thank members of the Zhao Lab for helpful discussions.