ABSTRACT
The mammalian gene Brain-Derived Neurotrophic Factor (BDNF) is an essential mediator of brain assembly, development, and maturation which has been implicated in a variety of brain disorders such as neurodevelopmental disorders (e.g. autism spectrum disorder), neuropsychiatric disorders (e.g. depression, PTSD, schizophrenia), and neurodegenerative disorders (e.g. Parkinson’s). Loss of BDNF during early development is embryonic lethal, and depletion of BDNF during adolescence or adulthood can result in disease-related neuropathology across a broad range of model organisms. In order to better understand the role of BDNF in disease, we seek to provide an evolutionary context to BDNF’s role within the brain by elucidating the molecular and genetic comparative history of BDNF across species. We conduct sequence alignment and phylogenetic reconstruction of the BDNF gene across a diverse selection of over 160 mammalian species spanning ∼177 million years of evolution. Selective evolutionary change was examined via several independent computational models of codon evolution including FEL, MEME, and BGM. We report strict purifying selection in the main functional domain of BDNF (NGF domain, essentially comprising the mature BDNF protein). Specifically, we discover 6 sites in our homologous alignment which are under episodic selection in the early regulatory region of BDNF (i.e. the prodomain) and 23 pairs of coevolving sites that are a part of complex spatial relationships that are distributed across the entire BDNF gene. Thus, we propose that our discovery of both local and distal sites of co-evolution within the pro- and mature-domains of BDNF that likely reflect the evolutionary fine-tuning of BDNF’s unique and complex regulatory capacities whilst also retaining it’s core yet diverse ontogenic functionality within the central nervous system. This discovery consequently supports the idea that the BDNF prodomain is more prone to change than the mature domain, however the fact that this region has also been subject to negative purifying selection also highlights genetic sensitivity and thus partially explains the prodomain’s disease relevance (e.g. Val66Met and other variants) to numerous neuropsychiatric disorders.
HIGHLIGHTS
We extracted coding sequences for Brain-Derived Neurotrophic Factor (BDNF) from over 160 mammalian genomes that span approximately ∼177 million years of evolution.
We observe strict purifying selection in the main functional domain (NGF) of the BDNF gene in mammals.
We observe novel results with 6 sites in our homologous alignment which are under episodic selection in the early regulatory region of BDNF (i.e. the prodomain).
We observe 23 pairs of coevolving sites within BDNF. Many of which are a part of complex spatial relationships and are distributed across the entire BDNF gene.
These data define exactly how “BDNF is highly conserved” by defining exactly where and how the mammalian BDNF has evolved, confirming the widespread belief that the BDNF prodomain is more prone to change than the mature BDNF protein.
INTRODUCTION
Brain-Derived Neurotrophic Factor (BDNF) is an ubiquitously studied molecule in modern neuroscience [1]. BDNF is a neurotrophin that binds with high affinity to its cognate tyrosine kinase receptor, TrkB [2], to elicit rapid induction of synaptic plasticity [3–5] and neuronal spine remodeling [6, 7]. Additionally, BDNF has been implicated in a variety of brain disorders [1], including depression [8–10], PTSD [11–14], schizophrenia [9, 15–17], Parkinson’s disease [18, 19], and autism spectrum disorders [20–22] amongst many more. BDNF has correspondingly been the primary target, or an ancillary factor, of many novel therapeutics including small molecule mimetics [23, 24] and existing drugs (e.g. antidepressants [25, 26]). Yet, nascent research has provided the humbling reminder that much remains to be discovered about BDNF. In recent years, new BDNF ligands have been discovered [27], new receptor interactions unveiled [27, 28], and mechanisms of behavioral function unlocked [7]. This is a timely reminder that while BDNF has remained a seminal molecule of interest across the broader neuroscience literature, much remains to be discovered about its origins, evolution, function, and disease relevance.
A Primer of the Molecular Biology of BDNF and its Functional Topology
BDNF is encoded by the BDNF gene [29], whose expression is regulated in humans by an antisense-gene (BDNF-AS) that can form RNA-duplexes to attenuate translation [30]. Thus, the natural antisense for BDNF is capable of directly downregulating endogenous expression on demand [31]. The BDNF gene in humans comprises 11 exons [30] and can produce at least 17 detectable transcript isoforms [29]. Different transcripts are induced in response to activity and/or cellular states, allowing the BDNF gene to adjust to environmental stimuli and potential selection pressures. However, all transcripts ultimately yield a singular preproBDNF protein that (prior to intracellular processing, cleavage, and transport) can be partitioned into three domains [11, 29]: a signal peptide, a prodomain, and the mature domain. The signal domain is only 18 amino acid residues long, possessing ambiguously defined functionality, and the majority of BDNFs functional outputs reflect sequence specificity to the prodomain and mature domain. The BDNF prodomain encodes binding sites for intracellular transport of both BDNF mRNA [32] and BDNF protein [33], and contains numerous post-translational modification sites [29]. The BDNF prodomain is also the resident location of a widely studied Single Nucleotide Polymorphism (SNP) in neuroscience (Val66Met, or rs6265) [1], and the Furin consensus sequence (Arg 125) for cleavage to its mature form (including by plasmin [34]). The prodomain is composed of 110 amino acids within the N-terminus, and must be processed via proteases to generate mature BDNF [5]. The mature domain of BDNF is composed of, almost exclusively, the Nerve growth factor (NGF) domain and is responsible for the canonical trophic actions associated with BDNF, e.g., long-term potentiation, rapid acting antidepressant effects. Following intracellular handling, processing, and transport, the preproBDNF isoform is cleaved to yield the mature BDNF peptide (which only contains the mature NGF domain). For many years the prodomain was thought to be cleaved following transport and thus destined for degradation. However, recent work has shown that the cleaved prodomain can be secreted and bind as a ligand to novel receptors (e.g., SorCS2) [27]. Thus, the BDNF prodomain can accordingly influence brain circuits as well as behavior [7]. For a comprehensive, detailed, analysis of the various intricacies of the BDNF gene, protein, and its regulation, more information is provided in [29].
The Conservation of BDNF & Neurotrophins
One of the interesting curiosities surrounding BDNF is its relationship to other neurotrophic (NT) growth factors, comprising NGF, NT-3, NT-4. Thus, neurotrophins retain some intercalated functionality. Each share some commonalities in structure (pre-, pro-, and mature-domains) [29], post-translational modification potential (e.g. glycosylation [35]), as well as catalytic processing, trafficking, and composition [36]. Specifically, neurotrophins share approximately 50% sequence homology [29], and a comparison of domains and motifs reveals that each comprises a prototypic NGF-domain as the principal component of the mature pro-growth peptide for each factor (see PFAM database [37]). While each neurotrophin elicits functionality via binding to cognate receptors, neurotrophins also exhibit cross-affinity amongst neurotrophin receptors [38] presumably due to their high rates of structural homology. Not surprisingly then, there is some redundancy in the pro-trophic effects of neurotrophins, yet each still maintains nuanced functionality which remains specific to each factor during central nervous system development [39]. Differences in the evolution and temporal dynamics of regulatory sequences, which target gene-products to specific destinations within cell-compartments (e.g. dendrites) [40] or to processing routes (e.g. the activity-dependent release pathway) which alter secretory dynamics and/or bioavailability [41], likely contribute to both similarities and differences between neurotrophins. However, almost nothing is known of how the BDNF prodomain has evolutionarily adapted to specifically regulate BDNF dynamics. While evolution has almost certainly shaped the sequences, structure, and function of BDNF, the modeling of such remains relatively unexplored but could provide important insight into the phylogenetic evolutionary history of BDNF, it’s selection pressure sensitivity across lineages, and quantitative metrics of evolutionary change across species.
Purpose of this Study
Here, we use computational methods to explore the comparative evolutionary genomics of the neurotrophic factor BDNF. By reconstructing phylogenetic trees of BDNF in mammals (Mammalia), we utilize sequence alignments of over 160 species to determine unique genomic attributes of BDNF. In specific we investigate which sites in BDNF are subject to pervasive (i.e., consistently across the entire phylogeny) diversifying selection (FEL) or pervasive/episodic (i.e., only on a single lineage or subset of lineages, diversifying selection (MEME). Likewise, utilizing multiple models for the inference of selective pressure and the evaluation of evolutionary change, we identify novel sites within the BDNF prodomain and mature peptide coding regions that are susceptible to synonymous and nonsynonymous changes. Additionally we investigate which sites in BDNF may be coevolving (BGM). Taken together, these computational evolutionary analyses provide an important context as to the origins and sensitivity of genetic changes within the BDNF gene, which may be important for providing insight into genetic risk factors linked to disease in humans.
RESULTS
We find that unique evolutionary pressures have shaped the BDNF gene across time. Mostly, these forces have operated through strict purifying selection. Of note, BDNF elicits tight regulation and specific functionality that can be separated from other neurotrophins, yet these growth factors remain closely related in their structure and sequence, especially in the conserved NGF domain.
Evolutionary History of Mammalian BDNF
Prior to conducting our primary evolutionary analysis, we ported our mammalian species into a platform (timetree.org, see [42, 43]) to examine the epoch events that may have influenced the analysis described here. This was an important pre-analysis step to frame the age of our genomes, and the broad-strokes evolutionary pressures that these species have been exposed to (which, in theory, could contribute to subsequent purifying selection and coevolution analyses). As expected, this revealed BDNF as an ancient gene that has been preserved throughout the mammalian lineage and has both survived and been shaped under all major evolutionary events of the past ∼177 million years (data not shown). For species within our data-set we identified several examples of species-level evolutionary epochs that cross-referenced with major earth events (e.g. bottleneck events) that have historically been believed to drive evolutionary adaptation. This included major geologic periods that are cross-referenced against earth impacts, oxygenation changes across time, atmospheric carbon dioxide concentrations, and solar luminosity. This indicates that even under extreme evolutionary pressures, the BDNF gene has exhibited (relatively speaking) very specific adaptation events (see results below) over millions of years in mammals. This tracks with the idea that “old genes” tend to be highly conserved, evolve more slowly, and therefore are more likely to exhibit both specific and selective changes as opposed to more dramatic permutations (e.g. gene duplications etc.).
Predominant Purifying Selection in BDNF
A common approach to gain an increased understanding of the evolutionary forces that have shaped proteins is to measure the omega ratio ω consisting of the non-synonymous (β or dN) and synonymous (α or dS) substitution rates with ω = β/α for each site in a particular gene of interest [47]. We define two major substitional changes for the amino acid being coded for at each site: synonymous changes, which keep the same amino acid coded for at a particular site and nonsynonymous changes, which change the amino acid coded for at that particular site. Non-synonymous changes can have strong influences on the structural, functional, and fitness measures of an organism. This is in contrast to synonymous changes which leave the amino acid at a particular site unchanged, but can confer weak fitness effects through the emergent properties of codon usage bias, mRNA structural stability, translation and tRNA availability. However, synonymous changes are typically understood to represent neutral selection acting on coding sequences, and provides a baseline rate against which non-synonymous evolutionary rates can be compared. The omega ratio ω of relative rates of non-synonymous and synonymous substitutions is a common measure in evolutionary biology of the selective pressure acting on protein coding sequences. These estimates provide increased information availability as to the type of selection (positive, with omega > 1 or negative, with omega < 1, or neutral with omega = 1) that has acted upon any given set of protein-coding sequences.
As FEL analysis is a sensitive measure of negative (purifying) selection, for our FEL analysis, we observe a predominant amount of purifying selection (over 66% of sites, 174 sites out of 261; Table S1) in our recombination free alignment for BDNF. The dN/dS estimates for the entire alignment were plotted including 95% lower- and upper-bound estimates (see Figure 1 or Table S1). Overwhelmingly, the mature NGF-domain of the BDNF exhibited evidence of greater pervasive negative purifying selection relative to the prodomain region of BDNF. Thus over the evolutionary history in mammalia, negative selection has predominantly occurred in the regions of BDNF that encode the functional mature protein that binds TrkB to elicit neurotrophic effects. The mature domain of BDNF has exhibited remarkable conservation across innumerous epochs defined by rapid evolutionary adaptation (see Figure 1) in other genes and species.
Specific Sites that are Evolving Non-Neutrally
To examine specific sites for episodic adaptive evolutionary selection, we utilized an algorithm known as MEME which is fundamentally similar to our FEL analysis (described above) except that it applies a more sensitive method for the detection of both pervasive (persistent) and episodic selection (transient selection occurring only on one or a subset of branches in the phylogenetic tree) as compared to only pervasive selection which occurs across all branches of the phylogenetic tree. Essentially, only a subset of the lineages (i.e. species) are affected allowing for a more granular/sensitive method of detecting selection (whereas FEL is better geared towards broad changes). This analysis revealed that for all sites, only 2.3% (6 of 261; see Table 2) exhibit evidence for episodic diversifying selection (i.e. positive selection) in at least one branch within the phylogeny. Spatially, these mutations occur outside of the NGF functional region of BDNF. Further, this result is essentially relevant as the MEME analysis is a sensitive measure of episodic selection. The sites we observe as statistically significant are as follows: 26, 27, 30, 38, 249, 254. For comparison, these specific sites were re-aligned to the respective human sites with indel (insertion/deletion) events accounting for any respective discrepancy in specific site numbers. When mapping these sites to the human BDNF coordinate system, they correspond to: 26, 27, 29, 36, 238, 240, respectively.
Evidence of Coevolutionary Forces
To examine the coevolution of sites, i.e. if one particular amino acid was evolving in-tandem with another, we subjected our protein-coding gene sequences to the BGM algorithm which leverages Bayesian graphical models [51]. The BGM algorithm infers substitution history through the use of maximum-likelihood analyses for ancestral sequences and maps these to the phylogenetic tree, which allows for the detection of correlated patterns of substitution [51]. For our BGM analysis, we find evidence for 23 pairs of coevolving sites. This suggests interaction dynamics in tertiary space of the 3D, folded, protein level (see relevant sites in Figure 2) BDNF protein structure. Or, otherwise, is evidence for coevolving sites due to other fitness consequences related to compensation for maladaptive changes in one part of the protein sequence that may have occurred. When we review these sites, we notice that several pairs (see Figure 2) occur within alignment sites which correspond to the Human BDNF coordinate system (Table 2). These include sites (89, 184), (94, 155), (103, 233), (135, 154). Of note, several other sites also display interesting geometric features including triangular relations: [(81, 93), (93, 98), (81, 98)], an acyclic graph network of site connections [(70, 74), (74, 94), (94, 155)] and [(25, 49), (49, 85), (49, 86)], and more complex double linked co-evolutionary sites: [(39, 103), (103, 233)], and triple linked co-evolutionary sites: [(30, 119), (33, 119), (91, 119)]. Additionally, three-dimensional reconstruction - here focusing on a specific heterodimer configuration of BDNF and NT-4 as an example of a spatial protein-protein interaction - highlights that coevolving sites as well as positively evolving sites are likely to have been “fine tuned” over time to help support BDNF’s cognate functionality (see Figure 3). Mapping our FEL purifying sites in a structural configuration was not shown due to the overwhelming nature of negative selection acting on BDNF within mammals.
DISCUSSION
In this study, we explore the evolutionary history of the BDNF gene in Mammalia. The BDNF gene is implicated in a number of human diseases including a variety of brain disorders such as neurodevelopmental disorders (e.g. autism spectrum disorder), neuropsychiatric disorders (e.g. depression, PTSD, schizophrenia), and some neurodegenerative disorders [1]. By using orthologous BDNF sequences within the Mammalia taxonomic group, our results indicate that within species, unique evolutionary pressures and site-specific changes within the BDNF gene have evolved across time. We performed a number of comparative evolutionary analyses to tease out signals from our orthologous gene collection in BDNF. Of note, the BDNF gene elicits tight regulation and specific functionality that can be separated from other neurotrophins, yet these growth factors remain closely related in their structure and sequence and conservation of the NGF functional domain. In the NGF domain, we observe a high degree of conservation (via purifying selection) across species, owing to the functional importance of this region in protein-protein interactions. This work additionally provides broad comparative insights into the evolutionary history of the BDNF gene family. Our MEME method identified novel substitutions (see Table 2) in regions of BDNF that may provide significant areas of interest for designing molecular therapeutic approaches, and their potential broader significance are outlined in further detail below.
Predominant purifying selection across BDNF in Mammalia
Over time, evolution drives the divergence of genetic sequences. What can we learn from the direct comparison of the sequences of the BDNF gene in Mammalia? By comparing the BDNF products of orthologous genes in different species we observe the accumulation of mutations at different sites with varying degrees of insight into both BDNF functionality (see [29] for site annotation) and potential disease [1]. These are summarized in-full within Table S1 and Table 2. Coding sequences with highly constrained structures are expected to fix non-synonymous mutations at a slower rate due to the maladaptive nature of changes such as what we observe with the FEL negatively selected sites across BDNF. Additionally, we observe a high degree of negative (purifying) selection across the main functional domain (NGF) of BDNF. While structures for the NGF domain in most species under analysis do not exist, based on our findings we expect a highly conserved tertiary structure. Based on the high degree of purifying selection observed across BDNF, we hypothesize that BDNF plays a critical role in the underlying network of genes governing homeostasis and normal organismal development. This may have happened because BDNF is particularly useful specifically for the phylogenetic branch in question (i.e. mammals). This interpretation also tracks with the observation that BDNF is essential for normative development and is basically lethal in non-conditional full knock-out mammalian models.
Non-Neutral Positive Diversifying Evolution Sites in the BDNF Gene
It has been described that BDNF particularly plays a role as a foundational gene for brain development [11]. Despite a significant level of purifying selection shaping the evolutionary history of BDNF (Figure 1), we observe several novel statistically significant sites under positive episodic diversifying selection across the BDNF gene (see Table 2). Traditionally, evolution of this variety consists of amino acid diversifying events that may promote phylogenetic adaptation and/or functionality. These results are entirely novel - they have not been previously reported (to the best of our knowledge) and MEME is an established and sensitive method for the analysis of episodic diversifying sites. Thus, the very specific and limited sites within BDNF to exhibit such patterns is a highly promising result from which to further disentangle BDNFs complex functionality and disease linkage. Presumably, we would encourage biologists to consider these sites as those that may contain important adaptive functions within the BDNF gene. However, where our results fall within the context of a core protein-protein interaction network of required genes for neural cellular diversity and development is yet to be determined. We do note that at least one identified site (238) overlaps with potential post-translational modifications to the human BDNF peptide (specifically, a disulfide bridge; see UniProt and [29]). This supports the idea that non-neutral positive diversifying sites within BDNF are not spurious and likely reflect specialized, regulatory, or functional capacities that may have yet to be annotated in-full. Given that this manuscript is devoted to analysis of BDNF’s evolution in mammals, we highlight the potential importance of these sites but emphasize that their importance remains a hypothesis that should be tested in well-defined experiments under controlled laboratory conditions.
Discovery of Proximal & Distal Coevolving Site-Pairs in the BDNF Gene
Another novel, and potentially important, series of findings in this manuscript was the presence of numerous sites that exhibit coevolution. In fact, we observe a significant number of coevolving sites within the BDNF gene (see Figure 2, Table 3), and these too reflect an entirely novel aspect of BDNF biology that has not previously been reported. Evidence of coevolving sites are not limited to a particular domain (e.g. prodomain vs mature) nor specific motifs. Instead, coevolving pairs seem to be distributed across the entire BDNF gene with, perhaps unsurprisingly, an increased density of interactions early in the pre-pro domain region. However, we also note that there are coevolving sites in the mature NGF domain which are “linked” to early domain sites. Importantly, these relationships may confer strong epistatic interactions shaping the continued evolution of this critically important gene. The new evidence for coevolution may point to the importance of these sites in shaping the early regulatory or main functional (NGF) domain of BDNF. These residues may form important interactions for the functional integrity of BDNF and, importantly, the highly specific pairs which span the BDNF prodomain and its mature region point to a new mechanism by which the BDNF prodomain may have co-regulated the mature domain (or vice versa). Alternatively, these coevolving pairs may be part of a network of residues occupying a shifted fitness landscape in order to accommodate new or species-specific functional requirements.
Potential Structural Implications of Evolving Sites
In considering our observation of both diversifying selection and coevolving sites in the BDNF gene, we considered the potential implications this may have at a protein structural level in three-dimensional space (see Figure 3). While protein structural impacts from evolution remain poorly understood and cannot be completely experimentally disentangled in a confirmatory sense, the implications fall upon our understanding of basic BDNF neurobiology. Here we note that our BGM and FEL analyses implicate the prodomain - the primary topological region of BDNF known for polymorphic variability (e.g. Val66Met, Gln75His) that is often linked to disease [1, 11, 29], and our 3D modeling suggests that two of our co-evolving sites appear to be associated with looping structures that could have important yet to be discovered functionality. In this regard, we predict that the evolutionary changes described here are likely to reflect some form of specialization and/or divergence in function and/or interaction partners at different points of BDNF’s evolutionary history in mammals. Thus, further work may unveil yet more novel sites that could provide further insight into the origins of BDNF’s diverse functionality and its role in disease.
Limitations of our Computational Evolutionary Analysis
This analysis focused on BDNF sequences contained in the taxonomic group Mammalia in lieu of examining a more inclusive dataset for BDNF containing sequences from all of Gnathostomata (jawed vertebrates) or extension into invertebrate clades which may contain BDNF or BDNF-like analogue genes. Our results are applicable to mammals, which are our intentional taxonomic group of study, but we nonetheless emphasize that our results do not capture the entirety of BDNF’s evolutionary history (e.g. there could be more to learn about BDNF from birds, lizards, fish, and higher order taxonomic groups which we do not evaluate here). In addition we do not explore the patterns or mutational processes occurring outside of coding-sequence evolution which include complex structure and dynamics of non-coding regions in the BDNF gene or across. Therefore, evolutionary temporality is important in the context and interpretation of our results because Mammalia represents only a portion of the long evolutionary history of BDNF. Although we failed to find evidence for recombination in our dataset, species where we may find evidence for recombination may have been precluded from our analysis due to our decision to focus on Mammalian BDNF gene evolution. Further, a limitation of the current analysis is owed to the presence of indel events, especially in the early region of the alignment but which also occur in other spatially distributed regions of the BDNF gene. These indel events are not currently modeled in existing codon substitution models but may represent an additional pathway of evolutionary change. Nonetheless, the prominence of indels in our observations indicate that several regions of BDNF may evolve significantly through indel events across species. Lastly, although there is a risk that the “gappy” nature of the early region of our multiple sequence alignment may be a computational artifact of the alignment procedure, based on all other outputs we believe that our results are reasonably interpreted and have subsequently tolerated these potential effects.
Future Directions: Understanding the Remainder of the Neurotrophin Family
We hypothesize that the similarities between neurotrophins reflects conserved evolutionary selection for motifs and domains for which support common functionality in neurotrophic factors between sites and lineages. While we note significant isotropy in mature peptide sequences for these factors, anisotropic pressures likely influenced the prodomain sequences of neurotrophins leading to alterations in processing, trafficking, regulation, and secretion. As such, we also predict differences in the evolutionary fate of other neurotrophins which also exhibit compartmentalized functionality due to similar alterations within their prodomains (i.e. similar results may be reasonably anticipated not just in BDNF, but also NGF, NT-3, and NT-4).
CONCLUSION
To sum up, our research modeled the natural evolutionary history of changes in the BDNF gene across >160 mammalian genomes. Conservatively, this analysis spans approximately ∼177 million years of evolution - and going deeper could yet reveal more information on the ontogenesis of BDNF and its topological structure (and, consequently, function). Notably, we observed strict purifying selection in the main functional domain of the BDNF gene in mammals and discovered 6 specific sites in our homologous alignment that are under episodic selection in the early regulatory region of BDNF (i.e. the prodomain) and in the terminal region of BDNF. We also make the case for spatial coevolution within this gene, with 23 pair-sites that have evolved together. In sum, these data go above and beyond the common trope that “BDNF is highly conserved” by defining exactly where and how the mammalian BDNF has evolved. Thus, we confirm the widespread belief that the BDNF prodomain is more prone to change than the mature BDNF protein, having important implications for how we think about and consider genetic variation in BDNF and its linkage to disease.
METHODS
Data Retrieval
For this study, we queried the NCBI Ortholog database via https://www.ncbi.nlm.nih.gov/kis/ortholog/627/?scope=7776. For the purpose of this study, as we are interested in mammalian BDNF evolution, we limited our search to only include species from this taxonomic group (mammals, Mammalia). This returned 162 full gene transcripts and protein sequences. We downloaded all available files: RefSeq protein sequences, RefSeq transcript sequences, Tabular data (CSV, metadata). In Table 1, we provide a table of the species included in this analysis but we also make this accessible via GitHub. Furthermore, we also make these species NCBI accessions (see also Table 1) available for download on GitHub:
AnalysisOfOrthologousCollections/BDNF_orthologs.csv at main · aglucaci/AnalysisOfOrthologousCollections · GitHub
Data Cleaning
We used the protein sequence and full gene transcripts to derive coding sequences (CDS) (via a custom script, scripts/codons.py). However, this process was met with errors in 20 “PREDICTED” protein sequences which had invalid characters such as sequences which have incorrect ‘X’, or unresolved amino acids and these sequences were subsequently exempt from analysis. This process removes low-quality protein sequences from analysis which may inflate rates of nonsynonymous change.
Analysis of Orthologous Collections (AOC): Alignment, Recombination Detection, Tree inference & Selection Algorithms
The Analysis of Orthologous Collections (AOC) application is designed for comprehensive protein-coding molecular sequence analysis (https://github.com/aglucaci/AnalysisOfOrthologousCollections). It accomplishes this through a series of comparative evolutionary methods. AOC allows for the inclusion of recombination detection, a powerful force in shaping gene evolution and interpreting analytic results. As well, it allows for lineage assignment and annotation. This feature (lineage assignment) allows between group comparisons of selective pressures. This application currently accepts two input files: a protein sequence unaligned fasta file, and a transcript sequence unaligned fasta file for the same gene. Typically, this can be retrieved from public databases such as NCBI Orthologs. Although other methods of data compilation are also acceptable. In addition, the application is easily modifiable to accept a single CDS input, if that data is available.
If protein and transcript files are provided, a custom script ‘scripts/codons.py’ is executed and returns coding sequences where available. Note that this script currently is set to use the standard genetic code, this will need to be modified for alternate codon tables. This script also removes “low-quality” sequences if no match is found, see the above Data cleaning section.
Step 1. Alignment. We used the HyPhy [44] codon-aware multiple sequence alignment procedure available at (https://github.com/veg/hyphy-analyses/tree/master/codon-msa).
This was performed with a Human BDNF coding sequence NM_001709.5 Homo sapiens brain derived neurotrophic factor (BDNF), transcript variant 4, mRNA as a reference based alignment. Our alignment procedure retained 126 unique in-frame sequences.
Step 2. Recombination detection. Performed manually via RDP v5 [45], see below, the “Recombination detection” section for additional details. A recombination free file is placed in the following folder: results/BDNF/Recombinants. For the purpose of this study, we did not detect recombination in our dataset.
Step 3. Tree inference and selection analyses. For the recombination free fasta file, we perform maximum likelihood phylogenetic inference via IQ-TREE [46]. Next, the recombination free alignment and unrooted phylogenetic tree is evaluated through a standard suite of molecular evolutionary methods. This set of selection analyses include the following but for the sake of brevity some of these results were not shown (essentially, most were not statistically significant or not meaningful as relevant to the evolutionary results presented here).
FEL: locates codon sites with evidence of pervasive positive diversifying or negative selection [47].
BUSTEDS: tests for gene-wide episodic selection [48].
MEME: locates codon sites with evidence of episodic positive diversifying selection, [49].
aBSREL: tests if positive selection has occurred on a proportion of branches, [50].
SLAC: performs substitution mapping, [47].
BGM: identifies groups of sites that are apparently co-evolving, [51].
RELAX: compare gene-wide selection pressure between the query clade and background sequences, [52].
CFEL: comparison site-by-site selection pressure between query and background sequences, [53].
FMM: examines model fit by permitting multiple instantaneous substitutions, [54].
Step 4A. Lineage assignment and tree annotation. For the unrooted phylogenetic tree, we perform lineage discovery, via NCBI and the python package ete3 toolkit. Assigning lineages to a K (by default, K = 20) number of taxonomic groups. Here, the aim is to have a broad representation of taxonomic groups, rather than the species being heavily clustered into a single group. As a reasonable approximation, we aim for <40% of species to be assigned to any one particular taxonomic group.
Step 4B. We perform tree labeling via the hyphy-analyses/Label-Trees (REF, link) method. Resulting in one annotated tree per lineage designation. For the purpose of this study, we will only consider the following five lineages for additional analyses (Artiodactyla, Carnivora, Chiroptera, Glires, Primates) as they are the most populated lineages.
Step 5. Selection analyses on lineages. Here, the recombination free fasta file and the set of annotated phylogenetic trees (where labeling was performed in Step 4) is provided for analysis with the RELAX and Contrast-FEL methods.
Recombination detection
Manually tested via RDPv5.5 with modified settings as follows:
We also included the following algorithms/analyses: RDP [55], GENECONV [56], Chimaera [57], MaxChi [58], BootScan [59] (Primary and Secondary Scan), SiScan [60] (Primary and Secondary Scan), 3Seq [61].
Recombination events are ‘accepted’ in cases where 3 or more methods are in agreement.
We slightly modified default parameters, such that
Require topological evidence.
Polish breakpoints.
Check alignment consistency.
Sequences are linear.
List events detected by >2 methods.
We manually recheck all of the events via “Recheck all identified events with all methods”.
We manually accept events detected by >2 methods.
The resulting alignment was saved as a distributed alignment (with recombinant regions separated).
Recombination was not detected within our Human reference based alignment. Therefore we used the single recombination free alignment for analyses.
Data & Software Availability Statement
The AOC application is freely available via a dedicated GitHub repository at: https://github.com/aglucaci/AnalysisOfOrthologousCollections
Raw data for this study is available on GitHub: https://github.com/aglucaci/AnalysisOfOrthologousCollections/tree/main/data/BDNF
Full results for this study include all HyPhy selection analyses JSON-formatted result files are available on GitHub: https://github.com/aglucaci/AnalysisOfOrthologousCollections/tree/main/results/BDNF
Conflicts of Interest Statement
The authors are without conflict.
Contributions & Acknowledgements
M.N. and A.L. contributed equally. M.N. was supported by a NHMRC CJ Martin Fellowship for stem cell training at Weill Cornell Medical College.
Supplementary Material
These results are also available at the following link: https://github.com/aglucaci/AnalysisOfOrthologousCollections/blob/main/tables/BDNF/BDNF_FEL_CI.csv