Abstract
The human brain differs from that of other primates, but the underlying genetic mechanisms remain unclear. Here we measured the evolutionary pressures acting on all human protein-coding genes (N=17,808) based on their divergence from early hominins such as Neanderthal, and non-human primates. We confirm that genes encoding brain-related proteins are among the most conserved of the human proteome. Conversely, several of the most divergent proteins in humans compared to other primates are associated with brain-associated diseases such as micro/macrocephaly, dyslexia, and autism. We identified specific expression profiles of a set of divergent genes in ciliated cells of the cerebellum, that might have contributed to the emergence of fine motor skills and social cognition in humans. This resource is available at http://neanderthal.pasteur.fr and can be used to estimate evolutionary constraints acting on a set of genes and to explore their relative contribution to human traits.
Introduction
Homo sapiens possess high abilities to perform complex cognitive tasks and to communicate with peers1. Anatomical brain differences between humans and other primates are well documented (e.g. cortex size, prefrontal white matter thickness, lateralization), but the mode in which the human brain has evolved remains controversial2. For example, modern humans have large and globular brains that distinguish them from their extinct Homo relatives, but it remains unknown when and how brain globularity evolved and how it relates to evolutionary brain size increase. A recent study analyzing endocranial casts of Homo sapiens fossils indicate that 300,000 years ago, brain size in early Homo sapiens already fell within the range of present-day humans3. Brain shape, however, evolved gradually within the Homo sapiens lineage, reaching present-day human variation between about 100,000 and 35,000 years ago. It has also been suggested that the enlargement of the prefrontal cortex relative to motor cortex in humans was mirrored in the cerebellum with an enlargement of the regions of the cerebellum connected to the prefrontal cortex4. These anatomical processes of tandem evolution in the brain paralleled the emergence of motor and cognitive abilities such as bipedalism, planning, language, as well as social awareness that are especially well developed in humans.
Genetic diversity between primates has surely contributed to these brain and cognitive differences, but the contributing genes or variants remain largely unknown. Indeed, demonstrating that a genetic variant is adaptive requires strong evidence both at the genetic and functional levels. There are few human specific genes such as SRGAP2C5, ARHGAP11B6 or NOTCH2NL7 that recently appeared through gene duplication in the Homo lineage8 and remarkably their expression in the mouse brain increases cortical development6,7,9,10. In addition, several genes involved in brain function display accelerated evolution of their coding regions in humans, for example FOXP2 has been associated with verbal apraxia and ASPM with microcephaly11,12. Functional studies also showed that mice carrying a “humanized” version of FOX2 display qualitative changes in ultrasonic vocalization13. However, these reports targeting only specific genes sometimes provide contradictory results14 and other studies have reported a greater sequence conservation of the brain protein-coding genes compared to other tissues15–17, suggesting that the main substrate of evolution in the brain reside in regulatory changes of gene expression18–20 and splicing21. It was also shown that more genes underwent positive selection in chimpanzee evolution than in human evolution contradicting the widespread brain-gene acceleration hypothesis of human cognition origins22. Several studies have recently explored the most constraint genes during primate evolution or in human populations to better estimate the pathogenicity of variants identified in patients with genetic disorders23,24. In contrast, few studies made a systematic detection of the divergent genes during primate evolution 25,26. Here, we provide an exhaustive screening of all protein-coding genes for conservation and divergence from our common primate ancestor, and take advantage of rich datasets of brain single-cell transcriptomics, proteomics and imaging to investigate their relations to brain structure, function, and diseases.
Results
High conservation of brain protein-coding genes
We first compared the sequence of modern, archaic humans, and other primates to their common ancestor, and extract a measure of evolution for 17,808 protein-coding genes in the modern human genome (Fig. 1a, see also Supplementary Fig. 1 and 2)27. This resource is available online at http://neanderthal.pasteur.fr. Our measure is derived from one of the most popular and reliable measures of evolutionary pressures on protein-coding regions, the dN/dS ratio28 —also called ω. This measure compares the rates of non-synonymous and synonymous mutations of coding sequences. If there are more non-synonymous mutations than expected, there is divergence, if less, there is conservation. First, we estimate the dN and dS for all the genes that are 1:1 orthologous since the evolutionary constraints are relaxed in duplicated genes29 (note: only the Y chromosome was ignored from the analyses). Then, we adjusted the dN/dS ratio for biases induced by variations of mutations rates with the GC content of codons. Finally, we renormalized across the genome the values obtained for each taxon. The final ωGC12 is a form of Z-score corrected for GC content that quantifies the unbiased relative divergence of genes compared to the ancestral primate genome.
Using the ωGC12 on all protein coding genes for each hominin —Homo sapiens, Denisovan, Neanderthal, and Pan troglodytes— we identified two distinct clusters (Fig. 1b and Supplementary Table 1): one with divergent proteins, enriched in olfactory genes (OR=1.48, p=8.4e-9), and one with conserved proteins, enriched in brain related biological functions (Fig. 1c and Supplementary Table 2). This second cluster reveals especially a strong conservation of proteins implicated in nervous system development (OR=1.2, p=2.4e-9) and synaptic transmission (OR=1.35, p=1.7e-8).
We then focused on the genes expressed in different body tissues by combining different RNAseq (Illumina Bodymap2 and GTEx), microarray and proteomics datasets. For expression data, we characterized the specificity of genes by normalizing their profile across tissues (Supplementary Fig. 3). The results confirmed with three different datasets that protein-coding genes expressed in the brain were more conserved (Wilcoxon rc=-0.1, p=4.1e-12) than genes related to other parts of the body, testis being the most divergent (Wilcoxon rc=0.3, p=7.8e-11; Fig. 1d, see also Supplementary Fig. 4 and 5).
Conservation of proteins related to nervous system substructure and neuronal functions
Within the central nervous system, three substructures appeared to evolve under the strongest degree of purifying selection at the protein sequence level, i.e. highly conserved: the cerebellum (Wilcoxon rc=-0.29, p=5.5e-6), the amygdala (Wilcoxon rc=-0.11, p=4.1e-6), and more surprisingly the prefrontal cortex (Wilcoxon rc=-0.1, p=5.7e-10; Fig. 2a, see also Supplementary Table 3). Indeed, the prefrontal cortex is proposed as one of the most divergent brain structure in human evolution30 and is associated with high cognitive functions31. Only one brain structure was more divergent than expected: the superior cervical ganglion (Wilcoxon rc=0.22, p=1e-6), which provides sympathetic innervation to many organs and is associated with the archaic functions of fight or flight response. Among the divergent genes expressed in the superior cervical ganglion, CARF is specifically divergent in the genus Homo. Interestingly, this calcium responsive transcription factor regulates neuronal activity-dependent expression of BDNF32 as well as a set of singing-induced genes in the song nuclei of the zebra finch, a vocal learning songbird33. The gene has a dN/dS of 2.44 and is among the most divergent proteins expressed in the human brain.
Regarding the brain gene ontologies, all pathways are overall highly conserved, but genes involved in glutamatergic and GABAergic neurotransmissions are on average more conserved (Wilcoxon rc=-0.25; p=9.8e-6) than those participating to dopamine and peptide neurotransmission and to intracellular trafficking (Fig. 2b, see also Supplementary Fig. 6 and Supplementary Table 3). Integrating the recently released ontology of the synapse provided by the SynGO consortium (http://syngoportal.org), not only we confirmed this global strong conservation of the synapse, but also revealed how it especially concerns the trans-synaptic signaling processes (Wilcoxon rc=-0.21, p=4.5e-5) as well as postsynaptic (rc=-0.56, p=6.3e-8) and presynaptic membranes (Wilcoxon: rc=-0.56, p=7e-8; Fig. 2c,d).
The divergent proteins and their correlation with brain expression and functional activity
In order to focus on the most divergent proteins, we took the proteins situated at the extremes of the ωGC12 distribution (>2SD; Fig. 3a; Supplementary Table 4) and which have been fixed in the modern Homo sapiens population (Neutrality Index<1). Among those 352 highly divergent proteins, 126 were related to the brain (Fisher exact OR=0.66, p=1e-4), listed as a synaptic proteins34,35, specifically expressed in the brain (+2SD of specific expression) or related to a brain disease (listed in OMIM: https://www.omim.org or HPO: https://hpo.jax.org/app/). For comparison, we also extracted the 427 most conserved proteins, among which 290 were related to the brain categories listed above (enrichment for brain genes, Fisher exact OR=1.26, p=0.0032).
Using these 352 highly conserved and 427 highly divergent genes, we first confirmed using the Brainspan data available on the Specific expression Analysis (SEA) that genes expressed in the cerebellum and the cortex were enriched in conserved genes (Supplementary Figure 7). Despite this conservation on average, we nevertheless identified several divergent genes, some even specifically divergent in Homo sapiens such as PBXIP1 (dN/dS=2.35) which has a role in cell migration and proliferation, but is also an astroglial progenitor cell marker during human embryonic development36. We investigated the pattern of expression of the highly divergent genes through the lens of the Allen Brain atlas37. We compared among the six donors how reproducible was the difference of expression between highly divergent and conserved genes and found a cluster of brain regions (hypothalamus, cerebral nuclei, and cerebellum) expressing more specifically highly divergent genes (Fig 4). Among the divergent genes, the most specifically expressed in those subcortical structures were two members of the NACA family (NACAD; dN/dS =1.31 and NACA; dN/dS =2.92), which play a role in protein transport and targeting, as well as KIAA1191 (dN/dS =1.31) and KIAA0319 (dN/dS =3.89), both involved in the regulation of neuronal survival, differentiation, and axonal outgrowth38,39. Using single-cell transcriptomics40,41, we finally found that ciliated cells in the cerebellum and choroid cells in the cortex were the two main cell types displaying specific divergence (Fig 5a; Supplementary Figure 8). Combining our analyses across Homo sapiens, Neanderthal, Denisovan, and Pan troglodytes, we also found among proteins specifically expressed in those ciliated cells some which were specifically divergent in modern humans (Fig. 5b): C11orf16, EFCAB12, LCA5, and NSUN7. Interestingly, LCA5 (dN/dS=1.6) is directly involved in centrosomal or ciliary function, and has been shown to interact with genes implicated in Joubert Syndrome42. Some proteins linked to brain morphology disorders also displayed a divergence specifically in the hominin lineage: TTLL6, DYNC2LI1, and TOP3A respectively implicated in Joubert syndrome, ciliopathy, and microcephaly. TOP3A is particularly interesting since it is strongly conserved in Pan troglodytes and two hominin specific variants (RNA NM_004618, R831Q and E959G) are located in the vicinity of pathogenic mutations causing microcephaly 43.
Finally, to test potential association with brain functions, we extracted from NeuroVault44 19,244 brain imaging results from fMRI-BOLD studies (T and Z score maps) and examined how their spatial patterns correlate with those of genes expression in the Allen Brain atlas45. The correlation between brain activity and divergent genes expression was stronger in subcortical structures than in the cortex (Wilcoxon rc=0.14, p=2.5e-248). The brain activity maps that correlate to the expression pattern of the divergent genes (see Supplementary Table 6 for details) were enriched in social tasks (empathy, emotion recognition, theory of mind, language; Fisher exact p=2.9e-20, odd ratio=1.72; see Supplementary Figure 9 for illustration).
The divergent proteins and their connection with brain disorders
Our systematic analysis revealed that highly constrained proteins were more associated with brain diseases or traits compared to divergent proteins especially with intellectual disability and autism but also diseases associated with myelin or encephalopathy (Figure 3b). Those highly conserved proteins associated with brain diseases include tubulins (TUBA1A, TUBB3, TUBB4A), dynamin (DNM1), chromatin remodeling proteins (SMARCA4) and signaling molecules such as AKT1, DVL1, NOTCH1 and its ligand DLL1, all associated with different forms of neurodevelopmental disorders (Supplementary Table 4). We also found 31 highly divergent proteins that are associated with several human diseases or conditions such as micro/macrocephaly, autism or dyslexia.
When we compared humans and chimpanzees with our common primate ancestor, several proteins associated to micro/macrocephaly displayed different patterns of evolution in humans and chimpanzees (Fig. 6). Some proteins displayed a divergence specifically in the hominin lineage (AHI1, ASXL1, CSPP1, DAG1, FAM111A, GRIP1, NHEJ1, QDPR, RNF135, RNF168, SLX4, TCTN1, and TMEM70) or alternatively in the chimpanzee (ARHGAP31, ATRIP, CPT2, CTC1, HDAC6, HEXB, KIF2A, MKKS, MRPS22, RFT1, TBX6, and WWOX). The protein phosphatase PPP1R15B associated with microcephaly was divergent from the common primate ancestor in both taxa. None of the proteins related to micro/macrocephaly appeared divergent only in Homo sapiens (Fig. 6).
We also found divergent proteins associated with communication disorders (Fig. 3c), such as autism (CNTNAP4, AHI1, FAN1, SNTG2 and GRIP1) and dyslexia (KIAA0319). Interestingly, these proteins are divergent from the common primate ancestor only in the hominin lineage and highly conserved in all other taxa (Fig. 7). They all have roles related to neuronal connectivity (neuronal migration and synaptogenesis) and were within the human brain relatively more expressed in the cerebellum except GRIP1 that is almost only restricted to the cortex.
Among the genes associated with autism, CNTNAP4 is a member of the neurexin protein family that plays a role in proper neurotransmission in the dopaminergic and GABAergic systems46. SNTG2 is a cytoplasmic peripheral membrane protein that binds to NLGN3 and NLGN4X, two proteins associated with autism47 and several copy-number variants affecting SNTG2 were identified in patients with autism48. GRIP1 (Glutamate receptor interacting protein 1) is also associated with microcephaly and codes a synaptic scaffolding protein that interacts with glutamate receptors. Variants of this gene were recurrently associated with autism49.
Interestingly, we also detected that the dyslexia-susceptibility protein KIAA0319, involved in axon growth inhibition38,50, is among the most divergent brain proteins in humans compared to our common primate ancestor (dN/dS=3.9). The role of KIAA0319 in dyslexia is still debated, but its rapid evolution in the hominoid lineage warrants further genetic and functional studies.
Finally, several genes display very high divergence in Homo sapiens, but their functions or their association to diseases remain unknown. For example, the zinc finger protein ZNF491 (dN/dS=4.7) is highly expressed in the cerebellum and displays structural similarities with a chromatin remodeling factor, but its biological role remains to be characterized. Another example is CCP110, a centrosomal protein such as ASPM, but not associated to a disease. Based on its function, this divergent protein CCP110 could be a compelling candidate for microcephaly in humans. A complete list of the most conserved and divergent proteins is available in Supplementary Table 4 and on the companion website.
Systematic analysis of Neanderthal introgression
Finally, we also looked at the recent evolutionary event of Neanderthal introgression. Combining the dataset of Sankararaman51 with our systematic screening of coding sequences, we derived for 16,020 genes the level of Neanderthal introgression (also available on the website). We conduced similar analyses as with ωGC12 (see Supplementary Table 7) and despite no correlation between ωGC12 and introgression, we obtained opposite trends at the tissue level: genes expressed in the testes appeared as the less introgressed tissues of the body (Wilcoxon rc=-0.09, p=1.5e-10), while genes associated to the adult brain in RNAseq (Wilcoxon rc=0.16, p=1.2e-4) and Mass Spectrometry (Wilcoxon rc=0.26, p=2.6e-6) appeared more introgressed than expected (Supplementary Figure 10). At the nervous system level, the spinal cord (Wilcoxon rc=0.3, p=1.9e-5) and the retina (Wilcoxon rc=0.15, p=3e-9) revealed strong introgression, while the pineal gland had less introgression than expected (Wilcoxon, pineal during day: rc=-0.14, p=6.2e-15; pineal during night: rc=-0.12, p=6.5e-10; Supplementary Figure 11). At the functional level, the proteins associated with the glutamatergic (Wilcoxon rc=0.38, p=3e-5) and the serotonin (Wilcoxon rc=0.29, p=1.8e-3) pathways were also highly introgressed (Supplementary Figure 12).
Discussion
Divergent proteins and brain sizes in primates
Several proteins were previously proposed to play a major role in the expansion of the size of the human brain. Some of these proteins such as ARHGAP11B, SRGAP2C and NOTCH2NL7 were found specific to humans since their genes were recently duplicated52. Other studies proposed that high divergence in genes involved in micro/macrocephaly could have contributed to the dramatic change in brain size during primate evolution25,53. Several of these genes apparently displayed an accelerated evolution in humans such as the ASPM54 and MCPH155. However, the adaptive evolution of these genes was previously questioned56 and in our analysis neither of these two proteins are on the list of the highly divergent proteins (their dN/dS value are indeed below 0.8).
Conversely, our systematic detection could identify the most divergent proteins in humans for micro/macrocephaly and the top 10 proteins were FAM111A, AHI1, CSPP1, TCTN1, DAG1, TMEM70, ASXL1, RNF168, NHEJ1, GRIP1. This list of divergent proteins associated with micro/macrocephaly in humans can be used to select the best candidate human specific gene/variants for further genetic and functional analyses in order to better estimate their contribution to the emergence of anatomical difference between humans and other primates.
Some of these proteins could have contributed to brain size and to other morphological features such as difference in skeleton development. For example, the divergent proteins FAM111A (dN/dS=2.99) and ASXL1 (dN/dS=1.83) are associated respectively with macrocephaly and microcephaly. Patients with dominant mutation of FAM111A are diagnosed with Kenny-Caffey syndrome (KCS). They display impaired skeletal development with small and dense bones, short stature, primary hypoparathyroidism with hypocalcemia and a prominent forehead57. While the function of FAM111A remains largely unknown, this protein appears to be crucial to a pathway that governs parathyroid hormone production, calcium homeostasis, and skeletal development and growth. On the other hand, patients with dominant mutations of ASXL1 are diagnosed with Bohring-Opitz syndrome, a malformation syndrome characterized by severe intrauterine growth retardation, intellectual disability, trigonocephaly, hirsutism, and flexion of the elbows and wrists with deviation of the wrists and metacarpophalangeal joints58. ASXL1 encodes a chromatin protein required to maintain both activation and silencing of homeotic genes.
Remarkably, three proteins AHI1, CSPP1 and TCTN1 are in the top 5 with dN/dS>2 and are required for both cortical and cerebellar development in humans and associated with Joubert syndrome. Children diagnosed with this recessive disease present with agenesis of the cerebellar vermis and difficulties coordinating movements. AHI1 is a positive modulator of classical WNT/Ciliary signaling. CSPP1 is involved in the cell-cycle-dependent microtubule organization and TCTN1 is a regulator of Hedgehog during development.
AHI1 was previously reported as a gene under positive evolutionary selection along the human lineage59,60, but to our knowledge, neither CSPP1 and TCTN1 were previously detected as divergent proteins during primate evolution. Interestingly, it was proposed that the accelerated evolution of AHI1 required for ciliogenesis and axonal growth could have played a role in the development of unique motor capabilities in humans, such as bipedalism53. Our findings provide further support for accelerated evolution of a set of genes associated with ciliogenesis. Indeed, we found that three additional proteins CSPP1, TTLL6, and TCTN1 involved in Joubert syndrome are among the most divergent proteins during human evolution and our single cell analysis revealed that ciliated cells are the main category of cerebellar cells expressing divergent genes.
The possible link between a change in the cerebellum genetic makeup and the evolution of human cognition
Although the emergence of a large brain was probably an important step for human cognition, other parts of the brain such as the cerebellum might have also played an important role in both motricity and cognition. In this study, we showed that proteins expressed in the cerebellum were among the most conserved proteins in humans. Nevertheless, we also found a set of divergent proteins with relatively high expression in the cerebellum and/or affecting cerebellar function when mutated. As discussed above, several genes associated with Joubert syndrome such as AHI1, CSPP1, TTLL6, and TCTN1 display protein divergence in humans and are important for cerebellar development. In addition, among the most divergent proteins expressed in the brain, CNTNAP4, FAN1, SNTG2, and KIAA0319 also display high expression in the cerebellum and have been associated with communication disorders such as autism and dyslexia.
Within the cerebellum, the lateral zone, also known as neocerebellum, is novel to mammals and has expended during primate evolution61,62. Specifically, it was shown that this part of the cerebellum connected to the prefrontal cortex was enlarged, relative to the cerebellar lobules connected to the motor cortex4. Interestingly, the largest difference in the endocranial shapes of modern humans and Neanderthals are found in the cerebellum63, although other regions are also relatively larger in modern humans than in Neanderthals, including parts of the prefrontal cortex and the occipital and temporal lobes64.
In this line, a recent study detected Neanderthal alleles nearby two genes, UBR4 on chromosome 1 and PHLPP1 on chromosome 8, which could be linked to difference in endocranial globularity64. These genes are involved in neurogenesis and myelination, respectively, and UBR4 is most specifically expressed in the cerebellum. In our analysis, the coding regions of these genes are highly conserved (dN/dS<0.13) suggesting that regulatory variants rather than coding sequence variants might be involved.
In humans, the cerebellum is associated with higher cognitive functions such as visual-spatial skills, planning of complex movements, procedural learning, attention switching, and sensory discrimination65. It has a key role for temporal processing66 and for being able to anticipate and control behavior with both implicit and explicit mechanisms65. It is therefore expected that a change in the genetic makeup of the cerebellum might have been of great advantage for the emergence of specific human cognition.
Despite this possible link between the cerebellum and the emergence of human cognition, this part of the brain was neglected compared to the cortex where most of the functional studies investigating the role of human specific genes/variants have been conducted. For example, while in humans the expression of SRGAP2C is almost restricted to the cerebellum, its ectopic expression was studied in the mouse cortex5,10 where it triggers human-like neuronal characteristics, such as increased dendritic spine length and density. Hence, we propose that an exploration of specific human genes/variants to the development and the function of the cerebellum might shed new light on the evolution of human cognition.
Perspectives
Our systematic analysis of protein sequence diversity confirms that proteins related to brain function are among the most conserved proteins of the human proteome. The set of divergent proteins that we identified could have played specific roles in the evolution of human cognition by modulating brain size, neuronal migration and/or synaptic physiology, but further genetic and functional studies should shed new light on the role of these divergent proteins. Beyond the brain, this resource will be also helpful to estimate the level of evolutionary pressure acting on genes related to other biological pathways, especially those displaying signs of positive selections during primate evolution such as the reproductive and immune systems.
Methods
1. Genetic sequences
Alignments on the reference genome
We gathered and reconstructed aligned sequences on the reference human genome version hg19 (release 19, GRCh37.p13). For the primate, common ancestor sequence, we took the Ensemble 6 way Enredo-Pecan-Ortheus (EPO)67 multiple alignments v71, related to human (hg19), chimpanzee (panTro4), gorilla (gorGor3), orangutan (ponAbe2), rhesus macaque (rheMac3), and marmoset (calJac3). For the two ancestral hominins Altai and Denisovan, we integrated variants detected by Castellano and colleagues68 onto the standard hg19 sequence (http://cdna.eva.mpg.de/neandertal/, date of access 2014-07-03). We finally took the whole genome alignment of all the primates used in the 6-EPO from the UCSC website (http://hgdownload.soe.ucsc.edu/downloads.html, access online: August 13th, 2015).
VCF annotation
We combined the VCF file from Castellano and colleagues68 with generated VCF files from the Ancestor and primates sequences alignment. The annotation of the global VCF was done with ANNOVAR69 (version of June 2015) and using the data base: refGene, cytoBand, genomicSuperDups, esp6500siv2_all, 1000g2014oct_all, 1000g2014oct_afr, 1000g2014oct_eas, 1000g2014oct_eur, avsnp142, ljb26_all, gerp++elem, popfreq_max, exac03_all, exac03_afr, exac03_amr, exac03_eas, exac03_fin, exac03_nfe, exac03_oth, exac03_sas. We also used the Clinvar database (https://ncbi.nlm.nih.gov/clinvar/, date of access 2016-02-03).
2. ωGC12 calculus
Once all alignments were gathered, we extracted the consensus coding sequences (CCDS) of all protein-coding genes referenced in Ensembl BioMart Grc37, following HGNC (date of access 05/05/2015) and NCBI Concensus CDS protein set (date of access 2015-08-10) and computed the number of non-synonymous mutations N, the number of synonymous mutations S, the ratio of the number of nonsynonymous mutation per non-synonymous site dN, the number of synonymous mutations per synonymous site dS, and their ratio dN/dS — also called ω—between all taxa and the ancestor using the yn00 algorithm implemented in the PamL software70. To avoid infinite or null results, we computed a corrected version of dN/dS, i.e. if S was null, we took it equal to one for avoiding null numerator. Finally, we obtained our ωGC12 measure by correcting for the GC12 content of the genes by using a general linear model and calculating a Z-score within each taxon27. Indeed, the GC content has been associated with biases in mutation rates, especially in primates71 and humans72. We kept only the 11667 genes with one-to-one orthologs in primates (extracted for GRCh37.p13 with Ensemble Biomart, access online: February 27th, 2017).
3. Genes sets
We used different gene sets, starting at the tissue level and then focused on the brain and key pathways. For body tissues, we took the Illumina Body Map 2.0 RNA-Seq data, which consists of 16 human tissue types, including adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells (for more info : https://personal.broadinstitute.org/mgarber/bodymap_schroth.pdf; data pre-processed with Cufflinks, accessed the 5th of May 2015 at http://cureffi.org). We also used the microarray dataset by Su and colleagues73 (Human U133A/GNF1H Gene Atlas, accessed the the 4th of May 2015 at http://biogps.org). Finally, we also replicated our results on the recently available RNAseq data from the GTEx Consortium74 (https://www.gtexportal.org/home/).
For the brain, we used the Su and colleague dataset and the sets of the Human Protein Atlas (accessed the 7th of November 2017 at https://www.proteinatlas.org). For the biological pathways associated with the brain we used KEGG (accessed the 25th of February 2015, at http://www.genome.jp/kegg/), synaptic genes curated by the group of Danielle Posthuma at Vrije Universiteit (accessed the 1st of September 2014, at https://ctg.cncr.nl/software/genesets), and mass spectrometry data from Loh and colleagues75. Finally, for the diseases associated with the brain we combined gene sets generated from Human Phenotype Ontology (accessed the 5th of April 2016, at http://human-phenotype-ontology.github.io) and OMIM (accessed the 5th of April 2016, at https://omim.org), and curated lists: the 65 risk genes proposed by Sanders and colleagues76 (TADA), the genes candidates for autism spectrum disorders from SFARI (accessed the 17th of July 2015 at https://gene.sfari.org), the Developmental Brain Disorder or DBD (accessed the 12th of July 2016 at https://geisingeradmi.org/care-innovation/studies/dbd-genes/), and the Cancer Census (accessed the 24th of November 2016 at cancer.sanger.ac.uk/census). Notice that the combination of both HPO & OMIM allows to be the most inclusive and not miss any potential candidate gene, however this leads to not specific association.
SynGO was gently provided by Matthijs Verhage (access date: 2019-01-11). This ontology is a consistent, evidence-based annotation of synaptic gene products developed by the SynGO consortium (2015-2017) in collaboration with the GO-consortium. It extends the existing Gene Ontology (GO) of the synapse and follow the same dichotomy between Biological Processes (BP) and Cellular Components (CC).
For single cell transcriptomics datasets, we identified the genes specifically highly expressed in each cell types, following the same strategy used for the other RNAseq datasets. The data for single-cell developing human cortex was gently provided by Maximilian Haeussler (available at https://cells.ucsc.edu; access date: 2018-10-30). The data of a single-cell transcriptional atlas of the developing murine cerebellum40 was gently provided by Robert A. Carter (access date: 2019-01-29). For each cell type, we combined expression values observed across all available replicates to guarantee a good signal-to-noise ratio. We then calculated the associated genes in Homo sapiens following paralogy correspondence between Humans and Mice (Ensembl Biomart accessed on 2019-02-23).
4. Genes nomenclature
We extracted from Ensembl Biomart all the EntrezId of the protein coding genes for Grc37. We used the HGNC database to recover their symbol. For the 46 unmapped genes, we manually search on NCBI the official symbol.
5. Neanderthal introgression
The level of introgression from Neanderthal to modern humans was assessed by averaging the probabilities of Neanderthal ancestry calculated for European and Asian population for each SNP of the 1000 Genome Project dataset 51,77.
6. McDonad-Kreitman-test (MK) and Neutrality Index (NI)
To detect the fixation of variant in the Homo sapiens population we first computed the relative rate of non-synonymous to synonymous polymorphism (pN/pS) from the 1000 Genomes VCF for all SNPs, for SNPS with maximum allele frequency (MAF) <1% and >5%. SNPs were annotated with ANNOVAR across 1000 Genomes Project (ALL+5 ethnicity groups), ESP6500 (ALL+2 ethnicity groups), ExAC (ALL+7 ethnicity groups), and CG46 (see http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#popfreqmax-and-popfreqall-annotations for more details). We then used the McDonald–Kreitman test by calculating the Neutrality Index (NI) as the ratio of pN/pS and dN/dS values78. We considered the divergent genes as fixed in the population when NI < 1.
7. Protein-Protein Interaction network
To plot Protein-Protein Interaction (PPI), we combined eight human interactomes: the Human Integrated Protein-Protein Interaction rEference (HIPPIE) (accessed the 10th of August 2017 at http://cbdm-01.zdv.uni-mainz.de/~mschaefer/hippie/), the Agile Protein Interactomes DataServer (APID) (accessed the 7th os September 2017 at http://cicblade.dep.usal.es:8080/APID/), CORUM – the comprehensive resource of mammalian protein complexes (accessed the 13th of July 2017 at http://mips.helmholtz-muenchen.de/corum/), and five PPI from of the Center for Cancer Systems Biology (CCSB) (accessed the 12th of July 2016 at): four high-quality binary Protein-Protein Interactions (PPI) using a systematic primary yeast two-hybrid assay (Y2H): HI-I-05 by Rual et colleagues79, Venkatesan-09 by Venkatesan and colleagues80, Yu-11 by Yu and colleagues81 and HI-II-14 by Rolland and colleagues82, plus one high-quality binary literature dataset Lit-BM-13 by Rolland and colleagues82, comprising all PPI that are binary and supported by at least two traceable pieces of evidence (publications and/or methods).
8. NeuroVault analyses
We used the NeuroVault website44 to collect 19,244 brain imaging results from fMRI-BOLD studies (T and Z score maps) and their correlation with the genes expression of the Allen Brain atlas45. The gene expression from the Allen Brain atlas were normalized and projected into the MNI152 stereotactic space used by NeuroVault using the spatial coordinates provided by the Allen Brain institute. Because an inverse relationship between cortical and subcortical expression dominated the pattern of expression for many genes, the correlations are calculated for cortex and subcortical structures separately.
9. Allen Brain data
We downloaded the Allen Brain atlas microarray-based gene data from the Allen Brain website (accessed the 19th of January 2018) at http://www.brain-map.org). Six adult brains had microarrays, and the right hemisphere was missing for three donors so we considered only the left one for our analyses. For each donor, we averaged probes targeting the same gene and falling in the same brain area. Then, we log-normalised the data and computed Z-scores: across the 20787 genes for each brain region to get expression levels; across the 212 brain areas for each gene to get expression specificity. For genes with more than one probe, we averaged the normalized values over all probes available. As a complementary dataset, we also used a mapping of Allen Brain Atlas on the 68 brain regions of the Freesurfer atlas83 (accessed the 4th of April 2017 at https://figshare.com/articles/A_FreeSurfer_view_of_the_cortical_transcriptome_generated_from_the_Allen_Human_Brain_Atlas/1439749).
10. Statistics
Enrichment analyses
We first computed a two-way hierarchical clustering on the normalized dN/dS values (ωGC) in different across the whole genome (see Fig1b; note: 11,667 genes were part of the analysis for a medium quality coverage for Homo sapiens, Neanderthal, Denisovan, and Pan troglodytes; see Supplementary table 2). According to 30 clustering indices84, the best partitioning in term of evolutionary pressure was in two clusters of genes: constrained (N=4825; in HS, mea=-0.88 med=-0.80 sd=0.69) and divergent (N=6842; in HS, mea=0.60 med=0.48 sd=0.63. For each cluster, we calculated the enrichment in biological functions in Cytoscape85 with the BINGO plugin86. We took as background all the 12,400 genes. To eliminate redundancy, we first filtered all the statistically significant Gene Ontology (GO) with less than 10 genes or more than 1000, and then combine the remaining ones with the EnrichmentMap plugin87. We used a P-value Cutoff = 0.005, FDR Q-value Cutoff = 0.05, and Jaccard Coefficient = 0.5.
For the Cell-type Specific Expression Analysis (CSEA)88 we used the CSEA method using the online tool http://genetics.wustl.edu/jdlab/csea-tool-2/. This method associate any gene lists with the brain expression profiles across cell type, region, and time period.
Wilcoxon and rank-biserial correlation
To measure how each gene set was significantly more conserved or divergent than expected by chance, we calculated a Wilcoxon test of the normalized dN/dS values (ωGC) of the genes in the set against zero (the mean of the genome). For quantification of the effect size, we used matched pairs rank-biserial correlation as introduced by Kerby89. Following the calculation of non-parametric Wilcoxon signed rank tests, the rank-biserial correlation is calculated as the difference between the proportion of negative and positive ranks compared to the total sum of ranks:
It corresponds to the difference between the proportion of observation favourable to the hypothesis (f) minus the proportion of observation that are unfavourable (u), thus representing an effect size. Like other correlational measures, its value ranges from minus one to plus one, with a value of zero indicating no relationship. In our case, a negative rank-biserial correlation corresponds to a gene set where more genes have negative ωGC values than positive, thus revealing more conservation than the average of all the genes (i.e. ωGC = 0). At contrary, a positive rank-biserial correlation corresponds to a gene set more divergent than expected. All statistics related to Figures 1d, 2a and 2b are summarized in the Supplementary table 3.
Validation by resampling technique
To correct for potential bias of the length of the coding sequence or the global specificity of the gene expression (Tau, see section A3), we also run bootstraps. For each permutation, we took randomly the same number of genes as in the sample within the total set of genes with non-missing dN/dS. We corrected for CCDS length and GC content by bootstrap resampling method. We estimated the significance of the null hypothesis rejection by calculating how many draws of bootstraps (B0) fall below and above the observed measure (m). The related empirical p-value was calculated according to this formula:
Author contributions
G.D. and T.B. devised the project and conceived the main conceptual ideas. G.D. developed the methods, performed the analyses, and designed the figures. G.D. and T.B. discussed the results and wrote the manuscript. S.M. developed the companion website.
Acknowledgements
We thank J-P. Changeux, L. Quintana-Murci, E. Patin, G. Laval, and B. Arcangioli for sharing their advices and comments, and the members of the Human Genetics and Cognitive Functions lab for helpful discussions. We also are grateful to S. Paäbo, C. Gorgolewski, R. Carter, L. Bally-Cuif, A. Chedotal, C. Berthelot, H. Roest Crollius, M. Haeussler, M. Verhage and the SynGO consortium for providing key datasets without which the current work could not have been possible. This work was supported by the Institut Pasteur; Centre National de la Recherche Scientifique; the University Paris Diderot; the Fondation pour la Recherche Médicale [DBI20141231310]; The human brain project; the Cognacq-Jay foundation; the Bettencourt-Schueller foundation; and the Agence Nationale de la Recherche (ANR) [SynPathy]. This research was supported by the Laboratory of Excellence GENMED (Medical Genomics) grant no. ANR-10-LABX-0013, Bio-Psy and by the INCEPTION program ANR-16-CONV-0005, all managed by the ANR part of the Investment for the Future program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Footnotes
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].
- [17].↵
- [18].↵
- [19].
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵