Abstract
In recent years, elucidation of genetic and molecular mechanisms defining the phenotypic uniqueness of Modern Humans attained a significant progress in illuminating the essential role of human-specific regulatory sequences (HSRS). The macromolecules comprising the essential building blocks of life at the cellular and organismal levels remain highly conserved during the evolution of humans and other Great Apes. Identification of nearly hundred thousand candidate HSRS validate the idea that unique to human phenotypes may result from human-specific changes to genomic regulatory sequences defined as “regulatory mutations” (King and Wilson, 1975). The exquisite degree of accuracy of the state-of-art molecular definition of HSRS is illustrated by identification of 35,074 single nucleotide changes (SNCs) that are fixed in humans, distinct from other primates, and located in differentially-accessible (DA) chromatin regions during the human brain development (Kanton et al., 2019). Annotation of SNCs derived and fixed in modern humans that overlap DA chromatin regions during brain development revealed that 99.8% of candidate regulatory SNCs are shared with the archaic humans. This remarkable conservation on the human lineage of candidate regulatory SNCs associated with early stages of human brain development suggest that coding genes expression of which is regulated by human-specific SNCs may have a broad effect on human-specific traits beyond embryonic development. Gene set enrichment analyses of 8,405 genes linked with 35,074 human-specific SNCs revealed the staggering breadth of significant associations with morphological structures, physiological processes, and pathological conditions of Modern Humans, including more than 1,000 anatomically-distinct regions of the adult human brain, many types of human cells and tissues, more than 200 common human disorders and more than 1,000 rare diseases. Thousands of genes connected with human-specific regulatory SNCs represent essential genetic elements of the autosomal inheritance and survival of species phenotypes: a total of 1,494 genes linked with either autosomal dominant or recessive inheritance as well as 2,273 genes associated with premature death, embryonic survival, and perinatal, neonatal, and postnatal lethality of both complete and incomplete penetrance have been identified in this contribution. Therefore, thousands of heritable traits and critical genes impacting the offspring survival appear placed under the human-specific regulatory control in genomes of Modern Humans. These observations highlight the remarkable translational opportunities with clinical utility potentials afforded by the discovery of genetic regulatory loci harboring human-specific SNCs in the ground-breaking fundamental study of great ape’s cerebral organoids.
Introduction
DNA sequences of coding genes defining the structure of macromolecules comprising the essential building blocks of life at the cellular and organismal levels remain highly conserved during the evolution of humans and other Great Apes (Chimpanzee Sequencing and Analysis Consortium, 2005; Kronenberg et al., 2018). In striking contrast, a compendium of nearly hundred thousand candidate human-specific regulatory sequences (HSRS) has been assembled in recent years (Glinsky et al., 2015-2019; Kanton et al., 2019), thus validating the idea that unique to human phenotypes may result from human-specific changes to genomic regulatory sequences defined as “regulatory mutations” (King and Wilson, 1975). The best evidence of the exquisite degree of accuracy of the contemporary molecular definition of human-specific regulatory sequences is identification of 35,074 single nucleotide changes (SNCs) that are fixed in humans, distinct from other primates, and located within differentially-accessible (DA) chromatin regions during the human brain development in cerebral organoids (Kanton et al., 2019). However, only a small fraction of identified DA chromatin peaks (600 of 17,935 DA peaks; 3.3%) manifest associations with differential expression in human versus chimpanzee cerebral organoids model of brain development, consistent with the hypothesis that regulatory effects on gene expression of these chromatin regions are not restricted to the early stages of brain development. Annotation of SNCs derived and fixed in modern humans that overlap DA chromatin regions during brain development revealed that essentially all candidate regulatory human-specific SNCs are shared with the archaic humans (35,010 SNCs; 99.8%) and only 64 SNCs are unique to modern humans (Kanton et al., 2019). This remarkable conservation on the human lineage of human-specific SNCs associated with human brain development sows the seed of interest for in-depth exploration of coding genes expression of which may be affected by genetic regulatory loci harboring human-specific SNCs. The GREAT algorithm (McLean et al., 2010, 2011) was utilized to identify 8,405 genes expression of which might be affected by 35,074 human-specific SNCs located in DA chromatin regions during brain development. Comprehensive gene set enrichment analyses of these genes revealed the staggering breadth of associations with physiological processes and pathological conditions of H. sapiens, including more than 1,000 anatomically-distinct regions of the adult human brain, many human tissues and cell types, more than 200 common human disorders and 1,116 rare diseases.
Results and discussion
Identification and characterization of putative genetic regulatory targets associated with human-specific single nucleotide changes (SNCs) in in differentially accessible (DA) chromatin regions during brain development
To identify and characterize human genes associated with 35,074 human-specific single nucleotide changes (SNCs) in differentially accessible (DA) chromatin regions during human and chimpanzee neurogenesis in cerebral organoids (Kanton et al., 2019), the GREAT algorithm (McLean et al., 2011) have been employed. These analyses identified 8,405 genes with putative regulatory connections to human-specific SNCs (Figure 1) and revealed a remarkable breadth of highly significant associations with a multitude of biological processes, molecular functions, genetic and metabolic pathways, cellular compartments, and gene expression perturbations (Supplemental Table Set S1).
It has been noted that particularly striking numbers of significant associations were uncovered by the GREAT algorithm during the analyses of two databases:
The Human Phenotype Ontology containing over 13,000 terms describing clinical phenotypic abnormalities that have been observed in human diseases, including hereditary disorders (326 significant records with binominal FDR Q-Value < 0.05);
The MGI Expression Detected ontology referencing genes expressed in specific anatomical structures at specific developmental stages (Theiler stages) in the mouse (370 significant records with binominal FDR Q-Value < 0.05).
These observations support the hypothesis that biological functions of genes under the putative regulatory control of human-specific SNCs in DA chromatin regions during brain development are not limited to the contribution to the early stages of neuro- and corticogenesis. Collectively, findings reported in the Supplemental Table Set S1 argue that genes expression of which is affected by human-specific SNCs may represent a genomic dominion of putative regulatory dependency from HSRS that is likely to play an important role in a broad spectrum of physiological processes and pathological conditions of Modern Humans.
Identification of genes expression of which distinguishes thousands of anatomically distinct areas of the adult human brain, various regions of the central nervous system, and many different cell types and tissues in the human body
To validate and extend these observations, next the comprehensive gene set enrichment analyses were performed employing the web-based Enrichr API protocols (Chen et al., 2013; Kuleshov et al., 2016), which interrogated nearly 200,000 gene sets from more than 100 gene set libraries. The results of these analyses are summarized in the Table 1 and reported in details in the Supplemental Table Set S2. Genes expression of which were placed during evolution under the regulatory control of ∼ 35,000 human-specific SNCs demonstrate a staggering breadth of significant associations with a broad spectrum of anatomically distinct regions, various cell and tissue types, a multitude of physiological processes, and a numerous pathological conditions of H. sapiens.
Of particular interest is the apparent significant enrichment of human-specific SNCs-associated genes among both up-regulated and down-regulated genes, expression of which discriminates thousands of anatomically distinct areas of the adult human brain defined in the Allen Brain Atlas (Figure 2; Supplemental Table Set S2). Notably, genes expressed in various thalamus regions appear frequently among the top-scored anatomical areas of the human brain (Figure 2; Supplemental Table Set S2). These observations support the hypothesis that genetic loci harboring human-specific SNCs may exert regulatory effects on structural and functional features of the adult human brain, thus, likely affecting the development and functions of the central nervous system in Modern Humans. Consistent with this idea, the examination of the enrichment patterns of human-specific SNCs-associated genes in the ARCHS4 Human Tissues’ gene expression database revealed that top 10 most significantly enriched records overlapping a majority of region-specific marker genes constitute various anatomically-distinct regions of the central nervous system (Figure 2; Supplemental Table Set S2). However, results of gene set enrichment analyses convincingly demonstrate that inferred regulatory effects of genetic loci harboring human-specific SNCs are not restricted only to the various regions of the central nervous system, they appear to affect gene expression profiles of many different cell types and tissues in the human body (Table 1; Supplemental Table Set S2).
Identification and characterization of genes expression of which is altered during aging of humans, rats, and mice
Genes altered expression of which is implicated in the aging of various tissues and organs of humans, rats, and mice are significantly enriched among 8,405 genes associated with human-specific regulatory SNCs (Figure 3; Supplemental Table Set S2). Aging of the hippocampus was implicated most frequently among genes manifesting increased expression with age, while among genes exhibiting aging-associated decreased expression the hippocampus and frontal cortex were identified repeatedly (Figure 3). Overall, twice as many significant association records were observed among aging-associated down-regulated genes compared to up-regulated genes (Table 1). Collectively, these observations indicate that genes changes in expression of which were associated with aging in mammals, in particular, hippocampal and frontal cortex aging, represent important elements of a genomic dominion that was placed under regulatory control of genetic loci harboring human-specific SNCs.
Identification of genes implicated in development and manifestations of hundreds physiological and pathological phenotypes and autosomal inheritance in Modern Humans
Interrogations of the Human Phenotype Ontology database (298 significantly enriched records identified), the Genome-Wide Association Study (GWAS) Catalogue (241 significantly enriched records identified), and the database of Human Genotypes and Phenotypes (136 significantly enriched records identified) revealed several hundred physiological and pathological phenotypes and thousands of genes manifesting significant enrichment patterns defined at the adjusted p value < 0.05 (Figure 4; Table 1; Supplemental Table Set S2). Interestingly, 645 and 849 genes implicated in the autosomal dominant (HP:0000006) and recessive (HP:0000007) inheritance were identified amongst genes associated with human-specific regulatory SNCs (Figure 4; Supplemental Table Set S2). Notable pathological conditions among top-scored records identified in the database of Human Genotypes and Phenotypes are stroke, myocardial infarction, coronary artery disease, and heart failure (Figure 4).
A total of 241 significantly enriched records (Table 1) were documented by gene set enrichment analyses of the GWAS catalogue (2019), among which a highly diverse spectrum of pathological conditions linked to genes associated with human-specific regulatory SNCs was identified, including obesity, type 2 diabetes, amyotrophic lateral sclerosis, autism spectrum disorders, attention deficit hyperactivity disorder, bipolar disorder, major depressive disorder, schizophrenia, Alzheimer’s disease, malignant melanoma, diverticular disease, asthma, coronary artery disease, glaucoma, as well as breast, prostate and colorectal cancers (Figure 4; Supplemental Table Set S2). These observations indicate that thousands of genes putatively associated with genetic regulatory loci harboring human-specific SNCs affect risk of developing numerous pathological conditions in Modern Humans.
Identification of genes expression of which is altered in several hundred common human disorders
Gene set enrichment analyses-guided interrogation of the Gene Expression Omnibus (GEO) database revealed the remarkably diverse spectrum of human diseases with the etiologic origins in multiple organs and tissues and highly heterogeneous pathophysiological trajectories of their pathogenesis (Figure 5; Supplemental Table Set S2). Overlapping gene sets between disease-associated genes and human-specific SNCs-linked genes comprise of hundreds of genes that were either up-regulated (204 significant disease records) or down-regulated (240 significant disease records) in specific pathological conditions, including schizophrenia, bipolar disorder, various types of malignant tumors, Crohn’s disease, ulcerative colitis, Down syndrome, Alzheimer’s disease, spinal muscular atrophy, multiple sclerosis, autism spectrum disorders, type 2 diabetes mellitus, morbid obesity, cardiomyopathy (Figure 5; Supplemental Table Set S2). These observations demonstrate that thousands of genes expression of which is altered in a myriad of human diseases appear associated with genetic regulatory loci harboring human-specific SNCs.
Identification of genes implicated in more than 1,000 records classified as human rare diseases
Present analyses demonstrate that thousands of genes associated with human-specific regulatory SNCs have been previously identified as genetic elements affecting the likelihood of development a multitude of common human disorders. Similarly, thousands of genes expression of which is altered during development and manifestation of multiple common human disorders appear linked to genetic regulatory loci harboring human-specific SNCs. Remarkably, interrogations of the Enrichr’s libraries of genes associated with Modern Humans’ rare diseases identified 473, 603, 641, and 1,116 significantly enriched records of various rare disorders employing the Rare Diseases GeneRIF gene lists library, the Rare Diseases GeneRIF ARCHS4 predictions library, the Rare Diseases AutoRIF ARCHS4 predictions library, and the Rare Diseases AutoRIF Gene lists library, respectively (Figure 6; Supplemental Table Set S2). Taken together, these observations demonstrate that thousands of genes associated with hundreds of human rare disorders appear linked with human-specific regulatory SNCs.
Gene ontology analyses of putative regulatory targets of genetic loci harboring human-specific SNCs
Gene Ontology (GO) analyses identified a constellation of biological processes (GO Biological Process: 308 significant records) supplemented with a multitude of molecular functions (GO Molecular Function: 81 significant records) that appear under the regulatory control of human-specific SNCs (Figure 7; Supplemental Table Set 2). Consistently, both databases identified frequently the components of transcriptional regulation and protein kinase activities among most significant records. Other significantly enriched records of interest are regulation of apoptosis, cell proliferation, migration, and various binding properties (cadherin binding; sequence-specific DNA binding; protein-kinase binding; amyloid-beta binding; actin binding; tubulin binding; microtubule binding; PDZ domain binding) which are often supplemented by references to the corresponding activity among the enriched records, for example, enriched records of both binding and activity of protein kinases.
Interrogation of GO Cellular Component database identified 29 significantly enriched records, among which nuclear chromatin as well as various cytoskeleton and membrane components appear noteworthy (Figure 7). Both GO Biological Process and GO Cellular Component database identified significantly enriched records associated with the central nervous system development and functions such as axonogenesis and axon guidance; generation of neurons, neuron differentiation, and neuron projection morphogenesis; cellular components of dendrites and dendrite’s membrane; ionotropic glutamate receptor complex. In several instances biologically highly consistent enrichment records have been identified in different GO databases: cadherin binding (GO Molecular Function) and catenin complex (GO Cellular Component); actin binding (GO Molecular Function) and actin cytoskeleton, cortical actin cytoskeleton, actin-based cell projections (GO Cellular Component); microtubule motor activity, tubulin binding, microtubule binding (GO Molecular Function) and microtubule organizing center, microtubule cytoskeleton (GO Cellular Component).
Analyses of human and mouse databases of the Kyoto Encyclopedia of Genes and Genomes (KEGG; Figure 8) identified more than 100 significantly enriched records in each database (KEGG 2019 Human (2019): 129 significant records; KEGG 2019 Mouse: 106 significant records). Genes associated with human-specific regulatory SNCs were implicated in a remarkably broad spectrum of signaling pathways ranging from pathways regulating the pluripotency of stem cells to cell type-specific morphogenesis and differentiation pathways, for example, melanogenesis and adrenergic signaling in cardiomyocytes (Figure 8). Genes under putative regulatory control of human-specific SNCs include hundreds of genes contributing to specific functions of specialized differentiated cells (gastric acid secretion; insulin secretion; aldosterone synthesis and secretion), multiple receptor/ligand-specific signaling pathways, as well as genetic constituents of pathways commonly deregulated in cancer and linked to the organ-specific malignancies, for example, breast, colorectal, and small cell lung cancers (Figure 8). Other notable entries among most significantly enriched records include axon guidance; dopaminergic, glutamatergic, and cholinergic synapses; neuroactive receptor-ligand interactions; and AGE-RAGE signaling pathway in diabetic complications (Figure 8; Supplemental Table Set 2).
Structurally, functionally, and evolutionary distinct classes of HSRS share the relatively restricted elite set of common genetic targets
It has been suggested that unified activities of thousands candidate HSRS comprising a coherent compendium of genomic regulatory elements markedly distinct in their structure, function, and evolutionary origin may have contributed to development and manifestation of human-specific phenotypic traits (Glinsky, 2019). It was interest to determine whether genes previously linked to other classes of HSRS, which were identified without considerations of human-specific SNCs, overlap with genes associated in this contribution with genomic regulatory loci harboring human-specific SNCs. It was observed that the common gene set comprises of 7,406 coding genes (88% of all human-specific SNCs-associated genes), indicating that structurally-diverse HSRS, the evolutionary origin of which has been driven by mechanistically-distinct processes, appear to favor the regulatory alignment with the relatively restricted elite set of genetic targets (Figure 9).
Previous studies have identified stem cell-associated retroviral sequences (SCARS) encoded by human endogenous retroviruses LTR7/HERVH and LTR5_Hs/HERVK as one of the significant sources of the evolutionary origin of HSRS (Glinsky, 2015-2019), including human-specific transcription factor binding sites (TFBS) for NANOG, OCT4, and CTCF (Glinsky, 2015). Next, the common sets of genetic regulatory targets were identified for genes expression of which is regulated by SCARS and genes associated in this study with human-specific regulatory SNCs (Figure 9). It has been determined that each of the structurally-distinct families of SCARS appears to share a common set of genetic regulatory targets with human-specific SNCs (Figure 9). Overall, expression of nearly half (4,029 genes; 48%) of all genes identified as putative regulatory targets of human-specific SNCs is regulated by SCARS (Figure 9). Consistent with the idea that structurally-diverse HSRS may favor the relatively restricted elite set of genetic targets, the common gene set of regulatory targets for HSRS, SCARS, and SNCs comprises of 7,833 coding genes or 93% of all genes associated in this contribution with human-specific regulatory SNCs (Figure 9).
To gain insights into mechanisms of SCARS-mediated effects on expression of 4,029 genes linked to human-specific regulatory SNCs, the numbers of genes expression of which was either activated (down-regulated following SCARS silencing) or inhibited (up-regulated following SCARS silencing) by SCARS have been determined. It was observed that SCARS exert the predominantly inhibitory effect on expression of genes associated with human-specific regulatory SNCs, which is exemplified by activated expression of as many as 87% of genes affected by SCARS silencing (Figure 9).
Identification of 2,273 genes associated with human-specific SNCs and implicated in premature death and embryonic, perinatal, neonatal, and postnatal lethality phenotypes
Interrogation of MGI Mammalian Phenotype databases revealed several hundred mammalian phenotypes affected by thousands of genes associated with genomic regulatory regions harboring human-specific SNCs: the MGI Mammalian Phenotype (2017) database identified 749 significant enrichment records, while the MGI Mammalian Phenotype Level 4 (2019) database identified 407 significant enrichment records (Figure 10; Supplemental Table Set S2). Strikingly, present analyses identified a total of 2,273 genes that are associated with premature death, embryonic survival, and perinatal, neonatal, and postnatal lethality phenotypes of both complete and incomplete penetrance (Figure 11) and appear under regulatory control of genetic loci harboring genes essential for human-specific SNCs. A significant fraction of these 2,273 offspring survival were implicated in the autosomal dominant (389 genes) and recessive (426 genes) inheritance in Modern Humans (Figure 11). Based on these observations, it has been concluded that thousands of genes within the genomic dominion of putative regulatory dependency from human-specific SNCs represent the essential genetic elements of the survival of species phenotypes.
Methods
Data source and analytical protocols
Candidate human-specific regulatory sequences and African Apes-specific retroviral insertions
A total of 94,806 candidate HSRS, including 35,074 human-specific SNCs, detailed descriptions of which and corresponding references of primary original contributions are reported elsewhere (Glinsky et al., 2015-2019; Kanton et al., 2019). Solely publicly available datasets and resources were used in this contribution. The significance of the differences in the expected and observed numbers of events was calculated using two-tailed Fisher’s exact test. Additional placement enrichment tests were performed for individual classes of HSRS taking into account the size in bp of corresponding genomic regions.
Data analysis
Categories of DNA sequence conservation
Identification of highly-conserved in primates (pan-primate), primate-specific, and human-specific sequences was performed as previously described (Glinsky, 2015-2019). In brief, all categories were defined by direct and reciprocal mapping using LiftOver. Specifically, the following categories of candidate regulatory sequences were distinguished:
– Highly conserved in primates’ sequences: DNA sequences that have at least 95% of bases remapped during conversion from/to human (Homo sapiens, hg38), chimp (Pan troglodytes, v5), and bonobo (Pan paniscus, v2; in specifically designated instances, Pan paniscus, v1 was utilized for comparisons). Similarly, highly-conserved sequences were defined for hg38 and latest releases of genomes of Gorilla, Orangutan, Gibbon, and Rhesus.
– Primate-specific: DNA sequences that failed to map to the mouse genome (mm10).
– Human-specific: DNA sequences that failed to map at least 10% of bases from human to both chimpanzee and bonobo. All candidate HSRS identified based on the sequence alignments failures to genomes of both chimpanzee and bonobo were subjected to more stringent additional analyses requiring the mapping failures to genomes of Gorilla, Orangutan, Gibbon, and Rhesus. These loci were considered created de novo human-specific regulatory sequences (HSRS).
To infer the putative evolutionary origins, each evolutionary classification was defined independently by running the corresponding analyses on all candidate HSRS representing the specific category. For example, human-rodent conversion identify sequences that are absent in the mouse genome based on the sequence identity threshold of 10%). Additional comparisons were performed using the same methodology and exactly as stated in the manuscript text and described in details below.
Genome-wide proximity placement analysis
Genome-wide Proximity Placement Analysis (GPPA) of distinct genomic features co-localizing with HSRS was carried out as described previously (Glinsky, 2015-2019). Briefly, a typical example of the analytical protocol is described below. The significance of overlaps between hESC active enhances and human-specific transcription factor binding sites (hsTFBS) was examined by first identifying all hsTFBS that overlap with any of the genomic regions tested in the ChIP-STARR-seq dataset (Barakat etl, 2018; Glinsky et al., 2018-2019). Then, the relative frequency of active enhancers overlapping with hsTFBS was calculated. To assess the significance of the observed overlap of genomic coordinates, the values recorded for hsTFBS were compared with the expected frequency of active and non-active enhancers that overlap with all TFBS for NANOG (15%) and OCT4 (25%) as previously determined (Barakat et al 2018). The analyses demonstrated that more than 95% of hsTFBS co-localized with sequences in the tested regions of the hESC genome.
The Enrichr API (January 2018 through October 2019 releases) (Chen et al., 2013; Kuleshov et al., 2016) was used to test genes linked to HSRS of interest for significant enrichment in numerous functional categories. In all tables and plots (unless stated otherwise), the “combined score” calculated by Enrichr is reported, which is a product of the significance estimate and the magnitude of enrichment (combined score c = log(p) * z, where p is the Fisher’s exact test p-value and z is the z-score deviation from the expected rank). When technically feasible, larger sets of genes comprising several thousand entries were analyzed. Regulatory connectivity maps between HSRS and coding genes and additional functional enrichment analyses were performed with the GREAT algorithm (McLean et al., 2010; 2011) at default settings.
Statistical Analyses of the Publicly Available Datasets
All statistical analyses of the publicly available genomic datasets, including error rate estimates, background and technical noise measurements and filtering, feature peak calling, feature selection, assignments of genomic coordinates to the corresponding builds of the reference human genome, and data visualization, were performed exactly as reported in the original publications and associated references linked to the corresponding data visualization tracks (http://genome.ucsc.edu/). Any modifications or new elements of statistical analyses are described in the corresponding sections of the Results. Statistical significance of the Pearson correlation coefficients was determined using GraphPad Prism version 6.00 software. Both nominal and Bonferroni adjusted p values were estimated. The significance of the differences in the numbers of events between the groups was calculated using two-sided Fisher’s exact and Chi-square test, and the significance of the overlap between the events was determined using the hypergeometric distribution test (Tavazoie et al., 1999).
Supplemental Information
Supplemental information includes Supplemental Tables S1 and S2, Supplemental Text, and Supplemental Figures.
Author Contributions
This is a single author contribution. All elements of this work, including the conception of ideas, formulation, and development of concepts, execution of experiments, analysis of data, and writing of the paper, were performed by the author.
Acknowledgements
This work was made possible by the open public access policies of major grant funding agencies and international genomic databases and the willingness of many investigators worldwide to share their primary research data. I would like to thank my anonymous colleagues for their valuable critical contributions during the peer review process of this work.