Abstract
Transcription factor (TF) proteins play a critical role in the regulation of eukaryote gene expression by sequence-specific binding to genomic locations known as transcription factor binding sites.
Here we present the TFBSFootprinter tool which has been created to combine transcription-relevant data from six large empirical datasets: Ensembl, JASPAR, FANTOM5, ENCODE, GTEX, and GTRD to more accurately predict functional sites. A complete analysis integrating all experimental datasets can be performed on genes in the human genome, and a limited analysis can be done on a total of 125 vertebrate species.
As a use-case, we have used TFBSFootprinter to study sites of genomic variation between modern human and Neanderthal promoters. We found significant differences in binding affinity for 110 transcription factors, which are enriched for homeobox and brain. Analysis of single cell data show that a subset of these (CUX1, CUX2, ESRRG, FOXP1, FOXP2, MEF2C, POU6F2, PRRX1 and RORA) co-occur as marker genes in L4 glutamatergic neurons.
Differential binding sites for these transcription factors were found in 74 target genes, the largest number of which were found in the bidirectional promoter of key mitochondrial-function genes FARS2 and LYRM4.
Introduction
North of Eden
Humans and their hominin relatives have been leaving Africa in waves for the past two million plus years. Various environmental and cultural pressures have impacted each diaspora in different ways, subsequently producing adaptations reflected in physiology, immunity, and brain size. In relatively recent history, the discovery and sequencing of DNA from remains of Neanderthal [1–4] and Denisovan [5, 6] has now allowed direct comparison of DNA from modern and ancient hominids. In the observed genomic variations between modern humans and Neanderthals, a limited number have been identified which occur in gene coding regions. Some of these are found in genes known to affect cognition and morphology (cranium, rib, dentition, shoulder joint) [1], pigmentation and behavioral traits [2], and brain development [7]. However, as has been noted before, there is a paucity of coding variations to explain the differences between related species; the genome of the Altai Neanderthal reveals just 96 fixed amino acid substitutions, occurring in 87 proteins [7]. Unsurprisingly, a much larger set of variants are observed in intergenic regions, owing not only to the fact that these are comparatively much larger regions but also to the expectation of lesser conservation in what until recently was often termed “junk DNA”. While variants in coding regions can directly affect protein structure, those found in intergenic regions may affect regulation of gene expression, through alternative binding of transcription factors in promoters and enhancers and expression of non-coding RNAs. What may surprise some, is the cumulative effect of numerous small — and large — changes to gene expression arising due to these manifold intergenic changes, and which may ultimately serve as the engine of speciation. Indeed, analysis of introgressed Neanderthal DNA is more depleted in regulatory regions than in those which code for proteins [8]. Using a new computational tool we have created, which incorporates numerous transcription-relevant genomic features, we sought to reveal how the comparative differences in these regions may affect regulatory differences between modern humans and Neanderthals. Specifically, our aim is to identify those gene-regulating transcription factors whose binding to DNA may vary between these species of hominid and thus drive the differences between them.
Transcription factors drive gene regulation
As early as 1963, Zuckerkandl and Pauling began addressing the apparent disparity in the fact that species with obvious differences could have proteins which look so similar. At the time they posited that this could be explained by the idea that “in some species certain products of structural genes inhibit certain sets of other genes for a longer time during development than in other species.” [9]. Since this early exposition, the idea was more formally proposed in a publication by King and Wilson [10], and now the importance of regulatory regions in evolutionary adaptation has been further explored and accepted [11, 12].
Transcription factor (TF) proteins play a critical role in the regulation of eukaryote gene expression by sequence-specific binding to genomic locations known as cis-regulatory elements (CREs) or, more simply, transcription factor binding sites (TFBSs). TFBSs can be found both proximal and distal to gene transcription start sites (TSSs), and multiple TFs often bind cooperatively towards promotion or inhibition of gene transcription, in what is known as a cis-regulatory module (CRM). Because of the role these proteins play in transcription, discovery of TFBSs greatly furthers understanding of many, if not all, biological processes. As a result, many tools have been created to identify TFBSs. Owing to the time and material requirements of individual mutation studies needed for experimental verification, many of these tools are computational. However, at issue in both the synthetic and experimental approach are two distinct problems. First is identification of where TF proteins bind to DNA. Experimental tools like chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-Seq) are rapidly revealing the landscape of TFBSs for individual TFs in various cell types, and under various conditions. Computational tools often leverage this new data to build TFBS models which are increasingly accurate at making those predictions in silico. The second issue is how to determine if an experimental site (e.g., ChIP-Seq peak) or putative computational prediction (statistically likely) are actually biologically relevant, such as contributing to gene expression/repression, chromatin conformation, or otherwise. This second problem is by far the more demanding of the two, and it is here where we seek to levy the inclusion of transcription-relevant data.
The law of large numbers
Computational prediction of TFBSs seeks to enlist experimental data in the quest for the best possible specificity and sensitivity. However, computational modeling has deficits rooted in the law of large numbers; that is, because of the large size of a target genome, any method of prediction is bound to produce a large number of spurious results and filtering or thresholding results often means lost true positives. This can be compounded by the fact that biologically relevant TFBSs can be weakly binding [13].
Depending on the approach, the extent of incorporation of relevant experimental data varies widely. From very early on, the position weight matrix (PWM) has been used to represent and predict the binding of proteins to DNA. PWMs use a single, but very relevant, type of experimental data, that derived from observed binding events [14]. To create a PWM, a count is made of each of the four nucleotides at each position in the set of experimentally determined binding sites, known as a position frequency matrix (PFM). With the PFM, and using some contextual information about the target genome, a PWM probability model is generated which represents the binding preferences of a TF (detailed in Methods). The PWM can then be used to arrive at a likelihood score for a target DNA region, which thus represents the likelihood of a TF binding to that DNA sequence. The accuracy of this method can continue to improve solely due to the large, and increasing, amounts of TFBS sequencing data provided by newer experimental technologies; like ChIP-Seq [15], high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX) [16, 17], and protein binding microrarrays (PBMs) [18].
Old game, new tricks
Updates to the traditional PWM have appeared over the years. One addresses the fact that traditional PWMs assume that the binding preference of a TF is independent at each position of the sequence it binds, resulting in both dinucleotide [19, 20] and position flexible [21–23] models. These become relevant when a TF protein has significantly different binding modes due to changes in conformation, structure, or splicing [24]. Ultimately, however, the difference between position independent and dependent models has been shown to be minimal, and TFs with more complex binding specificities can be accounted for using multiple position-independent PWMs [25, 26].
New statistical and computational algorithms, as they have come in vogue, have been brought to bear on the problem of TF-DNA binding: regression, Monte Carlo simulations, Markov models, machine learning, and deep learning, to name the most prominent. An important distinction is that these, and many other, approaches seek to improve what their creators viewed as the primary aspect of prediction of TFBSs, the binding model itself. There has been great success in targeting this aspect, and while it is critical to know, the location of binding alone is not indicative of function [27]. Because of the incremental gains which have been achieved with new computational modeling of TF-DNA binding events, it can be argued that further work in this space is best served in discovery of binding sites for TFs which are not currently cataloged or modeled. Indeed, this is often a key difference to be noted when choosing between traditional PWMs and some of the later computational tools/models. The existence of a greater number of TF binding models is a strong reason to choose to incorporate the more extensively used traditional PWMs in binding prediction.
In the search for increased accuracy, other new models have improved TFBS prediction by instead incorporating other relevant biological data, for example: 3D structure of DNA [28–32], chromatin accessibility/DNAse hypersensitivity sites [33, 34], overlap in gene ontology [35], amino acid physicochemical properties [36], and gene expression and chromatin accessibility [37, 38]. These alternative models often match or outperform strictly sequence-based models [23, 32].
TFBSFootprinter incorporates transcription-relevant data
We sought to identify multiple sources of experimental data relevant to gene expression and transcription factor binding, and to incorporate it into a comprehensive model in order to improve prediction of functional TFBSs. Specifically, clustering of TFBSs has been shown to be an indicator of functionality [27,39,40]; conservation of genetic sequence across genomes of related species is one of the most successfully used attributes in identification of TFBSs [40, 41]; proximity to TSS is strongly linked to TFBS functionality [42]; correlation of expression between a transcription factor and another gene is an indication of a functional relationship [37,43,44]; variants in non-coding regions have a demonstrated effect on gene expression [45–47] and variants affecting gene expression are enriched in TFBSs [48]; open chromatin regions (ascertained by ATAC-Seq or DNAse-sensitivity) correlate with TF binding [49]; and finally, as previously mentioned, significant effort has gone into identifying the actual composition of the binding sites themselves through the use of sequencing of TFBSs (e.g., ChIP-Seq and HT-SELEX) [16,22,50].
Ensembl identifier-oriented system of analyses
For our tool, instead of using simple absolute genomic coordinates, the Ensembl transcript ID was chosen as the basic unit of reference. This is useful for several reasons. First, the Ensembl database is one of the most well-maintained biological databases in existence. It is continually updated and expanded and contains a wealth of sequence and regulatory information on a large number of vertebrate species. As a result, the tool we present here — TFBSFootprinter — can offer predictions in 125 vertebrates at the time of writing. From human and model organisms such as mouse and zebrafish, to African bush elephant, the catalog will increase as the Ensembl database itself expands. Second, it allows the inclusion of important datasets which are gene-centric, such as FANTOM TSSs and expression data, GTEx eQTLs, and all annotations which are compiled within Ensembl itself. Finally, the Ensembl transcript ID provides an easy point of reference for a greater audience of scientists, thus increasing the accessibility and utility of the tool.
Non sum qualis eram
This study supports the hypothesis that future advances in prediction offered by incorporation of transcription-related biological data will outshine that which can be achieved by improvements in modeling of TF-DNA binding alone. We show here that the TFBSFootprinter tool provides a good way to predict TFBSs based on incorporation of a variety of relevant biological data. As a proof of usage, we apply TFBSFootprinter in a comparative analysis of locations of variation in the promoters of modern humans and Neanderthal genomes.
Results
High-scoring TFBSs differ between modern human and Neanderthals and are enriched for homeobox and brain
In analysis of 21,990 SNPs occurring in comparison of the modern human and Neanderthal promoteromes — the collection of all human/Neanderthal proximal promoters of protein-coding genes — a total of 108 TF models, representing 110 unique TF proteins, showed a significant difference in scoring between the two human species (Table 1). Compared to all protein-coding genes, the complete set of TFs cataloged in the JASPAR database are very strongly enriched for homeobox genes (Fisher’s exact test: odds ratio [OR] 85.16; p-value 5.79 x 10-177) and moderately for brain, eye, retina genes (OR 1.80; p-value 3.50 x 10-11). Likewise, compared to the complete set of JASPAR TFs, homeobox genes are strongly over-represented among the differentially binding (DB) TFs (78/110; OR 12.40; p-value 2.96 x 10-17). Similarly, genes with brain-specific expression (79/110) are also enriched in DB TFs (OR 2.86, p-value 3.48 x 10-6) while JASPAR TFs themselves are enriched compared to all protein-coding genes (OR 1.80; p-value 3.50 x 10-11). Immune-specific genes (29/110) were under-represented in DB TFs compared to the JASPAR set (Fisher’s exact test: OR 0.37, p-value 1.44 x 10-5).
DB TF target genes
A total of 74 genes were identified, hereafter referred to as ‘DB TF target genes’, whose promoters contain SNPs where differential binding occurs by our determined set of statistically significant DB TFs (Table 2). The target gene with the greatest number of DB events (143) was FARS2, whose promoter contains 2 SNPs over which 105 (97.22%) of the DB TF binding models showed differential binding (Table 2). The remaining top ten DB TF target genes with the next highest number of DB occurrences were: TEX19, NARF, SLC6A11, LSM5, FAM172A, WNT7A, EXO1, KRT8, and ZNF37A. Of these, all but LSM5 have been identified with brain (FARS2, SLC6A11, FAM172A, WNT7A, and ZNF37A) or reproductive (FARS2, TEX19, NARF, WNT7A, EXO1, KRT8) tissue- or cell-specific expression in the Protein Atlas database (Supplementary Table 1). Among all 74 DB TF target genes, 30 have reproductive specific expression, 28 have brain specific expression, 18 have blood cell specific expression, and 16 have mitochondria specific expression or localization.
DB TFs are highly expressed in immune cells
Data derived from the bulk RNA-Seq experiments of the FANTOM5 project was used to generate a cluster map of DB TF expression within the 100 tissues with the highest aggregate expression, presented as Figure 1A. Relative aggregate expression for individual tissues is presented as bars in Figure 1B and shows that within the 100 tissues those with the highest expression are immune cell types/tissues. The immune cell types with highest aggregate DB TF expression are mast cells, eosinophils, CD4+ T cells, CD8+ T cells, Basophils, CD14+/CD16- monocytes, and natural killer cells (Supplementary Figure 1); and of the 20 cell types/tissues with the highest aggregate expression, 16 are immune.
DB TFs coexpress in neural and immune tissues
Expression data for DB TFs in the FANTOM 5 dataset was extracted for cluster analysis (Figure 1A). Tissues of several types cluster together by their expression of DB TFs, with the largest clusters being neural (30 brain and eye/retinal tissues) and immune (28 immune cells/tissues) (Figure 1B). Likewise, based on their tissue expression in the FANTOM data set, the DB TFs have been clustered. Data on tissue group specificity (brain and retina; male/female) from the Human Protein Atlas (bulk and single-cell RNA-Seq), status as a development related gene (homeobox, forkhead box, and SRY-related HMG-box), and above 90th quantile expression in immune cell types as defined by the Database of Immune Cell eQTLs (DICE), was analyzed and included as Figure 1C. This data shows two distinct top-level clusters: the larger cluster contains 89 genes, the majority of which are known to have a role in development (78/89) and specific expression in a brain or retina tissue (72/89); while the smaller cluster of 17 genes all show specific expression in blood and immune cells.
Within the larger cluster of 89 developmental and brain specific genes there is a subcluster of 12 genes which whose expression is most limited to FANTOM 5 neural tissues: NKX6-2, CUX2, NR2E1, POU3F2, POU6F2, EMX1, VAX1, POU3F3, DLX1, DLX2, OTX1, POU3F1. All of these genes are identified in the Protein Atlas as being enriched or enhanced in the brain or retina. GO analysis of the 12 genes with the PANTHER-based geneontology.org web server [51, 52] shows the top 10 enriched biological process terms: ‘cerebral cortex GABAergic interneuron fate commitment’ (GO:0021893; 100x enrichment), ‘positive regulation of amacrine cell differentiation’ (GO:1902871; 100x), ‘forebrain ventricular zone progenitor cell division’ (GO:0021869; 100x), ‘negative regulation of photoreceptor cell differentiation’ (GO:0046533; 100x), ‘regulation of photoreceptor cell differentiation’ (GO:0046532; 100x), ‘regulation of amacrine cell differentiation’ (GO:1902869; 100x), ‘neuroblast differentiation’ (GO:0014016; 100x), ‘negative regulation of oligodendrocyte differentiation’ (GO:0048715; 100x), ‘cerebral cortex GABAergic interneuron differentiation’ (GO:0021892; 100x), and ‘negative regulation of glial cell differentiation’ (GO:0045686; 100x).
Additionally, within the brain tissues cluster we observe sub-clusters of fetal and newborn tissues (Figure 1D). Overall, DB TFs show the highest expression in medial frontal gyrus (newborn), medial temporal gyrus (adult), parietal lobe (fetal), occipital lobe (fetal), eye (fetal), pineal gland (adult), temporal lobe (fetal), retina (adult), occipital cortex (newborn), parietal lobe (newborn), dura mater (adult), medial temporal gyrus (newborn), and spinal cord (fetal) (Supplementary Figure 1). In a similar fashion, the cluster with high expression in blood and immune cells/tissues is comprised of 17 genes: FOS, RORA, FOXK1, FOXK2, IRF3, NFYA, TBP, MAFG, POU2F1, ARID5A, RARA, JUN, CUX1, FOXP1, MEF2C, MAFB, and RXRA All of these genes have expression ⋝90th quantile of all genes in at least one immune cell type as cataloged in the Database of Immune Cell eQTLs (DICE) [53]. GO analysis of the 17 genes (geneontology.org web server) shows enrichment for immune related biological process terms: ‘CD4-positive, alpha-beta T cell differentiation involved in immune response’ (GO:0002294; 76x enrichment), ‘T-helper cell differentiation’ (GO:0042093; 76x), and ‘alpha-beta T cell differentiation involved in immune response’ (GO:0002293; 73x), among others.
DB TFs have developmental time-point related expression profiles in brain
Cluster analysis of Allen Brain Atlas bulk RNA-Seq expression data for DB TF genes revealed distinct clusters of time-point specific expression (Figure 2B). Specifically, we observe clusters of expression for brain tissues at different grouped time points: pcw (8 to 37 weeks post conception), early (4 months to 4 years), and late (8 years to 40 years). Nearly all of the pre-birth (pcw) brain tissues cluster together, without admixture with the other grouped timepoints. The exception is a small cluster where ‘cerebellar cortex’ from all three groups is joined with ‘pcw cerebellum’, ‘pcw upper (rostral) rhombic lip’, and ‘mediodorsal nucleus of thalamus’ of all three groups. All 15 pre-birth cortical brain tissues form a cluster. Similarly, a large cluster contains all of the other early and late tissues which primarily segregate into two age-related subclusters. Those genes comprising the cluster with highest aggregate expression in the data set (module 1 genes: NFYA, TCF3, MEF2C, SOX10, CUX2, TBP, FOXP1, IRF3, SOX15, RARA, FOXK2, NKX6-2, FOXK1, FOS, JUN, POU6F1, POU3F2, MAFG, POU3F3, MAFB, and CUX1) show a dynamic and cyclic pattern of expression during the pre-birth timepoints which then stabilizes in infancy and trends downwards in early childhood (Figure 2A).
Single-nucleus RNA-Seq of cortical cells reveals differential expression of DB TFs
Analysis of Allen Brain Atlas single-nucleus RNA-Seq data from 49,417 cortical brain cells [55] was performed to identify which of the MH-Neanderthal DB TF genes may play a functional role in specific annotated brain regions, layers, and cell types. Analyses in the original paper have defined metadata indicating cell class (excitatory, inhibitory, and non-neuronal), layer (L1–L6), and predicted type (e.g., astrocyte, microglia, endothelial, etc.), as well as brain regions (Figure 3A). Cell groups defined by the intersection of class, layer, and region were analyzed for differentially expressed genes (DEGs). Several DB TFs were DEGs in specific categories, occurring in nearly all cases in glutamatergic neurons, usually in layers L4–L6, and across nearly all brain regions. Categories where four or more DB TFs were DEGs are presented in Table 3, and included cut-like homeobox 1 (CUX1), cut-like homeobox 2 (CUX2), estrogen-related receptor gamma (ESRRG), forkhead box protein 1 (FOXP1), forkhead box protein 2 (FOXP2), Myocyte Enhancer Factor 2C (MEF2C), POU Class 6 Homeobox 2 (POU6F2), paired related homeobox 1 (PRRX1), and RAR-related orphan receptor alpha (RORA). In all categories POU6F2 was a DEG, and in all but two either FOXP2 or FOXP1 (or both) were DEGs. FOXP2 was present as a DEG in excitatory L4 cells in A1C (primary visual cortex), MTG (medial temporal gyrus), S1lm (lower limb somatosensory cortex), and S1ul (upper limb somatosensory cortex).
Gene ontology analysis of L4 excitatory neurons
In the Allen Brain Atlas cortical cell snRNA-Seq data, the multiple DB TFs were consistently DEGs in both layer 4 and excitatory cell groups, and their intersection. Gene ontology analysis of the top 100 marker genes of L4 excitatory cells using the DisGeNet ontology within the Enrichr tool [56] showed the top over enriched terms ‘autism spectrum disorders’, ‘autistic disorder’, ‘narcolepsy’, ‘neurodevelopmental disorders’, ‘intelligence’, ‘epilepsy’, ‘schizophrenia’, ‘refractive errors’, ‘cognition’, and ‘apraxia, developmental verbal’ (Figure 3D). The top 10 biological process ontology over enriched terms for L4 excitatory cells were: ‘corticospinal neuron axon guidance’ (GO:0021966; >100x enrichment), ‘corticospinal tract morphogenesis’ (GO:0021957; 91x), ‘ventricular cardiac muscle cell differentiation’ (GO:0055012; 42x), ‘synaptic membrane adhesion’ (GO:0099560; 35x), ‘central nervous system projection neuron axonogenesis’ (GO:0021952: 30x), ‘synaptic transmission, GABAergic’ (GO:0051932: 29x), ‘positive regulation of excitatory postsynaptic potential’ (GO:2000463: 28x), ‘negative regulation of smooth muscle cell migration’ (GO:0014912: 27x), ‘positive regulation of dendrite morphogenesis’ (GO:0050775: 25x), and ‘modulation of excitatory postsynaptic potential’ (GO:0098815: 25x).
In GO analysis of cellular component, the top 10 over enriched terms for L4 excitatory cells were: ‘anchored component of presynaptic membrane’ (GO:0099026; 61x enrichment), ‘presynaptic cytosol’ (GO:0099523; 47x), ‘NMDA selective glutamate receptor complex’ (GO:0017146; 47x), ‘intrinsic component of presynaptic active zone membrane’ (GO:0098945; 45x), ‘node of Ranvier’ (GO:0033268; 42x), ‘integral component of presynaptic active zone membrane’ (GO:0099059; 40x), ‘GABA-A receptor complex’ (GO:1902711; 34x), ‘GABA receptor complex’ (GO:1902710; 30x), ‘presynaptic active zone membrane’ (GO:0048787; 27x), and ‘intrinsic component of presynaptic membrane’ (GO:0098889; 22x).
In GO analysis of molecular function, the top 10 over enriched terms for L4 excitatory cells were: ‘transmembrane receptor protein tyrosine phosphatase activity’ (GO:0005001; 50x enrichment), ‘transmembrane receptor protein phosphatase activity’ (GO:0019198; 50x), ‘GABA-gated chloride ion channel activity’ (GO:0022851; 49x), ‘inhibitory extracellular ligand-gated ion channel activity’ (GO:0005237; 42x), ‘ligand-gated anion channel activity’ (GO:0099095; 35x), ‘GABA-A receptor activity’ (GO:0004890; 34x), ‘syntaxin-1 binding’ (GO:0017075; 32x), ‘GABA receptor activity’ (GO:0016917; 29x), ‘transmitter-gated channel activity’ (GO:0022835; 17x), and ‘transmitter-gated ion channel activity’ (GO:0022824; 17x).
DB TFs and Human Accelerated Regions
Doan et al. have identified and cataloged a list of genes in close proximity to human accelerated regions (HARs; elevated divergence in humans vs. other primates) as well as those which have direct interaction with a HAR through chromatin conformation changes, e.g., enhancers [57]. In the set of 110 DB TF genes, 28 were found to be HAR-associated. Fisher’s exact test analysis revealed that over-representation of HAR-association in DB TFs versus all JASPAR TFs was not significant (OR 0.98; p-value 0.57). However, compared to all protein-coding genes, the JASPAR TFs as a group are enriched for HAR-association (OR 3.27; p-value 6.73 x 10-26).
TFBSFootprinter availability
The TFBSFootprinter tool is available for installation via Conda making it available for most computing OS systems (https://anaconda.org/thirtysix/tfbs-footprinting3). The TFBSFootprinter tool is also available for use as a Python library (https://pypi.org/project/TFBS-footprinting3/) and subsequently can be easily installed to a Linux system using a single command ‘pip install TFBS-footprinting3’. Due to size considerations, supporting experimental data for both human and non-human species is downloaded on demand on first usage. Documentation on background, usage, and options is available both within the program and more extensively online (tfbs-footprinting.readthedocs.io).
Experimental Datasets
Experimental data from a total of six databases were incorporated into the TFBSFootprinter algorithm (Figure 4A). Data from the relevant datasets were pre-processed to generate score distributions with which putative TFBS predictions could later be compared, as described in the Methods. Each dataset allows for scoring of transcription-relevant markers in or near putative regulatory elements identified by PWM analysis: co-localization with ChIP-Seq metacluster, cap analysis of gene expression (CAGE) peak, ATAC-Seq peak, or CpG island; correlation of expression between predicted TF and gene of interest; co-localization of eQTL and effect on expression of target gene; measure of conservation in related vertebrate species (Figure 4C). The simplicity of this piece-wise approach allows for easy inclusion of additional TFBS relevant data in the future.
Inclusion of empirical datasets improves TFBSFootprinter accuracy
The performance of both individual datasets and combinations of datasets, in the identification of experimentally verified functional TFBSs and TFBS ChIP-Seq peaks, was tested by receiver operating characteristic (ROC) analysis (Figure 5, Supplementary Figure 2, Supplementary Figure 3, Supplementary Figure 4, Figure 6, Supplementary Figure 5). Four different benchmarking approaches were used. Depending on the benchmarking approach used, different combinations of transcription-relevant features produced the best ROC scores. We observed that when using all available features TFBSFootprinter consistently outperformed the PWM, and that for the greater majority of TFs tested the best TFBSFootprinter model outperformed the best DeepBind model (benchmark 1, 60/65; benchmark 2, 28/40; benchmark 3, 13/14; and benchmark 4, 11/14). In the majority of cases a subset of the available transcription-relevant data produced the optimal model (Supplementary Figure 2). In addition, the features which were most frequently observed as components of the best models (Figure 5A, Figure 6A, Supplementary Figure 3A, Supplementary Figure 4A) varied by benchmark.
Discussion
In this study we have introduced and tested a new method for prediction of transcription factor binding sites. The approach levies nine separate transcription-relevant data to augment prediction of functional TFBSs beyond the classical PWM. As a proof of use, we have used the tool to analyze differences in regulatory regions observed in comparison of modern human and Neanderthal DNA. The results align with and complement previous studies on Neanderthal DNA, and with studies on brain development and evolution. While the current study focuses on human DNA, it is important to note that the TFBSFootprinter tool is built to also make predictions in many other species, based on its integration of Ensembl data.
Regulatory changes drive evolution
Current knowledge points to the fact that, across the tree of life, prior to and during major radiations of new species there comes a commensurate increase in the number of regulatory genes. Comparative genomic analyses suggest that at the time of eukaryogenesis, or the origin of the last eukaryotic ancestor (LECA), a significant increase in novel TF classes occurred [58]. Eukaryogenesis is one of the major transitions of life on Earth, with an explosion of diversity owing to the endosymbiotic synthesis of an energy producing α-proteobacteria mitochondrion-progenitor within an archaeon [59]. If the new inhouse energy source could be described as the engine driving the diversity arising from LECA, the complementary argument could be made that it was TF proteins which did the steering. Later, prior to the colonization of land by plants, there was an increase in the TF families of the ancestral aquatic streptophytes which were then already present in the first land plants [60]. Again, prior to the radiation of bilaterian (multicellular) metazoans an increase in transcription factor families occurred [61], roughly quadrupling in ratio [62]. Arguably, these increases in TF family numbers immediately prior to major transitions in life could be pointed to as evidence of the critical role of TFs in the rise and adaptation of eukaryotes and the complexity arising within their successors.
Far more recently, a transposon-mediated shift in regulatory signaling in mammalian pregnancy has led to the endometrial stromal cell type [63] and rewiring of a stress response producing the decidual stromal cell type [64]. In humans, the development of our cognitive skills is possibly due to a delay of synaptic-function gene expression and corresponding synaptogenesis in the pre-frontal cortex, owing to transcription factors such as myocyte enhancer factor 2A (MEF2A) [65–68]. At the same time, evidence is accumulating that maturation occurred earlier in the closely-related Neanderthal, as it does in chimpanzee [69], which is supported by samples of jaw [70], tooth [71], and, most importantly, cranium [72]. Furthermore, development of globular brain structure was not present in modern humans at the time of divergence with Neanderthal and Denisovan lineages but has been a unique product of modern human development within the last 35,000–100,000 years [73] making it also particularly attractive for further analyses.
Modern human brain regulatory changes
The TF genes we have identified as differentially binding show a strong enrichment for homeobox genes, the primary regulators of development in humans and other multicellular organisms, compared to the complete set of JASPAR TFs, which are themselves very strongly enriched compared to all protein coding genes. A further six DB TFs are forkhead box genes, also known drivers of development. Similarly, the DB TFs are enriched for brain-specific expression compared to JASPAR TFs, which are again enriched compared to all human genes. Hierarchical clustering of TF expression revealed that DB TF genes associated with neural and immune function formed distinct groups. Within the tissue axis, in the neural cluster, adult and fetal/newborn neural tissues cluster separately in several instances (Figure 1). From these discoveries in a broader set of data, it became important to focus more closely on the expression of these DB TFs in developing neural tissues specifically. Analysis of RNA-Seq data from whole brain subtissues further revealed differences in expression of these genes in a chrono, and likely developmentally, dependent manner. Specifically, expression of DB TFs in all tissues for the two groups occurring within 8 weeks post conception to 4 years post birth were largely segregated from all tissues for the group defined by 8 years to 40 years of age (Figure 2). Additionally, several modules of DB TFs identified by cluster analysis appear to show a cyclic pattern of expression during fetal development which then stabilizes post birth (Figure 2A).
DB TFs co-occur as DEGs in single-cell RNA-Seq of glutamatergic cells of L4 human cortex
Recent proliferation of single-cell transcriptomics has begun to clarify the direct role of transcription factors in cell type determination and identity (Arendt et al. 2019). To determine in which cell types DB TFs are expressed in adult brain, single-nucleus data for 49,417 cortical cells was examined. A subset of the DB TFs were annotated as commonly co-occuring DEGs: CUX1, CUX2, ESRRG, FOXP1, FOXP2, MEF2C, POU6F2, PRRX1, and RORA. Interestingly, we observed that these DB TF DEGs were primarily found co-occurring in glutamatergic cells of the L4 cortex. Within this set of nine genes, six of them (CUX1, CUX2, FOXP1, FOXP2, MEF2C, and RORA) are cataloged in the SFARI Gene knowledgebase [74] (sfari.org; vers. 3.0) as autism spectrum disorder (ASD) candidate genes. Likewise, six (CUX1, FOXP1, FOXP2, MEF2C, POU6F2, and RORA) are present in the genome-wide association studies (GWAS) catalog [75] (www.ebi.ac.uk/gwas) for the ‘schizophrenia’ (MONDO_0005090) term. Supporting this connection, a recent large-scale study on schizophrenia genetics in 74,776 individuals with schizophrenia (and 101,023 control individuals) has shown enrichment for genes which express highly in glutamatergic neurons of both human and mouse cerebral cortex [76]. Similarly, a recent large-scale study on ASD genetics in 11,986 individuals with ASD (and 23,598 control individuals) identified 102 ASD risk genes, and showed that fetal-development excitatory neurons expressed the greatest number of these among all cell types identified [77].
Co-occurring DB TF DEGs associate with HARs and neuropsychiatric disease
Human-accelerated regions (HARs) are locations in the genome where the rate of evolutionary change has accelerated since divergence with chimpanzee. A study of 2,737 HARs has shown that they are enriched for TFBSs generally, and for TFs associated with neural development specifically (Doan et al. 2016). While the DB TFs were not enriched for HAR association compared to JASPAR TFs (OR 0.98; p-value 0.57), the JASPAR TFs are compared to all genes (OR 3.27; p-value 6.73 x 10-26). Using the HARs cataloged by [57] we identified 28 of the DB TFs which are either in close proximity to a HAR or directly interact with one through chromatin conformation changes. Additionally, all but one (CUX2) of the co-occurring DEGs DB TFs (CUX1, ESRRG, FOXP1, FOXP2, MEF2C, POU6F2, PRRX1, and RORA) identified in the L4 excitatory neurons of cerebral cortex (Allen Brain Atlas scRNA-Seq data) contain HAR regions or HAR interactivity. Several of these contain or associate with HAR-variants that have been noted to cause neurological disease. POU6F2 possesses an intronic HAR where a mutant allele (GRCH38:chr7:39,033,595) is associated with ASD (Doan et al. 2016). The promoter of the CUX1 gene has been determined by ChIA-Pet analysis of chromatin to interact with a HAR, located ∼200kb away; unrelated individuals with intellectual disability (ID) (IQ<40) and ASD have been identified with a homozygous mutation (GRCh38:chr7:101,606,361) in this HAR. Luciferase reporter assay indicates that when the mutant version of this HAR interacts with the promoter of the CUX1 gene, its expression is increased by three-fold, while cultured differentiated neurons with enhanced CUX1 expression produced an increased synaptic spine density. Similarly, another HAR (GRCh38:chr5:88,480,873) is shown to interact with the promoter of MEF2C, and a mutation in this HAR creates a putative MEF2A binding site, reducing expression by ∼50%. Mutations in the MEF2C gene are associated with autism [78, 79], mental retardation [78–80], schizophrenia [81, 82], epilepsy [78, 79], and speech abnormalities [78]. Additionally, binding motifs for several of the DB TFs identified in our study were found to be enriched in HAR regions; POU6F1 and POU2F1 in ultra-conserved HARs, and HNF1A in all HARs [57]. After these intriguing results, further direct analysis of HARs using the TFBSFootprinter tool is merited.
Importantly, one of the identified DB TFs in our data is the FOXP2 TF gene, which is the first noted “speech gene” [83]. It has been shown that for the early postnatal period of mouse, FOXP2 is a negative regulator of MEF2C, and that the likely result is promotion of synaptogenesis of cortical striatum [84]. In addition to this interaction with MEF2C, there is a POU3F2 (which our results show is also a DB TF) binding site within the FOXP2 gene which has been shown to affect FOXP2 expression and which is associated with a selective sweep occurring since the divergence of humans and Neanderthal [85, 86]. Because of its significantly documented role in human evolution, further analysis is warranted regarding regulation of FOXP2 in other relevant areas (e.g., introns, 3’ UTR, enhancers, etc.) and for genome wide binding sites of FOXP2 itself.
DB TF target genes are expressed in oligodendrocytes, testis/sperm, maternal-fetal interface, and mitochondria
There were 74 genes identified as DB TF target genes. Of these, all but 8 have specific expression in reproductive (30), brain and retina (28), immune (18), or mitochondria (16) cells or tissues. Within the 30 DB TF target genes with reproductive cell- and tissue-specific expression, the majority (19) are expressed in male tissues (testis, early and late spermatids, spermatocytes, spermatogonia). A total of 12 were expressed in female-specific cells and tissues, and in many cases specifically those involved in the maternal-fetal interface (cervix, granulosa cells, cytotrophoblasts, extravillous trophoblasts, syncytiotrophoblasts, and endometrial ciliated cells). Within the 28 DB TF target genes with brain- and retina-specific expression, a total of 16 have expression in oligodendrocytes, 12 in inhibitory neurons, and 11 in excitatory neurons.
The DB TF target gene with the greatest number of differential binding events in its promoter region is FARS2, a member of the mitochondrial aminoacyl-tRNA synthetases (mt-aaRSs). scRNA-Seq data from the Protein Atlas indicates FARS2 has enhanced expression in excitatory neurons, oligodendrocyte precursor cells, inhibitory neurons, oligodendrocytes, astrocytes, microglial cells, late spermatids, and early spermatids. mt-aaRSs charge their cognate mt-tRNA with the appropriate amino acid, and mutations in these genes cause a variety of diseases but most predominantly those which affect the central nervous system [87], perhaps due to delayed myelination, demyelination, or both [88]. In addition to their canonical roles, the mt-aaRSs are hypothesized to have functions in monitoring of amino acid levels, as sensors for the mitochondrial environment, and transcriptional regulation [87] and through the addition of new protein domains have been associated with neural development and immune response, among others, as reviewed by [89]. The FARS2 gene produces the mt-PheRS protein (phenylalanyl-tRNA synthetase) which is mitochondria-locating and responsible for attaching phenylalanine to its corresponding mt-tRNA for mitochondrial protein translation [90]. Intragenic variants in the FARS2 gene have been linked to two primary clinical manifestations, early-onset epileptic mitochondrial encephalopathy and spastic paraplegia, and for patients in both groups symptoms can also include intellectual disability or developmental delay [90]. Deletions within FARS2 and reduced expression levels have also been associated with schizophrenia [91].
Importantly, FARS2 shares a bidirectional promoter with another mitochondrial gene, LYRM4 (LYR Motif-Containing Protein 4), which has a TSS just ∼400 bp away. Together with ISCU, NFS1, and FXN, the ISD11 protein of the LYRM4 gene is involved in formation of iron–sulfur (Fe–S) clusters [92, 93], essential cofactors in many basic biological processes such as formation of respiratory chain complexes I, II, III [94] and subsequently oxidative phosphorylation [95]. Mutations in LYRM4 have been noted to cause deficits in oxidation phosphorylation reactions [96], which are critical in neuronal development and schizophrenia as reviewed in [97]. Supporting this connection, polymorphisms occurring in the FARS2-LYRM4 bidirectional promoter region have been shown to be associated with cognitive deficit and schizophrenia [98]. Likewise, a 900kb microdeletion in this region (encompassing RPP40, PPP1R3G, LYRM4, and part of FARS2 and CDYL genes) has been shown to produce gyral pattern anomaly, intellectual disability and speech and language disorder [99].
A comparison of metabolite levels in pre-frontal cortex, visual cortex, cerebellum, kidney, and muscle between human and other primates (chimpanzee and macaque), showed ‘aminoacyl-tRNA biosynthesis’ as the most significant enriched pathway for metabolites having higher levels in human; for all three brain regions, but not muscle or kidney [100]. Likewise, the ‘Phenylalanine, tyrosine and tryptophan biosynthesis’ pathway was enriched for all three brain regions. This observation fits with the ongoing hypothesis that there exists a requirement for coevolution of nuclear and mitochondrial genes coding for mitochondrial proteins, known as mitonuclear compensatory coevolution, as reviewed in [101]; and that this coevolution is a driver of speciation [102, 103]. In the mitonuclear compensatory coevolution hypothesis, nuclear-encoded genes coding for aminoacyl tRNA synthetases (like FARS2) and OXPHOS complex components (like LYRM4) are expected, and in some cases have been shown, to evolve more rapidly in response to high rates of evolutionary change in mitochondrially-encoded genes, with which their protein products subsequently directly interact [101]. Another consideration is how to regulate translation of OXPHOS complex components when the corresponding genes exist in both mitochondrial and nuclear genomes [104, 105]. The high number of observed DB TFBSs in the bidirectional promoter of the nuclear FARS2 and LYRM4 genes thus potentially allow more well-regulated coordination of cross-compartment OXPHOS component gene expression, and may even contribute to the effects of both mitonuclear compensation and speciation.
Taken all together, the LYRM4-FARS2 locus is potentially of great interest in the divergence of modern human with other hominins and hominids. It is possible to imagine that differences in neuronal development and function, and even diet, between MH and Neanderthal could require corresponding differences in mitochondrial activity, metabolism, and amino-acid sensing.
Lots of work left to do
The divergence of modern human and Neanderthal is estimated to have occurred between 400,000 and 800,000 years ago [106–108]. Since then both subspecies have experienced unique evolutionary, social, and cultural paths, many of which overlapped and intertwined. What appears to be both divergent and unifying in the hominin lineage is significant change to the brain leading to abilities in cognition, social function, language and creativity. In the case of humans, and likely for Neanderthal as well, many of the same genes which are at the core of our novel capabilities are the same which are commonly the root of our maladies. Autism and schizophrenia, among other neuropsychiatric disorders, appear to be maintained in humans at a greater rate than would be expected if they were strictly deleterious: the global prevalence of ASD is about 1% [109] and 0.28% for schizophrenia [110]. Likewise, as the study of these diseases continues, the direction of understanding tends towards a continuum of effects rather than discrete on/off state. A study by Linscott et al. reports prevalence of psychotic experiences in the general population at 7.2% [111]. Our results have shown that DB TFs are enriched for development and brain, and as studies on HAR regions have shown mutations in regulatory regions alone are sufficient to cause significant neuropsychiatric effects [57]. As a result, there appears to be significant opportunities to analyze gene regulation for understanding not only the evolution of human cognitive abilities but also the neuropsychiatric disorders which have accompanied it.
Limitations
We have not included analyses of 3’ UTR, introns, or enhancers. The single-most expressed transcript for each protein-coding gene was chosen for analysis, while MH vs. Neanderthal SNPs may have occurred more closely to TSSs of lower-expressed transcripts and thereby have higher and perhaps more significant scoring. Only transcripts for protein-coding genes were analyzed, despite the existence of numerous other transcribed sequence types. The SNP dataset we used for analysis is based on multiple Neanderthal individuals and has a lower sequencing coverage than newer datasets. The inclusion of multiple individuals is useful for comparing modern humans to Neanderthals as a group, but can only provide a more general comparison. Using a higher coverage dataset would allow greater assurance that SNPs under inquiry are legitimate. We chose to perform the TFBSFootprinter analysis using all transcription-relevant features available, in the future we plan to expand testing and assessment of empirical datasets and incorporate an option to use the combination of features which is proven best for each individual TF.
TFBSFootprinter
The TFBSFootprinter tool incorporates 7 different transcription-relevant empirical data features in the prediction of TFBSs. It can take as input any Ensembl transcript ID from any of 125 vertebrate species available in the Ensembl database. Starting with a list of Ensembl transcript ids for a target species (e.g., Homo sapiens) TFBSFootprinter will retrieve from the Ensembl REST server, a user-defined region of DNA sequence surrounding each transcription start site (TSS). The sequence is then scored using up to 575 JASPAR TFBS profiles, or a more limited set as defined by the user. User-defined p-values may be used to filter results; and the corresponding score thresholds have been determined by scoring each JASPAR TFBS profile on the complete human genome. Each putative TFBS is then additionally scored based on proximity/overlap with TSS, TFBS metaclusters, open chromatin, and eQTLs which affect expression levels of the proximal gene, as well as conservation of sequence, correlation of expression with proximal (target) gene, and CpG content. Additional transcription-relevant data can be added easily in the future as is appropriate.
We believe that TFBSFootprinter provides an excellent way to predict TFBSs, thus easily supplementing current investigations into gene function or providing a means to perform larger-scale analyses of groups of related target genes. The ability to identify conserved binding sites in a large number of species particularly widens its applicability to researchers studying various vertebrates. After completion of analysis a publication-ready figure depicting the top scoring TFBS candidates is produced. Additionally, a number of tables (.csv) and JavaScript Object Notation (.json) files presenting various aspects of the results are output. Primary among these is a list of computational predictions in the target species which are supported by empirical data, sorted by a sum of the combined log likelihood scores (the combined affinity score). Importantly, scoring of non-human species becomes limited by the availability of external data for that species; at this time the only data available for non-human species are sequence conservation, CpG, and JASPAR motifs.
Methods
Identification of the canonical promoter
Analysis of differences in TF binding affinity between MH and Neanderthal promoters requires defining the promoter search space, and accordingly the target promoter boundaries. To this end, the complete set of ChIP-Seq data (15,982 experiments; 1,391 TFs; and 1,039 cell types and tissues) from the GTRD database (gtrd.biouml.org; version 20.06) was downloaded for review [112]. The single most highly expressed transcript from each Ensembl gene object was identified, a total of 40,490 transcripts, as described by transcript-level expression data from GTEx dataset (v8). Subsequently, all ChIP-Seq peaks with a fold-enrichment ≥10 (58,976,174 peaks) were mapped to any overlap of the region ±40,000 bp relative to the TSS of all most highly expressed transcripts (Figure 7). From this analysis the sequence ±2,500 bp relative to transcript TSS was defined as the region of interest for identifying promoter region MH-Neanderthal SNPs.
Analysis of modern human vs. Neanderthal genetic variation
The Neanderthal Genome Project has cataloged a total of 388,388 SNPs in comparing modern human vs. Neanderthal genomes [1]. These were further reduced to 21,990 SNPs in the proximal promoters (−2,500 to +2,500 nts relative to the TSS) of those transcripts which are defined by Ensembl as “protein-coding” (Figure 8C). Using the TFBSFootprinter tool, a 50bp region centered on each SNP was analyzed for binding of 575 TFs, for both the modern human version and the Neanderthal variant. TFBSFootprinter automatically retrieves the human sequences at a target region, and custom Python scripts were used to modify these sequences for the Neanderthal variant. All TFBS predictions which overlap the target SNP position were kept. The complete result set was then reduced based on the combined affinity score p-value using a Benjamini-Hochberg derived critical p-value corresponding to a false discovery rate cutoff of 0.01 to address multiple testing. For each putative TFBS meeting the cutoff in either subspecies, the corresponding matched pair of PWM scores was kept. To identify statistically different scoring for TFs between subspecies: for each TF, using the compiled matched scores, the Wilcoxon rank statistical test was performed using the SciPy stats Python library [113] and subsequent results filtered using a Benjamini-Hochberg derived critical p-value corresponding to a false discovery rate cutoff of 0.10 to address multiple testing.
A total of 108 TF models were identified as scoring differently across human vs. Neanderthal SNP locations. For each of these we extracted RNA-Seq data from the FANTOM data set (across all CAGE peaks associated with that TF) and kept the data for the 100 tissues with the highest aggregate expression across all of the target TF genes. In the case of hetero-dimer JASPAR TF models (e.g., FOS::JUN, NR1H3::RXRA, and POU5F1::SOX2) the expression of each TF component gene was used. Expression was extracted as TPM values and was normalized by log2 transformation. From the subsequent normalized expression data values, hierarchical clustering was performed and visualized using SciPy, Matplotlib, and Seaborn Python libraries [113–115].
The results of hierarchical clustering revealed a cluster of TF genes which showed unique expression in neural and immune tissues. These gene sets were used to perform PANTHER-based gene ontology enrichment analysis (using the www.geneontology.org web server) [51, 52], with the default statistical settings using the Fisher’s Exact test method and a FDR threshold of p<0.05.
Analysis of brain expression of differentially binding TFs
Expression data in the form of reads per kilobase per million reads (RPKM) were extracted for 26 unique tissues across 31 timepoints (from 8 weeks post-conception to 40 years after birth) from the Allen Brain Atlas (brainspan.org). RPKM values were then converted to TPM and log2 transformed. Ages were grouped into three phases of growth for simplicity of analysis and interpretation: pcw (8 to 37 weeks post conception), early (4 months to 4 years), and late (8 years to 40 years). Correspondingly, log-transformed TPM values were grouped and averaged and used to perform clustermap analysis to identify grouping tissues at time phases with similar expression profiles.
Analysis of brain sample scRNA-Seq
The Allen Brain Atlas has performed single-nucleus RNA-Seq analysis of 49,417 nuclei derived from 8 brain cortex regions within the middle temporal gyrus (MTG), anterior cingulate gyrus (CgG), primary visual cortex (V1C), primary motor cortex (M1C), primary somatosensory cortex (S1C) and primary auditory cortex (A1C) [55]. These data were downloaded as a count matrix along with a table of the associated metadata (https://portal.brain-map.org/atlases-and-data/rnaseq/human-multiple-cortical-areas-smart-seq) and loaded into Python library SCANPY [116]. Using a modified workflow described previously in [117], samples were filtered by Gaussian fit of read count (300,000<x<3,500,000), expressed gene count (2,000<x), and number of cells in which a gene is expressed (>50), resulting in a final count of 46,959 cells and 42,185 genes for further analysis. Using SCANPY, counts were normalized by cell (‘pp.normalize_total’; target_sum=1,000,000), log transformed (‘pp.log1p’), highly variable genes genes identified (‘pp.highly_variable_genes’; flavor=‘seurat_v3’; n_top_genes=5,000; layer=’counts’), principal component analysis (‘pp.pca’; n_comps = 15; svd_solver=’arpack’), and k-nearest neighbors (‘pp.neighbors’; n_neighbors=15). Expression relationships between cells were graphically visualized with the Python implementation [118] of the ForceAtlas2 [119] graph layout algorithm as called in Scapny (‘tl.draw_graph’ and ‘pl.draw_graph’).
Annotation data regarding brain region, cortical layer, and GABAergic/glutamatergic/non-neuronal cell type features were extracted from Allen Brain Atlas sample data for mapping onto derived cell type clusters. The top 100 marker genes for each cell type+cortical layer (CT/CL) cluster (e.g., excitatory L4) were identified as those with higher expression unique to each cluster by Welch t-test in SCANPY (‘tl.rank_genes_groups’). Expression of the DB TF list genes which were identified as a marker gene in a CT/CL cluster were mapped onto cluster figures.
Gene ontology analysis of target cluster marker genes was performed using the Protein Analysis Through Evolutionary Relationships (PANTHER) tool at the geneontology.org web server [52]. Ontological term overabundance among cluster marker gene lists were established by Fisher’s exact test and results were filtered by FDR<0.05; analyses were performed for biological process, molecular function, and cellular component terms. Disease term gene ontology analysis was performed using Enrichr [56, 120] based on ontology compiled by DisGeNET [121].
TFBSFootprinter Methodology and Scoring
A computational pipeline was created in Python to allow for automated vertebrate promoter sequence retrieval from the Ensembl database (Ensembl version 94 was used in this analysis). The user-defined target sequence is then analyzed using 575 different transcription factor position weight matrices (PWMs), or a user-defined subset, derived from PFMs taken from the JASPAR database. Each TFBS prediction results in a log-likelihood score indicating the likelihood of a particular TF binding the DNA at that location. After this initial step, seven additional gene transcription related features are assessed for each TFBS prediction, each of which generate their own log-likelihood score based on proximity or overlap with these features. The features which may be considered for each TFBS prediction are: vertebrate sequence conservation (GERP), proximity to CAGE peaks (FANTOM5), correlation of expression between target gene and TF predicted to bind promoter (FANTOM5), overlap with ChIP-Seq TF metaclusters (GTRD), overlap with ATAC-Seq peaks (ENCODE), eQTLs (GTEx), and observed/expected CpG ratio (Ensembl). A summation of these scores, for each putative TFBS, then equals a value which we describe as the ‘combined affinity score’. In this way the model’s parameters are significantly more empirically flexible and therefore robust, and ultimately generate a more complete picture of a binding site instead of just computational prediction of binding affinity to a static set of nucleotides.
Ensembl sequence retrieval
The Ensembl Representational State Transfer (REST) server application programming interface (API) (Yates et al. 2014) is used by TFBSFootprinter for automated retrieval of user-defined DNA sequence near the transcription start site of an established Ensembl transcript ID. Annotations for the transcript, and Ensembl-defined regulatory regions (e.g., ‘promoter flanking region’) are also retrieved and mapped in the final output figure.
PWMs
A total of 575 transcription factor PFMs retrieved from the 2018 JASPAR database [22] (http://jaspar.genereg.net/) are used to create PWMs (Eq 1), as described by (Nishida, Frith, and Nakai 2009): N is the set of nucleotides in the currently scanned sequence; ai is the number of instances of nucleotide a at position i; b is a pseudocount set to 0.8 as per [122]; S is the number of sequences describing the motif; nnuc is the count of the nucleotide in the background sequence; and lbg is the length of the background sequence. The background frequencies for each nucleotide were set to match those of the human genome as determined previously [123].
Without a score threshold, scoring of a 1,000 nt promoter with all 575 JASPAR TF profiles will produce ∼1,150,000 predictions. In order to address this issue, each of the 575 JASPAR PWMs was used to score the complete human genome (a total of 3,375,096,897,466 TF-DNA binding calculations). Scores for each TF were then used to generate a distribution which allowed pairing scores with p-values (from 10-1 to 10-5, or smaller when possible, depending on TF profile length). As a result, an appropriate score threshold can be chosen at the discretion of the user. Computation was performed using the supercomputing resources of CSC – IT Center for Science Ltd. (a non-profit owned by the state of Finland and Finnish higher education institutions).
CAGE peak locations and Spearman correlation of expression values
Cap analysis of gene expression (CAGE) uses sequencing of cDNA generated from RNA to both determine TSSs and quantify their expression levels. The FANTOM project has performed CAGE across the human genome [124] and the results are freely available for download (http://fantom.gsc.riken.jp/data/). Using the genomic locations of FANTOM CAGE peaks, the distances from all human nucleotides to the nearest CAGE peak were calculated. The distribution of these distances was used to generate a log-likehood score for all observed distances. The CAGE peak locations and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 2). Where N is the number of all CAGE peaks associated with the target gene; di is the distance to current CAGE peak; pi is the number of peak counts of current CAGE peak; and ptotal is the total peak count for this gene.
Expression data for CAGE peaks associated with the 575 JASPAR TF genes was then combined with expression data for all CAGE peaks to perform a total of 386,652,770 Spearman correlation analyses using the ‘spearmanr’ function from the SciPy Stats module [113]. Bonferroni correction was performed to account for multiple testing. Due to the size of the analysis, a cutoff correlation magnitude value of 0.3 was used, and all lesser values (−0.3<x<0.3) and correlation pairs were discarded. A distribution was generated from the resulting correlation data which was used to generate log-likelihood scores for each possible correlation value (Eq 3). The CAGE peak expression correlations/log-likelihood score pairings are then used during de novo prediction of TFBSs. Computation was performed using the supercomputing resources of the CSC – IT Center for Science Ltd. Where ccurrent is the Spearman correlation between the expression of the target gene and the expression of the TF corresponding to the putative TFBS; and call is the distribution of all Spearman correlations between JASPAR TF genes and all genes.
Experimental TFBSs compiled by the GTRD
Experimental data on TFBS binding (derived from ChIP-Seq, HT-SELEX, and PBMs) is one of the most direct ways to locate potentially functional TFBSs. While binding alone is not indicative of function, clusters of binding sites have been shown to imply functionality [27, 39]. The GTRD project (gtrd.biouml.org) is the largest comprehensive collection of uniformly processed human and mouse ChIP-Seq peaks and has compiled data from 8,828 experiments extracted from the gene expression omnibus (GEO), sequence read archive (SRA), and the encyclopedia of DNA elements (ENCODE) databases [112]. One of the outputs of the performed analyses are reads that have been grouped to identify metaclusters, places where TF binding events cluster together in the human genome. We retrieved the metacluster data (28,524,954 peaks) from the GTRD database (version 18.0) and subsequently mapped the number of overlapped metaclusters for all human nucleotides. The distribution of these overlaps was used to generate a log-likehood score for all observed distances. The metacluster locations and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 4). Where noverlap is the number of metaclusters overlapped by the current putative TFBS; and Dhuman genome is the distribution of the number of overlapping metaclusters for every location in the human genome.
ATAC-Seq peaks
Assay for Transposase-Accessible Chromatin using sequencing (ATAC-Seq) is an experimental method for revealing the location of open chromatin [125]. These locations are indicative of genomic regions which, due to their unpacked nature, may allow TFs to bind to DNA and subsequently influence transcription. Open chromatin regions have been shown to be useful in the prediction of TFBSs [126]. We retrieved and compiled data from 135 ATAC-Seq experiments stored in the ENCODE project database (www.encodeproject.org) and mapped the distance from all human nucleotides to the nearest ATAC-Seq peak, and the distribution of these distances was used to generate a log-likehood score for all observed distances. The ATAC-Seq peak locations and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 5). Where N is the number of ATAC-Seq peaks within the current target region; di is the distance to the current ATAC-Seq peak; and Dhuman genome is the distribution of the distances from all human nucleotides to the nearest ATAC-Seq peak.
eQTLs
The genome tissue expression (GTEX) project (gtexportal.org) has performed expression quantitative trait loci (eQTL) analysis on 10,294 samples from 48 tissues from 620 persons (version 7) [45]. This analysis has identified 7,621,511 variant locations in the genome, usually 1–5 nt, that affect gene expression. eQTL data was extracted from the GTEX database and used to construct a distribution of the magnitude of effect on gene expression, which was then used to generate log-likelihood scores (Eq 6). Next, we generated a second distribution of the distance from each gene to its variants; the distance was limited to 1,000,000 bp from either end of the transcript as this is the search area over which GTEx scans for variants affecting expression of each gene. The variant locations, magnitude of effect/log-likelihood score pairings, and distance/log-likelihood score pairings are then used during de novo prediction of TFBSs (Eq 7). Where N is the number of eQTLs overlapping the current putative TFBS; mi is the magnitude of effect of an eQTL overlapping the current putative TFBS; and Dhuman genome is the distribution of all eQTL magnitudes in all human nucleotides. Where N is the number of eQTLs overlapping the current putative TFBS; ltf is the length of the current putative TFBS (nucleotides); nv is the number of variants associated with the target gene; lg is the length of the current gene (nucleotides); lt is the length of the target transcript; and Dhuman genome is the distribution of all eQTL overlaps for nucleotide windows the same length as the current TFBS, in the human genome.
CpG islands
Due to the fact that methylation of DNA acts as a repressor of transcription, active promoters tend to be un-methylated. When methylated, the cytosine in a CpG dinucleotide can deaminate to thymine. Therefore, a CpG ratio close to what would be expected by chance is often indicative of an active promoter region [127, 128]. Subsequently, CpG ratios (observed/expected) across a 200 nt window were computed for all human nucleotides. A distribution of these ratios was generated and used to generate log-likelihood scores for each possible ratio. CpG ratio/log-likelihood score pairings are then used during de novo prediction of TFBSs. Where robs/exp is the ratio of observed to expected CpG dinucleotides in a 200 nt window centered on the current putative TFBS; and Dgenome is the distribution of robs/exp across all nucleotide locations in the target genome.
Conservation of vertebrate DNA
Conservation of DNA across species boundaries is evolutionarily costly and thus implies function. While differential regulation of gene transcription is what accounts for much of the variation among species [129], there is conservation of regulatory elements across closely related species groups, such as among primates [130]. The Ensembl EPO (Enredo, Pecan, Ortheus) pipeline has created whole genome multiple sequence alignments for distinct clades of vertebrates [131]. As of Ensembl release 94, whole-genome alignments exist for mammals (70 species), fish (48 species), amniotes (32 species), and sauropsids (7 species). The number of available species is dependent on the current Ensembl release and is continually growing. Conservation of sequence analysis has been performed by Ensembl to identify constrained elements for each species in each species group using the genomic evolutionary rate profiling (GERP) tool [132]. For each of the vertebrate species of Ensembl release 94, we have calculated the distance from all nucleotides in the associated species genome to the nearest GERP constrained element; and generated distributions of distances which were used to calculate log-likelihood scores for each distance. GERP element distance/log-likelihood score pairings, for each species, are then used during de novo prediction of TFBSs in the relevant species. Where di is the distance between the current putative TFBS and the nearest conserved element in an alignment of 70 mammalian genomes (GERP); and Dgenome is the distribution of distances between all nucleotides in the target genome and the nearest GERP conserved element.
Combined affinity score
A summation of the weight scores from each experimental dataset is then performed for each putative TFBS and is represented as the ‘combined affinity score’. For analysis of human sequences this is represented by Eq 10. Due to the limitations of available experimental data for non-human species, currently, for vertebrates the combined affinity score is described by Eq 11. A complete scoring of ∼80,000+ transcript’s promoter regions (1,000 nt) was used to generate p-values for combined affinity scoring; computation was performed using the supercomputing resources of CSC – IT Center for Science Ltd.
Benchmarking TFBSFootprinter
In order to robustly assess the ability of PWM, TFBSFootprinter, and DeepBind models to identify functional human transcription factor binding sites, several approaches were taken for benchmarking. These included using ChIP-Seq peaks from the GTRD database (compiled and uniformly processed from ENCODE, GEO, and SRA databases) as well as experimentally validated functional regulatory binding sites to serve as true positives. True negatives were also defined using several approaches in order to mitigate biases inherent in each approach.
ChIP-Seq peaks for 1,392 transcription factors from 15,982 experiments uniformly processed with MACS2 [133] were downloaded from the GTRD database. Using the full complement of Ensembl (release 100) protein-coding transcripts, the ChIP-Seq peaks were reduced to those which occur within the window -1000 to +200 bp relative to an Ensembl transcript TSS. In the first benchmark, for each JASPAR TF with an identically named match in this subset of GTRD peaks, the top 100 peaks which had a fold-enrichment of ≥50x were kept as true positives and matched locations 1,000 bp upstream were used as true negatives. In the second benchmark, for each JASPAR TF with an identically named match in this subset of GTRD peaks, the top 100 peaks which had a fold-enrichment of ≥50x were kept as true positives and the bottom 100 peaks which had a fold-enrichment of ≤2x were kept as true negatives. Using this set of high and low-occupancy locations, a 50-bp window centered on the peak summit was defined for analysis with TFBSFootprinter and DeepBind.
Experimentally verified and curated TFBSs belonging to the annotated regulatory binding sites (ABS) [134], ORegAnno [135], and Pleiades promoter project [136] databases were retrieved as curated GFF files from the Pazar database [137]. The TFs associated with each experimental TFBS were identified using a combination of Python scripting and manual parsing, due to deprecated gene naming in some cases. From this data, 504 experimentally-validated binding sites affecting gene expression for 20 DeepBind TFs and 607 experimentally-validated binding sites affecting gene expression for 25 JASPAR TFs were selected. Selection criteria for the chosen TFs was that they have at least 10 experimentally-validated binding sites affecting gene expression. All target sites were converted from Hg19 to GRCh38 genomic coordinates using Ensembl REST. Subsequently, 50 bp sequences were retrieved centered on each experimentally-validated functional binding site in the human genome, to serve as true positives.
Paired with these true positives were true negatives from two different sources. In the first approach, for each true positive, 50 locations were chosen in random Ensembl transcripts, at the same distance from the TSS. In the second approach, the 50 locations were chosen within the promoter of the same Ensembl transcript, at least 50 nucleotides away from the true positive.
For each of the TFBSFootprinter, PWM, and DeepBind methods, the true positive locations were scored using that method and the model for the corresponding TF which had previously been experimentally verified as functional at that location. However, for the set of true negatives, each method scored all locations using all TF models.
Analyzing effect of feature combinations on TFBSFootprinter accuracy
Subsequently, for each method, and for each TF, the correlating true positive and true negative scores were used to generate receiver operator characteristic (ROC) curves using the ‘roc_curve’ module of the scikit-learn Python library [138].
Using the TFBSFootprinter tool, all 128 possible combinations of transcription-relevant features (PWM, CAGE, eQTL, metaclusters, ATAC-Seq, CpG, sequence conservation, expression correlation) which include PWM as a component were used in scoring of the true positives and true negatives. This allowed identification of the best possible feature-combination TFBSFootprinter model for each TF, which is described as ‘TFBSFootprinter best by TF’, as well as the TFBSFootprinter model which performed best on average across all TFs, named as ‘TFBSFootprinter best overall’. In assessment of the DeepBind tool both available models, based on SELEX or ChIP-Seq data, were used. Using related T-test, comparisons of ROC scores were made between PWM and each of TFBSFootprinter best by TF, TFBSFootprinter overall best, TFBSFootprinter all features, and DeepBind best by TF. Similarly, comparisons were made between DeepBind and each of TFBSFootprinter best by TF, TFBSFootprinter overall best, TFBSFootprinter.
Data availability
Code reproducing the analyses and figures in the manuscript are available on GitHub at https://github.com/thirtysix/TFBS_footprinting_manuscript. Results of benchmarking of TFBSFootprinter (https://osf.io/hzny6/) and MH vs. Neanderthal promoter analysis (https://osf.io/r2mtw/) are available as Open Science Foundation repositories.
Funding
This work was supported by the Finnish Cultural Foundation and Fimlab to HB, and Academy of Finland and Jane & Aatos Erkko Foundation to SP.
Conflicts of Interest
Authors declare they have no conflicts of interest.
Figure supplement legends
Supplementary Figure 1. Aggregate expression of DB TFs in 100 tissues. FANTOM5 RNA-Seq data was extracted as TPM values and the 100 tissues with highest aggregate expression of the differentially binding TF genes were selected for clustering (Figure 1), resulting in order of labeled tissues depicted.
Supplementary Figure 2. ROC Analysis in the identification of TFBS ChIP-Seq peaks – strong and distal binding. ROC Curves are presented for the best performing TFBSFootprinter model for each TF.
Supplementary Figure 3. ROC Analysis model performance in the identification of experimentally verified functional TFBSs – random Ensembl transcripts. ROC analysis was performed using experimentally verified functional TFBSs as annotated in the ORegAnno/Pleides/ABS datasets as true positives, where true negatives were random locations in other Ensembl transcripts at the same distance from the TSS as the associated true positive. All ROC curve analyses were performed on TFs which had at least 10 true positives and 50 true negatives per true positive were used for each analysis. Each true positive/negative segment analyzed was 50 nucleotides long, and the highest TFBS score for the relevant dataset(s) was used for each true positive/negative. (A) Barplot of the frequency of experimental data type in the top 20 performing TFBSFootprinter models. (B) Boxplot of ROC scores for TFBSFootprinter and DeepBind for 14 TFs (left). ROC scores were also calculated based on using individual experimental metrics to show how well each contributes to accuracy of the combined model. (C) ROC scores for each individual TF tested, for each primary TFBS prediction model under study. The best scoring model among all is named for each TF (right). TFBSFootprinter best by TF, based on using the highest ROC score achieved by some combination of experimental data models; TFBSFootprinter overall best, based on using the combination of experimental data models which had the best average ROC score across all TFs analyzed; DeepBind best by TF, based on using the higher ROC score of the SELEX or ChIP-Seq DeepBind models.
Supplementary Figure 4. ROC Analysis model performance in the identification of experimentally verified functional TFBSs – random location in same Ensembl transcript. ROC analysis was performed using experimentally verified functional TFBSs as annotated in the ORegAnno/Pleides/ABS datasets as true positives, where true negatives were random locations in the same Ensembl transcript apositive. All ROC curve analyses were performed on TFs which had at least 10 true positives and 50 true negatives per true positive were used for each analysis. Each true positive/negative segment analyzed was 50 nucleotides long, and the highest TFBS score for the relevant dataset(s) was used for each true positive/negative. (A) Barplot of the frequency of experimental data type in the top 20 performing TFBSFootprinter models. (B) Boxplot of ROC scores for TFBSFootprinter and DeepBind for 14 TFs (left). ROC scores were also calculated based on using individual experimental metrics to show how well each contributes to accuracy of the combined model. (C) ROC scores for each individual TF tested, for each primary TFBS prediction model under study. The best scoring model among all is named for each TF (right). TFBSFootprinter best by TF, based on using the highest ROC score achieved by some combination of experimental data models; TFBSFootprinter overall best, based on using the combination of experimental data models which had the best average ROC score across all TFs analyzed; DeepBind best by TF, based on using the higher ROC score of the SELEX or ChIP-Seq DeepBind models.
Supplementary Figure 5. ROC Analysis in the identification of TFBS ChIP-Seq peaks – strong and weak binding. ROC Curves are presented for the best performing TFBSFootprinter model for each TF.
Abbreviations
- (ATAC-Seq)
- Assay for Transposase-Accessible Chromatin using sequencing
- (CAGE)
- Cap analysis of gene expression
- (ChIP-Seq)
- Chromatin immunoprecipitation with massively parallel DNA sequencing
- (CREs)
- Cis-regulatory elements
- (DICE)
- Database of Immune Cell eQTLs
- (HT-SELEX)
- High-throughput systematic evolution of ligands by exponential enrichment
- (PWM)
- Position weight matrix
- (TF)
- Transcription factor
- (TFBS)
- Transcription factor binding site
- (TSSs)
- Transcription start sites
- (PCW)
- Weeks post conception
Acknowledgements
Heini Huhtala is acknowledged for assistance in statistical techniques and professor Matti Nykter and Payam Emami Khoonsari PhD are gratefully thanked for discussions on practical and theoretical concerns. The non-profit CSC – IT Center for Science Ltd, owned by the state of Finland and Finnish higher education institutions, is acknowledged for providing computational resources for analyses.
Footnotes
Analysis of ChIP-Seq data was used to redefine target promoter regions (old, -900 to +100; new, -2,500 to +2,500; relative to TSS). Subsequently, all analyses of modern human vs. Neanderthal SNPs in promoters were re-run and the results re-analyzed. Details regarding the genes with differential binding in their promoters were added. Analyses on enrichment of the DB TFs in various cell and tissue types were added. Discussion has been expanded.
References
- 1.↵
- 2.↵
- 3.
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.
- 30.
- 31.
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵
- 133.↵
- 134.↵
- 135.↵
- 136.↵
- 137.↵
- 138.↵