The cis-regulatory effects of modern human-specific variants

The Neanderthal and Denisovan genomes enabled the discovery of sequences that differ between modern and archaic humans, the majority of which are noncoding. However, our understanding of the regulatory consequences of these differences remains limited, in part due to the decay of regulatory marks in ancient samples. Here, we used a massively parallel reporter assay in embryonic stem cells, neural progenitor cells and bone osteoblasts to investigate the regulatory effects of the 14,042 single-nucleotide modern human-specific variants. Overall, 1,791 (13%) of sequences containing these variants showed active regulatory activity, and 407 (23%) of these drove differential expression between human groups. Differentially active sequences were associated with divergent transcription factor binding motifs, and with genes enriched for vocal tract and brain anatomy and function. This work provides insight into the regulatory function of variants that emerged along the modern human lineage and the recent evolution of human gene expression.


30
The fossil record allows us to directly compare skeletons between modern humans and their 31 closest extinct relatives, the Neanderthal and the Denisovan. From this we can make inferences 32 not only about skeletal differences, but also about other systems, such as the brain. These 33 approaches have uncovered a myriad of traits that distinguish modern from archaic humans. For 34 example, our face is flat with smaller jaws, our development is slower, our pelvises are narrower, 35 our limbs tend to be slenderer, and our brain differs in its substructure proportions 1-3 (especially 36 the cerebellum 4 ). Despite our considerable base of knowledge of how modern humans differ 37 from archaic humans at the phenotypic level, we know very little about the genetic changes that 38 have given rise to these phenotypic differences. 39

40
The Neanderthal and the Denisovan genomes provide a unique insight into the genetic 41 underpinnings of recent human phenotypic evolution. The vast majority of genetic changes that 42 separate modern and archaic humans are found outside protein-coding regions, and some of these 43 likely affect gene expression 5 . Such regulatory changes may have a sizeable impact on human 44 evolution, as alterations in gene regulation are thought to underlie most of the phenotypic 45 differences between closely related groups 6-9 . Indeed, there is mounting evidence that many of 46 the noncoding variants that emerged in modern humans have altered gene expression in cis, 47 shaped phenotypes, and have been under selection 5,10-18 . Fixed variants, in particular, could 48 potentially underlie phenotypes specific to modern humans, and some of these variants might 49 have been driven to fixation by positive selection. 9 two cell types (Supplementary Fig. 1b). Thus, the lack of activity of the osteoblast putative 155 positive controls is likely because, in contrast to the ESC and NPC positive controls, the 156 osteoblast putative positive controls were not previously tested in an MPRA and are not driving 157 as much expression as the true positive controls in the ESCs and NPCs. Overall, these results 158 suggest that the lentiMPRA was technically reproducible and adequately powered to detect 159 expression. We first examined which of the assayed sequences are able to drive expression. To do so, we 171 utilized MPRAnalyze 40 , which uses a model for each of the RNA and DNA counts, estimates 172 transcription rate and then identifies sequences driving significant expression. We also added an 173 additional stringency filter whereby a sequence is only considered expressed if it had an 174 RNA/DNA ratio significantly higher than that of the scrambled sequences (FDR < 0.05). We 175 found that in ESCs, 8% (1,183) of sequence pairs drove expression in at least one of the alleles, 176 6% (814) in osteoblasts, and 4% (602) in NPCs (FDR < 0.05, see Methods). Hereinafter, we 177 refer to these sequences as active sequences. Overall, 13% (1,791) of archaic and modern human 178 sequence pairs were active in at least one cell type, 4% (586) in at least two cell types, and 2% 179 (222) in all three cell types (overlap of 75-fold higher than expected, P < 10 -100 , Super Exact 180 test 44 , Fig. 2a).

10
To test whether activity in our lentiMPRA reflects true biological function, we investigated 183 whether our active sequences had expected regulatory characteristics in the modern human 184 genome. Active sequences tend to bear active chromatin marks in their endogenous genomic 185 environment. Therefore, we examined whether active sequences in lentiMPRA tend to be 186 enriched for markers of active chromatin. We first tested overlap with five histone modification 187 marks and one histone variant associated with active chromatin (H3K27ac, H3K4me1, 188 H3K4me2, H3K4me3, H3K9ac, and H2A.Z), as well as with two histone modification marks 189 associated with repressed chromatin (H3K9me3 and H3K27me3, see Methods) 45 . We found that 190 on average, active sequences were 1.7x more likely than inactive sequences to have an active 191 chromatin mark. At the same time, these sequences tended to be depleted of repressive marks 192 ( Fig. 2b-d, Supplementary File 2). This trend gets stronger when looking at more highly active 193 sequences. For example, while only 18% of inactive sequences overlap an H3K4me2 peak, 70% 194 of active sequences with an RNA/DNA ratio > 3 in ESCs overlap H3K4me2 peaks (FDR = 195 1.9x10 -21 , hypergeometric test, Fig. 2b-d, Supplementary File 2). To further test the functional 196 characteristics of active sequences, we analyzed chromHMM annotation 45,46 , which uses 197 chromatin signatures to subdivide the genome into functional regions. Compared to inactive 198 sequences, we found that active sequences are enriched for promoter and enhancer marks 199 ( Supplementary Fig. 2a-c, Fig. 2e, Supplementary File 1-2). We also found that compared to 200 inactive sequences, active sequences are 6-32% closer to GTEx 47 eQTLs, depending on cell type 201 (FDR < 0.05, t-test). Active sequences are also 1.

224
Differentially active sequences between modern and archaic humans 225 We next set out to identify modern and archaic human sequences driving differential expression. 226 We used MPRAnalyze 40 to compare expression driven by the modern and archaic sequences. 227 Out of the active sequence pairs in each cell type, 110 (9%) in ESCs drive significantly 228 differential expression between modern and archaic humans, 243 (30%) in osteoblasts, and 153 229 (25%) in NPCs (FDR ≤ 0.05, see Methods, cell-type differences). We refer to these sequence pairs hereinafter as differentially active 231 sequences. Overall, we see significant overlap between cell types in differentially active 232 sequences: 407 sequences (23% of active sequences) were differentially active in at least one cell 233 type, 89 (5%) in at least two cell types, and 10 (0.6%) in all three cell types (8-fold higher than 234 expected, P = 5x10 -7 , Super Exact test 44 , Fig. 3d). 235

13
As expected from such closely related organisms, and similar to other MPRAs that compared 237 nucleotide variants (see Discussion) including one that compared human and chimp sequences 30 , 238 most sequences drove modest magnitudes of expression difference; of the 407 differentially 239 active sequences, the median fold-change was 1.2x, and only five sequences had a fold-change 240 greater than 2x (Fig. 3a-c). We refer to differentially active sequences where modern human 241 expression is higher/lower than archaic human expression as up/downregulating sequences, 242 respectively. In ESCs and NPCs, sequences were equally likely to be up-or downregulating 243 (51% and 52% of differentially active sequences were downregulating, P = 0.92 and P = 0.63, 244 respectively, Binomial test), while in osteoblasts downregulation was observed slightly more 245 often (59%, P = 6.9x10 -3 ). For sequences that are differentially active in more than one cell type, 246 we found an exceptional agreement between the direction of differential activity, with the 247 regulation of 107 out of 109 pairs going in the same direction (P = 9.2x10 -30 , Binomial test), and 248 a high correlation between the magnitudes of differential activity (Pearson's R = 0.82, P = 249 1.6x10 -27 ). That differentially activity sequences from one cell type are predictive of differential 250 activity in other cell types, even of cell types as disparate as those used here, suggests that these 251 sequences are likely to be differentially active in other cell types not assayed in this lentiMPRA. differentially active sequences between cell types. Super Exact test P-value is presented for the overlap of the three 257 groups. In the 10 sequences that were differentially active across all three cells types, the direction of fold-change 258 was identical across all cell types (P = 1.9x10 -3 , Binomial test). e. Violin plots of predicted TF binding score 259 difference between modern and archaic sequences. Positive scores represent increased binding in the modern 260 sequence. Points show mean.

262
To further test the replicability of these results, we examined the relationship between pairs of 263 overlapping differentially active sequences (i.e., variants that are < 200bp apart and thus appear 264 in more than one sequence, three overlapping pairs in ESCs, five in osteoblasts, and two in 265 NPCs). We found that the direction of expression change is identical in all pairs of overlapping 266 sequences (P = 2.0x10 -3 , binomial test), and that the magnitude of their expression change is 267 highly correlated (Pearson's R = 0.95, 2.4x10 -5 , Supplementary Fig. 3a). To validate these 268 results with an orthogonal method, we tested four differentially active sequences from each cell 269 type in a luciferase reporter assay and found that the direction and magnitude of differential 270 expression tended to replicate the lentiMPRA results (9 out of 12 sequences, Pearson's R = 0.67, 271 P = 3.7x10 -4 , Supplementary Fig. 3b, Supplementary File 1). These results suggest that the 272 lentiMPRA was both technically reproducible across cell types and assays and also indicative of overlapping each annotation to that proportion in the other active sequences. We found that 281 sequences driving differential expression are over-represented within putative enhancer regions 282 (P = 3.6x10 -3 , Fisher's exact test, see Discussion). This is mainly driven by enrichment within 283 ESCs and NPCs (2.2x and 2.54x, P = 5.1x10 -3 , and P = 2.0x10 -6 , respectively, Supplementary 284

293
Molecular mechanisms underlying differential activity 294 Next, we sought to understand what regulatory mechanisms could be associated with differential 295 activity. Changes in expression are often linked to changes in regulatory marks, such as DNA 296 methylation. Thus, we checked the overlap between differentially active sequences and 297 differential methylation in modern versus archaic humans 12,20,21 . As hypermethylation of 298 regulatory elements is often associated with transcriptional downregulation 49 , we tested if higher 299 activity levels in the lentiMPRA are associated with hypomethylation and vice versa. Because 300 the DNA methylation maps originate from bone samples, we compared them to the osteoblast 301 lentiMPRA data. We found that upregulating sequences indeed have a slight but significant 302 tendency to be hypomethylated and that downregulating sequences tend to be hypermethylated in 303 modern compared to archaic humans (-1% difference for upregulating sequences, and +2% for 304 downregulating sequences, P = 0.028, paired t-test). This trend strengthens when looking at the 305 most differentially regulating sequences. For example, the top ten most downregulating 306 sequences show on average +3.2% methylation in modern humans, whereas the top ten most 307 upregulating sequences show -7.4% methylation. We also found that differentially active 308 sequences tend to overlap previously reported differentially methylated regions 12 more often than 309 expected by chance (P = 0.01, overlap within 10 kb, hypergeometric test). 310

311
We conjectured that some of the differential activity in these loci might have been driven by 312 alterations in transcription factor (TF) binding. To investigate this, we compared predicted TF 313 binding affinity to the modern and archaic sequences using FIMO 50 . We found that: (1) 314 compared to other active sequences, the difference in predicted binding between the modern and 315 archaic human alleles tends to be larger for differentially active sequences (combined across cell 316 types: 4.3x, P = 0.02, t-test, Supplementary Fig. 4a); (2) the directionality of differential 317 expression tends to match the directionality of differential binding, i.e., upregulating sequences 318 tend to have stronger predicted binding for the modern human sequence, whereas 319 downregulating sequences tend to have stronger predicted binding for the archaic sequence (P = 320 3.7x10 -6 for ESCs, P = 1.7x10 -6 for osteoblasts, and P = 1.3x10 -5 for NPCs, binomial test, Fig.  321 3e, see Methods); and (3) the magnitude of expression difference is correlated with the 322 magnitude of predicted binding difference (Pearson's R = 0.43 and P = 1.2x10 -3 for ESCs, 323 Pearson's R = 0.23 and P = 0.02 for osteoblasts, and Pearson's R = 0.35 and P = 2.4x10 -3 for 324 NPCs, Supplementary Fig. 4b-d and Supplementary File 4). These results support the notion 325 that alterations in TF binding played a role in shaping some of the expression differences 326 between modern and archaic humans. 327 To identify the TFs that primarily drove these observations, we investigated which motif changes 329 are most predictive of expression changes. For each TF and the sequences it is predicted to 330 differentially bind, we examined the correlation between binding and expression fold-change. 331 We found that changes to the motifs of 11 TFs in osteoblasts and 13 in NPCs were predictive of 332 expression changes, i.e., higher affinity to the modern human sequences was predictive of higher 333 expression of that sequence (Supplementary Fig. 4e, Supplementary File 4). All of these TFs 334 had a positive correlation between changes in their predicted binding affinity and changes in 335 expression of their bound sequences, reflective of an activator function. 336 337 Next, we sought to explore if some motif changes are particularly over-represented within 338 differentially active sequences, suggestive of a more central role in shaping modern human 339 regulatory evolution. To control for sequence composition biases, we used active sequences as a 340 background to search for motif enrichment in differentially active sequences. We found two 341 enriched motifs: ZNF281 (4.6-fold, FDR = 0.016, in NPCs) and SP3 (2.5-fold, FDR = 0.046, in 342 NPCs, Supplementary File 4). ZNF281, among other functions, is an inhibitor of neuronal 343 differentiation 51 . SP3 is a ubiquitously expressed TF involved in a wide variety of processes 344 including cell differentiation, proliferation, and synaptic gene regulation 52 . Both TFs bind GC-345 rich sequences with overlapping binding specificities 53,54 , evident also through the overlap 346 between them in predicted differential binding (Supplementary File 4). Out of 153 differentially 347 active sequences in NPCs, 23 (15%) are predicted to be differentially bound by SP3 (2.5-fold, 348 FDR = 0.046), and 14 by ZNF281 (4.6-fold, FDR = 0.016). Intriguingly, ZNF281 and SP3 are 349 also two of the TFs whose predicted differential binding is most tightly linked with differential 350 expression ( Supplementary Fig. 4e-g). Overall, these data support a model whereby variants in 351 ZNF281 and SP3 motifs might have modulated their binding in NPCs, thereby shaping neural 352 expression differences between modern and archaic humans.

364
Phenotypic consequences of differential expression

365
In an attempt to assess the functional effects of the differential transcriptional activity we 366 detected, we first sought to link each sequence to the gene(s) it might regulate in its endogenous 367 genomic location. While most regulatory sequences are known to affect their closest gene 55,56 , 368 some exert their function through interactions with more distal genes, often reflected in 369 20 chromatin conformation capture assays, such as Hi-C interactions 57 , or eQTL associations 57,58 . 370 To predict the genes linked to each sequence we combined data from four sources: (1) proximity 371 to transcription start sites; (2) proximity to eQTLs 47 ; (3) proximity to putative enhancers 59 ; and 372 (4) spatial interaction with promoters using Hi-C data 58 (see Methods). Using these data, we 373 generated for each cell type a list of genes potentially regulated by each sequence. Overall, 1,341 374 out of the 1,791 active sequences (75%) were linked to at least one putative target gene 375 To study the potential functional effects of sequences driving differential expression, we 378 analyzed functions associated with their linked genes. To control for confounders such as cell 379 type-specific regulation, gene length, and GC content, we compared differentially active 380 sequences to other active sequences (instead of the genomic background), which minimizes 381 inherent biases in the active sequences. First, we tested Gene Ontology terms and found an 382 enrichment of the terms G-protein coupled receptor signaling pathway (2.9-fold, FDR = 0.019, 383 in ESCs), and regulation of transcription, DNA-templated (1.8-fold, FDR = 3.3x10 -3 , in 384 downregulating sequences across all cells combined, Supplementary File 4). To obtain a more 385 detailed picture of phenotypic function, we ran Gene ORGANizer, a tool that uses monogenic 386 disorders to link genes to the organs they affect 60 . We analyzed the genes linked to differentially 387 active sequences and found that for genes linked to sequences driving up-regulation, the most 388 enriched body parts belong to the vocal tract, i.e., the vocal cords (5.0-fold, FDR = 1.1x10 -3 ), 389 voice box (larynx, 3.8-fold, FDR = 4.2x10 -3 ), and pharynx (3.3-fold, FDR = 8.3x10 -3 , all within 390 ESCs, Fig. 4a). Interestingly, we have previously reported that the most extensive DNA 391 methylation changes in modern compared to archaic humans arose in genes affecting the vocal 392 tract 12 . Most subcranial, appendicular and axial skeletal parts are also over-represented within 393 upregulating sequences. Conversely, within sequences driving downregulation, the enriched 394 body parts are the cerebellum (3.0-fold, FDR = 3.9x10 -3 , in NPCs) and urethra (2.3-fold, FDR = 395 0.04, in ESCs, Fig. 4a, Supplementary File 5). This is in line with previous reports of cerebellar 396 anatomy differences between modern humans and Neanderthals 1-3 , including results suggesting 397 that the biggest differences in brain anatomy are in the cerebellum 4 . These data also provide 398 leads into the functional divergence of organs, like the urethra, that are not preserved in the fossil 399 record. 400 401 Next, we delved into individual phenotypes associated with the differentially active sequences. 402 To this end, we used the Human Phenotype Ontology (HPO) database 61 , a curated database of 403 genes and the phenotypes they underlie in monogenic disorders. HPO covers a broad range of 404 phenotypes related to anatomy, physiology, and behavior. Within upregulating sequences, the 405 enriched phenotypes were involved in learning ability, craniofacial morphology, speech, 406 hormonal activity, heart anatomy, testicular descent, osteoarthritis, muscle tone, and spine 407 curvature (FDR < 0.05, Fig. 4b). Within downregulating sequences, we detected an enrichment 408 of phenotypes affecting cerebellar size, intellectual ability, seizures, kidney function, and 409 walking (FDR < 0.05, Fig. 4b, Supplementary File 5). These results reveal body parts and 410 phenotypes that were particularly associated with gene expression changes between modern and 411 archaic humans, and could be new candidates for phenotypes under selection. 412

23
Downregulation of SATB2 potentially underlies brain and skeletal differences 423 This catalog of cis-regulatory changes allows us to explore specific sequence changes that 424 potentially underlie divergent phenotypes observed from fossils. To use the most robust data 425 from lentiMPRA, we examined the ten sequences that are differentially active across all three 426 cell types, and looked at their linked genes. To investigate the phenotypes that are potentially 427 linked to these genes, we looked for those genes whose phenotypes can be compared to the fossil 428 record (i.e., skeletal phenotypes). The only gene that fit these criteria was SATB2, a regulator of 429 brain and skeletal phenotypes 62 . First, we analyzed its linked variant (C to T transition), which is 430 at a position that is highly conserved in vertebrates (GRCh38: 199,469,203 on chromosome 2, 431 PhyloP score = 0.996). This position is found within a CpG island between two alternative TSSs 432 of SATB2 (Fig. 4c). It is also found in the first intron of SATB2-AS1, an antisense lncRNA which 433 upregulates SATB2 protein levels 63 . To determine if this position lies within a regulatory region, 434 we investigated it for chromatin marks in modern humans. We found that it overlaps a DNase I-435 hypersensitive site 64 and shows many peaks of active histone modification marks in all three cell 436 types (Fig. 4c, Supplementary File 1). Indeed, this sequence drives high expression in all three 437 cell types (fourth, eighth, and 19th percentile among active sequences, in ESCs, osteoblasts, and 438 NPCs, respectively, FDR < 10 -5 across all). Although this sequence shows hallmarks of activity 439 in modern humans, compared to the archaic sequence the modern human sequence is 440 downregulating in all three cell types (-9% in ESCs, FDR = 6.8x10 -4 , -27% in osteoblasts, FDR 441 = 2.2x10 -42 , and -12% in NPCs, FDR = 1.1x10 -7 ). These results suggest that the ancestral version 442 of this sequence possibly promoted even higher expression in archaic humans. 443 24 SATB2 encodes a transcription factor expressed in developing bone and brain. Its activity 445 promotes bone formation, jaw patterning, cortical upper layer neuron specification, and 446 tumorigenesis 62 . Genome-wide association studies (GWAS) show that common variants near 447 SATB2 are mainly associated with brain and bone phenotypes, such as reaction time, anxiety, 448 mathematical abilities, schizophrenia, autism, and bone density 65 . Heterozygous loss-of-function 449 mutations in SATB2 result in the SATB2-associated syndrome, which primarily affects 450 neurological and craniofacial phenotypes. This includes speech delay, behavioral anomalies 451 (e.g., jovial personality, aggressive outbursts, and hyperactivity), autistic tendencies, small jaws, 452 dental abnormalities, and morphological changes to the palate 66 . Additionally, reduced functional 453 levels of SATB2 due to heterozygous loss-of-function have been shown to be the cause of these 454 phenotypes in both human 62,66-68 and mouse 69-71 . Because these phenotypes are driven by 455 changes to functional SATB2 levels 62 , we conjectured that the differential expression of SATB2 456 predicted from lentiMPRA might be linked to divergent modern human phenotypes. Thus, we 457 examined whether the phenotypes SATB2 affects are divergent between archaic and modern 458 humans (e.g., if modern human jaw size is different than the jaw size of archaic humans). We 459 focused on phenotypes available for examination from the fossil record, primarily skeletal 460 differences between modern humans and Neanderthals. From HPO, we generated a list of 17 461 phenotypes known to be affected by SATB2 and found that 88% (15) of them are divergent 462 between these human groups (Supplementary File 6). These include the length of the skull, size 463 of the jaws, and length of the dental arch. Next, based on SATB2 downregulation in modern 464 humans predicted from lentiMPRA, we examined whether the direction of a phenotypic change 465 between patients and healthy individuals matches the direction of phenotypic change between 466 modern and archaic humans. For example, given that SATB2-associated syndrome patients have 467 smaller jaws, we tested if modern human jaws are smaller compared to archaic humans. If 468 SATB2 expression is not in fact related to phenotypic divergence, there is a 50% likelihood for a 469 given phenotype to match the fossil record. Yet, we observed a match in direction in 80% of the Notably, it has limitations that could influence our results, mainly by potentially generating false 495 negatives. First, our lentiMPRA library inserts were limited to ~200bp in length, due to 496 oligonucleotide synthesis technical restrictions, which may be insufficient to detect the activity 497 of longer enhancer sequences 41 . Second, some minimally active sequences may not be expressed 498 at a high enough level to pass our limit of detection. Third, some sequences may regulate 499 expression post-transcriptionally, which lentiMPRA is not designed to detect. Fourth, since test 500 sequences are randomly integrated into the genome, sequences that are dependent on their 501 endogenous genomic environments (e.g., on nearby TF binding sites) might show reduced 502 activity when inserted in new locations, while others might show activity that they otherwise 503 would not have. Our design partially addresses this through the use of insulators and multiple 504 independent integrations, which are intended to dilute location-specific effects. Additionally, all 505 biases are expected to similarly affect the modern and archaic human versions of each 506 sequence 41 . Fifth, transcriptional repression is less likely to be detected due to the low basal 507 activity of the minimal promoter used. Finally, differences in the trans environment of a cell 508 could have an effect on the ability of a sequence to exert its cis-regulatory effects. However, 509 previous data comparing substantially more divergent organisms (e.g., human-chimpanzee 30 and 510 human-mouse 72 ) show that cis-regulatory changes exert their effects similarly across different 511 trans environments, suggesting the potential effect of such trans differences is likely minimal 512 between modern and archaic humans. 513 Importantly, when genomes from additional modern human individuals are sequenced and new 514 variants mapped, it might become clear that some of the variants we analyzed have not reached 515 fixation. However, regardless of whether they are completely fixed or not, these variants 516 represent derived substitutions that likely emerged in modern humans and spread to considerable 517 frequency. Further investigation is required to determine when they emerged, how rapidly they 518 spread, and whether their effect was neutral or adaptive. 519 As expected, we observed differences in activity and differential activity between cell 520 types 28,43,72 . Although some of this variation is likely biological (i.e. cell type-specific gene 521 regulation), it is difficult to determine what proportion of it is due to biological versus technical 522 factors (e.g., differences in lentivirus preparation, infection rate, or cell growth, see Methods). 523 Importantly, these differences are expected to result in false negatives rather than false positives. 524 In other words, some of the sequences that appear as active or differentially active in one cell 525 type might actually be active or differentially active in additional cell types (including cell types 526 that were not tested in this study). Thus, we largely refrained from comparisons between cell 527 types. Despite these caveats and limitations, lentiMPRA is a powerful high-throughput tool to 528 characterize the regulatory activity of derived variants, and indeed has become a common assay 529 to study the capability of sequences to promote expression 19 . 530 With this method, we found that 1,791 (13%) of the 14,042 sequence pairs can promote 531 expression in at least one of the three cell types tested, and that 405 (23%) of these active 532 sequences show differential activity between modern and archaic humans (average fold-change: 533 1.24x, standard deviation: 0.18, Fig. 2, Supplementary File 1). Interpreting these results in light 534 of previous MPRAs is challenging, not only because of key differences in statistical power and 535 experimental design (e.g., sequence length), but also because of differing variant selection 536 processes for each MPRA. With the exception of highly repetitive regions, which were removed 537 from our library for technical reasons, the sequences we selected included all known modern 538 human-derived fixed or nearly fixed variants (see Methods). Conversely, previous reporter 539 assays and MPRAs on human intra-or inter-species variation used biased sets of variants by 540 selecting sequences with putative regulatory function (e.g., eQTLs 28 , TF binding sites 16 , ChIP-541 seq peaks 29 , or TSSs 72 ) and/or regions showing particularly rapid evolution (e.g., human 542 accelerated regions 30,31,73,74 ). In line with the fact that our data was not pre-filtered for putative 543 regulatory regions, the proportion of active and differentially active sequences we observed tends 544 to be slightly lower than these previous studies, but the magnitude of differential expression in 545 our active sequences was similar 16,28-31,72-74 . At the same time, we were capable of measuring 546 regulatory activity in regions that would otherwise be excluded by filtering for a specific set of 547 marks. Thus, future MPRAs on unfiltered sets of variants will enable the comparison of the 548 patterns we observed to patterns within modern humans, between more deeply divergent clades, 549 and of non-fixed modern-archaic differences. 550 Emergence of new regulatory elements is a common phenomenon in mammalian genomes 48,75 . 551 Indeed, some instances of differential activity in our dataset could potentially be explained by the 552 emergence of entirely new regulatory sequences: in 22-34% (depending on cell type) of 553 differentially active sequences, only the derived sequence promotes activity (Supplementary 554 File 1). Some of these cases could potentially be the result of a new TF binding site. This is 555 supported by the observation that 20 of these sequences show significant predicted TF binding 556 only in the derived allele (Supplementary File 4). 557

558
Our results also suggest that sequences driving differential expression are over-represented 559 within putative enhancers, especially in ESCs and NPCs (Supplementary Fig. 2c-d,  560 Supplementary File 3). Enhancers have been suggested to be an ideal substrate for evolution 561 because of their tissue-specificity and temporal modularity 76 . Indeed, previous studies of 562 introgression between archaic and modern humans suggested that enhancers are some of the 563 most divergent regions between modern and archaic humans 11,25,77 . In line with the enrichment 564 we observed in NPCs, brain-related putative enhancers show particularly low introgression, 565 perhaps suggesting that the modern human sequences in these regions were adaptive 25,77 . To 566 fully characterize the underlying mechanisms of differential activity in enhancers, it is important 567 to disentangle the various factors and confounders that might contribute to this enrichment. 568 There are several alternative explanations for the enrichment we observe, namely that variants 569 within enhancers could be more likely to alter expression compared to other active sequences, or 570 they could be particularly detectable in lentiMPRA. This could be tested using saturation 571 mutagenesis MPRAs 43 to compare the effect of random mutations in enhancer and non-enhancer 572 modern human-derived active sequences. 573 574 Our results suggest that differentially active sequences are not randomly distributed across the 575 genome, but rather tend to be linked to genes affecting particular body parts and phenotypes. The 576 most pronounced enrichment was in the vocal tract, i.e., the vocal cords, larynx, and pharynx. 577 This was evident in both with Gene ORGANizer, where these organs are over-represented by up 578 to 5-fold, and also by HPO phenotype analysis, where the most enriched phenotype was nasal 579 speech (a phenotype that occurs when there is an abnormal closure of the soft palate against the 580 walls of the pharynx, 13-fold, FDR = 3x10 -4 , Fig. 4b, Supplementary File 5). Overall, 53 of the 581 407 differentially active sequences were linked to genes which are known to affect one or more 582 vocal tract phenotypes. Previous reports have also suggested that the vocal tract went through 583 particularly extensive regulatory changes between modern and archaic humans 12 , as well as 584 between humans and chimpanzees 78 . Intriguingly, the anatomy of the vocal tract differs between 585 humans and chimpanzees, and has been suggested to affect human phonetic range 79 . Comparing 586 the anatomy of archaic and modern human larynges is currently impossible because the soft 587 tissues of the larynx rapidly decay postmortem. However, together with these previous 588 reports 12,78 , our results enable the study of vocal tract evolution from a genetic point of view and 589 suggest that genes influencing the modern human vocal tract have possibly gone through 590 regulatory changes that are not shared by archaic humans. 591

592
We also identified an enrichment of brain-related phenotypes, particularly those affecting the 593 size of the cerebellum (Fig. 4, Supplementary File 5). The cerebellum is involved in motor 594 control and perception, as well as more complex functions such as cognitive processing, 595 emotional regulation, language, and working memory 80 . Interestingly, the cerebellum has been 596 described as the most morphologically divergent brain region between modern and archaic 597 humans 1,4 . Evidence of divergent brain and cerebellar evolution can also be found at the 598 regulatory level. Studies of Neanderthal alleles introduced into modern humans through 599 introgression provide a clue as to the functional effects of divergent loci between archaic and 600 modern humans. These works have shown that many of the introgressed sequences were likely 601 negatively selected, with the strongest effect in regulatory regions 11,25 , particularly in brain 602 enhancers 77 . Studies of introgressed sequences have also shown that the cerebellum is one of the 603 regions with the most divergent expression between Neanderthal and modern human alleles 10 . 604 Together with our results, these data collectively suggest that sequences separating archaic and 605 modern humans are particularly linked to functions of the brain, and especially the cerebellum. 606 607 Functional information on archaic human genomes is particularly challenging to obtain because 608 of the postmortem decay of RNA and epigenetic marks in ancient samples. MPRA not only 609 provides a new avenue to identify differential regulation in archaic samples, but also reveals the 610 sequence changes underlying these differences. Here, we present a catalog providing regulatory 611 insight into the sequence changes that separate modern from archaic humans. This resource will 612 hopefully help assign functional context to various signatures of sequence divergence, such as 613 selective sweeps and introgression deserts, and facilitate the study of modern human evolution 614 through the lens of gene regulation. 615

Selection of fixed, derived variants and design of DNA oligonucleotides
We selected the variants for our lentiMPRA in the following manner. As a basis, we used the list 618 of 321,820 modern human-derived single nucleotide changes reported to differ between modern 619 humans and the Altai Neanderthal genome 33  15cm 100 mg/mL carbenicillin LB agar plates. Colonies were pooled and midiprepped (Qiagen). 665 We collected approximately 6 million colonies, such that ~200 barcodes were associated with 666 each oligo on average. To determine the sequences of the random barcodes and which oligos 667 they were associated with, we first amplified a fragment containing the oligo, mP and barcode 668 from each plasmid in the lentiMPRA library using primers that contain Illumina flow cell 669 adapters (P7-pLSmp-ass-gfp and P5-pLSmP-ass-i#, Supplementary File 1). We sequenced these 670 amplified sequences with a NextSeq 150PE kit using custom primers (R1, pLSmP-ass-seq-R1; 671 R2 (index read), pLSmP-ass-seq-ind1; R3, pLSmP-ass-seq-R2, Supplementary File 1) to obtain 672 approximately 150M total reads. We later did a second round of barcode association sequencing 673 of these fragments to obtain approximately 76M additional reads, for a combined total of 674 225,592,667 reads. To associate barcodes with oligos, we first mapped read pairs (R1 and R3) to 675 the original list of 28,993 oligos using bowtie2 (--very-sensitive) 82 . Next, we filtered out pairs of 676 reads that (1) did not map to the same oligo, (2) did not have at least one of the reads in the pair 677 with a mapping quality of ≥ 6, or (3) did not have the "proper pair" SAM designation. We linked 678 each pair of reads with the read covering its barcode (R2) and saved only those barcode reads 679 having at least a quality score of 30 across all 15 bases in the R2 read. We removed any barcodes 680 associated with more than a single unique oligo (i.e., "promiscuous" barcodes), as well as any 681 barcodes where we did not see evidence of its oligo association at least three times. replicate and DNA type separate, with 3-cycle PCR using primers that include adapters 741 necessary for sequencing (P7-pLSmp-ass16UMI-gfp and P5-pLSmP-5bc-i#, Supplementary File 742 1). These primers also contained a sample index for demultiplexing and a UMI for consolidating 743 replicate molecules (see later). A second round of PCR was performed to amplify the library for 744 sequencing using primers targeting the adapters (P5, P7, Supplementary File 1). The fragments 745 were purified and further sequenced with six runs of NextSeq 15PE with 10-cycle dual index 746 reads, using custom primers (R1, pLSmP-ass-seq-ind1; R2 (read for UMI), pLSmP-UMI-seq; 747 R3, pLSmP-bc-seq; R4 (read for sample index), pLSmP-5bc-seq-R2, Supplementary File 1). 748 Later, an additional two runs of 15PE of only the ESC samples were performed due to lower 749 lentivirus infection efficiency in this cell type. Each samples' R1 and R3 reads (containing the 750 barcode) were mapped with bowtie2 [ 82 ] (--very-sensitive) to the barcode-oligo association list. 751 Next, we applied several quality filters on the resulting alignments. We first filtered out read 752 pairs that didn't map as proper pairs, and then ensured the mapped sequence completely matched 753 the known barcode sequence by requiring that both R1 and R3 reads have CIGAR stings = 15M, 754 MD flags = 15 and a mapping quality of at least 20. Next, we consolidated read abundance per 755 barcode by selecting only reads with unique UMIs, the result being abundance counts for each 756 barcode, across each replicate library of each cell type for both RNA and DNA. replicate + allele, dnaDesign = ~ replicate + barcode + allele, reducedDesign = ~ replicate). We 795 extracted P-values and the differential expression estimate (fold-change of the derived relative to 796 ancestral sequence). Then, we corrected the P-values of the set of active oligos (see above) for 797 multiple testing with the Benjamini-Hochberg method to generate an FDR for each sequence. 798 We set a cutoff of FDR ≤ 0.05 to call a sequence capable of driving differential expression. From 799 this we generated, for each cell type, a list of sequences with differential expression between the 800 ancestral and derived alleles (Supplementary File 1). Overall, 11,207 out of the 14,042 loci were linked to at least one putative target gene, with a 881 median of four target genes per locus. 2,830 of the remaining loci were linked to their closest 882 TSS, regardless of distance. The last 5 without hg38 coordinates for their closest TSS were not 883 linked to a gene. Importantly, these links do not necessarily mean that these target genes are 884 regulated by these loci, but rather they serve as a list of potential target genes for the loci 885 showing a regulatory function through lentiMPRA. 886

Differential transcription factor binding sites
We predicted differences in binding of human transcription factors caused by each of our 889 variants as follows. First, we downloaded the entire set of publicly available human transcription 890 factor binding motifs (7,705 motifs, 6,608 publicly available) from the Catalogue of Inferred 891 Sequence Binding Preferences (CIS-BP) database (http://cisbp.ccbr.utoronto.ca/), and filtered 892 them to include only motifs labeled as directly determined (i.e., we filtered out inferred motifs), 893 resulting in 4,351 motifs. Next, to enrich our mapping result for matches covering the variant 894 location, we trimmed each of our oligo sequences containing a single variant to +/-30 bp around 895 the variant (the length of the longest motif). We did not trim oligos containing >1 variant. We 896 used FIMO 50 to map each remaining motif to both the ancestral and derived alleles of each 897 trimmed sequence (or untrimmed, for sequences with >1 variant). For each motif mapping to 898 both the ancestral and derived alleles at the same strand and location, we required that at least 899 one allele had a q-value (as supplied by FIMO) ≤ 0.05. For motifs where only one of the 900 ancestral or derived alleles had a reported mapping from FIMO for a given oligo, we searched all 901 the mappings of that motif for the lowest mapping score, and substituted that score for the 902 missing allele's score. Then, we found cases where the FIMO predicted binding score of a motif 903 differed between the ancestral and derived alleles. Finally, we linked each motif to the 904 transcription factor (TF) it is most confidently associated with in CIS-BP, thereby generating 905 lists of TFs that showed differential predicted binding for each sequence. For cases in which 906 multiple unique motifs corresponded to the same TF, we used the motif with the largest score 907 difference between alleles. TF enrichment analyses were done on all predicted differential TF 908 binding sites for TFs with a minimum of 10 predicted differential sites. This was done for each 909 cell type, as well as for the union of all differentially active sequences across all cell types. 910 Fisher's exact test was used to compute enrichment of a TF among differentially active 911 sequences compared to other active sequences. P-values were FDR-adjusted. A threshold of 10 912 sites per TF was used for the analysis of correlation between predicted binding and expression. 913 Overlapping loci with genomic features

914
The following datasets were used for the overlap analyses: GENCODE v28 GRCh38 human 915 genome TSSs 84 , GTEx v8 eQTLs 47 , and broad peaks for the following histone modification 916 marks: H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, and H3K27me, and 917 the histone variant H2A.Z from the Roadmap Project for ESCs, ESC-derived NPCs, and 918 osteoblasts 45 . We overlapped each of these datasets with the lists of non-active and active 919 sequences, and computed enrichment P-values using a hypergeometric test. We repeated this for 920 various RNA/DNA cutoffs (1, 1.5, 2, 2.5, 3 and 3.5). Sex chromosomes were removed from the 921 analyses. P-values were FDR-adjusted using the Benjamini-Hochberg procedure.