Integrative analysis of 10,000 epigenomic maps across 800 samples for regulatory genomics and disease dissection

To help elucidate genetic variants underlying complex traits, we develop EpiMap, a compendium of 833 reference epigenomes across 18 uniformly-processed and computationally-completed assays. We define chromatin states, high-resolution enhancers, activity patterns, enhancer modules, upstream regulators, and downstream target gene functions. We annotate 30,247 genetic variants associated with 534 traits, recognize principal and partner tissues underlying each trait, infer trait-tissue, tissue-tissue and trait-trait relationships, and partition multifactorial traits into their tissue-specific contributing factors. Our results demonstrate the importance of dense, rich, and high-resolution epigenomic annotations for complex trait dissection, and yield numerous new insights for understanding the molecular basis of human disease.


Introduction
Genome-wide association studies (GWAS) have been extremely successful in discovering more than 19 100,000 genomic loci containing common single-nucleotide polymorphisms (SNPs) associated with 20 complex traits and disease-related phenotypes, providing a very important starting point for the 21 systematic dissection of the molecular mechanism of human disease 1,2 . However, the vast majority of 22 these genetic associations remain devoid of any mechanistic hypothesis underlying their molecular and 23 cellular functions, as more than 90% of them lie outside protein-coding exons and likely play non-24 coding roles in gene-regulatory regions whose annotation remains incomplete 3-6 . 25 To help annotate non-coding regions of the genome, large-scale experimental mapping of epigenomic 26 modifications associated with diverse classes of gene-regulatory elements has been undertaken by 27 several large consortia, including the ENCyclopedia of DNA Elements (ENCODE), Roadmap 28 Epigenomics, and the Genomics of Gene Regulation (GGR) [7][8][9] . Integration of these datasets, and in 29 particular histone modification marks and DNA accessibility maps, has helped infer chromatin states 30 and annotate diverse classes of gene-regulatory elements, including distal-acting and tissue-specific 31 enhancer regions and proximal-acting and mostly-constitutive promoter regions [10][11][12] across 16 cell 32 types by ENCODE in 2012 7,8 , and 111 cell and tissue types by Roadmap Epigenomics in 2015 7,8 , 33 by incorporation of 1698 new datasets across three consortia, joint uniform re-processing of a total of 23 3030 reference datasets, and computational completion of 14,952 epigenomic maps across 859 tissues 24 and 18 marks. We rely on epigenomic imputation [25][26][27] , which maximizes the consistency and quality of 25 an ensemble of epigenomic maps by leveraging correlation patterns between related assays and 26 between related cell types to infer missing datasets, and to generate high-quality predictions of 27 experimentally-profiled datasets using all experimentally-profiled datasets. 28 We show that EpiMap greatly surpasses previous reference maps in scope, scale, and coverage of 29 biological space. We combine single-mark maps into a multi-mark chromatin state reference for each of 30 833 high-quality epigenomes, combine multiple enhancer chromatin states with DNA accessibility to 31 infer a high-resolution enhancer map for each cell type, group enhancers into modules based on their 32 common activity patterns across cell types, and use these modules to infer candidate upstream 33 regulators based on motif enrichments, and downstream gene functions based on gene ontology 1 enrichments, providing an important reference for studies of gene regulation both for disease dissection 2 and for basic biological studies of human tissues. 3 Lastly, we use EpiMap to increase our understanding of the molecular processes underlying complex 4 traits and human disease by systematic integration of genetic variants associated with 926 traits and 5 high-resolution enhancer annotations across 833 tissues. Compared to the Roadmap Epigenomics 6 resource analysis 8 , we achieve a nearly 10-fold increase in the number of traits showing significant 7 epigenomic enrichments (N=534) and a dramatic increase in the number of loci with putative causal 8 variants within enriched enhancer annotations (N=30k). We also exploit the greatly-increased number 9 of samples by developing a new hierarchical approach for narrowing down the tissue-level resolution of 10 GWAS enrichments thus resulting in many new enriched traits and loci, to distinguish multifactorial and 11 polyfactorial traits from unifactorial traits, to learn principal-partner tissue pairs that cooperate to explain 12 multifactorial traits, to partition multifactorial-trait SNPs by tissue thus revealing the distinct biological 13 processes and explaining trait comorbidity patterns, and to learn a trait-trait relationship network that 14 helps guide the interpretation of unifactorial and multifactorial traits. 15 These results indicate that EpiMap is a valuable new reference for both gene-regulatory studies and 16 disease studies seeking to elucidate the molecular basis of complex disorders. 17

18
Generation and validation of EpiMap reference epigenome compendium 19 We generated EpiMap, the largest integrated compendium of epigenomics maps to-date, spanning 859 20 epigenomes across 18 epigenomic assays (Fig. 1a, Supp. Fig. S1) by aggregation, uniform 21 processing, and computational completion of three major resources: Roadmap Epigenomics 8 Table S1). 29 We classified these 3,030 assays into multiple tiers, according to their completeness. Tier 1 assays (7 30 marks, 2197 total experiments, 36% complete on average) include DNase-seq (accessible chromatin), 31 H3K4me3 (promoters), H3K36me3 (transcribed), H3K4me1 (poised enhancers), H3K9me3 32 (heterochromatin), H3K27ac (active enhancers/promoters), and H3K27me3 (polycomb repression). Tier 1 2 assays (6 marks, 365 experiments, 7% complete on average) include H3K9ac, ATAC-seq, H3K4me2, 2 H2AFZ, H3K79me2, and H4K20me1, in order of abundance. Tier 3 assays consist of general factors 3 associated with key biological processes (5 factors, 335 total experiments, 8% complete on average), 4 including POLR2A (transcription initiation), EP300 (enhancer activation), CTCF (insulation, chromatin 5 looping), SMC3 (cohesin), and RAD21 (cohesin component, double-stranded-break repair). Lastly, Tier 6 4 assays (22 marks, 133 experiments, 0.7% complete on average) consist of 16 acetylation marks, 4 7 methylation marks (H3K9me2, H3K79me1, H3K9me1, H3K23me2), H3F3A, and H3T11ph. 8 We generated a total of 14,952 imputed datasets covering 18 marks and assays across 859 samples 9 using ChromImpute 25 . We constructed genome-wise browser visualization tracks available from the 10 WashU Epigenome Browser 29 . Imputed tracks showed strong agreement with high-quality observed 11 tracks, individual enhancer elements, precise boundaries in histone modification peaks, and finer-12 grained features of epigenomic assays, such as dips in H3K27ac peaks indicative of nucleosome 13 displacement (Fig. 1b). They also captured the density, location, intensity, and cell-type-specificity of 14 both active and repressed marks across thousands of regions and all 859 samples (Fig. 1c). We found 15 85% peak recovery and 75% average genome-wide correlation for puncate marks representing 59% of 16 tracks (Supp. Fig. S2). 17 Disagreement between imputed and observed tracks helped flag 138 potentially problematic datasets 18 which showed markedly lower QC scores (Supp. Fig. S2, S3), and revealed potential sample or 19 antibody swaps (Supp. Fig S4, S5) some of which were independently flagged by the data producers. 20 We also used the difference between observed and imputed datasets to recognize 15 experiments with 21 potential antibody cross-reactivity or secondary specificity (Supp. Fig S6-S8). We removed from 22 subsequent analyses the 138 flagged datasets and 442 tracks based solely on ATAC-seq or low-quality 23 DNase-seq data as they showed lower correlations, resulting in 2,850 observed and 14,510 imputed 24 marks across 833 samples used in the remainder of this work. 25 Sample relationships, chromatin states, and high-resolution enhancer mapping. 26 The resulting compendium of 833 high-quality epigenomes represents a major increase in biological 27 space coverage, with 75% of epigenomes (624 of 833) corresponding to new biological specimens 28 across 33 tissue groups (categories), providing the opportunity to study their relationships 29 systematically, providing insights into the primary determinants of the epigenomic landscape. Both 30 hierarchical and two-dimensional embedding 30 clustering of the genome-wide correlation patterns of 31 multiple marks (Fig. 2a,b, Supp. Fig. S9-S11) grouped these samples firstly by lifestage (adult vs. 32 embryonic) and sample type (complex tissues vs. primary cells vs. cell lines), and secondly by distinct 33 groups of brain, blood, immune, stem-cell, epithelial, stromal, and endothelial samples within them. By 1 contrast, donor sex was not a primary factor in sample grouping. 2 We clustered samples using each individual mark (Fig. 2a,b) evaluated within mark-specific relevant 3 genomic regions. We found that active marks (H3K27ac, H3K4me1, H3K9ac) primarily group samples 4 by differentiation lineage, leading to separate blood, immune, spleen, thymus, epithelial, stromal, and 5 endothelial clusters. These marks also grouped lung, kidney, heart, muscle, and brain samples each 6 into distinct clusters, regardless of whether they were adult or embryonic. By contrast, repressive marks 7 (H3K27me3, H3K9me3) better captured the stage of differentiation, grouping together pluripotent, 8 iPSC-derived, and embryonic samples, which were separated into different groups by active marks. In 9 each case, observed and imputed data co-clustered, but imputed datasets better captured the 10 continuity between different sample types, and more clearly revealed the grouping of sample types, 11 likely driven both by their cleaner signal tracks and their sheer number. 12 We generated a reference epigenomic annotations for each for the 833 samples, using combinations of 13 histone modification maps to map genome-wide locations of 18 chromatin states 7,8,10 (Fig. 2c), 14 including multiple types of enhancer, promoter, transcribed, bivalent, and repressed regions (Supp. 15 Fig. S12). We applied this model on a scaled mixture of observed and imputed data, excluding the 138 16 flagged observed datasets (see Methods). We found broad consistency in state coverage (Supp. Fig.  17 S13, S14) and state definitions (Supp. Fig. S15) across cell types. 18 We also generated a high-resolution annotation of active enhancer regions by intersecting 3. covering 0.8% of the genome in each sample on average, and together capturing 13% of the human 22 genome (Fig. 3a, Supp. Fig. S18) a 31% increase from previous maps (Supp. Fig S19), and 23 constructed an 'enhancer-sharing tree' (Fig. 2d, Supp. Fig. S16, S17) relating all 833 samples in a 24 nested set of binary groupings based on their number of shared active enhancers. 25

Regulatory genomics of enhancer modules, target biological processes, and upstream regulators 26
For each high-resolution enhancer region, we defined its activation state in each of the 833 27 epigenomes using flanking nucleosome H3K27ac levels, and used it to group enhancers with 28 coordinated activity into 300 enhancer modules ( Fig. 3a and Supp. Fig S20-S22). We distinguished 29 290 tissue-specific modules (1.8M enhancers, 88%) with activity in only 2% of samples on average, 30 and 10 broadly-active modules (251k enhancers, 12%) showing activity across 77% of sample 31 categories on average. Embryonic modules combined multiple tissue types, including heart, lung, 32 kidney, muscle, and digestive system, while adult modules separated internal organs. 33 We predicted candidate gene-regulatory roles of each enhancer module based on highly-significant 1 ( Fig. 6c), including coronary artery disease (CAD) 50 with 19 different tissue groups, including liver, 23 heart, adipose, muscle, and endocrine samples. 24 We next used the enriched tissues of multifactorial traits to partition their associated SNPs into 25 (potentially-overlapping) sub-groups, which were enriched in distinct biological pathways, thus revealing 26 distinct processes through which multifactorial traits may act (Fig. 6d, Supp. Fig. S29 This partitioning of genetic loci into tissues also helped inform the shared genetic risk between pairs of 2 co-enriched traits, by revealing the tissues that may underlie their common biological basis (Fig. 6d). 3 For example, the same partitioning of CAD loci showed that CAD loci in heart, muscle, and endothelial 4 enhancers were preferentially also associated with high blood pressure and atrial fibrillation risk loci. 5 However, CAD loci in liver and endocrine enhancers were instead associated with systolic blood 6 pressure 51 . Similarly CAD loci also associated with waist-to-hip ratio 49,51,52 overlapped adipose but not 7 liver, endocrine, or heart enhancers, and CAD loci associated with HDL cholesterol 53 overlapped liver, 8 adipose, and endocrine enhancer but not heart tissues. 9 Tissue enrichment and co-enrichment patterns paint network of complex trait associations. 10 We next classified each GWAS trait according to its enriched tissues, and linked traits showing similar 11 enrichment patterns into a trait-trait co-enrichment network (Fig. 7, and Supp. Fig. S30, S31). 12 Unifactorial traits formed the cores of highly-connected communities, including: cognitive and 13 psychiatric traits in brain and neurons; heart beat intervals in heart; cholesterol measures in liver; 14 filtration rate in kidney; immune traits in T-cells; and blood cell counts in hematopoietic cells. 15 Multifactorial traits connected these communities, including: CAD linking heart, endocrine, and liver; 16 HDL and triglycerides levels linking liver and adipose; lung function-related traits linking lung, heart, and 17 digestive tissues; blood pressure measurements surrounding heart and linking to endocrine, 18 endothelial, and liver; cell count fraction traits implicated principally blood often partnered with liver, 19 digestive, and other tissues. Polyfactorial traits (e.g. waist-to-hip ratio and heel bone mineral density) 20 were centrally located linking diverse categories. 21 We found that this trait-trait co-enrichment network captures many biologically-meaningful relationships 22 that are missed by genetic information alone (Supp Fig. S32, S33, S34). A genetic overlap network 23 connecting any two traits that share even 5% of their loci (Jaccard index, 10kb resolution) results in 24 only 934 edges, and only captures 5% of the edges captured by our epigenomics-centered analysis 25 (N=283 of 5547), highlighting the importance of our epigenomics lens in capturing the common 26 biological basis of complex traits. 27

28
In this work, we presented EpiMap, the most complete and comprehensive map of the human 29 epigenome, encompassing ~15k uniformly-processed, QC-metric pruned, and computationally-30 completed datasets. Our resource encompasses 833 distinct biological samples, each with 18 31 epigenomic marks, painting a rich epigenomic landscape. For each reference epigenome, we generate 32 a chromatin state annotation based on multiple chromatin marks and distinguish diverse classes of 1 enhancer, promoter, transcribed, repressed, repetitive, heterochromatic, and quiescent states. We also 2 provide a high-resolution enhancer annotation track, that combines multiple active enhancer states and 3 DNase-accessible regions, covering only 0.8% of each reference epigenome, and collectively 13% of 4 the genome across all 833 tissues. 5 EpiMap greatly expands the biological space covered by previous reference epigenome maps, by 6 incorporating samples of the endocrine system, placenta, extraembryonic membranes, reproductive 7 system, stromal and endothelial primary cells, and a large collection of widely-used cancer cell lines. 8 We also greatly increase the coverage of embryonic and adult brain, heart, muscle, kidney, lung, and 9 liver tissues, as well as blood, immune, lymphoblastoid, and epithelial cells. This broader biological 10 space has important implications both in capturing gene-regulatory elements and upstream regulators 11 of an increased set of tissue-specific biological pathways, and in annotating gene-regulatory variants 12 across a broader biological spectrum that now capture many more traits and disease phenotypes that 13 were previously uncaptured. 14 The integration and uniform processing of these samples enabled us to paint a detailed picture of 15 biological sample relationships, both outlining the relationships between our 33 broad tissue categories, 16 and detailing the relationships of individual samples within these categories. For example, we found 17 that lifestage played an important role in establishing high-level organization of sample similarities, that 18 primary cells clustered separately from their tissue of origin, that adult samples separated by tissue but 19 embryonic samples clustered together. We also recognized differences in the epigenomic relationships 20 painted by different marks, with repressive marks better distinguishing developmental stages, and 21 active marks better distinguishing different tissues and lineages. These relationships can help guide the 22 prioritization of new biological samples to profile for gene-regulatory or disease studies, by studying 23 gene-regulatory motif enrichments, downstream-gene pathway enrichments, and GWAS trait 24 enrichments in the context of the coverage of biological space by closely-and distantly-related 25

samples. 26
Our analysis enabled us to elucidate important gene-regulatory relationships between enhancers, their 27 gene-regulatory targets, and their upstream regulators. By recognizing modules of common enhancer 28 activity, we partitioned 2.1 million gene-regulatory elements into 300 co-regulated sets, distinguishing 29 broadly-active enhancer modules vs. tissue-specific modules, and their distinct gene-regulatory circuitry 30 and biological roles. Gene-regulatory sequence motifs enriched in modules of common activity patterns 31 enabled us to recognize upstream regulators of these modules, including many newly-profiled tissues 32 such as TEAD3 in placenta and epithelial cells and NFIB in various embryonic organs, and 33 distinguishing tissue-specific vs. promiscuous gene-regulatory motifs such as RFX1-5, GRHL1, 1 HNF1A/B, and AP-1. Similarly, common gene ontology enrichments of genes proximal to these 2 modules enabled us to pinpoint the common biological pathways they likely control in tissue-specific, 3 lineage-specific, and broadly-active biological roles. 4 Our work also provided the most comprehensive analysis to date of the gene-regulatory underpinnings 5 of complex traits and human disease. We found statistically-significant enrichments between 534 traits 6 and all 833 tissues, shedding light on 30,247 loci containing SNPs within enriched annotations, and 7 thus providing meaningful insights into their potential mechanisms of action. These enrichments helped 8 distinguish unifactorial, multifactorial, and polyfactorial traits, based on the number of distinct tissue 9 types they implicate, and revealed principal vs. partner tissues that play likely driver vs. auxiliary roles 10 across traits. The multiple enriched tissues in multifactorial traits allowed us to dissect their complexity, 11 by partitioning SNPs by tissues, which showed distinct gene pathway enrichments and shared genetic 12 risks with different related traits. Finally, we used these tissue-trait enrichment and co-enrichment 13 patterns to reveal the shared biological basis of hundreds of complex traits through the lens of their 14 enriched tissues, providing an important basis for studying the common and distinct components of 15 disease comorbidity relationships. 16 We expect EpiMap to enable many new methodological developments in the study of gene regulation 17 and disease. For example, the much more densely-populated tree of samples enabled us to develop a 18 new hierarchical approach to GWAS trait-tissue enrichment analysis that directly compares the 19 enrichment of parent-child tree node pairs, thus finding the appropriate level of tissue specificity where 20 genetic traits may act. The approach itself is general, and likely also applicable to hierarchical analyses 21 of motif and gene pathway enrichments across the tree. We expect that many additional multi-22 resolution approaches will be applicable to these datasets, enabling us to recognize variability between 23 and within group rigorously and systematically. Similarly, the large number of traits and tissues enabled 24 us to infer networks of tissue-trait, trait-trait, and tissue-tissue relationships, and we expect that many 25 additional graph algorithms and analysis approaches will be applicable to the study of these networks. 26 EpiMap also has several limitations that we hope will be overcome with continued technological 27 improvements. First, outside cancer lines and isolated primary cells, our tissue samples are primarily 28 from bulk dissections, which combine multiple underlying cell types, thus hiding the individual 29 contributions and distinct biology of each cell type, and sometimes missing altogether the contribution 30 of lower-abundance cell types. As single-cell ATAC-seq 55 and single-cell ChIP-seq 56,57 approaches 31 mature, we expect that new cell-type-specific maps will become possible, enabling us to partition the 32 enhancers and chromatin states discovered here into their constituent cell types, and also to discover 33 new enhancers and gene-regulatory elements from lower-abundance cell types that are currently not 1 detectable in bulk samples. The presence of such single-cell maps should also allow systematic 2 methods for epigenomic deconvolution 58 . In addition, our current approach for uniformly processing and 3 imputing missing marks does not take into consideration the genotype of different individuals, which 4 contains important information for inferring the activity pattern of genes or gene-regulatory regions in a 5 new individual based on their genotype, but also their phenotypic variables. Lastly, while EpiMap 6 represents a substantial increase relative to previous maps, we are still missing many tissues, 7 environmental stimulation conditions and developmental stages that may active enhancers and other 8 gene-regulatory elements that may not be currently visible in our compendium. 9

10
We thank the ENCODE, Roadmap, and GGR consortia for generating high-quality public datasets and 11 rapidly disseminating their results to the broader community. We thank Daofeng Li and Ting Wang, and 12 Idan Gabdank and J. Seth Strattan for making our observed and imputed genome-wide tracks and 13 chromatin state annotations available through the WashU Epigenome Browser and the ENCODE portal 14 respectively. We thank Jason Ernst for advice, guidance, and for developing the ChromImpute 15 methodology and code base. We thank Pouya Kheradpour for help with motif enrichment analysis 16 software. We thank Luke Ward for incorporating our annotations in HaploReg. We thank the Charles 17   . Table S1).

Epigenomic datasets and processing 2
Primary data sources and metadata information: We analyzed 3,030 datasets, including 2329 3 epigenomic ChIP-seq datasets, 635 DNase-seq datasets, and 66 ATAC-seq datasets from ENCODE at 4 https://www.encodeproject.org/, released as of Sept. 24th 2018. These marks include: Tier 1 assays: 5 DNase-seq, H3K4me1, H3K4me3, H3K27ac, H3K36me3, H3K9me3, and H3K27me3; Tier 2 assays: 6 ATAC−seq, H3K9ac, H3K4me2, H2AFZ, H3K79me2, and H4K20me1; Tier 3 assays: POLR2A, EP300, 7 CTCF, SMC3, RAD21; and Tier 4 histone marks: 16 non-imputed histone acetylation marks, 4 8 methylation marks (H3K9me2, H3K79me1, H3K9me1, H3K23me2), H3F3A, and H3T11ph. We 9 assigned unique sample IDs to each unique combination of: extended biosample summary, donor, sex, 10 age, and lifestage, wherever each attribute was available. We removed samples with genetic 11 perturbations, and kept only samples with appropriately matched ChIP-seq controls. We provide a 12 metadata matrix including the mapping between ENCODE accessions and our unique sample IDs 13 (Supp . Table S1, also at compbio.mit.edu/epimap). We mapped the 111 Roadmap epigenomes and 16 14 ENCODE 2012 epigenomes to any of our samples with overlapping dataset accessions if the 15 accessions were used in the flagship Roadmap epigenomics analysis. This mapping assigned 25 16 samples to ENCODE 2012 and 184 samples to Roadmap 2015, some of which were merged multi-17 donor samples in Roadmap, out of the final 833 samples that passed QC. These were merged into 16 18 and 111 tissue types respectively in the Roadmap 2015 publication. 19 Uniform data processing: We downloaded one alignment file per replicate, prioritizing filtered 20 alignments in hg19 whenever possible. We uniformly processed ChIP-seq and DNase-seq datasets 21 according to the processing pipelines established by the Roadmap Epigenomics Consortium 8 . Briefly, 1 we filtered out improperly paired and non-uniquely mapped reads, truncated reads to 36 base pairs, 2 filtered out a blacklist of low complexity and artifact regions (ENCODE accession ENCSR636HFF), and 3 filtered reads against a mappability track of uniquely mappable regions for 36bp reads 60 . We converted 4 bam files to tagAlign, used liftOver 61 to map GRCh38 alignments to hg19, and pooled all experiments 5 within each ID and assay combination. We subsampled pooled ChIP-seq data sets to a maximum of 30 6 million reads and DNase-seq and ATAC-seq data sets to a maximum of 50 million reads. We used SPP 7 to estimate fragment length. In cases with extremely low fragment length in ATAC-seq and DNase-seq 8 datasets we used the average fragment length (73) from the average of the rest of the tracks. We 9 generated -log10 p-value signal tracks against matched whole cell extract (WCE) for both ChIP-seq 10 and accessibility data sets using the MACS2 62 and the SPP 63 peak caller and cross correlation analysis 11 to identify the proper fragment length as in the Roadmap analysis. 12

Epigenomic Imputation 13
Imputation: We carried out epigenomic imputation on 859 unique cell types using ChromImpute 25 for a 14 total of 10,778 imputed datasets over thirteen Tier 1 and Tier 2 assays using predictors trained on all 35 15 epigenomic assays across 859 samples. We additionally imputed 4,345 datasets for the five DNA 16 associated factors, using only the 35 epigenomic assays as features to train predictors with 17 ChromImpute. We provide all imputed and processed observed tracks along with tracksets for the 833 18 QCed samples at https://epigenome.wustl.edu/epimap 29 . 19 Quality control: For imputation quality control (QC) and validation, we compare observed tracks to 20 imputed tracks when both were available (i.e. when at least two original observed datasets were 21 available for that cell type). We calculated all imputation QC metrics from the original ChromImpute 22 publication 25 , including genome-wide correlation, % imputed and observed peak recovery, and AUC for 23 all pairs of imputed and observed tracks. In addition to quantitative metrics, we visually inspected 24 epigenomic predictions as part of our quality control. We show (Fig. 1B) three dense and varied regions 25 of different resolutions (25kb, 200kb, 1.5Mb) for each of two randomly chosen samples containing both 1 observed and imputed tracks for each assay. We calculated epigenomic profile quality metrics NSC, 2 RSC, and read depth for all datasets and compared these to the imputation QC metrics (tables in Supp. 3 Table S1). We flagged low-quality tracks by detecting the elbow in the ranked correlation metrics, which 4 we calculated as the point where the change in correlation exceeded 5% of the correlation. 5 Sample and antibody swap detection: To systematically identify both potential sample or antibody 6 (Ab) swaps and poor quality experiments, we computed the correlation of each observed experiment 7 against all 10,734 imputed tracks for histone marks and assays (all imputed tracks before removing 8 samples by QC). We then calculated the average correlation among the top 10 most similar tracks to 9 each observed track. We flagged potential Ab swaps by comparing the average correlation against 10 samples of the putative mark against those computed for other marks. We fit an OLS model to each 11 mark comparison, flagged datasets with residuals greater than 3 standard deviations of the average 12 correlation, and visually confirmed 7 Ab swaps (6 low-quality tracks). Similarly, we flagged potential 13 sample swaps by comparing the correlation between imputed and observed tracks against the average 14 correlation in the top 10 tracks in the same mark. We fit an OLS model and flagged datasets with 15 residuals greater than 3 standard deviations of the residuals distribution. We report 19 potentially 16 swapped samples, of which 5 were also flagged as low-quality tracks (Supp. Fig. S8). 17 Secondary reactivities: In addition to genome-wide QC of imputed tracks, we also focused on the 18 specific differences between observed and imputed tracks. For each observed mark, we generated a 19 genome-wide 'delta' track, computed as the difference in signal intensity between observed and 20 imputed data, re-scaling imputed tracks to match signal intensity properties of the observed tracks, as 21 observed tracks showed a general bias for higher intensity. Some of these 'delta' tracks showed 22 surprisingly high correlations with 'primary' tracks of non-putative marks, indicating potential secondary 23 antibody reactivities. In order to flag these reactivities, we compared the average correlation of each of 24 the delta tracks to the top 10 closest imputed tracks for each mark. As with antibody swaps, we fit an 25 OLS model in each mark combination to flag outliers. We flagged 19 tracks and report 15 after visual 1 inspection as potential secondary reactivities or single replicate swaps (e.g. in the case of DNase-seq) 2 (Supp. Fig. S7 and S8). We noted that some cases showed clear difference tracks that don't match 3 available antibodies, suggesting that the secondary reactivity is not a common mark in our 4 compendium. 5 Biological space coverage: To evaluate the similarity of imputed and observed tracks across 6 samples, we calculated the pairwise genomic correlations between all pairs of imputed and observed 7 signal tracks. We hierarchically clustered each individual mark's imputed or observed correlation matrix 8 using Ward's method. We averaged all imputed matrices for the six main marks (H3K27ac, H3K4me1, 9 H3K4me3, H3K36me3, H3K27me3, H3K9me3) to create a fused correlation matrix, which we similarly 10 clustered. We plotted the hierarchically clustered tree for the fused matrix alongside the metadata 11 information for each epigenome using the circlize R package 64 .  Figure 4C, we reduced the enrichments by modules matrix to terms with a maximal -25 log 10 p-value > 4 that were enriched in less than 10% of modules. The full enrichment matrix is shown 1 as Supp. Figure S27. As in the case of the diagonalized module centers, we labeled each term 2 according to the module containing its maximal signal. We used a bag of words approach (as described 3 in Roadmap 8 ) to pick 72 representative terms out of 865 total terms for Figure 3B such that each tissue 4 group has at least one term and the rest are representatively allocated across groups. 5 Motif enrichment: We performed motif enrichment analysis across enhancer modules as described in 6 the Roadmap paper. Briefly, we measured enrichments for 1,772 motifs against a background of all 7 enhancers. We report the motif with the highest enrichment in any module for each of 300 previously 8 identified motif clusters 8,66 . We only report motifs with a maximal log 2 enrichment ratio of at least 1.5, 9 resulting in 86 motifs, which we show with their PWM logos against all 300 modules as Figure 3C. 10

GWAS enrichment analysis 11
We pruned the NHGRI-EBI GWAS catalog 1 (downloaded from https://www.ebi.ac.uk/gwas/docs/file-12 downloads on May 3rd, 2019) using a greedy approach: within each trait + PMID combination, we 13 ranked associations by their significance (p-value), and added SNPs iteratively if they were not within 14 5kb of previously added SNPs. We also removed all associations in the HLA locus (for hg19: 15 chr6:29,691,116-33,054,976). This reduced the catalog from 121k to 113k associations. Finally, we 16 reduced the catalog to 926 unique GWAS (from 5454 GWAS) with an initial sample size of at least 17 20,000 cases or individuals (wherever cases and controls were not annotated). This resulted in 66,801 18 lead SNPs, which landed in 33,417 unique genome intervals when we split the genome into 10,000bp 19

intervals. 20
Flat epigenome enrichments and module epigenome enrichments: We performed the hyper-21 geometric test to evaluate GWAS enrichments on full epigenomes and on modules. For these flat 22 enrichments, we compare each number of SNP-enhancer intersections for each enhancer set 23 (epigenome or module) to the full set of intersections in all M enhancers. As above, we correct for 24 multiple testing for each GWAS and enhancer set combination by computing and correcting with null 25 association p-values for epigenomes and modules using the null catalogs generated for the tree 1 enrichment. Rarefaction curves were calculated on the epigenome enrichments by iteratively adding 2 the sample that was either (Fig. 4D) significantly enriched or (Supp. Fig. S29) the maximal enrichment 3 for the most remaining GWAS until all GWAS were accounted for. 4 Tree epigenome enrichments: We constructed a tree by hierarchically clustering the Jaccard 5 similarity of the binary enhancer by epigenomes matrix using complete-linkage clustering. Then, for 6 each node in the tree, we calculated its consensus epigenomic set, defined as the set of all enhancers 7 present in all leaves of the subtree, such that each node's set was a superset of that of its parent. For 8 each GWAS, we asked whether the novel consensus enhancers at a node were significantly enriched 9 for lead SNPs by comparing the enrichment between each node and its parent as measured by the 10 likelihood ratio test between two logistic regressions. 11 Briefly, for each GWAS catalog unique trait and PubMedID, we find all intersections of its pruned SNPs 12 with M=2,069,090 enhancers. Then Y is an indicator vector of size M which shows the intersected 13 enhancers. We find all consensus enhancers (intersection of epigenomes in the subtree) in the node of 14 interest (vector X N ) and in its parent (X P ). All vectors are 1xM. We calculate X D = X N -X P (specific 15 enhancers), which is also in {0,1} (1xM) as each node contains a superset of its parent's enhancers. 16 We then calculate following two logistic regressions: 17 M1: Y ~ X P + 1 18 M2: Y ~ X P + X D + 1 19 We calculated the log-likelihood difference and apply the likelihood ratio test to test whether adding the 20 specific enhancers (M2) is significantly different from the parent model (M1). To correct for multiple 21 testing on a per GWAS and node basis, we generated 1000 null GWAS catalogs by shuffling the 22 associations across GWAS. We used these catalogs to compute the null association p-values for each 23 permuted GWAS and used the 1st, 5th, and 10th smallest p-values for each GWAS and node 24 combination as their 0.1%, 0.5%, and 1% FDR cutoff. 1 For the CAD example, GO terms 67 were calculated using the nearest gene of each enhancer hit by a 2 lead SNP. We pruned genes to expressed genes by calculating average RNA-seq profiles for each 3 tissue group and excluded genes that had log 2 FPKM of less than 2 in the average RNA-seq of each 4 sample's group. 341 of 833 samples have matched RNA-seq, which we list in addition to releasing the 5 processed data at compbio.mit.edu/epimap. We kept only the GO terms that were significant in 25% or 6 less of nodes, report the top 2 GO terms per node in Figure 6C, and all GO terms in Supp. Fig. S37. 7 Tissue similarity: We assigned each internal node in the tree to a unique tissue if its subtree's leaves 8 are more than 50% of that tissue and as "Multiple" if the subtree is not majority one tissue. We assigned 9 tissue labels to 641 of 832 (77%) internal nodes where the majority of leaves corresponded to a single 10 group. Using these assignments, we created a tissue by GWAS matrix by adding the -log10 p-values 11 for each tissue node set from all of the GWAS enrichments on the tree. We binarized this matrix and 12 computed the jaccard similarity across tissues to calculate a tissue similarity matrix. To assess 13 significance of tissue overlap, we compared each overlap value against the overlaps from 10,000 14 permuted enrichments. We collapsed each permuted matrix into a tissue by GWAS matrix to compute 15 the overlaps under the null. We performed the permutations for each tissue against other tissues by 16 shuffling the enrichment p-values on the node by GWAS matrix. Specifically, we (a) binarized the 17 enrichment matrix, (b) fixed the column of the group of interest and (c) permuted the remainder of the 18 matrix keeping its row and column marginals the same and then (d) calculated the cosine distance 19 between the permuted and the original matrix of enrichments. 20 Cross-GWAS Network: To evaluate the cross-GWAS similarity, we normalized the tissue by GWAS 21 matrix for each GWAS to obtain the proportion of significance attributed to each tissue for each GWAS 22 (Supp. Fig S38). We reduced the matrix to 538 significant GWAS with at least 20,000 cases (or 23 individuals when no cases were specified). We created a GWAS-GWAS network using the cosine 24 distance matrix as an adjacency matrix, keeping 5,547 links with a cosine distance of 0.25 or less. We 25 used the Fruchterman-Reingold algorithm to lay out the graph 68 . We use the tissue by GWAS matrix to 1 color links according to the maximum tissue in the product between each pair of nodes and to color 2 nodes according to the maximal tissue for each node (Supp Fig. S39) . 3 In order to compare the epigenetic network to trait genetic similarity, we binned snps in the GWAS 4 catalog into 10kb windows starting from the beginning of each chromosome. We counted the number of 5 intersecting bins between two traits and keep any trait pairs with jaccard similarity of at least 1%. To 6 compare this to the epigenetic network, we plotted only links in the epigenetic network that coincide 7 with any SNP-sharing GWAS pairs. Additionally, we plotted the heatmaps of the tree enrichments 8 distance matrix and the genetic similarity matrix side by side, first organized by hierarchically clustering 9 the enrichments matrix and then by clustering the genetic similarity matrix (Supp. Fig. S41 and S42).   (17) H3F3A (16) H2AK5ac (7) H2BK120ac (7) H2BK5ac (7) H3K18ac (7) H3K23ac (7) H3K4ac (7) H3K79me1 (7) H4K8ac (7) H2BK12ac (6) H2BK15ac (6) H3K14ac (6) H4K91ac (6) H3K56ac (4) H3K9me1 (4) H4K5ac (4) H2BK20ac (3) H3K23me2 (2) H2AK9ac (1) H3T11ph (1) (8) Eye (5) Myosat (6) Placenta & EEM (31) PNS (10) Thymus (11) Lung (43) Heart (41)

Figure 2
Brain (52) Digestive (73) Muscle (61) Mesench (4) Adipose (9) Urinary (3) Endocrine (29) Other (7) ESC (9) Pancreas (7) Spleen ( CORD 621_STOMACH     presynaptic membrane cell-cell signaling involved in cell fate commitment cell adhesion molecule binding neural precursor cell proliferation sensory perception of sound lens morphogenesis in camera-type eye central nervous system projection neuron axonogenesis potassium ion transport voltage-gated potassium channel complex synaptic transmission substrate-specific channel activity calcium ion transmembrane transporter activity intracellular non-membrane-bounded organelle symbiosis, encompassing mutualism through parasitism cytosolic ribosome ribosome SRP-dependent cotransl. protein targeting to membrane alpha-beta T cell activation response to type I interferon innate immune response-activating signal transduction granulocyte chemotaxis myeloid leukocyte mediated immunity cytokine production lymphocyte costimulation leukocyte activation involved in immune response membrane-enclosed lumen modulation by virus of host morphology or physiology mRNA splicing, via spliceosome covalent chromatin modification embryonic digit morphogenesis hemidesmosome assembly positive regulation of endothelial cell proliferation neg.         Cholesterol, total Systolic blood pressure Trait pair tissue co-enrichment (cosine similarity)