NUCOME: A Comprehensive Database of Nucleosome Organizations in Mammalian Genomes

Nucleosome organization is involved in many regulatory activities in various organisms. However, studies integrating nucleosome organizations in mammalian genomes are very limited mainly due to the lack of comprehensive data management. Here, we present NUCOME, which is the first database to organize publicly available MNase-seq data resource and manage unified processed datasets covering various cell types in human and mouse. The NUCOME provides standard, qualified and informative nucleosome organization data at the genome scale and at any genomic regions for users’ downstream analyses. NUCOME is freely available at http://compbio.tongji.edu.cn/NUCOME/.

Nucleosome organization is involved in many regulatory activities in various organisms.
However, studies integrating nucleosome organizations in mammalian genomes are very limited mainly due to the lack of comprehensive data management. Here, we present NUCOME, which is the first database to organize publicly available MNase-seq data resource and manage unified processed datasets covering various cell types in human and mouse. The NUCOME provides standard, qualified and informative nucleosome organization data at the genome scale and at any genomic regions for users' downstream analyses. NUCOME is freely available at http://compbio.tongji.edu.cn/NUCOME/.
Keywords: Nucleosome, Database, Transcriptional regulation, MNase The nucleosome is the fundamental unit of eukaryotic chromatin and is involved in regulatory activities through interactions with DNA-binding proteins, including regulatory factors, chromatin remodelers, histone chaperones and polymerases (1). Genome-wide nucleosome organization maps have been established in multiple species, and some consistent nucleosome positioning patterns of specific regulatory elements have been reported. For example, nucleosomes act as barriers to transcription factors (TFs) interacting with cis-regulatory elements. Nucleosome-free regions (NFR) and regularly spaced nucleosome arrays are strongly associated with transcription initiation (1)(2)(3)(4). Nucleosome remodelers are essential for nucleosome dynamics that remove, slide and reposition nucleosomes to overcome barriers and facilitate transcription initiation and elongation (3,(5)(6)(7). The conserved nucleosome positioning patterns on transcription start sites (TSSs) and other regulatory elements highlight the importance of the nucleosome in regulatory activities.
Furthermore, previous studies have shown that tissue-and disease-specific nucleosome organization widely exists in the mammalian genome and is involved in cell differentiation (8,9), reprogramming (9,10), tissue impairment (11)(12)(13) and diseases (14)(15)(16). Most studies focus on identifying specific nucleosome organization features and associating these features to gene transcription or chromatin modifications. Cell type-specific nucleosome organizations typically indicate the chromatin environment that involves distinct regulatory factors and cellular processes (17). Therefore, nucleosome organization maps are critical for deriving a panoramic view regarding the chromatin structure, modification and their relationship in regulatory function. However, compared to other types of regulatory landscapes, such as histone modification, DNA methylation and transcription factors, nucleosome organizations have not been sufficiently explored.
To systematically explore the regulatory function of the nucleosome, a comprehensive and dedicated database of nucleosome landscapes is urgently needed to manage, explore and analyze these data resources. MNase-seq is the most widely used technology for generating nucleosome organization maps (18,19). The large size of mammalian genomes is a major challenge in experiments that require a very high sequencing depth, thus, MNase-seq data analyses are complicated and time-consuming. We established a comprehensive database named NUCOME (Nucleosome Organizations in Mammalian Genomes, Figure 1) that organizes extensive MNase-seq data and characterizes nucleosome organizations. The following are the three major features of NUCOME database: 1) NUCOME is one of the most extensive catalog of MNase-seq data and is managed via a standard analysis pipeline and quality control (QC) metrics. 2) NUCOME provides high-quality nucleosome organization information for various human and mouse cell and tissue types. 3) NUCOME provides a web interface containing multiple modules, including data search, nucleosome organization visualization and an application that quantifies informative nucleosome organization data in genomic regions. order. For each sample, a "Pass" or "Fail" label was assigned for each QC measurement, except for sequencing coverage. The criteria of "Pass" and "Fail" for AA/TT/AT di-nucleotide frequency, nucleosomal DNA length, nucleosome depletion at TSSs, enrichment of well-positioned nucleosome arrays at DHSs and UTRs were defined previously (20). For nucleosome fuzziness downstream TSSs, samples with values lower than 0.4 was defined as "Pass", while others were defined as "Fail".

Calculation of QC measurements
For each cell or tissue type, the reference nucleosome organization map was selected via two indicators. Firstly, the total number of 'Pass' QC measurements for each sample was defined as the first rank indicator. Then, we calculated the sum of rank quantiles of all QC measurements for each sample as the second indicators. For each cell or tissue type, the sample with the top rank for both two indicators was selected as the reference nucleosome organization map.

Prediction of TF binding sites
We collected TF ChIP-seq data from Cistrome DB database (21) as the actual TF binding profile. Qualified ChIP-seq samples were selected by the QC measurements that provided in the database. For each TF ChIP-seq sample, the top 5,000 binding peaks (ordered by the fold change of peaks) were defined as the positive binding sites (bound sites). The negative binding sites (unbound sites) were defined as 20,000 random DHSs that do not overlap with any detected peaks in the ChIP-seq sample. All 25,000 bound and unbound sites were used for evaluating the TF binding prediction. Motif score was the first predictor. We scanned the entire genome for significant motif hits for each TF by using BINOCh (22). For any TF bound or unbound site with one or more significant motif hits, the highest motif matching score among those hits was assigned as the motif score of that site. For sites without any significant motif hits, their motif scores were assigned as 0. Nucleosome depletion level acted as the second predictor to improve the prediction performance. Here, the nucleosome depletion level at a given site was calculated as the degree of nucleosome occupancy deficiency at the center of the site. The maximum nucleosome occupancy at the centeral 200 bp bin of the site was defined as N center , and the maximum nucleosome occupancy surrounding the centeral bin spanning 200 bp on both sides was defined as N background . If N center was larger than N background , the nucleosome depletion level of the site was assigned to 0. Otherwise, the nucleosome depletion level of the site was calculated as 1 -N center /N background .
A logistic linear regression conducted by a 'glm' function in R was performed to match the actual TF binding status (bound or unbound) by scoring motif score and nucleosome depletion level. The R package 'pROC' was used to evaluate the prediction power by calculating the true-positive rate and true-negative rate with different thresholds, and area under the curve (AUC) scores calculated by 'auc' function in R were used as an indicator to evaluate the performance of the TF binding prediction. The prediction improvement was calculated as AUC (motif score + nucleosome depletion level) -AUC (motif score).  (Table S1), representing one of the most extensive data sets of nucleosome organization in mammalian genome reported to date.

NUCOME provides standard qualified nucleosome organization maps
To avoid variation in the processed data among the datasets, we applied a streamlined analysis pipeline starting with the raw sequenced data. We developed a standard workflow for managing MNase-seq data named CAM that includes read mapping, nucleosome organization profiling, nucleosome array detection an QC assessment (20). The workflow guarantees the quality of the data in the database and offers a convenient approach for users to download processed, standard and qualified data for their downstream analyses.

Reference nucleosome organization maps selected based on quality comparison
To avoid users' confusion in selecting high quality data among various samples, we determined reference nucleosome organization maps based on a quality comparison. We

Quality-supervised data query guides users in filtering samples
The 'Search' module in NUCOME allows for data retrieval in a quality-supervised manner to

Data visualization of reference nucleosome organization maps
The 'JBrowser' module provides an overview of the reference nucleosome organization maps of the entire genome queried by gene or genomic region (Figure 2b). The nucleosome organization maps focus on genome-wide nucleosome occupancy and the location of nucleosome arrays. Nucleosome occupancy is measured as the number of reads mapped to the genomic site in a cell population and presents the probability that a nucleosome is localized at the genome site. Thus, nucleosome occupancy is an important feature that reflects DNA accessibility. It has been reported that well-positioned nucleosome arrays downstream TSSs participate in transcription regulation with either positive or negative effects (1) in a cooperative context with other regulatory factors (23). We hypothesized that the location of nucleosome arrays may indicate significant regulatory activities. Thus, we display the nucleosome occupancy and nucleosome arrays in the module of data visualization to present nucleosome organization on a genome-wide scale.

Informative nucleosome organization illustrating regulatory function
The 'NuP Browser' module provides users with a friendly and flexible platform to query nucleosome organization information in any genomic region or any gene. Compared with the 'JBrowser' module, 'NuP Browser' module provides more nucleosome organization features and quantifies the information as easily processed scores describing the local chromatin structure, including nucleosome occupancy, nucleosome array score, nucleosome depletion level and nucleosome profile. The text format output allows users to perform nucleosome positioning analyses without encountering difficulties in data processing.
The nucleosome typically participates in biological processes by interacting with other regulatory factors. Here, we used an example to explore the function of nucleosome organizations around TSSs in transcription activity. We profiled aggregated nucleosome positioning patterns around TSSs of genes with different expression levels and compared the differences between the nucleosome positioning patterns (Supplementary Figure 2a). The analysis illuminated the relationship between nucleosome organization and transcription activity. The promoters of highly expressed genes tend to exhibit stronger canonical nucleosome organization, including nucleosome depletion level at TSSs and a regularly spaced nucleosome array downstream the TSSs (Figure 2c). The better nucleosome positioning pattern exhibits an increased nucleosome depletion level ( Figure 2d) and reduced nucleosome fuzziness (Figure 2e). Many previous studies have reported similar observations using specific cell cultured systems or tissues (24)(25)(26)(27)(28)(29)(30). Here, we showed that the regulation manner is conserved among various human and mouse tissues and cell types by performing this analysis in multiple samples stored in NUCOME (Supplementary Figure 2b-e). Using this comprehensive database, researchers can deeply explore the features of nucleosome-mediated regulation.

Nucleosome organization influences transcription factor binding
Previous studies have confirmed that nucleosome positioning patterns around TFs may indicate the regulatory function of the TFs and play a predictive role in gene expression (31).
Here, we explored the ability to predict TF binding in vivo by introducing informative nucleosome organization (Supplementary Figure 3a).  Figure 3b, c). Taken together, these results demonstrate that nucleosome organization information derived from NUCOME can be widely applied to elucidate the transcription regulatory program in human and mouse.
NUCOME is a comprehensive database that organizes the most extensive data sources of MNase-seq data and provides standard nucleosome organization maps of various cell and tissue types in human and mouse. NUCOME provides three modules in a web interface for 1) querying data, 2) visualizing reference nucleosome organization maps and 3) querying nucleosome organization information in any genomic regions and providing text format outputs for users' downstream analyses. Given that nucleosome organization participates in various regulatory activities, and NUCOME is the first comprehensive database of nucleosome organization data, the database can be a valuable resource to elucidate the panoramic view of transcription regulatory program in human and mouse.
We thank Ji Liao for his contribution in the early stage of this project. Figure 1. Structure of the NUCOME database.  points are TFs that AUC scores improve more than 0.05, while the round points are TFs that AUC scores improve less than 0.05. b) The distribution of AUC score improvements by introducing nucleosome depletion level in the prediction model. 'Improve_G' group includes TFs exhibits at least a 0.05 improvement, while 'Improve_S' group includes TFs with less than 0.05 AUC score improvement. c) A scatter plot with the red regression line revealing the significantly negative correlation between prediction improvement by introducing nucleosome depletion level and prediction performance by DNA motif score only (p-value < 2.2 x 10 -16 ). The x-axis represents the AUC score by using DNA motif score only. The y-axis represents the AUC score improvements by introducing nucleosome depletion level. d) The distribution of prediction performance by DNA motif score only. 'Motif_L' group includes TFs that have AUC scores less than 0.65 by using DNA motif score, while 'Motif_H' group includes TFs with AUC scores higher than 0.65. e) Overlapping among TFs in 'Improve_S' group and 'Motif_L' group.
The p-value is calculated by the Chi-square test.