plot2DO: a tool to assess the quality and distribution of genomic data

Summary Micrococcal nuclease digestion followed by deep sequencing (MNase-seq) is the most used method to investigate nucleosome organization on a genome-wide scale. We present plot2DO, a software package for creating 2D occupancy plots, which allows biologists to evaluate the quality of MNase-seq data and to visualize the distribution of nucleosomes near the functional regions of the genome (e.g. gene promoters, origins of replication, etc.). Availability And Implementation The plot2DO open source package is freely available on GitHub at https://github.com/rchereji/plot2DO under the MIT license. Contact razvan.chereji@nih.gov Supplementary Information Supplementary data are available at Bioinformatics online.


Introduction
Nucleosomes − 147 bp of DNA wrapped around a histone octamer in about 1.7 turns − are the basic units of DNA packaging in eukaryotes (Luger, et al., 1997). Access of DNA-binding factors to their target sites is about 10-20 times higher if these sites are located in nucleosome free regions (NFRs) (Liu, et al., 2006). Therefore, knowing the precise positions of nucleosomes and NFRs is very important for understanding DNA-binding and gene regulation.
Currently, the most used method for mapping nucleosomes is MNaseseq: chromatin is digested with micrococcal nuclease (MNase) and the remaining undigested DNA fragments are subjected to high-throughput sequencing. Unfortunately, MNase has a strong sequence preference (Dingwall, et al., 1981;Hörz and Altenburger, 1981), and the nucleosomal fragments that result from an MNase-seq experiment are affected by the degree of MNase digestion Chereji, et al., 2017). Furthermore, after a mild digestion, a large fraction of the genome is not yet broken into mono-nucleosomal DNA fragments (~150bp long) and is discarded from further analysis, while after an extensive digestion, many nucleosomes occupying A/T-rich sequences are over-digested and lost from the sample of mono-nucleosomal fragments . Therefore, MNase-seq experiments require a careful control of the level of digestion, and the variable degree of digestion must always be taken into account, especially when multiple samples are compared, and differences in nucleosome occupancy are observed among the samples. Here, we present plot2DO, a tool for plotting the 2D occupancy (2DO) of genomic data, which is extremely useful not only to assess the degree of digestion as an initial quality check of MNase-seq data, but also for getting insights about the nucleosome organization near functional regions of the genome and about the MNase digestion kinetics.

Usage
Plot2DO is an open source package written in R, which can be launched from the command line in a terminal. The user selects the type of distribution to be plotted (occupancy/coverage of undigested DNA fragments, distribution of fragment centers (nucleosome dyads), or distribution of the 5'/3' ends of the fragments), the reference points to be aligned (transcription start sites (TSS), transcript termination sites (TTS), +1 nucleosomes, or a list of specific user-provided sites). The user can choose the width of the window that is plotted and can also perform in silico size selection of the undigested DNA, by specifying the size limits of the fragments to be used as representative for the nucleosome population.
Plot2DO allows the investigation of paired-end sequencing data originating from a variety of organisms (yeast, fly, worm, mouse, and human) and mapped to any of the following genome versions: sacCer3, dm3, dm6, ce10, ce11, mm9, mm10, hg18, hg19. The full list of available options and multiple usage examples are provided as supplementary data at Bioinformatics online.

Discussion and conclusion
The usefulness of plot2DO is demonstrated in the figure above. Fig. 1A shows the default panels generated by plot2DO. The 2DO plot (heat map) indicates the relative coverage of the DNA fragments of specified lengths, at different locations relative to the selected reference (TSS by default). Each row of the 2DO plot shows the average occupancy generated by DNA fragments of a given length (indicated on the right side) as a function of the position (indicated on top). Each column of the 2DO plot shows the average occupancy generated at a specific position relative to the reference point, as a function of the DNA fragment length. The average one-dimensional occupancy (shown above the 2DO plot), generated by stacking DNA fragments of all lengths, represents the sum of the elements in each column of the matrix shown in the heat map. To compute these occupancy profiles, the raw sequencing data is normalized such that the average occupancy is 1, for each chromosome. The third panel that is generated by plot2DO is the fragment length histogram, which is shown to the right of the 2DO plot. Note that this histogram takes into account all the sequencing reads, not just the ones that are mapped to the regions shown in the heat map (i.e. the lengths of the reads far from the reference points are also considered).
Apart from the TSS/TTS alignments, plot2DO can also align specific lists of sites that are provided by the user (Fig. 1B). The plots created by plot2DO are particularly useful for investigating the level of digestion and the effect of MNase on the subset of intact nucleosomes that are obtained in a sample. Fig. 1C shows the effect of MNase on the +1 nucleosomes (the first nucleosomes downstream from the gene promoters). It is obvious that during the MNase digestion, the fragments that are protected by nucleosomes become shorter and are eventually destroyed by MNase, if the digestion is not stopped early enough. Data used in Fig.  1 were obtained from  and  (see the Supplementary information for details).
Nucleosome mapping experiments are very expensive, and it is very important to assess the quality of MNase digestion before a large amount of money is spent for sequencing. We recommend that before committing to a large sequencing experiment that results in hundreds of millions of reads, a trial sequencing round should be performed (e.g. getting only a few million reads) and the level of digestion should be examined using plot2DO. We strongly advocate that in all MNase-seq and MNase-ChIPseq experiments, plot2DO should be the first plot to do. The easiest way to download this package is by using the GitHub interface (click the green Clone or download button) from https://github.com/rchereji/plot2DO.
If you want to download the package from the terminal, then you need to install rst a git client of your choice, using the package manager that is available for your system. For example, in Ubuntu or other Debian-based distributions of Linux, you can use apt-get: 1 $ sudo apt−g e t i n s t a l l g i t In Fedora, CentOS, or Red Hat Linux, you can use yum: 1 $ sudo yum i n s t a l l g i t Installers for Windows and OSX are available for download at the following websites: http://git-scm.com/download/win and http://git-scm.com/download/mac.
After git has been installed, run the following command from the folder where you want to download the plot2DO package: 1 $ g i t c l o n e h t t p s : / / g i t h u b . com/ r c h e r e j i / plot2DO . g i t In order to be able to run plot2DO, you will need to have R installed on your computer, together with the following R packages.

Dependencies
Plot2DO uses the following R packages: biomaRt, caTools, colorRamps, GenomicRanges, optparse, rtracklayer, and Rsam- Notice that using the --simplifyPlot=on option, it is possible to plot only the 2DO panel (without the panels showing the one-dimensional occupancy and the histogram of fragment lengths), to make it easier to combine multiple such panels in a single gure (Fig. S3). Plot2DO can process sequencing data from multiple organisms: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus and Homo sapiens. The following genome versions are available for these organisms: yeast sacCer3; y dm3, dm6; worm ce10, ce11; mouse mm9, mm10; human hg18, hg19. For the multicellular organisms the alignment of +1 nucleosomes is not possible, as the locations of these nucleosomes could vary from cell type to cell type, and these positions should be identied separately in each cell type.
Here The gures resulted from these commands are shown in Figure S4.