Abstract
The identification of sites of DNA replication initiation in mammalian cells has been challenging. Here, we present unbiased detection of replication initiation events in human cells using BrdU incorporation and single-molecule nanopore sequencing. Increases in BrdU incorporation allow us to measure DNA replication dynamics, including identification of replication initiation, fork direction and termination on individual nanopore sequencing reads. Importantly, initiation and termination events are identified on single-molecules with high resolution, throughout S-phase and across the human genome. We find a significant enrichment of initiation sites within the broad initiation zones identified by population level studies. However, these focussed initiation sites only account for ∼20% of all identified replication initiation events. Most initiation events are dispersed throughout the genome and are missed by cell population approaches. This indicates that most initiation occurs at sites that, individually, are rarely used. These dispersed initiation sites contrast with the focused sites identified by population studies, in that they do not show a strong relationship to transcription or a particular epigenetic signature. Therefore, single-molecule sequencing enables unbiased detection and characterisation of DNA replication initiation events, including the numerous dispersed initiation events that replicate most of the human genome.
Introduction
DNA replication is a fundamental cellular process that is conserved throughout life. It is critical for genomic stability that genomes are replicated once and only once. In eukaryotes, DNA replication is initiated from multiple sites along each chromosome, for example, tens-of-thousands across the human genome. These sites have been linked in disease as sites of higher mutation rates [1, 2] and also implicated in chromosomal translocations [3–5]. In Saccharomyces cerevisiae, sites of DNA replication initiation are defined by a sequence motif bound by the origin recognition complex (ORC) [6], although they vary in how frequently they are used [7]. However, in metazoans, ORC has weak sequence specificity and there is conflicting evidence for sequence bias, such as G-quadruplexes, at replication initiation sites [8–10].
Numerous genomic methods have been used to examine either the dynamics of DNA replication or to directly identify the sites of replication initiation [10]. Predominantly, these methods utilise short-read DNA sequencing (or microarrays) and measure the average signal from a population of millions of cells. Cell population replication dynamics have been studied either by examining replication time (e.g. repli-seq [11, 12]; sort-seq [13, 14]) or by determining replication fork direction (e.g. Okazaki fragment sequencing - ‘Ok-seq’ [15, 16]; Polymerase-usage sequencing - ‘Pu-seq’ [17, 18], GLOE-seq [19]). In mammalian cells, these methods have identified broad zones of replication initiation (30-100 kb, initiation zones - IZs), separated by large regions (interquartile range 83-369 kb; median 183 kb) of implied unidirectional fork progression [16]. There is a high degree of concordance within and between these approaches, for example, in the identified IZs [10]. Alternatively, multiple approaches have been used to directly identify replication initiation sites, including the abundance of short nascent strands (SNS-seq) [20], ‘bubble’ structures at activated origins [21], early replication (e.g. EdUseq-HU; ini-seq) [22–24] or binding of replication initiation factors (chromatin-immunoprecipitation of ORC or MCM complexes) [25–27]. However, in mammalian cells, these approaches differ in the number and locations of sites identified and show poor concordance with IZs identified from replication dynamics studies [10, 28, 29]. This discrepancy could result from each assay detecting different steps in the process of genome replication (e.g. licensing, origin activation, fork progress, etc). However, it has also been postulated that this inconsistency could be due, in part, to the reliance on cell population-based approaches to study what may be a heterogenous process, thus resulting in low sensitivity, and a high rate of false-negatives [10].
Single-molecule and single-cell analyses of DNA replication have the potential to resolve the complexity of heterogeneity within cell population data. Genome sequencing from single cells in S phase reveals which portions of the genome have already replicated [30]. This gives a ‘snapshot’ of the state of genome replication, but without the resolution to identify individual replication initiation sites. By comparison, single-molecule approaches, such as combing and DNA fibre, rely on labelling nascent strands with radiolabelled nucleotides or more recently, halogen- or fluorophore-labelled analogues that are then detected by radiography or microscopy [31–33]. These methods indicate that replication initiation is highly stochastic with sites ∼100 kb apart [33–35]. However, they are low throughput and generally lack genomic location information. Hybridization probes can give some location information, however this further reduces throughput [36]. Recently, an Optical Replication Mapping (ORM) method generated genome-wide replication dynamics on single megabase-length molecules at high coverage and with genomic coordinate information [37]. However, ORM requires cell synchronisation and transfection with bulky fluorophore-labelled nucleotides and is therefore only suitable for certain cell types. In addition, ORM only labels the first 2% of S-phase, depends upon activation of the intra-S-phase checkpoint, and has a resolution of ∼15 kb.
To address the limitations of population-sequencing approaches and optical single-molecule techniques, we developed the first genome-wide DNA sequencing method using ultra-long single molecules to detect DNA replication dynamics, called DNAscent [38]. This method utilises nanopore sequencing and bespoke base-calling models to identify, at base resolution, the sites of BrdU-incorporation [38, 39]. We validated this approach in Saccharomyces cerevisiae by in vivo incorporation of BrdU during a single S-phase. Resulting patterns of BrdU incorporation allowed high-throughput and high-resolution identification of replication fork direction and sites of replication initiation, termination and fork pausing on single-molecules. This demonstrated that, even in S. cerevisiae with its generally well-defined origin sites, 10-20% of replication initiation events are at sites not detected by population-level approaches. Variants of this nanopore sequencing-based method have allowed genome-wide measurement of mean fork velocity [40] and ensemble identification of replication fork pause sites [41].
Here we apply DNAscent to cultured human cells and, for the first time, identify DNA replication initiation events on single, sequenced molecules across the human genome. The sites we identify are enriched within the initiation zones previously identified by population-level replication dynamics studies; we term these ‘focused initiation sites’. However, the majority of initiation sites identified by DNAscent are outside previously reported initiation zones; we term these ‘dispersed initiation sites’. Unlike focussed sites, dispersed sites are not related to a particular epigenetic mark or transcription context. We propose a model that integrates the ‘focused’ sites of high replication initiation efficiency and low efficiency ‘dispersed’ initiation sites occurring throughout most of the genome.
Results
Nanopore detection of BrdU in human genomic DNA
We set out to establish suitable growth conditions for cultured human cells (HeLa-S3 and hTERT-RPE1) that would allow detection of BrdU incorporated into nascent DNA using nanopore sequencing (DNAscent). Asynchronously growing cells were treated with a range of BrdU concentrations (0.3 – 50 μM), either for the duration of one cell cycle (20 hr HeLa-S3 or 27 hr hTERT-RPE1) or for pulses of 2 or 24 hrs (Fig. 1; Supplementary Fig. S1). Then, extracted genomic DNA was subjected to PCR-free nanopore sequencing followed by detection of BrdU at single-base resolution. This produces a BrdU probability at every thymidine position on each single-molecule read (Fig. 1A). Analysis of these BrdU probabilities showed that they have concentration-dependent bimodal distributions, each with a peak close to zero (i.e. thymidine calls) and a second peak approaching a probability of one (i.e. BrdU calls) (Fig. 1B; Supplementary Fig. S1A). As expected, the proportion of high BrdU probabilities depended on the BrdU concentration used in labelling. Similar results were observed for both cell lines and various BrdU pulse lengths (Supplementary Fig. S1A). We used a probability threshold of >0.5 to call BrdU at each thymidine position on each single-molecule read. With this threshold, we observed a low false positive rate (0.1%) on genomic DNA from cells not treated with BrdU (Fig. 1B), consistent with data from S. cerevisiae [39].
Individual BrdU probabilities were used to determine the proportion of thymidine positions substituted with BrdU in windows (of 290 thymidines, corresponding to ∼1 kb for the human genome) across each single-molecule read (Fig. 1C, 1D, Supplementary Fig. S1B). Under these conditions, we observed reads with background levels of BrdU incorporation (Fig. 1A & 1C, left) consistent with parental DNA, and reads with higher BrdU incorporation (Fig. 1A & 1C, right) consistent with nascent DNA. Analysis of these BrdU substitution levels showed that they have concentration-dependent bimodal distributions (Fig. 1D). For example, with 1.5 or 10 μM BrdU treatments we observed ∼35% or ∼65% modal BrdU incorporation, respectively, on the nascent DNA. This indicates that there is a wide range of BrdU pulse concentrations (>=1.5 μM) that enables parental and nascent DNA to be distinguished. Levels of BrdU substitution are comparable between cell lines and independent of pulse length (Fig. 1D, Supplementary Fig. S1B).
We used mass spectrometry to determine the level of BrdU incorporated into bulk cellular DNA for a range of BrdU treatment concentrations. We observe a clear concentration-dependent increase in BrdU incorporation. Next, we used the mass spectrometry data to validate the nanopore sequencing measurements of BrdU incorporation. We see a strong concordance between the two methods across a range of BrdU treatment concentrations, pulse times and cell lines (Fig. 1E, Supplementary Fig. S1C). For cells subjected to a 50 μM BrdU treatment, the BrdU incorporation level determined by nanopore sequencing is slightly lower than the level determined by mass spectrometry. This is consistent with the DNAscent algorithm slightly under-calling BrdU in DNA with very high levels of BrdU incorporation (e.g. >80% in the nascent strand for the 50 µM BrdU treatment; Fig. 1D), as previously reported [39]. We also performed BrdU-immunoprecipitation and short read sequencing (BrdU-IP-seq) from the same cell cultures (Supplementary Table S1, Supplementary Fig. S1C). BrdU-IP-seq data does not determine absolute levels of BrdU incorporation, rather relative levels of incorporation between samples that have been barcoded and pooled prior to immunoprecipitation [11]. Although we observe a concentration-dependent increase in nascent strand pull-down, the BrdU-IP-seq data is not linearly related to the amount of BrdU incorporated as determined by mass spectrometry (or nanopore sequencing; Supplementary Fig. S1C).
With cell cultures labelled for a time equivalent to one cell cycle, we can expect up to 50% labelled, nascent reads. We classified reads with ≥5% BrdU as nascent strands and found ∼40% of such nascent reads in both HeLa-S3 and hTERT-RPE1 cell cultures (Supplementary Table 2). These values are consistent with expectation given that some cells exhibit a longer G1-phase [42] and a fraction of cells within cultures are quiescent. Therefore, we have established conditions under which nanopore sequencing can be used to detect BrdU incorporated into nascent human genomic DNA.
Independent detection of BrdU and CpG methylation
We sought to determine whether nanopore sequencing could independently detect CpG methylation and BrdU incorporation on the same DNA strand without interference. DNAscent was trained on S. cerevisiae genomic DNA, which lacks base methylation, and conversely, nanopore detection of methylation has not previously been undertaken in the presence of BrdU. To examine the potential for interference, we called CpG methylation (using Nanopolish [43]) and BrdU incorporation (using DNAscent) in parallel on sequencing reads from cultures treated with a range of BrdU concentrations (Fig. 1F). As previously reported [43], nanopore sequencing for detection of 5mC agrees well with bisulfite sequencing datasets (not shown), unaffected by the presence of BrdU in the reads analysed. For example, very high levels of BrdU incorporation (∼60%) in a nascent read do not result in an increase in methylation detection (Fig. 1A, C & F). Furthermore, the example nascent read contains a methylated CpG island (70.41-70.42 Mbp) that has not affected the frequency of BrdU calls (Fig. 1A, C & F).
Next, we sought to quantify genome-wide, any effect of 5mC on BrdU detection by nanopore sequencing. First, hypo- and hypermethylated CpG islands (CGIs) in HeLa-S3 cells were identified using published bisulfite sequencing data [44]. From the published data, we observe ∼140-fold more methylation in hyper-compared to hypomethylated CGIs. Then, in these CGI categories, we assessed levels of BrdU detection in nanopore sequence reads from HeLa-S3 cell cultures treated with a range of BrdU concentrations (in 100 bp windows; Fig. 1G and Supplementary Fig. S1D). In data from cells not treated with BrdU, we observe a BrdU false positive rate of ∼0.1% at hypomethylated CGIs (Fig. 1G left panel, black line), consistent with the genomic average. Within a narrow window centred on hypermethylated CGIs we observe a slight increase in base-resolution BrdU false positives, i.e. on DNA not treated with BrdU, peaking at ∼5% (Fig. 1G left panel, blue line). Conversely, in reads with a high level of BrdU incorporation (∼80% on nascent reads; Fig. 1D), we observe a slight dip in base-resolution BrdU detection (indicating false negatives) across hypomethylated CGIs (Fig. 1G, right panel, black line). Across hypermethylated CGIs we observe no variation in BrdU detection (Fig. 1E, right panel, blue line). Hence, even in the most methylated regions of the genome we observe only minor, localised effects on BrdU detection. In subsequent analyses, we look at BrdU incorporation in windows (of 290 thymidines) where any minor effect of methylation on BrdU detection will be further reduced by >5-fold. Therefore, we consider that variation in methylation does not adversely affect our ability to detect levels of BrdU incorporation in nanopore sequencing reads.
Whole genome single molecule identification of replication initiation sites
To identify temporal patterns of DNA replication on nascent single molecules, we treated asynchronously growing cell cultures with increasing concentrations of BrdU (0 μM to 12 μM in 0.5 μM increments over 1 hr, followed by a further 1 hr at 12 μM; Fig. 2A). Then, ultra-long nanopore sequencing reads (N50 >120 kb) were generated and used to determine sites of BrdU incorporation, as described above. In Figures 2B and 2C, we show two example single molecule reads visualising BrdU probabilities at each thymidine (black dots) and windowed levels of BrdU incorporation (blue lines). Background levels of BrdU signal represent sequences replicated prior to the addition and incorporation of BrdU (Fig. 2 B & C). Gradients of increasing BrdU incorporation, corresponding to sequences replicated during the sequential BrdU additions, indicate replication fork direction. Therefore, minima and maxima in BrdU incorporation identify DNAscent replication initiation and termination events respectively. On this basis, replication fork direction and DNAscent replication initiation and termination sites were identified computationally genome-wide (see methods; Fig. 2 B & C, fork direction indicated by open arrows, light and dark green bars indicate initiation and termination sites, respectively).
For the example 526 kb single molecule read in Fig. 2B, we identified three DNAscent replication initiation events and four termination events. For the example 309 kb single molecule read shown (Fig. 2C), we identified four DNAscent replication initiation events and three termination events. We note that the minima (DNAscent replication initiation events) vary in width, for example the minimum flanking 8.15 Mb is wider than the other three minima on this molecule (Fig. 2C). We consider two explanations. First, wide minima could result from multiple initiation events with replicons merging prior to the BrdU addition.
Second, the width of minima could indicate how far replication forks have progressed from a single initiation event prior to BrdU addition. In the second scenario, the minima width is a measure of relative replication initiation time across a molecule, with wider minima arising from earlier replication initiation events. Given reported distances between replication initiation events of ∼100 kb [33–35], we favour this second explanation for these molecules, especially for narrower minima. Analogous to the scenarios at minima, wider maxima could either arise from a single replication termination event or multiple events occurring during the final 1 hr at 12 μM BrdU where we do not have temporal resolution.
At the depth of sequencing performed for the HeLa-S3 sample, we identified a total of 2,577 DNAscent replication initiation sites (Supplementary Table S3, these sites were used for subsequent initiation site density analyses) and 2,791 termination sites. For hTERT-RPE we find 912 initiation sites and 1,099 termination sites. Next, we filtered the set of initiation sites to those with a resolution of <5 kb (dashed line in Supplementary Fig. S2A), which are most likely to result from a single initiation event. The 1,690 high resolution HeLa-S3 DNAscent initiation sites are distributed across the whole genome (this set was used for all intersection analyses). Comparisons to published population-level relative replication timing data (sort-seq from the same HeLa-S3 stock [13]) show that we have identified initiation sites throughout S phase (Fig. 2D left panel). These observations are consistent with expectations, given that labelling was performed on asynchronously growing cell cultures with an unbiased representation of all stages of S phase.
To compare the density of DNAscent replication initiation sites between different genomic regions, we determined the number of replication initiations per gigabase of mapped reads (abbreviated to RIGR). Across four equally sized replication timing quartiles we observed a slight enrichment in replication initiation density within the earliest quartile (15% more than expectation; p<0.00005; Supplementary Table S4). We also observe a decrease in GC-content for DNAscent initiation sites from later replicating regions of the genome (Fig. 2D, panel 2). This may be a consequence of lower gene density in later replicating parts of the genome [45]. However, we do not observe any localised variations in DNA sequence composition or enriched sequence motifs associated with DNAscent initiation sites (see methods), consistent with low sequence specificity for human ORC [8].
Initiation sites identified by single molecule sequencing are enriched in Ok-seq initiation zones
The 1,690 high-resolution DNAscent replication initiation sites were compared with initiation sites and zones reported by other studies (in the same cell line where possible). For the example reads described above, of the seven DNAscent initiation sites, three intersect with published HeLa-S3 ORM initiation zones (Fig. 2 B & C; yellow bars)[37] and two intersect with published HeLa-S3 Ok-seq initiation zones (Fig. 2 B & C; blue bars)[16]. Across the high-resolution DNAscent initiation sites identified in HeLa-S3 cells, we observe a clear enrichment in published ORM [37], Pu-seq [17] and Ok-seq [16] initiation zones, that is most pronounced in early S phase (Fig. 2D). We determined the relative distance between Ok-seq initiation zones and DNAscent initiation sites to test for spatial correlation. DNAscent initiation sites and Ok-seq initiation zones occur with much closer proximity than expected by random chance (p<0.001; Supplementary Fig. S2B).
Given the strong enrichment of published Ok-seq initiation zones within and in close proximity to DNAscent initiation sites (Fig. 2D ; Supplementary Fig. S2B), we next examined the reciprocal relationship. In HeLa-S3 cells, Ok-seq previously identified 8984 replication IZs with a mean size of 32 kb [16]. Within these Ok-seq IZs we identified 507 DNAscent replication initiation sites in HeLa-S3 cells. For example, on chromosome 15 a 599 kb single-molecule sequencing read identified two DNAscent replication initiation sites both of which are contained within Ok-seq IZs (Fig. 3A). Overall, within Ok-seq IZs, we find a DNAscent initiation site density that is double expectation (RIGR= 157 compared to an expectation of 77.8 from simulations; p<0.00001; Supplementary Table S4) – we term these ‘focused’ DNAscent initiation sites. The high density of DNAscent initiation sites in Ok-seq IZs is observed across S phase but is greatest in the earliest (first) replication timing quartile (RIGR = 180) and progressively falls through the second (RIGR = 155), third (RIGR = 124) and fourth (RIGR = 104) timing quartiles. In summary, DNAscent identifies high-resolution replication initiation sites on single-molecules, throughout S phase, that are enriched in the initiation zones identified by published population-level replication dynamics studies.
Most replication initiation occurs outside of initiation zones
Although we observe a strong enrichment of DNAscent initiation sites within and in close proximity to published Ok-seq initiation zones, we note that this was the case for only a subset of sites (Supplementary Fig. S2A). For example, despite this significant enrichment, only 20% of the DNAscent initiation sites lie within Ok-seq IZs (focused sites; 31% in the first quartile of S phase, falling to just 5% in the fourth quartile of S phase; Supplementary Table S4). Therefore, we undertook a more detailed comparison between these datasets. In Ok-seq data the proportion of reads mapping to each strand serves as a proxy for replication fork direction (RFD). Three features have previously been described from genomic plots of Ok-seq RFD: sharp positive gradients identifying IZs (Fig. 3A, B & C, upper panels, indicated by dark blue boxes), plateaus potentially consistent with a single progressing replication fork (Fig. 3B & C, upper panels, indicated by the absence of blue boxes), and gradual negative gradients identified as replication termination zones (TZs; Fig. 3A, B &C, upper panels, indicated by light blue boxes). We identified DNAscent replication initiation sites across all three Ok-seq features.
DNAscent initiation site density (RIGR) is ∼50% lower outside of Ok-seq IZs and lower than expected by a random distribution (Fig. 3D). However, Ok-seq plateaus and TZs cover ∼2x and ∼7x more of the genome respectively than Ok-seq IZs. Therefore, despite the lower initiation density, 19% and 61% of all DNAscent initiation sites intersect with Ok-seq plateaus and ‘termination’ zones, respectively – we term these ‘dispersed’ DNAscent initiation sites. We see clear examples of multiple replication initiation sites on individual molecules that span plateaus of population-level RFD (e.g. Fig. 3B) and within regions designated by population-level data as termination zones (e.g. Fig. 3C).
In contrast to Ok-seq, Pu-seq and ORM IZs, when considering all high-resolution (<5 kb) DNAscent initiation sites we did not observe a significant enrichment in the published sites identified by chromatin immunoprecipitation (ChIP; ORC [26] or Mcm7 [27]), SNS-seq [46], or Ini-seq [23, 24] (Supplementary Fig. S3A). However, when we consider just focused DNAscent initiation sites, we see a modest enrichment for Mcm7-ChIP [27], Ini-seq [23, 24] and SNS-seq sites [24, 46] (Supplementary Fig. S3B). Therefore, ∼20% of DNAscent initiation sites showed enrichment for the published initiation sites identified by a range of independent population-based genomic assays. However, the majority (80%) of replication initiation sites are missed by all population averaged datasets and are dispersed throughout the genome.
Transcription excludes replication initiation promoting co-directionality
For expressed genes, population-level studies of replication dynamics (e.g. Ok-seq) report a bias for codirectional replication and transcription at transcription start sites (TSSs); and a counter-directional fork bias at transcription end sites (TESs) [47]. As described above, the majority of replication initiation sites detected on single molecules are missed by population-level data (Fig. 3). Related to this, we observe many replication forks moving in a direction opposite to the average reported by bulk population data (e.g. rightward forks in Fig. 3B). Therefore, we tested whether our single-molecule replication fork directions displayed a bias at highly transcribed genes. Transcribed genes were divided into two categories; ‘low’ and ‘high’ transcription (based upon nucleoplasmic RNA-seq from HeLa-S3 cells [48]). The median gene transcription is ∼30-fold greater in the high transcription compared to the low transcription category (each containing 8123 genes). At genes with a low level of transcription there is no significant bias in replication fork co-directionality (Fig. 4A, left panel). At genes with a high level of transcription, co-directional replication forks are significantly overrepresented from 3 kbp upstream of the TSSs and at least 10 kbp into the gene body (Fig. 4A, right panel, green line; p<0.01). Counter-directional replication and transcription is significantly underrepresented across the same region of highly transcribed genes (Fig. 4A, right panel, blue line). We observe similar replication fork biases at TSSs in the RPE1 dataset (Supplementary Fig. S4A). Additionally, we observe a modest trend towards overrepresentation of counter-direction replication upstream of the TESs of highly transcribed genes, although no individual point passes a 95% confidence threshold (Supplementary Fig. S4B).
The observations above suggest that replication initiation may preferentially take place flanking highly transcribed genes. As a direct test, we determined the density of DNAscent replication initiation sites (RIGR) within and flanking genes with low and high levels of transcription (as defined above). Across genes with a low level of transcription, we see no significant variation in DNAscent initiation site density compared to random expectations (Fig. 4B, left panel). However, in highly transcribed genes we see a significantly lower density of DNAscent initiation sites within the transcribed portion (RIGR=50.3, compared to a mean of 81.8 from randomisations; p<0.00001; Fig. 4B, right panel). In contrast, DNAscent replication initiation site density is elevated within a 25 kb window upstream of the highly transcribed genes (RIGR=126.2, compared to a mean of 73.7 from randomisations respectively; p<0.00001; Fig. 4B, right panel). Therefore, we observe that highly transcribed genes tend to exclude replication initiation from the gene body, with initiation often taking place in upstream regions.
Next, we tested whether proximity to high transcription is a general property of DNAscent replication initiation sites or is specific to the focused subset. Figure 4C visualises transcript abundance within 100 kb of each DNAscent replication initiation site (Fig. 4C left heatmap early S phase; Supplementary Fig. S4D heatmap all sites; other transcription-associated chromatin marks, Supplementary Fig. S4D). We separately considered focused and dispersed DNAscent replication initiation sites with each subdivided by replication time. We observe clear minima in transcription centred on the focused DNAscent replication initiation sites (dark blue and red plots). This is clearest in early S-phase (dark blue), but apparent for the smaller number found within late S-phase (red). We see no evidence for a similar relationship to transcription for the more numerous dispersed DNAscent initiation sites (light blue and orange plots). Furthermore, the focused DNAscent initiation sites are also enriched for multiple signals of accessible/open chromatin (e.g. DNase-seq read depth, Fig. 4C right heatmap early S phase; Supplementary Fig. S4D), unlike the dispersed DNAscent initiation sites. Therefore, our single molecule replication dynamics data demonstrate a clear difference in transcriptional and epigenetic context between replication initiation sites that are focused in IZs and the majority that are dispersed throughout the genome. In summary, we report a novel class of ‘dispersed’ replication initiation sites, undetectable by population-level studies, that replicate most of the genome.
Discussion
For the first time, we report DNA replication initiation sites identified genome-wide in human cell cultures using ultra-long single molecule sequencing. To achieve this, we have established a set of experimental conditions (in HeLa-S3 and hTERT-RPE1 cells) resulting in concentration-dependent BrdU incorporation, followed by nucleotide resolution quantitative analogue detection with nanopore sequencing and DNAscent software (Fig. 1C & D). This approach allowed us to quantify the relationship between BrdU concentration in the culture media (0.3 – 50 µM) and the amount of BrdU incorporated in vivo by DNA polymerases (validated by mass spectrometry; Fig. 1E). We show that there is minimal interference between Me-CpG and BrdU detection in nanopore sequencing data, allowing independent quantification of both base modifications (Fig. 1F & G). Quantitative detection of BrdU incorporated into DNA allowed us to perform sequential additions of BrdU to the culture media to generate gradients of BrdU incorporation on single-molecule sequencing reads (Fig. 2A-C). This enabled replication fork direction to be determined and therefore sites of initiation and termination to be discovered. This method is performed in unperturbed cells and uses low levels of analogue in the culture media thereby avoiding the risk of inducing DNA damage and/or perturbing replication fork dynamics. Additionally, the fork direction information is generated without the need for high concentrations of a second potentially more toxic analogue [49–51] as used in double labelling protocols. Therefore, we present a method to identify DNA replication initiation and termination sites in human cells that is readily transferable between cell types and to answer various biological questions.
Using ultra-long single-molecule sequencing, we have identified and located thousands of high-resolution DNA replication initiation and termination events in cultured human cells. These include reads that feature multiple replication initiation and termination events (Fig. 2 & 3). The use of unperturbed, asynchronous, cells has allowed us to identify replication initiation events throughout S-phase (Fig. 2D). These are functionally activated origins that we refer to as replication initiation sites. Given the high numbers of MCM double-hexamers loaded onto DNA in human cells [52, 53], these initiation events likely reflect only a fraction of licenced sites. Due to sequencing depth limitations, our single-molecule dataset samples initiation sites rather than exhaustively identifying all sites. Nevertheless, the frequency of initiation sites that we identify are consistent with a model that many more sites are licenced (with MCM) and identified by methods such as MCM ChIP-seq, than are functionally activated to initiate replication forks.
A lack of concordance between various published methods for identifying DNA replication initiation sites has frequently been reported [10, 28, 37]. Here we identify replication initiation sites that are enriched in the initiation zones identified by population-level replication dynamics studies (Ok-seq, PU-seq) and initiation zones from an optical mapping study (ORM) (Fig. 2D & E) [16, 17, 37]. Remarkably, when considering DNAscent focused initiation sites, we also find concordance with other major published cell population methods (Supplemental Fig. S3B), despite previous comparisons between these methods not showing concordance. We propose that focused sites are used at higher frequency and therefore identifiable by cell population methods. By filtering these cell population datasets by the focused initiation sites identified here the comparisons are less susceptible to method-specific noise. However, we do not find any significant sequence motif enrichment in the initiation sites we identified, whether considering all or only the focused initiation sites.
Although we show enrichment of replication initiation sites in the initiation zones identified in cell populations, this accounts for only 20% of DNAscent initiation sites with 80% situated outside of IZs (Fig. 5). We find that within the replication IZs identified by population-level studies there is a much higher density of DNAscent initiation sites (higher RIGR) compared to a lower density of initiation sites across the rest of the genome (Fig. 3D). Notably, we observe initiation sites within regions marked as termination zones in population-level data (Fig. 3C) and termination sites within population-level IZs (Fig. 3A).
Our single-molecule data, with multiple independent BrdU incorporation measurements across each read, allows high confidence identification of individual replication initiation events, even when the site may be rarely used within a population of cells. Therefore, this method is well-suited to identifying the numerous initiation sites, from which the majority of the genome is replicated, but are individually rarely used and therefore missed by population datasets.
We propose that there are genomic regions that strongly favour replication initiation. In these regions, replication initiates in a sufficiently high proportion of cells to permit detection by cell population studies (Fig. 5). However, most replication initiation sites are more spatially dispersed with high cell-to-cell variability thus preventing their detection in population-level analyses. This model can reconcile the order-of-magnitude difference between the spacing of population-level IZs (megabase) and single molecule inter-origin distances (IODs; ∼100 kb). In addition, Ok-seq and ensembles of single-cell DNA replication sequencing (scRepli-seq) [30] imply megabase regions of unidirectional fork movement [16]. However, our data clearly identify multiple dispersed replication initiation sites across such regions. The paucity of initiation sites within highly-transcribed regions, observed in our data (Fig. 4B) and from population level Ok-seq studies [16, 47], leads us to propose that transcription may be one mechanism that strongly determines regions of favoured initiation. This could be due to a permissive chromatin environment (Fig. 4C, Supplementary Fig. S4D) adjacent to active genes and/or by displacing MCMs from gene bodies [53] (Fig. 5). Therefore, we propose that high levels of transcription confine replication initiation to consistent sites within the cell population, which we term ‘focused’ sites, that are thus observable by cell-population techniques (Fig. 5). In our dataset, 20% of replication initiation events are located within initiation zones – i.e. focused sites; whereas the other 80% of initiation events are located at sites dispersed throughout the genome (including within transcribed genes; Fig. 4C), with each site being used at low frequency within the population (Fig. 5).
Our demonstration of DNAscent in cultured human cells opens the way for numerous future studies with the potential to address further longstanding biological questions. For example, DNAscent can be applied to look at the roles of cis- and trans-acting factors in DNA replication and models of human disease via perturbation of gene function. DNAscent is applicable to a wide range of cell types and different organisms – requiring only that cells can be cultured, that cells incorporate BrdU and that high molecular-weight DNA can be extracted. Ultra-long molecules combined with improved reference genomes, for example recent telomere-to-telomere assemblies, will allow analyses of DNA replication in understudied repetitive regions of genomes, including centromeres and telomeres.
Furthermore, gradients of analogue incorporation have been shown to allow detection of replication fork pause sites [38]. Variants of the BrdU incorporation regime [54] or the use of multiple analogues [55] allows quantification of replication fork kinetics to identify the genomic context of challenges to replication fork progression. Additionally, independent detection of DNA methylation and BrdU incorporation on the same molecules will enable assessment of how methylation status impacts DNA replication dynamics and the kinetics of methylation re-establishment on nascent DNA.
More broadly, in addition to our single-molecule detection of DNA replication dynamics, we envisage that DNAscent will allow detection of BrdU incorporated through other cellular pathways, including DNA repair synthesis. Moreover, DNAscent could be used as a molecular tool to detect ex vivo labelled single-stranded DNA breaks, either those induced in vivo or generated in vitro, for example with a DNA glycosylase. Furthermore, steadily increasing nanopore sequencing throughput will allow the identification of more BrdU-labelled sites allowing greater statistical power in subsequent analyses.
Optical methods that visualise the incorporation of modified nucleotides into DNA have provided enormous biological insights into pathways of DNA replication, recombination and repair. However, to date these methods have been limited by low throughput, low spatial resolution, a lack of the underlying sequence context and/or the requirement for perturbation to S phase progression. Our application of nanopore sequencing provides a step change to these powerful methods, to provide quantitative, high throughput, high resolution, sequence-specified detection of base analogues. This has allowed, for the first time, the discovery of replication initiation sites on single sequence-resolved molecule across the human genome.
Methods
Cell line maintenance and BrdU treatment
HeLa-S3 (adherent) and hTERT-RPE1 were maintained in DMEM Glutamax (HeLa-S3) or DMEM/F12 Glutamax (hTERT-RPE1, both Gibco), with the addition of 10% foetal bovine serum (Sigma) and 1% penicillin/streptomycin (Gibco). Cells were maintained at 70% confluency in 5% CO2. For BrdU pulse concentration scoping experiments asynchronous cultures were treated with the indicated concentrations (0.3 – 50 μM) of BrdU for 20 hrs (HeLa-S3) or 2, 24 or 27 hrs (hTERT-RPE1). For BrdU gradient experiments, asynchronous HeLa-S3 or hTERT-RPE1 cell cultures were treated with BrdU from 0 μM to 12 μM over 1 hr, with the addition of 0.5 μM every 2.5 minutes until 12 μM which was incubated for a further 1 hr.
Sample harvesting
At appropriate time points samples were harvested for genomic DNA for long read nanopore sequencing, BrdU-IP short read sequencing and mass spectrometry. Cells were washed twice in ice-cold D-PBS and scrape harvested. Pellets were collected by centrifuging samples at 500 g for 5 mins at 4°C and stored at -20°C.
Standard DNA extraction
For genomic DNA extraction, care was taken to avoid vortexing and pipetting to prevent shearing of DNA. Where necessary, wide bore tips were used. DNA was extracted with phenol:chloroform; specifically, frozen cell pellets were resuspended in 250 μl D-PBS and 4 volumes of digestion buffer (20 mM Tris HCl pH 8.0, 0.2 M EDTA, 1% SDS, 100 μg/ml DNase free RNase A) and incubated for 5 mins at room temperature. Proteinase K was added to a final concentration of 1 mg/ml and incubated overnight at 56°C with gentle shaking.
Proteinase K addition was repeated until lysates were clear (1-4 hrs) then an equal volume of phenol:chloroform added. Samples were shaken well and separated by centrifuging at 1700 g, 10 mins. Phenol:chloroform addition and separation was repeated with the top aqueous layer. An equal volume of chloroform was added to the top aqueous layer, shaken well and centrifuged as above. DNA was precipitated with 1/10 volume of 7.5M ammonium acetate and three volumes of ice-cold ethanol or isopropanol and centrifuged at 21130 g for 1 hr at 4°C. The DNA pellet was washed with 70% ethanol and airdried. DNA was resuspended in 1x TE overnight at 4°C. DNA concentration was determined with high sensitivity dsDNA kit for Qubit as per manufacturers recommendation (Invitrogen) and 260/230 and 260/280 purity determined with microvolume spectroscopy (Nanodrop or Denovix).
Ultra-high molecular weight DNA extraction
Ultra-long DNA extractions were performed using the Circulomics Nanobind CBB kit (NB-900-001-01) and UHMW DNA aux kit (provided by the manufacturer on request), following the Circulomics protocol “Nanobind UHMW DNA Extraction – Cultured Cells Protocol”. A pellet containing approximately 6 million cells was used as the sample input for each extraction. For each step that required tip-mixing, the samples were mixed continuously until homogeneous mixtures were achieved. For the overnight elution, a 10 µl pipette tip was left in each tube to ensure that that disc remained submerged in the elution buffer.
Oxford Nanopore Technology MinION sequencing
Sequencing libraries were prepared using the Genomic DNA by Ligation Sequencing Kit (Oxford Nanopore Technologies, SQK-LSK109) following manufacturers’ instructions with the following changes to enrich for longer read lengths. DNA was incubated at 20°C for 30 mins and 65°C for 30 mins for the end repair step. All AMPure bead cleanup steps used 0.4x volume of beads and Long Fragment Buffer was used in the final AMPure bead elution wash steps. Sequencing adapter ligations were performed as 0.5x volume reactions for 30 mins at room temperature.
For genome wide sequencing without barcoding, 4 μg input DNA was used. For genome wide sequencing with barcoding, 1-1.5 μg input DNA was used with Native barcoding genomic DNA protocol (Oxford Nanopore Technologies) using barcoding kit EXP-NBD104 (Oxford Nanopore Technologies). Barcode ligation reactions were performed as 0.5x volume reactions and incubated for 30 mins at room temperature. Equimolar amounts of barcoded reactions were pooled and 2 μg taken forward for sequencing adapter ligation.
For all sequencing runs recommended amounts of libraries were loaded onto R9.4.1 MinION flow cells (FLO-MIN106D, Oxford Nanopore Technologies) and sequenced with MinION MkB (Oxford Nanopore Technologies) following manufacturer’s instructions. Where appropriate sequencing runs were paused and flow cells washed and reloaded according to manufacturer’s instructions.
Oxford Nanopore Technology PromethION sequencing
Sequencing libraries were prepared using the Ultra-Long DNA Sequencing Kit (Oxford Nanopore Technologies, SQK-ULK001), and Nanobind UL Library Prep Kit (Circulomics, NB-900-601-01). DNA input was approximately 40 µg (HeLa-S3) and 15 µg (RPE), both in a volume of 750 µl. For elution, the samples were kept at room temperature overnight and were placed above a magnet to keep the Nanobind disks submerged. The quantity of final library was enough to load the PromethION flow cell three times with nuclease washes in between each load. To maximise sequencing yield, we loaded two R9.4.1 (Oxford Nanopore Technologies, FLO-PRO002) flow cells for each sample and picked the best performing flow cell (in terms of Gb output) to wash and load again. Each flow cell was run on the PromethION 24 for 48 hours regardless of whether it was washed and reloaded, with a 6-hour pore scan frequency as an optimisation for long-read sequencing. Nuclease washes were performed using the Flow Cell Wash Kit (Oxford Nanopore Technologies, EXP-WSH004).
BrdU-IP short read sequencing
Genomic DNA, fragmented to 300 bp using a Bioruptor Pico, was prepared for multiplexed pooled anti-BrdU ImmunoPrecipitation Illumina NGS sequencing libraries as [38, 56]. Specifically, starting input for sonication was 6 µg DNA. After sonication DNA was ethanol precipitated, then underwent End repair and A-tailing using NEBNext Ultra II end repair module (E7546). Illumina compatible primers with barcodes were added using NEBNext Ultra II ligation module (E7595). DNA was purified using AMPure XP beads at 0.9x then equal quantities of barcoded DNA pooled and 20 ng reserved for input DNA. 3 µg DNA was heat denatured and BrdU-containing DNA immunoprecipitated using 60 µl anti-BrdU antibody (BD, 347580) in IP buffer (1x PBS, 0.0625 % Triton X-100) overnight at 4°C with rotation. Protein G Dynabeads (60 µl, Thermo Fisher 10003D) were added for 1hr then beads washed three times in ice cold IP buffer, twice in TE and then eluted in elution buffer (1x TE, 1% SDS). Immunoprecipitated DNA was purified using AMPure XP beads at 0.9x. IP and input DNA were amplified separately using Illumina compatible indexes and NEBNext Ultra II Q5 Master Mix (M0544) for an equal number of cycles, typically 15-17, depending on recovery, and purified using AMPure XP beads at 0.9x.
Libraries were checked for fragment size distribution using Tapestation and libraries quantified using Library Quant as [13]. Libraries were multiplexed where appropriate and at least 35 million reads collected per condition by 80 bp single end sequencing using NextSeq 500 (Illumina) as [13].
Mass spectrometry
Mass spectrometry samples were prepared from genomic DNA samples and data collected as described in [38].
Methylation detection by nanopore sequencing
To examine interference between the signal from CpG methylation and incorporated BrdU, Nanopolish [43] was used to base-call 5mC using the same sequencing files that were separately used to call BrdU with DNascent2. We visualised methylation and BrdU incorporation on individual reads (e.g. Fig. 1F). Raw nanopore fast5 sequencing files, and subsequent guppy basecalled fastq files, and minimap2 alignment files, generated in the DNAscent pipeline were used to call 5mC with Nanopolish, using the following commands: nanopolish -d </path/to/fast5/files/> <corresponding.fastq.gz> nanopolish call-methylation -r <corresponding.fastq.gz> -b <corresponding_alignment.bam> -g <reference.fasta> > <output.tsv>
CpG island analysis
CpG island annotations were downloaded from the UCSC Genome Browser [57] and processed bisulfite data from HeLa-S3 cells were downloaded from ENCODE (ENCSR550RTN) [58]. Bisulfite sequencing BED files that list CpGs, their coverage in the data and the fraction that were found to be methylated (biological duplicates ENCFF696OLO and ENCFF804NTQ, combined for subsequent analysis) were intersected with the CpG island annotations. CpG islands with sufficient mean coverage (>= 3) in the bisulfite data were ranked by mean proportion methylation, the top third of which were categorised as ’high methylation’ and the bottom third of which were categorised as ’low methylation’ (8,844 CpG islands, each). DNAscent detect data (HeLa-S3 cells treated for 20 hours with 0, 0.3, 1.5, 5 or 10 µM BrdU) with nanopore read depth and number of BrdU calls at thymidine positions were each converted to bigwig format. The following deepTools [59] commands were used to sum nanopore read coverage and DNAscent BrdU calls in 100 bp windows 2.5 kbp up- and downstream of the CpG island sets described above. Proportion BrdU per 100 bp window were plotted by dividing the summed BrdU count by the read count at thymidine positions.
computeMatrix reference-point --regionsFileName [<lowMeth_CGIs.bed> | <highMeth_CGIs.bed>] --scoreFileName [<nanoDepth.bw> | <DNAscentBrdU.bw>] --outFileName <outMatrix.mat.gz> --referencePoint center -- beforeRegionStartLength 2500 --afterRegionStartLength 2500 -- binSize 100 --averageTypeBins sum
BrdU-IP short read data analysis
Sequencing data were downloaded from Basespace (Illumina) and pre-barcodes were demultiplexed using FASTX barcode splitter: cat </path/to/fastq/files> | fastx_barcode_splitter.pl -- bcfile <text/file/with/barcodes> --bol --prefix Pre-barcodes were removed with FASTX barcode trimmer: fastx_trimmer -f 6 -i <my_file.fastq> -o <my_trimmed_file.fastq.gz> -z Sequencing and barcode trimming steps were checked for quality using FASTQC [60]: fastqc <my_file.fastq> -o </path/to/save/output/>
Reads were mapped to hg38 using BWA-MEM, and filtered for uniquely mapping reads and duplicate reads excluded using Samtools. Proportion of BrdU incorporation per sample is calculated as number of uniquely mapped reads (with duplicates excluded) for IP/INPUT, as a proportion of the total pooled sample.
For visualisation, coverage at 5’ ends of reads was calculated using bedtools to output a coverage.bed file using the script bwa_map.bash available on our github repository.
Blacklist regions were removed using bedtools using the following blacklist; hg38-blacklist.v2.bed from https://github.com/Boyle-Lab/Blacklist/tree/master/lists with the addition of two further regions found to have extremely high coverage in our HeLa sequencing data; chr8 127218000 127230000 and chr15 67840000 67841000. Coverage was mapped into windows using bedtools. BigWig files were generated for intermediate data visualisation using UCSC bedGraphToBigWig using the script gencoverageToBigWig.bash available on our gihub repository.
Nanopore long read data analysis
Nanopore sequencing reads were basecalled with guppy and mapped to hg38 with minimap2 [61]. Bam files were filtered where described and indexed with samtools [62] and BrdU incorporation identified with DNAscent2 [39] (first running DNAscent index, then DNAscent detect using the default minimum read length of 1000 bp unless otherwise stated). Detect files were converted to modBAM using the detect_to_modBAM script, available in our GitHub repository.
For meta-analysis (plotting of distributions in Fig. 1), BrdU incorporation was analysed directly from .detect files or after converting to windowed fraction of BrdU incorporation.
Read visualisation
For a read of interest, the relevant line from the modBAM file was converted back to detect format and passed to an R [63] script (read_&_gene_annotation_plotting.R) for visualisation. Plots include the probability of BrdU at each thymidine position, the determined level of BrdU incorporation (in windows of 290 thymidines; ∼1 kb) and where appropriate the inferred replication fork direction, initiation and termination sites. Finally, the script annotates reads with data from other genomic datasets, including genes from Gencode (GRCh38.p13) [64].
Identification of fork direction, initiation and termination sites
Replication fork direction was detected in nanopore sequencing experiments which followed the scheme described in Figure 2A. Nanopore sequencing and detection of BrdU with DNAscent was carried out as described above. The DNAscent detect data was used to calculate replication fork position and direction, and therefore, replication initiation and termination sites. This process was carried out using a custom R script (ori-ter-fork_calls.R), with major steps outlined below:
Proportion BrdU incorporation was calculated in 290 thymidine windows as described above.
Gradients of BrdU proportion were detected using the Total Variation Regularized Numerical Differentiation algorithm [65]. In short, total variation regularisation is used to denoise the first derivative of the windowed BrdU values by fitting a curve which minimises both regression from the measurements and variance across the fitted curve. Regions of fitted first derivatives greater than 1 indicate rightward forks (5’ -> 3’ on the forward strand of the genome), regions with gradients less than -1 indicate leftward forks (5’ -> 3’ on the reverse strand of the genome). These regions, and their orientation, are labelled with open chevrons in the example reads shown.
Adjacent replication fork calls with divergent or convergent orientations were used to define initiation and termination events, respectively.
Search for sequence motifs at replication initiation sites
The high-resolution DNAscent replication initiation sites (or the subset that intersect with Ok-seq IZs) were analysed to identify enriched sequence motifs using HOMER [66] as described elsewhere [20]. We did not identify any highly significant sequence motifs and of those identified the most significant were present in <∼5% of initiation sites.
Replication initiation site density (RIGR)
To compare the observed number of DNAscent replication initiation sites in different genomic regions we determined the initiation site density, defined as Replication Initiations per Gigabase of mapped Reads (abbreviated to RIGR). This controls for any differences in aggregated region size, sequence coverage or ploidy when comparing with a haploid reference genome (Hg38). Briefly, for a particular set of regions (e.g. Ok-seq IZs) the number of intersecting DNAscent initiation sites was determined using BEDTools intersect (requiring >50% of the initiation site to overlap a region of interest). This number of initiation sites was then normalised to the sequence coverage (in Gb) calculated using Samtools bedcov. The significance of observed DNAscent initiation site densities was determined by comparison to 1000 randomised initiation sites. Briefly, each randomisation used BEDTools shuffle to randomly permute the genomic location of DNAscent initiation sites within a randomly selected subset of the nanopore reads (using Samtools view --subsample).
Comparison of whole genome DNAscent initiation sites with other datasets
To compare the locations of DNAscent replication initiation sites with transcribed regions of genome, the high-resolution initiation sites were intersected with annotated genes with support for transcriptional activity. Nucleoplasmic RNA-seq from HeLa-S3 cells [48] was reanalysed to produce normalised read counts per gene as a measure for RNA polymerase II activity on DNA.
Adapters and low quality bases were trimmed from raw fastq files using the following cutadapt [67] command: cutadapt --minimum-length 10 --quality-cutoff 15,10 --trim-n - a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --output <trimmed_fastq_1.gz> --paired-output <trimmed_fastq_2.gz> <raw_fastq_1.gz> <raw_fastq_2.gz>
Trimmed reads were then pseudo-aligned and gene counts generated using Salmon [68]. Human cDNA, ncRNA and reference genome (GRCh38.p14) files were downloaded from Ensembl [69], catenated, and the reference genome sequence designated as ‘decoys’. The following Salmon commands were used: salmon index --transcripts <transcriptome.fa> --index <SALMON_INDEX> --decoys <decoys.txt> salmon quant –libType A --index <SALMON_INDEX> --mates1 <trimmed_fastq_1.gz> --mates2 <trimmed_fastq_2.gz> --seqBias - -gcBias --posBias --output <SALMON_QUANT>
The output quant.sf file contains normalised read counts (TPM, transcripts per million) for each isoform in the annotated transcriptome. To find the expression levels of genes the TPM values of all isoforms of a gene were summed. The genomic coordinates of the most highly expressed isoform of a gene were used as the coordinates for the gene. Genes with no read counts were excluded from further analysis. The coordinates of transcribed genes, and their expression level (log2 of TPM), were converted to bigwig format. The following deepTools [59] command was used to plot average transcriptional activity within 100 kbp of DNAscent initiation sites: computeMatrix reference-point --regionsFileName [<focused_IS.bed> | <dispersed_IS.bed>] --scoreFileName <log2TPM.bw> --outFileName <outMatrix.mat.gz> --referencePoint center --beforeRegionStartLength 100000 -- afterRegionStartLength 100000 --binSize 1000 --averageTypeBins mean
The output matrices were used to plot heatmaps and were separately processed to find the geometric mean. For significance and to plot 99% confidence intervals, 1000 simulations of
DNAscent initiation sites with randomised positions within the mapped reads were analysed as above.
To assess association with chromatin structure we compared high-resolution DNAscent initiation sites to DNase-seq rep 1 and 2 (ENCSR959ZXU) and ChIP-seq for the following histone modifications: H2AFZ ChIP-seq rep 1 and 2 (ENCSR000AQN), H3K4me1 ChIP-seq rep 1 and 2 (ENCSR000APW), H3K4me2 ChIP-seq rep 1 and 2 (ENCSR000AOE), H3K4me3 ChIP-seq rep 1 and 2 (ENCSR340WQU), H3K9ac ChIP-seq rep 1 and 2 (ENCSR000AOH), H3K9me3 ChIP-seq rep1 and 2 (ENCSR000AQO), H3K27ac ChIP-seq rep 1 and 2 (ENCSR000AOC), H3K27me3 ChIP-seq rep1 and 2 (ENCSR000APB), H3K36me3 ChIP-seq rep 1 and 2 (ENCSR000AOD), H3K79me2 ChIP-seq rep 1 and 2 (ENCSR000AOG), H4K20me1 ChIP-seq rep 1 and 2 (ENCSR000AOI). The following deepTools [59] command was used to assess chromatin structure 100 kbp up- and downstream of high-resolution DNAscent initiation sites: computeMatrix reference-point --regionsFileName [<focused_IS.bed> | <dispersed_IS.bed>] --scoreFileName [<DNaseSeq.bw> | <chromatinMarkChIPseq.bw>] --outFileName outMatrix.mat.gz --referencePoint center -- beforeRegionStartLength 100000 --afterRegionStartLength 100000 --binSize 1000 --averageTypeBins mean
For significance and to plot 99% confidence intervals, 1000 simulations of DNAscent initiation sites with randomised positions within the mapped reads were analysed as above.
Relative distance analysis was performed using the BEDTools suite [70] as described previously [71]. Briefly, the relative distance between each DNAscent initiation site and the nearest Ok-seq initiation zone was determined. Statistical significance was determined using 1000 randomisations for the DNAscent initiation site data.
Fork counts across transcription start sites
To distinguish between inactive genes and highly transcribed genes, expressed genes (identification described above, but excluding overlapping genes) were ranked by their expression level, and divided into two equally sized categories (8,123 genes each): ‘low’ and ‘high’ expression.
Replication fork calls, as described above, were separated by directionality, each converted to bigwig format and intersected with a 20 kbp region centred around the transcription start and end sites (TSS and TES, respectively) using the following deepTools command: computeMatrix reference-point --regionsFileName [<high _TSS.bed> | [<low_TSS.bed> | [<high _TES.bed> | [<low_TES.bed>] --scoreFileName [<left_forks.bw> | <right_forks.bw>] --outFileName outMatrix.mat.gz -- referencePoint center --beforeRegionStartLength 10000 -- afterRegionStartLength 10000 --binSize 10 --averageTypeBins sum
The columns of the output matrix were summed to get the final forks counts in 10 bp windows with respect to the locations of expressed gene start and end sites. Forks were divided into codirectional or counterdirectional to the direction of gene transcription and plotted using custom R scripts.
Authorship contribution
JC, RW and CN conceived the study and planned the experiments. JC, RW, TB, LC, AD, VK, CW and KG performed the experimental work, JC, RW, ST and CN analysed the data. JC, RW and CN wrote the manuscript.
Data availability
Raw and processed data will be shared post-peer review.
Competing interest statement
The authors declare they have no competing interests.
Supplementary Figure S1: Concentration-dependent detection of BrdU in human genomic DNA by nanopore sequencing – further figures, accompanies figure 1.
A) Frequency distributions of BrdU probabilities at every thymidine position for nanopore sequencing reads from hTERT-RPE1 cells grown in the indicated range of BrdU concentrations, for the indicated times. Inserts show the full y-axis for the 50 µM sample. B) Frequency distributions of the fraction BrdU incorporated in each window (290 thymidines) for the same reads from hTERT-RPE1 cell cultures as in Supplementary Fig. 1A. Insert shows the full y-axis for the 50 µM sample. C) Top three rows: BrdU detection as determined by Mass Spectrometry, DNAscent or BrdU-IP for hTERT-RPE1 or HeLa-S3 cells treated with the indicated time and range of BrdU concentrations (as for Supplementary Fig. 1A, B and HeLa-S3 in Fig. 1). Bottom three rows: Pairwise comparisons of fraction BrdU incorporated in DNA as determined from Mass Spectrometry, DNAscent or BrdU-IP for hTERT-RPE1 and HeLa-S3 cell cultures. Dashed line is y=x. (HeLa-S3 mass spectrometry vs DNAscent shown in Fig. 1E). D) Meta-analysis of fraction BrdU detected by DNAscent at thymidine positions relative to CGI centres for reads from HeLa-S3 cells treated with 0.0, 0.3, 1.5, 10 or 50 μM BrdU (as per Fig. 1B, D; 0.0 μM and 50 μM also shown in Fig. 1G). Fraction BrdU calculated in 100 bp windows. Dashed lines (200 bps apart) indicate the minimum bounds of CGIs. CGIs were separated into three groups based on mean methylation level, blue lines show the top third (high methylation), black lines show bottom third (low methylation).
Supplementary Figure S2: Single molecule detection of DNA replication dynamics on ultra-long nanopore sequencing reads – further figures, accompanies figure 2
A) Histogram of HeLa-S3 DNAscent initiation site lengths (kbp). Dashed line indicates 5 kbp cut off for high resolution initiation sites (<5 kbp). B) Histogram of relative distance between each DNAscent initiation site and the nearest Ok-seq initiation zone (red line) with randomised data (blue line) and confidence intervals shown (dashed grey lines, inner 95%, outer 99%).
Supplementary Figure S3: DNAscent reveals stochastic replication initiation sites not identified by population studies – further figures, accompanies figure 3.
A) Summary plots and heatmaps comparing replication initiation sites identified by DNAscent, and sorted by replication timing (blue), with: proportion GC content, Ok-seq initiation zones (HeLa-S3) [16], PU-seq initiation zones [17], optical mapping (ORM) initiation zones (HeLa-S3) [37], ini-seq initiation sites [23], ini-seq (updated method) initiation sites and SNS-seq initiation sites from the same publication [24], ORC ChIP peaks (HeLa-S3) [26], Mcm7 ChIP peaks (HeLa) [27], SNS-seq initiation sites (HeLa-S3) [46], G4 K minus, G4 K plus, G4 PDS minus, G4 PDS plus (all yellow; HEK-293T) [72] from HeLa-S3 datasets where available as indicated. Ok-seq, Pu-seq and ORM are also shows in Fig. 2. B) Summary plots and heatmaps as in Supplementary Fig. S3A except DNAscent initiation sites were divided into two groups, overlapping with Ok-seq initiation sites (focused, dark blue) or not overlapping (dispersed, green). C) Summary plots show the mean signal across DNAscent initiation sites for the heatmaps shown in Supplementary Fig. S3A, here including comparison to random expectations (black line) and 99% confidence intervals (grey band) from 1000 randomisations of DNAscent initiation site co-ordinates. D) Summary plots show the mean signal across focused (blue) and dispersed (green) DNAscent initiation sites for the heatmaps shown in Supplementary Fig. S3B, here including comparison to random expectations (black line) and 99% confidence intervals (grey band) from 1000 randomisations of DNAscent initiation site co-ordinates.
Supplementary Figure S4: A fifth of initiation sites are focused by high levels of proximal transcription to define initiation zones – further figures, accompanies Figure 4.
A) Replication fork direction relative to the level and direction of transcription for hTERT-RPE1 cells. The bottom half (left) or top half (right) of transcribed genes were determined by normalised transcript counts (Transcripts Per Million; TPM). Forks identified by DNAscent occurring within 10 kb of transcription start sites are included. The count of replication forks co-directional with transcription is shown in green and counter-directional in blue. Confidence intervals from 1000 randomisations are shown in grey (99%) around average fork density (in black). B) As A except forks identified by DNAscent from HeLa-S3 reads occurring within 10 kb of transcription end sites are included. C) As A except forks identified by DNAscent from hTERT-RPE1 reads occurring within 10 kb of transcription end sites are included. D) Heatmaps comparing replication initiation sites identified by DNAscent with: transcription, DNase-seq, and ChIP-seq for the following histone marks: H2AFZ H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H4K20me1. Initiation sites are separated by S-phase replication timing (early or late) and whether they overlap with Ok-seq IZs (focused/dispersed). Comparison with focused early sites shown in dark blue, and with dispersed early sites shown in mid blue. Comparison with focused late sites shown in red, and with dispersed late sites shown in orange.
Supplementary Table S1: BrdU-IP data
Relative number of reads (uniquely mapping reads and after excluding duplicates) between pooled samples giving semi-quantitative BrdU-IP data of BrdU incorporation for HeLa-S3 and hTERT-RPE1 samples as in Fig. 1 and Supplementary Fig. S1.
Supplementary Table S2: Fraction nascent reads
Fraction of reads classed as replicated from HeLa-S3 20hr cultures and hTERT-RPE1 2hr, 24hr, 27hr cultures as shown in Fig. 1 and Supplementary Fig. S1. Reads were counted as replicated where >5% of thymidine positions had a probability >0.5.
Supplementary Table S3: Numbers of replication initiation and termination sites identified
Table shows the numbers of leftward and rightward forks, initiation and termination sites identified in the HeLa-S3 and hTERT-RPE1 datasets, from all reads or when analysing just those reads where at least one 3000 thymidine window contained >5% thymidine positions with a BrdU probability >0.5 (nascent reads).
Supplementary Table S4: DNAscent replication initiation site density (RIGR) for various genomic regions.
Density (RIGR) of DNAscent replication initiation sites (from HeLa-S3 and hTERT-RPE1 datasets) in various genomic regions.
Acknowledgments
The authors thank Stephanie Barker (University of Oxford, for help with reagent preparation), Amanda Williams and Becky Busby (Zoology Sequencing, University of Oxford, for use of NextSeq and Tapestation), Paolo Spingardi and Skirmantas Kriaucionis (University of Oxford, for Mass Spectrometry data collection) and Ildem Akerman (University of Birmingham for sharing whole-genome datasets).
This work was supported by the Biotechnology and Biological Sciences Research Council (BBSRC), part of UK Research and Innovation, through the Core Capability Grant BB/CCG2220/1 at the Earlham Institute, its constituent Transformative Genomics (BBS/E/ER/23NB0006), the Earlham Institute Strategic Programme Grant Cellular Genomics BBX011070/1, its constituent work packages BBS/E/ER/230001B (Cellular Genomics WP2 Consequences of somatic genome variation on traits), and grants BB/N016858/1, BB/W006014/1 and BB/Y00549X/1. This work was supported by a Wellcome Trust Investigator Award 110064/Z/15/Z.