ABSTRACT
Accurate knowledge on the core components of mutation rates is of vital importance to understand genome dynamics. By performing a single-genome and model-free analysis of 39894 retrotransposon remnants, we reveal core, sequence-dependent, nucleotide substitution rates (germline) at each of the 3.2 billion positions of the human genome. Benefiting from the data made available in such detail, we show that a simulated genome generated by equilibrating a random DNA sequence solely using our rate constants, exhibits nucleotide organisation observed in the actual human genome, with or without repeat elements. This directly demonstrates the key role of the core nucleotide substitution rates in shaping the oligomeric composition of the human genome. We next generate the basal mutability profile of the human genome and show the depletion of the moieties with low basal mutability in the database of cancer mutations.
INTRODUCTION
The stability, dynamics and organisation of genomes are key factors that influence the molecular evolution of life1. Single-nucleotide substitutions occur an order of magnitude more frequently than common insertions/deletions2,3, and are major contributors to the sampling pathways by which a genome changes over the passage of time. A thorough understanding of the descriptors that govern single-nucleotide substitutions (mutations hereafter) is thus essential to comprehend genome dynamics and its connection to the underlying first principle processes.
For a given genomic position and i->j nucleotide substitution, the mutation rate, as expressed by the rate constant ri,j, can be presented as a single-base average value and fluctuations contributed by short-range context , CpG-associated , long-range , gene/functional , and specific effects: The term can be estimated through genomic averages for the individual i->j mutation rates, and is reported for the genomes of human4,5 and other species6-8. By investigating the aggregation patterns in substitution frequencies, it was shown that the ri,j variation is subjected to two distinct short-range (<10 nt) and long-range (>1000 nt) effects9,10. In the equation above, the short-range effect is captured through the term and describes the totality of the intrinsic properties and sequence-dependent interactions of DNA with overall mutagenic and reparation processes in a given organism11. The better-studied mutation patterns at a CpG context12-14 are separated in the term, since besides having a specific short-range dyad context, CpG mutations also depend on a number of regional factors that alter the epigenetic targeting of the CpG sites10,15-18. Many relatively recent studies have shed light on the variation caused by the regional effects that depend on a long-range sequence context through secondary mechanisms, such as recombination and GC-biased gene conversion1,4,19,20, transcriptioncoupled biased genome repair21 and instability22, chromatin organisation23, replication-associated mutational bias24 and inhomogenous repair25, differential DNA mismatch repair26, non-allelic gene conversion27, and male mutation bias28. The term captures the change in mutation rates in genes and other functional elements under strong selection bias and reflects observations such as the increased neutral substitution rates in exons29,30 and the possible reduction of the mutation rates in X-chromosome31. holds the highly specific increase or decrease in mutation rates governed by targeted mechanisms32.
Herein, we obtain the core components (Eq. 2) of the spontaneous single-nucleotide substitution rates via the direct analysis of 39894 L1 mobile DNA remnants33 in the same, human, genome (a single-genome approach).
Our transposon exposed k (Trek) method provides the rates at single-nucleotide resolution in L1, where we demonstrate sufficient sequence variability to cover a wide-range of sequence contexts. We use this coverage to determine the core rate constants for all possible nucleotide substitutions (3 per position) at every single 3.2 billion position in the human genome. The Trek method reveals the variation in a model-free manner and at a level beyond accounting for only the two immediate neighbouring nucleotides34. Furthermore, we make our dataset, holding the time-dependent rate constants for individual substitutions that account for up to 7-mer sequence context effect, publically available. Importantly, we demonstrate that values alone can generate a sequence, from first principles, starting from a random DNA sequence, whose key features reflect the oligomeric organisation of the actual human genome. Next, we calculate the core mutability profile of the human genome evaluating the basal predisposition to single-nucleotide substitutions and outline the decreased frequency of the stable sequence motifs among the sites linked to somatic cancer mutations.
RESULTS AND DISCUSSION
Revealing the core single-nucleotide substitution rates
The repetitive occurrence of mobile DNA elements in different regions within the same genome33 provides the opportunity to obtain the core mutation rate constants that account for the immediate effects of neighbouring nucleotides. After the initial inactivation at different time epochs35-38, individual remnants of many transposon subfamilies within a genome have been subjected to largely the same overall mutagenic and proofreading conditions as the rest of the genome39, hence can also serve as markers of mutation rates applicable to genomic sites that share the immediate sequencecontext. For the purpose of this study, we have used the hominoid lineage of the L1 LINE retrotransposons, spanning 3.1 to 20.4 myr (million years) of age36. The constituent subfamilies of the lineage are L1PA5, L1PA4, L1PA3, L1PA2 and the most recent L1Hs. Their respective age and the number of insertions in the human genome are presented in Supplementary Table S1. The choice was made through the following reasoning. The L1 elements have a long (∼6k nt) sequence without extended repeats like in the LTR elements33. This enables their robust mapping on a chosen template and provides essential local sequence variability around different nucleotide positions within L1 elements. There are distinct L1 subfamilies that were active at different time epochs, with detailed molecular clock analyses available35-38 to reveal and, importantly, validate the age of each subfamily. They are well-represented and, unlike other classes of transposable elements, are uniformly scattered across mostly the intergenic regions of the human genome33,40,41. Unlike SINEs and LTRs, LINE sites show very low level of RNA polymerase enrichment, as a marker of transcriptional association, in normal tissues42. The selected most-recent subfamilies are sufficiently young36 a) to enable an unambiguous identification of the genomic coordinates of the borders for the remnants; b) to assume that each position in those elements would be unlikely to mutate multiple times over the studied period of their existence as remnants in the human genome (see Methods); c) to attribute a time-invariance to the rates during the analysed period of mutation accumulation4,43,44. The young L1 subfamilies have most of their remnants coming from the genomic regions with G+C content close to the genomic average value40 (see Supplementary Fig. S1). Finally, many matching positions in our studied five L1 representatives share the same consensus bases, hence, such positions are not polymorphic due to adaptive pressure and can serve as internal references for inferring the rates.
The Trek methodology of obtaining rates, along with the considerations for filtering out the possible selection and non-neutral mutation sites, is presented in Fig. 1 with further details in Methods and Supplementary Fig. S2. The acquired data on the full set of position-specific mutation rates are presented in Fig. 2.
A total of 661 positions, at the 3’ side of the L1 elements used, passed our robustness checks (see Methods) and were thus employed to infer the corresponding values from the analysis of all the young L1 remnants in the human genome. We recorded the data in the Trek database that contains a set of well-defined constants (see below for the extent of sequence context coverage in the Trek database) capturing the influence of the unique arrangement of neighbouring nucleotides at those positions. Owing to the nature of the selected L1 elements, as discussed above, and the Trek procedure design (Fig. 1), we expect the absence of the contribution, the elimination of at the averaging stage (Fig. 1b) and the removal of the and effects through our robustness checks embedded within the Trek procedure (see Fig. 1c,d and Methods). Therefore, our method provides the core variation (Fig. 2) of the mutation rates at around the genomic average values for each i->j base substitution. If the above is correct and our method indeed results in values, further averaging of the core rates (median values shown in Fig. 2) should give us the single-base genomic average substitution rates, cancelling out the remaining contribution. In fact, the comparisons of our Trek-derived with two published datasets that report on the genomic average rates4,5 show an excellent correlation (Supplementary Fig. S3, Pearson’s R>0.99) confirming the absence of any bias and unusual mutation rates in the time-accumulated substitutions at the L1 sites that pass the Trek procedure. The genome simulation, described later in this work, provides an additional validation for our rate constants. The values (Eq. 1 and 2) for all possible i->j substitutions inferred for each of the eligible individual L1 positions are thus assumed to be common for any other sites in the genome that share the short-range sequence context.
The influence range of neighbour nucleotides
To apply the constants to the human genome, we first established the optimal length of a DNA sequence (k-mer, where k is the length of the sequence) capturing most of the influences that modulate mutation rates of the base at the centre. For this, we evaluated the power of the knowledge of the neighbouring arrangement of nucleotides in predicting the constants for each of the twelve i->j substitution types, where i and j are the four DNA bases. We built test predictors for individual substitution types via a tree-based machine learning technique, while using varying lengths of sequences centred at the positions where the rate constants were to be predicted (see Methods). The aim of the machine learning procedure was to establish the optimal sequence length to minimise the error in the predicted rate constants (Supplementary Fig. S4 and S5). In agreement with prior evidence9,10,45,46, but now obtained for each individual i->j substitution type from Trek data, the optimal window was found to be 5-7-nt (both 5- and 7-nt resulting in comparable results for many substitution types) and was subsequently used as guidance for the direct mapping of the Trek rate constants from the L1 sequence onto any given human nuclear DNA sequence for the assignment.
Mapping the Trek data on any DNA sequence
The upper 7-nt size window for determining the single-nucleotide substitution rate constants at the central base accounts for 3 upstream and 3 downstream bases relative to each nucleotide position. Our substitution positions that pass the Trek criteria capture 636 unique 7-mers out of the possible 16384 (47). Therefore, for many loci in the human genome we need to use a smaller window (< 7-mer) as a match criterion to assign to one of the Trek rate constant sets. By trimming the size of the k-mer to 5, hence accounting for 2 upstream and 2 downstream bases, we cover 404 unique sequences out of possible 1024 (45). Further reduction of the size to 3, allows having data for 56 unique triads out of 64 (leaving out only the CpG containing triads, see below). For the single-base case (1-mers), where we average out all short-range neighbour effects and longer-range sequence variability, we obtain data for all the 4 bases and 4×3 possible substitutions as shown in Fig. 2 (the median values on the top of the figure). The coverage of the longer k-mers is however increased nearly twice when we account for the strand-symmetry, as described in Methods. Please note, that for each unique k-mer we obtained 3 constants via the described analysis of a large pool of L1 remnants from different genomic loci.
With the above considerations, we created a program (Trek mapper) to produce the full range of core mutation rate constants for any sequence, accounting for the context information for up to the 7-mer window and pulling the matching core data from the Trek database. Should a representative match be absent with the full 7-nt long sequence, the window around the given position in a query sequence is shortened into the longest variant possible (out of the 5-nt, 3-nt or 1-nt lengths) with a full match in the Trek database (see Methods and Supplementary Fig. S6). In this way, for all the possible 16384 7-mers, our Trek database reports 49152 rate constants (3×16384), of which 3168 (6.4%) account for the 7-mer context, 23232 (47.3%) account for the nested 5-mer context, 17120 (34.8%) for 3-mer and only 5632 (11.5%, CpG containing sequences) constants do not account for any context effect on the central base, since we eliminate those by design, due to the contributions. We thus produce an unprecedented dataset that reports, and makes publically available (Supplementary Data 2 and Data 3), the direct, rates for all individual i->j substitutions accounting for the context effects beyond the 64 triads34. If we consider only the unique values in the Trek database, we report 2078 unique rate constants (taking into account different extent of averaging, where multiple entries are present for the different context ranges), of which 1208 (58.1%), 782 (37.6%), 85 (4.1%) and 3 (0.1%) entries account for 7-, 5-, 3- and 1-mer contexts respectively. The 1-mer averaged data were used for the k-mers in that contain either C or G bases of a CpG dyad at the centre, to assign the overall mutation rate constants by the Trek mapper. This was done since none of the CpG sites in the L1 elements passed our robustness checks, due to the targeted epigenetic control of the single-nucleotide substitutions there12-14, which were also non-uniform with time (active targeting, , while in the viable epoch for each L1 subfamily) and were present to silence the active retrotransposons.
Our current data are for the human nuclear genome. However, the general approach for obtaining constants is applicable to any organism where the genome contains a set of well-characterised and related young mobile elements silenced at different time epochs and without notable context bias.
The origin of the oligomeric landscape in the human genome
The full set of sequence-dependent human mutation rates (all 3 constants per position) enabled us to perform a sophisticated in silico evolution of a random DNA sequence, guided solely by our values. We started from a random sequence of 5 million (mln) nt with a G+C content of 60% (substantially greater than the 40.45% G+C content for the human genome). We performed random mutations weighted by Trek-inferred probabilities (see Methods, Supplementary Fig. S7 and Supplementary Video 1), where, after each cycle, the mutation rate constants were updated for the sequence positions that were either mutated or fell within the influence zone of the performed mutations. The simulation was continued until the overall G+C content of the simulated sequence became constant (see Fig. 3a-c).
The simulation converged to generate a sequence with the A, T, G and C compositions of 30.91, 30.90, 19.06 and 19.13% respectively. Note, that these values are in good agreement with the A, T, G and C compositions of the repeat-masked human genome of 29.75, 29.79, 20.24 and 20.22% respectively (see Methods). Furthermore, the simulated genome captures the contents of different individual oligomers (k-mers) in the human genome. The data for all the possible 16 dyads, 64 triads, 1024 pentads and 16384 heptads are presented in Fig. 3d-g and show a significant (see the correlation coefficients) correlation between the compositional landscapes of the Trek-simulated genome and the actual human genome. Regardless of the starting composition of the initial DNA sequences, our simulations always equilibrated to a state with similar oligomer (up to 7-mer) content. The k-mer contents shown in Fig. 3d-g for the actual human genome were calculated from the repeat-masked version of the RefSeq human genome, where all the identified repeat elements, including the L1, were disregarded. This assured the removal of a potential bias due to the presence of L1 elements in the human genome. As constants are free of the contribution (see above), the simulated genome produced higher alterations in representing the k-mer contents that have CpGs (red points in Fig. 3d-g), which directly demonstrates the contribution of to the background compositional landscape of the human genome.
The correlations in Fig. 3 are from simulations where the rate constants were symmetrised according to the inherit strand-symmetry in double helical DNA (see Methods). The results without such equalisation are still significant, though producing slightly worse correlation coefficients (Supplementary Fig. S8).
To confirm that the observed correlations for different k-mer contents (Fig. 3d-g) arise due to our sequence-context-dependent core mutation rates, rather than as a side effect, by a pure chance, in a sequence where the simulation makes only the single-base composition converge to that of the real human genome (such as in sequence generated using an ideal 4×4 single-nucleotide substitution rate matrix), we calculated the expected distribution of different k-mers in a fully random genome but with the exact human A, T, G and C base compositions. In the complete absence of any sequence-context effects, the probability of the occurrence (fraction) of any k-mer in a sufficiently long sequence is equal to the product of the occurrence probabilities of their constituent bases. For instance, the probability of observing the AGT triad is the pAGT = pApGpT product, where the individual pi probabilities are the base contents expressed in fractions. The comparison of the k-mer fractions obtained in this way with the human genome data (Supplementary Fig. S9) shows a substantially reduced correlation (for the genomic 7-mer content, Pearson’s R=0.59 compared to 0.74 using Trek rates), supporting the attribution of the important role to the values in shaping the compositional landscape of our genome.
The basal mutability profile of the human genome
Our Trek mapper provides the full set of constants for each position in the whole human genome. Such data enables us to calculate the context-dependent overall mutability by taking the sum of the individual rate constants for the 3 possible mutations at each base position, thus producing the core mutability constant for the substitution of a given base i by any other base N. Supplementary Fig. S10 shows a comparison of the basal mutability profiles calculated for the individual chromosomes (red) with the whole genome profile (green), where most of the chromosomes exhibit the same overall distribution as the whole genome. Further grouping and analysis47 of the unique sequences found in regions of different basal mutability for the whole human genome reveals motifs that tune the stabilities of the bases at the centre (see Fig. 4 and the caption for further elaborations). Note that the observed sequence-determined mutational biases can potentially contribute to the initial nucleation of a more extensive base-content pattern formation in chromosomes48,49.
Basal mutability profile in cancer-linked sites
A recent study50 suggested that the multi-etiologic nature of cancer is primarily linked to random chance. To this end, genomic sites with higher intrinsic mutability (lower stability) might exhibit a higher prevalence of cancer-related genome alterations, as compared to sites of lower intrinsic mutability (higher stability). Although the sequence context of cancer mutations and their dependence on cancer types is out of the scope of the present work and is covered in detail elsewhere51-54, here we examined the simple relationship between our calculated basal mutability values and observed cancerassociated somatic mutations accessed via the annotated COSMIC cancer database55 (see Methods).
Since the Trek data are for the core spontaneous substitutions, we restricted the analysis to the noncoding and non-polymorphic (not identified as SNP) single-nucleotide substitutions (6 mln mutations) in cancer. By mapping these sites to the human genome and retrieving the sequence-context information (7-nt long sequences centred at the mutation points), we processed the data with Trek mapper and obtained the profile for the non-coding sites detected in human cancer. The outcome in Fig. 5, overlapped with the whole-genome profile, shows that stable sites in the human genome, assigned by the Trek mapper to have mutability constants below 1.13 byr-1, are significantly less represented in cancer.
Like many other disease-causing mutation sites56, most of the sites that are highly enriched in cancer (Fig. 6a) are CpGs57, which, even without accounting for the methylation driven increase12-14 of the mutation rates, show high basal mutability values58. However, Fig. 6b,c demonstrates the discussed trend in the 7-mer cancer enrichment ratio (Methods) vs. basal mutability dependence even when all the CpG sites are removed from the analysis. Overall, our results show that the intrinsic basal mutability of different sites in DNA may contribute to their absence/presence in pathological genotypes. In particular, we observe that 7-mers with low mutability of the central base are relatively depleted in cancer-linked mutation data. They present a cancer enrichment ratio that is smaller than 1, whereas for the unstable 7-mers, the enrichment ratio, on average, tends to 1 (Fig. 6c), meaning that the enrichment in cancer is comparable to the one in the whole genome.
CONCLUSIONS
We have employed a single-genome and model-free approach (Trek) that reveals the core components of the spontaneous single-nucleotide substitution rates and basal mutability constants for the human nuclear genome. Although the mobile DNA elements have been used before4,39,43,59 for estimating averaged substitution rates, the increased quality of the human reference sequence and the detailed subfamily divergence studies for the L1 elements35-38 done during the past decade enabled the construction of a specific direct method for the single-genome retrieval of the core rate constants at a single-nucleotide resolution, while also accounting for the comprehensive short-range context effects beyond the previous Bayesian estimates for the +1/-1 base effects34. The retrieval of our data in a single-genome manner adds additional value, since it ensures the absence of potential bias due to a) differences in the molecular machinery that influence the mutation rates while comparing the genomes of different species, and b) the short generation span where only the most probable mutations would become visible while comparing genomic data in families (parents/offspring) to detect de novo substitutions. As the major outcome of this study, we used our context-dependent rate constants to provide the first direct demonstration of the equilibration of a random DNA sequence into the one with overall (genomic base content) and short-range (different k-mer contents) characteristics that closely mirror that of the actual human genome. The L1-derived rates recapitulated fundamental properties of the repeat-masked and therefore L1-free human genome (Supplementary Fig. S11). This simulation thus provides significant evidence in support of the role of core neutral mutations in shaping the compositional dynamics of complex genomes60 and additionally validates the reliability of the obtained Trek-derived constants. Importantly, our study demonstrates that the non-specific core mutation rates are capable of producing apparent selection or depletion patterns in the human genome. To this end, our in silico equilibrated sequences, obtained solely based on the full set of constants, can now serve as true background standards for the comparisons to reveal real selection61 for or against different sequence motifs. The extended set of core mutation rate constants we report can potentially help advance our understanding in genome dynamics, with possible implications for the role of random mutations in the emergence of pathological genotypes and the evolution of proteomes.
METHODS
The human reference genome sequence was taken from the Ensembl database (www.ensembl.org), and was of the version hg19/GRCh37. The positions and span of the retrotransposons were taken from the output of the RepeatMasker62 processing, accessed through the UCSC genome database (www.genome.ucsc.edu). The repeat annotations were those corresponding to the version of the used human RefSeq genome (hg19/GRCh37). The R programming language (www.r-project.org) was used for all the consecutive analyses. Most of the computations were performed on the available Linux workstation and computing cluster facilities hosted at the Department of Chemistry, University of Cambridge, and the Cancer Research UK Cambridge Institute.
Revealing the core mutation rate constants
All the remnant sequences of the mentioned subfamilies (Supplementary Table S1) were first aligned onto the 6064 nt reference sequence. As the reference, we took the consensus sequence of the human L1Hs retrotransposon (Fig. 1a,b and Supplementary Fig. S2). The alignment was done in a pairwise manner with high end-gap penalties (“overlap” mode) that, while allowing insertions and deletions, did not severely break the queried sequences for false mappings with a better global alignment score. R with the Biostrings library for alignment was used. After the alignment, all the relevant mutation fractions were collected for each position in the five L1 subfamilies reporting on a specific time epoch (Fig. 1b,c). For example, if the position i in the reference sequence was G (b1), the mutation rate constants were calculated for the G->A (b1->b2) transition and G->C (b1->b3), G->T (b1->b4) transversions. First, the base fractions were calculated for five time-reporting L1 subfamilies; i.e. to get the fraction of mutations accumulated in ∼20.4 myr (age of L1PA536), all bases in L1PA5 remnants that were precisely mapped on the ith position of the reference sequence were counted and the fractions of G, A, C and T bases retrieved (Fig. 1c). Here, we applied one of the robustness checks and made sure that the fractions were estimated if at least 700 mapped bases were present for the ith position in each time-reporting subfamily (Fig. 1b). We also aimed to calculate such substitution rates for only the positions where the mutations are random and not specifically selected for or against. In other words, the position should not be a polymorphic or a subfamily speciation-defining nucleotide. We filtered out such cases by ensuring that any eligible ith position had the same nucleotide of the reference sequence as its most prevalent variant with a minimum of 80% occurrence in all subfamilies (Fig. 1c). The average crude single-nucleotide substitution rate is noted to be 12.85×10-9 mutations per site per generation5. Assuming an average generation length of 20 years2, the mutation rate constant in a time domain can be crudely approximated as 0.64 byr-1. In the course of 20.4 myr (the age of L1PA5), this should result in only a 1.31% mutated base fraction at a give site, caused by the average spontaneous substitution rates. Therefore, by assuming a threshold of 80%, we allow up to 15 times variation of the rates from the average estimate, which is a safe range5 for the direct estimation of the single-nucleotide substitution rates and their core variation. Having the mutation fraction data from five different ages and for three (b1->b2, b1->b3, b1->b4) possible substitutions at the position i allowed to fit a linear model via the least squares methodology for the fraction-versus-time dependence for each mutation separately (Fig. 1d). If the data, hence the fitted line, were of high quality, the slope was expected to represent the mutation rate constant. We applied the third robustness filtering at this stage, by making sure that the rates were calculated for only the cases where the time correlation of the mutation fractions in Fig. 1d had greater than 0.7 Pearson’s correlation coefficient. This ensured that the retrieved fractions of the mutations comprised of only the time-accumulated substitutions, rather than of targeted substitutions during the active life-span of the L1 elements, before their silencing. Please note, however, that the correlation coefficients in most of such time correlations that passed the whole Trek procedure were substantially higher (the observed Pearson’s correlation coefficients were centred at 0.92 with 0.07 standard deviation). The procedure was done for all the 6064 positions in the L1 reference sequence, except the positions 5856-5895 and 6018-6064, close to the 3’-end (Supplementary Fig. S2) that engulf the low-complexity G-rich and A-rich sequences correspondingly, prone to alignment errors. One of the reasons for the usage of only the young L1 subfamilies (spanning 20.4 myr age) was to minimise the potential error in rate constant determination in the Trek procedure caused by repeated substitutions hitting the same position during the considered period of the mutation accumulation. The effect is indeed negligible for 20.4 myr span, as we can estimate using the above mentioned 0.64 byr-1 value5 for the average i->j substitution rate constant, r. Since the rate constant is sufficiently small to induce only a small δfj change in substituted base fraction during the δt = 0.0204 byr (20.4 myr) time period (see above), we can equate the δfj change in the fraction of the base j (at the given position that had the original base i identity in a large population of homologous sequences) to the p ≈ δfj ≈ rδt substitution probability within δt period. We can thus make a crude estimation for the probability of the second substitution to another, k≠j, base happening at the same position to be (rδt)2, which is the product of individual substitution probabilities assuming that the rate constant does not change from our average estimate r across those two substitution types. We can permit this for the sake of the back-of-the-envelope estimation of the order of the effect expected from the repeated mutations hitting the same site within 20.4 myr period. To this end, the apparent change in i->j substitution fraction that we would observe by neglecting the additional j->k substitution, would underestimate the more realistic δf and be equal to , as we would not count the j bases that emerged but became additionally substituted by k. Hence, the corresponding apparent rate constant that neglects second substitution would also underestimate the actual value r, and can be expressed as . This means that the underestimation of the actual rate constant would be by [r-r(1-rδt)]×100/r = 100×rδt %. Putting 0.64 byr-1 for r and the 0.0204 byr for δt, we can expect only 1.3% contribution from the repeated second substitution at the same position within 20.4 myr. Since some of the other non-j bases could become js and balance the underestimation of the δfj fraction, the error could be even smaller. This shows that repeated substitutions can be safely ignored in 20.4 myr time-scale. Furthermore, the validity of our rate constants was further checked through the two independent analyses reflected in Supplementary Fig. S3 and Fig. S4.
Finding the influence range of neighbour nucleotides
We have used generalised boosted models63 (GBM) to elucidate the effective range for the core sequence-context effects. This was achieved by developing test models to evaluate the predictive strength of only the neighbouring bases in defining the core mutation rate of the central base. The GBM was used as implemented in the gbm library for R. For each i->j mutation type, all the found Trek data were taken without the possible outliers, which were filtered by allowing only the usage of the values that were within the 1.65 (a value that keeps ∼90% data if normally distributed) times standard deviation range of the constants in a given mutation category. The sequences were then processed to produce pos/bi uncoupled features that were associated with the relative adjacent positions (pos, - for upstream and + for downstream positions) and their possible four bi base types. Those features took values 0 or 1, depending on whether the base at an associated relative position was of the bi (1) or any other base type (0). For instance, if we wanted to develop a model accounting for only a single upstream (pos = -1) and a single downstream (pos = +1) nucleotides, hence predicting the mutation rates for different 3-mers, where the central base is the one that mutates, then we produced 8 pos/bi features for the GBM fitting. There, 4 binary features (-1/A, - 1/C, -1/G and -1/T) described whether the upstream -1 position is of base type A, C, G or T, and 4 binary features described the same for the downstream +1 position. We built the models using 3-, 5-, 7-, 9-, and 11-mers, thus accounting for 1, 2, 3, 4 and 5 upstream and the same number of downstream neighbour bases. The absence of the coupling in the binary features, unlike in the case where, for instance, one employs only two binary features per 4 states, enabled us to also investigate the predictive significance of each nucleobase identity at a given neighbouring position, which was useful in deciding against the construction of more complex machine learning models (see below) using additional features with higher level of abstraction for the sequence information (overall base content, sequence-derivative properties). The GBM models were then fitted by systematically trying different permutations of the tuning values63 for the number of trees (50-7500), interaction depth (1-10), shrinkage (0.001, 0.01 and 0.1), the number of minimum observations per node (1-28) and the bag fraction (0.25-0.65). The optimal combinations of the tuning parameters were found per mutation type and sequence length, via a 16-fold cross validation repeated 7 times. The found best parameters are accessible in the Supplementary Data, and the predictive performances of the best models, from the repeated cross validation studies, are presented in Supplementary Fig. S5. To make a predictor of based on a sequence, we found it much better to use direct values coming from the proposed Trek methodology, rather than the GBM models, as the Trek values were already well averaged across multiple occurrences of the same sequence in different loci of the human genome (Fig. 1-3 and Supplementary Fig. S3). Furthermore, the overall poor performance of the GBM models implies that the influence of the immediate context is highly non-additive and non-Bayesian, which is expected taking into account the nature of the core context-dependent mutation rates. The latter rates reflect the intrinsic short-range sequence properties, interactions and recognition with the overall mutagenic and repair machinery present in a given organism. There, the whole sequence at a certain small scale9 is what defines the interaction11, and it is hard to represent such effects through even smaller-scale constituents. The direct model-free approach used in our Trek mapper methodology (see below) thus seems essential in mapping the rate constants throughout the human genome. To this end, the GBM models here had a sole purpose of identifying the optimal range of influence for accounting the neighbouring nucleotides. The optimal range was found to be captured, on average, by a 5-7-nt long window (Supplementary Fig. S4 and S5) which is in an excellent agreement with the prior <10 nt estimate9-11,45,46. We used the maximum 7-nt length to stratify the Trek data for the further model-free mapping on any provided sequence, including the whole human genome.
Mapping the Trek mutation data on any sequence
We developed a Trek mapper program. For each i position in a query sequence (Supplementary Fig. S6), the program looks at the bases i-3 to i+3. If the exact 7-mer, with the associated rate constant values, is not available in the Trek database, the program reduces the size of the sequence to 5, by considering i-2 to i+2 positions, or, if necessary, to 3- or 1-mers, until an exact match is found in the database. This would essentially mean that some reported mutation rates would come from the actual triad data. About half will come from pentads and some from heptads, accounting for more precise sequence-context information. A few will originate from the fully averaged single-base (1-mer) Trek mutation rates. Single-base values are also returned for the terminal positions in a query sequence. For each unique sequence in the discussed 7-, 5-, 3- and 1-mers, if the k-mer appears more than once in the reference L1 sequence, of course with different neighbours at the positions out of the k-mer range, we average the Trek values by taking the median. For instance, the rG->A mutation rate constant in the 3-mer AGT represents the rate averaged across all the appearances of AGT in the L1 reference sequence, which would normally be with varying other neighbour bases, out of the 3-mer range. The rG->A in AGT will therefore represent the average rate constant across all the representatives of the significant range, the NNAGTNN 7-mers, present in L1, where N can be any of the four bases. In the same way, the mutation rate constants for the single bases (1-mers) can be considered as fully averaged across all the possible neighbour effects in NNNGNNN sequences. Our algorithm therefore makes the most of transposon exposed mutation rate data of the human genome, returning the best possible values inferable from our Trek database and, where uncertainty is present, returning the best averaged values for a shorter context range. Furthermore, we have enabled an option to use symmetrised Trek parameterisation, assuming an overall strand-invariance of the mutation rates. In the latter case, the complementary mutation rate constants of the central bases in two reverse complementary k-mers were equalised. For example, the G->A mutation rate constant (r1) in the 3-mer AGC was set equal to the complementary C->T rate constant (r2) in the reverse complementary GCT. The data equalisation was done in the following way: if both r1 and r2 were of the same quality accounting for the whole sequence-context information in both 3-mer variants, then both values were set to (r1+ r2)/2; however, if one of the rate constants was determined with a better quality, since the full 3-mer data for the other case was missing and the 1-mer average was used as a replacement, then the rate constant of the better quality variant was assigned to both r1 and r2. Accounting for the strand symmetry improves the results of the validation studies, further refining the mutation rate constant values and increasing the coverage of longer k-mers in the Trek database.
The described Trek mapper program, along with the associated data can be accessed through the http://trek.atgcdynamics.org web page. Future improvements in data and the program, through extending the types of mobile DNA in the Trek procedure, will be reflected on the same web site. The Trek mapper server application was written in R, using the Shiny library and server application (http://shiny.rstudio.com).
Equilibration of a random DNA sequence with the Trek rates
A 5 million (mln) nt sized random genome was created with the initial A, T, G and C base contents set to 20, 20, 30 and 30% correspondingly, hence with 60% genomic G+C content. The length was selected to cope with the finite computational and time resources, though operating on lengthier sequences will not change the outcome of the calculations, since the captured sequence-context effects are within 7-nt window. We first calculated the probabilities of all the possible mutations in this random sequence, which basically meant the assignment of 3 mutation rate constants per position in the genome, describing the conversion into the three bases other than the base already present in the respective position. This was done using the Trek mapper described above. Next, at each step, we sampled 5000 mutations weighted by the calculated 3×5 mln rate constants. We then identified those 5000 positions and the corresponding mutation types that were sampled to happen (Supplementary Fig. S7), performed those mutations, and, updated the probability values via the Trek mapper. Repeated multiple times, the process evolved the sequence (Supplementary Video 1) ruled by the core spontaneous rate constants that are sensitive to the changes in the sequence composition at the immediate vicinity in the genome.
For the comparison of the simulated sequence at equilibrium with the real human genome (RefSeq), we calculated the fractions of different oligomers (k-mers) in both sequences (Supplementary Data). The k-mer contents of the human genome were calculated by sliding a window of size k (from 1 to 7) and counting the occurrence of each 4k unique sequence. We used a direct calculation of the lexicological index64 of a string to increase the computational efficiency of the k-mer counting. Although, data from the masked human genome were used in the k-mer analyses to rule out any bias from the presence of the same L1 elements in the object of application of Trek data, the comparison of the masked and unmasked genomes showed only negligible differences in both single base and short k-mer contents. If, however, we consider only the L1 elements, the k-mer content was quite different from the rest of the genome.
Basal mutability constants at cancer-linked sites
We took all the non-coding somatic single-nucleotide substitution data associated with cancer from the COSMIC database55 Error! Hyperlink reference not valid. NCV dataset accessed in February, 2015). Since our Trek rate constants are for the spontaneous mutations, we only considered the sites that were also not declared as known SNPs (the status was present in the NCV dataset). This was to ensure that we excluded sites where an active polymorphism is potentially encouraged by a natural selection. About 3.7% data from the remaining set of cancer-linked mutations were duplicates, with no differences found in genomic location and mutation types. We removed those, keeping only the single first-encountered copies of such entries. The resulting data contained 5984711 mutation entries.
The cancer enrichment score (Fig. 6) for a given k-mer sequence was calculated by taking the ratio of the occurrence fractions, calculated in (numerator) only the cancer-linked sites (where the linked base is the central one in the k-mer) and (denominator) in the whole repeat-masked human genome (Supplementary Data).
Data access
The Trek database and the mapper are housed at http://trek.atgcdynamics.org. Additional figures and tables, referenced in the text, can be found in the Supplementary Information (Table S1 and Figures S1-S11). We deposit the video showing the in silico dynamics of a random sequence upon equilibration (Supplementary Video 1), the raw data on the mutation rate constants in the reference L1 sequence (Supplementary Data 1), the resulting Trek database processed with (Supplementary Data 2) and without (Supplementary Data 3) the strand-symmetry considerations, the k-mer content for the masked and unmasked human genomes (Supplementary Data 4), the full set of 7-mer sequences with the respective cancer enrichment scores and mutability values (Supplementary Data 5), and the GBM parameters that were minimising the error of the tree-based test models (Supplementary Data 6). All the associated method and analyses source codes are available from the authors upon request.
AUTHOR CONTRIBUTIONS
A.B.S and S.B. designed the study, performed the research, interpreted the results and wrote the paper. The authors declare no competing financial interests.
ACKNOWLEDGEMENTS
This research was supported by the Herchel Smith Fund. SB is a Wellcome Trust Senior Investigator. We thank Dr. Chris Lowe for proofreading the manuscript.