The evolution and convergence of mutation spectra across mammals

Despite the key role genetic mutations play in shaping phenotypic differences between species, little is currently known about the evolution of germline mutation spectra across mammals. Domesticated species are likely particularly interesting case studies because of their high mutation rates and complex evolutionary histories, which can span multiple founding events and genetic bottlenecks. Here we have developed a new reusable workflow, nSPECTRa, that can undertake the key steps in characterising mutation spectra, from determining ancestral alleles to characterising multiple forms of variation. We apply nSPECTRa to seven species, including several that have undergone domestication, and highlight how nSPECTRa can provide important novel insights into mutation rate evolution. While mutation spectra most often show marked differences between species and even breeds, certain mutation types have risen to a high frequency in subpopulations of different species, indicative of convergent evolution in mutation rates. This includes the previously characterized TCC->TTC change enriched among European humans, which is also enriched among East Asian cattle. We show Indicine cattle are particularly interesting examples of how different mutation spectra segregate within a population and subsequently spread across the globe. Together, this work has important implications for understanding the control and evolution of mammalian mutation rates.


Introduction
As a key driver of phenotypic differences between individuals, and the main substrate upon which selection acts, characterising the processes that generate germline mutations is vital to determine the processes driving traits, diseases and species evolution.Owing to the importance of DNA fidelity in mammals, there are several hundred genes 1 involved in DNA repair, spanning at least seven different repair pathways 2 , from mismatch repair to nonhomologous end joining.Each of these different pathways is preferentially associated with the repair of a different spectrum of DNA changes, and because of the large number of genes and pathways involved, natural genetic variation across them is expected to lead to differences in the efficiency with which different mutation types are repaired between individuals.Although genetic variants with a large effect on DNA repair have been observed, most notably within families with high incidences of cancer 3 , these are generally rare in any given population, with the majority of genetic polymorphisms affecting DNA repair between individuals likely to have a small effect 4 .Nevertheless, the accumulation of multiple mutations across hundreds of repair genes over time could potentially lead to noticeable differences in the spectra of the mutations found between individuals.Supporting this idea, Harris and Pritchard 5 found that different human populations preferentially carry different numbers of DNA mutations in different K-mer contexts.For example, human European populations have been observed to carry a relatively greater proportion of TCC>TTC changes, suggesting that individuals from this population have less efficiently corrected such changes over recent human history.
In addition to varying rates by sequence context, the distribution of single nucleotide variants (SNVs) has been shown to be uneven across the genome, with directly adjacent SNVs occurring more often than expected and most often on the same DNA strand 6 .Although many of these neighbouring changes appear to be the result of a single mutational event in a single generation, that have been termed multi-nucleotide polymorphisms (MNPs) 7 , more commonly adjacent SNVs are found at different frequencies in the population, suggesting that they are the result of two independent mutation events leading to what has been termed sequential dinucleotide polymorphisms (SDMs) 6 .Consistent with this, flanking heterozygosity has been shown to be associated with an increased probability of base substitutions in the human genome 8 , suggesting DNA mutations are often not independent of those that have previously occurred nearby.In previous work, these SDMs were shown to be even more differentiated between populations than single nucleotide variants (SNVs) and to be driven via different mutational processes 6 .
Beyond heritable variation in the ability to repair DNA damage, other factors also play a role in shaping the distribution of base changes in the genome.Perhaps the most notable differences are in the rate by which certain sequences are prone to mutate 9 , different environmental exposures to mutagens 10 , natural and artificial selection, and biased gene conversion 11 .One of the largest drivers of DNA changes in mammals is the readiness by which methylated CpG sites deaminate to TpG dinucleotides, with such changes making up over a third of all observed mutations 12 .Similarly, selection is expected to rapidly remove novel deleterious mutations from the population, potentially skewing the mutational profile of the remaining changes.However, as most novel mutations are expected to be neutral due to falling in non-coding, intergenic, or intronic regions, the impact of selection may not substantially skew the mutation profiles between individuals.This is especially true, as the location and number of functional regions are expected to be largely consistent across individuals, at least within species.Biased gene conversion (BGC), the process by which one copy of an allele is preferentially replaced by another through recombination 11 , can also alter the spectrum of SNVs in the genome.This effect is known to be skewed towards increasing the frequency of C:G base pairs at the expense of A:T base pairs.However, recent studies have indicated that BGC is not likely to be a major driver of the observed differences in mutation spectra observed between individuals 13 .
Although most studies have been carried out within species, by comparing the genomes of offspring to those of their parents, Bergeron et al. 14 estimated germline mutation rates across 68 vertebrate species.Factors such as generation time, age at sexual maturity, and other fecundity traits were observed to be linked to differences in the number of germline mutations per generation between species.Domesticated species were shown to be particularly interesting models for studying germline mutation profiles owing to their unusual evolutionary histories.The authors observed them to have unusually high mutation rates, which they attributed to a reduction in generation time in these species, which may have arisen due to population bottlenecks following domestication, rather than any inherent differences in the underlying mutational processes.However, the relatively few mutations identified when comparing parental and offspring genomes means that it is not possible to compare the spectra of mutations between species and among domesticated populations using such trio-based approaches.The impact on mutation spectra of phenomena such as multiple domestication events in cattle 15 , selection for phenotypic diversity in dogs 16 , and population bottlenecks across domesticated species 17 is largely unknown.
Consequently, despite the importance of better understanding how patterns of mutation differ between individuals, little work has been done to study mutation spectra differences outside of the major model organisms 18,19 due to a lack of suitable large whole-genome sequenced cohorts and easy-to-use software tools.However, factors such as the everdecreasing costs of whole-genome sequencing and the availability of tools for studying SNV mutation rates, such as mutyper 20 and Relate 21 , have now enabled the potential application of these approaches across a wider range of species.In this study, we expand on these tools to develop a unified, reusable workflow and analysis approach that can characterise different types of SNV and SDM mutation spectra in any diploid species and apply it to study the evolution and convergence of mutation spectra.We applied this workflow across several mammalian species, with a particular focus on domesticated animals, to characterise how mutation spectra have evolved and converged both within and across species.

A novel reusable workflow for characterising mutation spectra across species
To enable the characterisation of mutation rate spectra across any diploid species, we generated a publicly available, reusable Nextflow workflow, nSPECTRa (www.github.com/evotools/nSPECTRa). nSPECTRa enables users to run a range of analyses, including imputation, annotation, and calculation of both SNV and SDM spectra, and their relative rates over time.Importantly nSPECTRa can infer ancestral alleles for any species, a prerequisite for determining the direction of mutations, but which are generally unavailable for most species.The entire nSPECTRa workflow is shown in Figure 1.To demonstrate the utility of this pipeline and to explore the patterns of germline variation across mammals, we applied nSPECTRa to whole-genome variant calls of unrelated samples from seven different species (309 cattle, 175 African buffalo, 79 water buffalo, 36 horses, 350 pigs, 606 dogs, and 2,561 humans).These species span a diverse range of mammals and cover three Bovidae and five domesticated species, enabling comparisons both within and across groups.
A key challenge when attempting to compare mutation spectra within and across species is that both the total number of mutations and the frequency of ancestral K-mers in the respective genomes differ between animals.This means that both factors need to be controlled for to enable comparisons of the relative rate at which different K-mers mutate across species.To achieve this, we adapted the median of ratios method originally developed for RNA-seq data analysis.Using this approach, the proportion of each K-mer in the genome showing a particular base change was first normalised by the occurrence of the corresponding K-mer in the ancestral genome.These normalised K-mer-specific mutation rates were further corrected by calculating a sample-wise median of ratios normalisation factor to control for differences in total mutation number between animals, as described by Anders and Huber 22 .
As shown in Supplementary Figure 1, this novel approach for processing mutation rates successfully places the mutation spectra of animals onto the same scale, so that they can be directly compared both within and across species.

Divergence of SNV mutation profiles and impact on protein evolution across mammals
Figure 2A shows a Principal Component Analysis (PCA) representation of the relationship between the SNV mutation spectra across the seven studied species.All species showed a clear separation based on their spectra of germline SNV variation, highlighting the substantial divergence in mutational profiles across mammals.Consistent with their comparatively close evolutionary distances, the two domesticated and one wild Bovidae (cattle, water buffalo, and African buffalo) showed the most similar mutational profiles.Figure 2B shows the large differences between species across a diverse range of mutation types, with the key driver of PC1 being the general difference in the rate of C>G mutations across species, with humans and horses having the highest rate of these changes, and pigs having the lowest.More specific mutational profiles were also observed.For example, pig species appear to present an unusually low rate of C>T mutations in CpG contexts.
Notably, on various principal components, the species do not cluster according to their known evolutionary distance, suggesting that their relationship by mutational profiles is not exclusively a function of their time since divergence.For example, PC5 separates African buffalo from the domesticated bovids, water buffalo, and cattle, while PC6 further separates the latter two domesticated species (Supplementary Figure 2).PC5 is in part driven by African buffalo presenting a comparatively high rate of V[C>A]T changes (Figure 2B, Supplementary Figure 3), mutations that are particularly enriched in the COSMIC SBS14 signature associated with defective DNA mismatch repair 23 .
To identify the key mutations driving the observed differences between species more clearly, we trained a gradient boosting machine learning model on the SNV mutation spectra profiles.
Consistent with the clear separation in mutation profiles between species, this model was 100% accurate in assigning species labels to individuals (Table 1).From this model, we calculated SHAP (SHapley Additive exPlanations) values 24 that indicated which model features (i.e., mutation type) were most important in differentiating between groups.Figure 2C shows the mutational changes that were most informative in separating species.Notably, only one or two mutation types are sufficient to explain most of the differences between species; for example, C[A>T]C rates are substantially elevated in horses but depleted in humans.
Importantly, these mutational biases are not only observed in intergenic and intronic regions, but also among coding changes, meaning that they impact the relative rate of protein evolution between these species.For example, the relative bias for T[C>G]T changes in cattle non-coding regions was also observed among cattle missense changes, skewing the relative rate of amino acid change in this species (Figure 2D).This potentially has particular implications for water buffaloes, which show a comparative enrichment for T[C>A]A changes (Supplementary Figure 4), and so are primed to accumulate stop-gained mutations at a comparatively higher rate than the other species examined.

SDM mutation profiles identify fine-scale relationships among populations
Sequential dinucleotide mutations are adjacent single-nucleotide base mutations that are found at different frequencies in the population, suggesting that they are a result of multiple mutational events across different generations.In previous work in humans we demonstrated SDMs are controlled by distinct mutational processes to SNVs, and that the rate of SDMs in different K-mer contexts can more effectively separate human populations than that of SNVs 6 .This result was observed to carry over across species.Pigs, humans and dogs all separate into sub-groups based on their SDM profiles (Supplementary Figure 5), with these groups reflecting known evolutionary relationships between sub-populations of these species, namely Asian and European pigs, ancient and modern dog breeds, and African versus non-African human populations.Notably, when analysing mutation profiles within the one species, the Indian water buffalo breeds did not separate with any combination of PCs based on SNV mutational profiles, but were clearly separated on PC2 vs. PC4 calculated from SDM profiles (Figure 3).Consistent with this, a gradient boosting model trained on SNV profiles only has an accuracy of 0.50 at assigning correct water buffalo breed labels (Table 1), but in contrast, the accuracy of the model trained on SDM profiles was 0.88.This is despite three of these water buffalo breeds coming from geographically proximal regions of western India and clustering closely on genotype-based PCA (see Dutta et al. 2020) 25 .This illustrates that SDM mutational profiles and machine learning models are potentially effective approaches for assigning breed labels to livestock species.

Figure 3. More effective clustering of animals by breed using SDMs versus SNVs. A) For PC1 vs. PC2 calculated from SNV profiles, only the divergent Mediterranean buffalo was clearly separated from the other water buffalo breeds. The inset shows the geographic origins of the six Indian breeds. No combination of the first eight PCs calculated from the SNV mutation profiles could be used to separate these breeds. B) PC2 vs. PC4 based on SDM profiles effectively clusters all seven breeds. PC1 vs. PC2 based on the SDM profiles are shown in
Supplementary Figure 6.

Mutation spectra divergence within breeds and sub-populations
Cattle are unusual in that they have been domesticated at least twice leading to two lineages, Bos taurus taurus (taurine cattle) and Bos taurus indicus (indicine cattle), which last had a common ancestor over 200,000 years ago 26 .These two lineages have historically populated different global areas, with taurine animals migrating to Europe, West Africa, and North Asia, whereas indicine cattle populated South Asia and later East Africa (Figure 4A).This separation was reflected in their mutation spectra, with the two lineages separating on PC1 for both the SNV and SDM mutation spectra (Figure 4B+C).However, surprisingly even animals from the same geographic origin or breed can be further separated by differences in their mutational

Convergence of mutational profiles across species
Using the SHAP values derived from gradient boosting models applied within species to differentiate population groups, we examined whether certain mutation types are enriched or depleted in populations across species.As shown in Figure 5A SDM biases are largely all private to sub-populations, with no SDM type showing unusual rates across two populations from different species.This is consistent with SDMs being more specifically associated with particular groups and populations, enabling SDMs to better differentiate between them, and suggests little convergence in SDM profiles across species and populations.However, a range of SNV mutation types were observed to be associated with specific populations across different species (Figure 5B).In particular, C>T changes in different flanking contexts: for  Most notably one of the main drivers of the separation of East Asian indicine cattle from other indicine cattle is an elevated rate of TCC>TTC mutations.This is the same mutational profile that has been extensively characterised among European humans across a number of studies 5,27 , but whose cause is unknown.Closer examination highlights that, as in humans, there has been a pulse of an increase in the rate of these mutations specifically in this one population of cattle (Figure 6A).This suggests the exact same mutational bias has arisen at least twice, once in European humans and once in East Asian cattle.This mutation is a strong contributor to the observed separation of indicine cattle described above on PC4 (Supplementary Figure 7).Therefore, these results suggest not only are European humans associated with a pulse of TCC>TTC mutations, but this change is specifically linked to a subgroup of indicine cattle that we can map to changes in its frequency between different animals of the same breed.

Comparison with cancer somatic COSMIC signatures
We finally compared the germline mutational spectra observed in each species to the curated set of human somatic mutational signatures in the COSMIC cancer dataset 28,29 , to see if any of the COSMIC signatures may be contributing to the observed mutation profiles.Cosine similarities between the rank 1 nonnegative matrix factorization (NMF) signatures for each species and the different COSMIC signatures is shown in Figure 7A.The highest cosine similarities between the signatures for the different species with the COSMIC database signatures range from 0.84 to 0.94 for the horse and pig species, respectively (Supplementary Data 1), with all of the NMF signatures for the different species most strongly matching the COSMIC SBS5 pattern.This pattern has been observed across several studies to be linked to human germline mutation rates, and this signature appears to make a major contribution to the mutation spectra of all species studied.Across-species links to the different COSMIC signatures were broadly comparable.However, a clear exception is observed for signature SBS1, which is strongly linked to the mutation spectra of all species apart from pigs (Figure 7A).SBS1 is known to be driven by the deamination of CpG sites to TpG dinucleotides.While this is one of the strongest drivers of mutation rates across all species, it is observed to be comparatively low in pigs (Figure 7B).To exclude the possibility that a problem with the pig ancestral genome was leading us to underestimate the number of changes due to misassigning the direction of change, we collapsed CpG to TpG changes with their reverse change (TpG to CpG).This collapsed frequency was though still lower in pigs than the other species examined (Supplementary Figure 9).Pigs show a generally similar number of methylated CpG sites across their genome as other species 30 , meaning a reduced frequency of these sites can also not be invoked to explain this observed difference.Closer examination of the two major pig sub-groups, European and Asian, highlights that Asian pigs are particularly depleted with rare CpG>TpG changes (Figure 7C) suggesting they are the stronger contributors to this observed depletion.

Discussion
In this work we present a novel workflow designed to streamline the process of characterising different types of mutation spectra in any diploid species.As well as enabling the profiling of the rates of the more commonly studied SNVs, it also enables the characterisation of the rates of SDMs, that we illustrate can provide distinct insights.In particular, SDMs more clearly separate species, and are more effective at assigning breed labels to domesticated animals.
The nSPECTRa workflow incorporates all the major steps required to study mutation spectra, from imputation and phasing, through variant annotation and mutation classification using tools like mutyper, to calculating PCs from the resulting mutation profiles.The workflow allows users to run all these steps with little intervention.Importantly, nSPECTRa also incorporates a workflow for reconstructing the ancestral state of a given genome, an important prerequisite for determining the direction of change of mutations.With ancestral genomes currently missing for most species this can be a major obstacle to performing mutation spectra analyses.The only pre-requisite for determining this ancestral state of a genome is a multiple genome alignment in HAL format containing the reference genome of interest.Some HAL files are publicly available, such as the 241 mammals from the zoonomia project 31 , or can be generated de novo using the progressive cactus alignment tool 32 .
Although other studies of mutation spectra have used different approaches to determine ancestral alleles, such as the est-sfs approach 13,33 , an important disadvantage of such approaches is that they rely on variant allele frequencies, that are in turn dependent on the availability of good quality population genomics data for the focal group.In contrast, the approach implemented in nSPECTRa only requires the availability of reference genomes, that are increasingly available for most species, and can characterise ancestral alleles across both polymorphic and fixed sites in the focal species.Consequently, we expect this ancestral genome workflow to be of potential use across studies that require an ancestral sequence and is not restricted to those investigating mutation spectra.
We highlight how the median of ratios method, introduced for normalising RNA-seq data, is also an effective approach for addressing the issue of normalising mutation spectra, so that they can be compared across species that exhibit different mutation counts.An alternate approach adopted in a recent study to correct for this is to downsample the variants to match the species with the lowest diversity 13 .However, as highlighted in this previous study, this method has the potential disadvantage of effectively adding noise to the estimates, and can lead to mutation counts of 0 following downsampling, effectively removing signals.Future direct comparison of these two approaches would be of interest to evaluate their relative merits and impact on downstream results.
Demonstrating the utility of the nSPECTRa pipeline, we characterise the mutation profiles across five domesticated species (pigs, cattle, water buffalo, dogs and horses) and two outgroups (African buffalo and humans) providing insights into the evolution of these species.
Cattle were observed to have an unusually complex pattern of mutation spectra evolution, partly reflecting the fact the species were domesticated at least twice from two independent ancestral populations, leading to distinct mutational profiles in the two lineages.More A further potential avenue of future research is further investigation of the PARP11 locus, where we identified a variant strongly associated with the separation of the indicine lineages.
The PARP family of genes have been strongly linked to DNA repair 35 , with PARP inhibitors developed as a cancer treatment as their inhibition blocks tumour cells from repairing harmful DNA mutations 36 .This makes the locus a strong candidate for being behind the divergence between lineages, but given the relatively limited number of indicine genomes currently publicly available, this result should be treated with caution until it can be replicated in further cohorts.However, this result does illustrate how the study of mutational processes, both within and across species, could potentially provide important insights into the genetic mechanisms controlling DNA damage repair.
A particularly notable finding is the observation of the pulse in TCC>TTC changes, previously observed in European humans, in East Asian cattle.The specificity of the change suggests disruption of the same mutational pathway is potentially responsible in both populations.
Although strongest for this change, this convergence in mutational profiles across populations from different species is seen for other mutation types and suggests expanding these analyses to further species and populations would identify more examples of such convergence in mutation spectra.This could potentially be used to refine the potential mechanisms underlying shared biases in mutation spectra.
These differences in mutation spectra are expected to have downstream consequences.For example, we demonstrate that the biases in mutation rates in non-coding areas of the genome are also observed in coding regions, consequently impacting the rate of change of different amino acids.This ultimately means that not all amino acid changes are equally likely across species, and certain species may be prone to accumulate certain deleterious mutations at different rates over extended timescales.
Some limitations of the current study should be highlighted.Notably, when comparing mutation spectra across species, different genomic regions are invariably studied.Although the results are normalised by sequence content, and the majority of the genome was studied in each species, this could lead to mismatches in the types of sequences examined.For example, certain repeat classes are not found in all species 37 , which may have distinct mutational profiles.However, other approaches to try and mitigate this issue, for example restricting the analyses to orthologous regions found across every species, would also likely lead to biases such as towards ultra-conserved regions.For this reason, we have primarily focused on the within species analyses.A further aspect to note is the fact that the population levels differ between species: for instance, cattle and water buffalo are broken down by breed, while other populations are by geographical area.This is partly a result of available annotations for the different datasets and partly because the concept of breed does not apply to species such as African buffalo.But this means the ability to separate populations should not be directly compared across species.Likewise changes in mutation spectra can have multiple causes, such as changes in demography.This is potentially the driver of the observed difference in CpG deamination rates in pigs.A previous study in human populations highlighting that differences in the rate at which CpG sites mutate is a result of the fact that these sites are more likely to mutate more than once.This impacts their relative frequency and the occurrence of this is dependent on demography 27 .
In conclusion, we introduce the nSPECTRa workflow for determining mutation spectra in any species, and we illustrate the utility of approaches such as the median of ratios method to normalise the output of this workflow.We also demonstrate their utility by applying them across seven different mammalian species to study the evolution and convergence of mutation spectra.With the increasing availability of population genetic data for an everexpanding set of species, we expect these resources and approaches to be of widespread use to provide further important insights into the evolution of mutation spectra and their relevance to shaping species and populations.

Defining the ancestral sequence
Reliable definition of mutation changes relies on the accurate definition of the ancestral state of each base in the genome of interest.To achieve this, we used CACTUS 32 that can generate ancestral genomes from multiple genome alignments.For the dog, human, horse and pig species we provided the 241-way cactus alignment generated by Armstrong et al. 32 and accessible at https://cglgenomics.ucsc.edu/data/cactus/.For each species we specified a different ancestral branch, representative of the different clades: fullTreeAnc209 for the dog, fullTreeAnc105 for the human, fullTreeAnc92 for the macaque, fullTreeAnc226 for the horse and fullTreeAnc192 for the pig.
As this existing alignment did not contain suitable genomes for the bovids, for these we generated a novel cactus alignment containing 17 publicly available genomes (Supplementary Data 2).The supporting phylogenetic tree was defined using the MASH (v2.2) software 38 with default parameters to calculate the pairwise genome distances.These were then used as input for phylip's neighbour algorithm, using the UPGMA approach, to define the guide tree for the CACTUS run.This phylogenetic tree was represented using FigTree 39 and shown in Supplementary Figure 11.
Prior to generation of the multiple genome alignments for the 17 assemblies, these were masked using a combination of DustMasker 40 , WindowsMasker 41 and RepeatMasker 42 with trf 43 , to provide homogenous and comparable levels of soft-masking of the different genomes.
Cactus (v2019.03)was then used to generate the multiple genome alignments and the ancestral state at each split of the phylogenetic tree.The ancestor Inner 13 has been selected as ancestral state for all the Bovinae genomes, since common among all of them.This Inner 13 ancestral genome presents a good coverage of all the assemblies.And for the cattle, water buffalo and African buffalo genomes we had coverages of 90%, 94% and 93.5%, respectively.

Datasets and variant pre-processing
We considered variant datasets for 7 different species in this study: human, cattle, water buffalo, African buffalo, horse, pig and dog.
The human dataset was generated as part of the 1000 genome project, and consists of 3,202 samples from five super-populations 44  were retrieved from the Genome Variation Map database 47 .
The human genomes had already been filtered and phased.For the other species, we performed the same pre-processing of the datasets, reducing as much as possible factors such as batch effect and depth of sequencing effects.Any cattle, African buffalo, macaque, horse and dog samples with very high missingness (>75%) and low average depth of sequencing (<8X) were discarded.For pigs, only the high missingness filter (>75%) could be applied due to absence of the DP flag in the VCF annotation.For each dataset we only retained biallelic SNPs with a call rate > 90% and minor allele count (MAC) >2.Following variant filtering, we also retained only 2 nd -degree-unrelated individuals identified by converting the biallelic SNPs to plink binary format using plink v1.90b4 48and then running the KING software 49

Determining mutation counts
The inference of the mutation spectra was performed using Mutyper (v0.6.1) 50and the sequential dinucleotide mutation (SDM) approach from Prendergast et al. 6 .The two algorithms were implemented into nSPECTRa (nextflow SPECTRum analysis), a novel Nextflow mutation spectrum analysis workflow that performs the analyses in a species-agnostic, lowinteraction and highly parallelised fashion.The Relate software, that can calculate mutation rates and their change over time 21 , was also included as an option in this workflow, but was not used in the current study.The scheme of the workflow can be seen in Figure 1, and can be summarised as follow: 1) pre-processing of the VCF file; 2) preparation of the ancestral reference; 3) computation of the mutation spectra.
The nSPECTRa VCF pre-processing consists of first imputing the sparse missing genotypes followed by phasing.The workflow supports either Beagle v5 or greater 51 or shapeit4 v4.2.0 52 .
In this study we applied Beagle (v5.2) with the following effective population sizes (Ne): 1000 for the cattle 25 , 358 for water buffalo 25 , 5000 for the African Buffalo 53 , 5000 for the pig 54 , 100 for horse 55 and, 230 for dog 56 .
Following the imputation and phasing, the workflow uses the variant effect predictor (VEP) software to define the effect of each variant considered in the analysis.This step can be performed by either passing a cache compliant with the installed version of VEP 57 , or by providing a custom GFF, suitable for recently annotated genomes or genomes for which a cache is not available.More specifically, we used the VEP cache v104 to annotate the cattle and the dog datasets (references ARS_UCD1.The ancestral reference state consists of detecting the most likely ancestral base in the reference genome used to call the variants.To do so, nSPECTRa takes the HAL generated by CACTUS as input, and extracts the alignments from the reference and the Inner13 genome in multiple alignment format (MAF) using the hal2maf function from HAL 58 .The alignments are then processed using a series of custom scripts that identify the most likely base in the Inner13 genome covering the reference genome, and uses this information to generate an ancestral reference genome with the ancestral base instead of the reference base, where defined.For non-Bovidae species, the principle is the same, but uses a different set of alignments and different ancestral genomes for each species: fullTreeAnc209 for the dog, fullTreeAnc105 for the human, fullTreeAnc92 for the macaque, fullTreeAnc226 for the horse and fullTreeAnc192 for the pig.
Once the inputs are prepared, the third stage can be run using one of the three software previously described.For Relate, the workflow runs all the steps required on the guide, preparing the inputs, estimating the effective population size and, finally, calculating the mutation spectra and mutation rate.
For mutyper, the software can calculate the mutation spectra for all samples, chromosome by chromosome to increase the throughput, for a set of K-mer sizes specified by the user, allowing them to account for multiple context sizes (e.g.K = 3 will consider 1 basepair flanking sequences such as A[T>C]A, whereas K = 5 will consider 2 basepair flanking sequences such as TA[T>C]AG).The raw mutation counts are then normalized using the initial state frequency in the ancestral sequence (e.g. the number of A[T>C]A changes is divided by the number of ATA K-mers in the ancestral genome).This allows the user to factor in the increased or decreased probability of a mutation occurring because of the intrinsic K-mer content of the ancestral genome.If mutyper and relate are run together, the Ne will be used in combination with the Ksfs vector generated by mutyper to perform demographic analysis and inference on the different populations considered.
Finally, SDM will estimate the rate of sequential dinucleotide mutations for each individual in each population, considering not only the context but also the effect of the variants and its position in the codon.The raw counts are normalized for the initial K-mer content in the ancestral sequence, analogously to what is done for the Mutyper counts.

Grouping of the populations
The grouping of the different individuals have been done according to the individual characteristics of each dataset considered.
For the cattle, we classified the individuals as belonging to 6 major groups: African indicine The pig dataset has been categorized depending on the individual geographical ancestry: European (n = 161), Asian (n = 180) and wild pig species (n = 15).This separation was validated by a PCA of the samples.Seven samples that were extreme outliers on this PCA were excluded from downstream analyses.
The horse dataset presents the lowest number among the species considered (n = 36) and lacks information on any sub-groupings; therefore, these samples were considered as a single pan-population.
Due to their high degree of diversity, dogs have been classified by applying the hierarchical clustering on principal component (HCPC) method from the FactoMineR R package 59 .This identified two groups: modern breeds (n = 507), and ancestral breeds and village dogs (n = 117).
Finally, humans are classified into the same 5 super-populations, as defined by the 1000 genome project: East-Asian (n = 506), South-East Asian (n= 504), European (n = 524), African (n = 678) and American (n = 349) but due to their high levels of admixture the American population was dropped from analyses.Following normalising SNV and SDM mutation counts by the occurrences of their ancestral Kmer in the ancestral genome we then applied the median of ratios method 22 to enable the comparison of normalised counts across individuals and species.We implemented the median of ratios as an R script function that applies the formula described by Anders S. and Hubers W. (2010): ) 1/ First, we compute the geometrical mean for each change across all samples.We then divide the value at each count for each sample by the geometrical mean of each count type.Then, we compute the median ratio for each sample, producing a normalization factor.Finally, each original count is divided by the normalization factor, obtaining the final value to be used for the analyses.
Dividing the values by the K-mer counts in the ancestral genome allows us to make the datasets comparable across species by accounting for the probability of a change to occur in a species due to an increased number of sites.The second, allows us to make samples comparable between each other.

Cross-species spectra analysis
We investigated the relationships between the populations for both SNVs and SDMs using a PCA on the normalized values.We also represented the heatmap of the single-change mutation matrix pooled for each population using a modified version of the plot_standardized_profile_heatmap function from the MutationalPatterns R package 60 .We also represented the single-change mutation profile using a modified version of the plot_96_profile function from the MutationalPatterns R package 60 .
For the multispecies comparisons, we first combined all the K-mer normalized datasets, and then normalized them through the same median of ratios approach.We represented the relationship among the mutation spectra in the different species using the same methods previously described.
Derived allele frequencies in the different populations have been computed using the command bcftools +fill-tag, and values were then extracted using bcftools query.These have then been processed by a custom script that counts how many rare changes occur in each group.

Modelling and SHAP values
To identify which mutation types best differentiate populations and species we fitted the normalised mutation spectra as features in multi-class extreme gradient boosting models using the xgbtree method in the caret R package 61 with five-fold cross-validation.To assess model performance 20% of the individuals were left out of the data prior to model training to act as an independent test set from which model metrics such as accuracies could be derived.
To identify those genetic changes that best differentiate populations we calculated SHAP 24 values from these models using the xgboost R package 62 .
Testing for Indicine cattle variant's whose genotypes were associated with cattle SNV PC4 was carried out using plink2 48 .Having restricted the initial cattle dataset to indicine cattle (African or Asian) the first 20 genotype PCs were calculated using plink2 having pruned variants using the --indep-pairwise 50 10 0.1 flag and a minor allele frequency cutoff of 0.01.These PCs were then provided as covariates to the plink --glm function while also filtering out variants with a Hardy-Weinberg P value less than 1x10 -6 or with more than 10% missing genotypes.

Comparison with COSMIC profiles
Finally, we compared the spectrum for each species with the COSMIC signatures (v3.3) 29 .The signatures for the GRCh38 human genome have been downloaded directly from the COSMIC website (https://cancer.sanger.ac.uk/signatures/documents/2047/COSMIC_v3.3.1_SBS_GRCh38.txt) and processed locally.
We identified the mutation signatures by applying the NMF analysis implemented in the R NMF package 63 (rank = 1 and performing 10 iterations).After extracting the signatures, we identified the most similar SBS COSMIC profile by using a customized version of the rename_nmf_signatures function from MutationalPatterns, that computes the cosine similarity between the NMF and the COSMIC profiles We used a cosine cut-off of 0.8 to identify the most similar signature in the COSMIC database.
After identifying the most similar COSMIC signatures, we extracted the contribution to the signatures for each species.

Figure 2 .
Figure 2. Variations in the SNV rates across species.A) Principal component (PC) analysis of the relationship between different species based on the rate of SNV mutations of different ancestral 3-mers.Each point represents an individual, and the species with larger numbers were randomly downsampled to a maximum of 80 individuals prior to calculating the PCs.B) Relative contributions of the different mutation types in each species.The 5' and 3' contexts of each change are shown on the left and bottom axes, respectively, with the observed base change at the top.Each row of changes corresponds to a different species, with the downsampled number of individuals shown on the right.Darker blue colours indicate the relative enrichment of the corresponding change.C) SHAP values associated with changes showing the greatest discrimination among the species.The x-axis indicates the impact of the corresponding feature (y-axis) on the model prediction.A greater spread in SHAP values indicated that the mutation type was more strongly associated with discriminating the corresponding species from others.The colour of the points indicates the feature (mutation) value, and the x-axis indicates whether it is relatively enriched (positive value) or depleted (negative) in the corresponding species.The darker purple positive points indicate that the species is associated with an enrichment of the given change, and the darker purple negative points indicate that the species is depleted with the change.D)The rate of different changes that lead to amino acid changes in humans versus cattle.Note the triplet changes correspond to K-mers that may not be in frame with a single codon.The colour of each point corresponds to the ratio of the rate of the same change between the same species, but in intergenic regions.The protein-coding changes enriched in cattle (above the diagonal) are generally also enriched in intergenic regions in cattle (coloured yellow/light green).
profiles.West African cattle breeds were separated on PC6 of the cattle SDM mutational spectra, with members of the NDama breed split across the two groups (Figure4B) suggesting distinct mutational signatures are segregating within this breed.Intriguingly, Indicine animals from the Indian subcontinent were separated on PC4 of the cattle SNV mutational spectra (Figure4C).Notably, African and East Asian indicine animals were similarly separated on this PC.This suggests that two distinct mutational profiles segregate among indicine cattle, with the animals transferred to East Africa carrying a distinct mutational profile compared to those that migrated to East Asia following the original domestication event of all Bos indicus animals on the Indian subcontinent.Examination of the PC4 loadings highlights that the changes linked to the two lineages are specific, with the East Asian indicine lineage comparatively enriched with A[C>T]C and T[C>T]C changes, and the African indicine lineage preferentially associated with N[A>C]G mutations (Supplementary Figure7).We investigated whether any genetic variants were associated with the genetic mutation profile captured by PC4 of the cattle SNV mutational spectra.Controlling for the first 20 genotype PCs and removing variants with a minor allele frequency <0.1, the variant with the strongest association with PC4 (chromosome 5:106108408; P = 6.1x10 -9 ) fell directly upstream of the PARP11 gene (Supplementary Figure8).

Figure 4 .
Figure 4. Divergence of mutation spectra within cattle sub-populations.A) The historic migration patterns of the two cattle subspecies, Bos taurus taurus in green and Bos taurus indicus in orange.B) Separation of West African Bos taurus animals on PC6, calculated from the SDM profiles.Each West African animal was labelled with its breed of origin.The "Taurine" group corresponds to the European taurine animals and the "Indicine" group corresponds to animals from the Indian subcontinent.C) The separation of the Indicine cattle on PC4 when calculated from their SNV profiles.
example, the A[C>T]C change is enriched among European humans and depleted from African taurine cattle, and the A[C>T]G change is depleted among European humans, enriched among West African buffalo, and differentiated between ancient and modern dog breeds.

Figure 5 .
Figure 5. Convergence of SNV mutation profiles across species.A) Discriminating SDM signatures are generally not shared across populations of different species.Lines indicate changes that are preferentially associated with different sub-populations in the SHAP analyses.As horse had no annotated sub-populations it is excluded.The width of each bar corresponds to the observed variance in the SHAP values for the corresponding change.The higher the variance, broadly the more strongly the change (shown on the right) is specifically associated with the sub-population (shown on the left).Only changes with a variance greater than 0.05 are shown.B) In contrast to SDMs, SNV profiles are shared across sub-populations with three selected changes that are linked to sub-populations in different species highlighted.

Figure 6 .
Figure 6.The convergent appearance of TCC>TTC mutation pulses in human and cattle subpopulations.A) The top row shows the SHAP values in human (left) and cattle (right) of the SNV changes most predictive of individual sub-populations.The TCC>TTC change is marked in the red rectangles and associated with separating the sub-populations in both species.The bottom row shows the fraction of TCC>TTC changes found at different frequencies in each subpopulation.Consistent with previous observations 5 , a pulse is observed of variants with a frequency of around 1% in European humans.A similar enrichment for TCC>TTC changes is observed in East Asian Indicine cattle.B) The relative enrichment of different changes in the different cattle groups.The TCC>TTC change in East Asian indicine cattle is marked by a red asterisk.

Figure 7 .
Figure 7. COSMIC signatures in each species.A) Cosine similarities between the NMF signatures on each species and the known COSMIC signatures, with the blue-to-red gradient representative of the 0-to-1 similarity values.B) The fraction of mutations that are CCG>CTG changes by derived allele frequency across species and populations.The Asian pigs show the lowest rate of these changes, in particular among rare changes.For comparison data on 15 wild pigs are included in this plot.
surprisingly though, cattle populations traditionally treated as single populations separated based on their mutational profiles.African taurine cattle, including animals from the same N'Dama breed, separated based on their SDM profiles.The causes of this remains unclear, with no individual SDM change driving this separation, but rather differences across a range of SDM types.It is possible admixture from other breeds into one or other population could skew their SDM profiles, but arguing against this hypothesis, the samples cluster closely in a traditional genotype-based PCA (Supplementary Figure10) which would not be expected if, for example, recent admixture from European cattle had occurred.Indicine cattle also separate based on their SNV mutation spectra.In particular African indicine cattle were associated with N[A>C]G changes and East Asian indicine cattle were linked to A[C>T]C and T[C>T]C changes, suggesting independent ancestries of these two indicine populations.Most surprisingly, though, indicine animals from the Indian subcontinent could also be broken down into these two lineages based on their profiles, despite their close geographic origin and situation at the original site of domestication.It is possible this is due to a bottleneck event leading to two independent Bos indicus lineages, or even potentially two independent Bos indicus cattle domestication events.But arguing against this, the enrichment of TCC>TTC changes observed in East Asian cattle is preferentially found among low frequency variants, suggesting the divergence in the rate of these changes happened more recently.Perhaps more plausibly, introgression from other Bos species may have driven these differences, with up to 1.57% of Chinese indicine genomes proposed to have arisen through introgression from banteng, B. javanicus34 .Future studies intersecting the location of such regions with known DNA repair genes, or study of the mutational spectra in B. javanicus, may provide insights into the potential causes of these distinct mutational spectra.
2 and CanFam3.1, respectively), VEP cache version 97 for the human (reference GRCh38.p12), and 94 for the domestic Horse (reference EquCab2.0), the NCBI GFF annotation for the water buffalo and the pig (reference Sscrofa 10.2) and the Ensembl rapid release annotation for the African buffalo (reference GCA_902825105.1).
25The cattle (Bos taurus) dataset consists of 477 animals from 71 worldwide populations, representative of both subspecies (B.t.taurus and B. t.indicus) and from four continents (Europe, Asia, Africa and South America) described in Zhao et al.45.The water buffalo (Bubalus bubali) dataset consists of 79 water buffalo genomes of 7 breeds including the Mediterranean buffalo from Italy and 6 others from India25.The African Buffalo (Syncerus caffer) dataset consists of 196 buffaloes from 4 different subspecies (S. c. caffer, S. c. nanus, S. c. aequinoctialis and S. c. brachyceros) from conservation areas from across the range of this species in Africa 46 .The pig (S. scrofa) and horse (E.caballus) datasets

Table 1 .
The performance of predicting the population groups within species (or species label across species in bottom row) using gradient boosting models trained on the spectrum of SNVs or SDMs.Significance P values of the accuracy above that expected from the no information rate are shown.