## Abstract

Although the utility of short tandem repeats on the Y-chromosome (Y-STRs) has long been recognized and leveraged in forensics, genealogy and paternity testing, the bulk of these applications have relied on only a few dozen loci identified as having remarkably high mutation rates. Recent efforts have expanded the set of Y-STRs with known mutation rates to two hundred markers, but the limited throughput of the capillary method for estimating mutation rates has left the mutability of most Y-STRs uncharacterized, particularly those with dinucleotide repeat units. To address this limitation, we developed a novel method capable of concurrently estimating the mutation rates of all Y-STRs by leveraging population-scale whole-genome sequencing data. Extensive simulations confirmed that our method robustly accounts for PCR stutter artifacts and obtains unbiased mutation rate estimates. Application of the method to orthogonal datasets from the 1000 Genomes Project and Simons Genome Diversity Project utilized evolutionary data from over 250,000 meioses to estimate the mutation rates of more than 700 Y-STRs with 2–6 base pair repeat units, yielding the largest such set to date. Comparison of these estimates with those from father-son studies indicated a high degree of concordance for loci that have been previously characterized. In addition, we identified nearly 100 previously uncharacterized Y-STRs with pergeneration mutation rates greater than 1 in 3000. Altogether, our study provides a broadly applicable method for estimating Y-STR mutation rates from whole-genome sequencing cohorts, outlines a framework for imputing Y-STRs, vastly expands the number of identified loci with high discriminative power and provides the first chromosome-wide characterization of the mutation rates of dinucleotide short tandem repeats.

## Introduction

Over the past 20 years, a multitude of fields have increasingly leveraged Y-STRs due to their unique combination of high mutation rate and paternal inheritance pattern. Prior to the advent of genome-wide SNP genotype data, population genetics utilized these highly mutable markers to build phylogenies (Takezaki and Nei 1996; Forster et al. 2000) and to draw a host of demographic inferences (Pritchard et al. 1999). In forensics, Y-STRs are commonly used to resolve cases in which DNA samples contain multiple donors or are difficult to profile using traditional autosomal techniques (Kayser et al. 1997; Roewer 2009). Y-STRs are also widely used in genealogy to ascertain the relatedness of families (Kayser et al. 2007) and in paternity cases, even resolving historical debates such as the contentious paternal relationship between Thomas Jefferson and Sally Hemings’ children (Foster et al. 1998). More recently, we employed these markers to demonstrate that one can infer the surname of an anonymous genome, a finding that stimulated important conversations related to genetic privacy (Gymrek et al. 2013).

Despite the immense utility of Y-STRs, the vast majority of their applications rely on only a few dozen markers. The small size of this panel is largely the result of the cumbersome and expensive method used to estimate Y-STR mutation rates. This process typically involves genotyping large pedigrees or thousands of father-son pairs using capillary electrophoresis, from which the frequency of discordant genotypes provides an estimate of the mutation rate (Heyer et al. 1997; Kayser et al. 2000; Dupuy et al. 2004; Gusmao et al. 2005). Recently, several large-scale studies expanded the set of loci with available mutation rates to include several hundred markers and identified a handful of rapidly mutating Y-STRs with mutation rates in excess of 10^{−2} mpg (Ballantyne et al. 2010; Burgarella and Navascues 2011). However, these studies characterized long Y-STRs with 3-6bp motifs that were identified in prior scans for polymorphic loci (Kayser et al. 2004), ignoring loci with fewer repeats or dinucleotide repeats. Since genome-wide studies of human populations have identified dinucleotide repeats as among the most abundant and heterozygous of the STR classes (Willems et al. 2014), characterizing these markers may identify new promising candidates for male lineage differentiation. In addition, characterizing the mutation rates of the full spectrum of Y-STRs will be instrumental towards understanding their mutational mechanisms and developing accurate sequence-based predictors of mutability for use across the genome.

Fortunately, the rapid advancement of next-generation sequencing technologies has provided a unique opportunity to address these issues. Coupled with vast improvements in the depth and quality of whole genome sequencing (WGS) datasets, the advent of STR genotyping tools has made it possible to genotype Y-STRs chromosome-wide (Gymrek et al. 2012; Highnam et al. 2013; Warshauer et al. 2013). As a result, mutation rate estimation procedures that leverage these datasets can perform unbiased scans for mutable loci instead of only considering previously ascertained sites. While it may seem appropriate to apply traditional STR mutation rate estimators based on microsatellite distance measures to these datasets (Goldstein et al. 1995; Slatkin 1995), these estimators assume simplistic STR mutation models and have been shown to be susceptible to haplogroup size fluctuations (Zhivotovsky et al. 2006). An alternative approach is to develop methods that also leverage the rich evolutionary information of recently generated high-resolution population-scale Y-chromosome (Francalacci et al. 2013; Poznik et al. 2013; Wei et al. 2013a). A recent study developed one such method, but it also required a simple mutation model and only partially utilized the phylogenetic information (Ravid-Amir and Rosset 2010). Furthermore, all of the above methods assume error-free genotypes and are therefore poorly equipped to deal with the sources of error prevalent in WGS-based STR call sets.

In this study, we demonstrate how to effectively integrate Y-STR genotypes and Y-SNP phylogenies derived from whole-genome sequencing data to estimate Y-STR mutation rates. Using various simulations, we demonstrate that our approach results in unbiased mutation rate estimates for almost all considered mutation models, even in the presence of extensive PCR stutter. We then apply our approach to data from the Simons Genome Diversity Project (SGDP) and 1000 Genomes Project (1KGP) (Genomes Project et al. 2015) to estimate the mutation rates of over 700 Y-STRs, most of which have never been characterized. The resulting sets of estimates were remarkably concordant, uncovered a large number of unknown highly polymorphic markers and shed light on the sequence factors that govern Y-STR mutability.

## Materials and Methods

### Mutation Rate Method Overview

Our approach to estimating Y-STR mutation rates, which is outlined in Figure 1, is motivated by the notion that current Y-SNP phylogenies are sufficiently detailed and accurate to infer STR mutation models. Given a phylogeny and a set of STR genotypes, Felsenstein’s pruning algorithm (Felsenstein 1981) and numerical optimization can be used to evaluate and improve the likelihood of a mutation model until convergence, providing an estimate of the mutation rate. However, due to the error-prone and low-coverage nature of WGS-based STR call sets, utilizing these genotypes will result in vastly inflated mutation rate estimates. To avoid these biases, we analyze the number of repeats observed in all individuals’ reads to learn a locus-specific error model and use this error model to compute genotype posteriors. As these posteriors account for genotype uncertainty, we utilize them during the mutation model optimization process instead of fixed genotypes to obtain robust estimates. More detailed descriptions of each of the steps involved in this approach are contained in the sections below.

### Y-SNP Phylogeny Construction

We downloaded Y-chromosome SNP calls for male SGDP samples from the project website and utilized VCFtools (Danecek et al. 2011) to remove loci where more than 10% of the calls were heterozygous. For the remaining polymorphic sites, we removed individual calls that were heterozygous, had fewer than 7 supporting reads or had more than 10% of reads supporting an uncalled allele. Lastly, we removed loci if fewer than 150 samples met these criteria or more than 10% of reads had zero mapping quality. These filters resulted in nearly 39,000 high quality polymorphic SNPs which were used to generate a maximum likelihood phylogeny using RAxML (Stamatakis 2014) and the options *–m ASC_GTRGAMMA – f d -asc-corr lewis*. We used Dendroscope (Huson and Scornavacca 2012) to root the resulting phylogeny along the branch marked by the M42 and M94 mutations, well-annotated markers associated with the split of the A haplogroup from all other haplogroups (Jobling and Tyler-Smith 2003). For the 1000 Genomes dataset, we utilized a RAxML-generated phylogeny generated by the 1000 Genomes Y-chromosome working group.

### Modeling STR Genotyping Errors

PCR stutter artifacts are one of the primary causes of STR genotyping errors and typically involve the insertion or deletion of copies of the STR repeat unit in a subset of the reads at a locus. To mitigate the effects of these errors, we developed a method to learn locus-specific stutter models. Our stutter model *Θ* is parameterized by the allele frequencies for each STR allele (*f _{i}*), the probability that stutter adds (

*u*) or removes (

*d*) repeats from the true allele in an observed read, and a geometric distribution with parameter

*ρ*that controls the size of the stutter-induced changes. Given a stutter model and a set of observed reads (

_{s}*R*), the posterior probability of each individual’s haploid genotype is: where

*g*and

_{i}*r*denote the number of repeats in the locus and k

_{k,i}^{th}read for the i

^{th}individual, respectively. To learn these parameters, we employed an expectation-maximization framework in which the E-step computes the genotype posteriors for every sample under the observed read repeat counts and the current stutter model. The M-step then utilizes these posterior probabilities to update the stutter model parameters for

*N*samples,

*A*alleles and

*Q*reads as follows:

In addition to PCR stutter, alignment errors may also cause reads to have a detected number of repeats that differ from their underlying genotype. As these errors are also incorporated when learning the stutter model, the stutter model accounts for the combined frequency of these errors and thereby generates robust posteriors.

### STR Mutation Model

We modeled STR mutations using a length-dependent variant of a generalized stepwise mutation model. The model is characterized by a per-generation mutation rate *μ*, a geometric step size distribution with parameter *ρ _{m}* and a spring-like length constraint

*β*that causes alleles to mutate back towards a central allele denoted as having zero length. Alleles in this model can have negative values as an allele’s value merely indicates the deviation, in repeats, from the central allele. Given a starting allele

*a*, the probability of observing allele

_{t}*a*

_{t + 1}the following generation is: where the fraction of mutations increasing or decreasing the size of the STR are and

*f*= 1 –

_{d}*f*. To avoid biologically implausible models, we constrained

_{i}*β*to have non-negative values, where

*β*= 0 reduces to a traditional generalized stepwise mutation model and increasingly positive values of

*β*represent STRs with stronger tendencies to mutate back towards the central allele.

### Mutation Model Likelihood

We utilized Felsenstein’s pruning algorithm (Felsenstein 1981) to evaluate the likelihood of an STR mutation model. Given a model *M*, dataset *D* comprised of observed STR genotypes and a SNP-based phylogeny *T* with root node *R*, the likelihood is

Due to the structure of the phylogeny, the conditional probability of the data *D _{Nt}* below each interior node

*Ni*given the node’s genotype can be expressed in terms of transition probabilities to each child node

*C*and the conditional probability of the data

_{j}*D*in its subtree:

_{cj}While descending the phylogeny, this recursive relation applies until a node with no children is encountered. These nodes represent an observed sample and the conditional probability of the data in its subtree is merely given by its genotype likelihoods. Therefore, the likelihood of a mutation model can be calculated using a post-order tree traversal in which one computes the genotype likelihoods of each observed genotype and the conditional probability of the data in each interior node’s subtree given the node’s genotype. The total data likelihood is then readily computed using the root node’s conditional probabilities and a uniform prior. Because normalizing the genotype likelihoods of each sample does not affect the relative model likelihoods, one can use genotype posteriors calculated using a uniform prior interchangeably. In addition, to avoid numerical underflow issues, we compute the total log-likelihood of the data instead of the raw likelihood.

### STR Transition Probabilities

To accelerate the computation of parent-to-child transition probabilities along each branch in the phylogeny, we devised a means of rapidly computing the STR transition probabilities across hundreds of generations. Given a mutation model *M* and a vector of allele probabilities for generation *t*, the probability of observing allele *v* in the next generation is

To calculate the probability of observing each allele in the next generation, we construct an *N*-by-*N* transition matrix *Γ*, where *N* is the number of STR alleles, rows 1, 2…N correspond to and each column represents the transition probabilities from one allele to all other alleles. We modify this matrix such that the first and last columns have one non-zero entry along the diagonal to prevent the boundary states from mutating and provide an assessment of how frequently they are encountered. We also modify the first and last rows of the matrix so that they represent transition inequalities that result in normalized transition probabilities. Recursive application of the transition matrix then readily results in the allele probabilities after *M* generations:

We balance the tradeoff between computation time and boundary state collisions by utilizing the smallest allele range such that the minimum and maximum observed STR alleles have less than a 10^{−5} probability of drifting into the boundary states when progressing from the root node to the deepest leaf node.

### Numerical Optimization of the Mutation Model

Given STR genotypes for a locus of interest, we developed a maximum likelihood approach to estimate the underlying mutation model. Our approach first estimates the central allele of the mutation model by computing the median observed STR length and then normalizes all genotypes relative to this reference point. It then randomly selects mutation model parameters *μ, β*, and *ρ _{m}* subject to the constraint that they lie within the ranges of 10

^{−5}– 0.05, 0 – 0.75 and 0.5 – 1.0, respectively. Using these bounds, the Nelder-Mead optimization algorithm (Nelder and Mead 1965) and the outlined method for computing each model’s likelihood, the numerical optimization method iteratively updates the mutation model’s parameters until the likelihood converges. After repeating this procedure using three different random initializations to increase the probability of discovering a global optimum, it selects the optimized set of parameters with the highest total likelihood.

### Simulating Exact STR Genotypes

Values of *μ*, *β*, and *ρ _{m}* ranging from 10

^{−5}–10

^{−2}, 0–0.75, and 0.6–1.0 were used to simulate genotypes under a host of different mutation models. Using either the 1KGP phylogeny or the SGDP phylogeny, each simulation was performed as follows:

Randomly assign the root node an STR allele between −4 and 4 and mark it as active

Remove an active node and mark it as inactive. For each of this node’s children:

Calculate the child’s allele probabilities using the branch length, the true mutation model and the parent node’s genotype

Randomly select an STR allele based on these probabilities

Mark the descendant node as active

While active nodes remain, go to step 2

Report the exact STR alleles for a random subset of the samples (leaf nodes) based on the required sample size

### Simulating STR Sizes in Reads with PCR Stutter

We first used the procedure above to simulate STR genotypes down the phylogeny. The true genotype for a particular sample *g _{i}*, in concert with a given stutter model, was then utilized to simulate the STR sizes observed in each read as follows:

Sample the number of observed reads

*n*for each sample with genotype_{reads,i}*g*from the read count distribution._{i}For each read from 1 through

*n*sample a number_{reads,i}*n*~ U (0,1)If

*n < d*, randomly sample an artifact size*a*from a geometric distribution with parameter_{j}*ρ*. Report the read’s STR size as_{s}*g*–_{i}*a*_{j}If

*d*≤*n*< 1 –*u*, report the read’s STR size as*g*_{i}Otherwise, randomly sample an artifact size

*a*from a geometric distribution with parameter_{j}*ρ*. Report the read’s STR size as_{s}*g*+_{i}*a*_{j}

To assess whether estimates would be accurate for even the most sparsely sequenced loci, we used read count distributions obtained from both Y-STR call sets (see below) corresponding to loci in the 10^{th} percentile by coverage. We also used a read count distribution representative of the median coverage in the SGDP dataset to assess performance at higher coverage.

### Collection of previously published mutation rate estimates

We collated STR mutation rates from two previous large studies to enable comparison with our mutation rate estimates (Ballantyne et al. 2010; Burgarella and Navascues 2011). Utilizing only the estimates obtained from analyzing thousands of father-son pairs, we collected at least one mutation rate estimate for nearly 190 Y-STRs. To associate each marker with a locus in the reference genome, we utilized the published set of primer sequences and the isPCR tool (Hinrichs et al. 2006) to map the primers to hg19 coordinates. We then ran Tandem Repeats Finder (Benson 1999) (TRF) on each region and pinpointed the coordinates using the published repeat structure (Ballantyne et al. 2010) to generate a list of annotated STR regions. We also ran TRF on additional regions previously published as part of comprehensive Y-STR maps to obtain coordinates for a total of 261 annotated Y-STRs (Hanson and Ballantyne 2006).

### STR Region Selection

We ran TRF on the hg19 assembly of the human reference genome and utilized previously described score thresholds to select only those regions significantly more repetitive than random genomic DNA (Willems et al. 2014). As this tool occasionally reports multiple overlapping repeats for a single genomic region, we merged overlapping entries in which the highest scoring entry contained 85% of the bases in the entries’ union. Overlapping entries that failed this criterion but had the same period were further merged as they frequently represent loci comprised of two neighboring motifs (e.g. [GATA]_{10} [TACA]_{8}), while the remaining regions were omitted. We further removed regions that overlapped the annotated markers, failed to liftOver (Hinrichs et al. 2006) to the GrCh38 assembly or were lifted to the X chromosome. We then generated the complete STR reference using these regions and the annotated STRs described above.

### Y-STR Call Set Generation

We downloaded BWA-MEM (Li 2013) alignments for 179 male and 108 female SGDP samples from the project website and extracted and merged the Y-chromosome alignments into a single BAM file using SAMtools (Li 2013). STR genotypes were then generated using HipSTR, a multi-sample haplotype-based STR caller that specifically accounts for the PCR stutter artifacts that drive most STR genotyping errors. HipSTR was run using the merged BAM, the hg19 STR regions described above, and the options —*min-reads 25 —haploid-chrs chrY* —*hide-allreads*. Similarly, we downloaded BWA-MEM alignments for 1320 male and 1371 female samples in the 1KGP phase 3 data release. As these alignments were relative to the *GrCh38* assembly, we ran HipSTR using the corresponding GrCh38 STR regions and the options —*min-reads 100* —*haploid-chrs chrY* —*hide-allreads*.

### Y-STR Call Set Filtration

To mitigate potential mutation estimate errors caused by pseudoautosomal and duplicated regions, we applied a series of stringent quality controls. We began by filtering the SGDP genotypes, as the 30X sequencing data and PCR-free protocol provided the highest quality dataset. To remove Y-STRs with putative homologous sites on the X chromosome, loci with more than 2 genotyped females were discarded. We further removed sites where more than 7.5% of reads had an indel in the STR flanks or 15% of reads had a stutter artifact, statistics that HipSTR reports based on the maximum likelihood alignment of each read relative to its sample’s most probable haplotype. These loci likely represent instances in which duplicated copies of a polymorphic locus are mapping to a single reference genome locus, HipSTR failed to generate sufficient candidate alleles or the STR is flanked by an indel. For the remaining loci, we discarded unreliable calls on a per-sample basis if more than 10% of an individual’s reads had an indel in the flanks. Because the mutation model outlined above assumes that all alleles are integral copies of the repeat unit, we discarded loci where more than 5% of samples’ genotypes violated this assumption. To avoid errors introduced by neighboring repeats, we omitted genotyped loci that overlapped one another or multiple STR regions, an issue that can arise when HipSTR expands an STR region to include proximal indels. Finally, we removed loci in which fewer than 100 samples had genotype posteriors above 66%, as these loci had too few samples for accurate inference.

To filter the 1000 Genomes call set, we first removed loci that did not pass the SGDP dataset filters. We then applied a set of filters identical to those described above except that we only removed loci with more than 15 genotyped females and did not apply a stutter frequency cutoff. These alterations account for the 1000 Genomes dataset’s larger sample size and lack of a PCR-free protocol.

### Estimating Y-STR Mutation Rates

For each locus in the SGDP and 1KGP call sets that passed the requisite quality control filters, we first used the EM algorithm to learn a PCR stutter model. The read STR sizes required to run this algorithm were obtained from the MALLREADS VCF field in which HipSTR reports the maximum likelihood STR size observed in each read that spans its sample’s most probable haplotype. In conjunction with a uniform prior, this stutter model was then used to compute the genotype posteriors for each sample with a HipSTR quality score greater than 0.66. Samples with quality scores below this threshold were omitted because the genotype uncertainty can result in erroneous reported read sizes. Finally, in conjunction with the optimization procedure and the appropriate scaled Y-SNP phylogeny, these genotype posteriors were used to obtain a point estimate of the mutation rate.

### Confidence Interval Estimation

We utilized a delete*-d* jackknife approach to estimate mutation rate confidence intervals (Shao and Wu 1989). For each Y-STR, we sampled without replacement half of the STR genotypes above the genotype posterior threshold a total of 250 times and recalculated the log mutation rate using each of these subsets. Given these subsample estimates and the log estimate obtained using all samples, the standard error (SE) and confidence interval (CI) for the log mutation rate were calculated according to:

### Effective Number of Meioses

For each phylogeny, we computed the sum of the branch lengths in generations after scaling (see results section). This resulted in estimates of ~177,600 and ~72,600 meioses in the 1KGP and SGDP phylogenies, respectively.

### Estimating the Number of De Novo Mutations

To predict the number of de novo mutations on paternally inherited chromosomes:

We constructed a genome-wide reference of STRs using an approach identical to that for Y-STRs

We bootstrapped Y-STR loci for each repeat unit 2-4 base pairs in length 1000 times. For each bootstrapped dataset and each repeat unit length, we

Built a sequence-based mutation rate model using the sampled Y-STRs

Utilized the fitted models to predict the mutation rate of each locus in the genome-wide reference with the same repeat unit based on its sequence properties

Summed the resulting values to obtain an aggregate mutation estimate

We selected the 5

^{th}and 95^{th}percentiles of aggregated estimates to obtain a 95% confidence interval for each repeat unit length

To build a sequence-based mutation rate model for each motif length, we assigned all fixed Y-STRs a log mutation rate of -5 (the minimum bound during optimization), all polymorphic Y-STRs the mean log estimated mutation rate between the two WGS datasets and utilized numerical optimization to fit a model of the form
where *T* is a threshold, *s* is the slope of the line and *I* is the length of the longest uninterrupted tract for each locus. These fitted models provide an estimate of the mean rate for a given tract length, but to account for uncertainty and to omit estimates for loci below the mutation rate optimization threshold, we predicted mutation rates as follows:
where *t*_{N−2} is sampled from a t-distribution with *N* – 2 degrees of freedom and *N, L _{j}* and are the number, length and mean length of Y-STRs used to fit each model, respectively. To avoid potential biases, we did not generate predictions for loci whose tract lengths were above the maximum length used to train each model.

### Y-STR Imputation Method

Given a set of samples with Y-SNP genotypes and a reference panel with Y-SNP and Y-STR genotypes, we extended the mutation rate estimation method to impute missing STR genotypes. Using the approach outlined in Figure 1, we first construct a phylogeny relating all samples and learn a mutation model. We then use this learned mutation model to pass two sets of messages along the tree and compute exact posteriors for each node. Samples with observed genotypes correspond to leaves in the tree and their posteriors represent imputation probabilities. In particular, for a node *N _{i}* in a binary phylogeny with parent

*P*, sibling

_{i}*S*and children

_{i}*C*

_{1i}and

*C*

_{2i}, its probability conditioned on the observed genotypes is given by where

*D*and

_{Ni}*D*denote the data in and not in node N

_{–Ni}*i*’s subtree, respectively. The first term in this expression is computed using a bottom-up traversal of the tree from the leaves to the root node. Each node in the tree combines the probabilities of its two children using the recurrence where

*GC*

_{1i}and

*GC*

_{2i}denote the two children of node

*C*

_{1i}. This recurrence applies to all nodes except the leaves, where genotype posteriors or a uniform prior are used for samples with and without genotype information, respectively. Similarly, the second term in the node posterior expression is computed using a top-down traversal of the tree from the root to the leaves. After assigning the root node a uniform probability, each node combines information from its parent and sibling:

## Results

### Mutation Rates Estimates with Perfect Genotypes

We validated our mutation rate estimation algorithm by simulating STR genotypes under various mutation models and assessing how accurately we could recover the true mutation rate when given error-free observations. For the majority of models, we obtained an unbiased estimate of the log mutation rate using both phylogenies (**Figure S1**). Slight upward biases were observed for the smallest simulated mutation rate (10^{−5} mpg), but these stem from the lower bound imposed during numerical optimization. As previous studies have developed estimators involving simplified mutation models, we sought to assess how these simplifications might affect estimates. Restricting the optimized models to single step mutations resulted in stronger upward biases for low mutation rate scenarios, strong downward biases for high mutation rate scenarios and higher variance in the estimates (**Figure S2**). The effect of disabling the length constraint for optimized models was much less pronounced, but also resulted in large downward biases for many rapidly mutating scenarios. Altogether, these results illustrate that if given error-free genotypes, our method is well powered to obtain accurate estimates across a host of mutation scenarios but that making overly simplistic assumptions about STR mutation models can result in marked biases.

### Mutation Rates Estimates with PCR Stutter

We extended the above simulations to introduce the effects of PCR stutter, a primary driver of STR genotyping errors. After simulating STR genotypes under various mutation models, we generated observed reads using various stutter models and distributions of reads per sample. Application of the EM algorithm to this data resulted in relatively unbiased estimates of each stutter model parameter for nearly all scenarios (**Figure S3**). Slight downward biases were observed for the geometric step size parameter when both stutter frequencies were 1%, but this is likely caused by the scarcity of informative instances of stutter in this setting. The variance of the stutter parameter estimates decreased substantially with increases in sample size and mean number of reads per sample, as these led to more stutter-informative reads.

We next sought to assess whether the stutter parameter estimates were sufficiently precise for mutation rate inference. To this end, we estimated mutation rates after computing genotype posteriors using the learned stutter models. For comparison, we also generated estimates when posteriors were computed with exact knowledge of the stutter model or using a naive approach based on the fraction of reads supporting each allele. In scenarios with low average coverage, the fraction-based posteriors resulted in marked biases, particularly for low mutation rates, demonstrating the importance of correctly accounting for stutter artifacts in these settings (**Figure 2, Figures S4-S5**). In contrast, posteriors generated using the estimated and exact stutter models obtained relatively unbiased mutation rate estimates across all scenarios and yielded estimates with similar variance. The primary exception to these trends was the slight upward bias observed for rates of 10^{−5} mpg, but this bias was also observed in the simulations with exact genotypes. Collectively, these results indicate that the combination of the EM and mutation rate algorithms obtain robust estimates suitable for downstream analyses.

### Call Set Validation

To assess the level of genotyping errors present in each call set, we stringently filtered each call set and compared them to capillary electrophoresis datasets involving a subset of the same male samples. For 565 samples in the 1000 Genomes Project, the concordance for 3500 calls at 13 loci in the PowerPlex Y23 panel was 97.5%, indicating that the low coverage data was not prohibitive for obtaining accurate Y-STR genotypes. An analogous comparison of 3300 calls at 48 loci for 76 SGDP samples resulted in an even higher concordance of 99.7%. These comparisons were restricted to loci with 3-5 base pair motifs and therefore may not reflect the quality for loci with shorter motifs due to their increased propensity for stutter. Nonetheless, they are indicative of the high quality of the data for larger repeat motifs.

### Scaling Phylogeny Branch Lengths

Although the maximum-likelihood phylogeny generated for each dataset has numerical branch lengths, these lengths are not scaled in units of generations as required by our method. We therefore sought to determine an appropriate scaling factor using mutation rate estimates for 15 loci in the Y-chromosome Haplotype Reference Database (YHRD) (Willuweit et al. 2007). We chose these loci as a calibration point because their estimates are based on more than 7000 father-son pairs per locus and should therefore be relatively precise. For the 1000 Genomes data, we used the PowerPlex capillary data for each locus, assumed error-free genotypes, scaled the phylogeny using a range of factors and estimated the set of mutation rates using each scaling factor. The choice of scaling factor had essentially no affect on the correlation with the YHRD estimates, resulting in an R^{2} of 0.89 (**Figure S6**). However, the total squared error between the estimates was minimized for a factor of 2800, which we therefore selected as the optimal scaling. For the SGDP data, we performed an analogous analysis using HipSTR genotypes for 9 of these 15 loci, again resulting in a uniform R^{2} of 0.91 and an optimal factor of roughly 3200 (**Figure S6**).

An alternative approach to scaling each phylogeny is to select the factor that best matches the total number of generations in the tree to the value based on published Y-SNP mutation rates. To explore how this approach might impact the scaling, we calculated factors using a recently published Y-SNP mutation rate of 3E-8 mutations per generation (Xue et al. 2009; Helgason et al. 2015) and the total numbers of called SNPs and called sites in each SNP dataset. The resulting estimates for the 1KGP data and SGDP data were remarkably concordant with those above, as they were only 14% and 34% greater. However, to maximize our concordance with pedigree estimates, we chose to utilize the first set of scaling factors outlined above.

### Y-STR Stutter Models

We applied the EM algorithm to each of the filtered call sets to learn per-locus stutter models. Across both datasets, the learned parameters demonstrated a strong bias in favor of stutter-induced contractions versus expansion for nearly all loci (**Figure S7**). The stutter parameter estimates were highly correlated between the two datasets, reflecting the algorithm’s ability to capture each locus’ distinctive error profile, but as expected the PCR-free protocol resulted in significantly lower stutter rates for the SGDP dataset relative to the 1KGP dataset (**Figure S7**). Within each dataset, the rates of stutter exhibited an inverse correlation with repeat unit size for a given allele length and a positive correlation with allele length for a given repeat unit size (**Figure S8**).

### Y-STR Mutation Rate Estimates

After utilizing the learned stutter models to compute genotype posteriors, we applied the mutation rate estimation algorithm to each polymorphic locus, resulting in estimates for 702 loci with 2-6 base pair motifs. Stratifying these estimates by motif length indicated a wide degree of variability both within and between classes (**Figure 3**). Within each class, mutation rates varied by two or more orders of magnitude, indicating that Y-STR mutation rates are highly dependent on the loci under consideration. Relative to other Y-STR classes, loci with previously characterized rates had substantially higher estimates, illustrating that they’ve been selected for their high mutability (**Figure 3, Tables 1-2**). While the bulk of uncharacterized loci with tri- and tetranucleotide motifs were substantially less polymorphic, we identified 29 of these loci with mutation rates greater than 10^{−3.5} mpg, of which the five most mutable loci had rates ranging from 10^{−2.29} - 10^{−2.44} mpg (**Table 2**). Dinucleotide repeats were also highly polymorphic and 70 of these loci had mutation rates above 10^{−3.5} mpg.

### Mutation Rate Concordance

In order to assess the reliability of our mutation rate estimates, we measured the concordance between the two sets of WGS-based estimates obtained in this study. Despite substantial differences in the quality of the sequencing data, the analyzed populations and the study sizes, we obtained an R^{2} of 0.92 between the 1KGP and SGDP log mutation rate estimates (**Figure 4**). This high concordance extended to slowly mutating markers, as estimates for loci with a mean estimated mutation rate below 10^{−3.5} mpg had an R^{2} of 0.80. To assess the potential impact of genotyping errors on these estimates, we regenerated them using the 1000 Genomes capillary genotypes for 23 loci, resulting in R^{2} of 0.98 and 0.94 with the SGDP and 1KGP estimates for the same loci. These comparisons illustrate that our method obtains robust locus-specific values while accounting for varying degrees of PCR stutter artifacts and genotyping errors. Furthermore, the inter-dataset concordance suggests that there are either very few errors in the phylogenies or that these errors have little impact on the resulting mutation rate estimates.

Next, we assessed the accuracy of our mutation rate estimates by comparing them to results from prior studies based on roughly 1500 and 500 father-son transmissions per Y-STR (**Figure 4**) (Ballantyne et al. 2010; Burgarella and Navascues 2011). The R^{2}between these two studies was only 0.34, a low concordance that likely stems from the small sample size and large uncertainty in the Burgarella et al. estimates. By comparison, the SGDP and Ballantyne et al. estimates had an R^{2} of 0.66. Although markedly higher, this concordance was substantially reduced by the plateau in the Ballantyne et al. estimates at 10^{−3.5} mpg, a threshold that stems from loci without any detected mutations. While accurately characterizing these loci using the father-son approach would require tens of thousands of additional pairs, our method easily obtains replicable estimates below this threshold by leveraging over 222,000 meioses in the phylogenies. An analogous comparison of the SGDP and Burgarella et al. estimates resulted in an R^{2} of 0.32. However, restricting this comparison to a subset of loci characterized using more than 5000 father-son pairs resulted in a substantially higher R^{2}of 0.87 (**Figure S9**). Collectively, these comparisons demonstrate that our method accurately replicates father-son based estimates based on sufficient pairs, but that the father-son approach is poorly suited to quantifying the point estimates for mutation rates of slowly mutating markers.

### Discriminative Power

Given the large number of markers with novel mutation rates, we sought to assess the potential gains in discriminative power they might provide. We therefore computed the probability of observing at least one mutation over one generation for various groups of loci. Utilizing the full panel of 190 Y-STRs characterized by Ballantyne et al. resulted in a discrimination probability of 42%. Extending this set of markers to incorporate those with novel rates in this study increased this probability to 50%. However, because of the constraints imposed by the Illumina sequencing reads, we were unable to genotype many of the long markers in the Ballantyne et al. study, particularly most of the 13 rapidly mutating markers with mutation rates greater than 0.01 mpg. The subset of their markers we were able to genotype resulted in a discrimination probability of 12% and incorporating our novel marker estimates improved this probability to 24%.

### Sequence Determinants of Y-STR Mutability

To assess the extent to which sequence characteristics drive STR mutation rates, we analyzed how allele length, repeat-motif length, and interruptions to the repeat structure affect mutability. For STRs with and without interruptions, major allele length only explained a modest amount of the variance in log mutation rate for loci with di-, tri-, and tetra-nucleotide motifs (R^{2} = 0.16, R^{2} = 0.25, and R^{2} = 0.42) (**Figure 5**). Restricting this analysis to STRs without interruptions substantially improved the variance explained (R^{2}= 0.83, R^{2} = 0.67, and R^{2} = 0.82), suggesting that interruptions to the repeat structure disrupt the correlation between allele length and mutability. A subsequent analysis of the relationship between the log mutation rate and the length of the longest uninterrupted repeat tract indicated that this was a more general predictor of mutability (**Figure 5**), as it explained over 75% of the variance for each of the three motif lengths regardless of the number of interruptions. Stratifying loci with dinucleotide repeat units by motif indicated that these trends also apply at a much finer scale (**Figure S10**). Major allele length was once again a relatively poor predictor of the log mutation rate for loci with AC, AG and AT repeat motifs, but uninterrupted tract length explained over 80% of the variance for each motif.

### De novo mutations

We sought to use the mutation rates and sequence properties of each Y-STR to predict the expected number of genome-wide de novo mutations. Because the Y-STR mutation rates are only applicable to the male germ line due to differing numbers of meioses, we restricted this analysis to the number of de novo mutations on paternally inherited chromosomes. To generate a prediction, we built sequence-based models of Y-STR mutability, obtained per-locus estimates by applying these models to each locus in a genome-wide reference of STRs and aggregated the results (**Methods**). After utilizing bootstrapping and sampling techniques to account for uncertainty in both the fitted models and the predicted values, we obtained 95% confidence intervals of 27–34, 2–11 and 37–102 mutations for loci with di-, tri- and tetranucleotide motifs. The 95% confidence interval for the total number of expected de novo mutations on paternally inherited chromosomes was 72–140 mutations. These estimates are likely conservative as we omitted loci with 5–6 base pair motifs due to a relatively small numbers of Y-STRs and omitted genome-wide loci that were longer than the Y-STRs used to train each model.

### Imputing Y-STRs

We extended the mutation rate estimation procedure to develop a Y-STR imputation approach. Briefly, after building a SNP phylogeny relating all samples and learning a mutation model as outlined in **Figure 1**, this approach passes two sets of messages along the phylogeny to compute the exact marginal posteriors for each node, resulting in imputation probabilities for samples without observed Y-STR genotypes. To assess the accuracy of this technique, we once again turned to the capillary PowerPlex Y23 genotypes for the 1KGP dataset, as this panel is one of the most commonly used in forensic and genealogical settings. Over 100 iterations, we randomly constructed reference and imputation panels of 500 and 70 samples, utilized the reference panel’s Y-STR genotypes to infer a mutation model and compute node posteriors, and compared the imputed genotypes for the imputation panel to their true underlying values. The resulting imputed probabilities roughly matched their true accuracy, indicating that the posteriors computed using this technique are well calibrated (**Figure S11**). When using all imputed genotypes, even those with probabilities below 50%, this approach resulted in an overall accuracy of 66% across markers (**Table 3**). However, discarding imputed genotypes with probabilities less than 70% resulted in an overall accuracy of 88% and retained more than 40% of the calls. On a marker-by-marker basis, accuracy was generally inversely proportional to the estimated mutation rates, with the most slowly mutating markers having accuracies on the order of 95%. This trend stems from the fact that as the mutation rate increases, shorter branch lengths are required to obtain an estimate with similar confidence.

## Discussion

Over the past two decades, tremendous advances in sequencing technology have fundamentally transformed the applications of Y-STRs. The initial scarcity of available SNP genotypes resulted in the development of methods capable of inferring coalescent models from Y-STR genotypes alone. Methods designed to also learn STR mutational dynamics either marginalized over these coalescent models (Nielsen 1997) or aimed to simultaneously infer the coalescent and mutational models (Wilson and Balding 1998; Wilson et al. 2003). With the advent of population-scale WGS datasets, many of these STR-centric approaches have instead utilized SNPs, resulting in substantially more detailed phylogenies. On the Y-chromosome, these detailed phylogenies now provide the evolutionary context required to interpret Y-STR mutations, obviating the need for expensive tree enumeration or marginalization approaches. However, the errors prevalent in WGS-based Y-STR genotypes require methods capable of accounting for genotype uncertainty, preventing the application of many traditional microsatellite distance measures designed for capillary data (Goldstein et al. 1995; Slatkin 1995).

In this study, we developed a novel method to leverage these datasets. One inherent advantage of our approach is its ability to model and learn many of the salient features of microsatellite mutations. Through the incorporation of a geometric step size distribution, we allow both single step mutations that predominate at tetranucleotide loci (Kayser et al. 2000; Sun et al. 2012) as well as multistep mutations that frequently occur at dinucleotide loci (Huang et al. 2002; Sun et al. 2012). In addition, the model’s length constraint parameter replicates the intra-locus phenomenon of shorter STR alleles preferentially expanding and longer alleles preferentially contracting (Xu et al. 2000; Huang et al. 2002). As these parameters are learned from the observed STR genotypes, our method avoids many biases that stem from imposing single-step mutations or assuming parameters a priori. Nonetheless, our mutation model does not capture the fully complexity of STR mutational dynamics as it ignore intra-locus mutation rate variation (Ellegren 2000). Incorporating these and other mutational characteristics may be of interest in future studies.

In addition to its mutational model flexibility, our approach is advantageous because of its ability to leverage evolutionary data. From the large number of meioses in each phylogeny, it can obtain extremely replicable and accurate estimates, as demonstrated by the strong concordance between our WGS-based estimates and their strong concordance with father-son estimates based on sufficient pairs. In contrast, the estimates of Ballantyne et al. and Burgarella et al. showed poor concordance, likely due to the small number of pairs used in one of these studies. This underscores the fact that without vast numbers of samples, pedigree-based approaches cannot obtain precise point estimates for more slowly mutating markers. We believe that this limitation, coupled with our method’s ability to analyze any WGS dataset and hundreds of STRs in parallel, make it a simple and scalable alternative to pedigree-based estimation approaches.

One longstanding concern regarding Y-STR mutation rates has been the apparent discrepancy between evolutionary and pedigree-based mutation rates. A host of studies have suggested that evolutionary rates are 3-4 times lower, resulting in substantial inconsistencies in Y-STR based lineage dating and large discrepancies from Y-SNP based TMRCA estimates. (Zhivotovsky et al. 2004; Zhivotovsky et al. 2006; Wei et al. 2013b). Because this study harnessed evolutionary data, we sought to avoid any potential issues by scaling each phylogeny such that our estimates best matched those from pedigree-based studies. Nonetheless, our investigations into an alternative scaling based on a SNP molecular clock resulted in similar scaling factors that only differed by 15-30%. Coupled with the strong concordance we observed with pedigree estimates, our study provides little evidence for a substantial difference between mutation rates estimated from these two types of data.

Empowered by the accuracy and parallelizability of our method, we were able to obtain Y-STR mutation rates on an unprecedented scale. The set of estimates for over 700 polymorphic STRs is, to our knowledge, the largest Y-STR set to date, substantially expanding upon those previously obtained for 190 loci (Ballantyne et al. 2010; Burgarella and Navascues 2011). Two of the largest prior studies of autosomal STR mutability characterized 350 and 2500 markers using traditional family-based approaches (Huang et al. 2002; Sun et al. 2012), but these studies only observed mutations for 50 and 800 of these loci. As a result, the scope of our study also parallels those of the largest autosomal studies.

Despite the large-scale nature of our study, it has several inherent limitations. Because we analyzed sequencing datasets comprised of 80-100bp Illumina reads, we were unable to genotype and characterize the mutation rates of many long Y-STRs. Given the strong positive correlation between tract length and mutation rate observed here and in previous studies, we anticipate that many long dinucleotide loci will be extremely mutable and will add significant discriminative power to Y-STR panels. We were also unable to include homopolymers in this study despite their shorter lengths due to a rapid degradation in base quality scores, but we anticipate that many of these markers are also highly polymorphic. As a result, future studies may benefit from reapplying our analysis to both of these sets of markers as sequencing technologies, especially those enabling long reads, continue to mature.

Nonetheless, given the extent of our set of estimates, we were able to shed new light on the sequence factors governing STR mutability. While prior Y-STR studies have primarily focused on loci with longer alleles and 3–6 base pair motifs, the results here extend these analyses to shorter loci and repeats with dinucleotide motifs. In particular, we found that for all examined repeat unit lengths, the longest uninterrupted tract length is an extremely strong predictor of the log mutation rate, replicating the exponential trend between mutation rate and tract length previously observed in a host of pedigree-based studies (Brinkmann et al. 1998; Kayser et al. 2000; Xu et al. 2000; Ballantyne et al. 2010). In contrast, allele length alone was a poor predictor. Coupled with the fact that Y-STRs without interruptions were much more mutable than interrupted ones with the same major allele length, our study provides strong evidence that interruptions to the repeat structure decrease mutation rates. This finding supports what has long been posited in STR evolutionary models (Kruglyak et al. 1998; Sainudiin et al. 2004) and has been shown in a handful of small-scale experimental studies of STR mutability (Petes et al. 1997; Bacon et al. 2000) but contradicts the recent findings of Ballantyne et al in which no effect was observed. This discrepancy may stem from the fact that they primarily considered longer repeats with uninterrupted tract lengths at least 8 repeat units long.

In addition to estimating Y mutation rates, we’ve outlined a Y-STR imputation method that is, to the best of our knowledge, the first of its kind. A preliminary assessment of this method’s accuracy indicated that imputation accuracies of up to 95% can be achieved for some of the most slowly mutating markers in the PowerPlex Y23 but that the performance is much poorer for more rapidly mutating markers. However, the accuracy of our approach is essentially linear in the shortest time to the most recent common ancestor. As a result, as population-scale sequencing datasets for the Y-chromosome continue to expand in scope to tens of thousands of individuals, we expect its accuracy to increase substantially. We also anticipate that our Y-STR mutation rate method and its relevant extensions may be applied to autosomal STRs. Although recombination complicates the generation of sufficiently detailed phylogenies, tools capable of inferring ancestral recombination graphs and the associated phylogenies continue to improve (Minichiello and Durbin 2006; Rasmussen et al. 2014). As a result, it may be possible to apply these approaches to ensembles of trees and aggregate the results.

The large corpus of mutation rate estimates has also enabled novel predictions about genome-wide STR variation. Prior studies have estimated a rate of approximately 75 de novo SNP mutations per generation (Conrad et al. 2011; Francioli et al. 2015) but have largely ignored STRs despite their elevated mutation rates. Based on our projections for de novo mutations on paternally inherited chromosomes, the number of de novo STR mutations is likely to exceed that of SNPs. An increasingly large number of candidate gene studies and genome-wide analyses have highlighted instances in which STR variations modulate gene expression (Gebhardt et al. 1999; Shimajiri et al. 1999; Contente et al. 2002; Gymrek et al. 2015). We therefore hope that as others aim to dissect the genetic basis of complex diseases and traits, this study will motivate them to consider STRs as the causal genetic elements.

## Web Resources

Simons Genome Diversity Project, https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/

Dendroscope, http://dendroscope.org/

RAxML, http://sco.h-its.org/exelixis/web/software/raxml/index.html

Simons Genome Diversity Project capillary genotypes, ftp://ftp.cephb.fr/hgdp_supp9/genotype-supp9.txt

1000 Genomes Project alignments, ftp:/ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/

1000 Genomes PowerPlex Y23 genotypes, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140107_chrY_str_haplotypes/s_PowerPLexY23_1000Y_QA_20130107.txt

1000 Genomes Project Y-chromosome phylogeny, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/chrY/1000Y.Supplementary.Data.File.2016.01.04.tar.gz

Y-Chromosome STR Haplotype Reference Database mutation rates, https://yhrd.org/pages/resources/mutation_rates

## Supplemental Figures

## Acknowledgements

M.G. was supported by the National Defense Science and Engineering Graduate Fellowship. Y.E. holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. This study was partially supported by NIJ grant 2014-DN-BX-K089 (Y.E. and T.W.)