Variance in estimated pairwise genetic distance under high versus low coverage sequencing: The contribution of linkage disequilibrium.

The mean pairwise genetic distance among haplotypes is an estimator of the population mutation rate θ and a standard measure of variation in a population. With the advent of next-generation sequencing (NGS) methods, this and other population parameters can be estimated under different modes of sampling. One approach is to sequence individual genomes with high coverage, and to calculate genetic distance over all sample pairs. The second approach, typically used for microbial samples or for tumor cells, is sequencing a large number of pooled genomes with very low individual coverage. With low coverage, pairwise genetic distances are calculated across independently sampled sites rather than across individual genomes. In this study, we show that the variance in genetic distance estimates is reduced with low coverage sampling if the mean pairwise linkage disequilibrium weighted by allele frequencies is positive. Practically, this means that if on average the most frequent alleles over pairs of loci are in positive linkage disequilibrium, low coverage sequencing results in improved estimates of θ, assuming similar per-site read depths. We show that this result holds under the expected distribution of allele frequencies and linkage disequilibria for an infinite sites model at mutation-drift equilibrium. From simulations, we find that the conditions for reduced variance only fail to hold in cases where variant alleles are few and at very low frequency. These results are applied to haplotype frequencies from a lung cancer tumor to compute the weighted linkage disequilibria and the expected error in estimated genetic distance using high versus low coverage.


Introduction
One of the defining empirical problems in evolutionary genetics is the measurement and characterization of genetic heterogeneity in natural and experimental populations. The advent of next-generation sequencing (NGS) provides researchers with a tool set for efficiently generating sequence data from large numbers of genotypes and over extensive regions of the genome, including whole-exome and whole-genome sequencing of multiple individuals. This data has the potential to provide the statistical power necessary to make robust inferences of genotype frequencies and their distributions.
High-throughput NGS technology gives researchers choices between different approaches to sampling genotypes from a population. A standard method, most widely used in studies of multicellular organisms, is to sample individuals and sequence their genomes at high coverage, i.e. generating reads containing most or all of the polymorphic sites of interest for each genome. An alternative approach is to sequence from a pooled set of individuals at a read depth much smaller than the number of genomes in the sample, e.g. Futschik and Schlötterer (2010); Anand et al. (2016), leading to a very low average coverage per individual genome. Figure 1 illustrates these two scenarios for a small model population: a sample of n individuals sequenced with full coverage, versus low coverage sequencing at read depth n from a pooled set of individuals. stant read depth, pooling improves the accuracy in estimated θ due to effectively larger sample size (Futschik and Schlötterer, 2010;Ferretti et al., 2013), while Korneliussen et al. (2013) have shown that low read depth can lead to estimation bias in the Tajima D test statistics.
Considering the effects of coverage on parameter estimation, if the number of genomes sampled is held constant, lower coverage leads to smaller sample size, and consequently greater error. However, Ferretti et al. (2014) have shown that as long as the reduction in coverage is compensated by the In this study, we will consider limiting cases of high and low coverage sequencing to investigate the contribution of linkage disequilbria to estimates of θ. High coverage sequencing (HCS) is represented by complete coverage of all polymorphic sites from n different genomes, as would typically be the case for individual sampling (including single-cell sequencing). Low coverage sequencing (LCS) is represented by a case where a very large sample of genomes is pooled and sequenced at a read depth n for each site, so that allelic variants at different sites are almost always drawn from different genomes. We will compute variances in the Tajima estimator E( π) = 4N u = θ π , which is calculated from the mean pairwise genetic distance in a sample of n genotypes: where π i,j is the Hamming distance for the haplotype pair i, j summed over all polymorphic sites.
We hypothesize that under most conditions, the variance in π estimated using HCS increases with greater linkage disequilibrium across polymorphic sites, i.e. strong linkage disequilibria inflate the estimation error across sites in a haplotype by reducing the number of independent genealogical sample paths. We will investigate this hypothesis analytically, and will additionally validate our results using individual-based simulations. We will also apply these results to NGS data by analyzing allele and haplotype frequencies from cancer cell genomes.

The Sampling Models
Consider a population of N organisms with mutations distributed over S segregating sites. We wish to estimate the mean genetic distance π for the population and its sample variance var( π) under the high and low coverage modes of sequencing. For HCS, we draw n N individual organisms (or cells) from the population and sequence their entire genomes, exomes, or any regions containing the polymorphic sites of interest.
For an idealized model of LCS, we assume a mean coverage depth n M , where M is the number of genotypes contributing to the pooled sample (M may be N or of the same order). If reads are short, the majority will contain at most a single polymorphic site. Together, these conditions lead to each polymorphic site being sampled independently of other polymorphic sites with respect to the genome of origin (note that in the second panel of Figure 1, multiple sites are sampled from the same genome simply because there are very few genomes to draw this random sample from). When computing sample genetic distance, extreme HCS sums over the Hamming distances of all haplotype pairs, while extreme LCS results in summing over all pairs for each segregating site sampled from a different genome.
We assume an infinite sites model (Kimura, 1969;Tajima, 1996) so that there are only two alleles per segregating site. This allows an unambiguous binary classification of alleles, with mutations as ancestral "wildtype" vs.
"reference" genotype. This also allows us to specify the direction (sign) of linkage disequilibria. For cancer cells, the reference corresponds to the normal germline genotype, with somatic mutations defining the variant genotypes of the clonal lineages. Our methods and results also apply to multiallelic states provided that some allele, usually the most common, is designated as a reference and all other alleles are aggregated to create a biallelic state.
Definitions. Throughout the paper, we use the following definitions and terminology as a formal way of defining Eqn. (1) for high and low coverage sequencing: Variables: Let z denote a genotype, at either a single locus s or across multiple loci. We define the frequency distribution of z over samples i as z i ∼ p(z), which are iid among i = 1...n. We use z is to denote site s in haplotype i (for HCS). We write z is,s to denote sample i at site s when sites are sampled by pooling at low coverage (LCS) when we wish to highlight the fact that z is and z ir are read from distinct haplotypes.
Pairs: For both HCS and LCS, the estimators of π include an average that share a common element are not, and thus the same for φ ij .

Moments of
We also define an expectation for the product E(φ ij , φ jk ) = κ for pairs of pairs with a shared element.

Pairs of pairs:
Let P denote the set of all ordered pairs of pairs, with P 3 ⊂ P defining the subset of ordered pairs of pairs with a single shared element, shared element from this triplet. In Appendix A1, we discuss the properties of ordered pairs of pairs, including the derivation of the following relation which we will use below to compute var( π) under low and high coverage sequencing, where φ n = 1 We will use this result twice, once for LCS with φ ij = f ijs , and once for HCS with φ ij = g ij .

Case 1: Low Coverage Sequencing (LCS)
For LCS, we use the indicator function at a single site s, where h s is also known as the heterozygosity at locus s. Under LCS we define the estimator forπ for Eqn. (1) as:

10
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; To evaluate the variance ofπ LCS , first note that the expectation of products f ij,s , f jk,s for ordered pairs is: From the assumption of statistical independence among sites s located on different reads with pooling, it follows (Appendix A1) that for a sample of Approximate statistical independence across sites requires that the number of possible samples of size n is much larger than the number of segregating sites (i.e. N n so that N n S N ).

Case 2: High Coverage Sequencing (HCS)
Computing pairwise differences among samples of individuals genomes under HCS involves calculating moments of sums rather than sums of moments.
For z i ∼ p(z) sampled independently under HCS, applying g ij = s I(z is = and derive its variance below. For a sample of individual haplotypes i = 1...n, consider z is ∼ Bern(p s ) as before, but with correlated z is , z ir due to linkage disequilibrium (LD) between sites. We define the LD D rs for 11 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; (arbitrarily labeled) alleles R, r and S, s at the two sites as follows. Letting q s , q r = 1 − p s , 1 − p r (Lewontin and Kojima, 1960), As with LCS, we have, for h s = 2p s q s , The probability of different identity among sites s, r in a sample pair i, j is p(f ijs f ijr = 1) = p(RS, rs) + p(rs, RS) + p(Rs, rS) + p(rS, Rs), where p(RS, rs) = p(z i,sr = RS, z j,sr = rs) etc. Therefore γ sr = E(f ij,s f ij,r ) = 2(p s p r + D sr )(q s q r + D sr ) + 2(p s q r − D sr )(q s p r − D sr ) and similarly, considering triplet samples with shared element j paired with i and k, the probability of different identity between i and j at site r and j and k at site s is p(f ijs f jkr = 1) = p(R, rS, s) + p(R, rs, S) + p(r, RS, s) + p(r, Rs, S), where p(R, rS, s) = p(z ir = R, z jr = r, z js = S, z ks = s) etc.
Using these terms, we compute the expectation: At linkage equilibrium (D sr = 0 for all s, r), high coverage sequencing of n individuals and low coverage sequencing of pooled individuals at read depth n are statistically equivalent, i.e. both equations simplify to γ sr = δ sr = 12 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The mean and sample variance terms for the expected pairwise distances are, respectively, while the covariance κ for the ordered pair of pairs with a shared j element is: By incorporating κ, we construct the sample estimate and variances for g ij .
Because we are now averaging over haplotypes z i (rather than independent counts for each site), g n is itself an average across pairs, like f n in the LCS case. Applying Eqn.
(2), we find that 13 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Difference and Independence
Using the results in Eqns. (3) and (4), we derive the difference between the sample variances in pairwise differences under HCS vs. LCS as By collecting terms, we can rewrite the above as For notational convenience, we define: 14 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; We remark that E(A sr ) > 0 is a sufficient but not necessary condition for ∆ > 0. The variance in mean pairwise distance can be reduced with pooled LCS even for E(A sr ) < 0, because negative A sr may be offset by the positive contributions of D 2 sr to the B sr term when pairwise LD values in the population are sufficiently high. However, with large sample sizes, the A sr term dominates because it scales as ∼ 1/n while the B sr term scales as ∼ 1/n 2 ; consequently, the sign of E(A sr ) generally predicts that of ∆.
E(A sr ) > 0 requires that most "major" alleles (those with p s , p r > 0.5) at different loci are in positive LD, while major and minor allele pairs at different loci (p s > 0.5, p r < 0.5 or vice-versa) are in negative LD. The weighted LD A sr provides a measure of the extent to which major alleles are in positive LD, regardless of whether the more common allele is a reference/wildtype or variant/mutant at a particular site. These results predict that when the mean weighted LD is positive, the sample variance (error) in estimated pairwise genetic distance will be reduced by LCS.

Possible Caveats: Random Read Depth and Fractional Coverage
Our results are based on a comparison of two extreme-case scenarios: full coverage sequencing of n sampled genomes versus low coverage sequencing of M sampled genomes at a fixed read depth n M , which is an idealization, because in reality the actual read depth under NGS varies across sites.

Random Read Depth.
We first consider the effect of having the read depth as Poisson random variable rather than a constant. In Appendix A2, we show that if the read depth n s ∼ P oiss(n) (for mean read depth n), then 15 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; the variance of the estimated pairwise genetic distance under LCS is: This result follows because a Poisson random variable with mean n has variance = n (indeed, read depth variance across sites will be typically of Different (ξ i , ξ j ) result in different expressions for γ. If both are on the same haplotype, ξ i = ξ j = 1, we get the γ rs in the HCS limit (see 16 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; derivation of Eqn. (4)). If both are independent, ξ i = ξ j = 0, we get We apply the law of total probability to evaluate with p(ξ i = 1) = p(ξ j = 1) = ρ. The same argument from total probability (applied over cases where i, j, k have s, r sites on various combinations of shared vs. different reads) applies to continuity of δ rs .
Letπ ρ denote the estimator under a sampling scheme with fractional coverage. It follows from the continuity of δ, γ and from the fact that the marginal expectations E(f ij ) are the same under HCS and LCS that the resulting var(π ρ ) is a continuous function of ρ with var(π 0 ) = var(π LCS ) and var(π 1 ) = var(π HCS ). This result is qualitatively consistent with the relationship between the variance in θ and the fraction of missing sequence data in Ferretti et al. (2014).

Dependence on Allele Frequencies and Linkage Disequilibria
Low coverage sequencing reduces the error in estimates ofπ when ∆ > 0, which holds if the expected weighted linkage disequilibria defined by Eqn. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
As noted above, E(A sr ) > 0 requires positive linkage disequilibria among pairs of alleles that are rare (p < 0.5) and among pairs of alleles that are common (p > 0.5), and by symmetry, negative LD among most pairs of major/minor alleles across loci. For a population of clonal, non-recombinant genomes, we can determine the distribution of LD given a distribution of allele frequencies.
In an infinite sites model without recombination, letting p s , p r be the frequencies of variant alleles S, R at loci s, r, where p s < p r , it follows that all occurrences of S must be in S, R haplotypes (conversely, if S co-occurs with r, there can be no S, R haplotypes). Therefore, if p s + p r > 1 and p s < p r , then p(S, R) = p s and the LD between loci s, r D sr = p s (1−p r ).
When p s +p r < 1, we can compute the expected LD by counting the number of cases that S can co-occur with the R allele versus the r wildtype, since the constraint of S either always or never co-occurring with R when p s < p r .

Computing Linkage Disequilibria and E(A sr )
Let N be the total number of haplotypes and let s, r denote any two loci.
We observe variant allele frequencies p s = k N and p r = h N at loci s and r for k, h = 1, . . . , N . Without loss of generality, we assume k ≤ h (corresponding to p s < p r ). Let C sr denote the event that mutations S and R co-occur and S and r don't co-occur, whileC sr denotes the event that S and R don't co-occur (i.e. S co-occurs with the "wildtype" allele at r). Let (C sr ∪C sr ) c denote the complement of C sr ∪C sr such that S and R co-occur on some haplotype and S and r co-occur on some other haplotype. Because there is no recombination and no multiple mutations per site, C sr ∪C sr and (C sr ∪ C sr ) c = ∅. We compute the probability of co-occurrence p(C sr |C sr ∪C sr ): 18 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
(2) N ≥ h + k. We first derive p(C sr ) and p(C sr ) and then p(C sr |C sr ∪ C sr ) = p(C sr )/(p(C sr ) + p(C sr )).
Counting the possible combinations, we find The numerators in the first and second equations give the number of ways in which k S alleles can co-occur with r or with R, respectively, given N k possible positions for the S alleles. Therefore, As expected, in the limiting cases of a new mutation (k = 1), D sr = 0, while To compute E(A sr ), we evaluate the weighted linkage disequilibria over 19 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; the distribution of η k ,η h , the number of sites at which there are exactly k, h copies of variant alleles. The expected weighted LD η = (η 1 , . . . , η N ) is with the convention that 1 2 = 0, so that the marginal expectation of A sr is given by This approximation holds based on a Taylor expansion of the ratio of two random variables, and the fact that ν(η) is much smaller than δ(η) due to the typically small values of A sr for any pair s, r. The ratio of expectations 20 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; is: The expectations . Therefore, we can approximate the expectation of E(A sr ) (and at least obtain the correct sign from the numerator) to determine if ∆ is positive when the expectations and second moments of an allele frequency distribution are known. The same approach can be used to compute the expectated linkage disequilibria over all η k , E[E(D sr |η)], by substitution D sr for A sr (i.e. no factor of (2p s − 1)(2p r − 1)) in the equations above.
To calculate E(A sr ) for an infinite sites model at mutation-drift equilib-21 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; rium, we use the expectations and variances of η k (Fu, 1995): where the coefficients σ k,k and σ k,h are functions of harmonic series sums a n and B n , i.e.
while the covariance coefficients are (for h > k) CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; linkage disequilibria, we show that these results follow as consequences of the distributions of pairwise LD, and compare these predictions to simulation results for representative parameters. We remark that numerical estimates of E(A sr ) computed from Eqn. (9) will be only approximate for several reasons: first, because the expectations of ratios do not exactly equal ratios of expectations (Eqn. (8)), and second, because the expectation and variance of η k are derived for sampling distributions where n N (Watterson, 1975;Fu, 1995). However, they usually provide a sufficiently good approximation (Wakeley and Takahashi, 2003) as n → N , so that our estimates of E(A sr ) should be of the correct sign and magnitude.

Comparison to individual-based simulations
To simulate Fisher-Wright genetic drift in an infinite sites model, we initial- To simulate full coverage sequencing of individuals, n haplotypes were randomly selected without replacement from the model population. The Hamming distances were calculated for all pairs in a sample, variant allele 23 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; frequencies and linkage disequilibria were calculated over all individuals and all pairs in the model population. Sequencing of pooled samples at very low coverage was simulated by selecting n alleles without replacement at every segregating site and summing pairwise distances over all sites (corresponding to sampling with replacement with respect to genomes, but without replacement with respect to each locus). ∆ was estimated from the data as the difference in the sample variances between the HCS and LCS pairwise distances. In every replicate run, A sr was calculated from the mutation frequencies p s , p r and D sr using Eqn. (6). All simulations were implemented using Python 2.7.3, the code is available from the corresponding author upon the estimated pairwise genetic distance π converges to the Tajima estimator 24 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; for haploids θ = 2N u, which is π = 150, 300 for N = 200, 500, respectively. Table 2   ing sites, and for small sample sizes (corresponding to low coverage depth with NGS). ∆ scales approximately as ∼ 1/n for large n; consequently, for 25 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; sample numbers and coverage depths of the order ∼ 100, ∆ will be smaller by nearly an order of magnitude relative to the values obtained for n = 20 (simulations were performed for n = 10, 50, the results are not shown due to qualitative similarity to the data in Tables 1-2).
The two observed cases withĀ sr < 0 are for T = 10, with a negative  Table 2, and the number of segregating sites S in Table 1), we obtain CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; E(η k ) ∼ θ/k suggests that high frequency alleles that contribute to large positive LD are rare, the high variance in allele frequencies means that any given sample path will usually contain several haplotypes characterized by multiple variant alleles at very high frequency. This means that even though the expectation for a particular η k will be low for k ∼ N , most genealogies will be characterized by a high count for some individual large value(s) of k, as is seen in Figure 2a. This generates a positive skewed distribution of pairwise linkage disequilibria (consistent with the distributions of D sr derived numerically for non-recombining loci in (Golding, 1984)), as is seen

Analysis of cancer sequence data
In this section, we apply the results of our derivations to genomic data by (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; were performed with GATK (McKenna et al., 2010). Through the matching of read ends, somatic mutations co-occurring within ∼ 100 bp in single genomes were identified (Sengupta et al. 2015). These mutation pairs define two locus haplotypes that can be tallied without the need for phasing, giving estimates of haplotype frequencies (defined by two proximate polymorphic sites) directly from the read counts. Therefore, if we have heterozygous fixation at a single SNV site, a population consisting of 0/1 (reference and variant) genotypes, a mean genetic distance measure of π = 1/2 is meaningless because the population is homogeneous with respect to the 0/1 genotype, and variant allele frequencies must be rescaled to reflect this.

28
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; Figure 3 shows the distribution of mutant allele frequencies in Sample 1; note the high frequency of values near p = 0.5, the skew of the distribution is presumably the result of a low rate of detection of rare variants.  Ling et al. 2015) address the issue heterozygous genotype fixation by only considering polymorphic, segregating sites when comparing allele frequency spectra to neutral models, to the exclusion of sites that are ≥ 0.5 within a margin of sampling error; which also excludes sites whose frequencies p > 0.5 due to loss of heterozygosity.
For the truncated range of allele frequencies p = [0, 0.5], the frequencies are rescaled to reflect heterozygosity, which for diploids means mapping p = 2p, or more generally, p = p/f c where f c is the cutoff for the inference of fixation. With this mapping, π for a sample where all genotypes at a variant site are 0/1 is 0.
Assuming diploidy at all of the genotyped SNV sites and defining fixation as p = 0.5, we find that for sample size n = 65, the binomial probability of observing fewer than x = 26 mutant alleles is Bin(x ≤ 25|n = 65, p = 0.5) = 0.041, so we use f c = 0.4 as as a cutoff defining polymorphic sites. By this criterion, and the rescaling p = p/f c , there are only between 6 (sample 4) and 10 (sample 3) adjacent segregating sites, and consequently between 3 and 5 haplotypes defined by such a pair out of the original 69. The LD and ∆ values for this subset of haplotypes are summarized in Table 3. The differences in variances ∆ remain positive, consistent with sample variance being lower with pooling. ∆ is small (0.034 ≤ ∆ ≤ 0.070), implying that in practice the estimation errors for π would be negligibly different between 29 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; high and low coverage sequencing for this data set. However, the small ∆ are partly a result of the small number of segregating sites (i.e. π max = S n /2) while var( π) estimated by individual cell sampling may be expected to increase for more segregating sites, as was the case in the simulation data for larger time intervals.

TABLES 3a-b HERE
The values of ∆ are also sensitive to the choice of truncation, as many of the SNVs occur in genotypes that are close to fixation in the tumor.
For example, if we use f c = 0.49, x = 32 as a cutoff to define segregating sites rather than f c = 0.40, we obtainĀ sr < 0 and ∆ < 0 (of the order ∼ 0.1). The sign reversal results from some lower frequency SNVs uniquely co-occuring in genomes with other SNVs that are close to fixation.
The remaining allele and haplotype distributions contribute negative linkage disequilibria between the high frequency SNVs at one locus and high frequency reference alleles at the other site. The greater absolute value of ∆ is a consequence of the fact that with a cutoff of f c = 0.49, there are now 21-28 haplotypes (and 42-56 segregating sites) rather than the 6-10 for the f c = 0.40 cutoff. The negative weighted LDs and ∆ with this cutoff are shown in the second panel Table 3b, illustrating that for some samples, the variances in π may actually be slightly higher with low coverage sequencing.

Discussion
The reduction of error in estimated genetic distance through low coverage sequencing reflects the loss of information due to non-independence across sites through linkage disequilibria. If the most frequent alleles at the ma-30 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; jority of sites are in positive LD, any error in the estimated frequency and heterozygosity at one site covaries with the error at the other sites with individual sampling. In contrast, with LCS, each site provides independent information, so that the error across sites is uncorrelated. For S n segregating sites in a sample of n and a variance in estimated distance per site σ 2 , with independent sampling the error across sites will approach σ 2 /S n . In the extreme case where allele frequencies across sites are nearly identical (complete linkage), the sample variance is σ 2 independent of the number of sites.
Another way to think of this is to consider the information gain that comes from sampling different loci from different subclasses of individuals in a sample under LCS, so that each polymorphic site has its own sample genealogy. This is analogous to the results of Pluzhnikov and Donnelly (1996), who found that in the presence of recombination, the optimal sequence length per genome for estimating allele frequencies and θ was sufficiently high to provide a large sample size while sufficiently short to provide low coverage per genome when recombination rates are low. With greater recombination rates, there is no information gain through low coverage because all but the closest loci have their own coalescent genealogy.
Conversely and by symmetry, a negative association of major allele frequencies across pairs of sites means that an error in estimated distance at one site will on average be compensated by an error in the opposite direction at another site, leading to reduction in variance under high coverage sequencing of individual genomes (analogous to improved estimation of the mean by sampling positive and negative extremes of a distribution). Both heuristic considerations and simulation results suggest that such a scenario is unlikely except for distributions of allele frequencies that give very small error values regardless, at least under neutral evolution. This was further 31 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; confirmed by numerical calculation of the expected distribution of pairwise linkage disequilibria in an infinite sites model for clonal, ameiotic organisms.
Our results, like those of Ferretti et al. (2014) suggest that low coverage sequencing over pooled samples should be used to estimate the genetic distance (and consequently, population mutation rate parameter θ and the Tajima D statistic) under most conditions if reduction of estimation error is the sole criterion. However, there are several caveats to this conclusion, some theoretical, others practical. For example, we know that when most pairwise LD are approximately 0, the difference ∆ between HCS and LCS estimates will be very small. A number of recent studies have shown that LD are generally among sites that are not physically linked in the genomes of sexually reproducing model organisms, including Drosophila (Andolfatto and Przeworski, 2000) and humans (Peterson et al., 1995;Reich et al., 2001).
This suggests that any error introduced by sampling alleles from genomes individually with high coverage rather than pooled low coverage may be negligible for non-clonal genomes.
In contrast, in clonal organisms, or for regions of genome under very low recombination in sexually reproducing organisms, LD values will be high. Depending on the distribution of allele frequencies, ∆ will be large when evaluated over many polymorphic sites. In the cases of cancer and microbial genomics, the standard NGS approach to sequencing reads from large numbers of cells at low coverage suggests an improved estimation of π (and consequently, θ and N e ) relative to what would be obtained from more expensive single cell sequencing approaches. Furthermore, single-cell sequencing usually entails a much smaller sample size n than the coverage depths of 100-1000 that are standard for pooled sequencing. Moreover, ∆ is defined on the assumption of the same effective sample size n for both LCS 32 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; and HCS, when in fact LCS is associated with pooling and high read depth n, as is often the case, then this is often sufficient to reverse the sign of var( π HCS ) − var( π LCS ) even in the rare cases when ∆ < 0 for HCS sample size equal to LCS read depth n. This is because LCS combined with pooling increases the sample size per segregating site.
Finally, we remark that this study was to a large part motivated by efforts to apply the methods and theory of population genetics to cancer biology, where individual cell sequencing at high coverage versus pooled sampling with low coverage are often presented as alternative approaches.
The case study computing ∆ from lung cancer data in the previous section was used as proof of principle. A more accurate and refined analysis would have to take into consideration a number of potentially confounding variables. These include polyploidy and aneuploidy (so that with ploidy X, fixation corresponds to p = 1/X), as well as accounting for the loss of heterozygosity through mitotic recombination, reflected in frequencies p > 0.5.
The sensitivity of var( π) to the choice of cutoff f c defining fixation for both the diploid and polyploid cases is of interest as an area for future research.

Acknowledgments
MS and JL were supported by the St. David's Foundation impact fund.
YN and PM were supported by NIH grant 2R01CA132897-06A1. We thank Kalamakar Gulukota and Yuan Ji at NorthShore University HealthSystem for providing the lung cancer data used in this paper. We also thank two anonymous reviewers for their helpful critique and comments, as well as the following individuals for advice and assistance: Mark Kirkpatrick, Jon Wilkins, Matthew Cowperthwaite, Habil Zare, Mikhail Matz, and Andrea Sottoriva.

33
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; Figure 1. Illustration of high coverage sequencing (HCS) versus low coverage sequencing (LCS). In this example, the population consists of 8 haplotypes G1...G8 characterized by 4 segregating sites S1...S4. We assume a sampling depth of n = 3 and sufficiently many reads to capture all segregating sites. In the left panel, we have a random instance of HCS via the complete sequencing of G2, G4, G5 (gray ovals representing sampling), giving a mean pairwise distance of π = 2. In the right panel, we have a random instance of LCS, such that G1, G3, G8 are sequenced at S1, G4,G5 and G8 at S2, etc, giving a mean genetic distance π = 8/3. Note that E( π) is the same under both modes of sampling, the differences are due to var( π).

34
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  (Table 1a) and N = 500 (Table 1b)   These tables show the number of segregating sites S n in a sample of n = 20, the mean pairwise genetic distances π HCS , π L CS (for high and low coverage sequencing, respectively), and the variances in pairwise genetic distance for HCS vs. LCS. The latter are used to compute ∆ S in Table 1. Table 2a shows these summary statistics for N = 200, Table 2b for N = 500. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; to the diploid cutoff value. The lower panel shows the same for f c = 0.49, selected arbitrarily close to p = 0.5 to show the sensitivity of ∆ to the cutoff.
The f c = 0.40 calculations are based on 6-10 remaining polymorphic sites, the f c = 0.49 on 42-56 sites, depending on the sample. Table A2. This table summarizes the same parameter estimates as in Table   2a (for N = 200, T = 1000), however, the LCS read depth is now a Poisson random variable with mean n = 20 rather than a constant. Note that the simulation values for the means and variances of π and of A sr , ∆ are largely unchanged due to the negligible contribution of variance in read depth to the error in parameter estimation.
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

39
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

40
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Depth
In practice, the read depth with LCS is not constant and varies considerably across sites. Assuming the read depth follows a Poisson distribution with mean n equal to the number of haplotypes in HCS. Computing the Taylor expansion of the variance of genetic distance estimator, we find the difference between fixed and Poisson read depth is O( 1 n 2 ). Let n s denote the read depth at locus s which follows a Poisson distribution n s ∼ P oi(n) with mean n. Let (1 − 2h s ) .
The second term is zero because the expectation is independent of n,

41
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; The first term is non-trivial. Let g(n s ) = 1 (1 − 2h s ) and expand E(g(n s )) at n, E{g(n s )} ≈ E{g(n)} + E{g (n)(n s − n)} + 1 2 E{g (n)(n s − n) 2 } = g(n) + 0 + 1 2 where a = 2h s (1 − 2h s ) and b = −2h 2 s + 6h s . So 1 2 E{g (n)(n s − n) 2 } = 1 2 g (n)var(n s ) = 1 2 g (n)n = O 1 n 2 . Therefore, where the first term of the last equality is the variance ofπ LCS given the read depth is fixed at n across all loci. This result holds for any sampling distribution of read depths where the variance is of the order ∼ n.
The very small ∼ O( 1 n 2 ) contribution of non-constant read depth to the error in estimatedπ is confirmed using simulation results. Compare the variances and other parameter estimates observed in Table 2 for N = 200, n = 20 where the read depths are constant to those in Table A1, where the read depth is a Poisson random variable with parameter n.

42
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/108928 doi: bioRxiv preprint first posted online Feb. 15, 2017; . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.