Estimation of Pairwise Genetic Distances Under Independent Sampling of Segregating Sites vs. Haplotype Sampling

Genetic distance is a standard measure of variation in populations. When sequencing genomes individually, genetic distances are computed over all pairs of multilocus haplotypes in a sample. However, when next-generation sequencing methods obtain reads from heterogeneous assemblages of genomes (e.g. for microbial samples in a biofilm or cells from a tumor), individual reads are often drawn from different genomes. This means that pairwise genetic distances are calculated across independently sampled sites rather than across haplotype pairs. In this paper, we show that while the expected pairwise distance under whole haplotype sampling (WHS) is the same as with independent locus sampling (ILS), the sample variances of pairwise distance differ and depend on the direction and magnitude of linkage disequilibrium (LD) among polymorphic sites. We derive a weighted LD value that, when positive, predicts higher sample variance in estimated genetic distance for WHS. Weighted LD is positive when on average, the most common alleles at two loci are in positive LD. Using individual-based simulations of an infinite sites model under Fisher-Wright genetic drift, variances of estimated genetic distance are found to be almost always higher under WHS than under ILS, suggesting a reduction in estimation error when sites are sampled independently. We apply these results to haplotype frequencies from a lung cancer tumor to compute weighted LD and the variances in estimated genetic distance under ILS vs. WHS, and find that the the relative magnitudes of variances under WHS vs. ILS are sensitive to sampled allele frequencies.


Introduction
Genetic variation is the raw material for evolutionary change, consequently,  (Dexter and Leith, 1986; from genetic distances to the population mutation rate estimated from the 75 number of segregating sites S n in a sample of n genotypes 76 S n /( (Watterson, 1975) is the basis for the Tajima D test for selection. Values of 77 θ π that are inflated relative to θ S may be the result of diversifying selection  In studies of multicellular organisms, π is estimated directly from com-   Pairs: In both cases, that is, for WHS and ILS, respectively, the esti-193 mators π 1 and π 2 include an average i<j φ ij /n(n − 1) of some function The value of N 3 follows from the fact that there are n(n − 1)(n − 2)/6 ways 210 to select a triplet i, j, k, and three ways to select a shared element from this 211 triplet. In Appendix A1, we cover some of the properties of ordered pairs 212 of pairs, including the derivation of the following relation which we will use 213 below to compute var( π) under ILS and WHS, where z is,s ∼ Bern(p s ), i.e. p(z is,s ) = p s for z is,s ∈ 0, 1 such that (note that h s is the heterozygosity at locus s).

220
The expectation of the indicator function for ordered pairs on pairs in-221 cludes a covariance term, namely, The sampling estimator for π under ILS is given by From the assumption of statistical independence among sites s located on 223 different reads under ILS, it follows (Appendix A1) that for a sample of n, We remark that in practice, the assumption of independence requires that

243
Using these terms, we compute the expectation: Assuming independence (linkage equilibrium, D sr = 0 for all s, r) gives re-245 sults equivalent to ILS, i.e. both equations simplify to γ sr = δ sr = 4p s q s p r q r .

246
The mean and sample variance terms for the expected pairwise distances are, 247 respectively, while the covariance κ for the ordered pair of pairs with a shared j element 249 is: By incorporating κ, we can construct the sample estimate and variances for g ij . For the WHS model, z i ∼ p(z), independently, from which we construct the sample estimate for g n as: now averaging over haplotypes z i (rather than independent counts for each site).
Note that g n is again an average across pairs, like f n in the ILS case.
We again apply the result in Eqn.

251
Using the results in Eqns. (4) and (5), we derive the difference between the 252 sample variances in pairwise differences under WHS vs. ILS as By collecting terms, we can rewrite the above as For notational convenience, we define: In the absence of linkage disequilibria among pairs (D sr = 0 and therefore  i.e. that for most pairs of loci s, r, the "major" alleles (those with p s , p r > 0.5) 272 are in positive LD, while major and minor allele pairs (p s > 0.5, p r < 0.5 or

302
In the absence of recombination, multilocus haplotypes behave as alle-303 les at a single locus, so that the infinite sites model becomes effectively an 304 infinite alleles model (Tajima, 1996). Therefore, every new mutation is in The values of ∆ are also sensitive to the choice of truncation, as many 526 of the SNVs occur in genotypes that are close to fixation in the tumor.

527
For example, if we use f c = 0.49, x = 32 as a cutoff to define segregat-

528
ing sites rather than f c = 0.40, we obtainĀ sr < 0 and ∆ < 0 (of the 529 order ∼ 0.1). The sign reversal results from some lower frequency SNVs 530 uniquely co-occuring in genomes with other SNVs that are close to fixation.

531
The remaining allele and haplotype distributions contribute negative link- For the variance, note that