## Abstract

*F*_{ST} and kinship are key parameters often estimated in modern population genetics studies. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently used estimators of *F*_{ST} and kinship are method of moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the island model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure do not likely hold in many populations, including humans. In this work, we provide new results on the behavior of these estimators in the presence of arbitrarily complex population structures. After establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of *F*_{ST} and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally assumed models of structure. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the island model. Using 1000 Genomes Project data, we verify that population-level pairwise *F*_{ST} estimates underestimate differentiation measured by an individual-level pairwise *F*_{ST} estimator introduced here. We show that the calculated biases are due to unknown quantities that cannot be estimated under the established frameworks, highlighting the need for innovative estimation approaches in complex populations. We provide initial results that point towards a future estimation framework for generalized *F*_{ST} and kinship.

## 1 Introduction

In population genetics studies, one is often interested in characterizing structure, genetic differentiation, and relatedness among individuals. Two quantities often considered in this context are *F*_{ST} and kinship. *F*_{ST} is a parameter that measures structure in a subdivided population, satisfying *F*_{ST} = 0 for an unstructured population and *F*_{ST} = 1 if every SNP has fixated in every subpopulation. More specifically, *F*_{ST} is the probability that alleles drawn randomly from a subpopulation are “identical by descent” (IBD) relative to an ancestral population [1, 2]. The kinship coefficient is a measure of relatedness between individuals defined in terms of IBD probabilities, and it is closely related to *F*_{ST} [1].

The most frequently used *F*_{ST} estimators are derived and justified under the “island model” assumption, in which subpopulations are non-overlapping and have evolved independently from a common ancestral population. The Weir-Cockerham (WC) *F*_{ST} estimator assumes islands of differing sample sizes and equal *F*_{ST} per island [3]. The “Hudson” *F*_{ST} estimator assumes two islands with different *F*_{ST} values [4]. These *F*_{ST} estimators are ratio estimators derived using the method of moments to have unbiased numerators and denominators, which gives approximately unbiased ratio estimates [3–5], and they are important contributions used widely in the field.

Kinship coefficients are now commonly calculated in population genetics studies to capture structure and relatedness. They are utilized in principal components analyses and linear-mixed effects models to correct for structure in Genome-Wide Association Studies (GWAS) and to estimate genome-wide heritability [6–15]. The most commonly used kinship estimator for genotype data [9, 10, 13–18] is also a method of moments estimator whose operating characteristics are largely unknown in the presence of structure. As we show here, the required assumption for this popular estimator to be accurate is that the average kinship be zero, which implies that the population must be unstructured.

Recent genome-wide studies have revealed that humans and other natural populations are structured in a complex manner that violate the assumptions of the above estimators. This has been observed in several large human studies, such as the Human Genome Diversity Project [19], the 1000 Genomes Project [20], and other contemporary [21, 22] and archaic populations [23, 24]. Therefore, there is a need for innovative approaches designed for complex population structures. To this end, we reveal the operating characteristics of these frequently used *F*_{ST} and kinship estimators in the presence of arbitrary forms of structure with the goal of identifying new estimation strategies for *F*_{ST} and kinship.

We generalized the definition of *F*_{ST} for arbitrary population structures in the first paper in this series [25]. Additionally, we derived connections between *F*_{ST} and three models: arbitrary kinship coefficients [1, 26], individual-specific allele frequencies [27, 28], and admixture models [29–31]. Here, we study existing *F*_{ST} and kinship method of moments estimators in models that allow for arbitrary population structures (see Fig. 1 for an overview of the results). First, we obtain new strong convergence results for a family of ratio estimators that includes *F*_{ST} and kinship estimators. Next, we calculate the convergence values of these estimators under arbitrary population structures, where we find biases that are not present under their original assumptions about structure. We characterize the limit of the standard kinship estimator for the first time, identifying complex biases or distortions that have not been described before. We construct an admixture model, which represents a form of structure distinct from the island model, to illustrate our theoretical findings through simulation. We analyze 1000 Genomes Project populations to illustrate their non-island nature, and measure differentiation that is missed by the Hudson *F*_{ST} estimator. We identify a new direction for estimating *F*_{ST} and kinship in a nearly unbiased fashion, which is the topic of our next paper in this series [32].

## 2 Models and definitions

Here we summarize new arbitrary population structure models, definitions, and results presented in detail in the first paper in this series [25] (Fig. 1). We assume a complete matrix of *m* SNPs and *n* individuals. We concentrate on biallelic genotypes *x*_{ij} for SNP *i* and individual *j*, encoded as the number of reference alleles: *x*_{ij} = 2 is homozygous for the reference allele, *x*_{ij} = 0 is homozygous for the alternative allele, and *x*_{ij} = 1 is heterozygous. We assume the existence of a panmictic ancestral population *T* characterized by ancestral reference allele frequencies ∈ (0, 1) for every SNP *i*.

### 2.1 The kinship model and the generalized *F*_{ST}

Under the kinship model, individuals receive their alleles as determined by their inbreeding and kinship coefficients. The inbreeding coefficient of *j* is the probability that two alleles at a random SNP of individual *j* are IBD [33]. Similarly, the kinship coefficient of *j* and *k* is the probability that two alleles chosen at random from each individual and at a random SNP are IBD [1]. The ancestral population *T* determines what is IBD: only relationships since *T* count toward IBD. The first two moments of the genotypes are
where self-kinship is [1, 2, 26, 33]. Lastly, if *S* is a panmictic population that evolved from *T*, then is the value of shared by all individuals *j* in *S* relative to *T*, and equals Wright’s *F*_{ST} for this subdivided population [2].

The generalized *F*_{ST} definition that we proposed [25] requires the notion of *local populations*, needed to mirror at the individual level Wright’s distinction between structural inbreeding due to the population structure from local inbreeding [2]. The local population *L*_{j} of individual *j* is the most recent ancestral population of *j* [25]. Similarly, the *jointly local population L*_{jk} of a pair of individuals *j* and *k* is the most recent ancestral population shared by *j* and *k*, which is ancestral to both *L*_{j} and *L*_{k} [25]. For *T* ancestral to *L*_{j} or *L*_{jk}, as needed, we have three parameter pairs: “total” , “local” , and “structural” kinship and inbreeding coefficients, related by [25]
A *locally outbred* individual has and therefore . Similarly, a pair of *locally unrelated* individuals have and therefore . The generalized *F*_{ST} is given by
where *w*_{j} > 0, are weights chosen to capture the sampling procedure of individuals [25]. The individual-level pairwise *F*_{ST} is the special case of Eq. (4) for *n* = 2 individuals, given by
where the second equality holds for any *T* ancestral to *L*_{jk} [25].

### 2.2 The coancestry model for individual-specific allele frequencies

Previous *F*_{ST} estimators are often in terms of population allele frequencies [2–5]. Our earlier proposed coancestry model [25] extends previous models [5, 34] of population allele frequencies to individuals. The *individual-specific allele frequency* (IAF) is denoted *π*_{ij} ∈ [0, 1] for SNP *i* and individual *j* [27, 28]. In our model, IAFs are random variables drawn from *T* according to the population structure, with covariances between individuals *j* and *k* parametrized by the *individual-specific coancestry coefficients* . We assume that the IAF moments and genotypes are drawn as
We derived the following correspondence between coancestry and kinship coefficients by marginalizing *π*_{ij} from this model and comparing to Eqs. (1) and (2) [25]:
For this reason, and the similarities between Eqs. (1) and (2) and Eqs. (6) and (7), estimators based on genotypes can be readily restated in terms of IAFs, and viceversa. Due to Eq. (8), individuals in the coancestry model are locally outbred and unrelated, a key difference from the more general kinship model, so and for *j* ≠ *k* also hold. Therefore, *F*_{ST} in this model equals

## 3 Assessing the accuracy of genome-wide estimators

Many *F*_{ST} and kinship coefficient method of moments estimators are “ratio estimators”, a class that tends to be biased and have no closed form expectation [35]. In the literature, the expectation of a ratio is frequently approximated with a ratio of expectations [3–5]. Specifically, estimators are often called “unbiased” if the ratio of expectations is unbiased, even though the ration estimator itself may be biased. Here we characterize the behavior of two ratio estimator families calculated from genome-wide data, detailing conditions where this previous approximation is justified and providing additional criteria to assess the accuracy of such estimators.

The general problem involves random variables *a*_{i} and *b*_{i} calculated from genotypes at each SNP *i*, such that E[*a*_{i}] = *Ac*_{i} and E[*b*_{i}] = *Bc*_{i} and the goal is to estimate . *A* and *B* are constants shared across SNPs (given by *F*_{ST} or ), while *c*_{i} depends on the ancestral allele frequency and varies per SNP. The problem is that the single SNP estimator is biased, since [35]. Below we study two estimator families that combine SNPs to better estimate .

The solution we recommend is the “ratio-of-means” estimator , where and , which is common for *F*_{ST} estimators [3–5]. Note that and , where . We will assume bounded terms (|*a*_{i}|, |*b*_{i}| ≤ *C* for some finite *C*), a convergent , and *Bc* ≠ 0, which are satisfied by common estimators. Given independent SNPs, we prove almost sure convergence to the desired quantity (Appendix A.1),
a strong result that implies , justifying previous work [3–5]. Moreover, the error between these expectations scales with (Appendix A.2), just as for standard ratio estimators [35]. Although real SNPs are not independent due to genetic linkage, this estimator will perform well if the effective number of independent SNPs is large.

Another approach is the “mean-of-ratios” estimator , used often to estimate kinship coefficients [9, 10, 13–18] and *F*_{ST} [20]. If each is biased, their average across SNPs will also be biased, even as *m* → ∞. However, if for all SNPs *i* = 1, …, *m* as the number of individuals *n* → ∞, and Var is bounded, then
Therefore, mean-of-ratios estimators must satisfy more restrictive conditions than ratio-of-means estimators, as well as both large *n* and *m*, to estimate well.

## 4 *F*_{ST} estimation based on the island model

### 4.1 The island model *F*_{ST} estimator for infinite population sample sizes

Here we study the Weir-Cockerham (WC) [3] and “Hudson” [4] *F*_{ST} estimators, which assume the island model. These method of moment estimators have small sample size corrections that remarkably make them consistent as the number of independent SNPs *m* goes to infinity for finite numbers of individuals. However, these small sample corrections also make the estimators more notationally cumbersome than needed here. In order to illustrate clearly how these estimators behave, both under the island model and arbitrary structure, here we construct simplified versions that assume infinite sample sizes per population. This simplification corresponds to eliminating statistical sampling, leaving only genetic sampling to analyze [36]. Note that our simplified estimator nevertheless illustrates the general behavior of the WC and Hudson estimators under arbitrary structure, and the results are equivalent to those we would obtain under finite sample sizes of indivduals.

The Hudson *F*_{ST} estimator compares two populations [4]; we present a generalized Hudson estimator for *K* populations in Appendix B. Let us assume that population sample sizes are infinite, so allele frequencies are known. Let *j* index populations rather than individuals, *n* be the number of populations, and *π*_{ij} be the allele frequency in population *j* at SNP *i*. In this special case, both WC and Hudson simplify to the following island model *F*_{ST} estimator:
The goal is to estimate *F*_{ST} of Eq. (10) with uniform weights , under our coancestry model defined in Eqs. (6) – (8).

### 4.2 *F*_{ST} estimation under the island model

Under the island model, for *j* ≠ *k*, the estimator of Eq. (14) can be derived directly using the method of moments (Appendix C.1). Given the IAF moment Eqs. (6) and (7), the expectations of the two recurrent terms of Eq. (14) are
Eliminating and solving for *F*_{ST} in this system of equations recovers the estimator of Eq. (14).

Before applying the convergence result of Eq. (11), we test that its assumptions are met. The SNP *i* terms are and , which satisfy E[*a*_{i}] = *Ac*_{i} and E[*b*_{i}] = *Bc*_{i} with *A* = *F*_{ST}, *B* = 1, and . Further, over the distribution across SNPs. Lastly, since hold, then and , and since *n* ≥ 2, *C* = 1 bounds both |*a*_{i}| and |*b*_{i}|. Therefore, for independent SNPs,

### 4.3 *F*_{ST} estimation under arbitrary coancestry

Now we consider applying the island *F*_{ST} estimator to non-island settings. The key difference is that for every (*j*, *k*) will be assumed in our coancestry model of Eqs. (6) and (7). In this general setting, (*j*, *k*) may index either populations or individuals. The two terms of now satisfy
where is the mean coancestry with uniform weights. There are two equations but three unknowns: *F*_{ST}, , and . Island models satisfy , which allows for the consistent estimation of *F*_{ST}. Therefore, the new unknown precludes consistent *F*_{ST} estimation without additional assumptions.

The island model *F*_{ST} estimator converges more generally to
where it should be noted that
is the average of all between-individual coancestry coefficients, a term that appears in a related result for populations [5]. Therefore, under arbitrary structure the island model estimator’s bias is due to the coancestry between individuals (or islands in the traditional, non-overlapping subpopulation setting).

Since (Appendix D), this estimator has a downward bias in non-island settings: it is asymptotically unbiased only when , while bias is maximal when , where . For example, if for most pairs of individuals, then as well, and . Therefore, the magnitude of the bias of is unknown if is unknown, and small may arise even if *F*_{ST} is very large.

### 4.4 Consistent estimator of the individual-level pairwise *F*_{ST}

The individual-level pairwise *F*_{ST}, equal to *F*_{ST} for *n* = 2 and denoted by *F*_{jk}, is always an island model since *T* = *L*_{jk} must the most recent ancestral population shared by (*j*, *k*) and satisfies [25]. Hence, *F*_{jk} can be estimated consistently using of Eq. (14) with *n* = 2, which simplifies to
where the limit is stated for general *T* ≠ *L*_{jk} and matches *F*_{jk} under the coancestry model [25].

To obtain an estimator of *F*_{jk} that uses genotypes, we replace *π*_{ij} by in Eq. (16) and convert kinship to inbreeding coefficients using , resulting in
which converges to *F*_{jk} in Eq. (5) if *j* and *k* are locally outbred and locally unrelated . For general values of , and ,
which is obtained by substituting Eq. (3) into Eq. (17) and rearranging. Since is the only negative term in Eq. (18), local kinship can result in negative estimated from genotypes.

To compare our individual-level estimates to Hudson estimates between the two populations *S*_{u} and *S*_{v} that are not necessarily panmictic, consider the following average across populations and its limit assuming locally outbred and locally unrelated individuals:
If *S*_{u} and *S*_{v} are panmictic populations, then *L*_{j} = *S*_{u}, *L*_{k} = *S*_{v}, and *L*_{jk} = *L*_{uv}, so every pair and their average match the limit of the Hudson estimator. In Section 7, we show empirically that tends to be larger than the corresponding Hudson estimate when *S*_{u} and *S*_{v} are structured.

### 4.5 Coancestry estimation as a method of moments

Since the generalized *F*_{ST} is given by coancestry coefficients in Eq. (10), a new *F*_{ST} estimator could be derived from estimates of . Here we attempt to define a method of moments estimator for , and find an underdetermined estimation problem, just as for *F*_{ST}.

Given IAFs and Eqs. (6) and (7), the first and second moments that average across SNPs are where , and is as before.

Suppose first that only are of interest. There are *n* estimators given by Eq. (21) with *j* = *k*, each corresponding to an unknown . However, all these estimators share two nuisance parameters: and . While can be estimated from Eq. (20), there are no more equations left to estimate , so this system is underdetermined. The estimation problem remains underdetermined if all estimators of Eq. (21) are considered rather than only the *j* = *k* cases. Therefore, we cannot estimate coancestry coefficients consistently using only the first two moments and without additional assumptions.

## 5 Characterizing a kinship estimator and its relationship to *F*_{ST}

Estimation of kinship coefficients is an important problem, particularly for GWAS approaches that control for population structure [6–18, 37, 38]. Additionally, kinship coefficients are closely related to the generalized *F*_{ST} of Eq. (4) and the biases of in Eq. (15) (since coancestry and kinship coefficients are related by Eq. (3)). In this section, we focus on a standard kinship method of moments estimator and calculate its limit for the first time (Fig. 1). We study estimators that use genotypes or IAFs, and construct *F*_{ST} estimators from their kinship estimates. We find biases comparable to those of , and define unbiased *F*_{ST} estimators that require knowing the mean kinship or coancestry, or its proportion relative to *F*_{ST}. Lastly, we present a new kinship method of moments estimator with a uniform bias, which facilitates the estimation of the unknown mean kinship parameter needed to unbias kinship and *F*_{ST} estimates (Fig. 1).

### 5.1 Characterization of the standard kinship estimator

Here we analyze a standard kinship estimator that is in frequent use [9, 10, 13–18]. We generalize this estimator to use weights in estimating the ancestral allele frequencies, and we write it as a ratio-of-means estimator due to the favorable theoretical properties of this format as detailed in Section 3:
The estimator in Eq. (23) resembles the sample genotype covariance, but centers by SNP *i* rather than by individuals *j* and *k*, and normalizes by estimates of . We also derive the estimator of Eq. (23) directly using the method of moments (Appendix C.2). The weights in Eq. (22) must satisfy *w*_{j} > 0 and , so and hold.

Assuming the moments of Eqs. (1) and (2), we find that Eq. (23) converges to
where and . (See Appendix E for moments involving *x*_{ij} and that lead to Eq. (24).) Therefore, the bias of varies per *j* and *k*. Analogous distortions have been observed for sample covariances of genotypes [39]. Similarly, inbreeding coefficient estimates derived from Eq. (23) converge to
The limits of the ratio-of-means versions of two more estimators [14] are, if uses Eq. (22),

The estimators of Eqs. (23) and (26) are unbiased when is replaced by [10, 14], and are consistent when is consistent [27]. Surprisingly, of Eq. (22) is not consistent (it does not converge almost surely) for arbitrary population structures, which is at the root of the bias of Eq. (24). In particular, although is unbiased, its variance (see Appendix E),
may be asymptotically non-zero as *n* → ∞, since ∈ (0, 1) is fixed and may take on any value in [0, 1] for arbitrary population structures. Further, as *n* → ∞ if and only if for almost all pairs of individuals (*j*, *k*). These observations hold for any weights such that *w*_{j} > 0, . An important consequence is that the plug-in estimate of is biased (Appendix E),
which is present in all estimators we have studied.

### 5.2 Estimation of coancestry coefficients from IAFs

Here we form a coancestry coefficient estimator analogous to Eq. (23) but using IAFs. Assuming the moments of Eqs. (6) and (7), this estimator and its limit are where and are analogous to and . Eq. (28) generalizes Eq. (12) for arbitrary weights. Thus, use of IAFs does not ameliorate the estimation problems we have identified for genotypes. Like Eq. (27), of Eq. (28) is not consistent because Var , which causes the bias observed in Eq. (29).

### 5.3 Plug-in *F*_{st} estimator from inbreeding or coancestry estimates

Since the generalized *F*_{ST} is defined as a mean inbreeding coefficient in Eq. (4), or equivalently a mean self-coancestry coefficient in Eq. (10), here we study *F*_{ST} estimators constructed as either or . Although the previous and are biased, we nevertheless plug them into our definition of *F*_{ST} so that we may study how bias manifests. Note that we do not recommend utilizing these *F*_{ST} estimators in practice, but we find these results informative for identifying how to proceed in deriving new estimators.

Remarkably, the three estimators of Eqs. (25) and (26) give exactly the same plug-in if the weights in *F*_{ST} and of Eq. (22) match, namely
where the limit assumes locally outbred individuals so holds. The analogous *F*_{ST} estimator for IAFs and its limit are
The estimators of Eqs. (30) and (31) for individuals and their limits resemble those of classical *F*_{ST} estimators for populations of the form [5, 34]. of Eq. (31) for uniform weight is also *G*_{ST} for a biallelic locus [40] if we treat individuals *j* as populations and combine SNPs as a ratio-of-means estimator. Compared to of Eq. (14), of Eq. (31) admits arbitrary weights and, by forgoing bias correction under the island model, is a simpler target of study.

Like of Eq. (14), of Eqs. (30) and (31) are downwardly biased since . of Eq. (31) may converge arbitrarily close to zero since can be arbitrarily close to *F*_{ST} (Appendix D). Moreover, although for large *n* (due to Eq. (9)), in extreme cases can exceed *F*_{ST} under the coancestry model (where holds) and also under extreme local kinship, where of Eq. (30) converges to a negative value.

### 5.4 Adjusted consistent *F*_{ST} estimators and the “bias coefficient”

Here we explore two adjustments to from IAFs of Eq. (31) that rely on having minimial additional information needed to correct its bias. If is known, the bias in Eq. (31) can be reversed, yielding the consistent estimator
Consistent estimates are also possible if a scaled version of is known, namely
which we call the “bias coefficient” and has interesting properties. This coefficient measures the strength of the covariances relative to the variances, and satisfies 0 ≤ s ≤ 1 (Appendix D). The limit of Eq. (31) in terms of *s* is
Treating the limit as equality and solving for *F*_{ST} yields the following consistent estimator:
Note that and from Eqs. (13) and (14) are the special case of Eqs. (35) and (36) for uniform weights and ; hence, generalizes .

Lastly, using either Eq. (31) or Eq. (34), the relative error of converges to
which is approximated by s if *F*_{ST} ≪ 1, hence the name “bias coefficient”.

### 5.5 A new direction for *F*_{ST} and kinship estimation

Here, we outline a new estimation framework for kinship coefficients that has properties favorable for obtaining nearly unbiased estimates. These new kinship estimates can then also be utilized for *F*_{ST} estimation. We summarize our ideas here and then fully develop the estimation framework and study its operating characteristics in the next paper in this series [32].

Applying the method of moments to Eqs. (1) and (2), we derive the following estimator,
which compares favorably to the standard estimator of Eq. (24) by having a uniform bias in the limit, controlled by the sole parameter . If were known, Eq. (38) could be adjusted to yield unbiased kinship estimates:
Remarkably, Eq. (38) itself can be used to estimate : assuming and a large number of SNPs *m*, then
from which can be solved. However, additional steps can be taken to provide a more stable estimate than that based on [32]. Our improved kinship estimator will result in a plug-in *F*_{ST} estimator with increased accuracy.

The analogous coancestry estimator using IAF is
Lastly, note that the inbreeding coefficient estimator derived from Eq. (38) using equals of Eq. (26), so the resulting plug-in *F*_{ST} estimator equals that of Eq. (30). Therefore, Eq. (38) by itself does not directly yield a new *F*_{ST} estimator.

## 6 An admixture simulation illustrates challenges in *F*_{ST} and kinship estimation

### 6.1 Overview of simulations

We simulate genotypes from two models to illustrate our results when the true population structure parameters are known. One is an island model, the other an admixture model differing from the island model by its pervasive covariance, and designed to induce large biases in existing *F*_{ST} estimators (Fig. 2). Both simulations have *n* = 1000 individuals, *m* = 300, 000 SNP loci, and *K* = 10 islands or intermediate populations. These simulations have *F*_{ST} = 0.1, comparable to estimates between human populations [4].

Our island model satisfies the Hudson estimator assumptions: populations are independent, and each population *S*_{u} has a different *F*_{ST} value of (Fig. 2A). Ancestral allele frequencies are drawn uniformly in [0.01, 0.5]. Allele frequencies for *S*_{u} and SNP *i* are drawn independently from the Balding-Nichols (BN) distribution [41] with parameters and . Every individual *j* in island *S*_{u} draws alleles randomly with probability . Podpulation sample sizes were drawn randomly (Appendix F).

Our admixture model is a “BN-PSD” model [9, 17, 25, 27, 42, 43], which we analyzed in our previous paper in this series [25]. The intermediate populations are islands that draw from the BN model, then each individual *j* constructs its allele frequencies as , which is a weighted average of with the admixture proportions *q*_{ju} of *j* and *u* as weights (which satisfy , as in the Pritchard-Stephens-Donnelly [PSD] admixture model [29–31]). We constructed *q*_{ju} that model admixture resulting from spread by random walk of the intermediate populations along a one-dimensional geography, as follows. Intermediate populations *S*_{u} are placed on a line with differentiation that grows with coordinate (Fig. 3A). Upon differentiation, individuals in each *S*_{u} spread with random walks, a process modeled by Normal densities (Fig. 3B). Admixed individuals derive their ancestry proportional to these Normal densities, resulting in a genetic structure governed by geography (Fig. 3C, Fig. 2B) and departing strongly from the island model (Fig. 3D). The amount of spread was chosen to give *s* = 0.5, which by Eq. (37) results in a large bias for (in contrast, the island simulation has *s* = 0.1). See Appendix F for additional details regarding these simulations.

### 6.2 Weir-Cockerham and Hudson *F*_{ST} estimators misapplied to an admixed population

Our admixture simulation illustrates the large biases that can arise if the WC and Hudson *F*_{ST} estimators are misapplied to non-island populations to estimate the generalized *F*_{ST}. First, we test these estimators in our island model. This simulation satisfies the assumptions of the Hudson estimator (which we generalized for *K* population islands in Appendix B), so it is consistent (Fig. 4A). The WC estimator assumes that for all *u*, which does not hold; nevertheless, WC has a small bias (Fig. 4A). For comparison, we added the “plug-in” *F*_{ST} estimator of Eq. (30) (weights from Appendix F), which is derived from the kinship estimator of Eq. (23) and does not have island model corrections. Since the number of islands *K* is large, the plug-in estimator has a small relative bias of about ; greater bias is expected for smaller *K*.

To apply the WC and Hudson estimators to the admixture model, individuals are assigned to “populations” grouping by their maximum admixture proportions (Fig. 3D). Both WC and Hudson estimates are smaller than the true *F*_{ST} by nearly half, as predicted by the limit of of Eq. (15) (Fig. 4C). By construction, the plug-in also has a large relative bias of about *s* = 50%; remarkably, the WC and Hudson estimators suffer from comparable biases. Thus, the island model corrections of the WC and Hudson estimators are insufficient for estimating *F*_{ST} in our admixture scenario.

### 6.3 Evaluation of individual-level pairwise *F*_{ST} estimators

Fig. 5A shows the matrix of true individual-level pairwise *F*_{ST} values, *F*_{jk}, for every pair of individuals in our simulation. *F*_{jk} is a distance between pairs of individuals, with *F*_{jk} = 0 for pairs from the same population and increasing values for more distant population pairs. Larger lead to smaller *F*_{jk} (see Eq. (16)), hence the (Fig. 2B) and *F*_{jk} (Fig. 5A) matrices are negatively correlated.

Both of our consistent *F*_{jk} estimators perform well, using IAFs (Eq. (16), Fig. 5B) and genotypes (Eq. (17), Fig. 5C). Estimates from genotypes have a greater root-mean-squared error (RMSE, 3.43% relative to the mean *F*_{jk}) than the estimates from true IAFs (RMSE of 0.319%).

### 6.4 Evaluation of the standard kinship estimator

Our admixture simulation illustrates the distortions of the kinship estimator of Eq. (23). The limit of Eq. (23) has a fixed bias if for all *j*. For that reason, we chose that vary per *u* (Fig. 3A), which causes large differences in per *j* and large distortions in .

Compared to the true (Fig. 6A, where are plotted along the diagonal), are very distorted, with an abundance of cases, negative estimates (blue in Fig. 6B), but remarkably also cases with (top left corner of Fig. 6B). Our ratio-of-means estimator agrees with the limit of Eq. (24) (Fig. 6C), which an RMSE of 2.14% relative to the mean . In contrast, mean-of-ratios estimates have an RMSE of 10.77% from the limit of Eq. (24) (not shown). The distortions are similar for the estimator that uses IAFs of Eq. (29) (not shown), with reduced RMSEs from its limit of 0.32% and 8.82% for the ratio-of-means and mean-of-ratios estimates, respectively.

### 6.5 Evaluation of plug-in and adjusted *F*_{ST} estimators

We illustrate the behavior of our plug-in and adjusted *F*_{ST} estimators using our admixture simulation. We tested IAF (Fig. 7A) and genotype (Fig. 7B) versions of our estimators. The unadjusted plug-in of Eq. (31) is severely biased (blue), by construction, and matches the calculated limit for IAFs and genotypes (green dotted lines in Fig. 7, which are close because ). We also tested the two consistent “adjusted” estimators and of Eqs. (32) and (36), which estimate *F*_{ST} quite well (blue predictions overlap the true *F*_{ST} red dashed line in Fig. 7). However, and are oracle methods, since they require parameters that are not known in practice.

Prediction intervals were computed from estimates over 39 independently-simulated IAF and genotype matrices (Appendix G). Estimator limits are always contained in these intervals, which holds since the number of independent SNPs (*m* = 300, 000) is sufficiently large. Estimates that use genotypes have wider intervals than estimates from IAFs; however, IAFs are not known in practice, and use of estimated IAFs might increase noise. Genetic linkage, not present in our simulation, will also increase noise in real data.

## 7 Analysis of 1000 Genomes Project populations

We analyze 1000 Genomes Project (TGP) populations [20] with the Hudson *F*_{ST} estimator for two populations and our individual-level *F*_{ST} estimator, , of Eq. (17). We focus on since it is currently our only consistent estimator for arbitrary population structures. We analyze the 20,417,698 biallelic SNP ascertained in YRI from autosomal chromosomes in the final “phase 3” data on the TGP website (dated 2013-05-02). Of these, 14,145,759 SNPs are polymorphic in the Hispanic populations and 8,932,115 in the European populations discussed below. Individuals in these data are roughly locally outbred and locally unrelated [20], which is the only requirement for the consistency of estimated from genotypes.

First we focus on YRI, CEU, and CHB, which were analyized previously [4]. These population pairs are geographically distant, so the island model is more likely to fit well. Indeed, Hudson estimates are relatively close to (compare upper and lower triangle of Fig. 8A). In other words, the structure within populations is dwarfed by the structure between populations. A direct comparison to the Hudson estimates is given by of Eq. (19), which averages across populations for *j* ∈ *S*_{u} and *k* ∈ *S*_{v}. We find good agreement between Hudson estimates and , corroborating a good fit of the island model (Fig. 8D).

Next, we analyze the four Hispanic populations in the TGP: PEL, MXL, CLM, and PUR. Hispanic individuals are admixed primarily from Native American, European, and African super-populations. Each of these populations is structured, a consequence of variable individual admixture proportions [27], so pairwise comparisons are poorly fit by the island model. The complex structure of these populations is confirmed by , finding many individuals that have closer relatives from other populations compared to some individuals from the same population (lower triangle of Fig. 8B). Here we find that are always larger than their corresponding Hudson estimates (Fig. 8E). The largest proportional discrepancy is between PUR and CLM, whose Hudson estimate is 40% of . The Hudson estimator is solely a function of average allele frequencies and sample sizes per population (Appendix B), so it averages out the substructure within populations, explaining the smaller estimates observed relative to .

Lastly, we analyze four European populations: FIN, GBR, IBS, and TSI. We exclude CEU due to its similarity to GBR and because it was not sampled within Europe. The structure of European populations was previously found to disagree with the island model [44]. We confirm structure within these populations, although differentiation is much smaller here (Fig. 8C). Notably, proportional differences between Hudson and are as large within Europe (Fig. 8F) as in the Hispanic populations (Fig. 8E). The largest proportional difference was between TSI and IBS, whose Hudson estimate is 41% of . Thus, our individual-level pairwise *F*_{ST} estimator, , detects structure that is missed by island model estimators.

## 8 Discussion

We investigated the most commonly utilized estimators of *F*_{ST} and kinship, both of which can be derived using the method of moments (Fig. 1). We determined the bias of these estimators under models of arbitrary population structure. We calculated the bias that occurs in the *F*_{ST} estimator when the island model assumption is violated. This bias is present even when individual-specific allele frequencies are known without error. We also showed that the kinship estimator is biased when the population is structured (particularly when the average kinship is of a similar magnitude to the true kinship coefficient), and that the bias may be different for each pair of individuals.

Use of island model *F*_{ST} estimators requires taking certain precautions, as exemplified in the Hudson *F*_{ST} estimator work [4]. First, the Hudson estimator is given for two populations only, since two panmictic populations are always independent relative to their last common ancestor population. Second, only geographically distant population pairs were compared [4], which appear internally unstructured relative to the structure between populations. However, *F*_{ST} is often estimated between closely related populations, for example, within Mexico [21], the United Kingdom [22], and between contemporary and archaic European [23] and Eurasian populations [24]. These geographically close populations are more likely to have comparable structure within and between populations, a case where Hudson underestimates differentiation, just as in the Hispanic and European populations in Fig. 8. Our analyses highlight the need for new tools that measure differentiation in complex population structures.

We have shown that the misapplication of existing *F*_{ST} estimators on non-island population structures may lead to estimates that approach zero even when the true generalized *F*_{ST} is large. Weir-Cockerham [3] and Hudson [4] *F*_{ST} estimates in our admixture simulation are biased by nearly a factor of two (Fig. 4). These estimators were derived assuming independent populations, so the observed biases arise from their misapplication to non-island populations. Nevertheless, natural populations often do not adhere to the island model, particularly human populations [44–46].

The kinship coefficient estimator we investigated is often used to control for population structure in GWAS and to estimate genome-wide heritability [9, 10, 13–18]. While this estimator was known to be biased [10, 18], no closed form limit had been calculated until now. We found that kinship estimates are biased downwardly on average, but bias also varies for every pair of individuals (Fig. 1, Fig. 6). Thus, the use of these distorted kinship estimates may be problematic in GWAS or estimating heritability, but to what extent remains to be determined.

We developed a theoretical framework for assessing these genome-wide ratio estimators of *F*_{ST} and kinship. We proved that common ratio-of-means estimators converge almost surely to the ratio of expectations for infinite independent SNPs (Appendix A.1). Our result justifies approximating the expectation of a ratio-of-means estimator with the ratio of expectations [3–5]. However, mean-of-ratios estimators may not converge to the ratio of expectations for infinite SNPs. Mean-of-ratios estimators are potentially asymptotically unbiased for infinite individuals, but it is unclear which estimators have this behavior. We found that the ratio-of-means kinship estimator had much smaller errors from the ratio of expectations than the more common mean-of-ratios estimator, whose convergence value is unknown. Thus, we recommend ratio-of-means estimators, whose asymptotic behavior is well understood.

The Hudson estimator is a consistent estimator of the pairwise *F*_{ST} for two populations [4], which is reported often [21–24, 45, 47]. We derived a consistent estimator of the individual-level pairwise *F*_{ST}, *F*_{jk}, which extends the previous pairwise *F*_{ST} to individuals [25]. However, kinship or *F*_{ST} estimates for more than two individuals cannot be recovered from *F*_{jk} estimates. Conceptually, kinship and *F*_{ST} are in terms of a single ancestral population *T*, whereas each *F*_{jk} is relative to a jointly local population *L*_{jk} that varies per (*j*, *k*) pair (see Eq. (5)). Practically, there is loss of information since *Fjj* = 0 for every *j* by definition: for *n* individuals, there are *n* more than *F*_{jk} parameters. We used our *F*_{jk} estimator to identify structure with individual resolution in 1000 Genomes Project populations (Fig. 8).

Accurate estimation of generalized *F*_{ST} and kinship coefficients in arbitrary population structures will require further innovations, and the results provided here may be useful in leading to more robust estimators in the future. This, in particular, is the topic we tackle in the next paper in this series [32].

## Appendices

## A Accuracy of ratio estimators

### A.1 Almost sure convergence of ratio-of-means estimators with independent and uniformly bounded terms

Here we prove that , where and give the ratio-of-means estimator described in the main text. It suffices to prove and , from which the result follows using the continuous mapping theorem [48, 49]. The proof for follows, which applies analogously to . Our *a*_{i} are independent but not identically distributed, since they depend on that varies per SNP, so the standard law of large numbers does not apply to *Â*_{m}. We show almost sure convergence using Kolmogorov’s criterion for the Strong Law of Large Numbers [50], which is satisfied for bounded Var(*a*_{i}). Since |*a*_{i}| ≤ *C* ≤ ∞ for all *i* and some *C* (see main text), then , so Var(*a*_{i}) ≤ *C*^{2}. Therefore, , as desired.

### A.2 Order of error of expectations

The error of the ratio of expectations from the expectation of the ratio is given by
which follows from Cov(*X*, *Y*) = E[*XY*] − E[*X*] E[*Y*] [51] and expanding the covariance. Previous work on ratio estimators [35, 51] assumes IID *a*_{i} and *b*_{i}, which does not hold for SNPs. Assuming independent SNPs (Cov(*a*_{i}, *a*_{j}) = 0 for *i* ≠ *j*) and large *m* so is practically independent of any given *a*_{i} and *b*_{j}, then
Since *a*_{i}, *b*_{i} are bounded, | Cov(*a*_{i}, *b*_{i}) | ≤ *C*^{2} for the same *C* of the previous section, so
holds for some large enough *m* and *C*. Hence is as for standard ratio estimators [35].

## B Generalized Hudson *F*_{ST} estimator

The Hudson *F*_{ST} estimator compares two populations [4]. We generalize this estimator for *n* independent populations, where *F*_{ST} equals the mean pairwise *F*_{ST} for every pair of populations. We average numerators and denominators of the pairwise estimator before computing the ratio. Let *j* index the n populations, *n*_{j} be the number of individuals sampled from *j*, and be the sample allele frequency in *j* for SNP *i*, then
which consistently estimates *F*_{ST} in island models.

## C Derivation of method of moment estimators

### C.1 *F*_{ST} island model estimator

Assuming the coancestry model of Eqs. (6) and (7) for islands ( for *j* ≠ *k*), the first and second moments of the IAFs are:
appears by averaging Eq. (C.2) over *j*:
Since Eq. (C.1) has the same value for every *j*, and Eq. (C.3) as well for every *j* ≠ *k*, we average these to reduce estimation variance. The results are in terms of :
*F*_{ST} also appears in Eq. (C.6) because *j* = *k* terms are introduced in the double sum. Subtracting Eq. (C.4) and Eq. (C.6) in turn from Eq. (C.5) results in:
To reduce variance further, we average across SNPs, giving
where . Eliminating and solving for *F*_{ST} in this system of equations results in the following *F*_{ST} estimator:
This estimator is simplified noting that appears in the IAF sample variance,
so substituting it into Eq. (C.7) recovers Eq. (14) as desired:

### C.2 Standard kinship estimator

Here we assume the kinship model of Eqs. (1) and (2). Since Eq. (1) is the same for all individuals *j*, we average these first moments to reduce variance:
Each appears once per (*j*, *k*) pair in Eq. (2), recast here in terms of the sample covariance:
Variance in the kinship estimate is reduced by averaging across SNPs, yielding:
The resulting estimator of is
which is plugged into Eq. (C.8) and then is solved for, recovering Eq. (23) as desired:

## D Mean coancestry bounds

Here we prove that, for any weights such that ,
holds, and for uniform weights also holds. Furthermore, holds iff for all (*j*, *k*), and holds for island models.

The Cauchy-Schwarz inequality for covariances implies . Therefore,
where the second inequality follows from Jensen’s inequality, since *x*^{2} is a convex function. Since , then *F*_{ST} ≤ 1 as well. Equality in the second bound requires for all *j*, and equality in the first bound requires , so that requires for all (*j*, *k*). Since all *w*_{j}, , then
where the second inequality follows from dropping *j* ≠ *k* terms from the double sum of . The case gives , with equality for island models by construction.

## E Moments of estimator building blocks

Here we calculate first and some second moments for “building block” quantities that recur in our estimators, particularly terms involving *x*_{ij} and , and which enable us to calculate the limits of our estimators. Below are examples for genotypes, which follow from Eqs. (1) and (2); calculations for IAFs follow analogously from Eqs. (6) and (7) (not shown).

## F Admixture and island model simulations

### F.1 Construction of population island allele frequencies

We simulate *K* = 10 population islands and *m* = 300, 000 independent SNPs. Every SNP *i* draws . We set , where τ ≤ 1 tunes *F*_{ST}. For the island model, , so gives the desired *F*_{ST} (τ ≈ 0.18 for *F*_{ST} = 0.1). For the admixture model, τ is found numerically (τ ≈ 0.90 for *F*_{ST} = 0.1; see last subsection). Lastly, are drawn from the Balding-Nichols distribution [41]:

### F.2 Random island sizes

We randomly generate samples sizes r = (*r*_{u}) for *K* islands and individuals, as follows. First, draw **x** ~ Dirichlet (1, …, 1) of length K and **r** = round(*n***x**). While , draw a new **r**, to prevent small islands (they do not occur in real data). Lastly, while , a random *u* is updated to *r*_{u} ← *r*_{u} + sgn(*δ*), which brings *δ* closer to zero. The resulting **r** is as desired. Weights for individuals *j* in *S*_{u} are so the generalized *F*_{ST} matches from the island model, which Hudson estimates [25].

### F.3 Admixture proportions from 1D geography

We construct *q*_{ju} resulting from random-walk migrations along a one-dimensional geography. Let *x*_{u} be the coordinate of intermediate population *u* and *y*_{j} the coordinate of a modern individual *j*. We assume *q*_{ju} is proportional to *f* (|*x*_{u} − *y*_{j}|), or
where *f* is the Normal density function with *μ* = 0 and tunable *σ*. The Normal density models random walks, where *σ* sets the spread of the populations (Fig. 6). Our simulation uses *x*_{u} = *u* and , so intermediate population span [1, *K*] and individuals span . For the WC and Hudson *F*_{ST} estimators, individual *j* is assigned to the subpopulation *S*_{u} with the largest *q*_{ju} (Fig. 3D); thus these subpopulations have equal sample size, so is appropriate.

### F.4 Choosing *σ* and *τ*

Here we find values for *σ* (controls *q*_{jk}) and T (scales ) that give and *F*_{ST} = 0.1 in the admixture model. We previously found that and holds for the BN-PSD model [25]. In our simulation, and hold, so and . Therefore,
depends only on *σ*. A numerical root finder finds that σ ≈ 1.78 gives . For fixed *q*_{ju},
where *F*_{ST} is the desired value. *F*_{ST} = 0.1 is achieved with τ ≈ 0.901.

## G Prediction intervals of *F*_{ST} estimators

Prediction intervals with *α* = 95% correspond to the range of *n* = 39 independent *F*_{ST} estimates. In the general case, *n* independent statistics are given in order *X*_{(1)} < … < *X*_{(n)}. Then *I* = [*X*_{(j)}, *X*_{(n+1−j)}] is a prediction interval with confidence [52]. In our case, *j* = 1 and *n* = 39 gives *α* = 0.95, as desired. Each estimate was constructed from simulated data with the same dimensions and structure as before (fixed and *q*_{ju}; fixed sample sizes too for island models), but with , *π*_{ij}, *x*_{ij} drawn anew.