## Abstract

*F*_{ST} is a fundamental measure of genetic differentiation and population structure, currently defined for subdivided populations. *F*_{ST} in practice typically assumes *independent, non-overlapping subpopulations*, which all split simultaneously from their last common ancestral population so that genetic drift in each subpopulation is probabilistically independent of the other subpopulations. We introduce a generalized *F*_{ST} definition for arbitrary population structures, where individuals may be related in arbitrary ways, allowing for arbitrary probabilistic dependence among individuals. Our definitions are built on identity-by-descent (IBD) probabilities that relate individuals through inbreeding and kinship coefficients. We generalize *F*_{ST} as the mean inbreeding coefficient of the individuals’ local populations relative to their last common ancestral population. We show that the generalized definition agrees with Wright’s original and the independent subpopulation definitions as special cases. We define a novel coancestry model based on “individual-specific allele frequencies” and prove that its parameters correspond to probabilistic kinship coefficients. Lastly, we extend the Pritchard-Stephens-Donnelly admixture model in the context of our coancestry model and calculate its *F*_{ST}. To motivate this work, we include a summary of analyses we have carried out in follow-up papers, where our new approach has been applied to simulations and global human data, showcasing the complexity of human population structure, demonstrating our success in estimating kinship and *F*_{ST}, and the shortcomings of existing approaches. The probabilistic framework we introduce here provides a theoretical foundation that extends *F*_{ST} in terms of inbreeding and kinship coefficients to arbitrary population structures, paving the way for new estimators and novel analyses.

Note: This article is Part I of two-part manuscripts. We refer to these in the text as Part I and Part II, respectively.

**Part I:** Alejandro Ochoa and John D. Storey. “*F*_{ST} and kinship for arbitrary population structures I: Generalized definitions”. *bioRxiv* (10.1101/083915) (2019). https://doi.org/10.1101/083915. First published 2016-10-27.

**Part II:** Alejandro Ochoa and John D. Storey. “*F*_{ST} and kinship for arbitrary population structures II: Method of moments estimators”. *bioRxiv* (10.1101/083923) (2019). https://doi.org/10.1101/083923. First published 2016-10-27.

## 1 Introduction

A population of mating organisms is *structured* if its individuals do not mate randomly, which results in an increase in mean homozygozity over the population compared to that of a randomly mating population [3, 4]. *F*_{ST} is a parameter that measures population structure [5, 6], which is typically understood through homozygosity. An unstructured population has *F*_{ST} = 0 and genotypes at each locus have Hardy-Weinberg proportions. At the other extreme, a fully differentiated population has *F*_{ST} = 1 and every subpopulation at every locus is homozygous for some allele. In addition to measuring population differentiation, *F*_{ST} is also used to model DNA profile matching uncertainty in forensics [7–13] and to identify loci under selection [14–21]. Current *F*_{ST} definitions assume a partitioned or subdivided population into discrete, non-overlapping subpopulations [5, 6, 22–24]. Many *F*_{ST} estimators further assume that subpopulations have evolved independently from the most recent common ancestor (MRCA) population [21–24], which occurs only if every subpopulation split from the MRCA population at the same time (Fig. 1A, Fig. 2A). However, populations such as humans are not naturally subdivided [11, 25–27] (Fig. 1B); thus, arbitrarily imposed subdivisions may yield correlated subpopulations that no longer satisfy the independent subpopulations model assumed by existing *F*_{ST} estimators (Fig. 2B). In this work, we build a generalized *F*_{ST} definition applicable to arbitrary population structures, including arbitrary evolutionary dependencies.

Natural populations are often structured due to population size differences and the constraints of distance and geography [31]. For example, the genetic population structure of humans shows evidence of population bottlenecks migrating out of Africa [32–40] as well as numerous admixture events [41–45]. Notably, human populations display genetic similarity that decays smoothly with geographic distance, rather than taking on discrete values as would be expected for independent subpopulations [11, 27, 35, 37–39] (Fig. 1B). Current *F*_{ST} definitions do not apply to these complex population structures.

*F*_{ST} is known by many names, including fixation index [6] and coancestry coefficient [23, 46]). *F*_{ST} is also alternatively defined in terms of the variance of subpopulation allele frequencies [6], variance components [47], correlations [22], and genetic distance [46]. Our generalized *F*_{ST} is defined using inbreeding coefficients, like Wright’s *F*_{ST}. There is also a diversity of summary statistics that measure locus-specific differentiation, such as *G*_{ST}, , and *D*, which are functions of observed allele frequencies, and which approximate *F*_{ST} under certain conditions [48–53]. We consider *F*_{ST} as a genome-wide evolutionary parameter given by relatedness, which modulates the random drift of allele frequencies across loci but does not depend on these frequencies, mutation rates, or other locus-specific features. We review these previous *F*_{ST} definitions in greater detail in Supplementary Information, Section S1. The focus of our work is to generalize and accurately estimate the genome-wide *F*_{ST} in individuals with arbitrary relatedness, and does not presently concern locus-specific *F*_{ST} estimation or the identification of loci under selection.

The developments in this paper have lead to improved estimates of *F*_{ST} and kinship in Part II [2]. We have also applied these new probabilistic quantities and estimators to data from the Human Origins and 1000 Genomes Project data sets in ref. [54]. To motivate the generalized definitions we present in this work, in Section 2 we provide an overview of simulation results demonstrating the accuracy of the estimators (from Part II) and findings from analyzing the Human Origins and 1000 Genomes Project datasets (from ref [54]). These results establish that a generalized definition of *F*_{ST} in terms of kinship and inbreeding for arbitrary population structures is needed.

In Section 3 we formally define kinship and inbreeding coefficients, which measure how individuals are related, quantify population structure, and are the foundation of our work. We then generalize *F*_{ST} in terms of individual parameters (namely, inbreeding coefficients), and in analogy to Wright’s *F*_{IS}, model local inbreeding on an individual basis. Our *F*_{ST} applies to arbitrary population structures, generalizing previous *F*_{ST} definitions restricted to subdivided populations.

In Section 4 we show a connection between the coalescent and kinship, inbreeding and generalized *F*_{ST}. This provides a generalization of a previous result showing the relationship between the coalescent and the classic *F*_{ST} defined on subdivided populations. In Section 5 we define a coancestry model that parametrizes the correlations of “individual-specific allele frequencies” (IAFs), a recent tool that also accommodates individual-specific relationships [55, 56]. Our model is related to previous models between populations [23, 57]. We prove that our coancestry parameters correspond to kinship coefficients, thereby preserving their probabilistic interpretations, and we relate these parameters to *F*_{ST}.

Lastly, in Section 6 we provide a novel *F*_{ST} analysis for admixed individuals by applying our coancestry model from Section 5 to the widely-used Pritchard-Stephens-Donnelly (PSD) admixture model, in which individuals derive their ancestry from several ancestral subpopulations with individual-specific admixture proportions [58–60]. We analyze an extension of the PSD model [55, 61–64] that generates allele frequencies from the Balding-Nichols distribution [7], and propose a more complete coancestry model for the ancestral subpopulations. We derive equations relating *F*_{ST} to the model parameters of PSD and its extensions. These results enable us to use an admixture simulation without independent subpopulations to benchmark kinship and *F*_{ST} estimators in Section 6 of Part II.

Our generalized definitions permit the analysis of *F*_{ST} and kinship estimators under arbitrary population structures, and pave the way forward to new estimation approaches, which are the focus of our following work in this series (Part II).

## 2 Motivating analyses

The results presented here lead to a deeper understanding of the limitations of existing *F*_{ST}, kinship, and inbreeding estimators. Specifically, the assumptions underlying existing estimators are too restrictive and do not align with the properties of human populations that have been revealed through recent studies. In Part II, we theoretically calculate and then numerically verify complex biases that manifest from existing estimators when the population structure and relatedness violates the non-overlapping and independently evolving subpopulations assumptions. This then leads to new estimators of *F*_{ST}, kinship, and inbreeding proposed in Part II. In ref. [54], we applied the estimators from Part II to data from the Human Origins study and 1000 Genomes Project (TGP). There, it is revealed on these seminal studies that the theory, methods, and simulations from Part I and Part II hold true on real data. Although the results summarized in this section involve details presented in full in Part II and ref. [54], it may be useful to the reader to see the ultimate consequences of the theory present in the current paper, Part I.

In Part II, we carried out simulations in two scenarios. The first scenario approximately satisfies the assumptions of the existing (Weir-Cockerham) estimate of *F*_{ST}. The second scenario is an admixture model (described in Section 6), which reflects the characteristics we have observed in real data where there are no well-defined independent subpopulations. Fig. 3, columns A and B, show the results of these simulations. It can be seen that that both the existing and proposed estimators do well in the first scenario (Fig. 3A) where the population is divided into non-overlapping subpopulations that have independently evolved from a common ancestral population. However, in the second scenario (Fig. 3B) where these assumptions are violated, the existing estimators show notable downward bias. Our theoretical results determine exactly what this bias is for both kinship and *F*_{ST}.

In ref. [54], we then analyzed data from the Human Origins [28–30] and TGP studies [65], both of which consist of individuals sampled from a global distribution of ancestries. For the TGP data, we specifically limited our analysis to Hispanics. Our novel kinship estimates calculated on these data reveal a complex population structure in the global human population (Fig. 3C) and in Hispanics in particular (Fig. 3D). Since there are no independent subpopulations in the human data, existing kinship and *F*_{ST} estimates in these data will also be downwardly biased, which can be seen in the bottom two rows of Fig. 3C-D. In contrast, our more accurate novel *F*_{ST} estimates measure greater differentiation than has been previously reported (Fig. 3C-D, second and fourth rows). A deeper analysis of our calculations reveals a clear connection between our estimated kinship structure (but not existing estimates) and the global human migrations under the African Origins model [54]. Our results suggest that common population genetic analyses on real human data will greatly benefit from our improved kinship and *F*_{ST} estimation framework.

## 3 Generalized definitions in terms of individuals

Now that we have established the need for a more flexible population structure model that does not assume independent subpopulations, we shall introduce here novel definitions required for this goal. First we review the formal definitions of kinship and inbreeding coefficients. Then we define a “local” population for every individual, which allows us to distinguish “structural” inbreeding due to the population structure from the “local” inbreeding that applies to individuals with closely-related parents. We then introduce our generalized *F*_{ST} definition as the mean structural inbreeding coefficient, and show that this definition equals the previous *F*_{ST} definition for independent subpopulations. We also generalize previous formulas for changing the reference ancestral population for kinship and inbreeding coefficients. Lastly, we review the connection between kinship coefficients and the covariance of genotypes.

### 3.1 Overview of data and model parameters

Table 1 summarizes the notation used in this work. Our models assume that genotypes at every locus evolve neutrally—by random drift only, in the absence of recent mutation and selection. Thus, only the population structure shapes the covariance structure of genotypes.

Let *x*_{ij} be observed biallelic genotypes for locus *i* ∈ {1,…, *m*} and diploid individual *j* ∈ {1,…, *n*}. Given a chosen reference allele at each locus, genotypes are encoded as the number of reference alleles: *x*_{ij} = 2 is homozygous for the reference allele, *x*_{ij} = 0 is homozygous for the alternative allele, and *x*_{ij} = 1 is heterozygous. We focus on biallelic loci since they vastly outnumber other types of genetic variants in humans. Note that a multiallelic model, which would require additional notation, could follow in analogy to previous *F*_{ST} work for populations [23].

We assume the existence of a panmictic ancestral population *T* for all individuals under consideration. *T* is generally not required to be the MRCA population, so many choices of *T* are possible. Note that *T* is a collection of organisms ancestral to a given set of individual organisms, shared by all loci, and it is not assumed that the alleles at a given locus coalesce in *T*. Two alleles are said to be “identical by descent” (IBD) if they originate from a single ancestor organism that lived more recently than the given ancestral population *T* [4, 6, 66]. In other words, relationships that precede *T* in time do not count as IBD, while relationships since *T* count toward IBD probabilities. Every locus *i* is assumed to have been polymorphic in *T*, with an ancestral reference allele frequency , and no new mutations have occurred since then.

The inbreeding coefficient of individual *j* relative to *T*, , is defined as the probability that the two alleles of any random locus of *j* are IBD when the ancestral population is *T* [67]. Therefore, measures the amount of relatedness within an individual, or the extent of dependence between its alleles at each locus. Similarly, the kinship coefficient of individuals *j* and *k* relative to *T*, , is defined as the probability that two alleles at any random locus, each picked at random from each of the two individuals, are IBD when the ancestral population is *T* [5]. measures the amount of relatedness between individuals, or the extent of dependence across their alleles at each locus. Note that children *j* of parents (*k, l*) have an expected of [5]. Both and combine relatedness due to the population structure with recent or “local” relatedness, such as that of family members [68]. The values of are functions of the chosen ancestral population *T*, which determines the level of relatedness that is treated as unrelated [4, 66]. Thus, and increase if *T* is an earlier rather than a more recent population. The expression “ relative to *T*” refers to the value of when *T* is chosen as the reference ancestral population [6,66]. The mean is positive in a structured population [67], and it also increases slowly over time in finite panmictic populations due to genetic drift [69].

Given an ancestral population *T* (not necessarily the MRCA population in this context) and an unstructured subpopulation *S* that evolved from *T*, Malécot defined *F*_{ST} as the mean over the individuals in *S* relative to *T* [5], and which we denote by . When *S* is itself structured, Wright defined three coefficients that connect *T*, *S* and individuals *I* in *S* [6]: *F*_{IT} (“total inbreeding”) is the mean of individuals (*I*) relative to *T*; *F*_{IS} (“local inbreeding”) is the mean of individuals (*I*) relative to *S*, which Wright did not consider to be part of the population structure; lastly, *F*_{ST} (“structural inbreeding”) is the mean relative to *T* that would result if individuals in *S* mated randomly (and which equals our ). The special case *F*_{IS} = 0 gives *F*_{ST} = *F*_{IT} [6]. See Supplementary Information, Section S1.1 for a more detailed review of these definitions. Wright created the distinction between *F*_{ST} and *F*_{IT} with animal breeding in mind, since mating systems for artificial selection could cause the local inbreeding (*F*_{IS}) and therefore also *F*_{IT} to be large at times, but *F*_{ST} measures the more relevant mean inbreeding that results after random mating resumes in the strain [67]. However, in large, natural populations *F*_{IS} is small so *F*_{ST} ≈ *F*_{IT} in these cases. The *F*_{ST} definition has been extended to a set of disjoint subpopulations, where it is the average *F*_{ST} of each subpopulation from the last common ancestral population [23, 24].

In practice, the ancestral population *T* is usually not identified explicitly, which obscures its role in estimating kinship and *F*_{ST}. Here we clarify this important matter. Every population of mating organisms can be modeled as descending from a panmictic ancestral population *T*—whether real or a mathematical construct—that at every locus contained the pool of ancestral alleles that modern individuals inherited. By default, the recommended choice of *T* is the MRCA population of the individuals in the sample [22–24, 66, 70]. For example, if all individuals are drawn from one effectively panmictic population, then this population is the MRCA. In a pedigree with unrelated founders, the MRCA population consists of these founders [6, 31]. In a population structure defined by a tree, the MRCA population is the root node at which the first split occurs (Fig. 2). The choice of *T* sets the minimum possible value of : a pair of unrelated individuals drawn from *T* have , and an individual from *T* (with unrelated parents by definition) has [71]. Thus, assuming that pairs are present in a sample, the set of values is in terms of the MRCA population *T* if and only if min . If min , then *T* is more ancestral than the MRCA population. Estimates with min —impossible if is a probability—have an implicit *T* that is more recent than the MRCA population and cannot be interpreted biologically. For humans, if we ignore the limited Neanderthal and Denisovan introgressions [42, 43], the MRCA population is the real population estimated to have existed in Africa ≈100-200 thousand years ago [32, 33, 40], which first split into the ancestral southern African KhoeSan population (who speak unique “click languages”) and the rest of humans [32, 33, 37, 38, 40].

### 3.2 Local populations

Our generalized *F*_{ST} definition depends on the notion of a local population. Our formulation includes as special cases the independent subpopulations and admixture models, and its generality is in line with recent efforts to model population structure on a fine scale [72, 73], through continuous spatial models [27, 74–76], or in a manner that makes minimal assumptions [56].

We define the *local population L*_{j} of an individual *j* as the MRCA population of *j*. In the simplest case, if *j*’s parents belong to the same panmictic subpopulation *S*, then *S* = *L*_{j}. However, if *j*’s parents belong to different subpopulations, then *L*_{j} is modeled as an admixed population (see example below). More broadly, *L*_{j} is the most recent panmictic population from which individual *j* drew its alleles and its inbreeding coefficient can be meaningfully defined. We define the “local” inbreeding coefficient of *j* to be , and *j* is said to be *locally outbred* if .

For any population *T* ancestral to *L*_{j}, the parameter trio are individual-level analogs of Wright’s trio (*F*_{IT}, *F*_{IS}, *F*_{ST}) defined for a subdivided population [6], with *L*_{j} playing the role of *S*. Moreover, just like Wright’s coefficients satisfy
our individual-level parameters satisfy
since the probability of the absence of IBD of *j* relative to *T* (which is )equals the product of the independent probabilities of absence of IBD at two levels: of *j* relative to *L*_{j} (which is ), and of *L*_{j} relative to *T* (which is ). Note that an individual *j* is locally outbred if and only if .

Similarly, we define the *jointly local population L*_{jk} of the pair of individuals *j* and *k* as the MRCA population of *j* and *k*. Hence, *L*_{jk} is ancestral to both *L*_{j} and *L*_{k} (Fig. 2B). We define the “local” kinship coefficient to be , and *j* and *k* are said to be *locally unrelated* if . Since the expected inbreeding coefficient of an individual is the kinship of its parents [5], it follows that locally-unrelated parents have locally-outbred offspring.

Consider an individual *j* in an admixture model, deriving alleles from two distinct subpopulations *A* and *B* with proportions *q*_{jA} and *q*_{jB} = 1 − *q*_{jA}. Then *L*_{j} is modeled as a population that at locus *i* has a reference allele frequency of , where and are the allele frequencies in *A* and *B*, respectively. Considering a pair of individuals (*j, k*) and varying their admixture proportions, their jointly local population at one extreme is *L*_{jk} = *L*_{j} = *L*_{k} if and only if *q*_{jA} = *q*_{kA} (in other words, these individuals have the same local population if and only if their admixture proportions are the same); at the other extreme *L*_{jk} is the MRCA population of *A* and *B* if and only if *q*_{jA} = 1 and *q*_{kA} = 0 or vice versa (in other words, these individuals have the most distant jointly local population if and only if they are not admixed and belong to opposite subpopulations).

### 3.3 The generalized *F*_{ST} for arbitrary population structures

Recall the individual-level analog of Wright’s *F*_{ST} is , which measures the inbreeding coefficient of individual *j* relative to *T* due exclusively to the population structure (Fig. 2B, Table 1 and Section 3.2). We generalize *F*_{ST} for a set of *n* individuals as
where the most meaningful choice of *T* is the MRCA population of all individuals under consideration, and are fixed weights for these individuals. The simplest weights are for all *j*. However, we allow for flexibility in the weights so that one may assign them to reflect how individuals were sampled, such as a skewed or uneven sampling scheme. For example, if there are two local populations and the first has twice as many samples as the second, then this can be counteracted by weighing every individual from the first local population half as much as every individual from the second local population. In general, individuals can be weighted inversely proportional to their local population’s sample sizes, a scheme used implicitly in the Hudson pairwise *F*_{ST} estimator [24] and which we iterated for a hierarchy of subdivisions in our analysis of the Human Origins dataset [54]. However, for complex population structures without discrete subpopulations and no obvious sampling biases relative to geography or other variables, we favor uniform weights over complicated weighing schemes (the admixed Hispanic individuals were weighted uniformly in [54]).

This generalized *F*_{ST} definition summarizes the population structure with a single value, intuitively measuring the average distance of every individual from *T*. Moreover, our definition contains the previous *F*_{ST} definition as a special case, as discussed shortly. For simplicity, we kept Wright’s traditional *F*_{ST} notation rather than using something that resembles our notation. A more consistent notation could be , which more clearly denotes the weighted average of across individuals. Our definition is more general because the traditional *S* population is replaced by a set of local populations {*L*_{j}}, which may differ for every individual.

#### 3.3.1 Mean heterozygosity in a structured population

Our generalized *F*_{ST} parametrizes the reduction in mean heterozygosity relative to the ancestral population *T* for arbitrary population structures, thus generalizing the familiar connection of the classical *F*_{ST} to allele fixation in an independently-evolving subpopulation. Here we will assume locally-outbred individuals, for which . The expected proportion of heterozygotes *H*_{ij} of an individual with inbreeding coefficient at locus *i* with an ancestral allele frequency is given by [67]

The weighted mean of these expected proportion of heterozygotes across individuals, , is given by our generalized *F*_{ST}:

Hence, individuals have Hardy-Weinberg proportions at every locus if and only if *F*_{ST} = 0, which in turn happens if and only if for each *j*. In the other extreme, individuals have fully-fixated alleles at every locus , if and only if *F*_{ST} = 1, which in turn happens if and only if for each *j*.

Eq. (4) presents an apparent paradox since a given sample estimate of the heterozygosity on one side does not depend on *T*, while *F*_{ST} and on the other side vary depending on our choice of ancestral population *T*. In fact, both sides of Eq. (4) are constant with respect to *T* under our model: *F*_{ST} increases as *T* is taken to be a more distant ancestral population, but also changes so that is constant in expectation (see Supplementary Information, Section S4 for a proof of this result).

#### 3.3.2 *F*_{ST} under the independent subpopulations model

Here we show that our generalized *F*_{ST} contains as a special case the currently-used *F*_{ST} definition for independent subpopulations. As discussed above, *F*_{ST} estimators often assume the independent subpopulations model, in which the population is divided into *K* non-overlapping subpopulations that evolved independently from their MRCA population *T* [22–24]. For simplicity, individuals are often further assumed to be locally outbred and locally unrelated. These assumptions result in the following block structure for our parameters,
where *j, k* ∈ {1,…, *n*} index individuals, *S*_{u}, *S*_{u′} are disjoint subpopulations treated as sets containing individuals, and *u, u*′ ∈ {1,…, *K*} index these subpopulations. This population structure corresponds to a tree in which every subpopulation split from *T* at the same time (Fig. 2A), which is the required demographic scenario that leads to probabilistically-independent subpopulations.

The generalized *F*_{ST} applied to independent subpopulations agrees with the previous *F*_{ST} definition of the mean per-subpopulation *F*_{ST} [23, 24]:
where the weights *w*_{j} are such that . Note also that the *S*_{u} for *u* ∈ {1,…, *K*} act as the *K* unique local populations, where *L*_{j} = *S*_{u} whenever *j* ∈ *S*_{u}.

### 3.4 IBD probabilities with respect to a reference ancestral population

In developing the generalized *F*_{ST}, we have made use of equations that relate IBD probabilities in a hierarchy. Here we generalize these equations to individual inbreeding and kinship coefficients, which allow for transformations of these probabilities under a change of reference ancestral population. Our relationships are straightforward generalizations of Wright’s equation relating *F*_{IT}, *F*_{IS}, and *F*_{ST} in Eq. (1), now more generally applicable.

Let *A* be a population ancestral to population *B*, which is in turn ancestral to population *C*. The inbreeding coefficients relating every pair of populations in {*A*, *B*, *C*} satisfy

A similar form applies for individual inbreeding and kinship coefficients given relative to populations *A* and *B*, respectively,
which generalizes Eq. (2). All of these cases follow since the absence of IBD of *C* (or *j*, or *j, k*) relative to *A* requires independent absence of IBD at two levels: of *C* (or *j*, or *j, k*) relative to *B*, and of *B* relative to *A*. All of the above equations can be extended to a multi-level hierarchy just like Wright did for Eq. (1), by iterating at each level [6].

### 3.5 Genotype moments under the kinship model

In the kinship model [5, 6, 67, 77], genotypes *x*_{ij} are random variables with first and second moments given by

Eq. (6) is a consequence of assuming no selection or new mutations, leaving random drift as the only evolutionary force acting on genotypes [67]. Eq. (7) shows how inbreeding modulates the genotype variance: an outbred individual relative to *T* has the Binomial variance of that corresponds to independently-drawn alleles; a fully inbred individual has a scaled Bernoulli variance of that corresponds to maximally correlated alleles [6]. Lastly, Eq. (8) shows how kinship modulates the correlations between individuals: unrelated individuals relative to *T* have uncorrelated genotypes, while holds for the extreme of identical and fully inbred twins, which have maximally correlated genotypes [5, 77]. Hence, and parametrize the frequency of non-independent allele draws within and between individuals. The “self kinship”, arising from comparing Eq. (7) to the *j* = *k* case in Eq. (8), implies , which is a rescaled inbreeding coefficient resulting from comparing an individual with itself or its identical twin.

## 4 Kinship and the generalized *F*_{ST} in terms of the coalescent

Slatkin (1991) [78] derived an expression for the classical *F*_{ST} (for a subdivided population) in terms of mean coalescence times,
where is the mean coalescence time for alleles at a random locus within a subpopulations *S*, and is the mean coalescence time for alleles at a random locus across subpopulations. Here we generalize this expression to encompass inbreeding and kinship coefficients, as well as the generalized *F*_{ST}.

In all cases that follow, we generalize to denote the mean coalescence time for two alleles at a random locus drawn from the ancestral population *T*; in practice it corresponds to the mean coalescence time of the alleles of the two most distant individuals in the sample. The inbreeding and kinship coefficients are given by
where is the mean coalescence time of the two alleles of individual *j* at a random locus, and is the mean coalescence time of two alleles drawn at random from each of two individuals *j* and *k* at a random locus (see Supplementary Information, Section S2 for derivations). These mean coalescence times could be estimated as average coalescence times for a large number of neutral loci across the genome. If all individuals in the sample are locally outbred, we obtain the desired expression for the generalized *F*_{ST}:

Therefore, the generalized *F*_{ST} equals the relative difference between the weighted mean coalescence times of the alleles within individuals versus the mean coalescence time between the most distantly-related individuals in the sample.

## 5 The coancestry model for individual allele frequencies

*F*_{ST} and its estimators are most often studied in terms of subpopulation allele frequencies [22–24, 57]. Here we introduce a coancestry model for individuals, which is based on *individual-specific allele frequencies* (IAFs) [55, 56] that accomodate arbitrary population-level relationships between individuals. Some authors use the terms “coancestry” and “kinship” exchangeably [23, 70, 71]; in our framework, kinship coefficients are general IBD probabilities (following [68]), and we reserve coancestry coefficients for the IAFs covariance parameters (in analogy to the work of [23]). This coancestry model is the foundation behind the extension of the PSD admixture model we present in Section 6 below, and simplifies the analysis of *F*_{ST} estimator bias in Section 3 of Part II.

In this section we introduce two parameters (see Table 1). First, *π*_{ij} ∈ [0, 1] is the IAF of individual *j* at locus *i*. Individual *j* draws its two reference alleles independently with probability *π*_{ij}. Allowing every locus-individual pair to have a potentially-unique allele frequency allows for arbitrary forms of population structure at the level of allele frequencies [56]. Second, is the coancestry coefficient of individuals *j* and *k* relative to an ancestral population *T*, which modulate the covariance of *π*_{ij} and *π*_{ik} as shown below.

### 5.1 The coancestry model

In our coancestry model, the IAFs *π*_{ij} have the following first and second moments,

Eq. (9) implies that random drift is the only force acting on the IAFs, and is analogous to Eq. (6) in the kinship model. Eq. (10) is analogous to Eqs. (7) and (8) in the kinship model, with individual coancestry coefficients playing the role of the kinship and inbreeding coefficients (for *j* = *k*), a relationship elaborated in the next section. Lastly, Eq. (11) draws the two alleles of a genotype independently from the IAF, which models locally-outbred and locally-unrelated individuals [23]. Hence, the coancestry model excludes local relationships, so it is more restrictive than the kinship model.

Our coancestry model between individuals is closely related to previous models between sub-populations [23, 57]. However, previous models allowed [23]. We require that for two reasons: (1) covariance is non-negative in latent structure models [79], such as population structure, and (2) it is necessary in order to relate to IBD probabilities as shown next.

### 5.2 Relationship between coancestry and kinship coefficients

Here we show that the coancestry coefficients for IAFs, *θ*_{jk}, defined above can be written in terms of the kinship and inbreeding coefficients utilized in our more general model. We do so by relating our coancestry coefficients to general kinship coefficients by matching moments. Conditional on the IAFs, genotypes in the coancestry model have a Binomial distribution, so

We calculate total moments by marginalizing the IAFs. The total expectation is which agrees with Eq. (6) of the kinship model. The total covariance is calculated using

The first term is zero for *j* ≠ *k*, and for *j* = *k* it is

The second term equals 4 Cov (*π*_{ij}, *π*_{ik}|*T*) for all (*j*, *k*) cases, which is given by Eq. (10). All together,

Comparing the above to Eqs. (7) and (8), we find that

Therefore, our coancestry coefficients are equal to kinship coefficients, except that self-coancestries are equal to inbreeding coefficients.

Since individuals in our IAF coancestry model are locally outbred and locally unrelated, we also have and for *j* ≠ *k*. Replacing these quantities in Eq. (3), we obtain the generalized *F*_{ST} in terms of coancestry coefficients.

## 6 Coancestry and *F*_{ST} in admixture models

The Pritchard-Stephens-Donnelly (PSD) admixture model [58] is a well-established, tractable model of structure that is more complex than the independent subpopulations model. There are several algorithms available to estimate the PSD model parameters [58–60, 64, 80]. This model assumes the existence of several intermediate ancestral subpopulations, from which individuals draw alleles according to their admixture proportions. However, the PSD model was not developed with *F*_{ST} in mind; we will present a modified model that is compatible with our coancestry model. The results presented in this section are applied to evaluate kinship and *F*_{ST} estimators in Section 6 of Part II, where an admixed population without independent subpopulations is simulated and the true kinship and *F*_{ST} are known.

The PSD model is a special case of our coancestry model with the following additional parameters (see Table 1). The number of intermediate subpopulations is denoted by *K*. Let be the reference allele frequency at locus *i* and intermediate subpopulation *S*_{u} (*u* ∈ {1,…, *K*}; compare to previous notation in Table 1). Lastly, *q*_{ju} ∈ [0, 1] is the admixture proportion of individual *j* for intermediate subpopulation *S*_{u}. These proportions satisfy for each *j*.

### 6.1 The PSD model with Balding-Nichols allele frequencies

The original algorithm for fitting the PSD model [58] utilizes prior distributions for intermediate subpopulation allele frequencies and admixture proportions according to

Subsequent work has shown [56, 60] that the PSD model of [58] is then equivalent to forming IAFs
where genotypes are then drawn independently according to *x*_{ij}|*π*_{ij} ~ Binomial(2, *π*_{ij}).

Here we consider an extension of this model, which we call the “BN-PSD” model, by replacing Eq. (15) with the Balding-Nichols (BN) distribution [7] to generate the allele frequencies for the intermediate subpopulations from their MRCA population *T*. The BN-PSD model establishes an independent subpopulations structure of the intermediate subpopulations *S*_{u} as illustrated in Fig. 4. This combined model has been used to simulate structured genotypes [55, 62, 63], and is the target of some inference algorithms [61, 64]. The BN distribution is the following reparametrized Beta distribution,
where *p* is the ancestral allele frequency and *F* is the inbreeding coefficient [7]. The resulting allele frequencies *p*^{*} fit into our coancestry model, since E[*p*^{*}] = *p* and Var(*p*^{*}) = *p*(1 − *p*)*F*.

In BN-PSD, the allele frequencies at each locus *i* for intermediate subpopulation *S*_{u} are drawn independently from
where is the ancestral allele frequency and is the inbreeding coefficient of *S*_{u} relative to *T* (compare to notation in Table 1).

We calculate the coancestry parameters of this model by matching moments conditional on the admixture proportions **Q**= (*q*_{ju}). We calculate the expectation as
and the IAF covariance is

By matching these to Eq. (10), we arrive at coancestry coefficients and *F*_{ST} of

### 6.2 The BN-PSD model with full coancestry

The BN-PSD model contains a restriction that the *K* intermediate subpopulations are independent. Suppose instead that the intermediate subpopulation allele frequencies satisfy our more general coancestry model:
where is the coancestry of the intermediate subpopulations *S*_{u} and *S*_{v}. Note that the previous BN-PSD model satisfies and for *u* ≠ *v*. Repeating our calculations assuming our full coancestry setting, individual coancestry coefficients and *F*_{ST} are given by

Therefore, all coancestry coefficients of the intermediate subpopulations influence the individual coancestry coefficients and the overall *F*_{ST}. The form for above has a simple probabilistic interpretation: the probability of IBD at random loci between individuals *j* and *k* corresponds to the sum for each pair of subpopulations *u* and *v* of the probability of the pairing (*q*_{ju}*q*_{kv}) times the probability of IBD between these subpopulations . Note that Eq. (18) was derived independently for a related model [81], but the value of *F*_{ST} for a set of admixed individuals—which we provide in Eq. (19)—had not been described before to the best of our knowledge.

## 7 Discussion

We presented a generalized *F*_{ST} definition corresponding to a weighted mean of individual-specific inbreeding coefficients. Compared to previous *F*_{ST} definitions, ours is applicable to arbitrary population structures, and in particular does not require the existence of non-overlapping subpopulations.

We considered two closely-related population structure models with individual-level resolution: the kinship model for genotypes, and our new coancestry model for IAFs (individual-specific allele frequencies). The kinship model is the most general, applicable to the genotypes in arbitrary sets of individuals. Our IAF model requires a local form of Hardy-Weinberg equilibrium, and it does not model locally-related or locally-inbred individuals. Nevertheless, IAFs arise in many applications, including admixture models [59], estimation of local kinship [55], genome-wide association studies [82], and the logistic factor analysis [56]. We prove that kinship coefficients, which control genotype covariance, also control IAF covariance under our coancestry model.

We also calculated *F*_{ST} for admixture models. To achieve this, we framed the PSD (Pritchard-Stephens-Donnelly) admixture model as a special case of our IAF coancestry model, and studied extensions where the intermediate subpopulations are more structured. *F*_{ST} was previously studied in an admixture model under Nei’s *F*_{ST} definition for one locus, where *F*_{ST} in the admixed population is given by a ratio involving admixture proportions and intermediate subpopulation allele frequencies [52]. On the other hand, our *F*_{ST} is an IBD probability shared by all loci and independent of allele frequencies. Under our framework, the *F*_{ST} of an admixed individual is a sum of products, which is quadratic in the admixture proportions and linear in the coancestry coefficients of the intermediate subpopulations. In the future, inference algorithms for our admixture model with fully-correlated intermediate subpopulations could yield improved results, including coancestry and *F*_{ST} estimates.

Our probabilistic model reconnects *F*_{ST} [21, 23, 24] to inbreeding and kinship coefficients [68, 70, 83, 84], all quantities of great interest in population genetics, but which are currently studied in isolation. The main reason for this isolation is that *F*_{ST} estimation assumes the independent sub-populations model, in which kinship coefficients are uninteresting. However, study of the generalized *F*_{ST} in arbitrary population structures requires the consideration of arbitrary kinship coefficients [68]. Our work lays the foundation necessary to study estimation of the generalized *F*_{ST}, which is the focus of our next publication in this series (Part II).

## Acknowledgments

This research was supported in part by NIH grant R01 HG006448.

## References

- [1].
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].
- [10].
- [11].↵
- [12].
- [13].↵
- [14].↵
- [15].
- [16].↵
- [17].↵
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].
- [27].↵
- [28].↵
- [29].
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].
- [92].
- [93].
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].
- [106].↵
- [107].↵
- [108].↵