Mathematical constraints on FST: multiallelic markers in arbitrarily many populations ===================================================================================== * Nicolas Alcala * Noah A. Rosenberg ## Abstract Interpretations of values of the *F**ST* measure of genetic differentiation rely on an understanding of its mathematical constraints. Previously, it has been shown that *F**ST* values computed from a biallelic locus in a set of multiple populations and *F**ST* values computed from a multiallelic locus in a pair of populations are mathematically constrained by the frequency of the allele that is most frequent across populations. We report here the mathematical constraint on *F**ST* given the frequency *M* of the most frequent allele at a multiallelic locus in a set of multiple populations, providing the most general description to date of mathematical constraints on *F**ST* in terms of *M*. Using coalescent simulations of an island model of migration with an infinitely-many-alleles mutation model, we argue that the joint distribution of *F**ST* and *M* helps in disentangling the separate influences of mutation and migration on *F**ST*. Finally, we show that our results explain puzzling patterns of microsatellite differentiation, such as the lower *F**ST* values in interspecific comparisons between humans and chimpanzees than in the intraspecific comparison of chimpanzee populations. We discuss the implications of our results for the use of *F**ST*. **Subject Areas** statistical genetics Keywords * allele frequency * chimpanzee * genetic differentiation * migration * population structure ## 1. Introduction Multiallelic loci such as microsatellites and haplotype assignments are used to study genetic differentiation in a variety of fields, ranging from ecology and conservation genetics to anthropology and human genomics. Genetic differentiation is often measured for multiallelic loci using the multiallelic extension of Wright’s fixation index *F**ST* [1]: ![Formula][1] For a polymorphic multiallelic locus with *I* distinct alleles in a set of *K* subpopulations, denoting by *p**k,i* the frequency of allele *i* in subpopulation ![Graphic][2] and ![Graphic][3]. *F**ST* values are known to be smaller for multiallelic than for biallelic loci [2]. One reason invoked to explain this difference is that within-subpopulation heterozygosity *H**S* mathematically constrains the maximal value of *F**ST* to be below 1, and the constraint is stronger when *H**S* is high. This phenomenon was noticed concurrently in simulation-based, empirical, and theoretical studies [3, 4, 5, 6, 7], and the mathematical constraints describing the dependence were subsequently clarified [8, 9]. Studies have found that the maximal value of *F**ST* can be viewed as constrained not only by functions of the within-subpopulation allele frequency distribution such as *H**S*, but alternatively by aspects of the global allele frequency distribution across subpopulations. For a biallelic locus in *K* = 2 subpopulations, Maruki *et al*. [10] showed that the maximal *F**ST* as a function of the frequency *M* of the most frequent allele decreases as *M* increases from ![Graphic][4] to 1 (see also [11]). Generalizing the biallelic case to arbitrarily many alleles, Jakobsson *et al*. [12] showed that for multiallelic loci with an unspecified number of distinct alleles, the maximal *F**ST* increases from 0 to 1 as a function of *M* if ![Graphic][5], and decreases from 1 to 0 for ![Graphic][6] in the manner reported by Maruki *et al*. [10] for biallelic loci. Edge and Rosenberg [13] generalized these results to the case of a fixed finite number of alleles, showing that the maximal *F**ST* differs slightly from the unspecified case when the fixed number of distinct alleles is an odd number. Generalizing the simplest case of *K* = *I* = 2 in a different direction, Alcala and Rosenberg [14] considered biallelic loci in the case of arbitrarily many subpopulations *K*. We showed that the maximal value of *F**ST* displays a peculiar behavior as a function of *M*: the upper bound has a maximum of 1 if and only if ![Graphic][7], for integers *k* with ![Graphic][8]. The constraints on the maximal value of *F**ST* dissipate as *K* tends to infinity, even though for any fixed *K*, there always exists a value of *M* for which ![Graphic][9]. Relating *F**ST* to its maximum as a function of *M* helps explain surprising phenomena that arise during population-genetic data analysis. For example, Jakobsson *et al*. [12] showed that stronger constraints on *F**ST* could explain the low *F**ST* values seen in pairs of African human populations. They also found that such constraints could explain the lower *F**ST* values seen in high-diversity multiallelic loci compared to lower-diversity loci—microsatellites compared to single-nucleotide polymorphisms. Alcala and Rosenberg [14] showed that constraints on the maximal *F**ST* could explain the lower *F**ST* values between human populations seen when computing *F**ST* pairwise rather than from all populations simultaneously. In this study, we characterize the relationship between *F**ST* and the frequency *M* of the most frequent allele, for a *multiallelic* locus and an arbitrary specified value of the number of subpopulations *K*. We derive the mathematical upper bound on *F**ST* in terms of *M*, extending the biallelic result of Alcala and Rosenberg [14] to the multiallelic case, and providing the most comprehensive description of the mathematical constraints on *F**ST* in terms of *M* to date (Table 1). To assist in interpreting the new bound, we simulate the joint distribution of *F**ST* and *M* in the island migration model, describing its properties as a function of the number of subpopulations, the migration rate, and a mutation rate. The *K*-subpopulation upper bound on *F**ST* in terms of *M* facilitates an explanation of counterintuitive aspects of intra- and inter-species genetic differentiation. We discuss the importance of the results for applications of *F**ST* more generally. View this table: [Table 1](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/T1) Table 1 Studies describing the mathematical constraints on *F**ST*. ## 2. Model Our goal is to derive the range of values *F**ST* can take—the lower and upper bounds on *F**ST*—as a function of the frequency *M* of the most frequent allele for a multiallelic locus, when the number of subpopulations *K* is a fixed finite value greater than or equal to 2. We follow previous studies [12, 13, 14, 15] in describing notation and constructing the scenario. We consider a polymorphic locus with an unspecified number of distinct alleles, in a setting with *K* subpopulations contributing equally to the total population. We denote the frequency of allele *i* in subpopulation *k* by *p**k,i*, with sum ![Graphic][10] across subpopulations. Each allele frequency *p**k,i* lies in [0, 1]. Within subpopulations, allele frequencies sum to 1: for each ![Graphic][11]. Hence, *σ**i* lies in [0, *K*], and ![Graphic][12]. We number alleles from most to least frequent, so *σ**i* ≥ *σ**j* for *i* ≤ *j*. Because by assumption the locus is polymorphic, *σ**i* < *K* for each *i*. Alleles 1 and 2 have nonzero frequency in at least one subpopulation, so that *σ*1 > 0 and *σ*2 > 0. We denote the mean frequency of the most frequent allele across subpopulations by *M* = *σ*1/*K*. We then have 0 < *M* < 1. We treat the allele frequencies *p**k,i* and associated quantities *M* and *σ**i* as parametric values, and not as estimates computed from data. Eq. 1 expresses *F**ST* as a ratio involving within-subpopulation heterozygosity, *H**S*, and total heterozygosity, *H**T*, with 0 ≤ *H**S* < 1 and 0 ≤ *H**T* < 1. Because we assume the locus is polymorphic, *H**T* > 0. We write eq. 1 in terms of allele frequencies, permitting the number of distinct alleles to be arbitrarily large: ![Formula][13] Hence, our goal is, for fixed *σ*1 = *KM*, 0 < *σ*1 < *K*, to identify the matrices (*p**k,i*)*K*×∞, with *p**k,i* in [0, 1], ![Graphic][14] and ![Graphic][15], that minimize and maximize *F**ST* in eq. 2. Note that we adopt the interpretation of *F**ST* as a “statistic” that describes a mathematical function of allele frequencies rather than as a “parameter” that describes coancestry of individuals in a population [e.g. 16]. See Alcala and Rosenberg [14] for a discussion of interpretations of *F**ST* when studying its mathematical properties. ## 3. Mathematical constraints ### (a) Lower bound of *F**ST* Bounds on *F**ST* in terms of the frequency of the most frequent allele can be written with respect to *M* or *σ*1, noting that *M* ranges in (0, 1) and *σ*1 ranges in (0, *K*). For the lower bound, from eq. 2, for any choice of *σ*1, *F**ST* = 0 can be achieved. Consider (*σ*1, *σ*2, …) with *σ**i* in [0, *K*) for each *k, σ**i* ≥ *σ**j* for *i* ≤ *j*, ![Graphic][16], and *σ*1 > 0 and *σ*2 > 0. We set *p**k,i* = *σ**i* /*K* for all subpopulations *k* and alleles *i*; this choice yields *F**ST* = 0. *F**ST* = 0 implies that the numerator of eq. 2, *H**T* − *H**S*, is zero. This numerator can be written ![Graphic][17]. The Cauchy-Schwarz inequality guarantees that ![Graphic][18], with equality if and only if *p*1,*i* = *p*2,*i* = … = *p**K,i* = *σ**i* /*K*. Applying the Cauchy-Schwarz inequality to all alleles *i*, the numerator of eq. 2 is zero only if for all *i*, (*p*1,*i*, *p*2,*i*, …, *p**K,i*) = (*σ**i* /*K, σ**i* /*K*, …, *σ**i* /*K*). Thus, we can conclude that the allele frequency matrices in which all *K* subpopulations have identical allele frequency vectors are the only matrices for which *F**ST* = 0. The lower bound on *F**ST* is equal to 0 irrespective of *M* or *σ*1, for any value of the number of subpopulations *K*. ### (b) Upper bound of *F**ST* To derive the upper bound on *F**ST* in terms of *M* = *σ*1/*K*, we must maximize *F**ST* in eq. 2, assuming that *σ*1 and *K* are constant. The computations are performed in the Appendix; we write the main result as a function of *σ*1, noting that it can be converted into a function of *M* by replacing *σ*1 with *KM*. We first show in Theorem 1 from the Appendix that the maximal *F**ST* requires that (i) the sum of squared allele frequencies across alleles and subpopulations, ![Graphic][19], is maximal, and (ii) alleles *i* = 2, 3, … are each present in at most one subpopulation, but allele 1 might be present in more than one subpopulation. We then separately maximize *F**ST* as a function of *σ*1 for *σ*1 in (0, 1] and in (1, *K*). The two cases differ in that allele 1 appears in a single subpopulation in the former case, and it must appear in at least two subpopulations in the latter. The maximal *F**ST* as a function of *σ*1 for *σ*1 in (0, *K*) is ![Formula][20] where ![Graphic][21]. Here, ⌈*x*⌉ denotes the smallest integer greater than or equal to *x*, ⌊*x*⌋ denotes the greatest integer less than or equal to *x*, and {*x*} = *x* − ⌊*x*⌋ denotes the fractional part of *x*. Note that for *σ*1 = 1, the maximal values from eq. 3 if *σ*1 = 1 and the limit as *σ*1 tends to 1 from above both equal 1, so that the maximal value as a function of *σ*1 is continuous. The proof appears in the Appendix. From the Appendix, *F**ST* reaches its upper bound for *σ*1 in interval (0, 1] when each allele is present in only a single subpopulation, and when each subpopulation has exactly *J* alleles with a nonzero frequency: *J* − 1 alleles at frequency *σ*1 and one allele at frequency 1 − (*J* − 1)*σ*1 ≤ *σ*1. Because each subpopulation has *J* distinct alleles and no alleles are shared across subpopulations, this upper bound requires that the locus has *KJ* alleles of nonzero frequency. For *σ*1 in (1, *K*), *F**ST* reaches its maximal value when there are ⌊*σ*1⌋ subpopulations in which the most frequent allele has frequency 1, a single subpopulation in which the most frequent allele has frequency {*σ*1} and a private allele has frequency 1 − {*σ*1}, and *K* − ⌊*σ*1⌋ − 1 subpopulations each with a different private allele at frequency 1. Only the most frequent allele is shared across subpopulations, and at most a single subpopulation displays polymorphism. This maximal value requires that the locus has *K* − ⌊*σ*1⌋ + 1 alleles of nonzero frequency. ### (C) Properties of the upper bound Figure 1 shows the maximal value of *F**ST* in terms of *M* = *σ*1/*K* for various values of the number of subpopulations, *K*. We describe a number of properties of this upper bound. ![Figure 1](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2021/07/25/2021.07.23.453474/F1.medium.gif) [Figure 1](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/F1) Figure 1 Bounds on *F**ST* as a function of the frequency of the most frequent allele, *M*, for different numbers of subpopulations *K*. (a) *K* = 2. (b) *K* = 3. (c) *K* = 6. (d) *K* = 40. (e) *K* = 100. The gray region represents the space between the upper and lower bounds on *F**ST*. The dashed line represents the curve that the jagged maximal *F**ST* touches when ![Graphic][22], computed from eq. 4. The upper bound is computed from eq. 3; for each *K*, the lower bound is 0 for all values of *M*. #### Piecewise structure of the upper bound First, we observe that the upper bound has a piecewise structure. For ![Graphic][23], the upper bound depends on ![Graphic][24]. As *KM* increases in (0, 1], each decrement in the integer value of ![Graphic][25] produces a distinct “piece” with domain ![Graphic][26], for integers *j* ≥ 2. Within each interval ![Graphic][27], *J* has the constant value *j*. At ![Graphic][28], the upper bound transitions between its two cases. For ![Graphic][29], the upper bound depends on ⌊*σ*1⌋ = ⌊*KM*⌋. As *KM* increases in [1, *K*), each increment in ⌊*KM*⌋ also produces a distinct piece of the domain. For each *k* from 1 to *K* − 1, ⌊*KM*⌋ = *k* for *M* in ![Graphic][30]. Counting the intervals of the domain, we see that an infinite number of distinct intervals occur for *M* in ![Graphic][31], and *K* 1 intervals occur for *M* in ![Graphic][32]. Within intervals, the function describing the upper bound is smooth. #### Behavior of the upper bound for *M* in ![Graphic][33] For *M* in ![Graphic][34], we can compute the value of the upper bound at the transition points between distinct pieces of the domain, namely values of ![Graphic][35] for integers *j* ≥ 2. Applying eq. 3, we observe that at ![Graphic][36], the upper bound has value ![Graphic][37]. In other words, the upper bound touches the curve ![Formula][38] This curve is represented in Fig. 1 as a dashed line. Note that for *K* = 2, the special case considered by JAKOBSSON *et al*. [12], eq. 4 reduces to *q** (*M*) = *M*/(1 − *M*) = *σ*1/(2 − *σ*1), which matches eq. 21 from Jakobsson *et al*. [12]. In fact, setting *K* = 2, eq. 3 for *M* in ![Graphic][39] reduces to the *K* = 2 upper bound on *F**ST* in eq. 9 of [12]. For *M* in ![Graphic][40], the upper bound is equal to 1 if and only if its numerator is equal to its denominator. Setting the numerator and denominator equal, we find that the upper bound is equal to 1 if and only if ![Graphic][41]. #### Behavior of the upper bound for *M* in ![Graphic][42] For *M* in ![Graphic][43], we also find the locations where the upper bound is equal to 1. Setting the numerator and denominator equal, the upper bound is equal to 1 if and only if {*σ*1} = 0, that is, if and only if *σ*1 is an integer and ![Graphic][44] for *k* = 2, 3, …, *K* − 1. Hence, noting that the upper bound is equal to 1 at ![Graphic][45], we conclude that the upper bound can equal 1 if and only if ![Graphic][46] for integers *k* = 1, 2, …, *K* − 1. For fixed *K*, the upper bound on *F**ST* has exactly *K* − 1 maxima at which *F**ST* can equal 1, at ![Graphic][47] We can conclude that *F**ST* is unconstrained within the unit interval only for a finite set of values of the frequency *M* of the most frequent allele. The size of this set increases with the number of subpopulations *K*. We can also conclude that because the upper bound is a smooth function on each interval of its domain, and because it possesses maxima at interval boundaries ![Graphic][48], it must possess local minima in intervals ![Graphic][49] for *k* = 1, 2, …, *K* − 2. Indeed, such minima are visible in Figure 1 in cases with *K* = 3, *K* = 6, *K* = 40, and *K* = 100; for *K* = 2, only one maximum occurs, so that there is no interval between a pair of maxima in which a minimum can occur. Note that because we restrict attention to *M* in (0, 1), we do not count the point at *M* = 1 and *F**ST* = 0 as a local minimum. ## 4. Joint distribution of *M* and *F**ST* under an evolutionary model So far, we have described the mathematical constraint imposed on *F**ST* by *M* without respect to the frequency with which particular values of *M* arise in evolutionary scenarios. As an assessment of the bounds in evolutionary models can illuminate the settings in which they are most salient in population-genetic data analysis [9, 14, 17, 18, 19, 20], we simulated the joint distribution of *F**ST* and *M* under an island migration model, relating the distribution to the mathematical bounds on *F**ST*. This analysis considers allele frequency distributions, and hence values of *M* and *F**ST*, generated by evolutionary models. The simulation approach is modified from [14, 15]. ### (a) Simulations We simulated alleles under a coalescent model, using the software MS [21]. We considered a total population of *KN* diploid individuals subdivided into *K* subpopulations of equal size *N*. At each generation, a proportion *m* of the individuals in a subpopulation originated outside the subpopulation. Thus, the scaled migration rate is 4*Nm*, and it corresponds to twice the number of individuals in a subpopulation that originate elsewhere. We considered the island model [22, 23], in which migrants have the same probability ![Graphic][50] to come from any specific other subpopulation. We used an infinitely-many-alleles model; mutations occur at a rate *µ*, so that the scaled mutation rate is 4*Nµ*. We examined three values of *K* (2, 6, 40), three values of 4*Nµ* (0.1,1,10), and three values of 4*Nm* (0.1, 1, 10). Note that in MS, time is scaled in units of 4*N* generations, and there is no need to specify subpopulation sizes *N*. MS simulates an infinitely-many-sites model, where each mutation occurs at a new site; each haplotype is a new allele, so that each mutation creates a new allele. For each parameter triplet (*K*, 4*Nµ*, 4*Nm*), we performed 1,000 replicate simulations, sampling 100 sequences per subpopulation in each replicate. We computed *F**ST* values from the parametric allele (haplotype) frequencies. MS commands appear in File S1; note that the simulation approach here uses the standard method of simulating MS with a specified mutation rate *θ* = 4*Nµ*, whereas in our previous analyses of biallelic cases [14, 15], we had employed the alternative approach of requiring simulated datasets to possess exactly one segregating site. Figure 2 shows the joint distribution of *M* and *F**ST* for the nine values of (4*Nµ*, 4*Nm*) in the case of *K* = 2. Figures S1 and S2 provide similar figures for *K* = 6 and *K* = 40, respectively. ![Figure 2](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2021/07/25/2021.07.23.453474/F2.medium.gif) [Figure 2](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/F2) Figure 2 Joint density of the frequency *M* of the most frequent allele and *F**ST* in the island migration model with *K* = 2 subpopulations, for different scaled migration rates 4*Nm* and mutation rates 4*Nµ*. (a) 4*Nµ* = 0.1, 4*Nm* = 0.1. (b) 4*Nµ* = 1, 4*Nm* = 0.1. (c) 4*Nµ* = 10, 4*Nm* = 0.1. (d) 4*Nµ* = 0.1, 4*Nm* = 1. (e) 4*Nµ* = 1, 4*Nm* = 1. (f) 4*Nµ* = 10, 4*Nm* = 1. (g) 4*Nµ* = 0.1, 4*Nm* = 10. (h) 4*Nµ* = 1, 4*Nm* = 10. (i) 4*Nµ* = 10, 4*Nm* = 10. The black solid line represents the upper bound on *F**ST* in terms of *M* (eq. 3); the black point plots the mean values of *M* and *F**ST*. Colors represent the density of loci, estimated using a Gaussian kernel density estimate with a bandwidth of 0.02, with density set to 0 outside of the bounds. Loci are simulated using coalescent software MS, assuming an island model of migration and an infinitely-many-alleles mutation model. Each panel considers 1,000 replicate simulations, with 100 lineages sampled per subpopulation. Figures S1 and S2 present similar results for *K* = 6 and *K* = 40 subpopulations, respectively. ### (b) Impact of the mutation rate For fixed migration rate 4*Nm* and number of subpopulations *K*, the main impact of the mutation rate is on the frequency *M* of the most frequent allele. For *K* = 2, under weak mutation (4*Nµ* = 0.1), the joint distribution of *M* and *F**ST* is highest in the high-*M* region, for all values of 4*Nm* (Fig. 2A, D, G). Although most simulation replicates produce ![Graphic][51] with an upper bound on *F**ST* less than one, this set of parameter values does give rise to replicates near the peak at ![Graphic][52]. Under intermediate mutation (4*Nµ* = 1), the increased mutation rate tends to decrease *M*, shifting the joint distribution to lower values of *M* for all values of 4*Nm* (Fig. 2B, E, H). Finally, under strong mutation (4*Nµ* = 10), the joint distribution of *M* and *F**ST* is highest in the low-*M* region, for all values of 4*Nm* (Fig. 2C, F, I). In this region, the upper bound on *F**ST* is most strongly constrained, leading to low *F**ST* values. ### (c) Impact of the migration rate For fixed mutation rate 4*Nµ* and number of subpopulations *K*, the impact of the migration rate is seen primarily in the *F**ST* values rather than the values of *M*. Under weak migration (4*Nm* = 0.1), subpopulations are differentiated, and the joint distribution of *M* and *F**ST* is highest near the upper bound on *F**ST* in terms of *M* (Fig. 2A, B, C). Under intermediate migration (4*Nm* = 1), differentiation between subpopulations decreases, and the joint density of *M* and *F**ST* is highest at lower values of *F**ST* (Fig. 2D, E, F). Under strong migration (4*Nm* = 10), the joint density of *M* and *F**ST* nears the lower bound (Fig. 2G, H, I). ### (d) Impact of the number of subpopulations In Figure 1, the number of subpopulations changes the shape of the region in which *F**ST* is permitted to range as a function of *M*. Thus, in simulations, the impact of the number of subpopulations *K* is observed in cases in which a change in *K* permits *F**ST* to expand its range within the unit square for (*M, F**ST*). For each of the nine choices of (4*Nµ*, 4*Nm*), Figure 3 summarizes the means observed for (*M, F**ST*) in Figures 2, S1, and S2, corresponding to *K* = 2, *K* = 6, and *K* = 40, respectively. ![Figure 3](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2021/07/25/2021.07.23.453474/F3.medium.gif) [Figure 3](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/F3) Figure 3 Mean frequency *M* of the most frequent allele and mean *F**ST* in the island migration model, for different scaled migration rates 4*Nm* and mutation rates 4*Nµ* and different numbers of subpopulations *K*. (a) *K* = 2. (b) *K* = 6. (c) *K* = 40. The black solid line represents the upper bound on *F**ST* in terms of *M* (eq. 3). The colored points represent the mean *M* and mean *F**ST*, where colors correspond to values of 4*Nm*. These points are taken from Figures 2, S1, and S2. The number of subpopulations generally increases *F**ST* for fixed 4*Nµ* and 4*Nm*. For example, the mean *F**ST* can be substantially larger for *K* = 6 than for *K* = 2. Consider (4*Nµ*, 4*Nm*) = (0.1, 0.1). For *K* = 2, the mean *F**ST* is near its upper bound (Fig. 3A); for *K* = 6, *F**ST* is not as close to the bound (Fig. 3B). However, because the upper bound for *K* = 6 exceeds that for *K* = 2, the mean *F**ST* is nevertheless larger in the case of *K* = 6. ## 5. Example: humans and chimpanzees We now use our theoretical results to explain observed patterns of genetic differentiation. In particular, *F**ST* values between humans and chimpanzees for multiallelic loci have been seen to be perhaps surprisingly low in light of the fact that they represent inter-species computations rather than computations within species [8, 24]. We investigate this observation by examining data on 246 multiallelic microsatellite loci assembled by Pemberton *et al*. [25] from several studies of worldwide human populations and a study of chimpanzees [26]. We consider *F**ST* comparisons both between humans and chimpanzees and among populations of chimpanzees. For the human data, we consider all 5795 individuals in the dataset, and for the chimpanzee data, we consider 84 chimpanzee individuals from 6 populations: one bonobo population, and 5 common chimpanzee populations (Central, Eastern, Western, hybrid, and captive). In the data analysis, we perform a computation to summarize the relationship of *F**ST* to the upper bound. For a set of *Z* loci, denote by *F**z* and *M**z* the values of *F**ST* and *M* at locus *z*. The mean *F**ST* for the set, or ![Graphic][53], is ![Formula][54] Using eq. 3, we can compute the corresponding maximum *F**ST* given the observed *σ**z* = *KM**z*, *z* = 1, 2, …, *Z*. Denoting this quantity by ![Graphic][55], we have ![Formula][56] ![Graphic][57] measures the proximity of the *F**ST* values to their upper bounds: it ranges from 0, if *F**ST* values at all loci equal 0, to 1, if *F**ST* values at all loci equal their upper bounds. We computed the parametric allele frequencies for each subpopulation—the human and chimpanzee groups for the human-chimpanzee comparison, and chimpanzee subpopulations for the comparison of chimpanzees—averaging across subpopulations to obtain the frequency *M* of the most frequent allele. We then computed *F**ST* and the associated upper bound for each locus, averaging across loci to obtain the overall ![Graphic][58] and ![Graphic][59] for the full microsatellite set (eqs. 5 and 6). Surprisingly, given the longer evolutionary time between humans and chimpanzees than among chimpanzee populations, the *F**ST* value is significantly greater when comparing chimpanzee populations ![Graphic][60] than when comparing human and chimpanzees (![Graphic][61], Wilcoxon rank sum test). The explanation for this result can be found in the properties of the upper bound on *F**ST* given *M*. Values of *M* are similar in the two comparisons (Fig. 4A, 4B). However, *K* differs, equaling 2 for the human-chimpanzee comparison and 6 for the comparison of chimpanzee subpopulations. Because the theoretical range of *F**ST* is seen to be smaller for *F**ST* values computed among smaller sets of subpopulations than among larger sets (Fig. 1), the *F**ST* values among chimpanzees possess a larger range. For example, the maximal *F**ST* at the mean *M* of 0.27 observed in pairwise comparisons is 0.34 for *K* = 2 (red segment in Figure 4A), whereas the maximal *F**ST* at the mean *M* of 0.36 observed for six chimpanzee populations is 0.93 for *K* = 6 (Fig. 4B). Given the stronger constraint in pairwise calculations than in calculations with more subpopulations, it is not unexpected that pairwise *F**ST* values would be smaller than those in a 6-region computation. A high *F**ST* among chimpanzees compared to between humans and chimpanzees is a byproduct of mathematical constraints on *F**ST*. ![Figure 4](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2021/07/25/2021.07.23.453474/F4.medium.gif) [Figure 4](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/F4) Figure 4 *F**ST* values for comparisons involving humans and chimpanzees based on multiallelic microsatellite loci. (a) *F**ST* between humans and chimpanzees, considering *K* = 2 subpopulations (humans, chimpanzees). (b) *F**ST* among *K* = 6 chimpanzee subpopulations. In (a) and (b), colors represent the number of points in a neighborhood of radius 0.03; red points indicate the mean *M* and *F**ST*, and vertical red segments indicate the permissible range of *F**ST* at the mean *M*. (c) *F**ST*, computed using eq. 2, and *F**ST* /*F*max, computed using eqs. 2 and 3. Each point plotted represents one locus. Interestingly, the effect of *K* on *F**ST* is largely eliminated when each *F**ST* value is normalized by the associated maximum given *K* and *M* (Fig. 4C). The normalization leads to higher values for human-chimpanzee comparisons than among chimpanzee subpopulations (![Graphic][62] and 0.20, respectively; *p* = 1.1 *×*10*−*9, Wilcoxon rank sum test), as expected from the greater evolutionary distance between humans and chimpanzees compared to that among chimpanzees. ## 6. Discussion We have analyzed the range of values that *F**ST* can take as a function of the frequency *M* of the most frequent allele at a multiallelic locus, for arbitrarily many subpopulations. We showed that *F**ST* can span the full unit interval only for a finite set of values of *M*, at ![Graphic][63] for integers *k* in [1, *K −* 1]. For all other *M, F**ST* necessarily lies below 1. The number of subpopulations *K* enlarges the range of values that *F**ST* can take as *K* increases. This study provides the most complete relationship between *F**ST* and *M* obtained to date, generalizing previous results for the case of *K* = 2 subpopulations [12] and for a restriction to *I* = 2 alleles [14]. Interestingly, the maximal *F**ST* we have obtained merges patterns observed in these previous studies. Fixing *K* = 2, we obtain the upper bound on *F**ST* in terms of *M* that was reported by Jakobsson *et al*. [12]. As *K* increases, the piecewise pattern seen by Jakobsson *et al*. [12] for the maximal *F**ST* in the *K* = 2 case for *M* in ![Graphic][64] is observed in the multiallelic case for *M* in ![Graphic][65]. The decay from ![Graphic][66] to (*M, F**ST*) = (1, 0) seen by Jakobsson *et al*. [12] for *K* = 2 is observed for *M* in the decay from ![Graphic][67] to (1, 0) for arbitrary *K*. The allele frequency values for which the upper bound is reached for *M* in ![Graphic][68] generalize those seen for the case of *K* = 2 and *M* in ![Graphic][69] [12]. The upper bound is reached when all alleles are private, each subpopulation has as many alleles as possible at frequency *KM*, and at most one additional allele. The allele frequency values for which the upper bound is reached for *M* in ![Graphic][70] also generalize those seen for *K* = 2 and *M* in ![Graphic][71]: the maximum is reached when the most frequent allele is fixed in all subpopulations except one, and a single private allele is present in this last subpopulation. The results from Alcala and Rosenberg [14] for *I* = 2 produce a more constrained upper bound on *F**ST* than for arbitrary *I*, with the domain of *M* restricted to ![Graphic][72]. Nevertheless, many properties of the maximal *F**ST* we observe for unspecified *I* and *M* in ![Graphic][73] are similar to those seen for *I* = 2 and *M* in ![Graphic][74]: finitely many peaks at points ![Graphic][75], local minima between the peaks, and an increase in coverage of the unit square for (*M, F**ST*) as *K* increases. The maximal *F**ST* functions for *M* in ![Graphic][76] for unspecified *I* and for *I* = 2 agree, as the number of alleles required to maximize *F**ST* in this interval in the case of unspecified *I* is simply equal to 2. In assuming that the number of alleles is unspecified, we found that the number of distinct alleles needed for achieving the maximal *F**ST* is ![Graphic][77] for *M* in ![Graphic][78] and *K* − ⌊*lσ*1⌋ + 1 for *M* in ![Graphic][79]. With a maximal number of distinct alleles, such as in the *I* = 2 case of Alcala and Rosenberg [14] with *K* specified and in the *K* = 2 case with *I* specified [13], the upper bound on *F**ST* is less than or equal to that seen in the corresponding unspecified-*I* case. For the case of *K* = 2, specifying *I* has a relatively small effect in reducing the maximal value of *F**ST* [13]. As in Edge and Rosenberg [13], specifying *I* in the case of larger values of *K* is expected to have the greatest impact on the *F**ST* upper bound at the lowest end of the domain for *M*. In coalescent simulations, we found that the joint distribution of *M* and *F**ST* within their permissible space can help separate the impact of mutation and migration. Although the dependence of *F**ST* on mutation and migration rates has been long documented, the symmetric role that mutation and migration play under the island model [22] illustrates the difficulty in separating their effects. Under the island model, allele frequency *M* is informative about the scaled mutation rate 4*Nµ*, and comparing the value of *F**ST* to its maximum given *M* is informative about the scaled migration rate 4*Nm*. Adding a dimension that is more sensitive to mutation than to migration—*M* in our case— enables the separation of their effects. Other statistics, such as total heterozygosity *H**T* or within-subpopulation heterozygosity *H**S*, have the potential to play a similar role [20]. Our results can inform data analyses. In particular, we caution users to examine upper bounds on *F**ST* to assess how mathematical constraints influence observations. As the constraints are strongest for *K* = 2, this step is most important in pairwise comparisons. For example, visual inspection of the values of *M* and *F**ST* within their bounds can suggest that constraints have an effect. ![Graphic][80] can provide a helpful summary by evaluating the proximity of *F**ST* values to their maxima. Further, joint use of *M* along with *F**ST* could be useful in various applications of *F**ST*, such as in inference of model parameters by approximate Bayesian computation [27] and machine-learning [28]. *F**ST* outlier tests to detect local adaptation from multiallelic loci [29] could search for *F**ST* values that represent outliers not in the distribution of *F**ST* values, but rather, outliers in relation to associated upper bounds. Computing null distributions for *F**ST* conditional on *M* could enhance the approach. In an example data analysis, we have shown that taking into account mathematical constraints on *F**ST* can help understand puzzling *F**ST* behavior. In our example, *F**ST* at a set of loci was higher when comparing *K* = 6 chimpanzee populations than when comparing humans and chimpanzees (*K* = 2), even though the same loci were used and the mean value for *M* was similar in the two comparisons. A comparison of *F**ST* values to their respective maxima explained these counterintuitive results. In human populations, efforts to understand *F**ST* patterns trace in large part to Lewontin’s foundational *F**ST* -like computation [30], in which it was seen that among-population differences (analogous to *F**ST*) were small relative to within-population differences (analogous to 1 − *F**ST*). Studies using loci with different numbers of alleles, loci with different frequencies for the most frequent allele, and samples with different numbers of subpopulations have varied to some extent in their estimates for the numerical value of *F**ST* [14, 31, 32, 33, 34]. Mathematical results on *F**ST* bounds provide part of the explanation for these differences: they establish that *F**ST* calculations from data differing in the character of the loci and subpopulation sets can produce numerically distinct values not due to features of the underlying human biology, but rather, due to different constraints on the *F**ST* measure itself. The mathematical results explain quantitative variation in *F**ST* calculations, serving to support the qualitative claim that worldwide human genetic differentiation measurements represented by *F**ST* -like statistics have low values—as was argued by Lewontin fifty years ago. ## Data accessibility All data are publicly available (see cited references). ## Authors’ contributions NA and NAR designed the study. NA analysed the data. NA and NAR wrote the manuscript. ## Competing interests Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization. ## Funding Support was provided by NIH grant R01 HG005855 and a France-Stanford Center for Interdisciplinary Studies grant. ## Supplementary File S1: MS commands 1 We applied MS, specifying the scaled mutation and migration parameters. We performed the simulations for *K* = 2, *K* = 6, and *K* = 40 subpopulations. For each command, we replace x by the desired 4*Nµ* value and y by the desired 4*Nm* value. *K* = 2 ./ms 200 1000 -t x -I 2 100 100 y *K* = 6 ./ms 600 1000 -t x -I 6 100 100 100 100 100 100 y *K* = 40 ./ms 4000 1000 -t x -I 40 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 y ## Acknowledgements We thank Maike Morrison for helpful conversations. ## Appendix ### Proof of eq. 3 This appendix provides the derivation of the upper bound on *F**ST* as a function of *σ*1 that appears in eq. 3. First, we reduce the problem of maximizing *F**ST* to the problem of maximizing the sum of squared allele frequencies across alleles and subpopulations, ![Graphic][81]. Next, we maximize *S* as a function of *σ*1, separately in the intervals (0, 1] and (1, *K*). ### Reducing the problem of maximizing *F**ST* Suppose *K* ≥ 2 is a specified integer. Suppose *σ*1 is a fixed value, with 0 < *σ*1 < *K*. We leave the number of alleles *I* unspecified. For each *i* ≥ 1, we write ![Graphic][82], with *σ**i* ≥ *σ**j* for each *i* and *j* with *i* ≤ *j*. For convenience, *σ*1 is taken to mean both the function that computes the sum ![Graphic][83] for a specified set of values of the *p**k,i* and a fixed value for that sum. For each (*k, i*) with 1 ≤ *k* ≤ *K* and *i* ≥ 1, *p**k,i* lies in [0, 1], and ![Graphic][84] for all *k*, 1 ≤ *k* ≤ *K*. Define *F**ST* as in eq. 2. We seek to maximize *F**ST* over all possible sets of values of the *p**k,i* with a fixed value *σ*1 for the sum ![Graphic][85]. Note that because *σ*1 < *K* and ![Graphic][86], it follows that *σ*2 < 0. Denote the sum of squared frequencies of allele 1 across subpopulations, ![Graphic][87], by *S*1. Denote ![Graphic][88] for the corresponding sum of squared frequencies of all alleles. We express eq. 2 in terms of *σ*1, *S*1, and *S*: ![Formula][89] By construction of eq. 2, the denominator of eq. A.1 lies in (0, *K*2), as 0 < *H**T* < 1 from the fact that *σ*2 < 0. The numerator lies in [0, *K*2), as 0 ≤ *H**S* ≤ *H**T* < 1, so that 0 ≤ *H**T* − *H**S* < 1. *F**ST* lies in [0, 1], as 0 ≤ *H**S* and 0 < *H**T* imply 0 ≤ (*H**T* − *H**S*)/*H**T* ≤ 1. Theorem 1. *F**ST* *satisfies* ![Formula][90] *equality requiring that for each i* ≥ 2, *there exists at most one value of k for which p**k,i* < 0. *Proof*. Because ![Graphic][91] is subtracted from both the numerator and the denominator of eq. A.1, *F**ST* can be bounded above by minimizing this term. Because *p**k,i* ≥ 0 for all (*k, i*), each sum ![Graphic][92] is bounded below by zero. Setting the sum to 0 for all *i* ≥ 2 gives the upper bound in eq. A.2. For the equality condition, ![Graphic][93] if and only if all products *p**k,i* *p**ℓ,i* are zero—that is, if and only if for each *i* ≥ 2, at most one value of *k* has *pk,i* < 0. □ By Theorem 1, to maximize *FST* for fixed *σ*1, we must maximize the quantity in eq. A.2. It suffices to consider sets of values of *p**k,i* for which for each *i* 2, at most one value of *k* has *p**k,i* < 0. ### The case of *σ*1 in (0, 1] In this section, we find the set of values of the *p**k,i* that maximize *F**ST* for *σ* in (0, 1]. We proceed in two steps. (i) We show that for *σ*1 in (0, 1], the maximal *F**ST* occurs at a set of *p**k,i* values for which *all* alleles are private: that is, for each *i* ≥ 1, *p**k,i* < 0 for at most one value of *k*. (ii) We determine the set of *p**k,i* values that, with all alleles private, maximizes *F**ST*. (i) In eq. A.2, note that ![Graphic][94] Because ![Graphic][95] is subtracted from both numerator and denominator in eq. A.2, the quantity in eq. A.2 is maximal when ![Graphic][96] is minimal. In other words, the upper bound on *F**ST* is maximal if and only if ![Graphic][97] is minimal. Because *σ*1 ≤ 1, a minimum of 0 for ![Graphic][98] is achieved if and only if there is a single value *k* = *k ′* at which *p**k*′,,1 = *σ*1, so that *p**k*,1 = 0 for all *k* ≠ *k ′*. We then have ![Graphic][99], and from eq. A.2, ![Formula][100] Each allele is private, and because allele 1 is the most frequent, *p**k,i* lies in [0, *σ*1] for all (*k, i*). (ii) The problem of finding the set of *p**k,i* values that maximizes *F**ST* has now been reduced to the problem of maximizing the right-hand side of eq. A.3, with the constraint that all alleles are private. Because the numerator in eq. A.3 increases with *S* and the denominator decreases with *S*, the maximum is achieved if and only if *S* achieves its maximal value. In other words, we seek to maximize ![Graphic][101], with the constraints ![Graphic][102] and *p**k,i* ≤ *σ*1 for each (*k, i*) with 1 ≤ *k* ≤ *K* and *i* ≥ 1. Because each allele is private, the maximum is achieved by separately maximizing each ![Graphic][103] with constraints ![Graphic][104] and *p**k,i* ≤ *σ*1. This maximization is precisely that of Lemma 3 of Rosenberg and Jakobsson [35]. Applying the lemma, the maximum is achieved with *p**k*,1 = *p**k*,2 = … = *p**k,J*−1 = *σ*1, *p**k,J* = 1 − (*J* − 1)*σ*1, and *p**k,i* = 0 for ![Graphic][105]. It satisfies ![Graphic][106]. In other words, each subpopulation *k* possesses *J −* 1 private alleles with frequency *σ*1 and one private allele with frequency 1*−* (*J−* 1)*σ*1. Hence, *S K*[1*− σ*1(*J−* 1)(2*− Jσ*1)], so that eq. A.3 leads to eq. 3 for *σ*1 in (0, 1]. ### The case of *σ*1 in (1, *K*) This section finds the set of values of the *p**k,i* that maximizes *F**ST* for *σ*1 in (1, *K*). For ![Graphic][107] in (1, *K*), because 0 ≤ *p**k*,1 ≤ 1 for all *k, pk*,1 < 0 for at least two values of *k*. Writing *S** = *S − S*1, Eq. A.2 can be rewritten ![Formula][108] Because the numerator increases with *S*1, and because the numerator increases with *S** and the denominator decreases with *S**, the upper bound on *F**ST* is greatest when both *S*1 and *S** are maximized subject to ![Graphic][109] for each *k* and ![Graphic][110] for each *i*. If *S*1 and *S** can be simultaneously maximized at the same set of values of the *p**k,i*, then this set of values of the *p**k,i*achieves the maximal *F**ST*. We proceed in three steps. (i) First, we find the set of values of the *p**k,i* that maximizes *S*1. (ii) Next, we find the set of values that maximizes *S****. (iii) We then conclude that because the same set maximizes both *S*1 and *S**** separately, this set achieves the upper bound in eq. A.4, and hence in eq. A.2. 1. We first maximize *S*1 for fixed *σ*1 in (1, *K*). More precisely, we seek to maximize ![Graphic][111] with constraints ![Graphic][112] and *p**k*,1 *≤* 1 for each *k* from 1 to *K*. This maximization is precisely that performed in Theorem 1 from AL-CALA and Rosenberg [14], a corollary of Lemma 3 of Rosen-berg and Jakobsson [35]. Applying the theorem, the maximum is achieved by setting *p*1,1 = *p*2,1 = … = *p**⌊σ*1⌋ 1 = 1, *p*⌊*σ*1⌋ +1,1 = {*σ*1}, and *p**k*,1 = 0 for all *k >* ⌊*σ*1⌋ + 1. The maximal value of *S*1 is {*σ*1}2 + ⌊*σ*1⌋. 2. Next, we maximize ![Graphic][113]. Because, by Theorem 1, all alleles with *i ≥* 2 are private at the set of values of the *p**k,i* that maximizes *F**ST* for fixed *σ*1, each nonzero *p**k,i* for *i ≥* 2 is equal to the associated *σ**i*. The sum of the frequencies of all alleles across all subpopulations is ![Graphic][114], so that ![Graphic][115]. The problem of maximizing *S**** is the problem of maximizing ![Graphic][116] with the constraints ![Graphic][117] *σ*1 and *σ**i* *≤* 1 for each *i* from 2 to ∞. This maximization is again that performed in Lemma 3 of Rosenberg and Jakobsson [35]. Applying the lemma, the maximum is achieved by setting ![Graphic][118] and *σ**i* = 0 for *i > K* ⌊*σ*1⌋ + 1. The maximum is (1 − {*σ*1})2 + (*K* − ⌊*σ*1⌋ − 1). 3. *S*1 is maximized at a set of *p**k,i* for which ⌊*σ*1⌋ subpopulations are fixed for allele 1, allele 1 has frequency {*σ*1} in one subpopulation, and allele 1 has frequency 0 in all other subpopulations. *S**** is maximized at a set of *p**k,i* for which *K* − ⌊*σ*1⌋ − 1 subpopulations are fixed, each for a distinct allele *i* with *i ≥* 2, one subpopulation possesses a distinct allele *i ≥* 2 with frequency 1 − {*σ*1}, and all ⌊*σ*1⌋ other subpopulations possess no alleles *i ≥* 2 of nonzero frequency. *** The upper bound in eq. A.4 depends on both *S*1 and *S*, each of which depends on the *p**k,i*. Were the set of values of the *p**k,i* that maximizes *S*1 and the set of values of the *p**k,i* that maximizes *S**** to differ, additional work would be required to find the set of values of the *p**k,i* that maximizes *F**ST*. However, we now observe that *S*1 and *S**** can be simultaneously maximized at the same set of values of *p**k,i*, so that the same set of values of the *p**k,i* maximizes *S*1 and *S** and hence *F**ST*. In particular, *⌊σ*1⌋ subpopulations are fixed for allele 1, each of *K* − ⌊*σ*1⌋ − 1 subpopulations is fixed for its own private allele, and a single subpopulation possesses allele 1 with frequency *σ*1 and a private allele with frequency 1 −{*σ*1}. The number of alleles of nonzero frequency is *K* − ⌊*σ*1⌋ + 1. Only the most frequent allele has the possibility of being shared by more than one subpopulation, and at most a single subpopulation possesses more than one allele of nonzero frequency. Substituting the maximal values of *S*1 and *S** into eq. A.4, for *σ*1 in (1, *K*), we obtain the maximal *F**ST* in terms of *σ*1 in eq. 3. ![Figure S1](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2021/07/25/2021.07.23.453474/F5.medium.gif) [Figure S1](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/F5) Figure S1 Joint density of the frequency *M* of the most frequent allele and *F**ST* in the island migration model with *K* = 6 subpopulations, for different scaled migration rates 4*Nm* and mutation rates 4*Nµ*. The simulation procedure and figure design follow Figure 2. ![Figure S2](http://biorxiv.org/https://www.biorxiv.org/content/biorxiv/early/2021/07/25/2021.07.23.453474/F6.medium.gif) [Figure S2](http://biorxiv.org/content/early/2021/07/25/2021.07.23.453474/F6) Figure S2 Joint density of the frequency *M* of the most frequent allele and *F**ST* in the island migration model with *K* = 40 subpopulations, for different scaled migration rates 4*Nm* and mutation rates 4*Nµ*. The simulation procedure and figure design follow Figure 2. * Received July 23, 2021. * Revision received July 23, 2021. * Accepted July 25, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. [1].Nei, M., 1973 Analysis of gene diversity in subdivided populations. Proceedings of the Natinal Academy of Sciences USA 70: 3321–3323. 2. [2].Holsinger, K. E., and B. S. Weir, 2009 Genetics in geographically structured populations: defining, estimating and interpreting FST . Nature Reviews Genetics 10: 639–650. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/nrg2611&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=19687804&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000269063200013&link_type=ISI) 3. [3].Jin, L., and R. Chakraborty, 1995 Population structure, stepwise mutations, heterozygote deficiency and their implications in dna forensics.Heredity 74: 274–285. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1038/hdy.1995.41&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=7706114&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1995QK56400006&link_type=ISI) 4. [4].Charlesworth, B., 1998 Measures of divergence between populations and the effect of forces that reduce variability. Molecular Biology and Evolution 15: 538–543. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/oxfordjournals.molbev.a025953&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=9580982&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000073343800007&link_type=ISI) 5. [5].Nagylaki, T., 1998 Fixation indices in subdivided populations. Genetics 148: 1325–1332. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6MTA6IjE0OC8zLzEzMjUiO3M6NDoiYXRvbSI7czo0ODoiL2Jpb3J4aXYvZWFybHkvMjAyMS8wNy8yNS8yMDIxLjA3LjIzLjQ1MzQ3NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 6. [6].Hedrick, P. W., 1999 Highly variable loci and their interpretation in evolution and conservation. Evolution 53: 313–318. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.2307/2640768&link_type=DOI) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000080227100001&link_type=ISI) 7. [7].Balloux, F., H. Brünner, N. Lugon-Moulin, J. Hausser, and J. Goudet, 2000 Microsatellites can be misleading: an empirical and simulation study. Evolution 54: 1414–1422. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1554/0014-3820(2000)054[1414:MCBMAE]2.0.CO;2&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=11005307&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000089317900030&link_type=ISI) 8. [8].Long, J. C., and R. A. Kittles, 2003 Human genetic diversity and the nonexistence of biological races. Human Biology 75: 449–471. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1353/hub.2003.0058&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14655871&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) 9. [9].Hedrick, P. W., 2005 A standardized genetic differentiation measure. Evolution 59: 1633–1638. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1554/05-076.1&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=16329237&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000231658900003&link_type=ISI) 10. [10].Maruki, T., S. Kumar, and Y. Kim, 2012 Purifying selection modulates the estimates of population differentiation and confounds genome-wide comparisons across single-nucleotide polymorphisms. Molecular Biology and Evolution 29: 3617–3623. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/molbev/mss187&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=22826460&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) 11. [11].Rosenberg, N. A., L. M. Li, R. Ward, and J. K. Pritchard, 2003 Informativeness of genetic markers for inference of ancestry. American Journal of Human Genetics 73: 1402–1422. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1086/380416&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=14631557&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000187491100015&link_type=ISI) 12. [12].Jakobsson, M., M. D. Edge, and N. A. Rosenberg, 2013 The relationship between FST and the frequency of the most frequent allele. Genetics 193: 515–528. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6OToiMTkzLzIvNTE1IjtzOjQ6ImF0b20iO3M6NDg6Ii9iaW9yeGl2L2Vhcmx5LzIwMjEvMDcvMjUvMjAyMS4wNy4yMy40NTM0NzQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 13. [13].Edge, M. D., and N. A. Rosenberg, 2014 Upper bounds on FST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theoretical Population Biology 97: 20–34. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.tpb.2014.08.001&link_type=DOI) 14. [14].Alcala, N., and N. A. Rosenberg, 2017 Mathematical constraints on FST : biallelic markers in arbitrarily many populations. Genetics 206: 1581–1600. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6MTA6IjIwNi8zLzE1ODEiO3M6NDoiYXRvbSI7czo0ODoiL2Jpb3J4aXYvZWFybHkvMjAyMS8wNy8yNS8yMDIxLjA3LjIzLjQ1MzQ3NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 15. [15].Alcala, N., and N. A. Rosenberg, 2019 ![Graphic][119], Jost’s D, and FST are similarly constrained by allele frequencies: A mathematical, simulation, and empirical study. Molecular Ecology 28: 1624–1636. 16. [16].Weir, B. S., and C. C. Cockerham, 1984 Estimating Fstatistics for the analysis of population structure. Evolution 38: 1358–1370. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.2307/2408641&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=28563791&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=A1984TY40400017&link_type=ISI) 17. [17].Whitlock, M. C., 2011 ![Graphic][120] and D do not replace FST . Molecular Ecology 20: 1083–1091. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1111/j.1365-294X.2010.04996.x&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=21375616&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000288074600003&link_type=ISI) 18. [18].Rousset, F., 2013 Exegeses on maximum genetic differentiation. Genetics 194: 557–559. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6OToiMTk0LzMvNTU3IjtzOjQ6ImF0b20iO3M6NDg6Ii9iaW9yeGl2L2Vhcmx5LzIwMjEvMDcvMjUvMjAyMS4wNy4yMy40NTM0NzQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 19. [19].Alcala, N., J. Goudet, and S. Vuilleumier, 2014 On the transition of genetic differentiation from isolation to panmixia: what we can learn from GST and D. Theoretical Population Biology 93: 75–84. 20. [20].Wang, J., 2015 Does GST underestimate genetic differentiation from marker data? Molecular Ecology 24: 3546–3558. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1111/mec.13204&link_type=DOI) 21. [21].Hudson, R. R., 2002 Generating samples under a Wright– Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/18.2.337&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=11847089&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000174028100021&link_type=ISI) 22. [22].Maruyama, T., 1970 Effective number of alleles in a subdivided population. Theoretical Population Biology 1: 273– 306. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/0040-5809(70)90047-X&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=5527634&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) 23. [23].Wakeley, J., 1998 Segregating sites in Wright’s island model. Theoretical Population Biology 53: 166–174. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1006/tpbi.1997.1355&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=9615475&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000073747300007&link_type=ISI) 24. [24].Long, J. C., 2009 Update to Long and Kittles’s “Human genetic diversity and the nonexistence of biological races” (2003): fixation on an index 81: 799–803. 25. [25].Pemberton, T. J., M. Degiorgio, and N. A. Rosenberg, 2013 Population structure in a comprehensive genomic data set on human microsatellite variation. G3: Genes, Genomes, Genetics 3: 891–907. 26. [26].Becquet, C., N. Patterson, A. C. Stone, M. Przeworski, and D. Reich, 2007 Genetic structure of chimpanzee populations 3: e66. 27. [27].Beaumont, M. A., 2010 Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics 41: 379–406. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1146/annurev-ecolsys-102209-144621&link_type=DOI) [Web of Science](http://biorxiv.org/lookup/external-ref?access_num=000284978200018&link_type=ISI) 28. [28].Schrider, D. R., and A. D. Kern, 2018 Supervised machine learning for population genetics: a new paradigm. Trends in Genetics 34: 301–312. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1016/j.tig.2017.12.005&link_type=DOI) 29. [29].Hoban, S., J. L. Kelley, K. E. Lotterhos, M. F. Antolin, G. Bradburd, et al., 2016 Finding the genomic basis of local adaptation: pitfalls, practical solutions, and future directions. American Naturalist 188: 379–397. [CrossRef](http://biorxiv.org/lookup/external-ref?access_num=10.1086/688018&link_type=DOI) [PubMed](http://biorxiv.org/lookup/external-ref?access_num=27622873&link_type=MED&atom=%2Fbiorxiv%2Fearly%2F2021%2F07%2F25%2F2021.07.23.453474.atom) 30. [30].Lewontin, R. C., 1972 The apportionment of human diversity. Evolutionary Biology 6: 381–398. 31. [31].Barbujani, G., A. Magagni, E. Minch, and L. L. Cavalli-Sforza, 1997 An apportionment of human DNA diversity. Proceedings of the National Academy of Sciences USA 94: 4516–4519. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czo5OiI5NC85LzQ1MTYiO3M6NDoiYXRvbSI7czo0ODoiL2Jpb3J4aXYvZWFybHkvMjAyMS8wNy8yNS8yMDIxLjA3LjIzLjQ1MzQ3NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 32. [32].1. R. S. Singh, 2. C. S. Krimbas, 3. D. B. Paul and 4. J. Beatty Ruvolo, M., and M. T. Seielstad, 2001 “The apportionment of human diversity” 25 years later. In R. S. Singh, C. S. Krimbas, D. B. Paul and J. Beatty, editors, Thinking about Evolution: Historical, Philosophical, and Political Perspectives. Cambridge University Press, Cambridge, 141–151. 33. [33].Rosenberg, N. A., J. K. Pritchard, J. L. Weber, H. M. Cann, K. K. Kidd, et al., 2003 Response to comment on “Genetic structure of human populations”. Science 300: 1877c. 34. [34].Li, J. Z., D. M. Absher, H. Tang, A. M. Southwick, A. M. Casto, et al., 2008 Worldwide human relationships inferred from genome-wide patterns of variation. Science 319: 1100–1104. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzMTkvNTg2Ni8xMTAwIjtzOjQ6ImF0b20iO3M6NDg6Ii9iaW9yeGl2L2Vhcmx5LzIwMjEvMDcvMjUvMjAyMS4wNy4yMy40NTM0NzQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 35. [35].Rosenberg, N. A., and M. Jakobsson, 2008 The relationship between homozygosity and the frequency of the most frequent allele. Genetics 179: 2027–2036. [Abstract/FREE Full Text](http://biorxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6MTA6IjE3OS80LzIwMjciO3M6NDoiYXRvbSI7czo0ODoiL2Jpb3J4aXYvZWFybHkvMjAyMS8wNy8yNS8yMDIxLjA3LjIzLjQ1MzQ3NC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) [1]: /embed/graphic-1.gif [2]: /embed/inline-graphic-1.gif [3]: /embed/inline-graphic-2.gif [4]: /embed/inline-graphic-3.gif [5]: /embed/inline-graphic-4.gif [6]: /embed/inline-graphic-5.gif [7]: /embed/inline-graphic-6.gif [8]: /embed/inline-graphic-7.gif [9]: /embed/inline-graphic-8.gif [10]: /embed/inline-graphic-9.gif [11]: /embed/inline-graphic-10.gif [12]: /embed/inline-graphic-11.gif [13]: /embed/graphic-3.gif [14]: /embed/inline-graphic-12.gif [15]: /embed/inline-graphic-13.gif [16]: /embed/inline-graphic-14.gif [17]: /embed/inline-graphic-15.gif [18]: /embed/inline-graphic-16.gif [19]: /embed/inline-graphic-17.gif [20]: /embed/graphic-4.gif [21]: /embed/inline-graphic-18.gif [22]: F1/embed/inline-graphic-19.gif [23]: /embed/inline-graphic-20.gif [24]: /embed/inline-graphic-21.gif [25]: /embed/inline-graphic-22.gif [26]: /embed/inline-graphic-23.gif [27]: /embed/inline-graphic-24.gif [28]: /embed/inline-graphic-25.gif [29]: /embed/inline-graphic-26.gif [30]: /embed/inline-graphic-27.gif [31]: /embed/inline-graphic-28.gif [32]: /embed/inline-graphic-29.gif [33]: /embed/inline-graphic-30.gif [34]: /embed/inline-graphic-31.gif [35]: /embed/inline-graphic-32.gif [36]: /embed/inline-graphic-33.gif [37]: /embed/inline-graphic-34.gif [38]: /embed/graphic-6.gif [39]: /embed/inline-graphic-35.gif [40]: /embed/inline-graphic-36.gif [41]: /embed/inline-graphic-37.gif [42]: /embed/inline-graphic-38.gif [43]: /embed/inline-graphic-39.gif [44]: /embed/inline-graphic-40.gif [45]: /embed/inline-graphic-41.gif [46]: /embed/inline-graphic-42.gif [47]: /embed/inline-graphic-43.gif [48]: /embed/inline-graphic-44.gif [49]: /embed/inline-graphic-45.gif [50]: /embed/inline-graphic-46.gif [51]: /embed/inline-graphic-47.gif [52]: /embed/inline-graphic-48.gif [53]: /embed/inline-graphic-49.gif [54]: /embed/graphic-9.gif [55]: /embed/inline-graphic-50.gif [56]: /embed/graphic-10.gif [57]: /embed/inline-graphic-51.gif [58]: /embed/inline-graphic-52.gif [59]: /embed/inline-graphic-53.gif [60]: /embed/inline-graphic-54.gif [61]: /embed/inline-graphic-55.gif [62]: /embed/inline-graphic-56.gif [63]: /embed/inline-graphic-57.gif [64]: /embed/inline-graphic-58.gif [65]: /embed/inline-graphic-59.gif [66]: /embed/inline-graphic-60.gif [67]: /embed/inline-graphic-61.gif [68]: /embed/inline-graphic-62.gif [69]: /embed/inline-graphic-63.gif [70]: /embed/inline-graphic-64.gif [71]: /embed/inline-graphic-65.gif [72]: /embed/inline-graphic-66.gif [73]: /embed/inline-graphic-67.gif [74]: /embed/inline-graphic-68.gif [75]: /embed/inline-graphic-69.gif [76]: /embed/inline-graphic-70.gif [77]: /embed/inline-graphic-71.gif [78]: /embed/inline-graphic-72.gif [79]: /embed/inline-graphic-73.gif [80]: /embed/inline-graphic-74.gif [81]: /embed/inline-graphic-77.gif [82]: /embed/inline-graphic-78.gif [83]: /embed/inline-graphic-79.gif [84]: /embed/inline-graphic-80.gif [85]: /embed/inline-graphic-81.gif [86]: /embed/inline-graphic-82.gif [87]: /embed/inline-graphic-83.gif [88]: /embed/inline-graphic-84.gif [89]: /embed/graphic-12.gif [90]: /embed/graphic-13.gif [91]: /embed/inline-graphic-85.gif [92]: /embed/inline-graphic-86.gif [93]: /embed/inline-graphic-87.gif [94]: /embed/inline-graphic-88.gif [95]: /embed/inline-graphic-89.gif [96]: /embed/inline-graphic-90.gif [97]: /embed/inline-graphic-91.gif [98]: /embed/inline-graphic-92.gif [99]: /embed/inline-graphic-93.gif [100]: /embed/graphic-14.gif [101]: /embed/inline-graphic-94.gif [102]: /embed/inline-graphic-95.gif [103]: /embed/inline-graphic-96.gif [104]: /embed/inline-graphic-97.gif [105]: /embed/inline-graphic-98.gif [106]: /embed/inline-graphic-99.gif [107]: /embed/inline-graphic-100.gif [108]: /embed/graphic-15.gif [109]: /embed/inline-graphic-101.gif [110]: /embed/inline-graphic-102.gif [111]: /embed/inline-graphic-103.gif [112]: /embed/inline-graphic-104.gif [113]: /embed/inline-graphic-105.gif [114]: /embed/inline-graphic-106.gif [115]: /embed/inline-graphic-107.gif [116]: /embed/inline-graphic-108.gif [117]: /embed/inline-graphic-109.gif [118]: /embed/inline-graphic-110.gif [119]: /embed/inline-graphic-75.gif [120]: /embed/inline-graphic-76.gif