Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

On the Number of Siblings and p-th Cousins in a Large Population Sample

View ORCID ProfileVladimir Shchur, View ORCID ProfileRasmus Nielsen
doi: https://doi.org/10.1101/145599
Vladimir Shchur
1Department of Integrative Biology, University of California Berkeley 4098 Valley Life Sciences Building (VLSB) Berkeley, CA 94720-3140
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Vladimir Shchur
Rasmus Nielsen
1Department of Integrative Biology, University of California Berkeley 4098 Valley Life Sciences Building (VLSB) Berkeley, CA 94720-3140
2Museum of Natural History, University of Copenhagen, Øster Voldgade 5-7, 1350 København K, Denmark
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Rasmus Nielsen
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

1. Abstract

The number of individuals in a random sample with close relatives in the sample is a quantity of interest when designing Genome Wide Association Studies (GWAS) and other cohort based genetic, and non-genetic, studies. In this paper, we develop expressions for the distribution and expectation of the number of p-th cousins in a sample from a population of size N under two diploid Wright-Fisher models. We also develop simple asymptotic expressions for large values of N. For example, the expected proportion of individuals with at least one p-th cousin in a sample of K individuals, for a diploid dioecious Wright-Fisher model, is approximately 1 − e−(22p−1)K/N. Our results show that a substantial fraction of individuals in the sample will have at least a second cousin if the sampling fraction (K/N) is on the order of 10−2. This confirms that, for large cohort samples, relatedness among individuals cannot easily be ignored.

2. Introduction

As genomic sequencing and genotyping techniques are becoming cheaper, the data sets analysed in genomic studies are becoming larger. With an increase in the proportion of individuals in the population sampled, we might also expect an increase in the proportion of related individuals in the sample. For example, Moltke et al. (2014) found in a sample of 2,000 Inuit from Greenland that almost half of the sample had one or more close relatives in the sample. The census population size for Greenland Inuit is only about 60,000 individuals and the effective population size might be substantially lower. Henn et al. (2012) found 5000 pairs of third-cousin and 30,000 pairs of fourth cousin relatives in a sample of 5000 selfreported Europeans, with nearly every individual having a detected cryptic relationship. In Genome Wide Association Studies (GWAS), related individuals are routinely removed from the sample, but other strategies also exist for using relatedness as a covariate in the statistical analyses (e.g., Visscher et al. 2008). These observations raise the following question: given a particular effective population size, how many close relatives would we expect to find in a sample? The answer to this question may help guide study designs and strategies for addressing relatedness in population samples and improve design for GWAS. Of particular interest is the number of individuals in the sample without relatives, i.e. the number of individuals remaining in the sample if individuals with relatives are removed.

Substantial progress has been made on understanding the structure of a pedigree in a population. For example, Chang (1999) showed that the most recent common ancestor of all present-day individuals is expected to have lived log2(N) generations in the past if N is the population size. A great deal of progress has also been made in understanding the difference between genealogical processes in full diploid pedigree models versus the approximating coalescent process (e.g., Wakeley et al. 2012; Wilton et al. 2016). However, the distribution and expectation of the number of individuals with relatives in a random population sample is still unknown.

In this paper we will address this question by exploring two diploid and dioecious Wright-Fisher models. We will use these models to derive distributions and expectations of the number of individuals that have, or do not have, siblings, first, second, etc. cousins within a sample.

3. Dioecious Wright-Fisher Model

The Wright-Fisher model (Fisher 1930; Wright 1931) describes the genealogy of a population with constant effective population size N. The model assumes that generations do not overlap. Let Embedded Image and Embedded Image be two successive generations with N individuals in each. Then for each individual ĝi from Embedded Image a parent gj is selected randomly and uniformly from Embedded Image.

In our study we consider a diploid population where each individual has two parents, one male and one female. Similarly to the original haploid Wright-Fisher model, the dioecious Wright-Fisher model (see e.g. Nagylaki 1997, King et al. 2017) assumes that generations do not overlap and, for each individual, the parents are chosen from the previous generation uniformly at random. The difference is that instead of a single parent, in the dioecious case, each individual has two parents, one male and one female, which are drawn independently from the corresponding sets of males and females in the preceding generation. We will refer to this model as the ‘non-monogamous Wright-Fisher model’ because we will also consider a model in which female and male parents form monogamous pairs. We will refer to the latter model as the ‘monogamous Wright-Fisher model’. As we will assume exactly equal proportions of males and females, the monogamous Wright-Fisher model is identical to the bi-parental monoecious model in King et al. (2017).

For both the non-monogamous and monogamous models, we assume that there are exactly N male and N female individuals. Each individual from generation Embedded Image (we enumerate generations backward in time starting from 0, i.e. Embedded Image is the present generation and Embedded Image is the generation of parents of individuals from Embedded Image) is assigned to a parent pair (one male and one female parent) from Embedded Image. As we described above, under the non-monogamous model, male and female parents are chosen independently from each other for every individual. In the monogamous case, the parent pairs are fixed, i.e. we assume each male and female is part of exactly one potential parent pair.

The two diploid models are similar to each other in that the marginal distribution of the number of offspring of each individual is binomially distributed with mean 2. However, they differ from each other in the correlation structure among parents. The important difference between these two models is that the monogamous model does not allow for half-siblings (we say that two individuals are half-siblings if they share only one parent). On the other hand under the non-monogamous model, full siblings (individuals which share both parents) have a very low probability of appearing.

We note that other dioecious versions of the Wright-Fisher models could be considered with varying degree of promiscuity, but most would likely have distributions of relatedness that are somewhat intermediate between these two models, as long as they otherwise maintain Wright-Fisher dynamics. We also note that none of these models probably accurately describe the behaviour of human populations, which likely have a much higher variance in offspring number, variable population sizes, etc.

As mentioned above, individuals are siblings if they have the same parents. If individuals share only one parent, we call them half-siblings. We say that two individuals are p-th cousins if there is at least one coalescence between their genealogies in generation Embedded Image. Of course, the amount of shared genetic material would depend on the number of shared ancestors in a certain generation. For two individuals, the number of shared ancestors is given in the supplementary materials of King et al. (2017) (see the discussion below). Notice, that two individuals can have different relations simultaneously. An example of such a situation is given in Figure 1: the individuals related by this genealogy are half-siblings and first-cousins at the same time.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

The two offspring (Individual 1 and Individual 2) related by this genealogy are half-siblings and first cousins at the same time. Notice that Individual 1 has a tree-like genealogy (no cycles, no inbreeding). The second individual though has inbreeding in its genealogy.

Let Embedded Image be a random sample of size K of individuals from the present-day generation Embedded Image of a population described by either a monogamous or non-monogamous Wright-Fisher models. In this paper we derive the number UT (notation for monogamous case) or VT (notation for non-monogamous case) of individuals in Embedded Image which do not have (T – 1)-order cousins (T = 1 would stand for (half-)siblings, T = 2 for first cousins, etc.) within Embedded Image and have genealogy with no cycles. We will derive the probability distribution of U1 and V1 and expectations of UT and VT for T > 2 in terms of Stirling numbers of the second kind. Further we present a simple analytical approximation of expectations of UT and VT. We derive this approximation as an exponential function of the ratio of the sample size to the effective population size.

The condition that individual’s genealogy does not have cycles means that there is no inbreeding in the history of the individual. Indeed, a cycle appears when two mating individuals share an ancestor, hence they are related to each other. On the contrary, if there is no inbreeding within T generations of ancestors of a certain individual, then all the ancestors have different parents, hence in the Embedded Image there are exactly 2k ancestors of the individual under consideration.

Notice that the requirement that there is no inbreeding is satisfied as long as 2T is small compared to the effective population size N. In this paper we are particularly interested in large populations. We will compute the fraction of individuals with siblings (T = 1) or p-th cousins (T = p + 1) in a sample in the limit of the effective population size N going to infinity. For fixed values of T and the sample size, K, the number of siblings and cousins goes to zero in the limit of large N. However, for a fixed ratio K/N, there is a positive expected number of siblings and offspring, but the expected number of cycles in the genealogy is small compared to K. This observation follows from the fact that the probability that two individuals share a parent is 1/N, which is a rare event for large N. Hence for large N all the ancestors of an individual are unrelated with high probability. We will, therefore, approximate the number of individuals who have siblings (or p-th cousins) by K – UT or K – VT depending on the model. We notice that using this method we cannot characterise, for example, the overlap between the set of individuals who have siblings and the set of individuals who have first-cousins, so we cannot provide an approximation of the number of individuals who have at least some kind of relatives within several generations.

Every genealogy has the same probability under the model. Hence our problem is equivalent to counting the number of possible genealogies with certain properties. To enumerate different genealogies, we will use the following approach. Firstly, we divide a sample Embedded Image into subsets of siblings (in case of non-monogamous model, we create two independent partitions of the sample, one of partitions corresponding to shared fathers and the other corresponding to shared mothers). Then we assume that individuals from the same subset have the same parent couple (in the case of the monogamous model) or the same father or mother (in the case of the non-monogamous model), and individuals from different subsets have different parents. This approach is the basis for our analyses and leads us to the proof of formulas for expectations of UT and VT.

The combinatorial technique used to obtain exact formulas for expectations of UT and VT is very similar to the technique used in King et al. (2017) (see supplementary materials S1). In particular, we have to keep track of the number of ancestors at each generation which is the question of interest of the section S1.1 of King et al. (2017). Notice, that results in our paper and the result of S1.2 of King et al. (2017) complement each other. We find the expected number of individuals in a sample which do not have any relatives with respect to a certain generation, hence we know approximately the number of individuals which share at least one ancestor in that generation with at least one more individual from the given sample. However we cannot characterise finer relatedness (e.g. the number of shared ancestors in a given generation,) as more than one coalescence per generation between genealogies of two individuals is possible. The pairwise analysis of individuals can be performed using King et al. (2017) results, though it can be computationally challenging. The asymptotic behaviour derivation for E(UT)/K and E(VT)/K (for fixed K/N ratio) is a completely new result to the best of our knowledge.

We remind the reader that the Stirling number of the second kind S(n, k) is the number of ways to partition a set of size n into k non-empty disjoint subsets. A generalisation of this is the r–associated Stirling number of the second kind, Sr(n, k) (Comtet 1974), which is the number of partitions of a set of size n into k non-empty subsets of size at least r. We provide more detailed information on the Stirling numbers of the second kind in the Appendix.

4. Probability distribution U1

We say that two individuals are siblings if they have the same parents. In this section we study the number of individuals U1 without siblings within a sample of a population. We derive both the probability distribution and expectation of U1.

Theorem 1.

Let U1 be a random variable representing the number of individuals in a sample Embedded Image of size K without siblings in Embedded Image under monogamous dioecious Wright-Fisher model. Then

  • the probability distribution of U1 is Embedded Image

  • the expectation of U1 is Embedded Image

  • if K/N = α Embedded Image

Proof. We begin the proof by computing the number of possible partitions of Embedded Image into u subsets of size 1 and t subsets of size greater than or equal to 2. Each subset of such a partition corresponds to the descendants in Embedded Image of the same couple of parents from Embedded Image. There are Embedded Image such partitions (see figure 2). Here the first multiplier corresponds to the number of choices of the first u individuals and the second multiplier corresponds to the number of partitions of the remaining K – u individuals into t disjoint subsets.

Now we need to assign u + t subsets to different couples of parents from Embedded Image. There are Embedded Image possibilities for choosing couples that have descendants in Embedded Image and (u+t)! permutations which assign these particular couples to different subsets of the given partitions of Embedded Image.

Finally, summing over all possible values of t we get Embedded Image where ⌊·⌋ stands for the floor integer part.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2.

Illustration to the proof of Theorem 1. Each dot correspond to an individual. The bottom set of points corresponds to the individuals in the sample Embedded Image. This sample is divided in disjoint subsets (the set of points in the middle): this partition corresponds to sets of siblings, or in other words individuals from each subset will be assigned to the same couple of parents. The top row corresponds to the set of couples in the parent generation. Subsets of siblings (from the middle row) are assigned to different couples of parents (from the top row).

The expression for expectation of U1 is much simpler. The probability π1 that an individual Embedded Image does not have any siblings in Embedded Image is π1 = (1 – 1/N)K−1, because all other individuals from Embedded Image can be assigned to any couple of parents except for the parents of the individual Embedded Image. By linearity, the expectation of U1 is Embedded Image

To prove the last statement of the theorem it is enough to rewrite Embedded Image because K/N = α by definition. Now notice that Embedded Image

Hence the last statement of the theorem is proved Embedded Image

5. Expectation of U2

In this section we will provide an expression for expectation of the number U2 of individuals in a sample which do not have first cousins in this sample. We will also establish a limit for Embedded Image in the case of a fixed ratio between K and N.

Theorem 2.

Let U2 be a random variable representing the number of individuals in a sample Embedded Image of size K without first cousins in Embedded Image under a monogamous dioecious Wright-Fisher model. Then the expectation of U2 is Embedded Image

Proof. Similarly to the case of Embedded Image, we need to find the probability π2 for a single individual not to have first cousins within Embedded Image. Then the expectation Embedded Image. Denote individuals from Embedded Image which have descendants in Embedded Image by Embedded Image.

Choose an individual Embedded Image, let Embedded Image and Embedded Image, be parents of s0. If s0 does not have first cousins, then Embedded Image and Embedded Image are assigned to different couples from Embedded Image and those couples do not have other descendants in Embedded Image.

Similarly to derivation of distribution of U1, we first partition Embedded Image into m disjoint subsets. We choose m couples from Embedded Image and establish a one-to-one correspondence between the subsets and the couples. There are N possibilities to choose a couple of parents for Embedded Image, N – 1 choices for Embedded Image and (N – 2) choices for all other 2m – 2 individuals from Embedded Image. Summing over m we get Embedded Image

Our next goal is to find the limit of Embedded Image for a fixed ratio of sample size to the population size. We assume that K/N = α for some constant 0 ≤ α ≤ 1 and we consider the limit of Embedded Image for K → ∞.

Theorem 3.

Let 0 ≤ α ≤ 1 and set K = αN. Then Embedded Image

The following lemma states that the sum of the first βK terms of the series in the formula for Embedded Image is small for large values of K. This makes it possible to make further approximations under the hypothesis that m = O(K).

Lemma 1.

Let K = αN for some 0 ≤ α ≤ 1 and set β = (2 ln 2)−1. Then Embedded Image

Proof. Denote Embedded Image

First, notice that Embedded Image

We will show that for β = (2 ln2)−1 < 1/2 Embedded Image which will immediately prove the statement of the Lemma.

Our goal is to prove that Embedded Image for some constants c1,c2 and K large enough.

We begin by approximating the following ratio for m ≤ ⌊βK⌋ Embedded Image by applying approximation (11). Here G1 = G(K, m) and G2 = G(K, m + 1).

Notice that 0 < G1 < G2 < − W0(−2e−2) < 1/2. The following term is bounded by a constant (we remind the reader that 0 < m ≤ βK < K/2) Embedded Image

After simplification, all the factorials in the formula are of the form (constK)!, hence they can be approximated uniformly in K by Stirling’s approximation Embedded Image

For simplicity of notations we drop all terms 1 + O(1/K) in (2). We also notice that Embedded Image

So for K large enough the ratio (2) has the following approximation Embedded Image

The derivative of G(x)1/x(x – G(x))1−1/x (x ≥ 1) with respect to x is Embedded Image

H(x) has one real root x = 2ln2 if x ≥ 1. The derivative H(x) is positive for x > 2ln2, so G(x)1/x(x – G(x))1−1/x is an increasing function of x for x > 2 ln 2. Hence as soon as K/m > 2ln2, or m < K/(2ln2), the following inequality holds Embedded Image

Consequently, for sufficiently large K we obtain the following upper bound for (2) Embedded Image

Hence, by recursion for m < ⌊βK⌋ Embedded Image

Now we use the obtained inequality to prove limit (1) Embedded Image where the second equality holds by summing over the geometric progression.

Lemma 2.

Let K = αN for some 0 ≤ α ≤ 1, set β = (2 ln 2)−1. Then for any m such that ⌊βK⌋ ≤ m < K Embedded Image

Proof. From the proof of Lemma 1, for K large enough and for β ≤ m/K ≤ 1 Embedded Image

Notice that xex = −1/e + O((x − 1)2) near x = −1. Hence 1 − G(x) = O(|x − 1|) and x − G(x) = O(|x − 1|) for x → 1. By definition, the Lambert W-function (Olver et al. (2010)) is the inverse function of xex. If x1 > −1 and x2 < −1 are two points in the neighbourhood of −1 such that x1ex1 = x2ex2, then |x1 – x2| = O(|x1 – 1|) = O(|x2 – 1|). For x > 1, −xe−x ∈ [−1/e; 0]. The value of the main branch, W0(xex), is in the interval [−1,0]. So −x and W0(−xe−x) correspond to x1 and x2.

Hence Embedded Image

Now we use mean value theorem to approximate Embedded Image where H(x) is given by expression (3). Denote Δx = |x – 1|, and notice that Embedded Image and ln G(x) are continuous near x = 1 and Ĥ(1) = 1, ln G(1) = 0. So for small Δx Embedded Image and hence Embedded Image which leads to the approximation of (5) with m = O(K) Embedded Image

We use this estimate and the Taylor expansion of logarithm to get Embedded Image

Finally, we estimate the ratio TK,N(m)/TK,N(m + 1) for K large enough Embedded Image with some constant C0, which depend on α.

Now we are ready to prove the theorem.

Proof. Firstly, notice that Embedded Image and (1 – 1/N)(1 – 2/N)2 → 1 as N → ∞. Hence, the lower bound is valid for any α and K Embedded Image where the right part trivially converges to e−4α with N → ∞ (we remind that K = αN for some constant 0 ≤ α ≤ 1).

Now we prove that this bound is sharp by applying subsequently Lemmas 1 and 2 Embedded Image because from Lemma 2 it follows Embedded Image

6. General Case: Expectation of Up for p ≥ 2

Similarly to the expectation of U2, we can find the probability of the expected numbers Up(p ≥ 2) of individuals which do not have (p − 1)-cousins and with pedigrees without cycles.

Lemma 3.

Let Embedded Image be a set and Embedded Image be a subset of size Embedded Image. The number of partitions of a set Embedded Image of size N into M disjoint subsets such that all elements of Embedded Image are in different subsets is Embedded Image

Proof. Let Embedded Image, Embedded Image, such that each element, Embedded Image, makes its own subset Pi = {ei} in the partition of Embedded Image. If Embedded Image there are Embedded Image ways to choose such a subset. Then, Embedded Image should be split into M – t non-empty subsets, Pt+1, Pt+2,…, PM, to obtain a partition of Embedded Image into exactly M subsets. There are S(N – k, M – t) possible ways of doing that. Each of the k – t elements of Embedded Image are then added to distinct subsets among the remaining M – t subsets, Pi, i > t, which can be done in Embedded Image ways.

Summing over all possible values of t we prove the statement.

Remark 1. For k = 1, Lemma 3 turns into the well-known recursive formula for Stirling numbers of the second kind.

The next theorem establishes the expression for the expectation of Up and its limit for fixed K to N ratio in the general case. Due to the size of the formula we had to introduce additional notations for readability.

Theorem 4.
  • For any natural p ≥ 1 the expectation of Up is Embedded Image and Embedded Image

  • If K = αN (i = 1, 2,…,p), then Embedded Image

Proof. To prove the first statement, we apply repeatedly the same arguments as used for Theorem 2: for each generation, we split the ancestors of the sample into subsets of siblings while controlling that ancestors of the given individual are not in the same subsets.

The proof of (7) is similar to the proof of Theorem 3. First we can show that we can substitute summations over mi > βK for some constant β (see Lemma 1). Then we use estimations for Qi that are similar to those obtained in Lemma 2.

7. Non-monogamous Wright-Fisher model

Similar results to those obtained for the monogamous case also hold for the non-monogamous dioecious Wright-Fisher model. However, in contrast to the monogamous case, the probability that two individuals are full siblings or full p-th cousins (i.e. sharing two ancestors) is rather small. Most familial relationships would involve sharing only one common ancestor at a given generation, i.e. related individuals would typically be half siblings or half p-th cousins.

Let Vp be a random variable representing the number of individuals in a sample Embedded Image of size K without half siblings or full siblings (p = 1) or half p-th cousins or full p-th cousins (p ≥ 2) in Embedded Image under the non-monogamous Wright-Fisher model. The next theorem established the expression for the expectation of Vp and its limit for K → ∞ in the case of fixed ratio between K and the population sizes N.

Theorem 5.
  • For any natural p ≥ 1, the expectation of Vp is Embedded Image where we assume m0 = K and Embedded Image and Embedded Image

  • If population sizes K = αN, then Embedded Image

    The proof of the theorem is similar to the case of the monogamous model. The function Pj counts the number of possibilities to have exactly mj parents (male plus female)

    In particular, Embedded Image

Corollary 1.

The qualitative behaviour of Ui and Vi is the same, more precisely Embedded Image

8. Numerical Results

In this section we present numerical results for expectations of Up and Vp, p = 1,2, 3. Every plot of figures 3 and 4 represents the behaviour of Embedded Image or Embedded Image for a particular p = 1, 2, 3. Those values are computed by formulas (6) or (8) for different values of N(N = 20,100,200) as a function of the ratio K/N. We also add corresponding limiting distribution to every plot to illustrate the convergence.

Figure 3.
  • Download figure
  • Open in new tab
Figure 3.

Embedded Image as a function of the K/N ratio for N = 50 (●), 100 (▲), 200 (▀) and the corresponding limiting distribution (⋆).

Figure 4.
  • Download figure
  • Open in new tab
Figure 4.

Embedded Image as a function of the K/N ratio for N = 50 (●), 100 (▲), 200 (▀) and the corresponding limiting distribution (⋆).

Because the effective population sizes are typically rather large (at least thousands of individuals) we might expect a satisfactory approximation of E(Up) and E(Vp) by its limiting distribution even for relatively small K/N ratios. One can also check that in our proofs the errors in the estimates are of the order of 1/N, hence for the desired ratio we can estimate the absolute error for smaller values of K, N numerically and then increase N to get the desired precision.

9. Discussion

In this paper we analysed the expected values of the number of individuals without siblings and p-th cousins in a large sample of a population. To do that we used two extensions of Wright-Fisher model which keeps track of the two parents of an individual.

The first extension corresponds to a monogamous population and the second to a non-monogamous population. The two models represent two extremes in terms of degree of promiscuity, and we might expect that in most other dioecious versions of the Wright-Fisher model, with intermediate degrees of promiscuity, the number of individuals without siblings or p-th cousins is somewhere in between those two regimes - as long as the models otherwise maintain Wright-Fisher dynamics.

Under both models we derived expressions for these expectations under the hypothesis that the pedigrees have no cycles (except for the one appearing in full sibs). Notice that this restriction is not too strong, because one can easily show that the chance that an individual has a pedigree with a cycle is a second-order effect as soon as the number of ancestors (≤ 2p) in a generation is much smaller than the effective population size N.

The important result of the paper is the limiting distributions for Embedded Image and Embedded Image. It turns out that Embedded Image and Embedded Image converge point-wise to e−cK/N where the constant c is 22p−2 for Up and 22p−1 for Vp.

We notice that even when the sampling fraction is relative low, the proportion of individuals in the sample with no close relatives can be small. For example, for the non-monogamous model and a sampling faction of 5%, the proportion of individuals with at least a second cousin is approx. 70% if the population size is at least N = 200. For a sampling fraction of 2% the proportion in individuals with at least a second cousin is close to 50% for reasonably large population sizes in case of random mating population or almost 30% in case of monogamous population. For sampling fractions on the order of 0.01 or larger, we expect a large proportion of individuals to have at least one other individual in the sample to which they are closely related. This fact should be taken into account in all genetic, and non-genetic, epidemiological studies working on large cohorts.

In the study of Danish population structure, Athanasiadis et al. (2016) discovered 3 pairs of first cousins and one pair of second cousins in a sample of just 406 individuals. Based on their estimate of an effective population size of 500,000, we would expect to find 1.32 individuals with first cousins under the monogamous model and 2.63 individuals with first cousins under the non-monogamous model. The empirical number of 3 first-cousins in the sample is therefore not significantly different of the expected number of 1.32 under the monogamous model assumption. It is also not statistically significantly different from the expected number of 2.63 under the non-monogamous model. The expected number of second cousins in the sample is 5.24 and 10.41 under the monogamous and non-monogamous models, respectively. The inferred number of 1 is much smaller than this, likely because it is difficult to infer second cousins empirically. We would in general expect that the true number of second cousins is larger than the true number of first cousins.

Notice, that the probability for two individuals to be p-th cousins is approximately Embedded Image, where cm is 1 for monogamous model and 2 for non-monogamous model. Hence, the expected number of pairs of p-th cousins in a sample of size K is approximately Embedded Image. Henn et al. (2012) found approximately 5000 pairs of third cousins and 30000 pairs of fourth cousins in a sample of only 5000 individuals with European ancestry, which would be expected for effective population sizes of 2 · 105 – 3 · 105 under the monogamous model and twice that (4 · 105 – 6 · 105) under the non-monogamous model. These numbers are roughly compatible with estimates of effective population sizes obtained for modern European populations (e.g., Athanasiadis et al. (2016)). We note that effective population size is a tricky concept for a spatially distributed population such as European humans, but the breeding structure observed in these samples suggest that the degree of relatedness in the sample is compatible with population sizes on the order of 105 – 106.

Acknowledgement

The work was supported by the UCOP Catalyst Award CA-16-376437.

10. Appendix: Stirling numbers of the second kind and their generalisation

In this section we provide definitions and properties of Stirling numbers of the second kind.

The Stirling number of a second kind S(n,k) is the number of ways to partition a set of size n into k non-empty disjoint subsets. These numbers can be computed using the recursion (Abramowitz and Stegun 1972) Embedded Image with S(0, 0) = S(n, 0) = S(0, n) = 0 for n > 0. Notice that S(n, n) = 1.

An r–associated Stirling number of the second kind, Sr(n, k) (Comtet 1974), is the number of partitions of a set of size n into k non-empty subsets of size at least r. These numbers obey a recursion formula (Comtet 1974) similar to that for Stirling numbers of second kind Embedded Image with Sr(n, 0) = Sr(1, 1) = 0. In particular, for r = 2 Embedded Image

10.1 Uniformly valid approximation for S(n, k)

The following useful approximation of Stirling numbers of the second kind is established by Temme (1993) Embedded Image where t0 = n/k – 1, x0 ≠ 0 is the non-zero root of the equation Embedded Image and Embedded Image

The following form of this approximation is known Embedded Image with G = −W0(−n/ke−n/k), where W0 is the main branch of Lambert W-function (Olver et al. 2010).

We did not find a reference for the formula (11) in the literature, so we provide briefly the proof. Notice that −1/e < −n/ke−n/k < 0, hence G ∈ (0,1). Let us show that x0 = n/k – G is the non-zero root of equation (10) Embedded Image where the second equality is due to the Lambert function property e−W(x) = W(x)/x. Substituting t0 and x0 in approximation (9) by their values and simplifying the formula, one gets the needed result. Obviously, Embedded Image

Now consider eAkn−k Embedded Image which finished the proof of equivalence of approximations (9) and (11).

References

  1. [1].↵
    Abramowitz M, and Stegun IA, (1972) Handbook of mathematical functions with formulas, graphs and mathematical tables. Dover, New York, p 825
  2. [2].↵
    Athanasiadis G, Cheng JY, Vilhjlmsson BJ, et al. (2016) Nationwide Genomic Study in Denmark Reveals Remarkable Population Homogeneity. Genetics 204(2):711–722. doi:10.1534/genetics.116.189241
    OpenUrlAbstract/FREE Full Text
  3. [3].↵
    Chang JT (1999) Adv Appl Prob 31:1002–1026
    OpenUrl
  4. [4].↵
    Comtet L (1974) Advanced Combinatorics. Reidel, Dordrecht, Holland
  5. [5].↵
    Fisher RA (1930) The genetical theory of natural selection. Clarendon Press, Oxford
  6. [6].↵
    Henn BM, Hon L, Macpherson JM, Eriksson N, Saxonov S, Pe’er I, et al. (2012) Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples. PLoS ONE 7(4):e34267. doi:10.1371/journal.pone.0034267
    OpenUrlCrossRefPubMed
  7. [7].↵
    King L, Wakeley J, Carmi S (2017) A non-zero variance of Tajima?s estimator for two sequences even for infinitely many unlinked loci. Theor Popul Biol. doi:10.1016/j.tpb.2017.03.002
    OpenUrlCrossRef
  8. [8].
    Moltke I, Fumagalli M, Korneliussen TS, Crawford JE, et al. (2015) Uncovering the genetic history of the present-day Greenlandic population. Am J Hum Genet 96:54–69. doi:10.1016/j.ajhg.2014.11.012
    OpenUrlCrossRefPubMed
  9. [9].↵
    Nagylaki T (1997) Multinomial-Sampling Models for Random Genetic Drift. Genetics 145:485–491
    OpenUrlAbstract/FREE Full Text
  10. [10].↵
    Olver FWJ et al. (2010) NIST Handbook of Mathematical Functions. Cambridge University Press, Cambridge, p 131
  11. [11].↵
    Temme NM (1993) Asymptotic Estimates of Stirling Numbers. Stud Appl Math 89:233–243. doi: 10.1002/sapm1993893233
    OpenUrlCrossRef
  12. [12].↵
    Visscher PM, Andrew R, Nyholt DR (2008) Genome-wide association studies of quantitative traits with related individuals: little (power) lost but much to be gained. Eur J Hum Genet 16:387–390. doi:10.1038/sj.ejhg.5201990
    OpenUrlCrossRefPubMed
  13. [13].↵
    Wakeley J, King L, Low BS, Ramachandran S (2012) Gene Genealogies Within a Fixed Pedigree, and the Robustness of Kingmans Coalescent. Genetics 190: 1433–1445. doi: 10.1534/genetics.111.135574
    OpenUrlAbstract/FREE Full Text
  14. [14].↵
    Wilton PR, Baduel P, Landon MM, and Wakeley J (2016) Population structure and coalescence in pedigrees: comparisons to the structured coalescent and a framework for inference. Theoretical Population Biology 115:1–12 doi:10.1101/054957
    OpenUrlCrossRef
  15. [15].↵
    Wright S (1931) Evolution in Mendelian populations. Genetics. 16:97–159
    OpenUrlFREE Full Text
Back to top
PreviousNext
Posted May 03, 2018.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
On the Number of Siblings and p-th Cousins in a Large Population Sample
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
On the Number of Siblings and p-th Cousins in a Large Population Sample
Vladimir Shchur, Rasmus Nielsen
bioRxiv 145599; doi: https://doi.org/10.1101/145599
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
On the Number of Siblings and p-th Cousins in a Large Population Sample
Vladimir Shchur, Rasmus Nielsen
bioRxiv 145599; doi: https://doi.org/10.1101/145599

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Evolutionary Biology
Subject Areas
All Articles
  • Animal Behavior and Cognition (2633)
  • Biochemistry (5221)
  • Bioengineering (3643)
  • Bioinformatics (15711)
  • Biophysics (7213)
  • Cancer Biology (5593)
  • Cell Biology (8045)
  • Clinical Trials (138)
  • Developmental Biology (4735)
  • Ecology (7462)
  • Epidemiology (2059)
  • Evolutionary Biology (10520)
  • Genetics (7698)
  • Genomics (10082)
  • Immunology (5148)
  • Microbiology (13823)
  • Molecular Biology (5354)
  • Neuroscience (30577)
  • Paleontology (211)
  • Pathology (871)
  • Pharmacology and Toxicology (1519)
  • Physiology (2234)
  • Plant Biology (4983)
  • Scientific Communication and Education (1036)
  • Synthetic Biology (1379)
  • Systems Biology (4130)
  • Zoology (803)