ABSTRACT
In this work we develop a novel algorithm for reconstructing the genomes of ancestral individuals, given genotype or sequence data from contemporary individuals and an extended pedigree of family relationships. A pedigree with complete genomes for every individual enables the study of allele frequency dynamics and haplotype diversity across generations, including deviations from neutrality such as transmission distortion. When studying heritable diseases, ancestral haplotypes can be used to augment genome-wide association studies or compute polygenic risk scores for the reconstructed individuals.
The building blocks of our reconstruction algorithm are segments of Identity-By-Descent (IBD) shared between two or more genotyped individuals. The method alternates between finding a source for each IBD segment and assembling IBD segments placed within each ancestral individual. After each iteration we perform conflict resolution to remove IBD segments that do not align with well-reconstructed haplotypes and upweight the probability that these segments should be placed in other individuals. We repeat this process until we are no longer successfully reconstructing additional ancestral haplotypes. Unlike previous approaches, our method is able to accommodate complex pedigree structures with hundreds of individuals genotyped at millions of SNPs.
We apply our method to an Old Order Amish pedigree from Lancaster, Pennsylvania, whose founders came to the United States from Europe during the early 18th century. The pedigree includes 1338 individuals from the past 10 generations, 394 with genotype data. The motivation for reconstruction is to understand the genetic basis of diseases segregating in the family through tracking haplotype transmission over time. Using our algorithm thread, we are able to reconstruct an average of 230 ancestral individuals per autosome. thread was developed for endogamous populations, but can be applied to any extensive pedigree with the recent generations genotyped. We anticipate that this type of practical ancestral reconstruction will become more common and necessary to understand rare and complex heritable diseases in extended families.
1 Introduction
Pedigree structures and associated genetic data provide a wealth of information for studying recent evolution. Nuclear families (parents and children) and other small pedigrees have been used to estimate mutation and recombination rates in humans [6, 8, 27, 44] and other species [24, 41, 46]. Pedigrees have informed breeding of domesticated animals [33], enabled the study of short-term evolution in natural populations [9], and can be used to study heritable diseases [4].
Genetic studies of rare, recessive traits pose a challenge to researchers when individuals expressing these traits are too sparse or too scattered to obtain sufficient genetic data. Endogamous populations with detailed pedigree records provide an important exception. Endogamous populations, defined by the practice of marrying within a social, ethnic, or geographic group, are often characterized by small effective population sizes with limited external admixture. These groups are of great interest to geneticists because a single small population can provide enough data to inform rare trait and rare variant studies with worldwide implications [32, 37]. Endogamous populations are also informative for common disease [16, 45].
Extended pedigrees from endogamous populations provide a valuable system for studying heritable disease, but genetic data is typically limited to recent generations. If genetic information from every individual in the pedigree were available, we would be in a better position to understand the transmission of causal variants throughout the history of the population. More specifically, we often know the disease phenotypes of ancestral individuals, but cannot obtain their genetic information. In these cases, reconstructed haplotypes allow us to augment genome-wide association studies (GWAS), where large sample sizes are essential. In addition, reconstructed genomes would enable the computation of polygenic risk scores (PRS) [25, 48] for ancestral individuals.
Reconstructed ancestral haplotypes also allow us to study genome dynamics over short time scales, including inheritance patterns and haplotype transmission. In populations with large nuclear families, transmission distortion [11, 34] and other deviations from neutrality are particularly visible. Understanding which parts of the genome are over- or under-represented in the recent generations could help us identify forms of deleterious variation. From a theoretical perspective, there has been relatively little work on the question of how much ancestral reconstruction is possible given genetic information from contemporary individuals (example from a small livestock pedigree in Hayes et al. [18]).
Previous work on ancestral reconstruction has typically been applied to small pedigrees with no loops (marriage of close relatives). One of the earliest examples comes from the Lander-Green algorithm [28], which uses a hidden Markov model (HMM) with inheritance vectors as the hidden state and genotypes as the observed variables. Methods such as SimWalk2 [43] and Merlin [2] use descent graphs and sparse gene flow trees (respectively) to extend the idea of likelihood-based computation to larger pedigrees. However, these methods do not perform reconstruction explicitly and also do not handle loops, as tree-based intermediate steps are common to both algorithms. With millions of loci and hundreds of individuals, the runtimes of these methods are prohibitive (see [42] for a runtime overview). Other HMM-based approaches such as HAPPY [35], GAIN [31], and RABBIT [51] reconstruct genome ancestry blocks, but do not tie them to specific individuals. HAPLORE [50] quantifies possible ancestral haplotype configurations but does not incorporate recombination, and the Bayesian approach in Fishelson et al. [14] is more suitable for haplotyping.
Lindholm et al. [30] reconstructed ancestral haplotypes for the purpose of identifying regions that contain susceptibility genes for schizophrenia. However, their pedigree was much smaller (with no loops), many fewer markers (450) were used, and several of the reconstruction steps were done by inspection or by hand, which does not scale to our scenario. Another study [22] reconstructed the African haplotype of an African-European individual who migrated to Iceland in 1802 and had 788 descendants, 182 of which were genotyped. However, this scenario is much simpler, as the regions of African ancestry within each descendant were easily identified and all belonged to the same individual.
The problem studied here is different from pedigree reconstruction, where genetic information is used to reconstruct (previously unknown) family relationships. See [20, 21, 23, 26, 39, 47] for discussions of pedigree reconstruction.
In this study we apply our method to an Old Order Amish population from Lancaster, Pennsylvania who can trace their ancestry to founders who came from Europe to Philadelphia in the early 18th century (see Figure 3 of [29] for an analysis of the contributions of the 554 founders). The Amish are an ethno-religious group in the Anababtist tradition, with a history of detailed record keeping and marriage within the Amish community [13]. In this work, we study an unpublished pedigree of 1338 individuals, augmented [3] from a pedigree of 784 individuals originally described in the Amish Study of Major Affective Disorder [10, 15]. Roughly one third of the individuals in the original pedigree display some form of mood disorder, and about 19% have been diagnosed with bipolar disorder specifically [25]. Bipolar disorder in a broad sense is roughly 80% heritable in this pedigree [25], and recent work has focused on understanding the genetic basis of this disease [15]. The availability of genetic data from 394 contemporary individuals from this pedigree gives us an opportunity to use reconstruction as another lens on inheritance patterns of mood disorders.
Here we present a novel algorithm, thread, for reconstructing ancestral haplotypes given an arbitrary pedigree structure and genotyped or sequenced individuals from the recent generations. thread can be applied in a variety of scenarios including pedigrees with loops, inter-generational marriage, and remarriage. More ancestral chromosomes will be reconstructed as the percentage of individuals with genetic data increases, but our method can be applied even when this fraction is modest. This work represents a key step towards understanding the limits of quantifying the genomes of ancestral individuals in the absence of ancient DNA. thread is available as an open-source software package: https://github.com/mathiesonlab/thread.
2 Methods
Problem statement
The first input to thread is a pedigree structure 𝒫. For each individual p ∈ 𝒫, we have information about the mother p(m) and father p(f), which are also members of 𝒫. In the case of founders or married-in individuals, we let p(m) and p(f) be 0. The pedigree may contain loops, meaning that the parents of a child share a recent common ancestor. The second input is a dataset of phased haplotypes (e.g. in Variant Call Format, VCF) from a subset of individuals in the pedigree, typically from the most recent generations. Phasing assigns the alleles of each individual to parental haplotypes. Our aim is to reconstruct the haplotypes of as many ancestral individuals in the pedigree as possible. An illustration of the problem is shown in Figure 1.
High level description
thread is built upon the idea of Identity-By-Descent (IBD). IBD segments are long stretches of DNA shared by a cohort of two or more individuals due to descent from a common ancestor (source). Each segment is analyzed independently (as opposed to working sequentially along the chromosome as an HMM would). We attempt to find the source of each IBD segment, as well as individuals who are on descendance paths from this ancestor to the cohort. After this step we proceed through each individual, clustering and assembling their associated IBD segments into haplotypes. During this grouping step we identify IBD segments that have been poorly placed – in the next iteration we will update their common ancestors. We alternate the process of analyzing IBD segments and individuals until we are no longer building new haplotypes. A schematic of thread is shown in Figure 2, and pseudocode is given in Algorithm S1 (Supplementary Material).
Input pedigree
The Amish pedigree under study was developed from several sources, including the book Descendants of Christian Fisher [5], the Anabaptist Genealogy Database (AGDB) [3] and associated software PedHunter [29], and the Amish Study of Major Affective Disorder [10]. The AGDB is covered by an IRB-approved protocol at the NIH. All work contained within this study was approved by the IRB of the Perelman School of Medicine at the University of Pennsylvania. The complete pedigree structure is shown in Figure S1 (created with the kinship2 R package [40]).
Step 1: We first read in the pedigree structure. We do not require that individuals be separated into generations, and we allow inter-generational marriage and loops. Let t be the total number of individuals in the pedigree (here t = 1338), and n be the number of genotyped individuals (here n = 394). We further define m to be the number of ungenotyped individuals with genotyped descendants (here m = 686). In our case, this leaves 258 individuals with no genotyped descendants; we do not expect to be able to reconstruct these individuals.
Genotypes for each genotyped individual were obtained from Illumina Omni 2.5M SNP arrays, and then phased into haplotypes using SHAPEIT2 [12]. We identify IBD segments between pairs of genotyped individuals using GERMLINE [17], although IBD-Groupon for detecting IBD in groups could be used instead [19]. For each IBD segment I, we combine pairs until we obtain a cohort of individuals who share this segment, C ∈ {2, n}. Here, the size of C ranged from two to 180 individuals. The descendance path of an IBD segment includes all descendants of the source who also passed down the IBD to reach the cohort descendants. Table 1 shows the number of unique IBD segments found on each chromosome.
Step 2: In the next phase of thread, sources for each IBD segment are identified independently. By the end of this step we will have enumerated all possible individuals who could have been the source of each IBD segment I, given its associated cohort C. This process is done only once and is not part of the iterative phase. When searching for all common ancestors of a cohort, each previous generation doubles the number of ancestors to search. thread maximizes efficiency in this exponential problem by merging overlapping paths using a modified breadth-first search algorithm (explained in detail below and in pseudocode in Algorithm S2).
First all the individuals in the cohort are added to a queue. For example, in Figure 3, so we would start out with Q = (1, 2, 5, 7, 8). We then pop the first individual off the queue, p0. If p0 is an ancestor of all individuals in the cohort, we add p0 to a set of possible sources. Either way, we add p0’s parents to the back of the queue and keep processing individuals (even if p0 is an ancestor, its parents may be ancestors via paths that do not include p0). In this example we would consider individual 1 first. Since it has not been processed, we add its parents:
Each time we add an individual p to the queue, we keep track of how many paths exist from p to the members of the cohort, using a multiset Mp. For the members of the cohort, Mp = {p} (just one path to themselves). When we add a parent to the queue, we concatenate the multisets of the individual’s children. For individual a in this example, its multiset would become Ma = {1, 2}, indicating one path to individual 1 and one path to individual 2. Going further up the pedigree, individual ℓ has two children, h and e with Mh = {1, 2, 5, 7, 8} and Me = {5}. Concatenating these two multisets, we obtain the multiset Mℓ = {1, 2, 5, 5, 7, 8}, indicating that there are two possible paths from ℓ to cohort member 5. As soon as an individual’s multiset contains all members of the cohort, the individual can be a source.
There are two post-processing phases to the source-finding algorithm. (1) We trim redundant sources: a redundant source is an ancestor of another source without adding any unique descendance paths. In other words, we do not want to include individuals if all their paths to the cohort go through another source. If the cardinality of an individual’s multiset is not greater than the maximum cardinality of the multisets of its children, it is redundant (for example, k is a redundant ancestor since |Mk| = |Mh|). (2) We merge couples into a single source, as typically we will not be able to resolve the source of an IBD segment beyond the couple level. Spouses with different multiset cardinality (usually caused by remarriage) are an exception. Individual ℓ is an example; we do not consider couple kℓ a source because |Mk| < |Mℓ| due to ℓ’s remarriage to m. If the cardinalities had been the same (and not redundant), we would have considered kℓ a source.
In the Figure 3 example, we identify three potential sources: S = {gh, ℓ, pq}. Note that we cannot stop processing the queue when we get to source gh, as there exist sources further up the pedigree that contain unique paths.
The use of multisets allows us to quickly determine the number of descendance paths from each source to the cohort. For each source s and each individual c in the cohort, let ms(c) be the multiplicity of c in Ms. For example, in Mℓ, the multiplicity of individual 5 is two, meaning that there are two paths from ℓ to individual 5. The total number of descendance paths (d) from source s to cohort C (sharing IBD I) is the product of all the multiplicities:
In this example, we obtain d(gh) = 1, d(ℓ) = 2, and d(pq) = 8. A few of these descendance paths are shown in blue in Figure 4 for clarity.
Before moving into the iterative part of the algorithm, we take note of individuals that are on all paths from all sources. For example, individual b happens to be on all 11 paths from the sources, so we know that individual b should have the IBD segment.
Step 3: At this stage we begin the iterative part of the algorithm. Every iteration begins with lists of reconstructed individuals and unreconstructed individuals. During the first iteration, the reconstructed list only includes genotyped individuals. The goal of Step 3 is to select a source for each IBD segment out of the potential sources enumerated in Step 2. We use the greedy approach of choosing the source with the fewest paths, provided that it does not conflict with one of the reconstructed individuals. The intuition behind choosing the source with the fewest paths is that this source will (often) be more recent than others, with fewer meioses separating the source from the cohort. For example, in Figure 3, we would choose source gh since it has only one descendance path. Once we select a source, we can begin to look at the individuals that lie on paths from this source. In the case of only one path, all the individuals on the path will be given the IBD segment (b, c, and d in this example), thus augmenting the associated cohort. In the more common situation when we have multiple paths from the source, we give the IBD segment only to individuals that appear on all the paths. However, if we try to give this IBD segment to a reconstructed individual and it conflicts with both their haplotypes, we reject the source and immediately choose the source with the next fewest paths. These tentative assignments result in potentially conflicting IBDs being assigned to the same individual, which we resolve in Step 4.
Step 4: During Step 3, we analyzed each IBD segment independently, identifying ancestral individuals who likely also share the IBD segment. In Step 4, we analyze the individuals independently and assemble the IBDs that have been placed within the individual. Say we are analyzing ancestral individual p with putative set of IBD segments ℐp. The goal of assembly is to separate the IBD segments into two haplotypes such that their sequences are consistent within each group. At a high level, this process can be compared to de novo genome assembly, where many small reads are stitched together to create contigs and chromosomes. However, we may have misplaced IBD segments, which we will need to identify and remove.
Our grouping algorithm (covered in pseudocode in Algorithm S3) begins by identifying regions of homozygosity within the IBD segments. This is accomplished by condensing all segments in ℐp down into a single sequence with a list of alleles at each site. Any region greater than 300kb with only one allele per site and at least 100 SNPs is declared homozygous. It is important to identify these regions early in the grouping algorithm, otherwise we may assume only one group shares this stretch. Each homozygous region is duplicated so that each chromosome will have a copy, and IBD segments contained within homozygous regions are not used in the next stages.
We process the remaining IBDs (those not incorporated into a group) one by one, from longest to shortest (in kbp). If the IBD does not overlap with any of the current groups, we create a new group initialized by the IBD segment. If the IBD does overlap with one or more groups, we add it to the group with the largest overlap (above a threshold).
At this point in the grouping algorithm, we have a set of homozygous groups, a set of heterozygous groups, and a set of remaining IBDs. If an IBD overlaps two groups, we use it to merge these groups into one. Finally, we merge groups that “line up” with each other – i.e. they do not overlap, but their IBD segments span adjacent SNPs and were likely separated by an ancestral recombination event. At the end of this process, three situations may emerge:
We have two clear groups (which we denote as strong) forming two haplotypes. This is the ideal scenario and it means we have a successful reconstruction of the individual. To determine if a group is strong, it must meet a combination of thresholds: a minimum number of IBD segments and a minimum coverage (#SNPs reconstructed/#SNPs genotyped on the chromosome). We use a sliding scale: if the group contains 1-2 IBDs, it must cover 90% of the SNPs. If a group contains 3-9 IBDs, it must cover 70% of the SNPs. And if a group contains 10 or more IBDs, it must cover 50% of the SNPs. These parameters can be customized by the user.
We have two strong groups, but we also have several weaker ones. This scenario is resolvable, as we can retain the two strong groups as the reconstruction, and reject the other groups. Specifically, the two best groups must meet our strong threshold and the rank three group must either have half as many IBD segments or be half as long. The IBD segments from the rejected groups give us a lot of information – since this individual was on all paths from the selected source, if the IBD segment does not fit with the reconstructed haplotypes, then we know the source was incorrect. Throughout Step 4 we collect all IBD segments that have been incorrectly sourced to update in the next iteration.
In all other situations, we typically cannot resolve the individual’s haplotypes. We may have only one group (which could be one of the individual’s haplotypes), but we do not declare the individual reconstructed. We could have many groups without two strong ones, or we many not have given the individual any IBDs to group.
At the end of Step 4, we move individuals from the first two scenarios in to the reconstructed list. IBDs that did not cause any conflicts are marked as processed and we retain the rest to re-source in the next iteration. An illustration of the grouping algorithm is shown in Figure 5.
Iteration and Step 5
At the end of Step 4 we have a set of IBD segments that were incorrectly sourced. We then repeat Step 3: we update the source for each such IBD by selecting the source with the next fewest paths. This allows us to assign the IBD to a new set of individuals. In the next Step 4 we treat reconstructed individuals and unreconstructed individuals differently. If an individual is already marked as reconstructed, we use each additional IBD to strengthen its groups or reject the new source of the IBD. If an individual has not been reconstructed, we run the grouping algorithm again. We keep iterating Steps 3 and 4 until we are no longer reconstructing new individuals.
The final step is to return the haplotype sequences for the reconstructed individuals. These may contain some gaps, but due to our coverage and length thresholds, if an individual is declared reconstructed, we will return at least half of each haplotype (for the chromosome under consideration).
Simulations
To validate our method, we simulate genetic data from an endogamous population. To generate the levels of IBD sharing seen in the Amish population, we first simulate marriage and offspring between individuals who share a common ancestor three generations in the past. For the founder genomes of these small pedigrees we use haplotypes drawn from European individuals (CEU from the 1000 Genomes Project [1]). This process simulates endogamy pre-immigration to the United States. Then we use these composite individuals as founders and feed them through our exact pedigree structure (of 1338 individuals), simulating meiosis and recombination from a human genetic map (chr 21). We record the genomes of all individuals in this simulated system, but only use the same 394 genotyped individuals when we run thread. After reconstruction, we compare the genomes we built to the true underlying genomes (accounting for arbitrary haplotype order).
3 Results
In our validation, we compared the true genomes from our CEU simulations to those reconstructed by thread. In the parts we reconstruct, we often see sequence similarity that is either close to 100% or around 70%. On average we see about 84% sequence similarity with the true haplotypes – symmetries between maternal and paternal lineages in the pedigree structure may account for part of the discrepancy. In the simulations we reconstructed 107 individuals, which is lower than for the real Amish data.
Moving to the real data, we began by testing the grouping algorithm on genotyped individuals. Figure 6 shows two chromosomes of a genotyped individual that were reconstructed using thread. Each horizontal line represents one IBD segment shared with a cohort of other genotyped individuals. IBD segments of the same color represent haplotypes, and have a consistent sequence along the chromosome. For example, if we condensed the orange IBD segments in Figure 6B, a single sequence would emerge. The small vertical lines represent heterozygous sites between the two haplotypes.
In general we found that our grouping algorithm worked very well for genotyped individuals, who typically share many IBD segments with other members of the pedigree. Very occasionally we obtained three groups (example in Figure 6A).
Next we investigated the number of sources per IBD segment and the number of descendance paths per source. These distributions are shown in Figure S4 for chromosome 21.
After running thread on each autosome using the entire pedigree and all genotyped individuals, we assessed the results in terms of how many individuals were successfully reconstructed (based on the criteria in Section 2). This means that at least half the chromosome can be constructed, with sufficient support in terms of coverage and number of IBD segments. Typically thread converged in 6-10 iterations and we were able to reconstruct between 166 and 260 individuals per chromosome (24%-38% of the 686 individuals with genotyped descendants). See Table 1 for the details of each chromosome.
The conflict resolution step was essential for removing misplaced IBD segments and routing them to other sources. An example is shown in Figure 7. In this case, the green and blue groups were removed from this individual, as they were much less strong than the cyan and red groups. In the next iteration, we re-source the associated IBDs and consider the individual reconstructed. Examples of successful ancestral reconstructions are shown in Figure 9, for a variety of different chromosomes and generations back in time. As expected, in the more distant generations, we place fewer IBD segments and generally have less coverage over the chromosome.
Although we reconstruct many individuals well in the recent generations, there are many haplotypes we are unable to resolve. A few examples are shown in Figure S3. Sometimes we build one haplotype successfully, but not the other (Figure S3A). Often we have some successful reconstruction, but the groups do not meet our threshold for “two strong” since the third group has too many IBD segments (Figure S3B). Four haplotypes could represent ambiguity between the individual’s spouse or close relative (Figure S3C). Sometimes we are placing too many IBDs in this individual, which could arise if they have many descendants (Figure S3D).
Table 1 and Figure 8 show our results in a wholistic view. Table 1 shows how many individuals we are successfully reconstructing for each chromosome. Figure 8 shows these same results on the family level, broadly indicating which individuals we are reconstructing well. Figure S2 shows these results on the individual level.
4 Discussion
The methodology behind thread represents a new direction for ancestral reconstruction that scales in both the number of individuals and the number of loci. Previous ancestral haplotype reconstruction algorithms have either been too slow to apply, too rigid to accommodate a complex pedigree, perform steps by hand, or consider a more diverse ancestral population. Although a likelihood approach to reconstruction is theoretically possible, our work represents a practical alternative as pedigree size and complexity continues to grow. We note that our method is most suitable when genotyped individuals exhibit high levels of IBD sharing. As effective population size and/or admixture levels increase, this type of method will become less useful.
There are many possible algorithmic improvements to our method. In particular, choosing the source with the fewest paths may bias us toward poor reconstructions in some situations. A more robust probabilistic approach might take other aspects into account, including: (1) the number of generations separating the cohort and the ancestor, (2) the length of the IBD segment, and (3) the location of the IBD segment on the chromosome. Due to recombination events at each generation, all of these factors affect the likelihood that an IBD is passed down, intact, from a certain ancestor to the descendants. In terms of implementation, thread could be parallelized across IBD segments and individuals.
The grouping algorithm could make use of the genetic map to merge groups at recombination hotspots. More realistic simulations could model crossover interference and sex-specific recombination maps, as in Caballero et al. [7].
Individual-level reconstruction opens the door for many types of downstream analysis. Using reconstructed genomes to augment GWAS could increase sample sizes by hundreds of individuals when the phenotype is known. More generally, quantifying allele frequency changes, transmission distortion, and un-reconstructable (“lost”) regions of the genome allows us to model genome dynamics on a recent time scale. thread could be applied to other genetically characterized endogamous populations with high levels of recessive traits, such as Mennonites and Hutterites [36]. Our method would also be suitable for model organisms and domestic animals, where extensive pedigree records are common.
Our results could also be used to find individuals of clinical significance in cases where a gene-inhibiting drug may provide a therapeutic option for a disease. More specifically, loss of function (LoF) mutations in some genes have shown to protect against disease [38, 49]. As gene inhibition as not been extensively studied in humans, identifying individuals who are already heterozygous null or homozygous null could be extremely valuable.
Acknowledgments
The authors would like to thank Jeff Knerr for providing invaluable computational support, Amy Williams for guidance with simulations, Iain Mathieson for feedback on the method and manuscript, and the participants in the Amish Study of Major Affective Disorder (ASMAD). This research is supported in part by the National Institutes of Health, National Cancer Institute.