Abstract
With the increasing use of massively parallel sequencing approaches in evolutionary biology, the need for fast and accurate methods suitable to investigate genetic structure and evolutionary history are more important than ever. We propose new distance measures for estimating genetic distances between individuals when allelic variation, gene dosage and recombination could compromise standard approaches.
We present four distance measures based on single nucleotide polymorphisms (SNP) and evaluate them against previously published measures using coalescent-based simulations. Simulations were used to test (i) whether the measures give unbiased and accurate distance estimates, (ii) if they can accurately identify the genomic mixture of hybrid individuals and (iii) if they give precise (low variance) estimates.
The results showed that the SNP-based GENPOFAD distance we propose appears to work well in the widest circumstances. It was the most accurate method for estimating genetic distances and is also relatively good at estimating the genomic mixture of hybrid individuals.
Our simulations provide benchmarks to compare the performance of different distance measures in specific situations.
Introduction
The last few decades have witnessed a methodological revolution in the field of population genetics. Model-based likelihood approaches have been propelled to the forefront of species and population level studies (e.g. Beaumont and Rannala 2004; Beaumont et al. 2002; Huelsenbeck et al. 2001). These changes have been made possible by the remarkable advances in computing technology and the application of computationally intensive Monte Carlo methodology.
But even these sophisticated methods are facing critical challenges confronted by the overwhelming amount of data generated by massively parallel sequencing technologies. In many cases, state-of-the-art approaches in terms of models and methods cannot always accommodate population genomics data. Consequently, quick and rapid approaches that allow for investigations of patterns and processes still have their utility in this discipline.
Our objective is to present new, flexible, and robust distance measures for estimating genetic distances from single nucleotide polymorphisms (SNPs) data. We focus on the estimation of distances between individuals (or organisms), even though the distances could certainly be useful in many other circumstances. There are good reasons to focus at the level of individuals rather than populations or species. Individuals are central to biology. Measurements based on morphology, spatial positioning, or genetics are generally performed at the individual level. Individuals are also the fundamental units of natural selection, the central concept of evolutionary biology. And finally, estimates of genetic relatedness between individuals can reveal correlations between genetic and phenotypic distances, spatial genetic structure across a landscape, species boundaries, and could be used for genetic or phylogenetic diversity (PD) surveys.
Although obtaining genetic distances among individuals seems relatively straightforward, there can be several complicating factors. One is the presence of SNPs among gene copies in non-haploid individuals. Polyploidy, which is defined by the presence of more than two genome copies in a nucleus, leads to further complexities. Not only is there the potential presence of more than two character states for each nucleotide, there is also the potential for non-conventional segregation of chromosomes. Finally, recombination along chromosomes renders the problem of calculating distances between organisms even more complex. Given the importance of estimating genetic distances between individuals and the increasing availability of genome-wide sequence data, we think that this issue deserves further investigation.
Only a few approaches, generally motivated by very different research questions, have been proposed to handle SNPs, polyploidy or recombination. Although not based on sequence data, Bruvo et al. (2004) proposed an interesting approach to deal with ploidy level variation for estimating the distances between individuals from microsatellites data that could be generalized to sequence data. Their method consisted in comparing directly the alleles of one individual with that of another, while accounting for the “missing alleles” in comparisons between ploidy levels. Joly and Bruneau (2006) proposed the pofad algorithm to estimate the genetic distance of individuals from allelic sequence information. Their idea for comparing homozygotes and heterozygotes could be seen as comparing alleles that share a most common recent ancestor. However, their implementation could not be applied to polyploid organisms. Later, Göker and Grimm (2008) proposed different methods to estimate distances between “populations” using, among others, community ecology statistics such as Shannon’s entropy or Euclidean distances. Although not originally designed for the problem we address here, they could nevertheless be relevant if one considers an individual as a “population” of sequences. Their approaches could be applied to individuals of mixed ploidy levels, but they did not deal with the potential presence of recombination.
Here, we propose four methods for estimating genetic distances between individuals from nucleotide sequence data. One of these is an adaptation of Nei’s genetic distance (Nei et al. 1983) for this specific problem, but the three other methods are novel. All methods are very general in that they can be applied to individuals of any ploidy level, but also when individuals have different ploidy levels. We first describe in detail the challenges involved in estimating genetic distances between individuals. We then describe the new methods and compare them and others using simulations. We finish by making recommendations on the use of distance measures in different contexts.
Problems associated with the estimation of distances between individuals
Allelic variation
If the estimation of genetic distances between DNA sequences is straightforward, the potential presence of more than one allele at autosome loci in non-haploid individuals makes it more complex to estimate the genetic distances between individuals, especially when combining information from multiple loci (Joly and Bruneau 2006). Also, one important property of distances that measure overall difference between individuals is that the comparison of a heterozygous individual with itself should have a distance of 0, something that is not necessarily obtained with all existing approaches. For instance, taking the mean pairwise distance between all alleles will not generally give a distance of 0 when comparing an individual with itself.
Polyploidy
Polyploidy brings two other problematic issues: inheritance and gene dosage. Inheritance of diploids is always disomic while it can be either disomic or multisomic in polyploids (Comai 2005). Polyploids are disomic if chromosomes group by pairs at meiosis, one example being homeologous chromosomes in allopolyploids. However, they are multisomic when chromosomes form multivalents. In many cases, inheritance of polyploid taxa is unknown or difficult to determine precisely. Some polyploids are even characterized by a mixture of inheritance modes. For instance, a marker could have mainly disomic inheritance with occasional multisomic inheritance, or different chromosomes could have different modes of inheritance within a genome (Wendel 2000).
Gene dosage is another issue associated with polyploidy (Bruvo et al. 2004). In diploids, gene dosage is obvious: a homozygous individual has two copies of the same allele and a heterozygous individual has one copy of each allele. In polyploids, it is rare that we know the exact dosage of each allele in the genome. A tetraploid that has the observed nucleotide state ‘A’ at a position (i.e., it is homozygote) can only have genotype ‘AAAA’. However, a tetraploid individual with observed states ‘A’ and ‘T’ at a site could have the genotypes ‘ATTT’, ‘AATT’, or ‘AAAT’. The unknown dosage of these character states makes it more difficult to estimate precisely the genetic distances between polyploids. The situation can become even more complicated when there are more than two character states at a sequence site, a feature that becomes more likely in higher polyploids. Finally, another important feature of the desired distance measure is the capacity to estimate distances between individuals of different ploidy levels (Bruvo et al. 2004).
Distance definitions
We propose four new distance measures to calculate the genetic distance between individuals from sequence data. The main novelty of these proposed measures is that they are all computed at the nucleotide level. Therefore, we define them first at the individual nucleotide site level, and explain later how these distances can be extended to strings of nucleotides, some potentially linked (within loci) and others unlinked. These measures assume that we know the nucleotides present at a given position in an individual but not necessarily gene dosage, which is typical for data obtained from genotyping or sequencing. All proposed distances are bounded between 0 and 1 and have the property that the distance between an individual and itself is 0.
matchstates
This measure looks at each nucleotide present at a given sequence site in one individual and checks if there is a nucleotide in the other individual that matches. More formally, consider a specific sequence site i that might be present in multiple alleles or gene copies in an individual. Let be the complete set of nucleotides for individual X at site i and let be the number of nucleotide states observed for individual X at site i. The Matchstates distance between individual X and individual Y at site i is where denotes the set of elements that belong to either or , but not in both.
genpofad
The genpofad measure is named after the pofad algorithm described by Joly and Bruneau (2006). The genpofad distance can be defined as one minus the ratio of the number of nucleotides shared between two individuals divided by the maximum number of nucleotides observed in either of the individuals at a given sequence site. Following the notation introduced above,
mrca
The mrca distance measure gives a distance of 0 whenever two individuals share at least one nucleotide at a given site and a distance of 1 otherwise. Formally, the mrca distance between individual X and individual Y is
nei
This distance is the application of Nei’s genetic distance (Nei et al. 1983) at the nu-cleotide level. The frequency of each nucleotide is estimated per site for each individual and then nei genetic distance between individual X and individual Y for site i is estimated as where is the frequency of nucleotide j in individual X at site i. This formula is flexible as it can be easily applied among individuals from different ploidy levels. Gene dosage is assumed to be known, but it can also be used if it is unknown by giving equal weight to each nucleotide present.
Extension to multiple sites and genes
The extension of all distance measures to many sites within a locus is easily done by taking the average distance over all DNA positions such as where s is the number of sites and is the contribution of site i to the distance. An estimate of standard error is then provided by the standard statistical formula
In some cases, it might be important to divide nucleotides into different loci, such as when several unlinked genes are sampled throughout the genome, each containing several linked nucleotides. We suggest distances be calculated first across sites within a marker to obtain distance matrices for each marker. Once this is done, one can compute a genome-wide distance matrix by taking the mean of all marker matrices. In calculating this genome-wide distance matrix, it is possible to scale each individual matrix by dividing the distances of a given matrix by the maximum distances in that matrix. This scaling gives the same weight to all markers whatever their variability, which could be interesting if the markers do not have the same evolution rates (e.g., exons, introns, non-coding regions, etc.). If the nucleotides cannot easily be divided into distinct loci, such as when we have a long contiguous sequence along a chromosome, the average distance over all DNA positions is appropriate because each site is then assumed to represent an independent assessment of the distance between the individuals.
Implementation
All these algorithms are implemented in POFAD version 1.06 (www.plantevolution.org/en/pofad.html). The matchstates algorithm is also implemented in SplitsTree4 (Huson and Bryant 2006).
Simulations
Computer simulations were performed to compare the performance of the distances in different situations. We evaluated three properties of the distance measures. First, we tested if the measures provide an unbiased and accurate estimate of distances between organisms. Second, we investigated how the different distances are able to detect the genomic mixture of hybrid individuals. Third, we evaluated how precise these different measures were. We evaluated our new distance metrics along with other previously published distances of Göker and Grimm (2008) that are relevant in the present context: the min distance and the Phylogenetic Bray-Curtis (pbc) distance (see Appendix for mathematical definition). The frq and the entropy distance measures of Göker and Grimm (2008) were not investigated because they are not bounded between 0 and 1 and because they are more relevant in a context of host-parasite associations as originally described. Finally, we also evaluated the recent 2isp method (Potts et al. 2014), even if the distance is not bounded between 0 and 1, as it is similar to our proposed methods (see appendix for mathematical definitions of previously published distances).
Accuracy of distance measures
To investigate whether the distance measures were accurate for estimating distances between individuals, we simulated tetraploid individuals (2n = 4x) along a species tree using the coalescent and estimated the genetic distances between individuals that have been evolving for different periods of time. Gene sequences of 1000 bp were simulated using MCcoal (Rannala and Yang 2003) on a species tree where the individuals compared had the following divergence times (τ): 0, 0.0005, 0.001, 0.002, 0.003, and 0.005. The divergence times (τ) represent the expected number of mutations per site from the node in the species tree to the present time. However, the expected divergence times of the sequences between individuals will be greater than the time of species divergence as the time to coalescence of the sequences in the ancestral species needs to be considered (Nei 1987; Edwards and Beerli 2000; Arbogast et al. 2002). The expected time to coalescence in the ancestral species (population) is equal to 2N or θ/2 (Edwards and Beerli 2000). The expected genetic distance is thus twice the coalescence time expectation, which is twice the time since the species divergence plus twice the expectation for the coalescent time in the ancestral population: d = 2τ + θ. Distance measures were thus compared to this expected sequence divergence, but also with the expected species divergence (2τ). Simulations were performed with two population sizes (θ = 0.001 and θ = 0.01) that were held constant throughout the tree. The larger population size increased the number of polymorphisms in individuals. All simulations were repeated 2000 times.
Estimation of the genomic mixture of hybrids
To investigate how good the different distance measures are at detecting the genomic mixture of hybrid individuals, we estimated and compared the genetic distance of an allopolyploid with its two parents. For this, we simulated an allopolyploid speciation event. Gene copies inherited from one parent in the allopolyploid were then transferred by descent in the allopolyploid species via multisomic inheritance (i.e., they can be assumed to form a panmictic population and simulated with the coalescent), and were evolving independently from the gene copies inherited from the other parent. This allowed us to simulate gene sequences using multi-labeled species trees (see Jones et al. 2013). The parental species were tetraploids whereas the allopolyploid species was either octopolyploid with four gene copies coming from each parent or hexaploid with four copies coming from one parent and two from the other. This allowed us to test two ratios of parental genome contribution in the hybrid.
Gene sequences of 1000 bp were simulated on a species tree as described above with a population size θ = 0.001 and with a divergence time to the two parental species fixed at τ = 0.003. Three different scenarios were investigated for the timing of the allopolyploid event: τ = 0 (in which case it is an immediate descendent of the two parental species), τ = 0.001 or τ = 0.002. To investigate the hybrid mixture of the allopolyploid individual, we estimated an hybrid index that indicates the relative distance of the hybrid from its two parents: where A and B are the two parents and X the hybrid, and where dAX is the genetic distance between species A and the hybrid. The hybrid index (I) is bounded between 0 and 1 and an index of 0.5 indicates that the hybrid is equally distant to both parents. Cases where both dA, X and dB, X were equal to zero were given I = 0.5. All simulations were repeated 2000 times.
Effect of the number of markers on precision
We also estimated the impact of gene number on precision in the two previous simulation settings. For the precision of the genetic distance estimate, we used the simulations with θ = 0.001 and the expected distance of 0.01. For the hybrid index, we used the framework of the octopolyploid speciation event at τ = 0.001. In both cases, we evaluated the statistics (distance or hybrid index) with 1, 2, 5, 10, 20, and 40 markers. Distances were estimated 100 times for each scenario and standard deviation among estimates was computed and plotted to investigate the decrease in standard deviation with the number of markers for each method.
Results
Theoretical considerations
Before comparing the different distance methods, it is relevant to note the similarities between the SNP-based methods proposed here and the previously published methods based on whole marker sequences. For example, mrca is the same as min applied to a single nucleotide. As such, it is interesting to compare the performance of this pair of methods in the simulations. Moreover, the genpofad distance is equivalent to the pofad algorithm of Joly and Bruneau (2006) when applied to a single nucleotide in diploid individuals. For a locus evolving under an infinite site mutation model without recombination, the genpofad distance should give the same distance as pofad when extended to the whole locus (see below). However, genpofad has the advantage that it could be applied to individuals of any ploidy level.
Distance accuracy
Only genpofad provided an accurate estimate of the sequence divergence (2τ + θ; Figs. 1,2). The genpofad estimates were very accurate with small population sizes (θ = 0.001), but tend to provide a slightly underestimated distance for small divergence times with θ = 0.01 (Figs. 1,2). Moreover, it also underestimated sequence divergence within populations (i.e., when species divergence = 0), suggesting that it is not a very accurate estimator of θ. Nevertheless, it was the best estimator of θ among the methods tested.
Other distance measures had interesting properties. min underestimated sequence divergence (Fig. 1), but provided an accurate estimate of the species divergence (Fig. 2). matchstates and pbc provided similar estimates that fell between the expected sequences divergence and the species divergence. The other estimates either largely overestimated sequence divergence (2isp, nei) or underestimated species divergence (mrca) in all situations (Figs. 1,2).
Hybrid genetic mixture
Distances measure were evaluated for estimating the intermediacy of hybrid individuals relative to its parents. When the parents contributed an equal number of gene copies, all methods were accurate, but nei provided the most precise estimate of the hybrid index (Fig. 3). genpofad and 2isp were the second best methods according to precision, followed very closely by pbc and matchstates. mrca and min provided imprecise estimates of hybrid index (Fig. 3).
No method provided an accurate hybrid index estimate when one parent contributed twice the number of gene copies as the other (Fig. 3), but some methods performed better than others. pbc was by far the best method, followed by genpofad and matchstates. As before, mrca and min provided the worst estimates of the hybrid index. Also, if some evidence for an unequal contribution was visible for the young hybrid for genpofad and matchstates, evidence of unequal parental contribution for older hybrids was only observed with the pbc distance.
Effect of the number of markers on precision
Evaluation of the methods’ precision showed different results for the distance accuracy and for the hybrid index simulations. For the estimation of the genetic distance, all methods showed a similar precision and the increase in precision (decrease in standard deviation among replicates) was similar for the different methods, with the exception of 2isp that had a much larger error than all others (Fig. 3a). The pattern was different for the precision of the hybrid index. The methods mrca and min were much less precise than the others and they required more markers to converge on stable estimates (Fig. 3b). The remaining methods had a similar precision, although they could be ranked as followed for precision (from best to worst): nei > gepofad = 2isp > matchstates > pbc (Fig. 3b).
Discussion
With the increasing use of massively parallel sequencing approaches in evolutionary biology, fast, accurate, and precise methods to investigate genetic structure and evolutionary history are required. Concatenation approaches are known to be inconsistent in some circumstances (Degnan and Rosenberg 2006; Salter Kubatko and Degnan 2007) and fully Bayesian approaches to population/species reconstruction (e.g. Heled and Drummond 2010; Liu et al. 2009) are computationally demanding with large number of markers. If faster coalescent alternatives exist for genomic studies (Bryant et al. 2012), distance measures nevertheless remain an interesting strategy, especially given the consistent properties of some indices (Liu et al. 2009; Mossel and Roch 2010).
Until now, the toolset of distance measures was limited for studying the relationships of individuals. Overcoming this shortcoming is critical given that individuals are the fundamental unit for many studies at the species level. The main problems encountered at this level are those of allelic variation and polyploidy. However, the potential presence of recombination in the nuclear genome and the SNP based nature of many contemporaneous studies represent further challenges. We thus present here new distance measures that all have the property that they are estimated at the nucleotide level in order to alleviate these biological complexities.
Advantages of SNP-based distances
Interestingly, SNP-based distances do not suffer from the comparison with whole-sequence distances in our simulations. This is relevant because the simulation of long (1000 bp) sequences without recombination should advantage distances estimated on whole sequences. To the contrary, the most accurate method for estimating genetic distances was a SNP-based method. Clearly, one can expect SNP-based methods to rapidly gain an advantage over whole sequence methods in the presence of recombination. In many empirical studies that use large numbers of markers, it is indeed very difficult to rule out completely the presence of recombination, especially if markers are long. If recombination should not affect the performance of SNP-based methods, it will affect those based on whole sequences. SNP-based methods are thus expected to be particularly useful given the increasing abundance of genome-scale studies based on whole genomes or reduced-representation sequencing data.
Another important factor to consider is the length of markers. Massively parallel sequencing technologies generally result in markers of small sequence lengths. With such data, we expect that the relative advantage of distance measures based on the whole marker sequence to decrease with decreasing sequence length. Indeed, we can have an idea of that effect when going from 1000 bp sequences to SNP data by comparing the distances min and mrca as mrca is identical to min applied to a single SNP. Consequently, SNP-based methods are particularly well suited for SNP-based studies or for studies using short length markers.
Importance of gene dosage information
Of the methods evaluated here, two can actually take into account exact gene dosage information if known: pbc and nei. One would expect this type of information to be particularly important for estimating unequal genomic mixtures in hybrid individuals. This actually seems to be the case for pbc that was the best method according to this criteria. However, nei did not appear to benefit from gene dosage information in the same situation. Our results tend to show, however, that gene dosage information is not critical for good performance in all situations. This is especially true for the estimation of genetic distances where the best method did not use gene dosage information. This is a very encouraging result given that such information is rarely known precisely in genomic studies involving polyploids.
Method performances
In term of genetic distance accuracy, the best method was genpofad, a SNP-based method. It provided very accurate estimates of sequence divergence at small population sizes (θ = 0.001), even if the estimates were slightly biased at larger population sizes (θ = 0.01). It was also found to provide a slightly underestimation of θ in populations, even though it was still better than all other methods in this aspect.
The minimum allelic distance between individuals (min) provided an accurate estimate of the species divergence time, which is an interesting property. This observation concurs with previous studies that have shown this measure to be a consistent estimator of species distances in certain situations (Mossel and Roch 2010; DeGiorgio and Degnan 2014). However, the simulations showed that this measure performs poorly when it comes to estimating the genomic mixture of individuals, both in terms of accuracy and precision. Interestingly, two distance measures provided estimates that fell between the expected sequence divergence and the species divergence, that is between 2τ + θ and 2τ. These are the matchstates and the pbc methods.
Regarding hybrid mixture estimates, the best method was clearly pbc that was the only method to be close to accurate when estimating unequal contribution of the parents in the young age hybrid. Moreover, evidence for unequal contribution remained even for older hybrids, whereas that signal was lost for all other methods. Note that this assumes that we now the exact number of copies in the hybrid (i.e., gene dosage), an information that might not be always available in empirical datasets and that could affect the performance of the pbc distance. Among other methods, genpofad and matchstates were slightly better as they showed slight evidence for the unequal parental contributions for the young hybrid and they provided precise estimates. The methods min and mrca were not precise and did not detect unequal parental contributions. This is not surprising as these methods essentially ignore polymorphisms by considering only the most similar nucleotides (mrca) or alleles (min).
Perhaps the best recommendation we can provide is to use the genpofad distance in general as this is the most accurate method in terms of expected genetic distance and given that it is relatively good at estimating genomic mixture between individuals. Moreover, its performance will not be affected by the presence of recombination or if only short markers are available. In cases where species divergence times are of interest and in absence of recombination, then the min distance is of great interest. Finally, if gene dosage is known and genomic admixture is of main interest, then the pbc distance is the best choice if recombination is absent. In any case, we hope that this study and the simulation framework we propose for comparing the performance of distance measures will stimulate the development and testing of further SNP-based distance measures.
Appendix
Definition of previously published distance measures
In the following definitions based on whole markers sequences, AX represents the complete set of alleles for individual X and |AX| is the number of alleles observed for individual X. Also, let dij be the genetic distance between alleles i and j.
MIN distance
The min distance was proposed by Göker and Grimm (2008) in the present context, but it had been often used in other contexts as well (e.g. Joly et al. 2009; Liu et al. 2009; Mossel and Roch 2010). It can be described as:
Phylogenetic Bray-Curtis distance (PBC)
The pbc distance was defined by Göker and Grimm (2008) as:
2ISP distance
The 2isp distance is a nucleotide-based distance (Potts et al. 2014). It estimates the distance between nucleotides using the step-matrix presented in Figure 1 of Potts et al. (2014).