Abstract
Clustering individuals based on genetic data has become commonplace in many genetic and ecological studies. Most often, statistical inference of population structure is done by applying model-based approaches, such as Bayesian clustering, aided by visualization using distance-based approaches, such as PCA (Principle Component Analysis). While existing distance-based approaches suffer from lack of statistical rigour, model-based approaches entail assumption of prior conditions such as that the subpopulations are at Hardy-Wienberg equilibria. Here we present a distance-based approach for inference of population structure using genetic data based on the network theory concept of community, a dense subgraph within a network. A network is constructed using the pairwise genetic-distance matrix of all sampled individuals, and utilizes community detection algorithms to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition’s modularity, a network theory concept measuring the strength in which partitions divide the network. In order to further characterize population structure, a measure of the Strength of Association (SA) for an individual to its assigned community is calculated, and the Strength of Association Distribution (SAD) of the communities is analysed to provide additional population structure details. The approach presented here provides a novel, computationally efficient, method for inference of population structure which does not assume an underlying model nor prior conditions, making inference potentially more robust. The method is implemented in the software NetStruct, available at https://github.com/GiliG/NetStruct.
1 Introduction
Inference and analysis of population structure from genetic data can be used to understand un-derlying evolutionary and demographic processes experienced by populations, and is therefore an important aspect in many genetic studies. Such inference is mainly done by clustering individuals into groups, often referred to as demes or subpopulations. Evaluation of population structure and gene flow levels between subpopulations allows inference of the migration patterns and their genetic consequences, amongst other processes[1, 2]. As sequencing of larger portions of the genome is becoming more readily available, methods for such inference should ideally be able to take into account a large number of loci.
There are two types of approaches to clustering individuals based on genetic data: Model-based approaches and distance-based approaches. Model based approaches evaluate the likelihood of the observed data assuming that they are randomly drawn from a predefined model of the populations, for example that there are K sub populations and that these subpopulations are at Hardy-Weinberg equilibrium. Distance based approaches aim at identification of clusters by analysis of a matrix describing genetic distance between individuals or populations, for example by graphic visualization using multidimensional scaling (e.g. PCA), without prior assumptions. Over the last decade or so, model-based approaches have been more dominant as procedures for inference of population structure, mostly with implementation of Bayesian clustering techniques in programs such as STRUCTURE[3], ADMIXTURE[4] and BAPS[5]. It has been pointed out that distance-based methods have several disadvantages: they are not rigorous enough and rely on graphical visualization, they depend on the distance measure used, it is difficult to assess significance of the resulting clustering, and it is difficult to incorporate additional information such as geographical location of the samples[3]. Given these disadvantages, it would seem that distance-based measures are unsuitable for statistical inference of population structure. On the other hand, model-based approaches suffer from the necessity to restrict the interpretation of the results by heavily relying on the prior assumptions of the model, for example that the populations meet certain equilibria conditions, such as migration-drift or Hardy-Weinberg[3].
There has recently been a flourish of network theory applications to genetic questions in Ge-nomics[6], landscape genetics[7] and migration-selection dynamics[8]. Recently, a network-based visualization tool (NETVIEW[9]) of fine-scale genetic populations structure, using a Super Paramag-netic Clustering algorithm[10], has been proposed and successfully applied to analysis of livestock breeds [11, 12]. However, this method still suffers from the many disadvantages of distance-based clustering approaches, and a more rigorous and statistically testable distance-based approach is still missing.
Development of a suitable distance-based network approach, that will not suffer from the dis-advantages listed above, necessitates a clear definition of genetic population structure in equivalent network theory terminology. A genetically defined subpopulation in commonly thought of as a group of individuals within the population which are more genetically related (or more genetically similar) to each other than they are to other individuals in the population, as a result of many possible genetic processes such as migration, mutation and selection. In a network, a group of nodes which are more densely and strongly connected within the group than outside the group, relative to the given topology of the network, is called a ”community”. Therefore, in network theory terminology, the equivalent of a genetic population structure should be the community partition of a network constructed with individuals as nodes and edges defined using an appropriate genetic distance or relatedness measure. In network science, clustering nodes into groups has been extensively studied, and specifically community detection has attracted much interest[13]. Since the definition of a community is not rigid, and identifying optimal partitions is computationally expensive, many approaches and algorithms to optimally detect communities in networks have been proposed[14, 15].
We propose a network-based approach for analysing population structure based on genetic data. We show that by applying recent advances in network theory, it is possible to design a distance-based approach that does not suffer from the previously described disadvantages, and also does not suffer from the disadvantages of model-based approaches. We also show how rigorous statistical inference can be incorporated into this network-based approach in a manner that does not entail prior assumptions or conditions about the data. The process is also computationally efficient in regards to the number of loci incorporated in the analysis, and therefore can be used with a large number of loci (e.g. microsattelites, SNPs). Moreover, we define a new measure for the strength in which an individual is associated with the community to which it is assigned, called Strength of Association (SA), and we show how Strength of Association Distribution (SAD) analysis can be used to infer further details regarding population structure. The analysis is demonstrated on genetic data from human population extracted from the HapMap project[16]. In addition to presentation of a new distance-based alternative to population detection to be applied in population genetic studies, that complements existing model-based methods to give a more detailed and robust account of population structure, we believe that defining the problem of genetic population structure analysis in network terminology will allow future adoption and adaptation of network methods to address population genetic questions.
2 Methods
In this section we describe a network-based approach for constructing genetic networks and inferring population structure by detecting community partitions on these networks, and the relevant theory. Following detection of community structure, we propose an additional exploratory analysis of the strength of association of individuals to communities that may shed light on finer details of the community structure and therefore on population structure and underlying genetic processes.
2.1 Constructing networks from genetic data
A network can be described by an adjacency matrix, where the element in column i and row j is the weight of the edge connecting node i and node j. Therefore a genetic-distance matrix (a matrix describing some measure of genetic distance between all pairs of individuals, based of their genotypes) of a population can be regarded also as the adjacency matrix of a genetic-distance network. Many genetic distance and relatedness measures have been proposed[17], but if we restrict the discussion to symmetric relatedness measures (where relatedness between individual 1 and 2 is the same as between individual 2 and 1), the genetic network thus described is a weighted undirected network (each edge is characterised by a weight but without directionality). Since we would like to extract information about the population structure from this network, we further restrict the discussion to genetic distance measures which are relative to the sampled population, i.e. measures that takes into account the allele frequencies of the total sampled population (local sampled populations allele frequencies should not be incorporated in the measure since this would mean that the null hypothesis is other then that there is no population structure).
In such a network, the strength of the connection between each dyad is relative to the genetic similarity between them, where shared rare alleles convey a stronger connection than do common alleles. Since even unrelated individuals may share many alleles, especially when many loci are examined, it is likely that this network will be extremely dense. It may therefore be useful, both from a computational point-of-view and in order to emphasize strong genetic relations within the population to increase detection power, to remove edges from the network with weights below a certain threshold. In this way a sparser network that consists of strong relatedness ties is attained.
2.2 Network communities and genetic population structure
In network theory, the term community refers to a subset of nodes in a network more densely connected to each other than to nodes outside the subset [18]. There are now several algorithms for efficiently partitioning a network to communities [14, 15]. Most commonly, a partition of a network to communities is evaluated by calculating the modularity of the partition, a quality measure (between -1 and 1) indicating whether the partition is more or less modular than would be expected[19]. The modularity of a particular community partition of a weighted network A is defined as the weight of the intra-community connections minus the expected weight on the intra-community connection in a random network preserving the edge weights of each node[20]: where is the sum over all edge weights in the network and δ(ci, cj) is a delta function with value one if nodes i and j are in the same community and zero otherwise. A positive modularity value indicates that the partition is more modular than expected. A partition of one community including all nodes results in a modularity of zero, and therefore for every network the optimal partition, maximizing the modularity, is always non-negative.
Since in a subdivided population the individuals in a subpopulation are expected to be more highly related in comparison to a random redistribution of relatedness levels between individuals, communities in the genetic network are expected to coincide with the subpopulations of the underlying population structure. We therefore propose that population structure can be ascertained by constructing a genetic network based on a genetic distance measure, and then applying one or several community detection methods to identify a partition which maximizes modularity. It is important to note that it is possible that the partition with the highest modularity is the entire network (with modularity of zero), and therefore community detection algorithms can also identify scenarios with no subdivision of the population.
Several approaches have been suggested in order to evaluate the statistical significance of community partitions [14]. However, since the genetic network was constructed using multilocus genetic data, it is possible to pursue an alternative approach where the optimal modularity of the community partition can be compared with the modularity of optimal community partitions of networks constructed from permutations of the genetic data. In this way it is possible to directly evaluate whether the modularity attained is significantly different than zero, and whether the network is significantly modular. This can be done either by permuting the genetic data (in each locus independently) and then constructing a genetic-distance matrix, or by permuting the genetic distance network while preserving the matrix symmetry (the latter is more computationally efficient when many loci are analysed).
2.3 Strength of Association Distribution (SAD) analysis
Revealing the division of the population into subpopulations may shed light on many aspects of the underlying evolutionary and ecological processes, more information can be attained by further analysing the characteristics of the division partition. The partitioning of the network into dense subgraphs, as presented above, does not convey information regarding how important each individual is to the detected partition. Here we introduce a measure intended to evaluate this aspect, the Strength of Association (SA) of individual i to its community. Given a community partition C and an individual i, we define the Strength of Association as where QC is the modularity of the partition C and Ck(i) is the partition identical to C except that node i is assigned to community k instead of its original community. Thus high SA values indicate that the partition C is sensitive to the assignment of i, and that the assignment of i to its community is essential, whereas low SA values indicate that there is another community that the individual is well assigned to. From a population genetic perspective, the measure evaluates how strongly individuals are related to the group to which they were assigned to, and SA is expected to be low when individuals are recent descendants from individuals from more then one subpopulation. Specifically, potential hybrids are expected to show low SA values, and the k that maximizes the term in equation 2 is the probable origin of the second lineage of the individual.
The SA measure is a measure at the individual level (although taking into account genetic data of the entire population). We introduce an exploratory subpopulation-level analysis that evaluates characteristics of subpopulations, the Strength of Association Distribution (SAD) analysis. This analysis examines the distribution of the SA values of the different communities and compares the statistical attributes of these distribution (e.g. the mean and variance of the SA values). Since different scenarios are expected to result in different cohesion of the subpopulations, it may be possible to infer what underlying processes where responsible for shaping the genetics of the population.
For example, a closed disconnected subpopulation is expected to display a narrow SAD with high mean (high community cohesion), since in a closed population individuals will be strongly related relative to the entire population, and individuals descended from lineages outside the subpopulation are rare. A subpopulation experiencing constant moderate gene flow levels is expected to display a wide or left-skewed SAD with high mean, since there should be many individuals with lineages that are mostly from the subpopulation, but recent migrant and descendants of recent migrants are expected to have low SA values, increasing the variance and the left-skewness of the distribution. A subpopulation experiencing constant strong gene flow levels is expected to display a wide SAD with low mean (low community cohesion), as many individuals will be descendants of migrants. A bimodal SAD distribution may indicate subgroups within the subpopulations experiencing different gene flow regimes as there are two groups corresponding to the two modes.
3 Analysis of human SNP data
In order to demonstrate the applicability of the network approach to infer population structure and the SAD analysis, we have selected a dataset of human SNP data, extracted from the Hapmap database [16]. This data set is well suited for the demonstration of a new approach since it is taken from a population where structure and demographic history are well known from archaeological, historic and genetic studies. The genetic data for this analysis consisted of 50 randomly selected individuals from each of the 11 groups in the HapMap project (overall 550 individuals). For each individual 1000 polymorphic SNPs from each autosome were randomly selected (overall 22,000 sites per individual).
From these genotypes a genetic network was constructed (without any information on the original grouping of the individuals) using, for genetic distance calculation, a simple frequency-weighted allele-sharing relatedness measure. Analogous to the molecular similarity index[17, 21], we defined the frequency-weighted similarity at locus l for individual i with alleles a and b, with frequencies fa and fb (in the total sample) respectively, and individual j with alleles c and d: where Iac is one if alleles a and c are identical and zero otherwise, and the other indicators similarly defined. Note that this measure is commutative with respect to i and j. Given a sample with L loci, the weight of the edge connecting individuals i and j is defined as the mean frequency-weighted similarity over all loci:
The relatedness measure defined in equation 3 is a very simple symmetric relatedness measure, that measures diversity relative to the entire population, since it takes into account the allele frequencies at the level of the entire population (with sharing of rare alleles conveying a stronger connection than sharing common ones). Other, more sophisticated, measures are likely to construct more accurate networks and may be specific to the type of marker considered (e.g. for microsattelites the length of the repeat might be taken into account) or include additional information (e.g. geographic locations of the samples). The formulation presented here is designed to analyse diploid populations, but it can be easily generalized to any level of ploidy. Different thresholds were used to create different networks, and community detection algorithms were used to identify the underlying genetic population structure.
There are currently many algorithms used for detecting population structure, relying on different network theory concepts (reviewed by Fortunato[14] and Lancichinetti[15]). We have used several different algorithms (implemented using igraph[22]), detailed in Appendix A, and we show here the results from the classic Girvan-Newman algorithm[13] (the results using different algorithms are not qualitatively different). Figure 1 shows the assignment results for networks with a low threshold of 0.188, a medium threshold of 0.194 and a network constructed using two high thresholds, 0.198 for most individuals and 0.207 for the East Asian component (A network component is a group of nodes that are not connected to any other node in the network, and it arises here when using a high enough threshold. The East Asian component is much denser then the rest of the network, and high thresholds are necessary for distinguishing between the network components within a high density networks). In all cases network components of one or two nodes were removed from the network.
Permutation test using 1000 permutations of the genotypes were conducted, and all community partitions were strongly significant (p ≤ 0.001) at all thresholds. With a low threshold the partition corresponded with an African\Non-African division (Fig. 1C), with a medium threshold to an African\Indo-European\East Asian division (Fig. 1B), and with high threshold resulted in six communities:African, Indian, European, Mexican, Chinese and Japanese (Fig. 1A. Some of the other algorithms also detected the Masai population as a distinct community, Appendix A). The trend where higher thresholds reveals more detailed structure is correlated with to the known broad patterns of human population differentiation. The low threshold coarse division of the population into two groups corresponds with the more ancient ”out-of-Africa” migrations, the medium thresh-old additional division of the Eurasian population correspond with the more recent migrations to Asia, and lastly the high thresholds correspond with the most recent relevant migrations to India, Japan and Mexico.
As a demonstration of the SAD analysis, the network with medium threshold (Fig. 1B) was analysed, and the distribution of the SA values for the three communities detected are shown in Figure 2. The SAD of the Indo-European community (blue) is the one with the lowest mean SA, and is a wide left-skewed distribution, consistent with a sub-population with defined core and peripheral areas that experienced extensive gene flow. Given that the individuals belonging to this community are from European, Indian or Mexican ancestry, a probable explanation is that the core consists of the two European sampled populations and that the Indian and Mexican ancestry individuals have lower association with this group. This can be clearly observed when the distribution is decomposed to three distributions based on the sampled population (Figure 3).
The distribution of the African community (orange) has a high mean, and is narrow and bimodal. This is consistent with a cohesive subpopulation with limited gene flow, but also that two distinct subgroups exist within the population with different levels of association to the community. Figure 4 shows the decomposition of the distribution to the four sampled populations composing it, and it can be seen that the the bi-modality can be explained by the fact that the Masai population (A2) is found to be a distinct population, as detected by some of the community detection algorithms (Appendix A), as well as a lower association of the American individuals (A1), with some individuals with very low association as indicated by the left-tail of the distribution. This lower association is consistent with higher gene flow experienced by the African ancestry Americans from other American groups.
The distribution of the East Asian community (pink) has a high mean and is a very narrow distribution, consistent with a subpopulation experiencing limited gene flow. This can be explained from the known historical trend of the relative isolation of East Asia from Europe and Africa.
4 Discussion
We present a distance-based approach for analysis of population structure, which does not entail the assumptions of an underlying model or any prior conditions. The approach is set in a network theory framework and uses the concepts of ”modularity” and ”community”. The method allows computationally efficient assignment of individuals to sub populations, and is applicable also in cases where many loci are studied. An additional analysis of the SAD of the communities can be used to explore the population structure beyond assignment of individuals to populations by evaluating the strength in which individuals are associated with their assigned populations (this measure may be useful to detected admixed individuals). Potentially, inference of genetic and ecological processes from population structure detected by this approach should be more robust than inference from structure detected by model-based approaches, since no prior conditions are assumed. Ideally, population-level studies would benefit from exploring structure using our network approach in combination with Bayesian clustering methods and visualization by multidimensional scaling, as these complement each other and may give a more robust and detailed picture.
One issue which has been a concern in model-based implementation is the assessment of the number of subpopualtion, K[3, 23], as K is usually one of the model parameters. By setting K these procedures regard the subpopulations as equivalent, even though this is often not the case. For example, for the network shown in Figure 1B, K = 3, however the three subpopulations show very different distributions of within-population relatedness (Fig. 2). In the network-based approach there is no such issue as most approaches for detecting community structure do not a priori assume K, but rather find the optimal K that maximizes modularity (e.g. [19, 24]), or acquires K as part of the detection process (e.g. [18, 25, 26]), without assuming any equivalence of the communities. Nevertheless, as has been demonstrated on human genetic data, using different thresholds for edge removal result in different K values (Fig. 1), and the same is noticed when using different community detection algorithms (Appendix A). Since, in the analysis presented here, these are statistically significant community structures, this may reflect the fact that there isn’t necessarily a ”correct” K value, but rather that different methods reveal structure at different hierarchical levels. Different significant community structures emerge, producing a semi-hierarchical structure, in the sense that a community partition at a given level does not depend on ”higher” level partitions. True hierarchical community partition procedures[27, 28] can possibly be useful for hierarchical population structure analysis, but in most of these procedures each level is constrained by higher levels. It is important to note that the sampling scheme may also affect the number of subpopulations detected, as, for example, in a population with continuous isolation-by-distance gene flow pattern, sampling at discrete locations far enough apart may result in arbitrary K values which have no biological meaning[3].
With full genome sequencing becoming more and more accessible, procedures for population structure analysis must also take into account computational considerations. The procedure pre-sented here is composed of three consecutive steps, with construction of the network taking O(Ln2) time (n is the number of individuals in the sample and L is the number of loci involved in the analysis). Computation time for community detection depends on the algorithm used, but fast near-linear algorithms, taking O(n + m) (m is the number of edges in the network) time and ap-proaching O(n) time for sparse networks, are already available[25]. SAD analysis depends on m and on the number of communities detected, c, and takes O(cnm) time, approaching O(cn2) for sparse networks (constructed using high thresholds). Only the first stage involves the number of loci, therefore the computation time of the entire procedure is linear with respect to the number of loci, and there should be no computational limitation for including full genome sequences in analyses.
Since the genetic-distance measure, the threshold and the community detection algorithm remain, for now, used-defined, and may result in different population structures, care must be taken when defining these parameters, and preferably several options should be explored. Further studies may provide guidelines for setting these parameters as a function of the particulars of the system under study. Network theory, and particularly community detection, is a highly active field of research, but our understanding of the usefulness of particular community detection procedures to different types of networks is still minimal, and future advancements in network theory may provide clearer guidelines for algorithm and threshold choice.
We believe a network approach may provide an additional viewpoint on population structure analysis, one less hampered by prior assumptions. Moreover, defining population genetic problems in network terminology is important in itself since currently many tools and methods are developed within the network theory framework in order to study complex systems. These methods may become accessible to the field of population genetics once network terminology is incorporated in population genetic theory and practice.
The method presented here is implemented in the program NetStruct, which uses community detection algorithms implemented in the software package igraph[22], and is available at https://github.com/GiliG/NetStruct.