Abstract
We present an axiomatic approach for multilocus informativeness measures for determining the amount of information that a set of polymorphic genetic markers provides about individual ancestry. We then reveal several surprising properties of a decision-theoretic based measure that is consistent with the set of proposed criteria for multilocus informativeness. In particular, these properties highlight the interplay between information originating from population priors and the information extractable from the population genetic variants. This analysis then reveals a certain deficiency of mutual information based multilocus informativeness measures when such population priors are incorporated. Finally, we analyse and quantify the inevitable inherent decrease in informativeness due to learning from finite population samples.
1 Introduction
“The investigations on the foundations of geometry suggest the problem: To treat in the same manner, by means of axioms, those physical sciences in which mathematics plays an important part.”
– Hilbert 1902, p.454
The literature on inferring individual ancestry from genetic markers and the related measures of ancestry informativeness is vast and involves multiple perspectives and mathematical approaches (e.g. [Rosenberg et al., 2003]; [Ding et al., 2011]). The inferential task is conceptually linked to the uncertainty inherent in any effective classification scheme given a target set of polymorphic genetic markers from a number of source populations. However, informativeness is commonly interpreted a property of the data independent of any particular classification heuristic; in that sense it is ideally meant to capture only the relevant aspect of informational content – and this is why the formulation of a good measure is hardly straightforward. In this paper we adopt an essentially axiomatic approach which relies on first producing a set of appropriate and justifiable criteria that any measure should comply with. This is in contrast to previous approaches that have considered measures of informativeness in a more ad-hoc fashion, without anchoring them in any rigorous framework. Moreover, following the successful approach of [Shannon, 1948] in formalizing the transfer of information it is commonly recognized that the justification for regarding a quantity an information measure resides in the associated mathematical theorems demonstrating operational significance ([Csiszár, 2008]). Here we aim to abide to this realization by deriving novel properties of ancestry informativeness that are also of potentially practical significance.
This paper follows in the footsteps of the preliminary analysis laid out in [Tal, 2012b]. That work focused on reviewing and comparing between multiple candidates for informativeness based on simple distribution divergences (e.g. the class of f-divergences), distance metrics (the Mahalanobis distance) and differentiation measures (the population-genetic FST) measures, and illuminating the drawbacks of each in in the context of a particular set of criteria. Here, we both refine and extend that set, and focus on information-theoretic and decision-theoretic informativeness measures, which are frequently invoked in the literature and also utilized in practical inference applications. Existing approaches for deriving measures for inferring ancestry have mostly focused on single-locus measures, have not rigorously incorporated the diminishing effects on informativeness of noise resulting from finite samples, and have not appropriately accounted for the effects of source-population size discrepancies. Crucially, previous related work has not sought justification in a firm conceptual or mathematically formal framework.
Our model highlights the important aspects of ancestry informativeness given simplifying assumptions on the nature of the genetic data. Theoretical work in population genetics often utilizes haploid rather than diploid models for the sake of simplified analysis (e.g. [Carja and Feldman, 2012]). We first consider a model of haploid populations given known allele frequencies from biallelic loci from two subpopulations with known class priors. The priors represent the discrepancy in source populations size, an aspect often incorporated in population models ([Rosenberg et al., 2003]; [Rosenberg, 2005]). Subsequently, sampling considerations enter into the analysis, and we derive enhanced informativeness measures that reflect more practical research studies.
2 The Population Model
We consider for simplicity a basic model of two haploid populations, denoted P and Q, a set of biallelic variants from these populations (following [Tal, 2012b]). We denote by Cn an informativeness measure for ancestry inference across a set of n loci, which captures the information given by a set of polymorphisms and the knowledge of relative population sizes, and which crucially complies with a given stipulated set of criteria.
More rigorously, Cn(α, P, Q) is a measure of ancestry informativeness given a set of n biallelic markers (such as SNPs), where here P and Q are vectors of known allele frequencies (p1, … , pn) and (q1, … , qn) from the respective populations, with 0 < pi < 1 and 0 < qi < 1. We (naturally) assume that pi and qi are true population parameters of polymorphic loci, i.e., each locus in each population is properly biallelic (the degrading effect of utilizing frequency estimates is introduced at a later section). A genotype sample of length n from one of the populations is then defined by an ordered sequence of n polymorphic alleles from the respective population, with a population frequency prescribed by the corresponding allele frequencies. We conveniently signify complete informativeness by an upper bound of 1, such that Cn(α, P,Q) → 1 whenever asymptotically definite classification is inherently possible.
We differentiate between a trivial locus and an uninformative one. The former case involves allele frequencies that exactly equal between the two populations pi = qi, while the latter case the differentiation at that locus is below a threshold such that there is no contribution to Cn. We shall show that there exist non-trivial but uninformative loci, i.e. that inclusion of loci with frequency differences greater than zero does not always contribute to informativeness.
The model includes a population prior α, which is arbitrarily assigned to population P, such that 1 − α is the prior of population Q. This prior is interpreted as the probability that a sample belongs to population P when its genotype is unknown, and simply reflect the known discrepancy in population sizes, treated from a Bayesian perspective, as in the model of [Rosenberg et al., 2003]. In effect, if we denote by NX the size of population X, then α = N1/(N1 + N2). Although the full notation is Cn(α,P,Q), we will interchangeably use Cn for simplicity in notation, where contextually sufficient.
Formally, let X = {0,1} be a binary variable representing the source populations P and Q respectively where, X ~ Bernoulli( 1 − α). Now let Yi = {0,1} be an allele at biallelic haploid locus i, with pi = Pr(Yi = 1|X = 0) and qi = Pr(Yi = 1|X = 1) where, where the genotype frequencies assuming linkage equilibrium, hk for population P and gk for population Q, are a simple product of allele frequencies (formulated in a closed-form as in [Tal, 2012b]), where (e.g., for n = 3, h0 = (1 − p1)(1 − P2)(1 − P3), h1 = p1(1 − p2)(1 − p3), … ,h7 = p1p2p3).
2.1 The Criteria for Informativeness Measures
Here we specify a set of criteria for multilocus informativeness with justification stemming from established empirical studies and theoretical results, and from basic intuitive reasoning (subsequently elaborated on in the discussion). This set of criteria are an extended, corrected and refined reformulation following the preliminary treatment in [Tal, 2012b].
Zero: Cn = 0 if and only if the two populations are virtually the same population, i.e., across n loci, pi = qi for all i and (implicitly) the prior . Naturally, under a single class (population) the classification task becomes void given any set of loci, while at the other hand, any level of population structure captured by a set of loci should admit some potential for successful assignment.
Performance: Cn should be a monotonic non-decreasing function of n. Informally, each additional locus should potentially be informative given the set of loci already included in Cn, but never diminish the aggregated informativeness. This criterion is justified by the phenomenon of asymptotically perfect classification which is achievable when effectively utilizing information from the intrinsicly high dimensional nature of functional data ([Delaigle and Hall, 2012]).
Asymptotics: if allele frequencies at each locus differ between populations by at least ε > 0, where ε is any predefined value as small as we wish, then Cn → 1 as n → ∞. Informally, with this Cn an infinite number of loci with even very slight frequency differences between the two populations induces complete informativeness. This criterion is strongly justified from both empirical studies ([Witherspoon et al., 2007]) with high-sequencing data and from theoretical considerations on separation in high dimensional spaces ([Edwards, 2003], [Tal, 2012a]).
Neutrality: The inclusion of trivial loci (pi = qi) should not affect Cn, as naturally the two populations are not differentiated with respect to such loci. Note that the alleles at such loci are still polymorphisms within each population, but these polymorphisms occur with (theoretically) equal frequencies.
Continuity: Cn should be continuous in pi, qi and α,, as one would not expect small fluctuations in allele frequencies at any locus or in population sizes to have large effects on informativeness.
Dominance: for any finite number of loci n, we expect Cn to be maximal if and only if for some locus i the differentiation is maximal, i.e. δi = |qi − pi| → 1. Informally, any single locus with maximal allele frequency difference is sufficient for accurate assignment of any genotype: one may simply classify according to the presence or absence of a given allele at that locus. The ‘if and only if’ assures that (for any finite n) no other scenario that does not include δi → 1 would result in Cn → 1. The asymptotic limit (→ 1) here follows from the continuity criterion.
Delta: Overlapping allele-frequency differences should induce a strict ranking among loci. More precisely, when some locus i has a wider and completely overlapping allele frequency difference compared to locus k (without loss of generality, pi < pk and qi > qk), then the inclusion of locus i in Cn should result in higher or equal total informativeness vs. the inclusion of k. Naturally, the contribution to informativeness of a marker which is both rarer in one population and more common in the other population – in relation to some other marker – should always be greater.
Invariances: Naturally, we expect Cn to admit to several natural invariances and symmetries: [a] invariant to different ordering of sequenced loci, i.e., the components of the allele frequency vectors P and Q may be specified in any order, as long as they remain in synchrony; [b] symmetric with respect to the two populations, i.e., Cn(P, Q) = Cn(Q, P); [c] invariant to the arbitrary choice of the alleles to which we assign the frequency parameters – the simultaneous substitution of pi with (1 − pi) in P and qi with (1 − qi) in Q.
Prior: Cn → 1 if α → 0 or α → 1, since if the discrepancy of source population size is extremely large, the probability for correct assignment should be asymptotically 1, irrespective of the allele frequency values. In that mostly hypothetical case, one would simply assign any unknown genotype to the large population. The use of limits here results from the continuity criterion and the framework that specifies Cn < 1 and 0 < α < 1.
In formal terms:
Let P = (p1, … , pn), Q = (q1, … , qn), allele frequencies 0 < pi,qi < 1, population P prior 0 < α < 1. Cn(α, P, Q), abbreviated Cn(P, Q) or just Cn, should satisfy:
Cn = 0 iff P = Q (∀i pi = qi, )
Cn+1 ≥ Cn
∀ ε > 0, if
pn+1 = qn+1 ⇒ Cn+1 = Cn
Cn is continuous in pi, qi and α
∀n, Cn → 1 iff |pi − qi| → 1 for some i
If without loss of generality, pn < qn, then for all ε : 0 < ε < 1 − qn, Cn(P, (q1, … , qn + ε)) ≥ Cn(P, Q) and for all ε : 0 < ε < pn, Cn((p1, … , pn − ε), Q) ≥ Cn(P, Q).
Cn(P, Q) = Cn((pσ(1), … , pσ(n), (qσ(1), … , qσ(n))) for all permutation σ ∈ Sn
Cn(α,P,Q) = Cn(1 − α, Q,P)
Cn(P,Q) = Cn((p1, … , 1 − pi, … , pn), (q1, … , 1 − qi, … , qn))
and
Obviously, any measure Cn that complies with the set of criteria admits an infinite number of ‘correlated’ measures representing the degrees of freedom of Cn (just as Shannon entropy H has degrees of freedom represented by a linear factor). Formally, this sense of a correlation between two functions f and g implies that for any two sets of parameters x and y,
It is easy to show that equivalently, this implies the existence of a monotonic function h such that f = h(g). In terms of our inferential framework, this means that for any two panels of SNPs (along with corresponding population priors) represented by x and y above, correlated informativeness measures always admit the same ranking, and one is a monotonic function of the other. In Appendix F we prove that any measure correlated with Cn also complies with our set of criteria.
2.2 Informativeness based on information-theoretic concepts
A well-known information theoretic measure of shared entropy is the mutual information, also commonly interpreted and utilized as a powerful measure of statistical dependency, sensitive also to nonlinear functional relationships ([Steuer et al., 2002]). The mutual information between an allele at a single locus and the source population has been explored in the context of feature selection ([Peng et al., 2005]) and ancestry informativeness ([Rosenberg et al., 2003]).
We would like to examine this instantiation of mutual information as a candidate for our Cn. From basic definitions of mutual information and conditional probability, we utilize our assumption of linkage equilibrium within each population to express the joint multivariate distributions [Y1, … , Yn|X] and [Y1, … ,Yn] in terms of the allele frequencies p(yi−X). With the population priors translating into P(X = 0) = α, P(X = 1) = 1 − α, we get (see [Tal, 2012b , Eq. 2]),
To comply with an upper bound of 1 corresponding to complete informativeness, normalization is required in the formulation of Cn. The maximal value of this expression of mutual information is but since H(X) would be the minimum of the two in all non-trivial cases, we may normalize by H(X) = −α log α − (1 − α) log(1 − α),
However, this candidate then fails the zero criterion (since Cn = 0 if pi = qi for all values of prior α) and most crucially fails the priors criterion, since Cn = 0 instead of 1 as α approaches 0 or 1, as illustrated in Fig. 1 (correcting Fig. 1B in [Tal, 2012b], which lacks proper normalization). The failure of the priors criterion is a characteristic of the non-normalized formulation as well, simply since H(X) is virtually zero at the prior extremes. Therefore, the deficiency exposed here similarly applies to the non-normalized informativeness for assignment measure, denoted In from [Rosenberg et al., 2003]. We shall therefore henceforth refer to the mutual information-based informativeness as In.
A related information theoretic measure is the variation of information (VI). It has been used as a criterion for comparing two partitions or clusterings of the same data set, and measures the amount of information lost and gained in changing from one cluster to another ([Meila, 2007]). The measure is defined as the difference between the joint entropy and the mutual information, and has the benefit of being a true metric.
Expressing VI between the population and genotype distributions in terms of our variables and incorporating the priors we have,
As with the previous candidate, we normalize by the maximal value attained by VI, which here is the joint entropy. Therefore,
However, this candidate fails the important dominance criterion, irrespective of the normalization factor chosen (in fact, Cn decreases as the absolute allele frequency difference increases at a locus), as illustrated in Fig. 2.
Some intuition about the deficiency of mutual information between the class and genotype in capturing informativeness when population priors are incorporated is gained by noting that mutual information represents the reduction of uncertainty in one random variable when knowing the other. Typically, in the ancestry inference framework that occupies us here, this means that I(X; Y) is the (average) reduction of uncertainty about the source population X, when knowing the genotype at n loci Y = [Y1, … , Yn]. By symmetry of mutual information, this inevitably also represents the reduction of uncertainty about the genotype at n loci when knowing the source population X, but since typically H(Y) ≫ H(X) for n ≫ 1, this reduction in H(Y) is relatively inconsequential. The absolute amount of reduction in source population uncertainty strictly depends on the initial amount of its uncertainty, which is encapsulated by the prior ε. This explains why the priors criterion fails with In: as the prior approaches the extremes of 0 or 1, H(X) → 0 and consequently the reduction in this quantity also approaches zero. But this scenario represents virtual certainty about inferring the source population, such that In should have approached 1 instead. This basic insight is schematically depicted in Fig. 3. The normalized formulation similarly fails this criterion, since, and similarly
3 Cn Based on a Naïve Bayes Classifier
The optimal classifier under known class-conditional densities is the Bayes classifier (with class priors) or alternatively the maximum-likelihood (ML) classifier, where data are classified to the most probable class. The Bayes error then is the expected error of this classifier, usually under a 0/1 risk function ([Hastie et al., 2009]). Here, we wish to derive a simple formulation of the Bayes error in the context of our population genetic model. The assumption of linkage equilibrium within each source population corresponds to within-class stochastic independence, and motivates the use of the popular a naïve Bayes classifier. The genotype frequencies are then simply the product of population-conditional allele frequencies across an independent set of loci (as in [Cornuet et al., 1999]; [Phillips et al., 2007]; [Tal, 2012b]). A similar approach for a decision-theoretic informativeness is the multilocus version of the optimal rate of correct assignment (ORCA) with general priors from [Rosenberg et al., 2003]. From basic definitions, the Bayes error for discrete data from two classes can be expressed as a prior-weighted sum of probabilities over all instances of the data ([Hastie et al., 2009]). For our framework, this translates to a sum over the 2n possible genotypes, indexed by k, where the genotype frequencies gk and hk are defined in Eq.(2.1). Since for two classes the Bayes error is bounded below by 0 and above by , a straightforward transformation equivalent to proper normalization is, Thus,
It is possible to derive an equivalent formulation based on the variational distance, a form of f-divergence ([Nguyen et al., 2009, section 2.1.1]), with a natural modification to incorporate our priors (see Appendix A for proof), where the genotype frequencies are again as in Eq. (2.1).
Finally, we would like to produce a general formulation of Cn which separates the classifier function (denoted here, with corresponding indicator function) from the error rate formalism. This will serve us in subsequent sections for deriving other decision-theoretic forms of Cn. Following Eq. (3.2) we can write, where (in case of equal distances, Dk = 0, we arbitrarily choose to classify to population P). For the naïve Bayes, Dk simply compares the genotype probabilities weighed by the priors,
It is proven in Appendix B that these equivalent formulations of Cn satisfy the complete set of informativeness criteria.
In fact, it is possible to derive a parametrized family of measures of infinite cardinality, based on a specific generalization of the variational distance that is also compliant with the full set of criteria (proof in Appendix G). Formally, denote this family of measures by , parametrized by the integer k : 1, … , ∞, where p = (p1, … , pn), q = (q1, … ,qn), and
Note that .
Crucially, it can be shown that for any k, is qualitatively different or uncorrelated with the Bayes-based Cn, i.e., is not a monotonic function of it (Appendix H). The compliance of the family of measures with all criteria therefore implies that the Bayes-error Cn is not a unique solution in the context of our axiomatic framework for an informativeness measure.
The benefit of the Bayes-error Cn over any instantiation from the infinite class derives from the former’s simplicity in both formulation and computational cost in comparison to the latter. Therefore, we will henceforth focus our analysis on the simpler formulation.
We now reveal a core discrepancy of our Bayes-error Cn and the information-theoretic In. We first demonstrate that Cn and In are not correlated (i.e., no function h exists such that I1 = h(C1)) even for the most rudimentary case of n = 1 and equal priors. To see this, denote I1 by f(x, y) and C1 by g(x,y), both a function only of the two allele frequencies x and y at a single locus (setting α = 0.5), such that,
To prove f and g are not correlated we only need to identify two points (p, q) and (p′, Q′) such that g(p, q) = g(p′, q′) but f(p, q) ≠ f(p′, q′). It is easily verified that one such instance is (p, q) = (0.3,0.1) and (p′,q′) = (0.3, 0.5). This lack of correlation even for the most rudimentary case strongly implies non-correlation in the general sense of n ≥ 1. Crucially, simulations indeed indicate many cases of discrepancy between Cn and In in their ranking for informativeness of particular marker panels and associated values of the population prior. In such cases, the relative ranking given by Cn or In to these panels is almost surely prior-dependent (as illustrated by one such typical scenario in Fig. 4).
3.1 Several distinctive properties of the decision-theoretic informativeness
In this section we reveal insightful set of properties of the Bayes-based Cn, which expecially highlight a unique balance between information originating from the genetic data vs. information attributable to population size discrepancy, represented by the priors (see Appendix C for the proofs of these properties).
Loci-subadditivity (w/out priors): The sum of the informativeness of two sequences of loci cannot be lower than the informativeness of the aggregated sequence, or put more simply in terms of an ordered lists concatenation operator ║,
For n = m = 1 the strength of this subadditivity (the inequality above) reflects the redundancy in informativeness within a pair of individual loci. We note that when population priors are incorporated in Cn, this property no longer necessarily holds.1 We may nevertheless allow for priors by introducing a correction term | 1 − 2α|, at the expense of overstretching somewhat the meaningful sense of proper subadditivity (see proof in Appendix C).
Population-subadditivity (w/out priors): Cn complies with a triangle inequality between any three populations. Formally, for any n loci given three populations P, Q and R,
This property affords the interpretation of informativeness as a distance metric between populations, relative to any given number of loci (n). We note here that (contrary to the erroneous claim in [Tal, 2012b]) subadditivity does not generally hold with general priors,2 where the three priors are derived from relative population sizes N1,N2 and N3 of P, Q and R respectively such that, α = N1/(N1 + N2),β = N2/(N2 + N3),γ = N1/(N1 + N3) and thus any two priors define the third, e.g.,
Prior-washout: The effect of the prior is washed out with additional loci; i.e., the effect on informativeness from the consideration of population priors generally diminishes with more loci. Formally, the range for Cn as a function of α (0 < α < 1) never increases with n → n + 1, i.e. the minimal value of Cn(α) increases or is unchanged (note this does not imply that for any two priors and any n, |Cn+1(α1) − Cn+1(α2)| ≤ |Cn(α1) − Cn(α2)|). Moreover, asymptotically with n the prior is completely washed out becoming uninformative.
Prior-sensitivity: Cn as a function of the prior α is convex and the minimum is unique almost surely (and Cn is not necessarily minimal at ). This means that Cn is always sensitive to fluctuations in the prior.
Prior-dominance: There is a particular asymmetric balance between information stemming from the population priors and information from the genetic markers. For any given allele frequencies at n loci, there exist thresholds for prior disparity, beyond which the information provided by the priors eclipses the information provided by genetic markers. At that level of disparity and beyond, Cn [a] is only a function of the prior α, [b] is insensitive to small fluctuations of allele frequencies, and [c] is ultimately unaffected by the exclusion of any or all n loci. In consequence, for any panel of markers there always exists a level of prior disparity (as a function of the allele frequencies) beyond which an effective classifier may ignore the given genotype sample, simply assigning it to the population with the larger prior.
Uninformative-loci: There always exists some such that the inclusion of an extra locus with an allele frequency difference does not change Cn.3 Formally, for any P and Q with n loci, and for any prior α, without loss of generality,
We would next like to inquire whether a formulation of Cn based on the error rate of any valid classifier may also satisfy our axiomatic framework, or perhaps not all classifiers are equivalent in this sense.
4 Cn based on a Nearest Centroid (NC) Classifier
Whereas a Bayes classifier is based on the probability of observing a given genotype in a target population, distance methods assign a genotype to the “closest” population ([Liao et al., 2009], [Degen et al., 2017]). In particular, we will require a genetic distance metric adopted as a measure between an individual and a population, or rather, the population genetic centroid. Importantly, these centroids are unique - defined irrespective of the distance measure one employs, and irrespective of the presence or absence of LD in the data or model.
In practical settings, the centroids are learned from the training data of m individual samples, x1, … , xm and computed using a simple arithmetic average: M = (x1 + ⋯ + xm)/m, (simulated in Fig. 5) or in terms of our model, M = [E(Y1|X), … , E(Yn|X)] such that, M1 = (p1, … , pn) and M2 = (q1, … , qn).
The distance between any genotype xk, defined here as the k-th coordinate in {0,1}n, and centroid M1 depends on the distance metric we chose to employ. Using a Euclidean metric, the distance to the centroid is, where as before, fn(k,i) := the ith bit of k. Alternatively, using a Manhattan distance or L1 norm, and the distance to the centroid would be,
In the context of a genotypic space, the L1 norm is equivalent to the allele-sharing-distance (ASD). and is therefore more appropriate for our framework than the Euclidean, and we therefore confine our development to the former (the Euclidean distance is the square root of the ASD only for distances between genotypes but not for distances to centroids). The classifier term is then simply a difference of the distances of a genotype to the two centroids, i.e.,
However, Cn based on such classifier does not comply with the crucial dominance criterion for any n ≥ 3, as illustrated by counter-example in Fig. 6.
The underlying problem is that the L1 metric does not take into account the variances of the genotype distribution of each population along the direction from any genotype to its population centroid. To overcome this issue, we use normalization that captures the distance to the mean in units of standard deviations, as in [Liao et al., 2009] and [Patterson et al., 2006]. In a Euclidean space, the Mahalanobis distance is the distance of a test point to the centroid divided by the width of the ellipsoid in the direction of the test point ([Hastie et al., 2009]), given by a variance-covariance matrix Σ. In terms of the distance of a binary vector x to mean vector μ,
Similarly, we define a corresponding ASD-Mahalanobis (ASD-M) distance,
The assumption of linkage equilibrium implies a diagonal covariance matrix (comprising only variances) so that the distance of xk to M1 for our two metrics is, such that our classifier function Dk is
Finally, we would like to incorporate the class priors into our classifier function Dk. Since priors do not naturally enter a distance classifier as they do a Bayesian posterior, we approach this problem by reformulating the expression of Dk for the Bayes and NC classifiers in a similar fashion. The Bayes classifier from Eq.(3.5) can be reformulated as, whereas the NC classifier from Eq. (4.1) as,
From this we derive the function that best reflects the effect of the prior as −log .
Interestingly, this alternative formulation of Dk provides insight into why different classifiers deviate from the optimal Bayesian, at least for multidimensional binary distributions; in the case of the NC classifier, this is due to being only a rough approximation of − log pi.
Finally, the we transform the resulting error rate to an expression for Cn as in Eq. (3.4), where the genotype frequencies are as in Eq. (2.1).
Nevertheless, despite the use of a normalized metric in the NC classifier, the distance-based Cn does not comply with all required criteria (see Appendix D). We briefly remark here that this deficiency is most probably a characteristic of many other non-optimal classifiers. For instance, we have examined a simple information-theoretic classifier based on the notion of typical sequences from [Tal et al., 2017]. Using a closed-form formulation of the error rate of the cross-entropy classifier (their Appendix A.2) and allowing for class priors, it is easy to show by counter example that at least two of our criteria are not met: the performance criterion (e.g. pi = 0.1/qi = 0.3 fails at n = 5) and the neutrality criterion (e.g. pi = 0.03/qi = 0.2, n : 1… 6 and pi = 0.1/qi = 0.1 for n > 6 fails at n = 11).
5 Sampling Effects
Our information measure has been defined parametrically, from the underlying properties of genotype distributions across two populations. In practice, however, researchers employ estimates made from sampled data rather than parametric allele frequencies. Simulations of a variety of classification methods on genetic data show that performance is always degraded with smaller population samples, most notably given low values of differentiation ([Cornuet et al., 1999]; [Rosenberg, 2005]; [Tal, 2012a]). Nevertheless, there is information available for classification under these more restrictive circumstances, which we would like to quantify.
The test error, also referred to as generalization error, is the prediction error over an independent test sample. Here the training set is fixed, and test error refers to the error for this specific training set. A related quantity is the expected prediction error (or expected test error). In practice, estimation of test error conditional on a particular training set is in general difficult, given just the data from that training set. Instead, cross-validation and related methods may provide reasonable estimates of the expected test error ([Hastie et al., 2009]). We therefore extend the information measure to encompass these limiting circumstances.
Denote by Cn,m the informativeness under a sampling scenario, effectively an expectation over all training samples of size m, where m = {m1,m2} indicates sample sizes from the two populations. The introduction of this sampling framework calls for a new set of criteria, which are formulated in reference to Cn to replace the existing criteria.
[1*]Sampling-Effect: For any sample size m, Cn,m ≤ Cn. This requirement is justified by established observations and theoretical results on reduced classification performance with smaller samples (e.g., [Cornuet et al., 1999]; [Rosenberg, 2005]).
[2*]Sampling-Convergence: As sample size m increases Cn,m → Cn. This requirement is justified by common observations on the effect of increasing sample size ([Cornuet et al., 1999]; [Rosenberg, 2005]), also motivated from the efficiency quality of any estimator – approaching the underlying parameter being estimated with a probability approaching one as the sample size becomes large.
The introduction of sampling effects requires a reformulation of Cn that distinguishes frequency estimates from the parametric frequencies. Essentially, the former are used in the expression for the classifier, while the latter are used in the expectation of the conditional error and thus we express a classifier module that is separate from the averaging process. Instead of the misclassification rate, we require a conditional test error - the expected error of a classifier over an independent training sample, and the expected test error or the generalization error - an average over all training samples of size m ([Hastie et al., 2009]). Our goal is to express the expected test error for the Bayesian and nearest-centroid classifiers in closed form.
Let m1 and m2 denote given samples size from the two haploid biallelic populations. Let and be the average number of allele ‘A’ in locus i over all training samples of size m1 and m2 respectively. The maximum likelihood estimator (MLE) of the true allele frequencies is an unbiased estimator,4
The distributions of the sample mean (allele frequency estimates) Xi and Yi for populations P and Q respectively, are binomial,5
The expected test error under all training samples of size m = {m1, m2} is the expectation with respect to the sampling distribution of the allele frequencies of the conditional test error Cn(X, Y), which is conditional on a particular sample of size m of frequency values X and Y, where cn follows the general formulation in Eq. (3.4),
And where the classifier compares genotype sample probabilities given a particular sample, following the Bayes classifier formulation of Eq. (3.5), where hk,gk,fn(k,i) are defined in Eq. (2.1).
Cn,m indeed complies with the two new criteria (see Appendix E for proof).
6 Discussion
Much of the work on elucidating factors contributing to the informativeness of genetic markers for ancestry inference had focused only on a small number of core factors, namely, higher informativeness with additional loci, with wider source population divergence, and with a larger population sample size in the learning phase ([Estoup and Angers, 1998]; [Cornuet et al., 1999]; [Edwards, 2003]; [Rosenberg et al., 2003]; [Wang, 2006]; [Witherspoon et al., 2007]; [Ding et al., 2011]; [Tal, 2012a]; [Tal, 2012b]). A central goal of this paper was to provide a formal grounding for an effective decision-theoretic measure compliant with an extended set of such factors framed axiomatically at the outset, subsequently revealing a host of novel intrinsic properties of both theoretical importance and practical relevance. Although one normally proceeds by explicitly deriving a measure from a set of axioms, we were constrained by the sheer cardinality of this set, adopting a top-down approach instead: demonstrating that a prospective informativeness measure adheres to all criteria. Naturally, in any axiomatic framework one strives to include some minimal set of consistent and independent criteria, such that no criterion contradicts nor logically follows from a combination of others ([Rodin, 2014]). The consistency of our framework follows in retrospect from the demonstration of at least one candidate in full compliance (the Bayes-based Cn). The independence aspect is harder to make formal but should be sufficiently intuitively captured from the basic formulations.
The information-theoretic measure
Mutual Information based measures have been widely used in both feature selection ([Battiti, 1994]; [Last et al., 2001]; [Grall-Maes and Beauseroy, 2002]; [Huang and Chow, 2005]; [Huang and Rong, 2009]), most effectively exemplified in the Max-Dependency principle ([Peng et al., 2005]), and in deriving ancestry informativeness measures ([Rosenberg et al., 2003]). The information-theoretic approach has origins more generally in the infomax principle, where a subset of features Yθ should be chosen so that the mutual information between the features and the class label X, I(Yθ;X), is maximized (equivalently, that the conditional entropy H(X|Yθ) is minimized). Some formal justification for the relevance of this principle to classification tasks is given by Fano’s inequality, which relates the conditional entropy between the source and destination in a communication channel to the transmission error ([Zhao et al., 2013]). As a result, mutual information can be used to express both an upper and lower bound for the Bayes error rate ([Brown et al., 2012]).
Nevertheless, much of the justification for employing mutual information in ancestry informativeness or feature selection is simply based on its usefulness in practice. For instance, [Peng et al., 2005] utilize a multivariate mutual information for feature selection since “minimal error usually requires the maximal statistical dependency of the target class c on the data distribution in the subspace Rm (and vice versa)”. This approach is effectively epitomized by a recent analysis of filters for feature selection, where the authors explain that “an intuitive [filter] J would be some measure of correlation between the feature and the class label-the intuition being that a stronger correlation between these should imply a greater predictive ability when using the feature” ([Brown et al., 2012]).
Here we have shown that multilocus ancestry informativeness measures based on mutual information that also account for population priors (such as the multilocus version of In proposed in [Rosenberg et al., 2003]) do not properly incorporate the information available from such priors. In the context of our axiomatic framework, this amounts to non-compliance with the priors criterion, which requires any measure of informativeness to approach its maximum when priors approach their extremes (reflecting a high discrepancy in source population sizes). This is of significance since informativeness ultimately represents the value of information given by the data (loci and priors) that an effective classifier can utilize in making correct inferences. In essence, it is a proxy for the performance of effective classifiers in a given setting, the level of certainty they can achieve. Indeed, we have shown that the decision and information theoretic measures may give different relative rankings to SNP panels under certain values of the priors. This ranking discrepancy is incongruent with the claim in [Rosenberg et al., 2003] that ORCA (analogous to our Bayes-based Cn) and In are highly correlated and produce similar estimates of panel ranking so that one ‘can proceed using only one statistic’. This is most likely a consequence of our focus here on multilocus rather than single-locus informativeness, along with the incorporation of population priors. Indeed, numerical simulations indicate that the ranking discrepancy between Cn and In is accentuated with a larger panel size (n) and with greater inequality in population priors – a rigorous analysis of which is not further pursued here and is left for future work.
Properties of Cn
We have revealed properties of the Bayes-based Cn that broadly pertain to two important aspects of this measure: subadditivity across loci and populations, and an interplay of information between the prior and the genetic variants. The loci-subadditivity property, which compares the sum of informativeness of two genetic sequences to that of the aggregated sequence, can be viewed as a generalization of a theorem by [Gattepaille and Jakobsson, 2012] on the informativeness of two-locus haplotypes of biallelic markers in the absence of linkage disequilibrium within populations. These authors were concerned with the effect of LD on the strength and sign of the gain of informativeness for assignment (GIA), a term they used to describe as similar inequality as above, utilizing the mutual information based In of [Rosenberg et al., 2003], concluding that “although there are a number of predictable behaviors of GIA-such as that GIA ≤ 0 when markers are in linkage equilibrium and that GIA is often large for cases where private alleles exist–GIA is not a trivial function of LD or allele frequencies”. Their primary motivation was to use this statistic as a heuristic for deciding whether to combine SNPs into haplotypes for improved accuracy in population assignment schemes, specifically targeting that part of LD that is due to physical linkage in the genomes, i.e., to handle the issue of having markers close along a chromosome (a type of data is typically pruned before further use in assignment schemes).
The population-subadditivity property allows interpreting Cn as a distance metric between populations relative to any given number of loci. By definition, a metric on a set X is a distance function d : X × X → ℝ that satisfies four conditions: d(x,y) ≥ 0 (the non-negativity of informativeness), d(x,y) = 0 if and only if x = y (the zero criterion), d(x, y) = d(y,x) (the invariances criterion for class symmetry) and d(x,z) ≥ d(x, y) + d(y,z) (this population-subadditivity property). This feature of Cn is in stark contrast to standard differentiation measures such as FST which are not metrics (see [Tal, 2013] for analysis of several population genetic distances).
The prior-washout property describes the diminished effect of the population prior as additional loci are included in Cn. Although this phenomenon is a direct consequence of the performance criterion, we have chosen to separately highlight it since it offers another perspective on the balance between the informative roles of the population prior vs. the genetic variants. For Bayesian classifiers, the washout effect also emerges from taking an information-theoretic perspective, by utilizing the asymptotic equipartition property (AEP) of relative entropy typical sets ([Cover and Thomas, 2006, p. 388]).
The interplay between the effects of the population prior vs. that of the genetic variants on our Cn is most sharply exemplified by the prior-sensitivity, prior-dominance and uninformative-loci properties. We have shown that while Cn is always sensitive to fluctuations in the prior, there are cases (which we characterize in detail in the proofs) where an additional locus is uninformative (i.e., Cn is invariant) given any value of the prior; and more surprisingly, we show that for any given set of loci, prior extremities beyond certain thresholds (determined by allele frequencies at these loci, Eqs. (C.2), (C.3) in Appendix C) render the information from the loci completely redundant for the assignment task. At this range of high disparity of population size, informativeness is only a function of the priors, with the consequence that effective classifiers may then simply assign unknown genotypes to the largest source population (as evidenced by the prior and likelihood terms of the log-posterior ratio of a Bayes classifier).
Feature Selection
Searching for the best m features out of the n available for the classification task is known to be a NP-hard problem and thus exhaustive evaluation of possible feature subsets is usually unfeasible in practice due to the large amount of computational effort required ([Huang and Rong, 2009]). This is why single-locus rather than multilocus based heuristics are common ([Rosenberg, 2005]). While Cn is not intended as a metric or heuristic for implementing feature selection schemes as it is inherently multilocus in nature, it could still be employed for related tasks. Primarily, it can be used for comparing between a small number of given SNP panels, for verification subsequent to a standard feature selection task. Moreover, the delta criterion implies the possibility of ranking single loci for their contribution to an already existing panel in the particular case where a subset of loci has completely overlapping population frequency differences. We emphasize in this context that it is generally not possible to rank loci for their contribution to total informativeness, due to their interdependence and redundancies. Indeed, it has long been recognized in feature selection that combinations of individually good features do not necessarily lead to good classification performance. In other words, selecting the loci that are individually most informative does not necessarily produce the optimal panel, or more compactly put, “the m best features are not the best m features” ([Peng et al., 2005]).
In the context of feature selection, it is common to make a distinction between filter and wrapper methods. Filter type methods select variables regardless of the classification model, and are based only on general principles like a correlation of features with the predicted variable. Such methods are particularly effective in computation time for a large set of features, and in robustness to overfitting when the number of observations is small. In contrast, a wrapper is a feature selector that convolves with a classifier (e.g., naive Bayes classifier), with the direct goal of minimizing the classification error of that particular classifier. Usually, wrappers can yield high classification accuracy for a particular classifier at the cost of high computational complexity and less generalization of the selected features on other classifiers.
While Cn has features of both filters and wrappers, it is more appropriately categorized as a wrapper, since it incorporates multiple features and has an explicit relation to the classification model (essentially derived from a classifier error rate). In this context is worth noting that our Bayes-error based Cn is not prone to overfitting since it effectively uses an infinite training sample (i.e., it is parametrized by underlying allele frequencies). Moreover, the finite-sample Cn,m also does not suffer from overfitting as it employs estimates of single SNP frequencies, rather than multilocus genotype frequencies.
Estimation from finite samples
Previous work on incorporating finite population samples in ancestry informativeness measures and in panel selection had proceeded by simply replaced true allele frequencies with sample frequencies in the informativeness measure. This approach, however, cannot provide general insight to the diminished informational content due to working with frequency estimates (e.g., [Rosenberg et al., 2003] in their Bayes-based ORCA and mutual information-based In for both the single- and multilocus formulations; see [Sampson et al., 2011] for a discussion of this issue). While this approach is may be appropriate in the context of practical feature selection schemes based on particular training samples, it does not convey the intrinsic degrading effect on informativeness resulting from finite samples. Closer to our approach is the analysis in [Rosenberg, 2005] of a performance function for panel selection, which proves (Theorem 7) a result akin to our sampling-convergence criterion of Cn,m for the Bayes-based ORCA statistic computed by simulation using sample allele frequencies. However, this result only captures the conditional test error, whereas we target the expected test error – an expectation of the former over all possible samples of size m. That treatment also includes a performance result characterizing the effect of additional loci (Corollary 8 of Theorem 7) given some minimal sample size threshold – an analysis which we do not undertake here for Cn,m.
Distance-based Methods
We have also analyze a simple distance-based classifier and derived a model of its error rate to see whether other effective classifiers can serve as a basis for informativeness. The construction of a nearest-centroid classifier for the high-dimensional discrete genetic data has revealed that simple genetic metrics such as the allele sharing distance do not take into account the variance-covariance matrices of the distribution of genotypes of each population (theoretically, if the centroids lie on opposite vertices of the hypercube genotype space, the classifier may associate up to 2(n − 2) vertices to the ‘wrong’ centroid). Such metrics effectively ignore the width of the distributions in the direction of individual to population-mean comparison. This issue probably explains the weak performance of the misclassification rate Cc from Fig. 2 in [Witherspoon et al., 2007], with data from low MAF sites (“rare” alleles, MAF < 0.1). Nevertheless, other treatments have ignored this issue, e.g., [Degen et al., 2017] who use a k-Nearest Neighbor (k-NN) algorithm for classification to one of a collection of source populations based on simple pairwise individual allele-sharing distance.
To overcome this issue, we developed a normalization akin to a Mahalanobis distance adopted to the discrete nature of the genetic space. In a Euclidean space, the Mahalanobis distance is the distance of a test point to the centroid divided by the width of the ellipsoid in the direction of the test point ([Hastie et al., 2009]) and is normally applied within classification techniques such as the k-Nearest Neighbor (k-NN) or linear and quadratic discriminant analysis (LDA, QDA). The use of a Mahalanobis distance is common procedure in classification schemes involving continuous gene expression data. [Dudoit et al., 2002] have also implemented similar normalization for nearest-centroid classification of gene expression profiles, using Diagonal linear discriminant analysis (DLDA). However, their model is different in a few crucial respects from our model: they assume normality of the underlying data distributions and equal class densities. A similar approach was adopted in [Tibshirani et al., 2002, Tibshirani et al., 2003] describing a method of classifying gene expression test samples according to closest shrunken centroids, also standardizing by variance estimates.
An approach that bears a stronger resemblance to our SNP distribution model is described in [Patterson et al., 2006]. The paper describes an application of PCA for genetic biallelic data using standardized distances. Each entry in the PCA matrix is normalized by the variance of C(i, j), the number of occurrences of the MAF allele for locus j from individual i. The authors report that “We verified (unpublished data) that the normalization improves results when using simulated genetic data, and that on real data known structure becomes clearer.” Even closer to our distance-approach is the work of [Liao et al., 2009] which introduces a classifier for SNP data genotypes with distance normalization by variance and a similar correction for population priors via a logarithmic term. However, since it employs a shrinkage centroid deriving a simple expression for its error rate would be hardly feasible.
The naïve Bayes classifier upon which Cn is formulated derives multilocus genotype frequencies under an implicit assumption of linkage equilibrium (LD). Indeed, we would not normally expect to see linkage disequilibrium within distinct populations, except between markers that are in close proximity (physical linkage, pruned in common preprocessing procedures) or in cases of recent admixture ([Pritchard et al., 2000]). At a population level, admixture manifests as LD between markers reflecting a shared history of descent, and invalidating the independence assumption. Haplotype based analysis was devised to potentially harness this information ([Lawson et al., 2012]). A possible approach for extending our model for ancestry informativeness to take advantage of information from LD that exists in structured populations, might involve including variance-covariance matrices that specify pairwise LD estimates for normalization in an NC classifier, instead of the diagonal covariance matrices. This extension is left for future work.
Conclusion
In this paper we have taken an axiomatic approach to select a measure for ancestry informativeness. We have shown that a measure based on an optimal Bayes classifier complies with the full set of proposed criteria, in contrast to a popular measure based on mutual information. The core deficiency of the information theoretic approach is in properly incorporating the information available from population priors. This is exemplified in decision and information theoretic measures providing different relative rankings to SNP panels under certain population priors. The measure based on a Bayes error was moreover shown to possess several surprising properties which characterize the interplay between information originating from the priors and the genetic markers. We have also analyzed a distance-based classifier adopted to discrete high-dimensional population genetic data, as an instance of an effective classifier which nevertheless cannot form a basis for satisfactory multilocus ancestry informativeness measure. Finally, we have extended the framework to formally quantify the inherent degrading effect on informativeness from using finite population samples in the learning phase.
Acknowledgements
Special thanks to Jim Portegies and Guido Montúfar for fruitful technical discussions. We also acknowledge Jürgen Jost and the Max Planck Institute for Mathematics in the Sciences for its generous support and the platform to present an earlier form of these ideas in an internal seminar.
Appendix A The relation between the Bayes error and some generalization of the variational distance
A.1 The relation
Here we develop a proof for the general case of unequal priors. It is required to show that, where,
In fact, we note that |x − y| = x + y − 2 min{x, y} for all x, y. Therefore
A.2 Other representations of Cn
Reminder: where, and fn(k, i) is the ith bit of k in its binary form.
To simplify the proofs, we also use another representation of Cn where In = {0,1}n and .
The Eq. (A.2) is equivalent to the Eq. (3.3).
Proof. In fact, we have an isomorphism between each k = 0, … , 2n − 1 and one corresponding element . Moreover, we have and similarly Fn(q, z(k)) = gk. This implies the proof.
From Eq. (A.1) and Eq. (A.2) imply
To simplify the proofs, we will use all four these representations of Cn (i.e. (3.2), (3.3), (A.2), (A.3)).
Appendix B
Proof of compliance of Bayes Cn with all criteria
Zero: (⇐): If P = Q (i.e. pi = qi, ∀i = 1, … , n) and α = 0.5 then we have Fn(p, z) = Fn(q, z) for all z ∈ In. Thus from Eq. (A.2), and since the sum of all genotype frequencies from a single population is 1, (⇒): Conversely, we need to show that Cn = 0 implies P = Q (i.e., pi = qi, ∀i = 1, … ,n) and α = 0.5. First, notice that if Cn = 0 then trivially each summand of Eq. (A.2) must be zero, i.e. αFn(p, z) = (1 − α)Fn(q, z), ∀z ∈ In. By summing for all z ∈ In we have α = 1 − α or α = 0.5. Therefore Fn(p, z) = Fn(q, z), ∀z ∈ In. Denote by . Then we have for every i = 1, … , n
Performance: See proof in [Tal, 2012b, Appendix B.1], and Fig. B.1 for illustration.
Asymptotics: See proof in [Tal, 2012b, Appendix B.2], and Fig. B.2 for illustration.
Neutrality: We need to prove that Cn = Cn+1 if pn+1 = qn+1. In fact, by denoting , we have
Continuity: From Eq. (A.2), Cn is a sum of absolute values of continuous functions and therefore has no singularities, and since pi,qi and α are real-valued parameters, Cn is continuous with respect to its parameters.
Dominance: (⇒) We need to prove that for a fixed n and some fixed α ∈ (0, 1), if there exists i ∈ {1, … , n} such that |qi − pi| → 1 then Cn → 1. In fact, without loss of generality we assume that pn → 0 and qn → 1. We note that
If zn = 0 then and then Fn(p, z) → 0
If zn = 0 then and then Fn(q, z) → 0.
Therefore min{αFn(p, z), (1 − α)Fn(q, z)} → 0 for all z ∈ In. This results in En → 0 and consequently Cn → 1.
(⇐) Conversely, assume that Cn → 1 for a fixed n and some fixed α ∈ (0,1), we need to prove that there exists i ∈ {1, … , n} such that |qi − pi| → 1. In fact, we first note that p ≠ q because otherwise |2α − 1| = Cn → 1 which is a contradiction. Moreover, because Cn satisfy the neutrality and invariant property, we can assume that pi < qi for all i. Now, put and . Because Cn → 1, it implies
By choosing z = (0, … , 0) we imply , therefore . Similarly, by choosing z = (1, … , 1) we imply , therefore . Assume that In then we can choose z ∈ In such that
Note that
for (otherwise, it implies qi(> pi) → 1 and contradicts to the definition of );
for and (from the definition of ).
Therefore we have
Similarly, we have for ; for and and therefore
It implies that which contradicts to (G.4). Thus which implies the proof (see Fig. B.3 for illustration).
7. Delta: For simplicity in notation, we examine n + 1 loci (where n can also be 0), and the proof proceeds without loss of generality with respect to locus n + 1 with allele frequencies p and q. First, express Cn+1 as in Eq. (B.1),
Each pair of absolute-value terms can be seen as a sum of two V-shaped functions of q, without loss of generality (illustrated in Fig. B.4). Let us examine the first pair, indexed by z. The zero ‘tip’ of one V-shaped function occurs where, while the zero tip of the other occurs where,
Assume, without loss of generality, that q > p at the zero-tip described by Eq. (B.4). Since the RHS equals the RHS of Eq. (B.5), then q < p at the zero-tip described by Eq. (B.5). Now, since the absolute values of the slopes of all V-shaped function pairs are equal, their sum is a 3-piece-wise linear convex function where the center section is of slope zero (Fig. B.4). This is also the case for any paired sum in Eq. (B.3), and since the sum of convex functions is convex, Cn+1 is necessarily convex. Crucially, the slope-zero sections of all these 2n convex functions (as a function of q) necessarily partially overlap, so that their sum Cn+1 also has a slope-zero section (Fig. B.4). Thus, if q < p then substituting a lower value for q can only increase Cn+1; similarly, if q > p then substituting a higher value for q can only increase Cn+1.
8. Invariances: [a] since the genotype probabilities Fn(p, z) and Fn(q, z) in Eq. (A.2) are each a commutative product of allele frequencies from all loci, Cn is invariant to different ordering of loci; [b] Cn(α, P, Q) = Cn(1 − α, Q, P) follows from the presence of an absolute value in the formulation of Eq. (A.2); [c] the simultaneous substitution of pi with (1 − pi) and qi with (1 − qi) simply changes the order of the summation terms in Eq. (A.2) and thus does not affect Cn. We illustrate this with regard to C2:
Now, after the substitution (termed here ) we get,
9. Prior: If α → 0 or α → 1 then one of the two terms within the sum in Eq. (A.2) diminishes to zero and what remains in the limit is the sum over all genotype probabilities in one population, which equals 1, therefore Cn → 1.
Appendix C Proof of the properties of the Bayes-based Cn
Loci subadditivity (w/out priors): we need to show that Cn+m(P║P′, Q║Q′) ≤ Cn(P, Q)+Cm(P′, Q′). Let designate the 2m genotype frequencies associated with frequency vector P′, and similarly with reference to frequency vector Q′. Then from Eq. (3.3) it directly follows that, where the transition between lines 2 and 3 is due to the triangle inequality (|a − b| ≤ |a| + |b|), and the transition between lines 4 and 5 is due the sum over all genotype frequencies at any population equaling 1. We note that when general priors are incorporated in Cn, this property may fail.
We may nevertheless allow for priors by introducing a correction term |1 − 2α|,
In fact, we have
Population-subadditivity (w/out priors): the compliance of Cn(P, Q) with the triangle inequality follows from a formulation in the form of the variational distance, as per Eq. (3.3). [Khosravifard et al., 2007] prove that such distance indeed satisfies the triangle inequality.
Prior-washout: This property is a direct implication of the performance criterion, since the minimum with respect to α also increases, Cn+1(α) ≥ Cn(α) (see Fig. 5), and asymptotically follows from the asymptotics criterion Cn(α) → 1 with n, such the prior is completely washed out at the limit. This property can also be inferred from an information-theoretic perspective on a Bayes classifier ([Cover and Thomas, 2006, p. 388]). The likelihood ratio with priors π1 and π2, can be formulated as, where from the AEP the second term tends to D(P1║P2) or −D(P2║P1) in accordance to P1 or P2 being the true source distribution, while the first term tends to 0, thus the effect of the prior washes out.
Prior-sensitivity: We first show that Cn as a function of the prior α is convex. Notice from Eq. (3.3) that it is a sum of V-shaped convex functions of α, f(α) = |(hk + gk)α − gk| and therefore is itself a convex function. To find α giving the minimum of Cn(α) notice that the minimum must occur at one or more of the 2n ‘singularity’ points of this piecewise linear convex function, given simply by, and therefore, the minimum occurs at,
To show that Cn is never invariant to fluctuations in α, i.e., that this minimum is unique, we need to prove there are no domains where the slope of Cn(α) is zero. The slope of each linear section of Cn(α) is some combination of ±(hk + gk), the coefficient of α. For the slope to be zero this sum over all genotypes has to be exactly zero, a situation which is almost surely (with probability 1) impossible since the allele frequencies are real-valued parameters and thus genotype frequencies never exactly equal between populations.
Prior-dominance: We need to show that for any combination of allele frequencies there always exist two thresholds strictly in (0,1), 0 < α0 < α1 < 1, such that a prior that is more extreme than these thresholds fully determines Cn, i.e., all the genetic loci become effectively uninformative. In fact, we note that αFn(p, z) − (1 - α)Fn(q, z) ≥ 0 if and only if
Therefore, by putting and
It implies immediately that 0 < α0(p, q) ≤ α1(p, q) < 1. Moreover, for all α ≥ α1(p, q) we have and for all α ≤ α0(p, q) we have
From Cn(α1) = 2α1 − 1, Cn(α0) = 1 − 2α0 and the non-negativity criterion it also follows that . Now, since α1 and α0 are continuous functions of pi and qi (as Fn(p, z) and Fn(q, z) are simply products of those frequencies), slight fluctuations in any allele frequency may induce only a slight change in the thresholds α1 and α0, so that the prior remains within the range for which Cn = |2α − 1|. More formally, for any ε > 0, and prior α = α1 + ε or α = α0 − ε strictly between 0 and 1, Cn = |2α − 1| and is invariant to slight fluctuations in allele frequencies.
From the uninformative-loci property and the delta criterion it immediately follows that any locus i, for which Cn is invariant to small fluctuations in pi or qi, is uninformative; i.e., it does not contribute information in the sense that Cn(α) = Cn−1(α), where this locus is excluded in Cn−1. Crucially, this implies that for any prior within the invariant range for Cn(P, Q), i.e., α ≥ α1 or α ≤ α0, all loci are effectively uninformative.
To see this, we need to show that the prior thresholds of Cn is included in those of Cn+1, i.e. ; consequently, the exclusion of any m loci would result in Cn(α) = Cn−m(α) = C0 = |1 − 2α|. In fact, we have from Eq. (C.3)
By a similar proof we also have . It completes the proof.
Note that calculating the thresholds α1 and α0 is of exponential complexity O(k) = O(2n), but we can achieve a much better result, linear in n, i.e., O(n). This is possible from the following simple transformation,
Uninformative-loci: We need to show that formally for any P and Q with n loci, and without loss of generality,
The equality Cn = Cn+1 above occurs iff the following inequalities k : 0 to 2n − 1 from the proof of the performance criterion ([Tal, 2012b, Appendix B.1]) are equalities, and
This in turn occurs when, following Eq. (C.4), αhkp ≤ (1 − α)gkq and αhk(1 − p) ≤ (1 − α)gk(1 − q) for each k : 0 to 2n − 1 for which min{αhk, (1 − α)gk} = αhk.
Now write instead of q, p + ε: αhkp − (1 − α)gk(p + ε) and αhk(1 − p) ≤ (1 − α)gk(1 − p − ε). We get, for each k : 0 to 2n − 1 for which αhk ≤ (1 − α)gk, and therefore the LHS is non-positive and the RHS is non-negative, and as expected, 0 < ε < 1. Also, when following Eq. (C.5), αhkp ≥ (1 − α)gkq and αhk(1 − p) ≥ (1 − α)gk(1 − q) for each k : 0 to 2n − 1 for which min{αhk, (1 − α)gk} = (1 − α)gk.
Now write instead of q,p + ε: αhkp ≥ (1 − α)gk(p + ε) and αhk(1 − p) ≥ (1 − α)gk(1 − p − ε). We get, for each k : 0 to 2n − 1 for which αhk ≥ (1 − α)gk, and therefore the LHS is non-positive and the RHS is non-negative, and as expected, 0 < ε < 1.
Thus unless alhk = (1 − α)gk for some k : 0 to 2n − 1, then for any p there is always a non-degenerate range around zero, such that for any ε within that range, the inclusion of an extra locus with allele frequencies p and p + ε does not increase Cn.6
The range for ε is, finally, the overlap of all such ranges (since they are satisfied simultaneously), where the internal minimums are taken over k : 0 to 2n − 1. Since the probability for the occurrence of αhk = (1 − α)gk for any k ranging over all possible genotype frequencies is negligibly small, we can conclude that the theorem is true, almost surely.7
Appendix D Proof the NC-based Cn complies with a subset of the criteria
Zero: If P = Q (i.e., pi = qi for all i) then by definition α = 0.5 (since these are effectively the same population) and thus hk = gk, Dk = 0 and dk = 0 for all k. From Eq. (9), and since the sum of all genotype frequencies from each single population is 1,
It remains to be proven in future work whether conversely, Cn = 0 implies P = Q (i.e., pi = qi for all i) and α = 0.5.
Performance: This criterion fails. The discontinuity inherent in the classifier expression dk renders Cn decreasing as n → n + 1 under certain scenarios (see Fig. D.1A).
Asymptotics: A formal proof that Cn → 1 left for future work, but numerical simulations strongly indicate it (see Fig. D.1B for illustration).
Neutrality: We need to prove that if pn+1 = qn+1 then Cn+1 = Cn.
Proof.From Eq. (4.2)we have, where the subscript k|n + 1 designates that the term is defined with reference to n +1 loci. First notice that the classifier expression does not change, since pn+1 = qn+1 from assumption. Finally, we split the multiple sum into two multiple sums, the first in which genotypes have a ‘0’ allele as the n + 1 locus, and the second in which they have a ‘1’ allele at that locus, noticing also that the sum from 2n to 2n+1 – 1 equals the sum from 0 to 2n – 1 when the n + 1 locus is fixed,
Continuity: This criterion fails (e.g., see in Fig. D.2 the discontinuity points).
Dominance: We need to prove that for any finite number of loci, Cn → 1 iff |qi – pi| → 1.
Proof.We prove, without loss of generality, that if qn+1 → 1 and pn+1 → 0 then Cn+1 → 1. In fact, we have
Note that the classifier . It implies that if pn+1 → 0 and qn+1 → 1 then
Therefore
Thus, when pn+1 → 0 and qn+1 → 1 we have and . It follows from Eq. (D.1) that Cn+1 → 1.
Delta: This criterion fails (e.g. see in Fig. D.2 the decrease of Cn at several locations).
Invariances: [a] since Dk is commutative as a sum, and hk and gk are commutative as a product, Cn is invariant to different ordering of the loci; [b] Cn(α, P, Q) = Cn(1 – α,Q,P) follows from the symmetry of Dk and Cn with respect to pi and qi; [c] the simultaneous substitution of pi with (1 – pi) and qi with (1 – qi) simply changes the order of summation in Dk and Cn. We illustrate this with reference to C1, and this is easily proven by induction on n. We write down the explicit form of C1, where,
Denote by the measure following this substitution. It is straightforward that, since Dk (and dk) were similarly modified by the substitution,
Priors: At the limit α → 0, Dk → +∞, dk = 1 and thus in (9) both αhkdk 0 and (1 – α)gk(1 – dk) → 0, therefore Cn → 1. Similarly, at the limit α → 1, Dk → −∞, = 0 and both αhkdk → 0 and (1 – α)gk(1 – dk) → 0, therefore Cn A 1.
Appendix E Proof of compliance of Sampling-based Cn,m with the new criteria
Here we prove the two new criteria and also show by simulation compliance with two other criteria from Cn. We first rewrite Eq. (5.2) in the form of where with
Note that we have immediately that cn(α, p, q, p, q) = Cn.
[3] Asymptotics: A proof is beyond the scope of this paper, but numerical simulations validate it (see Fig. E.1A for illustration).
[6] Dominance: A formal proof is beyond the scope of this paper, but numerical simulations validate it (see Fig. E.1B for illustration).
[1*] Sampling-Effect: Need to show that for any sample size m > 0, Cn,m ≤ Cn. In fact, for given , , for each z ∈ In we have d(α, x, y, z) ∈ {0,1}, therefore
It implies that, for given , , we have
Thus
It completes the proof (see Fig.E.2A for illustration).
[2*] Sampling-Convergence: We need to show that as sample size increases Cn,m → Cn. In fact, we note that when m1 → ∞, and . Similarly, when m2 → ∞, and . Therefore when m = {m1,m2} → ∞ we have
It completes the proof (see Fig. E.2B for illustration).
Appendix F Proof that every monotonic function of a compliant informativeness measure is also compliant
Given a function f : [0,1] → [0,1] such that
f(x) = 0 iff x = 0;
f(x) = 1 iff x = 1;
f is continuous;
f is monotone.
We show here that if Cn is a compliant informativeness measure, then so is f(Cn). In fact, the Zero criteria is followed from (a); the Performance criteria is followed from (d); the Asymptotic criteria is followed from (b) and (c); the Neutrality criteria is followed from the definition of a single-value function f; the Continuity criteria is followed from (c); the Dominance criteria is followed from (b) and (c); the Delta criteria is followed from (d); the Invariances criteria is followed from the definition of a single-value function f; the Prior criteria is followed from (b) and (c).
Appendix G The alternative non-decision-theoretic formulation of in compliance with all criteria
We construct here a family of measures , parametrized by the integer k : 1, … , ∞, where p = (p1, … ,pn), q = (q1, … ,qn), and
Note that , by using the same technique as in Appendix A, we also have
We show that these measures are compliant with all criteria.
Zero: (⇐): If P = Q (i.e. pi = qi, ∀i = 1, … , n) and α = 0.5 then we have for all . Thus from Eq. (G.1), and since the sum of all genotype frequencies from a single population is 1, (⇐): Conversely, we need to show that implies P = Q (i.e., pi = qi, ∀i = 1, … ,n) and α = 0.5. First, notice that if then trivially each summand of Eq. (G.1) must be zero, i.e. , . By summing for all we have α = 1 – α or α = 0.5. Therefore , . Denote by . Then we have for every i = 1, … , n which implies pi = qi.
Performance: We need to prove that . In fact, by denoting , , , we have
Asymptotics: First of all, note that is also a Bayes error for classifying over two n-loci (k + 1)-allele populations. By applying the same technique as in [Tal, 2012b, Appendix B.1], is bounded above by , the misclassification rate from simple multinomial model. Moreover, we have,
Thus, . It implies .
Neutrality: We need to prove that if pn+1 = qn+1. In fact, by denoting , , , we have
Continuity: From Eq. (G.1), is a sum of absolute values of continuous functions and therefore has no singularities, and since pi,qi and α are real-valued parameters, is continuous with respect to its parameters.
Dominance: (⇒) We need to prove that for a fixed n and some fixed α ∈ (0,1), if there exists i ∈ {1, … , n} such that |qi – pi| → 1 then . In fact, without loss of generality we assume that pn → 0 and qn → 1. We note that
If zn = 0 then and then
If zn = 0 then and then .
Therefore for all . This results in and consequently from Eq. (G.2) .
(⇐) Conversely, assume that for a fixed n and some fixed α ∈ (0,1), we need to prove that there exists i ∈ {1, … , n} such that |qi – pi| → 1. In fact, we first note that p ≠ q because otherwise which is a contradiction. Moreover, because satisfy the neutrality and invariant criteria, we can assume that pi < qi for all i. Now, put and . Because , it implies
By choosing z = (0, … , 0) we imply , therefore . Similarly, by choosing z = (1, … , 1) we imply , therefore . Assume that then we can choose such that . Note that
for (otherwise, it implies qi(> pi) → 1 and contradicts to the definition of );
for and (from the definition of ).
Therefore we have . Similarly, we have for ; for and and therefore .
It implies that which contradicts to (G.4). Thus which implies the proof.
Delta:
Invariances: [a] since the genotype probabilities and in Eq. (G.1) are each a commutative product of allele frequencies from all loci, is invariant to different ordering of loci; [b] follows from the presence of an absolute value in the formulation of Eq. (G.1); [c] the simultaneous substitution of pi with (1 – pi) and qi with (1 – qi) simply changes the order of the summation terms in Eq. (G.1) and thus does not affect .
Prior: If α → 0 or α → 1 then one of the two terms within the sum in Eq. (G.1) diminishes to zero and what remains in the limit is the sum over all genotype probabilities in one population, which equals 1, therefore .
Appendix H Proof that the alternative is not a monotonic function of the Bayes Cn
More precisely, we show that, for given n and k > 1 we can find (α, p, q) and (α′, p′, q′) such that Cn(α, p, q) = Cn(α′, p′, q′) but . In fact, we can choose, for example, α = α′ = 0.5, p1 = 0.1, q1 = 0, , , and pi = qi = = = 0.3 for all i = 2, … , n. By applying the Neutrality criteria, we have but . The figure Fig. H.1 illustrate the behavior of the difference when k runs from 1 to 10.
Footnotes
↵† trandat{at}mis.mpg.de
↵1 A simple counter example for a loci pair is: p1 = 0.24/q1 = 0.18, and prior α = 0.51.
↵2 e.g., n = 1, p = 1/4, q = 1/2, r = 3/4, α = 2/5, β = 3/5,7 = 1/2. In general, for higher dimensions such failure may occur with qi = (piri)1/2, β = 1 − α, γ =1/2; e.g., n = 3; pi = 0.7; π = 0.95; qi = (piri)1/2, α = 0.45, β = 0.55,7 = 1/2.
↵3 e.g., the 3rd locus here is uninformative: p1 = 0.02/q1 = 0.24, p2 = 0.2/q2 = 0.1,p3 = 0.1/q3 = 0.25 with α = 0.7.
↵4 It is also possible to use a proper Bayesian approach for estimating allele frequencies, with Beta priors at each SNP locus, as in [Rannala and Mountain, 1997, Eq. 5].
↵5 In simulating Cn,m we replace allele frequency estimates of zero with a small constant, 1/(m+1), a common procedure to avoid zero genotype frequencies ([Rosenberg, 2005]; [Phillips et al., 2007]).
6 Note that the existence of this non-degenerate range around zero for ε does not depend on the allele frequency at the extra locus (denoted in this proof as p).
7 It is true almost surely since the probability that a genotype would have the same frequency in both populations, weighed by the priors, is vanishingly small (even with only one locus, since allele frequencies differ between populations).