The “Gini index” in genetics: measuring genetic architecture complexity of quantitative traits

Genetic architecture is a general terminology used and discussed very often in complex traits genetics. It is related to the number of functional loci involved in explaining variation of a complex trait and the distribution of genetic effects across these loci. Understanding the complexity level of the genetic architecture of complex traits is essential for evaluating the potential power of mapping functional loci and prediction of complex traits. However, there has been no quantitative measurement of the genetic architecture complexity, which makes it difficult to link results from genetic data analysis to such terminology. Inspired by the “Gini index” for measuring income distribution in economics, I develop a genetic architecture score (“GA score”) to measure genetic architecture complexity. Simulations indicate that the GA score is an effective measurement of the complexity level of complex traits genetic architecture.


Introduction
Genetic architecture has been widely adopted as a general terminology that defines how a complex trait is regulated by the genome.Interestingly, although the concept is easily defined, we know little about the genetic architecture of any complex trait.Uncovering the underlying genetic architecture, step by step, therefore becomes the main goal of quantitative genetics research.
Although the genetic architecture of each measured complex trait is unknown, it is always needed to further understand the estimates derived from the data (e.g.heritability) based on potential genetic architecture complexity.For instance, two complex traits that have the same estimated narrow-sense heritability (h 2 ) may have very different genetic architecture, as the number of genetic variants contribute to such an h 2 value may differ substantially.In general, the more polygenic a trait is, the more difficult it is to map functional loci using genomewide association studies (GWAS) [1].Thus, it is particularly useful to develop a general statistic that measures the complexity of a complex trait, given its phenotypic measurements and genotypes in a population.
The genetic architecture complexity describes how polygenic a trait is, namely, how evenly the genetic variance is distributed across all the genotyped variants.Given a certain null hypothesis, mimicking the well-known economic concept "Gini index" [2] that measures the distribution of income in a population, developed here is a score that measures the complexity level of genetic architecture as the distribution of genetic variance in a genome.Simulations indicate that the developed score is an effective measurement.

Statistical modeling
Prior to the development of the GA score, the genetic architecture complexity of a quantitative trait needs to be defined using the phenotype and genome-wide genotype data.Consider the linear whole-genome regression model where y i is the phenotypic value of individual i, x i, j its genotypic value of SNP j coded as −2 f j , 1 − 2 f j , 2 − 2 f j ( f j is the allele frequency of SNP j), b j the SNP effect, and e i the residual.The weights of different SNP effects can be modeled as Such a hierarchical model defined by eq. ( 1) and ( 2) is actually a double hierarchical generalized linear model [3] for high-throughput genetic markers [4].The main idea of most current whole-genome regression methods is to optimize the estimation of the weights λ j 's, as the markers are ought to be re-weighted differently for different traits due to different genetic architectures.

Genetic architecture score
Let us define the genetic architecture complexity level as the weight distribution of genome-wide SNP effects, so that • The highest complexity: λ j is uniformly distributed across all the markers; • The lowest complexity: λ j > 0 and λ j ( j = j) = 0, i.e.
only one marker is predictive of the trait.
Based on a null hypothesis of the highest complexity, we have Eq. ( 1) becomes a ridge regression, a.k.a.SNP-BLUP model [5,6].Under the null hypothesis, every marker is predictive of the trait, so that for SNP j, after training the SNP-BLUP model in a training set, the variance explained by SNP j in a test set is proportional to its expected value.
Namely, we have where r 2 j is the variance captured by SNP j in the test set, f j the allele frequency of SNP j in the test set, and bj the estimated SNP effect in the training set.
Due to shrinkage estimates in high-dimensional genotype data, one cannot simply add up 2 f j (1 − f j ) b2 j from SNP-BLUP for multiple SNPs to obtain the cumulative expected variance explained by a group of SNPs.Nevertheless, it is possible to mimic the Gini index definition: Ranking the SNPs, from the lowest to the largest expected variance explained, one can calculate the observed cumulative variance explained by the first k SNPs is A genetic architecture (GA) score can be defined as an area ratio of 1 − A/B illustrated in Figure 1, i.e.

Results
Three phenotypes were simulated based on the simulated genotypes announced by the Genetics Society of America (GSA) [7], for both the training set (n = 2 000) and validation samples (simulated young generations, n = 1 500).All the three phenotypes had exactly the same simulated narrow-sense heritability (h 2 = 0.5) but different numbers of causal variants (50, 5 000, 50 000, respectively).The whole genome contains 57 458 genotyped The GA score is defined as the ratio (B − A)/B, where the area A varies from zero (null: highest complexity) to B = 0.5 (lowest complexity: "monogenic").

markers in total.
A SNP-BLUP model was fitted for each phenotype using the bigRR package [6] in the training population.The estimated SNP effects were passed onto the validation set to compute the GA score θ .The estimated GA scores were 0.42, 0.85 and 0.93 for the 50-, 5 000-and 50 000-causalvariants architectures, respectively.So the estimated GA score grows, although not linearly, with the number of causal variants.

Discussion
The GA score developed here measures the genetic architecture complexity in terms of the distribution of genetic variance over the genome.Nevertheless, the estimated scores are more useful when comparing multiple phenotypes.An interesting additional study is to correlate a score as such to the sample size required in GWAS meta-analysis of a certain trait, so that one can predict the chance of new discoveries in GWAS prior to data collection.
5) where j * represents the rank, ỹ and x are the phenotype and genotype vectors in the test sample, and r stands for correlation coefficient.Under the null, V O k ∝ s, where s is the number of SNPs included as predictors.A deviation from the null hypothesis will result in a function V O = g(s) deviating from the straight line V O = cs (Figure 1), where c = V O m /m.Therefore, the area between the curve V O = g(s) and the line V O = cs measures the genetic architecture complexity.

Figure 1 .
Figure 1.Illustration of the GA score definition.The black curve is plotted with VO k /V O m against V E k /V E m .The GA score is defined as the ratio (B − A)/B, where the area A varies from zero (null: highest complexity) to B = 0.5 (lowest complexity: "monogenic").

Figure 2 .
Figure 2. GA score applied on three different scenarios of the GSA simulated data.A narrow-sense heritability of 0.50 was simulated in all scenarios.50 (A), 5 000 (B), and 50 000 (C) causal markers were simulated, respectively.