Abstract
Admixture regression methodology exploits the natural experiment of random mating between individuals with different ancestral backgrounds to infer the environmental and genetic components to trait variation across culturally-defined racial and ethnic groups. This paper provides a new statistical framework for admixture regression based on the linear polygenic index model widely used in behavioural genetics. Using this framework we develop a new test of the differential impact of multi-racial identities on trait variation, an orthogonalization procedure for added explanatory variables, and a partially linear semiparametric functional form. The methodology is illustrated with data from the Adolescent Brain Cognitive Development Study.
1 Introduction
Racial/ethnic group identities such as Black, White, Hispanic, Native American, East Asian and South Asian show empirically strong linkages to medical and behavioural traits such as obesity (Wang et al. 2007), type 2 diabetes (Cheng et al. 2013), hypertension (Lackland 2014), asthma (Choudry et al. 2006), neuropsychological performance (Llibre-Guerra et al. 2018), smoking behaviours (Choquet et al. 2021), and sleep disorders (Halder et al 2015). An important research question is to what degree any such observed trait variation arises from differences in the typical diets, cultural practices and other environmental particularities of the racial/ethnic groups, or from similarity in genetic pools within each group traceable to shared geographic ancestry. Many diverse national populations descend demographically from isolated continental groups within a few hundred years. Modern genetic technology can measure with high accuracy the proportion of an individual’s ancestry associated with these continental groups. Also, in many culturally diverse nations, most individuals can reliably self-identify as members of one or more racial or ethnic groups. Admixture regression leverages these two data sources, self-identified race or ethnicity (SIRE) and genetically-measured admixture proportions, to decompose trait variation correspondingly. Admixture regression has been widely applied to medical and behavioural traits including asthma (Salari et al. 2007), body mass index (Klimentidis et al. 2009), type 2 diabetes (Cheng et al. 2013), blood pressure (Klimentidis et al. 2012), neuropsychological performance (Lasker et al. 2019), and sleep depth (Halder et al. 2015). It has particular value in the case of complex behavioural traits where reliably identifying genetic loci associated with trait variation is beyond the current reach of science. Admixture mapping is a more technically challenging methodology, often used in conjuction with admixture regression, which uses ancestral population trait differences to attempt to identify genetic loci associated with a trait. This paper focusses exclusively on admixture regression.
This paper first develops a new statistical framework for admixture regression of behavioural traits by linking it to the linear polygenic index model from behavioural genetics; this framework clarifies the key assumptions that are implicit in this simple and powerful statistical technique. The paper then extends the admixture regression methodology in several ways. We provide a new test statistic for identifying whether a given multi-racial identity differs in its trait impact from the average impact of its component single-SIRE categories. We examine the role of additional explanatory variables in the admixture regression and their interpretation with and without orthogonalization with respect to the core explanatory variables. We generalize the linear admixture regression specification to a partially linear semiparametric form.
We illustrate our methodology using neuropsychological performance data from the Adolescent Brain Cognitive Development database. Neuropsychological performance is one of the most complex traits to which admixture regression analysis has been applied. Using our new test statistic, we find that some multi-racial categories have identifiably distinct impact on trait variation relative to their component categories. We find that orthogonalization of additional variables can substantially change the interpretation of the core coefficients in the admixture regression. Our results hint that a partially linear semiparametric specification potentially adds empirical value.
2 A statistical framework for admixture regression tests of trait variation
2.1 Variable definitions
We assume that the database consists of n individuals indexed by i = 1, n who have each self-identified their racial or ethnic group membership(s), recorded a score on a behavioural trait, si, and provided a personal DNA sample. The k racial or ethnic group self-identification choices are captured by a matrix of zero-one dummy variables SIREij, i = 1, n; j = 1, k. We assume that every individual has self-identified as belonging to at least one and possibly more of the k groups.
We assume that a set of m geographic ancestries covered in the study have been chosen, such as African, European, Amerindian, South Asian, and East Asian, indexed by h = 1, m. The genotyped DNA samples are carefully decomposed into admixture proportions of geographic ancestry, as discussed in Section 4 below. For each individual the ancestry proportions across the chosen geographic ancestries sum to one. This gives a matrix of ancestry proportions Aih, i = 1, n; h = 1, m with 0 ≤ Aih ≤ 1 for all i, h and for each i.
In most applications of admixture regression, individuals’ racial or ethnic group identities will have statistical relationships with individuals’ genetically identified geographic ancestries and also with the observed trait si. The objective of admixture regression is to decompose trait variation into linear components due to genetic ancestries and linear components due to racial/ethnic group related effects.
2.2 An empirically infeasible GWAS model of ancestry-related behavioural trait variation
Admixture regression is an indirect method of analyzing group-related trait variation. In this subsection we provide a foundation for admixture regression be considering a more direct, but empirically infeasible, alternative approach based on a linear polygenic index model. Then in the next subsection we will show that the admixture regression model can be viewed as a statistically feasible simplification of this linear polygenic index model, in which proportional ancestries serve as statistical proxies for ancestry-related genetic differences.
The human genetic code contains a very large number of genetic variants (the allelles on the genome which vary between individuals) called single nucleotide polymorphisms or SNPs. Consider hypothetically a complete list of all genetic variants with any impact on variation in the observed trait. Assign a value of 0, 1 or 2 to each SNP for individual i depending upon the number of minor allelles for that SNP. Let SNPiz i = 1, n; denote the number of minor allelles on the zth SNP of the ith individual in the sample. The biochemical process linking human genetic variation to behavioural trait variation is unimaginably complex, and scientific understanding of the full biochemical process is very limited. Genome-wide association studies (GWAS) have made slow but steady progress in statistically modeling these linkages, although precise biochemical linkages are beyond the contemporary scientific frontier for most behavioural traits. A standard, admittedly highly simplified, model of the gene variation - trait variation nexus is the linear polygenic index model, in which the genetic component of a trait is a simple linear function of a relevant subset of the individual’s genetic variants. The linear polygenic index model has been applied to a wide range of medical and behavioural traits including body mass index (Yengo et al. 2018), neuroticism (Nagel et al. 2018), depression susceptibility (Wray et al. 2018), suicidal ideation (Mullins et al. 2014), schizophrenia (Mistry et al. 2018), educational attainment (Lee et al. 2018), neuropsychological test performance (Savage et al. 2018), and risk-taking (Clifton et al. 2017). The linear admixture regression model can be derived elegantly by invoking this standard linear polygenic index model, and hence we impose it in our model:
where pi denotes the “genetic potential” of individual i regarding the observable trait.
Let Prz(·) denote the univariate probability distribution for SNPz (probabilities of the three possible values 0, 1, and 2). The probability distributions of many SNPs differ substantially across geographic ancestries, hence we define the conditional probability distributions: let Prz(·|Ah = 1) denote the conditional probability distributions for SNPz for individuals with purebred (that is, Ah = 1) ancestry for h = 1, m. There is no assumption of genetic homogeneity within the ancestral populations, only that they are genetically distinct and hence these (unobserved) conditional probability distributions exist as hypothetical entities. There is no attempt to estimate these conditional probability distributions directly, but rather only to use them create conditional expectations in the construction of statistical proxy variables.
The expectation of using each purebred probability distribution defines the average genetic trait potential of each purebred ancestry:
which are not observed directly, but will be inferred indirectly from the admixture regression findings.
A key assumption of the admixture regression model is that admixture arises from recent random mating between the previously geographically-isolated ancestral groups. Assuming recent random mating between ancestral lines, it follows from the fundamental processes of sexual reproduction that the univariate probability distribution of any SNP for an admixed individual is the convex combination of the purebred probability distributions, with linear coefficients equal to the individual’s admixture proportion. (The relationship between the multivariate distributions is more complicated, but the multivariate distributions do not impact the expected trait given the linear polygenic index assumption.) We use a subscript to denote the vector created from the ith row of a matrix. We assume that mating across geographic ancestries is recent and random, and therefore in particular that the univariate frequency distribution of each SNP for any individual is the convex combination of the purebred frequency distributions:
Equation (3) is a fundamental condition for the admixture regression methodology.
The linearity of genetic potential in the SNPs (1) and the random mixing assumption (3) imply that conditional expected genetic potential of an admixed individual is a convex combination of the individual’s admixture proportions. Combining (1), (2) and (3) the conditional expected value of genetic potential for an individual with admixture proportions Ai. is the convex combination of the unobserved values with observed linear coefficients Aih:
2.3 Ancestry proportions as a statistical proxy for ancestry-linked trait variation
A key difference in the admixture regression methodology compared to GWAS is that there is no attempt to estimate (1) directly. Rather, admixture regression uses the natural experiment of subpopulation mixing to infer differences in the conditional expected value of (1) arising from differences in the probability distribution of genetic variants across ancestries.
For expositional simplicity, in this subsection we assume that every individual included in the sample has self-identified as belonging to exactly one from the pre-specified set of k racial or ethnic groups, so that for all i. In this case, the n × k matrix of racial/ethnic group explanatory variables used in the admixture regression, denoted G, is simply set equal to the SIRE matrix: Gij = SIREij for i = 1, n; j = 1, k. Multi-racial individuals (those who have self-identified as belonging to two or more groups) will be introduced into the analysis in the next subsection.
Define the environmental component of the trait, ei, as the observed trait minus genetic potential:
where ei is defined as all trait variation not captured by pi. Equation (5) is only definitional; later we will impose various conditions on ei to enable statistical identification of the model. Define
as the genetic component of the trait for each i which is not explained by geographic ancestry Ai:
by simple substitution into (5) this gives:
Recall that , for all i, so that one term in (6) is redundant; substitute
into (6) to get:
where
; h = 2, m, and
.
Equation (7) is not well-specified as a regression model since the error term will not be mean zero conditional on Ai· due to racial and ethnic group-related effects in ei. In order to transform (7) into a regression model it is necessary to add explanatory terms to the regression model to remove the expected value of ei conditional on Ai·. This is accomplished by assuming that the expected differences in ei conditional on Ai· are linearly dependent on the group identifiers Gi· and not otherwise dependent upon admixture proportions:
where bGh captures the environmental component associated with membership in group h relative to the reference group h = 1, and
is assumed to be independent of Ai·, Gi· and
. Although not strictly necessary, we also assume for simplicity that both residuals are normally distributed:
and
. Combining (7) and (8) produces the key linear admixture regression specification:
where
. Note that εi is normally distributed with zero mean and variance
and is independent of Ai· and Gi·.
The ordinary least squares coefficient estimates of bGj, bAh, j = 2, k; h = 2, m in (9) are maximum likelihood and efficient. In many applications, the analyst also has information on the sampling substructure of the data, such as its division into site-specific subsamples. In this case, a linear mixed effects model can be used for estimating (9) rather than ordinary least squares. This involves partially decomposing the residual term εi in (9) into linear random effects components linked to data collection site identifiers and/or other subsample identifiers, see Heeringa and Berglund (2021).
2.4 Adding multi-racial individuals to the regression
An identifying assumption of the admixture regression technique is that the environmental influences associated with racial/ethnic group membership are captured by the group membership self-identification choices, SIRE. Many individuals self-identify as belonging to two or more racial or ethnic groups and the model must be adapted to this reality. In the context of our statistical framework, there are essentially three approaches: evenly splitting the individual’s affiliation across their chosen groups, creating a new group for one or more particular multi-racial combinations, or deleting particular multi-racial observations where neither of the other two approaches seem appropriate.
Recall that SIRE is the n × k matrix of race/ethnicity self-identifications, and we now allow that some individuals choose more than one category, so that for some i. The simplest regression specification in this case is to assume that the group environment faced by a multi-racial individual is the average of the component group environments:
Although (10) is a reasonable specification, it is restrictive. It is possible to replace (10) with a more general specification at some loss of parsimony. Suppose that we are concerned about imposing the restrictive condition (10) for some common multi-racial choice (such as, for example, Black-White biracial in a US dataset). Let V1 denote a k—vector with ones for the included race/ethnicity groups in this particular multi-racial combination and zeros elsewhere. We can supplement (10) by adding a k + 1st group and using a different rule for this subset of multi-racials:
where SIREi· = V1 denotes vector equality between these two k—vectors. There are now k + 1 groups: the originally specified SIRE groups and a new group for the selected multiracial combination. G becomes a n × (k + 1) matrix, and the regression (9) described in the previous subsection applies exactly as before but with one extra dimension to G. Any small number of defined multi-racial groups can be appended in this way. The only change to the regression methodology is that G becomes a n × k* matrix (with an associated increase in the set of estimated parameters) where k* – k is the number of multiracial combinations added as new categories.
It is not feasible to use rule (11) for all race/ethnicity choice combinations due to lack of parsimony; there are 2k – k potential multi-racial combinations and each one added requires an additional parameter in the regression. It can only be used for the common multi-racial choices where there is sufficient data of that combination in the sample. For all others, it is necessary to stick with the restrictive assumption (10) or drop the observations from the sample. This will be illustrated in the empirical application in Section 4.
Once a regression model is estimated using (11), it is possible to test the accuracy of restrictive assumption (10) for that multi-racial group. The restrictive assumption implicit in (10) requires that the average of the coefficients of the components equals the added-group coefficient in the unrestricted model:
where #j* denotes the number of components in the multiracial category (typically either two or three) and the sum runs over these element only. This is a linear restriction on the vector of coefficients, or multiple linear restrictions for k* – k greater than one, which can be tested with a t-test (for each group coefficient singly) or a Wald test for all them, as detailed below.
Let denote the m + k* –vector of all the coefficients in the admixture regression (9):
and let
denote the estimated (m + k*) × (m + k*) –covariance matrix of these estimates.
First consider the case k* – k = 1. Let R denote the (m + k*) –vector expressing restriction (12) imposed on b: For example, if the group combination consists of individuals who choose all three of the first, second, and third SIRE categories (recalling that the first SIRE category has a zero beta by definition) the restriction vector is:
where the 1 is element k* in the vector. Any other restriction of type (12) is easily stated in this way. In the case of one group, this gives rise to a standard t-test of the one coefficient restriction, and in particular:
For the case k* – k > 1 it is possible to test each multi-racial group equality individually as above using (13) or perform a joint Wald test on all of them. Let R denote the (m + k*) × (k* – k)–matrix of all the linear restrictions, giving the standard Wald test:
where
denotes the approximate distribution for large n. In the case of estimation by linear mixed effects modeling, both test statistics (13) and (14) are large–n asymptotic distributions rather than exact finite-sample distributions, but they remain valid tests.
3 Extensions of the linear admixture regression model
3.1 Additional explanatory variables with and without orthogonalization
It is straightforward to include additional explanatory variables in the admixture regression model. Let xi1, xi2,…, xil denote a set of explanatory variables that help to linearly explain the trait along with the ancestry proportions and group identities. We modify specification (9) to include these:
and keep all the other assumptions as before. The estimation theory for (15) is essentially identical to that of (9) as discussed above.
In some cases, the admixture regression model with additional explanatory variables (15) can be made more useful and informative by orthogonal rotation of one or more of the explanatory variables, in order to aggregate the full linear effects of proportional ancestries and group identities into their associated coefficients. To understand why such an orthogonal rotation might be useful, consider the hypothetical case of an admixture regression model of Body Mass Index (BMI) in which waist measurement is one of the explanatory variables. Waist measurement has such strong explanatory power for BMI that its presence in an admixture regression model like (15) will diminish the direct explanatory power of proportional ancestries and group identities; their total impact will be partly hidden within the waist measurement variable. This can be remedied by orthogonalizing the waist measurement variable with respect to the proportional ancestry and group identity variables before estimating the admixture regression, as explained next.
Suppose that variable x1 in (15) has strong explanatory power for s and substantial correlation with proportional ancestry and/or group identity variables, and therefore the analyst wishes to orthogonalize it with respect to Gij and Aih; j = 2; k; h = 2; m. In a first step, the analyst can perform a simple least square regression decomposition of x1 into the component linearly explained by these variables, and the residual, orthogonal component :
Since all the explanatory variables are deterministic (that is, conditionally fixed variables rather than random variables in the regression model), this orthogonalization step (16) is interpreted as a matrix transformation of fixed vectors and does not alter any statistical assumptions of the main regression model. It merely serves to linearly rotate the deterministic explanatory variables used in the actual, second-stage, admixture regression. Replacing x1 with in (15) changes the interpretation of the coefficients
and
, j = 2; k; h = 2; m since they now include the Gij and Aih related explanatory power from x1. An illustrative example will be provided in Section 4 below.
3.2 A semiparametric extension of the admixture regression model
The linear dependence of the trait on admixture proportions in our regression model is in part an artifact of the assumption of a linear polygenic index (1). It is possible to weaken this linearity assumption using nonparametric regression methods. We replace the restrictive assumption of a linear polygenic index (1) with a very general description of genetic potential as a function of the full vector of genetic variants:
and instead of linearity as in (1) only require smoothness conditions on the conditional expectation of p(·) as a function of the ancestral proportions vector, as delineated below.
As in earlier subsections, we consider pi as a stochastic function of the ancestral proportions vector Ai·, but now without imposing the strict linearity (4) arising from the linear polygenic index assumption:
Define the unexplained component of pi as before:
and as before we assume that
and independent of Ai· and Gi·. We impose exactly the same assumptions on ei as in Subsection 2.2, giving:
where
is normally distributed with mean zero and variance
and independent of Ai· and Gi·. This equation (17) is a partially linear nonparametric regression model, see, e.g. Li and Racine (2007). This model can be consistently estimated using the three-step procedure of Robinson (1988). We will impose Condition 7.1 from Li and Racine (2007) in order to justify this procedure within our framework (see the Technical Appendix for details).
For the case m > 2 the general specification (17) suffers from the curse of dimensionality and is unlikely to be estimable on moderate-sized datasets. A more restrictive specification is needed to give the model sufficient parsimony for estimation. One reasonable specification choice is to restrict the nonlinearity in the impact of ancestries on the trait to a single ancestral category, which we assume is ancestry category 2, giving rise to the specification:
and we will now rely on this more restrictive specification throughout the remainder of this subsection.
We assume that the unconditional density Pr(A2) is continuous and strictly positive everywhere on the [0,1] interval. Let denote the nonpara-metrically estimated unconditional density of Ai2:
where k(•) is a kernel weighting function. In our empirical application in Section 4 we use the Gaussian kernel weighting function,
where ϕ is the chosen bandwidth.
In the first step of the Robinson procedure, the conditional means of the dependent variable and linear-component explanatory variables are estimated nonparametrically as functions of the nonparametric-component explanatory variable, Ai2:
and
that is:
and
In the second step, the linear parameters of the model (17) are estimated by ordinary least squares, replacing the dependent variable and linear-component explanatory variables with the deviations from their conditional mean functions:
where
Note that is a (k + m – 3)–vector and X is a n × (k + m – 3)–matrix where the index first runs from 2 to k over j and then from 3 to m over h.
In the third step, the nonparametric component of the model is estimated by subtracting the predicted linear component from both sides of (17) and then applying standard nonparametric regression:
and then:
where
.
The partially linear nonparametric approach to admixture regression is more empirically challenging than the linear specification. Proper implementation of the technique involves a tradeoff between parsimony, the generality of the specification used, and the distributional features of the available data. An example of (18) will be estimated in Section 4 below.
4 Empirical Application
In this section, we illustrate the techniques by performing an admixture regression analysis of neuropsychological performance from the Adolescent Brain Cognitive Development (ABCD) database. The ABCD study is the largest long-term study of brain development and child health in the United States, testing 11,000 children ages 9-10 at 21 testing sites; see Karcher and Barch (2020) for an overview. Our sample consists of age and gender-adjusted scores and genotyped DNA samples of the 9972 children who met our sample selection criteria, along with questionaire responses of their parent(s)/guardian(s). The dependent variable in our model is the composite neuropsychological performance score based on the NIH Toolbox® (NI-HTBX) neurocognitive battery provided in the ABCD database; this consists of tasks measuring attention, episodic memory, language abilities, executive function, processing speed, and working memory. Our core explanatory variables are seven SIRE variables, White, Black, Hispanic, Native American, East Asian, South Asian, and Other (and including multiple SIRE choices from among these) and five genetic ancestry proportions of European, African, Amerindian, East Asian and South Asian background obtained from the genotyped DNA samples. Children whose parent(s)/guardian(s) identified the child as belonging to Pacific Islander racial groups were excluded from our analyses owing to a lack of corresponding ancestry category in our chosen five categories. The ABCD Version 3 database provides 516,598 genotyped SNP variants for each individual’s DNA sample. After quality control, filtering, and pruning we were left with 99,642 SNP variants to determine the five ancestry proportions, employing the Admixture 1.3 software package (Alexander et al. 2015). See the Supplemental Materials for more detailed description of the ABCD database, our sample selection procedure, and the construction of the variables that we use.
Table 1 shows means and standard deviations of the regression dependent variable on data subsets sorted by SIRE choice. On the full sample, by construction, the dependent variable has a mean of zero and standard deviation of one. There is considerable dispersion in the subsample means sorted by SIRE; for example, the means differ by 1.02 standard deviations (using the full-sample standard deviation for simplicity) between two of the largest SIRE categories shown, White-only SIRE and Black-only SIRE. The considerable variation in means for SIRE-based subsamples provides an initial justification for performing admixture regression analysis. This is a table of descriptive statistics; the standard errors shown are not appropriate for formal hypothesis testing since there is no adjustment for potential site-linked and family-linked correlations, particularly relevant in the case of the smaller subsample categories.
Table 2 displays empirical results from three specifications of the admixture regression methodology. Recall that one SIRE variable and one ancestry proportion variable must be left out as an identification condition of the admixture regression: we leave out the White SIRE variable and the European ancestry proportion variable. Model 1 uses a linear regression specification and singleton SIRE categories for the group-identity variables G; individuals who choose multiple SIRE categories have G exposures equally divided between the chosen SIRE categories as in (10). Three of the four ancestral proportion variables and one of the six group-identity variables have statistically significant coefficients. Model 2 adds a selected set of multiple-SIRE composite categories to the G specification. We include the seven two-category choices with the largest number of observations in our sample. Individuals with one of these two-category choices has unit exposure to the associated explanatory variable, and no exposure to the weighted single-SIRE variables (see equation 11 above). The same three of four ancestral proportion variables as in Model 1 are significant in Model 2, with similar coefficients to Model 1. None of the single-SIRE group identity variables is significant. Five of the seven selected two-SIRE group identity variables have significantly different coefficients from that implied by equal weightings of the component single-category coefficients. One of these (Hispanic-Other) has a statistically significant coefficient; the other four are not significantly different from zero, but are significantly different from the value implied by the composite single-category coefficients. Random effects are included in all models except Model 3 to capture any common variation associated with the 22 individual data collection sites in the ABCD study or associated with those families having multiple individuals in the sample. We use the Imer maximum likelihood mixed effects model estimation routine from the R language library, see Bates et al. (2015), for all models except Model 3. See Nakagawa and Schielzeth (2013) for the definition and interpretation of conditional and marginal R2 in a linear mixed effects model.
Linear Specifications with and without Composite Groups and a Partially Linear Semiparametric Specification
Model 3 implements a partially linear nonparametric specification. This specification requires that the highlighted ancestry proportion (whose impact is estimated nonparametically) has observations throughout the [0,1] range. For each of the five ancestry categories, Table 3 gives the number of sample observations of proportional ancestry in decile bins of percent ancestry, for each of the five genetic ancestry categories. We use African proportional ancestry as the highlighted variable since it fulfils the requirement for observations throughout the [0,1] interval and therefore partially linear nonparametric estimation is feasible. Figure 1 shows the probability density of African ancestry for the full sample population; Figure 2 shows the density restricted to those individuals having measured African ancestry greater than 0.5%, this provides greater detail in the graph by excluding observations with near-zero ancestry. Interestingly, this density has three local peaks, at approximately 5%, 40% and 80% African ancestry.
Partially linear semiparametric Model 3 (18) is estimated using the npplr routine in the R programming language subroutine library NP written and maintained by Hayfield and Racine (2020). We use the simple average SIRE specification of G as in Model 1. We use the Guassian kernel throughout, and all bandwidths are chosen by iterated least-squares cross-validation. The linear coefficient estimates in Model 3 do not differ notably from those in Model 1. Figure 3 displays the nonparametric estimate of the impact of African ancestry on the performance variable along with the corresponding linear impact estimate from Model 1, that is, and
for A2 ∈ [0,1]. There is some graphical evidence for an uptick in the nonlinear gradient for ancestry proportions above 90%. We now briefly examine this further.
Model 3 does not capture the efficiency gain and test statistic bias reduction from the mixed effects modeling used in the estimation of the other models. Figure 3 of Model 3 is estimated in the second stage of a two-stage semiparametric estimation process and this weakens its empirical reliability. To examine more carefully the graphical pattern observed in Figure 3, but with single-stage estimation and the advantage of mixed effects modeling, we estimate a piecewise linear specification for Ai2 ≥ 0:9: This was chosen in order to mimick the observed nonlinear uptick seen in Figure 3 within a linear regression functional form. Recall that African ancestry proportion is ancestry variable 2; giving the formulation:
where D[•] is a zero-one dummy variable and bkink is the added coefficient. The results are shown as Models 4 and 5 in Table 4. In Model 4 we use the simple average SIRE specification of G as in Model 1; Model 5 adds the same seven two-SIRE combination groups as in Model 2. The coefficient bkink is significantly positive in one of the two models; the significance of this finding must be treated with caution since the particular kink specification (20) is based on examination of Figure 3 using the same data.
Table 5 adds two new variables, US born child and Social-Economic Status (SES), to the admixture regression model. US born child equals one if the child was born in the USA and zero if born elsewhere. SES is a factor-analytic composite of underlying variables from the ABCD database including neighborhood SES, subjective SES as determined from a set of questionaire answers by the parent(s)/guardian(s) of the child on parental/guardian marital status, completed level of parental/guardian education, reported neighborhood safety, and parental/guardian employment. See the Supplemental Materials for more detailed discussion. Models 6 and 7 are identical to Models 4 and 5 (respectively) from Table 3, except for the addition of these two variables. As discussed in Section 3 above, including additional explanatory variables complicates the interpretation of an admixture regression model in terms of the implied decomposition of trait variation into linear components linked to group identities and components linked to genetic ancestries. The SES variable covaries strongly with both genetic and environmental components of neuropsychological performance scores. To retain the standard interpretability of the admixture regression it is important to orthogonalize SES with respect to the group identity and ancestry variables before running the regression. For completeness, Models 6 and 7 are shown with and without the orthogonalization of SES (versions a and b of each model). If the purpose of the estimation is to identify the total impact of SES on the trait, the regression with raw SES is more appropriate (version a). For admixture analysis intended to capture the total effects of group identity and genetic ancestry on the trait, orthogonalized SES is more appropriate (version b).
5 Conclusion
Many behavioural traits covary strongly with racial/ethnic self-identities, but it is often ambigous whether this covariance reflects environmental causes associated with racial/ethnic identity groups or reflects underlying genetic similarity among group members arising from shared geographic ancestry. Admixture regression relies on the natural experiment of recent genetic admixture of previously geographically-isolated ancestral groups to measure the explanatory power arising from racial/ethnic group identities and that arising from ancestry-based similarities of genetic background. The admixture regression methodology, in various formulations, has been applied to a wide range of medical and behavioural traits including asthma, obesity, type 2 diabetes, hypertension, neuropsychological performance, and sleep depth.
This paper provides a statistical framework for admixture regression based on the linear polygenic index model of behavioural genetics, and develops refinements and extensions of the methodology within this framework. We provide a simple new test procedure for determining whether multiple-SIRE categories have independent explanatory power not captured by the individual component categories. We consider additional explanatory variable in the admixture regression and their interpretation with and without orthogonalization with respect to core variables. We weaken the linearity assumption and develop a partially linear semiparametric regression specification. We illustrate our methodology using neuropsychological performance test data from the Adolescent Brain Cognitive Development database, but the techniques have broader applicability.
Supplementary material
1. Materials and Methods
1.1 Dataset
The Adolescent Brain Cognitive Development Study (ABCD) is a collaborative longitudinal project between 21 sites across the US. Its goal is to further research into the psychological and neurobiological basis of development. At baseline, around 11,000 9-10 year old children were sampled, using a probabilistic sampling strategy, from public and private elementary schools and through non-school-based community outreach between 2016 and 2018, with the goal of creating a broadly representative sample of US children of this age. Children who were not fluent in English (or whose parents were not fluent in either English or Spanish) were excluded, along with those with severe medical, neurological, or psychiatric conditions. Informed consent was provided by parents.
Baseline ABCD 3.0 data release was used. For this analysis, we excluded individuals who did not have NIH Toolbox® results, who did not have admixture data, and who were identified as being a Pacific Islander. This left 9972 individuals.
1.2 Variables
1.2.1. Admixture
Subjects were genotyped using Illumina XX, with 516,598 variants directly genotyped and surviving the quality control done by the data provider. We used the 3.0 release of the dataset, which also includes an edition with imputed variants using TOPMED and Eagle 2.4. Because we had very few samples from Pacific Islanders, we excluded these from further analysis to simplify the reference populations needed (n = 69). All our work was done on build 38. Files in hg17/37 were lifted to hg38 using liftOver (https://github.com/sritchie73/liftOverPlink) and the GRC chain file at ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/ (GRCh37_to_GRCh38.chain.gz).
Before global admixture estimation, we applied quality control using plink 1.9. We used only directly genotyped, bi-allelic, autosomal SNP variants (494,433, 493,196, before and after lifting). We pruned variants for linkage disequilibrium at the 0.1 R2 level using plink 1.9 (--indep-pairwise 10000 100 0.1), as recommended in the admixture documentation (https://vcru.wisc.edu/simonlab/bioinformatics/programs/admixture/admixture-manual.pdf). Thi s variant filtering was done in the reference population dataset to reduce bias from sample representativeness. After pruning, we were left with 99,642 variants. To ensure a reasonable balance in the estimation dataset, we merged the target samples from ABCD, with reference population data for the populations of interest. We desired a k=5 solution (European, Amerindian, African, East Asian, and South Asian), so we merged with relevant samples from 1000 Genomes and from the HGDP. The following populations were excluded: Adygei, Balochi, Bedouin, Bougainville, Brahui, Burusho, Druze, Hazara, Makrani, Mozabite, Palestinian, Papuan, San, Sindhi, Uygur, Yakut. These reference populations were excluded because they were overly admixed or because, in the case of Melanesians and San, the individuals in the ABCD sample lacked significant portions of these ancestries.
Because the estimation sample would still be very skewed towards European ancestry using this joint sample, we used repeated subsetting to achieve balance. Specifically, we split the ABCD target samples into 50 random subsets, each with about 222 persons, and merged them one at a time with the reference data, followed by running admixture k=5 on each merged subset. We verified that these subsets produced stable results by examining the stability of the estimates for the reference samples. There was very little variation across runs, e.g. for the reference sample with the most variance (European, NA12342), the mean estimate was 98.3% with SD=0.17% across the 50 runs. Since Admixture does not label the resulting clusters, we used 5 reference samples to index the populations so the data would be merged correctly. In no case did this produce any inconsistencies.
1.2.2. Neuropsychological Performance
The NIH Toolbox® (NIHTBX) neuropsychological battery was designed to measure a broad range of cognitive abilities. It consists of seven tasks which index attention (Flanker Inhibitory Control and Attention Task), episodic memory (Picture Sequence Memory Task), language abilities (Picture Vocabulary Task & Oral Reading Recognition Task), executive function (Dimensional Change Card Sort Task & Flanker Inhibitory Control and Attention Task), processing speed (Pattern Comparison Processing Speed Task), and working memory (List Sorting Working Memory Task) (Akshoomoff et al., 2014; Weintraub et al., 2013). NIHTBX was normed for samples between ages 3 and 85; tasks correlate highly with comparable ability assessmnents (Weintraub et al., 2013). Moreover, this battery has been shown to be measurement invariant across American ethnic groups (Lasker, Pesta, Fuerst, & Kirkegaard, 2019).
Age-corrected composite scores, based on the seven tasks, were provided by ABCD. We regressed out sex from these age-corrected composite scores. The residuals were then standardized.
1.2.3. Self-identified Race and Ethnicity
Self-identified race was based on parental responses to 18 questions asking about the child’s race (“What race do you consider the child to be? Please check all that apply”). From these questions, six broad racial categories were created: European (“White”), African (“Black/African American”), Native American (“American/Native American” and “Alaska Native”), South Asian (“Asian Indian”), East Asian (“Chinese,” “Filipino,” “Japanese,” “Korean,” and “Vietnamese”, “Other Asian,”), and Other (“Other race,” “Refused to answer,” “Don’t know”). The Other Asian group (N = 66) was classified as “East Asian” because the Asian ancestry component was predominantly East (44%;) not South (7%) Asian; the remaining ancestry was predominantly European (40%). The Pacific Islander groups (“Native Hawaiian,” “Guamanian,” “Samoan,” and “Other Pacific Islander”) were excluded as we did not have a corresponding admixture component. Self-identified ethnicity was based on parental responses to 1 question asking about Latin American ethnicity (“Do you consider yourself Hispanic/Latino/Latina?”). From this we created an additional ethnic category.
Descriptive statistics for the SIRE groups are shown in Table S1. Statistics are reported for single ethnic categories i.e., individuals reported as being only White, Black, East Asian, Native American, or Other, with no combinations (e.g., Hispanic & White), Hispanics, the seven top double combinations (i.e., Hispanic & White, Hispanic & Black, Hispanic & Other, non-Hispanic Black & White, non-Hispanic East Asian & White, non-Hispanic Native American & White, and non-Hispanic South Asian & White) and finally all other remaining groups combined.
Descriptive Statistics for the SIRE Groups.
The racial and ethnic variables were then recoded to create interval categories for which individuals are assigned a percentage of each SIRE category based on the number of responses chosen (Liebler & Halpern-Manners, 2008; Kirkegaard et al., 2019). By this coding, if someone was marked as White and Hispanic, they were assigned scores of .5 for white and .5 for Hispanic and 0 for the other 5 categories. The correlations between these interval scores and genetic ancestry components are shown in Table S2 below (N = 9972). These associations are similar to those found by others (for example: Guo et al. 2014); self-identified race generally corresponds with genetic ancestry.
Correlations between Interval Coded SIRE and Genetic Ancestry.
1.2.4. Region of Birth (US Born)
Region of birth was based on the parental response to the question, “In which country was the child born?”. The response “United States” was recoded as 1 and all other responses were recoded as 0.
1.2.5. Socioeconomic Status
Socioeconomic status was based on seven indicators: financial adversity, area deprivation index, neighborhood safety protocol, parental education, parental income, parental marital status, and parental employment status. These are detailed below:
1.2.5.1. Financial Adversity (Reverse Coded)
Parents answered a seven item Financial Adversity Questionnaire (PRFQ). They were asked: “In the past 12 months, has there been a time when you and your immediate family experienced any of the following:
“Needed food but could not afford to buy it or could not afford to go out to get it?”,
“Were without telephone service because you could not afford it?”
“Did not pay the full amount of the rent or mortgage because you could not afford it?”,
“Were evicted from your home for not paying the rent or mortgage?”,
“Had services turned off by the gas or electric company, or the oil company would not deliver oil because payments were not made?”,
“Had someone who needed to see a doctor or go to the hospital but did not go because you could not afford it?”, and
“Had someone who needed a dentist but could not go because you could not afford it?”
For each of the seven items they answered “yes” (1) or “no” (0). We summed responses. Thus the maximum was 7 and the minimum was 0.
This variable was reverse coded, so that higher scores indicated less financial adversity, and then standardized.
1.2.5.2. Area Deprivation Index (ADI) (Reverse Coded)
Parents completed a residential history questionnaire. They provided the residential addresses and the number of full years they lived at each residence. For each address an Area Deprivation Index (ADI) was computed by ABCD and the national percentile of the area’s socioeconomic status was given. ADI was based on the following variables:
“Percentage of occupied housing units without complete plumbing (log)”
“Percentage of occupied housing units without a telephone”
“Percentage of occupied housing units without a motor vehicle”
“Percentage of single”
“Percentage of population below 138% of the poverty threshold”
“Percentage of families below the poverty level”
“Percentage of civilian labor force population aged >=16 y unemployed (unemployment rate)”
“Percentage of occupied housing units with >1 person per room (crowding)”
“Percentage of owner”
“Median monthly mortgage”
“Median gross rent”
“Median home value”
“Income disparity defined by Singh as the log of 100 × ratio of the number of households with <10000 annual income to the number of households with >50000 annual income”
“Median family income”
“Percentage of population aged >=25 y with at least a high school diploma”
“Percentage of population aged >=25 y with <9 y of education”
We weighted the ADI for the last three residences by the numbers of years at each residence. Before weighting, we recoded zero full years of residence as one-half of a year, so to give weight to time spend at a residence that was less than one year. The weighted ADI scores were then reverse coded, so that higher values indicated higher socioeconomic neighborhoods, and then standardized.
1.2.5.3. Neighborhood Safety Protocol
Parents were asked three Likert scale (1 = strongly disagree; 5 = strongly agree) questions about neighborhood safety: “I feel safe walking in my neighborhood, day or night,” “Violence is not a problem in my neighborhood,” and “My neighborhood is safe from crime.” We used the precomputed summary scores for which the three scores were summed and then divided them by three.
1.2.5.3. Education
Parents were asked, “What is the highest grade or level of school you have completed or the highest degree you have received.” To create an interval variable, we recoded parental education as 0 to 18: Never attended/Kindergarten only = 0, 1st grade = 1, 2nd grade = 2, 3rd grade = 3, 4th grade = 4, 5th grade = 5, 6th grade = 6, 7th grade = 7, 8th grade = 8, 9th grade = 9, 10th grade = 10, 11th grade = 11, 12th grade = 12, High school graduate =12, GED or equivalent Diploma General =12, Associate degree: Occupational Program =14, Associate degree: Academic Program = 14, Bachelor’s degree = 16, Master’s degree = 18, Professional school = 18, Doctoral degree = 18. We standardized the scores for each educational scores for both parents and standardized the averaged scores.
1.2.5.4. Income
Family was an interval variable which reflected the parents’ total combined family income in the past 12 months. The variable was recoded as follows: 1.00 = less than $5000 (recode: 4,500); 2.00 = $5000 to 11,999 (recode: 5,000); 3.00 = $12,000 to 15,999 (recode: 12,000); 4.00 = $16,000 to 24,999 (recode: 16,000); 5.00 = $25,000 to 34,999 (recode: 25,000); 6.00 = $35,000 to 49,999 (recode: 35,000); 7.00 = $50,000 to 74,999 (recode: 50,000); 8.00 = $75,000, to 99,999 (recode: 75,000); 9.00 = $100,000 to 199,999 (recode: 100,000); 10.00 = $200,000 and greater (recode: 200,000).
1.2.5.5. Marital Status
Parental marital status was coded as 1 if married and 0 for any other arrangement.
1.2.5.6. Employment Status
Parental employment was coded as 1 if at least one parent was working now either full or part time and 0 for all other cases.
3.2.5.7. General SES
Missing data for the seven economic indicators were imputed using the mice package (df, m=5, maxit = 50, method = ‘pmm’, seed = 500). Descriptive statistics for the imputed SES indicators are provided in Table S3, while the correlation matrix for the imputed variables (N = 9972), along with neuropsychiatric performance, is shown in Table S4.
Descriptive Statistics for the SES Indicators.
Correlation Matrix for the SES indicators, Genetic Ancestry, and Neuropsychiatric Performance (N = 9972 in all cases).
We then submitted the seven SES indicators to Principal Component Analysis (PCA). For this, we used the R package PCAmixdata, which handles mixed categorical and continuous data (Chavent, Kuentz-Simonet, & Saracco, 2014). The first unrotated component explained 40% of the variance. The PCA_1 loadings for the seven SES indicators were as follows: financial adversity (.250), area deprivation index (.153), neighborhood safety protocol (.256), parental education (.504), parental income (.658), parental marital status (.270), and parental employment status (0.183). The vector of PCA_1 loadings correlated with the vector of SES indicator effects on Neuropsychiatric Performance at .84 (N=7). This indicates that the better measures of general SES have higher cognitive loadings.
PCA_1 scores correlated with Neuropsychological Performance at r = .43 in the full sample. Among non-Hispanic Whites, non-Hispanic Blacks, and Hispanics, the correlations between PCA_1 and Neuropsychological Performance was r = .26 (N = 5533), r = .34 (N = 1434), and r = .32 (N = 1869), respectively. These magnitudes of child-parental SES correlations are consistent with those previously reported (Flores-Mendoza, Ardila, Gallegos, & Reategui-Colareta, 2021; Sirin, 2005). The congruence coefficients for the SES component loadings were greater or equal to r = .97 for the largest three SIRE groups (non-Hispanic Whites, non-Hispanic Blacks, and Hispanics), indicating identical structures across groups.
2. Methods
A series of regression models were run with NIHTBX as the dependent variable. The NIHTBX and socioeconomic variables were standardized (based on the subsample of 9972 retained). The ancestry variables were left unstandardized, thus the coefficients from ancestries can be interpreted as a change in 100% ancestry over a change in one standardized unit of NIHTBX scores. European ancestry and White SIRE were selected as reference values and thus not included as independents.
For the regression analyses, following the recommendations of Heeringa and Berglund (2021), we used a three-level (site, family, individual) multi-level mixed effects model. This model was applied to the pooled twin and regular ABCD baseline sample. This specification approximates the ABCD Data Exploration and Analysis Portal (DEAP) specification (Heeringa and Berglund, 2021).
Technical Appendix
In this technical appendix we re-state condition 7.1 from (Racine and Li 2007, p. 224) in the context of our partially linear admixture regression model (18).
We assume that the (k + m – 1)–vector of observations (si, Gij, Aih) j = 2, k; h = 2, m has an i.i.d. distribution over observations i = 1, n and that the conditional mean functions E[Gij| Ai2] and E[Aih|Ai2] are twice differentiable throughout the interior of the domain of A2, the closed unit interval. Let m(•) denote any of these conditional mean functions or their first or second derivative functions. As in Racine and Li, we impose the following Lipschitz-type smoothness condition on these conditional mean functions and their first and second derivatives: where H(•) is some continuous function such that E[H(A2)2] is finite. The expectation of H(A2)2 is over the probability distribution of A2.
We continue to assume that εi is mean-zero normally distributed with constant variance. Since Gij only takes the values of zero and one and Aih is confined to the unit interval, it necessarily follows that both have bounded fourth moments. We assume that k(•) is a bounded second-order kernel.
To formally derive the limiting distribution of the Robinson estimator, it is necessary to define a trimming parameter which ensures that the estimates are bounded away from zero. Let t denote a trimming parameter and consider the estimator described in the text but where observations such that
in (19) are dropped from the subsequent estimation steps. Let ϕ denote the kernel bandwidth for sample size n. Assume that the trimming parameter obeys the following two limiting conditions as n → ∞: nϕ2t4 → ∞ and nt−4ϕ8 → 0.
Under these conditions we have from (Robinson 1988) that
where the matrix X is defined in the main text of the paper above, in step two of the Robinson procedure.