Abstract
The integration of genetic data within large-scale social and health surveys provides new opportunities to test long standing theories of parental investments in children and within-family inequality. Genetic predictors, called polygenic scores, allow novel assessments of young children’s abilities that are uncontaminated by parental investments, and family-based samples allow indirect tests of whether children’s abilities are reinforced or compensated. We use over 16,000 sibling pairs from the UK Biobank to test whether the relative ranking of siblings’ polygenic scores for educational attainment is consequential for actual attainments. We find strong evidence of compensatory processes, on average, where the association between genotype and phenotype of educational attainment is reduced by over 20% for the higher-ranked sibling compared to the lower-ranked sibling. These effects are most pronounced in high socioeconomic status areas. We find no evidence that similar processes hold in the case of height or for relatives who are not full biological siblings (e.g. cousins). Our results provide a new use of polygenic scores to understand processes that generate within-family inequalities and also suggest important caveats to causal interpretations the effects of polygenic scores using siblingdifference designs.
Introduction
A large social science literature has produced theoretical and empirical support that parental actions shape patterns of within-family inequalities over the life cycle (1–3). Theoretically, the key components of the question include parental attitudes about inequalities of outcomes of their children (4, 5) and the potential of differential returns to investment based on children’s talents (i.e. technologies of skill formation)(6–8). Separating these mechanisms has proven difficult. Empirically, measurement limitations have been important bottlenecks in progress. Ideally, researchers could use measures that (a) occur early in life so they represent endowments and (b) capture endowments of children that are not related to parental behaviors and investments (no feedback effects). Typically, birth weight has been used in analyses that examine whether parents reinforce or compensate for children’s endowments (3, 9–11). However, focusing on birth weight is imperfect because it can be affected by parental behaviors and investments (12–16) and it limits the scope of analysis due to a focus on a single measurement. This scope limitation occurs both in terms of life course outcomes that can be tied to birth weight as well as failing to examine parental responses to endowments that are not associated with birthweight. Alternative measures, such as test scores (4) can be problematic because parents can shape these outcomes prior to research measurement and they may not be “early enough” to capture endowments—for example, before children acquire language skills.
Summary genetic assessments (i.e. polygenic scores, PGS) have the capacity to overcome these empirical limitations, as these measures are fixed at conception—thus, they have no feedback effects and they can capture endowments tied to early outcomes. PGS also provide the possibility of expanding the domains of analysis outside of birthweight. Studies have begun to use these measures to show associations with early childhood outcomes (17), and some research has found evidence that PGS are associated with parental responses (18, 19). However, fewer studies have incorporated these measures into analyses of within-family inequalities. Might parents use observable phenotypic downstream outcomes tied to PGS in their efforts to increase or decrease differences in later outcomes of their children? We begin this direction of empirical analysis by linking two highly predictive PGS measures (education and height) to phenotypic outcomes in adulthood in a sample of over 16,000 sibling pairs from the UK Biobank.
We propose that theoretical models of parental responses to children’s endowments can be assessed with PGS measures of siblings. Because our data do not include measures of early life, we use an indirect test of the accumulated parental responses to their children’s abilities by comparing educational attainments of the siblings with their PGS measurements. In order to provide an omnibus test of compensation vs. reinforcement, we test whether the relative ranking of PGS within sibling pairs is consequential for predicting adult outcomes. We focus on the case of education as an exemplar where parents may have the means and desire to shape inequalities in their children’s attainments (20) and the case of height, where they have neither, so that we have a negative test. We hypothesize that if parents prefer, on average, for equalizing their children’s outcomes (i.e. compensatory behaviors), the PGS will be less predictive of attainment in the sibling with a higher relative rank. If parents prefer, on average, to reinforce their children’s relative advantages, the PGS will be more predictive of attainment in the sibling with a higher relative rank. We then explore whether these patterns differ by the socioeconomic status of the families. A large literature has shown that parents in advantaged settings often show compensatory preferences and behaviors while parents in disadvantaged settings show reinforcing preferences and behaviors (11, 21, 22). Therefore, we divide the families in our data based on area-level socioeconomic status and estimate the relative ranking effects outlined above.
Results
We analyzed educational attainment in the full sample and a sibling sub-sample from respondents with European ancestry in the UK Biobank. Siblings were matched based on kinship estimates of genetic relatedness (see Methods). SI Appendix Table A1 provides descriptive statistics of the full sample and our sibling analysis sample. Our baseline results conformed with other analyses (23–25) showing that the education attainment (EA)-PGS predicts educational attainment in the UKB sample (Table 1, Column 1), that these associations are retained in the sibling sub-sample (Table 1, Column 2) and that the association is reduced by >50% when family fixed effects are included (Table 1, Column 3). Specifically, the association between a standard deviation increase in EA-PGS and EA falls from 0.91 years in Column 1 to 0.44 years in Column 3.
Sibling-difference analysis controls for influences shared by siblings growing up in the same household, including the portion of the PGS associated with genetic nurture (26, 27) that is shared among siblings. However, unshared environmental factors, including parental actions to reinforce or compensate for sibling differences in PGS are retained in the PGS associations. SI Table 2 presents average differences and correlations in sibling phenotype and genotype measures. As expected, sibling correlations in the genotype measures are r~0.5. Correlations in EA and height are ~0.3. On average siblings differ in PGS by ~0.75 SD, 4 years of schooling, and 8 cm in height.
In order to conduct an omnibus test of reinforcing vs. compensating processes in reaction to sibling differences in PGS, we add dyadic measures of relative ranking of PGS and age (see Methods) as well as an interaction between the indicator for higher relative ranking and PGS to predict EA.
Table 2 shows that inclusion of the dyadic measures is consequential. Conforming to the birth order literature, the older sibling (even accounting for age indicators (i.e. fixed effects)) attains over 0.5 years of schooling more than the younger sibling of the pair (28–30).2 The interaction between the indicator for higher ranked EA-PGS and the EA-PGS score is negative (p-value <0.09), so that the EA-PGS has a smaller association with attainment for siblings who are higher ranked within the sibship (See Figure 1). This result suggests, on average, compensatory processes within families and a concomitant reduction in within-family inequalities in educational attainment.
Table 2 Column 3 and 4 stratify the sibling analysis based on area (place of birth)-estimates of socioeconomic status. Results show that birth order effects are smaller in high SES areas, the gender gap is smaller in high SES areas, and the compensatory effects are much larger in high SES areas (p-value <0.05) (See Figures 2 and 3). These results also show, in general, the importance of families for shaping links between genetic measures and social outcomes.
Table 3 performs a negative test. We hypothesized to find no effect of families and relative rank on genetic penetrance for the case of height. The results suggest no effect and can rule out even modest size effects.
Table 4 performs a second negative test. We form dyads in the data who are related (e.g. cousins) but are not full siblings (kinship <0.18). We show that earlier results of impacts of relative rankings of PGS are absent in these related, non-sibling pairs.
Discussion
These results contribute to several literatures in the social and genetic sciences. In the social sciences, the use of PGS to measure relative traits between siblings allows expanded analyses estimating the extent to which parents compensate or reinforce their children’s endowments. Future work could expand this direction by estimating a larger set of domains of PGS as well as focusing on early life measurements of children’s outcomes. A major limitation for this direction of inquiry is the modest sample of siblings in many datasets.
Our results show lower levels of genetic penetrance for educational attainment for children who have higher PGS than their sibling. This phenomenon is larger in places whose residents have higher socioeconomic status than in places with low socioeconomic status. Together, these results are consistent with parental preferences for equality among their children, which may be accomplished through compensating investments.
The findings are also consequential for genetic analyses more broadly. Some recent innovative analysis linking genetics to outcomes has taken seriously the confounding between children’s genetics and family background when explaining children’s later outcomes as adults, seeking to decompose PGS into direct (child) and indirect (parent) effects (26, 31, 32). This work has been motivated, in part, by previous research that used sibling comparisons in PGS to predict education. For example, some early use of educational attainment PGS (33) used sibling pairs to show that the PGS continued to predict education. The implication was then drawn that the PGS contained some causal effects of genetics on important outcomes such as schooling.
Later work has prioritized comparisons of between family estimates with within-family estimates to assure a lack of confounding by family background—using EA2 (34) and EA3(23) has been a bit more mixed, with the within-family analysis often reducing the estimates of genetic penetrance by 50% compared to between-family analysis. However, even with these smaller effects, researchers have relied on this framework to show the likelihood of causal PGS effects on education and related outcomes (25, 35, 36). New work (36) has discussed common issues in genetic analysis that can be overcome with family-based designs, including population stratification, assortative mating, and dynastic effects. One commonality of these issues is that they are presumed to reflect shared effects on all children in a household that can be eliminated through sibling-comparisons or, alternatively, the use of parental genetic data. Indeed, researchers have stated that sibling analysis “rules out purely social transmission as an explanation for the associations between children’s education-linked genetics and their attainment.”(25)
However, while these may be the most common concerns for analysis linking genetics with outcomes, a rich set of theoretical models and empirical results from demography and adjacent social sciences suggest there are yet another set of complications that should be explored before claiming that genetic effects are causal. In particular, families shape outcomes, parents react to children’s abilities and have preferences for their set of children’s outcomes. Other family members (who also often share genetics with the child/children) also contribute. An important outcome of these processes is the possibility of statistical interference between the outcomes of siblings(37), so that siblings designs both eliminate some empirical concerns but raise others.
The current results demonstrate that whatever causal effects exist between PGS and outcomes can be mediated through families, even with a sibling fixed effects design. This suggests caution in interpreting sibling fixed effects models as “causal genetic effects” and reemphasizes that the value of this strategy is to control for shared environments but not produce causal estimates. Indeed, our results suggest that within-sibling analysis of genetics and outcomes continue to reflect family processes rather than pure “genetic effects”. Future work will need to incorporate additional strategies to separate “genetic” and “family” causal processes.
Methods and Materials
Data
We used data from the UK Biobank project(38). The participants, aged between 37 and 74 years, were originally recruited between 2006 and 2010. These data are restricted, but one can gain access by following the procedures described in www.ukbiobank.ac.uk/register-apply/. Although siblings are not identified in the survey, respondents’ genetics can be used to measure genetic relatedness among all pairs of respondents. We first use the UKB provided kinship file, listing all pairwise kinships among 100,000 pairs in the sample of nearly 500,000 individuals. We first choose pairs with kinship >0.2, which reflects first degree biological relatives (parents/siblings). We then choose remaining pairs who are <13 years apart in age, leaving ~22,000 sibling dyads. We then chose to keep only one dyad from any family with more one dyad, leaving ~17,600 dyads. We include only respondents of European ancestry in our analysis.
Polygenic scores
We constructed PGS for two traits for which large genome wide association studies (GWAS) are publicly available and do not contain UKB samples: Height (39) and Educational Attainment(40). We removed single-nucleotide polymorphisms (SNPs) in strong linkage disequilibrium (LD). We LD-clumped the GWAS summary data by PLINK(41), using 1000 Genome Project Phase III European genotype data as reference. We used a LD window size of 1Mb and a pairwise r2 threshold of 0.1. We did not apply any p-value thresholding to select SNPs. Final weights were produced by using PRSice-2(42). The PGSs were normalized to have mean zero and SD one and oriented so that each PGS was positively correlated with its corresponding outcome.
Phenotypes
Educational levels of the UK Biobank participants were measured by mapping each major educational qualification that can be identified from the survey measures to an International Standard Classification of Education (ISCED) category and imputing a years-of-education equivalent for each ISCED category(43). Height is measured standing height.
Sibling variables
We created three variables indicating the relative status of the members of sibling pairs. First, we created an indicator for the sibling with the higher EA-PGS score. Second, we created an indicator for the sibling with the higher Height-PGS score. Third, we created an indicator for the sibling who is older, due to well known birth order effects on educational attainment.(28–30) We note that self-reported birth order is only available for a subset of UKB respondents.
Place based socioeconomic status
The UKB does not contain information about childhood background socioeconomic status of the respondents. Therefore we created information based on place of birth and year of birth to predict which places/years of birth had high vs. low (based on median) predicted educational attainment. We split families based on whether neither sibling was born in a place/year with high predicted attainment vs. either sibling was born in a place/year with high predicted attainment. The median predicted schooling was 13.65 years in a regression that had place of birth fixed effects and year of birth fixed effects.
Sample Characteristics
SI Table 1 presents descriptive statistics for the full sample and the analysis sub-sample of siblings. Educational attainment is slightly higher in the full sample and number of siblings (by definition) is higher in the sibling sample. EA3 is slightly higher on average in the sibling sample but demographic characteristics are quite similar.
Sample Characteristics for Falsification Exercise
We created a second sample of related individuals who were not full biological siblings in order to develop a negative (falsification) test of the analysis. To construct the sample, we used dyads with kinship measurements below 0.18 and randomly chose one dyad in cases where a respondent was linked with multiple sample members. Sample characteristics are shown in SI Table 3A and Dyadic Characteristics are shown in SI Table 4A.
Analysis
We tested associations using linear regression models. We clustered standard errors at the family level. We conducted sibling difference analysis using family fixed effects regression (44). We control for sex, age indicators, and 20 genetic principal components. We controlled for age indicators (fixed effects) for the well documented secular increases in schooling over this time period.(45)
We use regression analysis to link PGS to educational attainments in our results. We compare results with and without sibling fixed effects and also with measures reflecting siblings’ relative position in the dyad based on EA3 and age. That is, we regress educational attainment for respondent i in family f on demographic characteristics (age, sex), a polygenic score and controls for 20 principle components and a family-clustered error term:
We next add sibling fixed effects to the model.
Finally, we enrich the model with dyadic measures of relative position, namely an indicator for whether the respondent has a higher PGS than his/her sibling and the interaction between this indicator and PGS:
Where the key coefficient is β4, which reflects differences in genetic penetrance based on relative ranking in the sibship on the PGS. A positive coefficient would suggest that parents reinforce early evidence of a child’s ability (relative to his/her sibling) while a negative coefficient suggests that parents compensate. It is possible that these effects vary by family socioeconomic status.3 Therefore, we stratify the model between families who were born in high socioeconomic status areas/years vs. low socioeconomic status areas/years.
Supplemental Information
Appendix Tables
Footnotes
1 The authors gratefully acknowledge use of the facilities of the Center for Demography of Health and Aging at the University of Wisconsin-Madison, funded by NIA Center Grant P30 AG017266. We thank members of the Social Genomics Working Group at University of Wisconsin and Benjamin Domingue for helpful comments. This research has been conducted using the UK Biobank Resource under Application 57284.
↵2 We would have preferred to control for birth order, but the UKB only asked birth order information for a subset of the sample.
↵3 Unfortunately, the UKB is extremely limited in measurements of childhood experiences and has nearly no information about parents’ characteristics of the respondents (e.g. parental education, family socioeconomic status during childhood, etc).