A General Approach to Adjusting Genetic Studies for Assortative Mating

The effects of assortative mating (AM) on estimates from genetic studies has been receiving increasing attention in recent years. We extend existing AM theory to more general models of sorting and conclude that correct theory-based AM adjustments require knowledge of complicated, unknown historical sorting patterns. We propose a simple, general-purpose approach using polygenic indexes (PGIs). Our approach can estimate the fraction of genetic variance and genetic correlation that is driven by AM. Our approach is less effective when applied to Mendelian randomization (MR) studies for two reasons: AM can induce a form of selection bias in MR studies that remains after our adjustment; and, in the MR context, the adjustment is particularly sensitive to PGI estimation error. Using data from the UK Biobank, we find that AM inflates genetic correlation estimates between health traits and education by 14% on average. Our results suggest caution in interpreting genetic correlations or MR estimates for traits subject to AM.

In this section, we describe a model of assortative mating (AM) and how AM influences genome-wide association studies (GWAS) and (Mendelian randomization) MR estimates.First, we describe the model of sorting.Then we then calculate what the effect of AM is on the LD between genetic variants after one generation of sorting.We then extend that calculation to show how AM-induced LD influences the bias of GWAS estimates.This simple, singlegeneration derivation is useful to develop intuition for what factors are important for influencing LD and selection bias in MR studies.One conclusion of the derivation is that the bias in GWAS coefficients is a linear function of the (conditional) association of each SNP on the treatment, outcome, and other traits on which the population may be sorting.Finally, we show that, in a population that already has undergone an arbitrary amount of sorting, the bias from AM will always be a linear function of these associations.

The model of sorting
We consider a model of AM, where mothers and fathers mate based on a pair of latent phenotypes, where  ! is the maternal phenotype and  " is the paternal phenotype.Specifically, we assume that parents will sort based on their rank-order of the latent phenotype.(That is, the female with the highest value of  ! with mate the male with the highest value of  !, and so forth.)This model of common, ordinal preferences follows the classic economic model of AM of Becker (1973).In contrast to existing models of sorting in the genetics literature where parents sort within or across single, observed traits (e.g., Gianola et al. 1982), this model is more general since  ! and  " can be a function of any other variables and can also include a random component.
Because these latent phenotypes are ordinal (as opposed to cardinal phenotypes), the model will still hold under monotonic transformations of the phenotypes.Because of this, it is possible to transform  ! and  " such that they are each normally distributed, each with a variance of one, and such that  ≡  !=  " whenever they correspond to matched parents.(The mean of the latent phenotypes will be fixed in a later step.) Next, we consider how the sorting phenotypes relate to a pair of observed traits,  and .(We assume that these traits have been standardized to mean zero and variance one.)In what follows, we will often refer to  as the "treatment trait" and  as the "outcome trait", in anticipation of an MR application of this theory, though we do not assume a causal relationship between them unless otherwise noted.Imagine we project each sorting phenotype onto  and , giving us In this model, we allow for some of the sorting to be described by the treatment and outcome traits, but we also allow for residual terms,  ! and  " , which could include other traits relevant for sorting or for the random component of sorting.We have also allowed for the -parameters to vary for mothers and fathers, which allows for different traits to matter differently for men and women.Imagine that we project the sorting phenotypes onto the space spanned by some set of SNPs: where  !,ℓ ' and  !,ℓ ( are the alleles that are transmitted and untransmitted, respectively, to the mother's offspring and  ",ℓ ' and  ",ℓ ( are the corresponding values for the father.In this case,  ℓ ! and  ℓ " are not the causal effects of the SNPs but simply the (conditional) associations of each SNP.The values  ! and  " are the maternal and paternal residuals that aren't explained by SNPs.The mean of the residuals is fixed by the sorting process.Specifically, because  !=  " , it must be the case that E( ! ) = E, " .. One way for this to hold is if and where  ℓ is the allele frequency of SNP ℓ.This way It will be the same in mothers and fathers since we are only considering SNPs on the autosomes, which are inherited independently of the sex chromosomes.We can also project , ,  !, and  " onto these SNPs, giving us Substituting these expressions into (1) and (2) and rearranging terms gives us Therefore, this implies that The intuition of these expressions is that the effect of some SNP on the sorting phenotypes is the sum of the effect of that SNP on the treatment, outcome, and other traits times the association of those traits with the sorting phenotype.

AM-induced LD after one generation of AM
We begin by assuming the set of parents come from a randomly mating population but that the parents themselves are not randomly mating.The key implication of this assumption is Cov, By the central limit theorem and under the assumption that the association of each SNP on the sorting phenotypes is small, each of the random variables, ,| !,* ' = 1., ,| ",+ ' = 1., and (), are approximately normally distributed with variance one.The mean of ,| !,* ' = 1. is Likewise, Substituting in the (approximate) distribution functions and simplifying gives us (The fourth line follows since the integrand is a normal pdf, and the fifth line follows from the Taylor expansion of an exponential.)Following the same steps, we can also show that From this expression, we see that the LD between a pair of SNPs increases as the association between either SNP and the sorting phenotype increases.Indeed, in the special case the mothers and fathers sort on the same underlying phenotype (i.e.,  ℓ !=  ℓ " ∀ℓ), the covariance is directly proportional to each SNP's association with the sorting phenotype.
Bias in GWAS after one generation of AM Next, we are interested in how AM produces bias in GWAS estimates.We begin considering the bias for the treatment trait.Using  R * # to denote the GWAS estimate of the association of SNP  on trait , we calculate .
(The approximation in the seventh line above is because Var, * .>  * ,1 −  * .due to AM, but the inflation of the variance will be small relative to Var, * .if  * ! and  * " are small.)Next, substituting in eqs (14-15), we get to denote the random-mating heritability of the treatment trait, to denote the random-mating genetic covariance of the treatment and outcome traits, to denote the random-mating genetic covariance of the treatment and maternal sorting trait (after residualizing for  and ), and to denote the random-mating genetic covariance of the treatment and paternal sorting trait (after residualizing for  and ), all under random mating.
Recall, however, that  ! and  " are the residuals from projecting  ! and  " , respectively, onto  and , which implies that  ! and  " are uncorrelated with  and .While it's technically feasible for those variables to be (phenotypically) uncorrelated while being genetically correlated, genetic correlations nearly always have the same sign as phenotypic correlations (Antilla et al., 2018), so the genetic correlation is likely to be negligible.Therefore, we assume that  #) !=  #) " =  $) !=  $) " = 0. Substituting in these terms gives us.
By symmetry, the bias for the outcome trait is Below, we note that we may see SNP-selection bias when the bias on the treatment GWAS estimates is a function of the effect of the SNP on the outcome trait ( * $ ).Therefore, we should expect to see SNP-selection bias when From this expression, we see that there are two key ways that selection bias can arise.First, if there is cross-or multi-trait AM, then 1 # $  ' " +  ' $  # " 2 ≠ 0 (except in rare knife edge cases), which implies that selection will always be a risk in that case.Second, even if there is no sorting on the treatment, there can be selection bias if there is sorting on the outcome and the treatment trait is genetically correlated with either the outcome trait or the residual of the selection trait.Notably, if there is no selection on the outcome trait (i.e.,  # $ =  ' $ = 0), then there is no selection bias.Hartwig et al. (2018) claims that MR is not biased if there is single trait sorting solely on the treatment or outcome trait and if the traits are genetically uncorrelated.Eqs. ( 20) and ( 21) show that this is not true.Sorting on just the treatment implies that Assume that the exclusion restriction is satisfied for SNP  in a randomly mating population, such that  * $ =  * # where  is the causal effect of the treatment on the outcome.The MR ratio for this SNP (under the assumption of sorting only on treatment and no genetic correlation) is This will not equal  unless the treatment is not heritable (i.e., ℎ # / = 0) or the treatment has not causal effect on the outcome (i.e.,  = 0).So the MR estimate will generally be biased even if the MR study only includes SNPs that satisfy exclusion under random mating.
Consider as well the case where parents only sort on the outcome trait.Then  !$ =  " $ = 1 and The MR ratio for SNP  in this case is This is also not generally equal to  unless the outcome is not heritable or the treatment has no causal effect on the outcome.

Inflation of genetic variance and genetic correlation after one generation of AM
While extensive theory has been derived for how AM affects genetic variance and genetic correlation (Fisher, 1918;Crow and Feselstien, 1968;Crow andKimura, 1970, Gianola 1982), here we provide a parallel derivation in context of the parameterization above.
We define the "excess variance" as the difference in the genetic variance in the current generation and the genetic variance under random mating.That is, for the treatment trait, We further calculate By symmetry, the excess variance for the outcome trait is The intuition for these expressions is simple.Genetic variance for  trait increases when parents sort on  (i.e.,  !# ≠ 0 and  " # ≠ 0), when they sort on a genetically correlated trait (i.e.,  !$ ≠ 0,  " $ ≠ 0,  #$ / ≠ 0)), and when there is cross trait sorting between a pair of genetically correlated traits.The parallel statements hold for trait .Now considering the excess covariance, Following parallel steps to the calculation of the excess variance obtains As with the excess variance, the first term of this expression implies that excess covariance will increase if both parents sort on the treatment or outcome trait, and the second term implies that excess covariance increases if there is multi-or cross-trait sorting.
The formula for the excess genetic correlation does not have a clean closed form except in certain simple cases.

AM-induced LD with arbitrary AM
Above we have derived a simple expression of the effect of AM on AM-induced LD and GWAS bias after a single generation of sorting.We now consider the case where the parents potential come from a population that may have had non-random mating for several generations.We do not assume that the population is in equilibrium or that the parameters defining the sorting have been constant over time.However, by inductive reasoning, we will show that if, in the parents' generation, the covariance between a pair of SNPs is of the form In the AM case, Next, we note that by the properties of linear projections By substituting (33) into (32) and rearranging terms, we have where Following a parallel derivation, we have where As before, substituting (34) and ( 37) into (31), completing the square, and rearranging terms gives us And by symmetry, we have So by induction, if the covariance between pairs of SNPs is of this form in one generation (as it is in a randomly mating population), it must be of the same form in any subsequent generation under our model of sorting.

GWAS Bias with arbitrary AM
As before, we would like to evaluate the bias of GWAS estimates for the treatment trait under a general model of sorting.The initial steps in this case are identical to the firstgeneration of AM derivation such that Here, we substitute in the general form of AM-induced LD, eq. ( 30), into (46), giving us Next, substituting in (14-15), we obtain  ) " .Therefore, we see that in any generation of arbitrary sorting, the bias in GWAS estimates for the treatment trait will always be linear in the true conditional associations of SNPs on the treatment trait.This property of the bias is useful because there is risk of SNP-selection bias if  * # / * $ = 0.By the linearity above, this derivative is a constant, making it easier to determine whether the condition holds.

Simulation design.
Studies already exist with large-scale simulations evaluating the effect of AM on genetic correlation (Border et al., 2022) and MR analyses (Hartwig et al., 2018).Here, we consider a series of simplified simulations that confirm some of their points, fill in the gaps that were missed in their analyses, and provide counter examples to some of their broader claims.We also evaluate the performance of our adjustments to genetic correlation and MR estimates under various conditions.
In these simulations, we consider four genetic architectures: (1) no correlation/perfect pleiotropy, (2) partial correlation/perfect pleiotropy (3) no pleiotropy, (4) partial pleiotropy.Under (1), the simulated effect of each SNP on each trait is drawn from a bivariate standard normal distribution with no correlation.Under (2), simulated effect of each SNP is drawn from a bivariate normal with mean zero, variance one, and a correlation of 0.5.Under (3), half of SNPs have simulated effects drawn from standard normal for trait 1 and have an effect of zero on trait 2, and for the residual half, the SNPs have a simulated effect of zero on trait 1 and an effect drawn from a standard normal distribution for trait 2. Under (4), 25% of SNPs affect trait 1 but not trait 2, 25% affect trait 2 but not trait 1, and 50% affect both traits.In this partial pleiotropy case, effect sizes are drawn from a standard normal distribution, with equal effects on trait 1 and 2 for pleiotropic SNPs.See Figure S1 for illustrations of these genetic architectures.
In the MR analyses, each of these approaches can be mapped to different ways that the exclusion restriction may be met or violated.For example, cases (1)-(3) correspond to scenarios where no SNP satisfies the exclusion restriction.Case (4) corresponds to a case where half of SNPs do.Notably, the 25% of SNPs that have no effect on the treatment trait will generally be excluded from MR since they would not meet the statistical significance criterion for inclusion.As a result, more than half of included SNPs would satisfy exclusion and therefore (in the randomly mating population) the assumptions of median-based MR (and Mode-based MR) should be met.Despite this, if we assume that the causal effect of the treatment on the outcome is 0 and 0.5 for cases (1) and ( 2), respectively, both median-based and mode-based MR will still be consistent in a randomly mating population since the median and modal SNP satisfy exclusion.
We simulate 4 sorting mechanisms: (1) sorting on just the treatment phenotype, (2) sorting on just the outcome phenotype, (3) cross-trait sorting, and (4) multi-trait sorting.Crosstrait sorting is when males with a high value for trait 1 are more likely to mate with females with a high value of trait 2 (independent of their value for trait 1).When considering the effect of AM on genetic variance or genetic correlation, due to their symmetry, we collapse mechanisms (1) and ( 2) into a single mechanism, which we call single-trait sorting.Multi-trait sorting is when people sort based on linear combination of both trait 1 and trait 2. In our multitrait sorting simulation, individuals sort on the linear combination of both traits with a weight of 0.8 on trait 1 and 0.2 on trait 2. In each simulation, in the first generation of 10,000 people, we simulate a population that is assumed to be randomly mating.(That is, each SNP is uncorrelated with each other SNP in the genome.)To maximize statistical power and to increase the effect of AM on AM-induced LD in subsequent generations, we simulate both the treatment and outcome traits with a heritability of one.To produce the next generation, we randomly assign half of the population as male and the other half as female.We then rankmatch one person from each group to the corresponding person from the other group.(E.g., in the single-trait sorting cases, the male with the highest trait value would be matched with the female with the highest trait value.)Because our simulated sample is so large, this results in a near-perfect correlation of the traits on which mates are sorting.We model near-perfect sorting for computational reasons since stronger sorting will give rise to large amounts of AMinduced LD in few generations.
Each pair is simulated to have two children (which maintains the population size).For simplicity, we assume perfect recombination between each pair of SNPs.This matching and mating procedure is repeated until we have four generations of data, with the earliest generation of data corresponding to random mating and the fourth generation having the largest impacts of AM.We then calculate the genetic correlation of the two traits for each generation and the MR estimate of the effect of the treatment on the outcome, each with and without the proposed AM adjustment.We also calculate the increase in the genetic variance of each trait relative to the genetic variance in the first (randomly mating) generation.
The PGIs used for the AM adjustments are simulated in the following way.To account for the potential of causal SNPs that are not included in the PGI, we simulate cases where all causal SNPs are included in the PGI and when only every other SNP is included.To account for estimation error in the PGI weights, the weight for SNP  and trait  was set as where  ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} is the parameter that sets the fraction of the PGI that is due to error,  * # ~(0,1) is an error term, and  * # is the simulated effect size (also drawn from a standard normal distribution).Combing both sources of error, this means we produce 12 PGIs for each replication of the simulation.
The parameter  in this paper and the parameter  defined in Becker et al. (2021) are related.Becker et al. defines  as the amount of shrinkage observed in a regression of some trait on the PGI for that trait.Under our model Becker et al. estimate  = 1.3 for educational attainment, which implies that  = .4 in that case.

Simulation study results
Supplementary Figure S4 reports our simulation results related to how AM inflates genetic variance and Supplementary Figure S5 reports the results for genetic correlation using a PGI with no error as a best-case scenario (see Supplementary Tables S1 and S3 for complete results).The simulations have two main implications for the effect on AM on genetic variance.First, single-trait AM on trait 1 has no effect on the genetic variance of trait 2 if the traits are uncorrelated in the randomly mating population (i.e., the rg = 0 and No Pleiotropy cases).Second, we see inflation of the genetic variance for both traits in all other settings, though this inflation is very small in the multi-trait AM cases for the uncorrelated genetic architectures.Both of these observations are consistent with the theory summarized in Main Text Table 1.
The simulations also have two main implications on the effect of AM on genetic correlation.First, single-trait AM inflates the genetic correlation between pairs of traits when there is a positive genetic correlation in the randomly mating population, but it does not affect the genetic correlation when the traits are not initially genetically correlated.This is consistent with the theoretical results of Gianola (1982) based on a bivariate normal model.Our simulations show that his conclusions hold also under our more complicated partial pleiotropy architecture.Second, cross-trait and multi-trait AM always inflate genetic correlation, a replication of the results reported in Border et al. (2022).When we apply our AM adjustment using a PGI without error to our simulated data, we see that we mostly recover the random mating estimate of genetic correlation, as expected based on our theory.The partial recovery is due to the fact our adjustment does not correct for any within-chromosome effects on LD.As expected, however, the residual inflation after the adjustment is generally small.When we consider calculate the adjusted genetic correlation in a randomly mating population, the adjustment is not biased on average, though the adjustment does become unstable for large values of  (i.e., very noisy PGIs).In AM populations, however, some downward bias is introduced, implying that our approach over-adjusts for AM.The implication of this over adjustment is that, our method will not identify a role of AM in genetic correlation estimates when there is none; however, if our method estimates a non-zero effect of AM, that estimate will be an upper bound on the true effect of AM on the genetic correlation.
The MR simulation results presented in Supplementary Figure S7 and Supplementary Tables S2 and S4 produce several important conclusions.First, for the architectures considered, single-trait AM on the treatment trait generally does not influence MR estimates except in the partial pleiotropy case, where it produces a bias toward the null.The reason for this bias in the partial pleiotropy case is that AM produces LD between the SNPs that are causal for both traits and the SNPs that are casual for only the "treatment" trait.Because MR analyses rely on preselection of instrument based on GWAS summary statistics, this AM-induced LD will result in preferential selection of SNP-treatment associations with inflated coefficients (is this a type of winner's curse?If so, I think we should use that phrase).This AM-induced LD will not inflate the outcome GWAS coefficients because those SNPs do not affect the outcome trait.Because the treatment GWAS coefficient is in the denominator of the MR ratios, this will result in bias towards the null.This attenuation does not affect the rg=.5 case because there are no SNPs that affect the treatment trait and not the outcome trait.
Second, MR estimates are inflated away from zero when there is sorting on the outcome trait.The mechanism by which this occurs is analogous to the reason that estimates are biased toward the null when there is sorting on the treatment.
Third, cross-trait and multi-trait AM inflates MR estimates in all cases except the partial pleiotropy case.This is because cross-and multi-trait AM generally increase LD between SNPs if one increases the outcome trait and the other increases the treatment trait.This will push the GWAS effects (which will capture the effects of both SNPs in the sorting population) to have concordant associations.In the case of the partial pleiotropy genetic architecture, cross-trait assortative mating inflates both the treatment and outcome GWAS associations equally since the fraction of SNPs that affect one trait and not the other are equally sized.Therefore, the inflation of the associations cancels out in the MR ratios.In the multi-trait sorting case, because there is greater weight on the treatment trait, the inflation of the GWAS associations is larger, which actually leads to a downward bias of the MR associations.
Fourth, in all cases, if a PGI without error is used, we see that our adjusted MR estimates is closer to the MR estimate that would be obtained under random mating than the unadjusted estimates.A small amount of the residual bias is due to within-chromosome LD, though the remaining bias is due to other issues discussed in the next paragraph.
Fifth, while the AM-adjusted MR estimates mostly obtain the random-mating MR estimate, large gaps remain in the Outcome AM case when rg=.5 and in the Cross-and Multitrait AM cases for both perfect-pleiotropy architectures.This is due to SNP-selection bias.In these cases, the GWAS associations are inflated more by AM for SNPs that have large, signconcordant effects on both traits.Since such SNPs have especially inflated treatment associations, these SNPs are more likely to satisfy the significance criterion for inclusion in the MR study.For example, in the cross-trait AM/rg=0 simulation, SNPs with concordant-signed effects and discordant-signed effects are equally likely to be included in the MR analysis in a randomly mating population (concordant: 8.7%, SE = 0.2%; discordant: 8.4%, SE = 0.2%), but in the fourth generation, concordant-signed SNPs have a 23.5% chance (SE = 0.2%) of being included compared to discordant-signed SNPs, which have only a 7.8% chance (SE = 0.2%).This bias is in addition to the bias from AM-induced LD and therefore is not accounted for by MRadj.Because of this, we see that in many cases, over half of the bias in our MR study simulations appears to be due to selection of instruments.
Sixth, considering the results in Supplementary Table S4, error in the PGI can generate adjusted estimates that are downward biased, even in a randomly mating population.This is because the adjustment factor is the ratio of two very small numbers, so even a small amount of error can make the adjustment factor approximate a Cauchy distribution as the ratio of two mean-zero normal distributions.This implies that, unless very large samples are available to produce high powered PGIs, this approach may be impractical to adjust MR estimates.
Overall, our simulation results highlight that both single-trait, cross-trait, and multi-trait AM can have dramatic effects on the interpretation of genetic correlation and MR studies.Our adjustment can be a useful tool for understanding the role of AM on genetic correlation estimates, but it is less useful for MR applications due to SNP-selection bias and bias due to noisy in the PGI.

III. Mendelian randomization Estimators
In this paper, we consider three different Mendelian randomization estimators: MR-Egger (Bowden et al. 2015), MR-median (Bowden et al. 2016), and MR-Mode (Guo et al., 2018).Each of these approaches rely on different assumptions about the SNPs used in the analysis.A more detailed summary of each of them has been described elsewhere (Sanderson et al., 2022), but we here we provide a simple, high-level description of each of them below.We also provide more detailed information about how they were implemented in the analyses contained in this paper.

Inverse-Variance Weighted MR (MR-IVW)
Before describe the three methods used in this paper, we describe what might be considered the classic estimator: inverse-variance weighted (IVW) MR.We do not use this estimator in our analyses since it relies on strictly stronger assumptions that the other three estimators considered.For this estimator, the MR ratios ( * $ / * # ) are first calculated for each SNP.Then, a weighted average is taken across all the ratios, where the weight for SNP ,  * , is the inverse variance of the treatment GWAS association,  * = 1/ * / .
The usual assumption for this estimator is that all included SNPs satisfy the exclusion restriction.Under random mating this condition is generally met by the "no pleiotropy" architecture of our simulations since the SNPs that violate exclusion (i.e., those which do not affect the treatment but do affect the outcome) are likely to be omitted during the selection step.The estimator will also be consistent under the knife-edge case that all violations average out to be zero.Under random mating, this condition holds under the simulated full-pleiotropy architectures if we assume that the true causal effect is equal to  T , and that deviations in the ,1 −  * .+ (1 −  + ). * * , .