## Abstract

In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique—including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding the principal components (PCs) of the genotype matrix, in a regression model—can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including principal components as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.

## Introduction

Statistical genetics and phylogenetic comparative biology share the goal of identifying correlations between features of individuals (or populations) that are structured by historical patterns of ancestry. In the case of statistical genetics, researchers search for causal genetic variants underlying a phenotype of interest, whereas in phylogenetic comparative biology, researchers are typically interested in testing for associations among phenotypes or between a phenotype and an environmental variable. In both cases, these tests are designed to isolate the influence of a focal variable from that of many potential confounding variables. But despite the shared high-level goal, the statistical traditions in these two fields have developed largely separately, and—at least superficially—do not resemble each other. Moreover, researchers in these two statistical traditions may have different understandings of the nature of the problems they are trying to solve.

In statistical genetics, phenotypes and genotypes can be spuriously associated because of confounding due to population structure [1–4] or assortative mating [5, 6]. For example, in their famous “chopsticks” thought experiment, Lander & Schork [1] imagined that genetic variants that have drifted to higher frequency in subpopulations in which chopsticks are frequently used will appear, in a broad sample, to be associated with individual ability to use chopsticks, even though the association is due to cultural confounding and not to genetic causation. Confounding can also be genetic [7]—if a genetic variant that changes a phenotype is more common in one population than others, leading to differences in average phenotype among populations, then other, non-causal variants that have drifted to relatively high frequency in this population may appear to be associated with the phenotype in a broad sample. In addition to affecting genome-wide association study (GWAS) results, such confounding can affect heritability estimation [8, 9], genetic correlation estimates [10, 11], and prediction of phenotypes from polygenic scores [12–16]. Although many candidate solutions have been offered [17–21], the two most common approaches involve adjusting for shared ancestry using the genetic relatedness matrix (GRM, [22]), either by incorporating individual values on the first several eigenvectors of this matrix (i.e. the principal components of the genotype matrix) as fixed effects [23], or by modeling covariance among individuals attributable to genome-wide relatedness in a linear mixed model (LMM, [24–27]).

In phylogenetic comparative biology, researchers typically aim to control for the similarity of related species by incorporating the species tree into the analysis. There has been a great deal of controversy over the years as to what the underlying goals and implicit assumptions of phylogenetic comparative methods (PCMs) are [see for examples refs. 28–34]. But broadly speaking, it seems that many researchers understand the goal of PCMs to be avoiding “phylogenetic pseudoreplication” [35]—mistaking similarity due to shared phylogenetic history for similarity due to independent evolutionary events [33]. This is most commonly done by conducting a standard regression, using either generalized least squares (GLS) or a generalized linear mixed model (GLMM), but including the expected covariance structure owing to phylogeny [36–40]. (Throughout this paper, we do not make a distinction between phylogenetic GLS and phylogenetic GLMM models. We refer to them generically by the shorthand GLS for the general case and PGLS for cases where the phylogenetic variance-covariance is used; see below for definitions.) This covariance structure reflects both the relatedness of species and the expected distribution of phenotypes under a model of phenotypic evolution [41, 42], such as a Brownian motion [43] or an Ornstein-Uhlenbeck [44, 45] process. (The “phylogenetically independent contrasts” method [46], which ushered in modern PCMs is statistically equivalent to a PGLS model assuming a Brownian model [47].)

In recent years, however, signs have emerged that these two subfields may benefit from closer conversation, as emerging approaches in both statistical genetics and phylogenetics encounter questions that call for the other subfield’s expertise. For example, in humans, evolutionary conserved sequences are enriched for trait and disease heritability [48, 49], and conservation across related species can be used to prioritize medically relevant variants in fine mapping [50, 51] and rare variant association studies [52, 53]. Similarly, conservation geneticists have used multi-species alignments to estimate the fitness effects of mutations in a focal population [54]. And there is growing interest in using estimated ancestral recombination graphs (ARGs) to perform explicitly tree-based versions of QTL mapping and complex trait analysis [55, 56]. From the phylogenetics side, researchers are increasingly employing GWAS-like approaches (“PhyloG2P” methods; [57]) for mapping phenotypes of interest for which the variation primarily segregates among rather than within species. Futhermore, phylogenetic biologists have been developing phylogenetic models that consider within-species polymorphisms in genes [58, 59] and how ancient polymorphisms that are still segregating among species may confound association tests [60].

Such emerging connections suggest that it would be beneficial to better understand the ways in which statistical genetics and phylogenetic comparative biology relate to each other. To this end, we start from first principles and develop a general statistical model for investigating associations between focal variables while controlling for shared ancestry and environment. We show that both standard GWAS and standard phylogenetic regression emerge as special cases of this more general model. We illustrate how this deep connection provides insights into why these two classes of methods can be misled by certain types of unmodeled structure, why solutions in both fields work (and when do they not), and how statistical advances in either domain can be applied straightforwardly to the other.

## Results and Discussion

### A general polygenic model of quantitative trait variation

We assume a standard model in which many genetic factors of small effect influence a phenotype in an additive way—that is, there is no dominance or interaction among genetic loci (epistasis). Denoting by *β*_{l} the additive effect size of the variant at the *l*th locus and *G*_{il} the genotype of the *i*th individual at the *l*th locus, we write
where *G*_{i} is the genetic component of the phenotype of individual *i*, sometimes called a genetic value or breeding value. We then express the phenotype of individual *i, Y*_{i}, as the sum of the genetic component and an environmental component, *E*_{i}.
Due to shared ancestry, the genotypes of individuals in the sample will be correlated; thus, the genetic components of the individuals in the sample will be correlated. Moreover, the environments experienced by individuals may be correlated, and these environmental effects may be correlated with the genetic components. Thus,
For the rest of the paper, our focus will be on the first term Cov(*G*_{i}, *G*_{j}), the covariance in phenotypes between individuals due to genetic covariance. As we will show below, many models used by both statistical geneticists and phylogenetic biologists can be understood without reference to the components that include environmental effects. Interestingly, some statistical geneticists interpret the use of the GRM in the standard GWAS model as being primarily a proxy for shared (but unmeasured) environment among individuals [7], since in many cases, individuals whose genotypes are correlated will also experience similar environments. Phylogenetic biologists, on the other hand, tend to interpret the use of the phylogenetic covariance matrix in the model exclusively as controlling for genetic effects and thereby implicitly assume that the effects of the environment on phenotypes are unstructured [29, 32]. On macroevolutionary time scales, it is not so obvious that genetic and environmental similarities will mirror each other. For example, dispersal is less limited on long timescales [61, 62], so relatively closely related species might live in very different conditions. However, for traits such as gene expression (which is increasingly studied in a phylogenetic context; [59, 63–66]) that are strongly environmentally dependent and for which measurements are taken from species in different environments, we may need to develop new ways to model the environmental terms. This is a challenge beyond the scope of the present paper. We also note that there are some circumstances in which genetic covariance in equation 3 is undefined, such as when effect sizes have an undefined variance [67], or under certain phenomenological models of evolution on phylogenies [68, 69]; we reserve these situations for future work and focus on situations in which the genetic covariance is finite in the subsequent sections.

### Conceptualizations of the genetic covariance

Individuals who are more closely related will have more similar genotypes. For example, individuals in the same local population may share the same alleles identical by descent due to recent common ancestry. On the other hand, individuals in different species may not share alleles due to the species being fixed for alternative alleles at a given locus. Using equation 1, The first term arises from the correlations at a single locus, while the second term arises from correlations among loci across individuals. We focus on the first term, though the second can be important in biologically realistic situations. A particularly important case is selection on polygenic traits, which can cause correlations among the genetic contributions to a trait across loci. For example, if a population experiences directional selection on a highly polygenic phenotype, much of the phenotypic change, compared with a related population that has not experienced such selection, is due to to small, coordinated changes in allele frequency [70, 71]. In addition, linkage can affect the evolution of polygenic traits [72] and the results of heritability estimates [73].

In full generality, it is hard to say much more, since both the effect sizes and genotypes might be viewed as random and as dependent in arbitrary ways. One way to gain further insight is to imagine a variable *Z* that renders the genotypes *G*_{il} and the effect sizes *β*_{l} conditionally independent. As an example, we will imagine the locus-level allele frequency as a candidate for such a variable below. Conditioning on *Z* and using the definition of covariance and the law of total expectation, the first term becomes
This formula is fully general, as long as the genetic covariance exists, and can be applied in any evolutionary model in which there is a variable *Z*—potentially a latent variable—that accounts for the relationship between effect sizes and genotypes [74]. Moreover, it applies when the variable *Z* = *β* or *Z* = *G*.

Methods across subfields of evolutionary and statistical genetics rely heavily on different versions of a matrix that encodes relationships among individuals or species by specializing equation 5. When we are referring to this matrix in general, we call it Σ. In a sample of *n* individuals (or *n* species), Σ is *n × n*, and Σ_{ij} is proportional to some version of equation 5.

Among other names, in different settings, Σ might take the form of a “genetic relatedness matrix,” “kinship matrix,” “expected genetic relatedness matrix,” or “phylogenetic variance-covariance matrix.” Below, we consider the off-diagonal entries of each of these matrices in turn.

### The genetic relatedness matrix

We can simplify equation 5 if the second term is equal to 0, which can be accomplished in a number of ways. One way is to mean-center the genotypes. In that case, we write
Then, the linear model of the phenotype (equation 2) can be written
where is the mean genetic value in the population, and. Then, equation 5 becomes
The second line follows because mean-centering implies that for all *i*.

One reasonable candidate for *Z*—i.e. a variable conditional on which the genotype and allele frequency are independent—is the allele frequency *p*_{l}. Conditioning on *p*_{l} and using mean-centered genotypes,
In the presence of dense genotype data, we can approximate the expectation over genotypes by the empirical mean,
where *n*_{p} is the number of sites with frequency *p*. Then, provided an estimate of, we can compute
If we suppose that, then this formula recapitulates the Genetic Relatedness Matrix (GRM), a realization of Σ that is commonly used to estimate heritability from SNP data [75] or to accommodate covariance due to relatedness in GWAS [24–27]. And considering that recovers the more general “alpha model” or “LDAK model” [22, 73]. In plant and animal breeding, where this frame-work first appeared [76], sometimes the same normalization is used as in human genetics, and sometimes genotypes are mean centered but not standardized [77, 78].

### The (pedigree-based) kinship matrix

Historically, plant and animal breeders, along with human and behavior geneticists interested in resemblance of relatives, have frequently faced a situation in which they have had 1) (at least partial) pedigree data describing the parentage of sets of individual plants or animals, 2) phenotypic data on those individuals, but 3) no genome-wide genetic information. In such a situation, one can model the entries of Σ as a function of expected genetic similarity based on the pedigree information, as opposed to realized genetic sharing observed from genotypes [77, 79–81]. One can specialize equation 4 by fixing the effect sizes, leading to
where *θ*_{ij} is the kinship coefficient (obtained from the pedigree) relating individuals *i* and *j*, and *V*_{A} is the additive genetic variance. Although many derivations exist in standard texts [e.g. 81, 82], we include one in the appendix for completeness.

Methods based on this formulation include the “animal model” [79, 81, 83, 84], a widely used approach for prediction of breeding values in quantitative genetics. The connection between the animal model and genome-wide marker-based approaches was plain to the quantitative geneticists who first developed marker-based approaches to prediction [76], and it is also noted in papers aimed at human geneticists [22, 75, 85], whose initial interest in the framework focused on heritability estimation. Similarly, the animal model is known to be intimately connected to the phylogenetic methods we discuss later [38–40]. One implication is that close connections between methods used in statistical genetics and phylogenetics, which are our focus here, must exist.

### The expected genetic relatedness matrix (eGRM)

We may make other assumptions to model the genetic covariances among individuals. For instance, we might let *Z* = *β*, the effect sizes themselves. Unlike in the previous subsection, the effect sizes are not fixed; they are random and independent of genotype. In this case, the covariance is computed with respect to alternative realizations of the mutational process (as in the branch-based approach in [86]) and, in this subsection, also alternative realizations of the underlying gene tree(s).

Then, if genotypes are independent of effect sizes and the expected effect size is 0, we show in the appendix that so that, if all loci are equivalent In theory, the expectation of the product of genotypes can be computed based on coalescent theory [87], focusing on the gene trees that describe the pattern of shared inheritance of alleles rather than the distribution of the alleles themselves. In practice, the gene trees underlying genetic variation in the sample cannot be observed directly. However, developing a coalescent approach provides theoretical understanding in its own right, and it also forms a basis for doing complex trait analyses using estimated genome-wide gene trees [55, 56].

Here we develop the gene-tree-based view using a different derivation from that presented by McVean [87]. For simplicity, we assume haploid genetic data, the extension to diploid data is straight-forward but tedious [87]. First, let 𝒯 be a tree (including branch lengths) and ℚ be the measure on tree space induced by the population history. In addition, let *μ* be the mutation rate and *T* the total branch length in 𝒯. Then, the density of a tree conditional on it having a mutation in the infinite sites model is
This formula indicates that conditioning on a gene tree having a mutation results in trees with more total branch length. Hence, the distribution of gene trees that underlie SNPs is different from the unconditional distribution of gene trees; in particular, gene trees that have a SNP will tend to have more total branch length than those that do not have a SNP.

We also need that
where *T*_{ij} is the length of time in the tree shared by samples *i* and *j*. This fits the intuition that, conditional on a site being variable, individuals will share that mutation only if it occurs in a common ancestor of those two individuals. Then,
We note that this formula explicitly requires conditioning on the site being variable and that we are ignoring the effects of linkage disequilibrium among sites.

The connection between the genotype-based view of the GRM and the gene-tree-based view, in which Σ is viewed as the conditional (on the gene tree(s)) expectation of the GRM over random mutations, developed by McVean [87], has been the basis for several recent methods in statistical genetics [55, 56, 88] and in phylogenetics [89]. (The expectation can also be computed for versions of the GRM that standardize genotypes by a function of the allele frequency [56, 88].) For example, Link and colleagues [55] use estimated gene trees in a region of the genome to compute the expectation of a local GRM formed from neutral variants falling on the estimated trees as a Poisson process. These matrices are then used as input to a variance-components model, which brings some advantages in mapping QTLs. Specifically, the resulting (conditional) expected genetic relatedness matrices naturally incorporate LD, providing better estimates of local genetic relatedness than could be formed from a handful of SNPs in a local region [55, 88].

### The phylogenetic variance-covariance matrix

In an extreme case, we might consider only variation among long-separated species. In this case, there may be only a single tree that describes the relationships among species, and the expectation over gene trees used in the previous subsection can be dropped. Then the entries of the relatedness matrix Σ, which in the case of phylogenetic methods is referred to as the phylogenetic variance-covariance (or vcv), are given by
This can be recognized as the covariance under the Brownian motion model [43] commonly used to model continuous traits in phylogenetics, given a phylogenetic tree, when setting the diffusion rate *Σ*^{2} of the Brownian motion process to
The formulation of equation 14 may look unfamiliar in phylogenetics, where the Brownian motion rate is typically taken to be *V*_{G}*/N*, where *V*_{G} is the additive genetic variance and *N* is the effective population size, following Lande [90] or simply *U* 𝔼 (*β*^{2}), where *U* represents the total mutation rate toward causative alleles, following Lynch and Hill [91]. To reconcile our result with the existing literature, note that 𝔼 (*L*) = *UT*, so that
as shown by Lynch and Hill [91]. Further, under a neutral model, the equilibrium additive genetic variance *V*_{G} is proportional to *NU* 𝔼 (*β*^{2}) [92]. Thus, under neutrality,
showing that under a neutral model, the Lande formulation is equivalent to the Lynch and Hill formulation, up to constants that depend on ploidy. Thus, we see that our equation 14 matches familiar formulations in the literature [93].

Consistent with previous arguments (e.g., [34]), this result also implies that one straightforward interpretation of the standard PGLS model is that it stratifies the regression between focal variables by an unobserved variable (or variables) that evolved primarily by drift. Hansen and colleagues have pointed out that this may not be an appropriate model for testing for adaptation [31, 32, 94], which was the primary motivation for developing many comparative methods in the first place [95]. Moreover, recently, standard PGLS has fallen into question in scenarios in which there is discordance between the gene tree and the species [96, 97]. Our formulation makes it clear that the standard PGLS formulation only applies when there is a single tree underlying all loci; if there is instead a distribution of gene trees, equation 12 suggests that the appropriate thing to do is to average over gene trees, as suggested by Hibbins *et al*. [97]. Nonetheless, as we illustrate below, the fact that the standard phylogenetic regression falls out as a special case of the same general model as standard statistical genetics models is useful in practice, even if not always in theory.

### How the same type of unmodeled structure misleads both GWAS and phylogenetic regressions

The result of the previous section that standard models in statistical genetics and phylogenetics are closely related immediately suggests that these models might suffer the same pathologies under model misspecification, and that solutions to these pathologies could be shared across domains. Here we illustrate this by studying the problem of how unmodeled (phylo)genetic structure biases estimates of regression covariates. This problem has received much attention in both the statistical genetics [98, 99] and phylogenetics literature [33, 34, 100], but the approaches taken in the two fields differ.

We assume that we have a sample of size *n* with a vector of a predictor, *x* = (*x*_{1}, *x*_{2}, …, *x*_{n})^{T}, and a trait, *y* = (*y*_{1}, *y*_{2}, …, *y*_{n})^{T}. In the context of genome-wide association studies, *x* may be the (centered) genotypes at a locus to be tested for associaiton, while in the context of phylogenetics, *x* is often an environmental variable or another trait that is hypothesized to influence *y*. Then, the regression model is
where *G*_{i} and *E*_{i} are the genetic and environmental components, as in equation 2, and *β* is the effect of *x* on *y*. In genome-wide association studies, *β* is the effect size of the locus being examined, while in phylogenetics it may quantify the effect of an environmental variable or other continuous trait, rather than the effect of an allele.

### Theoretical analysis

To understand the purpose and limitations of corrections for (phylo)genetic structure, we examined the properties of the estimators of regression coefficients with and without correction for (phylo)genetic structure. The simplest estimator, , is the ordinary least squares estimator. In that case, the estimated effect size is
Because equation 15 shows that *y* is (phylo)genetically structured with covariance matrix Σ, we can expand in terms of the eigenvectors and eigenvalues of Σ, which form an orthonormal basis of ℝ^{n}, by diagonalizing,
where *V* = [*v*_{1} *v*_{2} *· · · v*_{n}] is a matrix whose columns are the eigenvectors of Σ, and Λ = diag(*λ*_{1}, *λ*_{2}, …, *λ*_{n}) is a diagonal matrix whose diagonal is the eigenvalues of Σ. This decomposition is always possible because Σ is symmetric and positive semidefinite by virtue of being a covariance matrix. Then,
This shows that we can conceptualize the ordinary least squares estimator as adding up the correlations between *x* and *y* projected onto each eigenvector of Σ. Loosely, large-magnitude slope estimates arise when *x* and *y* both project with large magnitude onto one or more eigenvectors of Σ. If an eigenvector of Σ is correlated with a confounding variable, such as the underlying (phylo)genetic structure, then *x* and *y* may both have substantial projections onto it, even if *x* and *y* are only spuriously associated due to the confound.

Two seemingly distinct approaches have been proposed to address this issue. First, researchers have proposed including the eigenvectors of Σ as covariates. In the phylogenetic setting, this is known as phylogenetic eigenvector regression [101]. (In practice, researchers often use the eigenvectors of a distance matrix derived from the phylogenetic tree rather than Σ itself, but these two matrices have a straightforward mathematical connection [102]). In the statistical genetics setting, the analogous approach is to include the principal component projections of the data that are used to generate the genetic relatedness matrix—i.e., the principal components of the genotype matrix [23]—in the regression. To see that the eigenvectors of Σ are equivalent to the projections of the data onto the principal components, suppose we have *n × L* genotype matrix *G* whose rows represent individuals and whose columns represent genetic loci. Entry *i, j* contains the number of copies of a non-reference allele carried by individual *i* at locus *j*. Then, the projections of the data onto the *k*th principal component is given by
where *w*_{k} is the *k*th eigenvector of the covariance matrix of *L × L* covariance matrix of the genotypes, i.e. *G*^{T} *Gw*_{k} = *λ*_{k}*w*_{k}. Then,
where the second line follows from left multiplying the first line by *G*. For clarity, we emphasize that this is distinct from computing PCs on the phenotypic matrix (if multiple phenotypes are measured), which is intended to solve a different problem [103, 104].

Now, we construct a design matrix *X* = [*x v*_{1} *v*_{2} *· · · v*_{J} ], whose columns are the predictor *x*, followed by the first *J* eigenvectors of Σ. (In theory, any set of eigenvectors of Σ might be included in the design matrix, but in statistical-genetic practice, it is typically the leading eigenvectors.) Then, using standard theory, the vector of coefficient estimates is
After some linear algebra (see Appendix), the estimate of the coefficient of the predictor *x* is
i.e., it is the OLS estimator, but the first *J* eigenvectors of Σ are removed. This shows why inclusion of the PCs as covariates can correct for (phylo)genetic structure: it simply eliminates some of the dimensions on which *x* and *y* may covary spuriously. However, it also shows the limitations of including principal components as covariates. First, because it is simply cutting out entire dimensions, it can result in a loss of power. Second, confounding that aligns with principal components that are not included in the design matrix is not corrected.

The second approach to including the eigenvectors of Σ as covariates is to use Σ itself to model the residual correlation structure. In phylogenetic biology, this is accomplished using phylogenetic generalized least squares (PGLS) [37, 38], whereas in statistical genetics this is accomplished using linear mixed models (LMM) [27, 105]. In both settings, it is common to add a “white noise” or “environmental noise” term, such that the residual covariance structure is, where scales the contribution of genetics, scales the contribution of environment, and *I* is the identity matrix. In the context of phylogenetics, the relative sizes of and are of interest when estimating the phylogenetic signal measurement Pagel’s lambda [106, 107], whereas in statistical genetics, they are the subject of heritability estimation [108]. Then, both PGLS and LMM approaches model the data as
where and are typically estimated, for example by maximum likelihood [109], residual maximum likelihood [75], Haseman-Elston regression [110], or other methods [27, 105, 111]. For the theoretical analysis that follows, we assume and. This does not restrict the applicability of our analysis, because has the same eigenvectors as Σ, with corresponding eigenvalues , where *λ*_{i} are the eigenvalues of Σ.

With these assumptions, the regression coefficient can be estimated via generalized least squares,
By diagonalizing Σ (see Appendix), the estimated regression coefficient is
Like the ordinary least squares estimator in equation 16, this expression includes all the eigenvectors of Σ. However, it downweights each eigenvector according to its eigenvalue. Thus, GLS downweights dimensions according to their importance in Σ, which aims to describe the structure according to which *x* and *y* may be spuriously correlated. However, unlike equation 17, it retains all dimensions. Compared with adjusting for the leading eigenvectors of Σ using OLS, the GLS approach retains some ability to detect contributions to associations that align with the leading eigenvectors. It also adjusts for Σ in its entirety, rather than just its leading eigenvectors. This means that it adjusts for even very recent (phylo)genetic structure, which will likely not be encoded by the leading eigenvectors. That said, one disadvantage of GLS is that it assumes that all eigenvectors of Σ contribute to confounding in proportion to their eigenvalues, potentially resulting in an inability to completely control for confounding if the effect of an eigenvector of Σ is not proportional to its eigenvalue, as may be the case with, for example, environmental confounding.

Where sample sizes and computational resources allow it, the state-of-the-art practice in statistical genetics is to use a linear mixed model framework while *also* including principal components as covariates [27, 105, 111]. This at first may seem surprising, because it seems to be controlling for Σ twice. However, the analysis above suggests that including the principal components as covariates and using generalized least squares have different, and perhaps complementary, outcomes. To see how they interact, again costruct the design matrix *X* = (*x v*_{1} *v*_{2} … *v*_{J}), whose columns are the predictor *x*, followed by the first *J* eigenvectors of Σ, and use the generalized least square estimator,
After some linear algebra (see Appendix), we again obtain the estimate of the regression coefficient of *x*,
Thus, combining the principal components with Σ in generalized least squares may provide the benefits of both approaches: if there is confounding in a eigenvector of Σ that is “too large”—that is, it is out of proportion with its associated eigenvalue—then if that eigenvector is included in the design matrix, it will simply be excised from the estimator, as in equation 17. However, we still maintain the ability to control for spurious association between *x* and *y* due to the structure of Σ but not along included eigenvectors, as in equation 19. The major difficulty is in identifying the eigenvectors of Σ that might be associated with confounding effects larger than their corresponding eigenvalues would suggest.

### Simulation analysis

To put the intution developed from the previous subsection into practice, we performed simulations in both phylogenetic and statistical-genetic settings. First, to explore how the different approaches outlined above correct for both (phylo)genetic structure and environmental confounding, we performed simulations inspired by Felsenstein’s “worst case” scenario [34, 46]. Felsenstein’s worst case supposes that there are two diverged groups of samples that are measured for two variables *x*, and *y*, which are then tested for association; the only (phylo)genetic structure is between the two groups. In the phylogenetic setting, we represent the two clades as star trees with 100 tips each, connected by internal branches, and we simulate *x* and *y* as arising from independent instances of Brownian motion along the tree (see Methods). In the statistical genetics setting, we use msprime [112] to simulate 100 diploid samples from each of two populations, and then simulated quantitative traits using the alpha model [22] (see Methods). In this setting, McVean [87] showed that the first principal component captures population membership; hence, we only include the first principal component to capture any residual confounding. To perform inference in the phylogenetic case, we used the package `phylolm` [109], and for the statistical-genetic case, we used a custom implementation of `GREML` [75].

We first explored the impact of deepening the divergence between the two clades, starting from no divergence and increasing to high divergence (Figure 1A,C). As expected, we see ordinary least squares fails to control for the population stratification as the divergence time becomes large, resulting in high false positive rates. However, all of the other approaches appropriately controlled for the population stratification. This is as expected: in the case of two populations, all of the (phylo)genetic stratification is due to the accumulation of genetic variants in each group. Hence, either discarding the correlation between *x* and *y* on the dimension corresponding to group membership as in equation 17 or downweighting it as in equation 19 is sufficient to remove the confounding effect of the population stratification.

Despite the success of both OLS with principal components and generalized least squares in controlling for population stratification, it has recently been recognized that phylogenetic generalized least squares does not control for all types of confounding in Felsenstein’s worst case: for example, if there is a large shift in *x* and *y* on the branch leading to one of the groups, phylogenetic generalized least squares produces high false positive rates [34]. Because including the first principal component will completely eliminate the contribution to the estimated coefficient that projects along group membership, whereas generalized least squares will only downweight it, we reasoned that including the first principal component in either ordinary or generalized least squares should restore control even in the presence of large shifts.

We tested our hypothesis using simulations with divergence time in which ordinary least squares was not sufficient to correct for population stratification. In the phylogenetic case, we simulated an additional shift in one of the clades for both *x* and *y* by sampling from independent normal distributions, while in the statistical-genetic case, we simulated an environmental shift sampled from a normal distribution in one of the clades (Figure 1B,D). As expected, ordinary least squares is insufficient to address the confoudning, and becomes increasingly prone to false positives as the size of the shift increases. In line with our hypothesis, phylogenetic generalized least squares and linear mixed modeling also fail to control for the shift as it becomes large, while including just a single PC in each case is sufficient to regain control over false positives.

The preceding analysis might suggest that including principal components as covariates is sufficient to adjust for (phylo)genetic structure while also being superior to generalized least squares in dealing with environmental confounding. Recent work, however, suggests that principal components may not be able to adjust for more subtle signatures of population structure [8, 15, 99, 113]. To explore this, we simulated both phylogenetic regression and a variant association test using a more complicated model of population structure. For the phylogenetic case, we simulated pure birth trees with 200 tips, while in the statistical genetics case, we simulated pure birth trees with 20 tips and sampled 10 diploids from each tip using msprime. Then, as before, we simulated using a Brownian motion model in the phylogenetic case, or an additive model for the statistical genetic case.

As expected, using ordinary least squares without any principal components does not control for population structure in either the phylogenetic or the statistical-genetic setting, while the methods that use generalized least squares estimates of the regression coefficients appropriately model population structure (Figure 2). Although adding additional principal components reduces the false positive rate of ordinary least squares, it does not reach the desired false positive rate of 5%. This is in line with our theoretical analysis: as seen in equation 17, including principal components in ordinary least squares eliminates dimensions that explain the most genetic differentiation, but the correlations on the remaining dimensions are not adjusted. Because there is substantial fine-scale population structure in these simulations, removal of just a few dimensions with large eigenvalues is not sufficient to control for the subtle signature of population structure. On the other hand, generalized least-squares approaches, as shown in equation 19, will continue to correct for population structure that is found deeper into the eigenvectors of the correlation matrix (echoing points previously raised in the phylogenetics literature [114–116]). We also note that while the our analysis is focused on the eigenvectors of Σ, we suspect similar lines of reasoning may apply to other situations where eigenvector regression is used, such as in spatial ecology [117].

### A case study of applying PCs to PCMs

Although the eigenvectors of the phylogenetic variance-covariance matrix (or closely related quantities) have often been included in regression models by researchers using phylogenetic eigenvector regression [101], to the best of our knowledge, phylogenetic biologists have not previously used these eigenvectors as fixed effects in a PGLS model—which we have shown above to be an effective strategy in theory. To illustrate the approach in practice, we re-examine a recent study by Cope *et al*. [118] that tested for co-evolution in mRNA expression counts across 18 fungal species. More specifically, these researchers were interested in testing whether genes whose protein products physically interacted (using independent data from [119]) were more likely to have correlated expression counts than those whose protein products did not. They found support for this prediction. While we suspect the core finding is robust, and there are some theoretical reasons to expect that RNA expression counts should be Brownian-like under some selective scenarios [120], other studies have shown expression counts for many genes in this dataset (and many others) are not well-described by a Brownian process [65, 121]. As such, some of their observed correlations could be spurious due to unmodeled phylogenetic structure [34].

We re-analyzed the data of Cope *et al*. [118] with the addition of PCs of (phylogenetic) Σ as fixed effects in the PGLS model (see Methods and Materials for details). Cope *et al*. used a correlated multivariate Brownian model to test their hypothesis, which is slightly different than the more common PGLS approach [122], but they are close enough for our purposes. We conducted several iterations of the analyses, varying the number of PCs included from 1 to 10; Figure 3A shows how the different species project onto each principal component. We found that, as anticipated, the number of significant correlations decreased as more PCs were included (Figure 3B). However, as more PCs were included, the proportion of significant correlations in gene-expression count data in which the genes are known to physically interact increased (up to about 8 PCs; Figure 3C). If we assume that the significant correlations for physically interacting genes are more likely to be true positives than those for pairs of genes not known to interact physically, then the results would suggest that including the PCs in the analysis might reduce the false positive rate while still finding many of the true positives.

Uyeda *et al*. [34] suggest that one way to mitigate the spurious correlations arising from large, unreplicated events (see above sections) would be to simply use indicator variables in the regression model that encode the part of the phylogeny from which a tip descends (i.e., if there was a concordant shift in the means of both traits along a single branch, the indicator variable would be 0 if a species was on one side of that branch and 1 if on the other). This is similar in spirit to the use of hidden Markov models for the evolution of discrete traits [100, 123]. However, as Uyeda *et al*. point out, this leaves open the hard problem of identifying the branches on which to stratify. It is not possible to include an indicator for every branch, as the model would then be overdetermined. Using the simple method borrowed from GWAS studies of including PCs of Σ as fixed effects in the typical phylogenetic regression may be a promising (partial) solution to the problem of spurious correlations.

### The genetic model vs. the statistical model

We began by adding assumptions to equation 2 in order to show that common practices in disparate areas of genetics can be seen as special cases of the same model. One notable assumption is that of a purely additive model [124] for the phenotype (equation 1). There are two reasons we might be suspicious of this assumption. First, it is debatable to what extent most traits obey the additive model, given evidence of non-additive genetic contributions to traits across species [125, 126]. However, even if non-additive contributions are important for determining individual phenotypes or for understanding traits’ biology, they might still contribute a relatively small fraction of trait variance, meaning they might be safely ignored for some purposes [127–129] (but see [130]). Second, we used a neutral coalescent model to find an expression for the Brownian motion diffusion parameter in terms of the effect sizes of individual loci (equation 14). Although this provides a satisfying justification for the use of a phylogenetic regression model with a Brownian covariance structure and for averaging over gene trees to accommodate ILS (*sensu* [97]), it is likely unreasonable in many situations. It has long been appreciated that, while a population-mean phenotype will be expected to evolve according to a Brownian process under simple quantitative-genetic models of genetic drift [41, 90, 93, 131] the Brownian rate estimated from phylogenetic comparative data is orders of magnitude too slow to be consistent with plausible values for the quantitative-genetic parameters used to derive the Brownian model [93, 131–133]. There are more elaborate explanations than pure genetic drift for why long-term evolution may show relatively simple dynamics [134] but understanding the coalescent patterns of loci under these scenarios is likely challenging [135] and beyond the scope of the present paper.

However, even if one finds the genetic model unreasonable, the equivalence of the *statistical* models used in statistical genetics and phylogenetics still holds: that is, the core structures of the models are the same, whether or not one is willing to interpret the parameters in the same way or not. Indeed, phylogenetic biologists have been here before, with the realization that PGLMMs are structurally equivalent to the pedigree-based analyses using the animal model from quantitative genetics [38–40] even though the recognition that they were equivalent did not rely on a specific genetic model for phenotypes (above, we prove that they can both be derived from the same genetic model). Nonetheless, the recognition of a structural equivalency between the animal model and the phylogenetic model made it possible to use techniques from quantitative genetics to solve hard problems in phylogenetic comparative methods. For example, inspired by a similar model from [136], Felsenstein developed a phylogenetic threshold model [137, 138], in which discrete phenotypes are determined by a continuous liability that itself evolves according to a Brownian process. Hadfield [139] proved this model was identical to a variant of the animal model and that existing MCMC algorithms could be used to efficiently estimate parameters and extend the threshold to the multivariate case, which had not been previously derived.

### Incorporating natural selection

All of the connections highlighted in this manuscript depend in some way on the assumption that causal variants are neutral. However, in many species, including humans, there is direct evidence of selection acting on genetic and heritable phenotypic variation [140–147]. In both phylogenetic and statistical genetic traditions, there are attempts to capture the qualitative features of natural selection when performing regression analyses. In phylogenetics, selection is often modeled by the use of the Ornstein-Uhlenbeck process [44, 148–151], which can be derived from a quantitative genetic model of stabilizing selection [90]. Under the Ornstein-Uhlenbeck model, phenotypes on a phylogeny will be drawn toward a common optimum value; hence, covariances in Σ are higher than expected under a Brownian motion model. Modifying PGLS to include an Ornstein-Uhlenbeck model of the genetic covariance is straightforward, because the data remains multivariate Gaussian [37, 109]. Assuming a single optimum over an entire phylogenetic tree may be unrealistic. To mitigate that issue, several approaches for detecting optimum shifts have been developed [152, 153].

One the other hand, in statistical genetics, natural selection is sometimes incorporated via the *alpha*-model, in which the effect size of an allele is proportional to (2*p*(1 *− p*))^{−α} [75, 154, 155]. This qualitatively captures the idea that rare alleles will have larger effect sizes than common alleles; however, the realism and accuracy of the alpha model has been questioned [74, 155–157]. A promising path forward in statistical genetics is for the development of models that explicitly include the action of natural selection.

### Towards a more integrative study of the genetic bases of phenotypes

Building up this general framework is a step towards inference methods that coherently integrate intra-and interspecific variation to understand the genotype-to-phenotype map and how evolutionary processes, acting at different time scales, shape it. We can envision a number of paths forward for both statistical genetics and phylogenetics that can build on the work we present here.

Nearly two decades of GWAS in humans have revealed thousands of genetic variants associated with complex traits and disease [158–160]. Analyzing the identified variants and their estimated effect sizes shows two key trends: effect sizes tend to be larger for rarer alleles [154, 155, 161–164], and effect sizes tend to be larger in more evolutionarily conserved regions [50, 165–169]. A likely explanation for these patterns is that many traits in humans are evolving subject to stabilizing selection [156, 170–172], which in turn impacts their genetic architecture by determining the distribution of effect sizes as a function of allele frequency [154–156]. Though the importance of evolutionary conservation in triaging functional variants in the human genome has long been appreciated, it is becoming increasingly important as we collect larger and larger samples of human variants, the vast majority of which are extremely rare [173, 174]. In fact, recent work has shown that evolutionary conservation accounts for the vast majority of the predictive power of a state-of-the-art deep learning approach to variant annotation [169, 175]. A limitation of current approaches for utilizing evolutionary conservation is that typically there is no way to include information about the phylogenetic structure of the data (i.e., only multiple sequence alignments are used). Overcoming this is not straightforward and will require mechanistic modeling: the observed level of conservation is a nonlinear function of the strength of selection acting against variants at a locus; small changes in the strength of negative selection can greatly decrease the amount of variability seen on phylogenetic timescales, and this can cause counter-intuitive behavior of conservation scores [54, 176]. We focus here on applications to human genetics, but the same challenges exist whatever the focal organism is. For example, a recent study identified a variant of large effect for body size in canids by aggregating data from multiple species (both extinct and extant) [177]; while an outstanding discovery, we anticipate that variants with smaller effects might also be identified using similar data with a comprehensive approach for including phylogenetic structure. As such, the paucity of interpretable evolutionary methodology that combines trait-related information within and across species is a barrier to understanding the forces that shape trait-related genetic variation. There is a rich tradition of phylogenetic models that have been developed to depict various evolutionary scenarios [42, 122] that could potentially be leveraged in order to make such inferences. One of our goals in laying out the underlying connections among these modeling traditions is to facilitate the transfer of these models across disciplinary divides.

In phylogenetics, the PhyloG2P research program [57] also encompasses a wide spectrum of processes at different levels of divergence. The so-called PhyloGWAS method [178] is used to identify causal variants for a phenotype that segregated in ancestral populations but have fixed in descendant lineages. In other cases, researchers have used phylogenetic approaches to investigate convergent mutations associated with phenotypic convergence across lineages (e.g., [179]). And then there are methods, such as RERConverge [174], and PhyloACC [180], that identify regions with a relatively high number of substitutions — but not necessarily the same ones — in phylogenetically distinct lineages that have convergently evolved the same phenotype, presumably due to positive selection or relaxed negative selection [57]. For example, Sackton *et al*. [181] used such an approach to identify regulatory regions that had high rates of evolution in lineages of flightless birds; they also demonstrated that some of these regions influence wing development using experimental perturbations. Such rate association tests seem to be very similar, both conceptually and likely statistically, to techniques used in rare-variant association studies, which look for local enrichment of rare variants in cases vs. controls, rather than associating single variants with phenotype [52, 182–184].

More generally, integration over gene trees may be necessary for a wide range of phylogenetic methods. In a series of papers, Hahn and colleagues [60, 96, 97, 185, 186] have pointed out that conducting comparative tests using the species tree can be misleading: if the genes underlying a phenotype of interest have different topologies than that of a species tree—due to incomplete lineage sorting (ILS) and other processes [187]—researchers may mistake similarity due to common ancestry for convergence. This phenomenon, known as hemiplasy [188] may be relatively common [185, 186], motivating the development of ILS-aware phylogenetic regresssion [97] and PhyloG2P methods [189].

There are clear biological rationales explaining why various types of analyses will be more or less informative at different timescales. But this is a difference of degree and not of kind. And the different methodological traditions in statistical genetics and phylogenetics are just that — traditions. There is no reason a researcher should think about the problem of trait mapping in a fundamentally distinct way just because she happened to be trained in a statistical genetics or phylogenetics lab. Ultimately, we should work to take the best ideas from both of these domains and blend them into a more cohesive paradigm that will facilitate richer insights into the molecular basis of phenotypes.

## Materials and Methods

### Simulation details

To perform phylogenetic simulations, we used the fastBM function from the phytools R package [190]. In all cases, Brownian motions were simulated independently and with rate 1. When performing phylogenetic simulations of Felsenstein’s Worst Case, we used stree from ape [191] to simulate two star trees of 100 tips, where each tip in the star tree had length 0.5. We then connected the two star trees using internal branches of varying length. To add a non-Brownian confounder, in each simulation we added an independent normal random variable with varying standard deviations to the *x* and *y* values for individuals from clade 1. (Within a given simulation, all individuals in clade 1 were augmented by the *same* value for each trait, while between simulations, the confounding effect was a random draw.) When performing simulations in a more complicated phylogeny, we used `TreeSim` [192] to generate pure-birth trees with birth rate = 1 and complete taxon sampling. Each simulation replicate used a different tree. For ordinary least squares on phylogenetic data, we used the `R` function `lm`. For PGLS on phylogenetic data, we used the `R` package `phylolm` [109] with the Brownian motion model and no environmental noise.

To perform genome-wide association study simulations, we first generated neutral tree sequences and mutations using `msprime` [112]. To ensure our results were not simply due to genetic linkage, we simulated a high recombination of 10^{−5} per generation with a mutation rate an order of magnitude lower, 10^{−6} per generation. We first simulated causal variants on a sequence of length 100000, and generated phenotypes by sampling an effect size for each variant from a normal distribution with mean 0 and variance where *p*_{l} is the allele frequency of variant *l*. We then created each individual’s phenotype using the additive model, equation equation 1. We then added environmental noise so that the trait’s heritability was less than 1. In all simulations, every population had diploid population size 10000. To simulate the variant being tested for association, we simulated independent tree sequences and mutations and selected a random variant with allele frequency greater than 0.1. When simulating a GWAS analogue of Felsenstein’s Worst Case, we drew 100 diploid samples from each population, and varied the divergence time of the two populations. To include an environmental shift in one population, we added a normal random variable with varying standard deviation only to individuals in population 1. To simulate under a more complicated population structure, we simulated 20-tip pure birth trees using `TreeSim` with a birth rate of 5. We then multiplied all branch lengths by 10000 to convert them into generations, and imported them into `msprime` using the `from_species_tree` function. We then generated tree sequences and mutations, sampling 10 diploid individuals from each population. Note that each replicate simulation was performed on an independent random population tree. We performed association testing using a custom `python` implementation of the linear mixed model. We first used restricted maximum likelihood to estimate and , followed by using generalized least squares to estimate the regression coefficients and their standard errors.

### Phylogenetic analysis of yeast gene expression data

We obtained the species tree, gene expression matrix, and list of physically interacting genes from https://github.com/acope3/GeneExpression_coevolution [118]. We then randomly subsampled 500 genes that had measurements in at least 15 of the 20 species to test for association, resulting in 124750 pairs. Because of differential missingness among genes, we computed phylogenetic PC loadings only on the subtree for which both genes had data present, meaning that each pair may have had slightly different PC loadings. We then used `phylolm` [109] with no measurement error to estimate the regression coefficient. For each number of principal components included, we corrected for multiple testing by controlling the FDR at 0.05 using the Benjamini-Hochberg procedure [193].

## Data availability

Code used to generate the results in this paper can be found at https://github.com/Schraiber/PGLS_GWAS

## Acknowledgements

We thank Matt Hahn, Nick Mancuso, and members of the Pennell, Edge, and Mooney labs for their thoughtful comments on parts of this study. Alex Cope provided additional guidance on our analysis of the yeast gene expression data.

## Appendix Covariance of the genetic component of the phenotype with pedigree-based kinship

With a known pedigree but no genotype data, we might derive the covariances among individuals in the genetic component of a trait by fixing the (unobserved) effect sizes and allele frequencies. For simplicity, we will also assume that individuals are not inbred. Given the pedigree, at any given locus, diploid individuals *i* and *j* have some probability of inheriting 0, 1, or 2 alleles identical by descent (IBD). Call these probabilities *r*_{0}, *r*_{1}, and *r*_{2}, respectively. For example, for a parent-offspirng pair, *r*_{1} = 1, and for a pair of full siblings, *r*_{0} = 1*/*4, *r*_{1} = 1*/*2, and *r*_{2} = 1*/*4.

Following equation 4, the covariance of the genetic component of the trait for individuals *i* and *j* is
As in the main text, we assume that genotypes at distinct loci are independent, causing the second term to vanish. With the effect sizes treated as fixed constants, we have
The *G* values are allelic counts of a non-reference allele (we assume the locus is biallelic), and if we fix the non-reference allele frequencies *p*_{l}, then 𝔼 (*G*_{il}) = 𝔼 (*G*_{jl}) = 2*p*_{l}, giving
To proceed, we need 𝔼 (*G*_{il}*G*_{jl}) given the IBD probabilities *r*_{0}, *r*_{1}, and *r*_{2}. We thus compute the desired expectation conditional on each IBD state. *G*_{il}*G*_{jl} = 1 if individuals are heterozygous, *G*_{il}*G*_{jl} = 2 if one individual is heterozygous and the other is homozygous for the non-reference allele, and *G*_{il}*Gjl* = 4 if both individuals are homozygous for the non-reference allele. If individuals *i* and *j* share no alleles IBD, using Hardy–Weinberg genotype probabilities (per the assumption of no inbreeding), the conditional expectation is therefore
A similar calculation given one allele inherited IBD gives
(To explain the first line: for the first term, the individuals are both heterozygous if the IBD allele is of the non-reference type and the other two are reference alleles, or if the IBD allele is non-reference and both of the non-IBD alleles are reference alleles. For the second term, since one individual is homozygous for the non-reference allele, the IBD allele must be of the non-reference allele, and of the other two alleles, one must be of each type. For the third term, both individuals are homozygous if the IBD allele and both non-IBD alleles are non-reference alleles.)

If the individuals share two alleles IBD, then
Combining these terms weighted by the IBD probabilities gives the desired expectation,
where the last line comes from noticing that *r*_{0} + *r*_{1} + *r*_{2} = 1 and defining the kinship coefficient *θ*_{ij} = *r*_{1}*/*4 + *r*_{2}*/*2, equal to the probability that a pair of alleles chosen at random, one from individual *i* and one from individual *j*, is IBD.

We can now return to the main covariance of interest by plugging the expression for 𝔼 (*G*_{il}*G*_{jl}) from equation 21 into equation 20, giving
The final line comes from noting that the sum in the previous line is the additive genetic variance (*V*_{A}) under the assumptions here, that is, the variance of the genetic component of the phenotype among outbred individuals. This is the result required in the main text.

### Covariance of genotypes when effect sizes and allele frequencies are independent

We have
where the second line comes from the assumption that genotypes and effect sizes are independent, and the third line comes from adding and subtracting to the second line and simplifying. Further making the assumption that 𝔼 (*β*) = 0,

### The estimated regression coefficient from ordinary least squares with principal components

Recall the design matrix *X* = [*x v*_{1} *v*_{2} *· · · v*_{J} ], where the *v*_{i} are orthonomal eigenvectors of Σ. Then,
i.e. the first row and columns are the projection of *x* along each eigenvector and the rest of the matrix is an identity matrix. The identity matrix arises because the eigenvectors annihilate each other. Then,
Ultimately, we only need the first row of (*X*^{T} *X*)^{−1} to estimate the regression coefficient for *x*. Note that by expanding *x*^{T} *x* in terms of the eigenvectors, we have that
so that the first row of (*X*^{T} *X*)^{−1} is
We also compute
so that
as desired. Note that when *J* = 0, this recovers the ordinary least squares estimator (equation 16), as expected.

### The estimated regression coefficient from generalized least squares

Recall that we can diagonalize the covariance matrix Σ as
where *V* = [*v*_{1} *v*_{2} *· · · v*_{n}] is a matrix whose columns are the eigenvectors of Σ and Λ = diag(*λ*_{1}, *λ*_{2}, …, *λ*_{n}) is a diagonal matrix whose entries are the eigenvalues of Σ. If we have design matrix *X* = [*x v*_{1} *v*_{2} *· · · v*_{J}], we can rewrite the genrealized least square estimator as
where, using the fact that *V* is an orthonormal basis. Next, compute
which is an (*J* + 1) *× n* matrix, so the pattern extends *J* + 1 rows and the identity elements come from the eigenvectors annihilating each other. Then,
Note that *V* ^{T} *X* = (*X*^{T} *V*)^{T}, so that
As in the case for ordinary least squares, we will only need the first row of the inverse,
Next, compute
so that
and finally
as desired. Setting *J* = 0 recovers the GLS estimator without any PCs included as covariates, as in equation 18.

## References

- [1].↵
- [2].
- [3].
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].
- [14].
- [15].↵
- [16].↵
- [17].↵
- [18].
- [19].
- [20].
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].
- [26].
- [27].↵
- [28].
- [29].↵
- [30].
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵
- [104].↵
- [105].↵
- [106].↵
- [107].↵
- [108].↵
- [109].↵
- [110].↵
- [111].↵
- [112].↵
- [113].↵
- [114].↵
- [115].
- [116].↵
- [117].↵
- [118].↵
- [119].↵
- [120].↵
- [121].↵
- [122].↵
- [123].↵
- [124].↵
- [125].↵
- [126].↵
- [127].↵
- [128].
- [129].↵
- [130].↵
- [131].↵
- [132].
- [133].↵
- [134].↵
- [135].↵
- [136].↵
- [137].↵
- [138].↵
- [139].↵
- [140].↵
- [141].
- [142].
- [143].
- [144].
- [145].
- [146].
- [147].↵
- [148].↵
- [149].
- [150].
- [151].↵
- [152].↵
- [153].↵
- [154].↵
- [155].↵
- [156].↵
- [157].↵
- [158].↵
- [159].
- [160].↵
- [161].↵
- [162].
- [163].
- [164].↵
- [165].↵
- [166].
- [167].
- [168].
- [169].↵
- [170].↵
- [171].
- [172].↵
- [173].↵
- [174].↵
- [175].↵
- [176].↵
- [177].↵
- [178].↵
- [179].↵
- [180].↵
- [181].↵
- [182].↵
- [183].
- [184].↵
- [185].↵
- [186].↵
- [187].↵
- [188].↵
- [189].↵
- [190].↵
- [191].↵
- [192].↵
- [193].↵