ABSTRACT
Phylogenetic generalized least squares (PGLS) regression is one of the most commonly used methods in examining evolutionary correlations between two traits. Unlike the conventional correlation methods like Pearson and Spearman’s rank tests, the two analyzed traits are in different positions when correcting the phylogenetic non-independence in PGLS regression. In examining the correlations of CRISPR-Cas and prophage contents with optimal growth temperature and minimal doubling time, we noticed that a remarkable frequency (26.3%) of conflicting results appears after swapping the independent and dependent variables. Then, we generated 12000 simulations of the evolution of two traits (X1 and X2) along a binary tree containing 100 terminal nodes with different models and variances. In this simulated dataset, swapping the dependent and independent variables gave conflicting results at a frequency of 17.2%. By conventional correlation analysis of the trait changes along the phylogenetic branches (ΔX1 and ΔX2), we established a golden standard for whether X1 and X2 correlate in each simulation. With this golden standard, we compared six potential criteria for dependent variable selection, log-likelihood, Akaike information criterion, R2, p-value, Pagel’s λ, and the estimated λ in Pagel’s λ model. The last two criteria were found to be equivalent in their performance of dependent variable selection and superior to the other four criteria. Because Pagel’s λ values, as indicators of phylogenetic signals, are generally calculated at the beginning of phylogenetic comparative studies, for practical convenience, we recommend the trait with a higher λ value to be used as the dependent variable in future PGLS regressions. Logical analysis of cause and effect should be done after establishing a significant correlation by PGLS regression rather than providing an indicator for the choice of the dependent variable.
Due to mutations, genetic drift, or natural selection, some biological traits tend to evolve together over evolutionary time (Felsenstein 1985, Revell and Collar 2009, Goswami et al. 2014, Caetano and Harmon 2019, Revell et al. 2022). Correlation analysis between traits could quantify the magnitude and direction of changes in one trait given knowledge of evolutionary changes in another, providing evidence or counterexamples for hypotheses (Pearman et al. 2014), prompting deep thinking about biological processes, and contributing to better understanding and reconstruction of the events that occurred during evolutionary history (Bartomeus et al. 2018, Bawa et al. 2019, Suarez-Castro et al. 2020).
The correlation between two variables is usually assessed by computing the Pearson, Spearman’s rank, or Kendall rank correlation coefficient. However, the values of evolutionary traits generally violate a basic assumption of these standard statistical methods. They are not independent of each other but related through the evolutionary relationships of the analyzed species (Felsenstein 1985). Ignoring the phylogenetic dependences would distort the trait correlation (Revell 2010, Whitney and Garland 2010). A series of methods have been developed to address the problem of phylogenetic non-independence (Felsenstein 1985, Grafen 1989, Lynch 1991, Garamszegi 2014). Among them, the phylogenetic generalized least squares (PGLS), initially formulated by Grafen (1989) and subsequently developed by Martins and Hansen (1997), Pagel (1997, 1999), and Rohlf (2001), became the most commonly employed method to determine the relationship between two or more traits. When interpreting the correlation between two traits using PGLS regression, the regressions’ significant positive and negative slopes correspond to significant positive and negative correlations, respectively.
The two traits are at the same position in standard correlation analyses like Pearson and Spearman’s rank tests. However, in a regression model, one trait is designated as the independent variable and the other as the dependent variable. When exploring trait correlations using regression models, there is a common misconception that the positions of the independent and dependent variables are not equivalent. A causal relationship is often assumed in the regression analysis (Waugh 1943). For example, when analyzing the relationship between GC content and ecological factors, we tend to choose the ecological factors as the independent variable and the GC content as the dependent variable (Travnicek et al. 2019, Hu et al. 2022), with the assumption that the environmental factors might have driven the evolution of GC content. However, from the statistical principle, the regression analysis examines whether there is a significant positive or negative correlation between these two traits. In standard regression analysis, the independent and dependent variables are exchangeable; exchanging the independent and dependent variables will not affect the sign (positive or negative) of the regression coefficient or the significance test results (p < or ≥ 0.05). Therefore, we could arbitrarily select one trait as the independent variable and designate the other as the dependent variable in standard regression analysis. However, at this point, the PGLS regression analysis is different from the standard regression analysis. Swapping the independent and dependent variables in the PGLS analysis will result in inconsistent estimates of the parameters, like Pagel’s λ (Freckleton et al. 2002), and so inconsistent phylogenetic correction for the phylogenetic non-independence of the trait values. In principle, this inconsistency might lead to conflicting results.
When scrutinizing the correlations in a recent publication of our laboratory (Liu et al. 2023), we noticed that, in some cases, swapping the independent and dependent variables affects the results and conclusions of the correlation analysis qualitatively (see the Results section). That is, a significant correlation between two traits (p < 0.05) might disappear (p ≥ 0.05) after swapping the independent and dependent variables. How do we know, in these cases, which regression should we accept?
In the present study, using simulated data, we first evaluated the prevalence of conflicting results resulting from swapping the independent and dependent variables, then compared potential criteria to designate the independent and dependent variables in PGLS analysis appropriately
MATERIALS AND METHODS
Phylogenetic Generalized Least Squares
Phylogenetic generalized least squares, as proposed by Martins and Hansen (1997), rely on the generalized least squares model and can be written as where y is an n-dimensional vector of values of trait y, considered as a response variable in the regression model, n is the sample size. X is an n × (1 + p) matrix consisting of a column of ones and p columns of explanatory variables; the first column of ones can be interpreted as the intercept. α is a column vector of regression coefficients. ε is a column vector of errors, under the multivariate normal distribution, with a mean of 0 and a variance-covariance matrix of σ2∑. ∑ is a matrix describing the phylogenetic relationships (topology and branch length). Therefore ε will have a multinormal probability density given by According to equation (1), equation (2) can be expressed by and the log-likelihood function can be given by The parameters α and σ2 can be estimated by maximizing the value of via a maximum likelihood approach.
All the PGLS regressions in this study were performed using the R (version 4.0.2) package phylolm (version 2.6.2) (Ho and Ane 2014).
Evolutionary Models
The Brownian motion (BM) model on a phylogeny is like a “random walk” model, in which the trait value changes with a constant probability σ2 in any unit of time. In this model, ∑ij = Cij, where Cij is the distance from the root node to the most recent common ancestor of tip i and j. If i = j, Cij will be the distance from the root node to the tip i. Pagel (1997) introduced the parameter λ to represent different evolution rates on branches and to do length transformations. In Pagel’s λ model, the off-diagonal elements of the variance-covariance matrix are multiplied by λ. We noted the variance-covariance matrix modified by λ as ∑(λ). The multinormal probability density can be written as: and the log-likelihood function can be given by The parameter λ, lying between 0 and 1, is estimated by a search procedure to maximize the likelihood of equation (6) (Freckleton et al. 2002). The higher the λ value is, the stronger phylogenetic dependence is. If λ = 0, there is no phylogenetic dependence between the residuals. If λ = 1, ∑(λ) = C, Pagel’s λ model and Brownian motion model are equivalent.
The phylogenetic signals (λ) were estimated using the R (version 4.0.2) package phytools (version 1.0-3) (Revell 2012).
Empirical Datasets
A dataset including the empirical minimal doubling times, the CRISPR spacer numbers, the optimal growth temperature, and the number of prophages of 262 bacteria was extracted from the Supplemental Material Table S1 of Liu et al. (2023). The phylogenetic tree of these 262 bacteria was retrieved from the Genome Taxonomy Database (GTDB; accessed 8 April 2022) (Parks et al. 2022).
Simulation Data
First, we generated a binary tree containing 100 terminal nodes using the package ape in R (Paradis and Schliep 2019). The trait X1 evolved under a Brownian motion model (using the packages ape in R) with a variance rate σ2BM = 4 along the tree, and then the trait X2 was simulated based on where ε was a normally distributed random noise term with a mean of 0 and a variance varying from 1, 4, 16, 64, 256, to 1024. The term ε introduced noise to the dependent variable X2 and a gradient variance of this noise term (from 1 to 1024) changed the correlation between X1 and X2, from strong to weak. We named this case “BM & BM + Norm.”
Then, we simulated the trait X1 under a normal distribution with a mean of 0 and a variance of 4. The trait X2 was simulated based on equation (7), where ε is simulated under a Brownian motion model with σ2BM varying from 1, 4, 16, 64, 256, to 1024. We named this case “Norm & Norm + BM.”
Each parameter condition was simulated 1000 times, and totally we performed 12000 simulations.
RESULTS
Swapping the Dependent and Independent Variables in PGLS Analyses of Empirical Data Sometimes Gave Conflicting Results
In a recent publication of our laboratory (Liu et al. 2023), a dataset containing a series of traits of 262 bacteria was deposited as Supplemental Material Table S1. We scrutinized the correlations of empirical minimal doubling time and optimal growth temperature with the genomic characters. In total, 38 pairs of traits were re-analyzed by PGLS using Pagel’s λ model. Swapping the dependent and independent variables in ten pairs gave conflicting results and led to different conclusions (Table 1). For example, when choosing the average prophage number as the dependent variable and the optimal growth temperature as the independent variable, the PGLS analysis showed a significant negative correlation between the two traits (p = 9 × 10−4, Table 1). However, no statistically significant correlation was observed when the dependent and independent variables were swapped (p = 0.242, Table 1). Between each pair of conflicting results, we have to find the correct one and make the conclusion based on it. However, how do we know which is correct?
Prevalence of Conflicting Resulting from Swapping the Dependent and Independent Variables
In the above survey of correlations in the empirical data, we found that, in 26.3% of cases, swapping the dependent and independent variables would lead to conflicting results. It seems not a rare case. To assess the prevalence of conflicting results caused by swapping the dependent and independent variables, we simulated the evolution of two traits along a binary tree containing 100 terminal nodes. Different distributions of the features were simulated. In the case of “BM & BM + Norm”, we constructed trait X1 under the Brownian motion model (“BM”) and trait X2, as “BM + Norm”, equals to X1 plus a noise term “Norm.” In the case of “Norm & Norm + BM,” the noise term is “BM.” To account for varying levels of correlation, we set the gradient variance of the noise term, with the variance σ2Norm varying from 1, 4, 16, 64, 256, to 1024. For each case, we simulated 1000 times.
For the data of each simulation, we performed two rounds of PGLS analysis, X1∼X2 and X2∼X1. The results are deposited in Supplementary Table S1 and summarized in Table 2. In 3768 simulations, neither X1∼X2 nor X2∼X1.gave significant correlations (p ≥ 0.05 for all of them). In 6150 simulations, both X1∼X2 and X2∼X1.gave significant correlations (p < 0.05). In each of these 6150 cases, the regression coefficients of X1∼X2 and X2∼X1 have the same sign. In the other 2064 simulations, only one of X1∼X2 and X2∼X1.gave significant correlations (p < 0.05 for one and p ≥ 0.05 for the other). That is, the frequency of conflicting results caused by swapping the dependent and independent variables in our simulated dataset is 17.2%.
From Table 2, we can see a relationship between the variance in the simulations and the frequency of conflicting results caused by swapping the dependent and independent variables. When the variance of the noise term is slight (like 1 and 4), i.e., there is a strong correlation between X1 and X2, swapping the independent and dependent variables gives almost the same results. As the variance of the noise term increases, i.e., when the correlation between X1 and X2 becomes weak, swapping the independent and dependent variables leads to many conflicting results. In cases where the variance equals 16 in “BM∼BM+Norm” and 64 in “Norm∼Norm+BM,” there are even close to 50% of cases with conflicting results. However, as the variance becomes much more prominent and the correlation between X1 and X2 becomes weaker, the frequency of conflicting results caused by swapping the dependent and independent variables diminishes (Table 2).
Establish a Golden Standards to Evaluate the Correlations Between Simulated Traits
In the PGLS analysis of both the empirical and the simulated datasets, swapping the dependent and independent variables produces a significant frequency of conflicting results. Therefore, we should not arbitrarily select one trait as the dependent variable in the PGLS analysis of two traits. Taking advantage of the simulated data, we will try to find a criterion for selecting a better dependent variable.
In empirical phylogenetic data, we only know the trait values for the terminal nodes, and potential correlations among the traits could be estimated by the methods like PGLS. By contrast, in simulated phylogenetic data, we also know the trait values of the internal nodes. As the changes along different phylogenetic branches are independent, we can measure the evolutionary correlation between two traits by analyzing their changes along the phylogenetic branches using standard statistical methods. First, we calculated the changes in the trait X1 and the trait X2 along evolutionary branches per unit time, ΔX1/L and ΔX2/L, where L is the branch length. Then we did the Shapiro-Wilk test on ΔX1/L and ΔX2/L to determine whether they have normal distributions. If both of them satisfy normality, we used the Pearson correlation to detect the correlation between ΔX1/L and ΔX2/L. Otherwise, we used Spearman’s rank correlation. These analyses revealed significant correlations between the two traits, X1 and X2, in 7902 simulations (7099 positive and 803 negative, p < 0.05 for all these cases) but not in the other 4098 simulations (p ≥ 0.05 for all these cases) (Supplementary Table S1). These results provide “golden standards” to judge whether PGLS analyses of trait values on the terminal nodes (X1 and X2) give correct results.
PGLS analyses of X1∼X2 show that there are significant positive correlations in 6317 simulations (p < 0.05 for all these cases), significant negative correlations in 29 simulations (p < 0.05 for all these cases), and no significant correlations in 5654 simulations (p ≥ 0.05 for all these cases) (Supplementary Table S1). The same analyses of X2∼X1 show that there are significant positive correlations in 7999 simulations (p < 0.05 for all these cases), significant negative correlations in 19 simulations (p < 0.05 for all these cases), and no significant correlations in 3982 simulations (p ≥ 0.05 for all these cases) (Supplementary Table S1). By comparing with the golden standards, we found that PGLS analyses gave correct results for both X1∼X2 and X2∼X1 in 7475 simulations (62.29%) and gave incorrect results for both X1∼X2 and X2∼X1 in 2461 simulations (20.51%). In 2064 simulations (17.20%), only one of the two competing models, X1∼X2 or X2∼X1, gave correct results. Therefore, limited by the performance of PGLS regression analysis, we could at most get an accuracy of 79.49% in analyzing the data of our 12000 simulations by PGLS regressions. However, if we arbitrarily select one trait (X1 or X2) as the independent variable with the most bad fortune, we could get an accuracy of only 62.29%. In the following attempt to find an accurate criterion for dependent variable selection, we hope to perceive more correct cases from the 2064 simulations where X1∼X2 and X2∼X1 gave conflicting results.
Looking for an Accurate Criterion for Dependent Variable Selection
Referring to the golden standards, we evaluated six potential criteria for their performance in selecting a better model from X1∼X2 and X2∼X1.
In statistics, the goodness of fit of two competing statistical models is often assessed by calculating each model’s log-likelihood (LLK) values. We first examined whether a higher (or lower) LLK could give an accurate prediction of the better model between X1∼X2 and X2∼X1. By calculating the LLKs of the two models for the 2064 simulations (Supplementary Table S1), we found that the models selected from X1∼X2 and X2∼X1.by a lower LLK (denoted as ModelLLK,lower) have more correct results than the alternate model (denoted as ModelLLK,higher), 1079 vs. 985. A χ2 test showed that the difference is statistically significant (p = 0.004, Table 3).
Akaike information criterion (AIC) is a widely used estimator of the quality of statistical models for a given dataset (Akaike 1974). It balances the goodness of fit of the model and the model’s complexity. By calculating the AIC values of the two competing models for the 2064 simulations (Table 1), we found that the models selected by a higher AIC value (denoted as ModelAIC,higher) have significantly more correct results than the alternate model (denoted as ModelAIC,lower), 1079 vs. 985, p = 0.004 (Table 3).
The R2 describes the proportion of the total variation in the dependent variable that is explained by the independent variables in the regression model, and the p-value of the regression coefficient is the probability of observing the test statistic value under the assumption that the null hypothesis is true, where the regression coefficient equals 0. These two parameters are widely used indicators of the goodness of fit of regression models. The models selected by a higher R2 (denoted as ) and a lower p-value (denoted as Modelp,lower) have significantly more correct results than the alternate models (p = 4 × 10−4 for both cases, Table 3).
The phylogenetic signal is a measure of the extent to which the phylogenetic structure influences species trait values. Pagel’s λ is the most commonly used indicator of the phylogenetic signal (Pagel 1999). The estimated is a parameter in Pagel’s λ model (Pagel 1997) that measures the relatedness of the regression residuals with the phylogenetic structure. We found that the models that use the trait with a higher λ value as the dependent variable (denoted as ) have significantly more correct results than the alternate models (denoted as ) (p < 2.2 × 10−16, Table 3). And the model selected by a higher (denoted as ) also have significantly more correct results than the alternate models (denoted as ) (p < 2.2 × 10−16, Table 3).
Furthermore, we defined a virtual criterion where one model was randomly chosen from the two competing models, X1∼X2 and X2∼X1. The results of the models selected by this virtual criterion (Modelrc) were compared with the better models selected by the above six criteria. Although the better models selected by the above six criteria consistently have more correct results than Modelrc (Table 4), the differences of ModelLLK,lower and ModelAIC,higher with Modelrc are not statistically significant (p = 0.093 for both cases), the differences of and Modelp,lower with Modelrc are marginally significant (p = 0.046 for both cases), but the differences of and are highly significant (p < 2.2 × 10−16 for both cases).
From Tables 3-4 and S2, it could be seen that each pair of criteria gave identical results, LLK and AIC, R2 and p-value, Pagel’s λ and . Among the three pairs, Pagel’s λ and seem to be the best criterion for dependent variable selection. For a quantitative evaluation of these impressions, we performed the χ2 tests to compare Pagel’s λ with , LLK, AIC, R2, and p-value using the 2064 simulations where X1∼X2 and X2∼X1 gave conflicting results. As shown in Table 5, the equivalency between Pagel’s λ and and the superiority of λ to LLK, AIC, R2, and p-value have been statistically confirmed.
In summary, Pagel’s λ and are the best criteria for dependent variable selection. Among the 2064 simulations where X1∼X2 and X2∼X1 gave conflicting results, these two criteria led to correct results in 1736 simulations. Combined with the 7475 simulations that both X1∼X2 and X2∼X1 gave correct results, PGLS analysis can achieve an accuracy of 76.8% when the trait with a stronger phylogenetic signal was selected as the dependent variable. As this accuracy is still 2.7% lower than the upper limit of PGLS analysis (79.49%), a much better criterion might be found in the future. Moreover, the PGLS regression analysis itself should also be improved.
DISCUSSION
In the PGLS correlation analysis of an empirical dataset (Liu et al. 2023), we recognized that swapping the dependent and independent variables could lead to a remarkable frequency of conflicting results. Then, we simulated the evolution of two traits (X1 and X2) along a binary tree containing 100 terminal nodes with different models and variances for 12000 times. PGLS analysis of these simulated datasets showed that swapping the dependent and independent variables gave conflicting results at a frequency of 17.2%.
Taking advantage of simulation, we established a golden standard for whether X1 and X2 are correlated in each simulation by conventional correlation analysis of the changes of the two traits along the branches of the phylogenetic tree. With this golden standard, we can tell which model, X1∼X2 or X2∼X1, is correct. Six potential criteria for dependent variable selection, LLK, AIC, R2, p-value, Pagel’s λ, and , have been compared. The last two criteria are equivalent in dependent variable selection and have exhibited their superiority to the other four criteria. The Pagel’s λ values are generally calculated at the beginning of a phylogenetic comparative analysis, so they are already known before the PGLS analysis. If we can choose the trait with a higher λ value as the dependent variable, two rounds of PGLS analysis, like the X1∼X2 and X2∼X1, are not required. Considering the practical convenience, Pagel’s λ is superior to .
It should be highlighted that the terms independent variable and dependent variable are misleading in evolutionary correlation studies. They should not be taken literally. A PGLS regression analysis does not provide a model that uses the independent variable to explain or predict changes in the dependent variable. It replaces conventional correlation methods, like Pearson and Spearman’s rank tests, in phylogenetic comparative studies. The choice of the dependent variable in a PGLS regression analysis should not be based on a pre-assumption of the cause-and-effect relationship between the analyzed traits but should guarantee an accurate perception of the relationship, whether correlated or not.
SUPPLEMENTARY MATERIAL
Supplementary material is available on GitHub at https://github.com/BNU-Genome-Evolution/dependent-variable-selection.
ACKNOWLEDGMENTS
This work was supported by the National Natural Science Foundation of China (Grant number 31671321).