Abstract
Sex differences in human transcriptomes have been argued to drive sex-differential natural selection (SDS). Here, we show that previous evidence supporting this hypothesis has been largely unfounded. We develop a new method to test for a genome-wide relationship between sex differences in expression and selection on expression-influencing alleles (eQTLs). We apply it across 39 human tissues and find no evidence for a general relationship. We offer possible explanations for the lack of evidence, including that it is due in part to eQTL ascertainment bias towards sites under weak selection. We conclude that the drivers of ongoing SDS in humans remain to be identified.
Introduction
Sex-differences in gene expression have been theorized to be the result of long-term sex-differential selection (SDS), where allelic fitness effects differ between males and females1,2. When a gene product is only beneficial in one sex, it is expected that expression modifiers will evolve to increase expression for that sex and decrease expression in the other 3. The reverse causality has also been proposed: sex differences in gene expression might be the driver of SDS on expression modifiers1,4.
SDS acting on viability within the current generation can generate between-sex differences in allele frequency5,6 at expression modifiers (expression Quantitative Trait Loci, or eQTLs)3. Sexually dimorphic traits (diseases being one example with strong repercussions on survival6,7,8,9) may be subject to SDS acting through viability. Previous work has proposed and tested theoretical models for relating divergence in eQTL allele frequencies with differential expression between sexes1,10,11. Some of the empirical results have, however, been called into question12,13,14.
A controversial result by Cheng and Kirkpatrick (2016)1 (hereafter “CK16”) is a characteristic pattern relating between-sex FST1,5,15 (henceforth “FST”) to sex differences in gene expression (henceforth “Δ”). They observed high FST values at intermediate values of Δ and near-zero values of FST when a gene is expressed evenly between the sexes (Δ = 0) or is only expressed in one sex (Δ = ±1). They nicknamed this bimodal pattern “Twin Peaks”. The Twin Peaks pattern has been used as a signal to detect SDS in several species1,4,16,17.
Here, we revisit the CK16 model and statistical approach. We find that their interpretation was based on previously unappreciated statistical artifacts. We then refine their model and apply it to new, more extensive data on gene expression and allele frequencies. Across 39 human tissues, we find no evidence for a genome-wide relationship between viability SDS and Δ. We discuss how a bias in eQTL discovery towards eQTLs under weaker selection can explain the lack of signal for a relationship between SDS and Δ, and how the drivers of SDS may still be investigated.
Results
Twin Peaks is a Statistical Artifact
We begin by revisiting the Twin Peaks pattern with a critical eye towards caveats in its application and interpretation. Most importantly, we reconsider the model that CK16 proposed to explain the pattern. It is based on two key assumptions. First, the relationship between a gene’s expression levels and its effect on fitness in each sex is linear. Second, sexually antagonistic selection is symmetric, meaning selection coefficients are equal in magnitude and opposite in sign between sexes. This yields a quadratic relationship between FST and Δ at a biallelic site affecting expression. At small absolute values of Δ, where p is the allele frequency in zygotes and q = 1 − p. FST is the between-sex fixation index18 which is used to quantify allele frequency differences between males and females. In the model, these differences are due solely to the sex differences in post-zygotic fitness effects of alleles. The quantity Δ is the sex difference in gene expression (Methods). Finally, A is a compound parameter involving the within-sex effect of gene expression level on fitness.
The expectation in CK16 for FST at extreme expression differences (|Δ| → 1), however, is based on intuition rather than a model. The authors suggest that if a gene is not expressed in one sex (Δ = ±1), then selection will not act on it. Selection on the other sex should optimize expression levels, so under the symmetrical selection assumption there will be no ongoing directional selection in that sex either. As neither sex experiences selection, there will be no force driving increased FST. CK16 then interpolate a bimodal shape by joining the quadratic relationship at low |Δ| and the expectation for FST = 0 at Δ = ±1 (Fig. 1).
However, the assumption of symmetric selection used at low values of Δ may be inappropriate for large Δ values. In particular, when Δ = ±1, the lack of expression in one sex plausibly suggests different selection between males and females. It is therefore not intuitive that FST should simply go to zero at sites regulating expression in these genes. Additionally, although the quantity 2pq appears in Eq. 1, CK16 did not include that term in fitting the model, effectively assuming it to only contribute random noise to the relationship between FST and Δ.
Because of these caveats to the model, we decided to revisit the support for Twin Peaks. We first ask whether the Twin Peaks pattern is due to SDS by applying the statistical tests of CK16 to data generated under a null hypothesis of no relationship between Δ and SDS. We replicated the pattern shown in CK16 by using the same data and methods. Namely, we performed a 4th degree polynomial regression of FST on Δ, using allele count data from 1000 Genomes19 and expression data from the gonads (ovaries and testes) in GTEx v320. The curve and associations generated using these datasets we refer to as the “real data” (blue line in Fig. 2a).
We then generated an empirical null by permuting ovary and testes tissue labels in the expression data but retaining sex labels associated with FST values (Methods), then recomputed Δ values, and again fit a 4th-degree polynomial. We find 21% of permutations qualify as Twin Peaks according to the criteria used by CK16, and that the polynomial fit to the real data is not visually distinct from those fits to the null (Fig. 2a). Higher order polynomial regressions can also yield spurious fits because distant points have an oversized impact21,22. We therefore binned genes by Δ values and examined the relationship with mean FST in each bin. Again, the real data shows no unusual relationship between FST and Δ compared to null data (Fig. 2b). From these results, we conclude the Twin Peaks pattern is not statistically significant.
Importantly, this permutation method differs from the method used in CK16, which permuted Δ values across genes. Both methods break the associations between FST and Δ as desired. Our sex -label permutation method, however, preserves FST associations with the gene’s overall expression, maintaining gene features such as expression variance which the CK16 method does not. This explains why their reported p-value for Twin Peaks curves (0.016) is lower than our (0.21).
We hypothesized that one reason for the spurious Twin Peaks pattern is confounding. In particular, variance in expression (regardless of sex) is positively correlated with both Δ and FST. Previous work has shown that genes subject to weaker stabilizing selection show higher variance in expression23. Higher variance means larger differences between randomly selected subgroups, and therefore it should translate to larger values of Δ even in the absence of SDS. In turn, stronger (sex-agnostic) selection can lead to stronger drift at linked sites24,25,26,27. Taken together, the confounding with variance in gene expression could generate a relationship between SDS and |Δ| in the absence of SDS. Indeed, expression variance and FST are positively correlated (Pearson p = 0.011; Fig. 2c), consistent with this hypothesis. In sum, we do not find support for Cheng and Kirkpatrick’s conclusion that there is a genome-wide relationship between FST and Δ.
No evidence for genome-wide SDS on eQTLs
Although we do not find support for a relationship between sex differences in gene expression and selection using CK16’s methodology for generating Twin Peaks, one may still exist. We test this hypothesis across many tissues using improved statistical modeling, data, and methods. Despite the caveats to portions of the model discussed above, we believe that CK16’s theoretical expectation of a quadratic relationship between FST and small values of Δ is valid. We therefore built on that model by introducing the compound parameter δ2 = 4qpΔ2 and rewriting Eq. 1 as (Methods). To estimate A, for each gene-tissue pair, we used the single cis-eQTL from GTEx v828 with the strongest association with its expression. Our estimator of A is then the inverse-variance-weighted linear regression of FST on δ2 (Eq. 2). The advantages of this formulation over CK16’s are that it allows direct estimation of A, the strength of SDS on sex differences in expression. This model also accounts for variation in allele frequencies across sites. Further, by using a single eQTL to calculate FST for the whole gene, we circumvent biases which can arise when using a simple mean to estimate gene-wide FST29,30.
To accompany the updated regression methodology, we also updated the datasets for both allele frequencies and gene expression. For allele frequencies, we used the Non-Finnish European subset in gnomAD v3 (averaging over 60,000 allele samples per site)31. We used expression data from 49 tissues from the GTEx v8 dataset28 (averaging over 200 male samples and 150 female samples). Both datasets greatly expand our sample size compared to CK16, and GTEx v8 provides sample-specific sex labels for calculating Δ across multiple tissues instead of just the gonads—ovaries and testes—as in the original Twin Peaks paper.
Using the updated statistical framework method, we find no evidence that A differs significantly from zero in any tissue (Fig. 3; Methods). One explanation for the absence of a pattern may be found in recent work by Mostafavi et al. (2023). The authors contrasted how selection impacts the discovery of genome-wide association study (GWAS) hits with the discovery of eQTLs. Variants with large effects on phenotypes are expected to segregate at low frequencies, reducing discovery power. In GWAS, this is counterbalanced by increased power due to their large effect sizes. Consequently, in GWAS, low frequency variants can still be detected if their effect is large enough. In contrast, eQTL discovery is based only on the effect of genotype on gene expression which does not necessarily translate to fitness-relevant trait variation. Detection of strongly selected sites is therefore less likely32. This ascertainment bias can weaken the relationship between Δ and FST at eQTLs, as sites with high FST are less likely to be eQTLs.
In “Twin Peaks is a Statistical Artifact”, we suggested that large sex differences in expression may be entirely unrelated to SDS—for instance, merely tagging genes with high expression variance. There are other potential explanations for the lack of a genome-wide relationship. While some sex differences in expression may be driven by or drive SDS, these are the exception rather than the rule. The majority of sex differences in expression may be a regulatory side effect of SDS on different genes. Lastly, sex-differential gene expression may be due to past SDS, but not drivers of current SDS33. Regardless of the reason, if a causal relationship between Δ and FST is rare in the genome, it would be difficult to detect using models that assume a pervasive, persistent relationship between the two.
Conclusion
Previous work suggested a genome-wide relationship between sex differences in expression and SDS. However, this work was based on statistical artifacts and confounded effects, such as stabilizing selection and variance in gene expression. Even when using newer data and improved statistical methods, we found no evidence for a genome-wide relationship between sex-differential expression and contemporary SDS. In contrast, studies that measured SDS irrespective of expression or trait variation have reported pervasive, genome-wide signals of SDS in the human genome5,10,34. While causal relationships, past and present, between SDS and sex-differential gene expression in humans remain plausible, they are yet to be fully elucidated.
Materials and Methods
Permuting Twin Peaks sex-labels
In the section “Twin Peaks is a Statistical Artifact”, we demonstrate that the Twin Peaks curve presented in CK16 is not statistically significant compared to a permuted null. To do so, we used the methods and datasets described in CK16 and Eq. 1. We computed allele count using the 1000 Genomes dataset19 and calculating FST. We filtered out any sites with only a single alternative allele in either males or females (i.e., singletons). We used the Transcripts Per Million (TPM) normalized GTEx v3 (referred to as “pilot” on the download page) dataset20 for expression levels for calculating Δ. Because GTEx v3 does not have individual sample labels, this analysis only compared expression in the gonads (ovaries and testes) where expression level summaries represent a single biological sex. We used the Ensembl GRCh38 v7735 annotation file for gene annotations. Following CK16, we limited our analysis to protein-coding genes.
To estimate FST, we used Hudson’s estimator36 based on the R package used in CK1637 as presented by Bhatia et al. (2013)29, Here, pm and pf are the allele frequencies in males and females respectively, and nm and nf are the number of males and females respectively. As a measure of sex-differential expression we used Δ as defined in CK16: Here, xm and xf are sex-averaged TPM-normalized expression levels in males and females respectively. Because in Eq. 3 is a site-specific estimator while Δ is gene-wide, we generated a gene-wide estimate of FST by taking a simple average of all sites within a gene body plus 1000bp upstream and downstream. To generate the Twin Peaks curve, we used a 4th-degree polynomial regression between and Δ. We note that here and in CK16, this regression therefore ignores variation in heterozygosity across sites (Eq. 1). In our improved model (Results: “No evidence for SDS on eQTLs across all genes in multiple tissues” and Methods: “Applying a new model and method for SDS-expression regression”) we correct this omission.
To compare the original Twin Peaks curve to curves generated under the null of no SDS, we generated an empirical null distribution of 4th-degree regression curves using sex-label permutations of gene expression data. We permuted the ovary and testes labels for each GTEx sample, then recalculated Δ for all genes. We then re-performed the 4th-degree polynomial regression on the new Δ values (the FST values remain unchanged). This was repeated 500 times. By permuting sex labels, we break the association between sex-differential expression and sex-differential expression, while preserving other gene-level features. To quantitatively evaluate the significance of Twin Peaks in the null distribution, we used the three criteria laid out by Cheng and Kirkpatrick (2016) for classifying a curve as Twin Peaks. Namely, a 4th-degree polynomial must 1) be significant (p < 0.05) for the 4th-degree term by ANOVA, 2) have a negative coefficient for the Δ4 term, and 3) have three real roots. Any 4th-degree regression passing all three criteria is classified as Twin Peaks.
Applying a new model and method for SDS-expression regression
In the section “No evidence for SDS on eQTLs across all genes in multiple tissues”, we described a revised method for testing a genome-wide relationship between SDS and Δ. For this analysis, we revised the inference model and data. We used the Non-Finnish European subset in the gnomAD V3 dataset31 to calculate between-sex FST and heterozygosity at each eQTL. This set of samples has an order of magnitude more individuals (average of 31,470 samples per site) compared to 1000 Genomes data (average of 2,300 samples per site when combining all ancestry groups) originally used in CK16. Additionally, we used the normalized TPM from the current GTEx v8 for expression level data, which includes many more samples (average of 236 male and 168 female samples per tissue) than the pilot (8 male samples and 13 female samples in the gonads)28. This version also provides sample-specific sex labels, enabling us to use additional tissues beyond gonads.
The equation we base our regression on is the model relating FST with δ2 shown in Eq. 2. Now, rather than performing a 4th-degree polynomial regression as in the analysis for “Twin Peaks is a Statistical Artifact”, we perform a weighted linear regression of FST to δ2 = 4pqΔ2, isolating A as the coefficient of regression. Note that by including 4pq in the independent variable, we allow information about heterozygosity to affect the SDS-expression relationship. We weighted each point of the regression by 1/Var(expression), where Var(expression) is the averaged variance in expression of each sex. This should decrease the leverage of points with large |Δ|. To get a gene-wide estimate of FST, we used eQTLs from GTEx as mapped in the v8 study31. For each gene, we chose the single cis-eQTL (within 1Mbp of the gene’s midpoint) with lowest p-value for association with the gene and use that site for calculating FST, p, and q. By using an eQTL selected this way, we tried to isolate the effect of selection on gene expression to the site presumed to be contributing most to expression changes in that gene.
To determine the significance of our A estimates, we generated 90% null acceptance regions by permuting sex labels. We permuted sex labels as in “Twin Peaks sex-label permutation” such that for each tissue, sex labels are permuted in GTEx expression samples and Δ is recalculated across all genes. Then, for each iteration A is recalculated by linear regression using the new Δ values (FST and 4pq remain unchanged). The 90% null acceptance region was obtained by the 5th and 95th percentiles of 1,000 A values calculated on permuted expression data.
Acknowledgements
We thank Jared Cole for comments on the manuscript and other members of the Harpak Lab for helpful conversations. The work was funded by NIH grants GM11685307 to M.K., and R35GM151108 and a Pew Scholarship to A.H.