Addressing Overlapping Sample Challenges in Genome-Wide Association Studies: Meta-Reductive Approach

Polygenic risk scores (PRS) are instrumental in genetics, offering insights into an individual level genetic risk to a range of diseases based on accumulated genetic variations. These scores rely on Genome-Wide Association Studies (GWAS). However, precision in PRS is often challenged by the requirement of extensive sample sizes and the potential for overlapping datasets that can inflate PRS calculations. In this study, we present a novel methodology, Meta-Reductive Approach (MRA), that was derived algebraically to adjust GWAS results, aiming to neutralize the influence of select cohorts. Our approach recalibrates summary statistics using algebraic derivations. Validating our technique with datasets from Alzheimer’s disease studies, we showed perfect correlation between summary statistics of proposed approach and “leave-one-out” strategy. This innovative method offers a promising avenue for enhancing the accuracy of PRS, especially when derived from meta-analyzed GWAS data.


Introduction
Polygenic risk scores (PRS) have emerged as an essential tool in the field of genetics 1,2 .These scores offer a unique insight into an individual's genetic predisposition to a wide array of diseases and traits, capturing the cumulative effects of multiple genetic variants 3 .The Genome-Wide Association Studies (GWAS) serve as the base for creating PRS 4 .GWAS investigates the entire genetic makeup of individuals to identify genetic variations associated with specific diseases or traits.The predictive accuracy and precision of PRS are enhanced when the base GWAS summary statistics come from a sizeable sample, and the population in the GWAS matches the population where the PRS is being applied 4,5 .Due to this need for a substantial sample size, studies often aim to meta-analyze all available genetic datasets to achieve the statistical power necessary for identifying genetic markers linked to the trait or disease.However, this approach presents a challenge in securing independent datasets for training, testing, and validating PRS performance 6 .The use of overlapping samples can inflate the PRS calculations, resulting in imprecise risk predictions.
A logical approach might be to exclude a specific cohort of interest and then rerun meta-analyses with the remaining datasets.However, given the significant computational resources needed and the difficulties in accessing detailed summary statistics for all cohorts, this isn't always viable.Nonetheless, we do have access to the cohort-level data for the specific dataset we aim to employ as a training and testing set.
Recognizing this advantage, we formulated an alternative technique that incorporates the cohort-level result of our chosen dataset along with the meta-analysis GWAS findings.The goal is to neutralize the impact of the overlapping cohort of interest on the meta-analysis GWAS summary statistics, thus producing a PRS that avoids the inflationary tendencies arising from overlapping samples.
In this study, we derived equations to adjust GWAS results, effectively eliminating the impact of selected cohorts.Through comprehensive simulations and real data analysis, we demonstrated that our methodology effectively updates the base data's summary statistics, thereby addressing the challenge.

Derivation of Adjusted Summary Statistics: Meta-Reductive Approach
We analyzed two distinct sets of summary statistics: 1.A compilation from n datasets meta-analyzed using an inverse variance-based approach 7 .
2. A specific dataset of interest that was also part of the meta-analysis.

For these datasets:
 B and  symbolize the effect size and standard error, respectively, from the aggregate metaanalysis across n datasets.
   and   specify the effect size and standard error for the individual cohort i.
Our primary aim was to compute a summary statistic that eliminates the influence of the dataset of interest, providing a clearer perspective on the overarching genetic structure.

i. Inverse-Variance-Weighted Effect-Size Estimation
The inverse variance method gives more weight to studies with smaller variance because they offer more precise estimates.The weight,   , is the inverse of the variance, or squared standard error, of the effect size,   . Given, where the   = 1

2
Expanding this: This is the weighted sum of the effect sizes across all datasets, including the one of interest.Now, to remove the effect of the specific dataset,   , we rearrange: Which yields: This equation essentially adjusts the overall effect size, , by subtracting the influence of the dataset of interest.

ii. Standard Error Derivation
The standard error (SE) offers a measure of the statistical accuracy of an estimate.Here, we adjust the SE based on the weights of all datasets excluding the one of interest. Using: We derive: This equation gives the combined weight of all datasets, excluding the dataset of interest.

iii. Adjusted Effect Size and Standard Error
Post removing the influence of the dataset of interest, the modified effect size is given by: This adjusted beta,   , having nullified the contribution of the specific dataset n.
Additionally, the adjusted standard error is: This adjustment ensures that the standard error reflects the precision of our new effect size estimate, free from the influence of the specific dataset.

Validation:
To validate our methodological approach, we utilized summary statistics from four publicly accessible Alzheimer disease studies: Kunkle et al. 8 , Kunkle et al. 9 AA, Bellinguez et al. 10 , and Moreno-Grau S. et al. 11 From these studies, 100,000 markers were selected to conduct a meta-analysis using the METASOFT software 12 .
Following the initial meta-analysis, we applied a systematic "leave-one-out" strategy.For each iteration, we excluded the summary statistics from one dataset and conducted a meta-analysis of the remaining three.The results from this procedure served as our individual-level data for the three datasets in question.
For the final step of validation, we calculated the adjusted   and  2  values based on our proposed method and compared them against the individual-level data derived from the "leave-one-out" metaanalyses.Our results showed perfect positive correlation between summary statistics of three datasets using "leave-one-out" strategy and our approach with adjusted   and  2  values.Figure 1 illustrates this correlation for both effect size and standard error.

Figure 1 .
Figure 1.Comparison between the adjusted results from the Meta-Reductive Analysis (MRA) (Beta_adj