Sharing GWAS summary statistics results in more citations

Rates of sharing of genome-wide association studies (GWAS) summary statistics are historically low, limiting potential for scientific discovery. Here we show, using GWAS Catalog data, that GWAS papers that share data get on average 81.8% more citations, an effect that is sustained over time.

were so, it would provide an additional incentive, beyond good citizenship, for data sharing. We, therefore, used data from the GWAS Catalog 10 to explore the current sharing landscape of human GWAS summary statistics and to analyse the relationship between sharing and potential citations.

Results
We collected sharing and citation information from 5756 studies with results published in the GWAS Catalog (Supplementary Data 1) 10 . Roughly one in ten (604, 10.5%) had summary statistics available for download. The proportion of summary statistics-sharing studies has increased over the years, especially since 2015, but even in 2021, only 121 out of 578 studies (~21%) shared their summary statistics. (Fig. 1). Although we considered the GWAS Catalog as the prime source of GWAS summary statistics, some datasets might be available elsewhere (e.g. authors' or consortium's websites or alternative repositories), making studies be mislabeled as non-sharers. To verify that our measure of sharing-whether the summary statistics were available in the GWAS Catalog-was valid, we manually inspected a random sample of 353 manuscripts (out of 629) from two journals with high levels of GWAS publications, PLoS Genetics and Nature Genetics, and for which GWAS Catalog did not hold summary statistics. We found that 324 (91.7%) did not provide full summary statistics or data was controlled-access, 5 (1.4%) claimed to provide access, but links were either broken or contained no data, and only 24 (6.8%) linked to full summary statistics in non-GWAS catalog websites (Supplementary Data 2).
Most mislabeled articles (ie. classified as non-sharers in GWAS Catalog but sharing data elsewhere) in our sample appeared after 2017, indicating that sharing elsewhere also increased over time. We next downloaded the full text of 3317/5152 non-sharer articles in our dataset that were available from PubMed central and developed a custom search strategy to identify articles sharing data outside GWAS Catalog (see Methods). We found 217 additional sharers, raising the total proportion of data-sharing articles to 14.26% (Fig. 1, Supplementary Data 3).
Satisfied that this was a valid measure, we used logistic regression to study which factors influence sharing. According to the Bayesian Information Criterion (BIC), the optimal model included the year of publication and log-journal impact factor.
Both year (OR = 1.4911 [1.4373-1.5469]) and journal impact factor (log(SJR) OR = 2.6896 [2.4118-2.9993]) have positive effects on sharing, suggesting that sharing has increased over time, and tends to be more frequent in journals with higher reported impact factors (Supplementary Data 4).
We decided to investigate the impact of sharing on a paper's citations using the relative citation ratio (RCR), which compares the number of citations an article has to the average citation rates of the journals in its co-citation network 19 . In the early years of GWAS, such articles appeared to outperform their co-citation network before a gradual decrease in the median score (towards RCR = 1), except for the most recent complete year, 2021. This bump may reflect incomplete data or a sudden behaviour change (Fig. 2a). As a broad pattern, studies that shared their summary statistics in the GWAS Catalog had consistently higher RCR over the years than their non-sharing counterparts (Fig. 2b). Again, the data from 2021 appeared anomalous, with sharing papers showing only a weak advantage over non-sharing papers.
To try and understand the 2021 data, which had the shortest follow-up time by definition, we analysed the citation patterns of sharing and non-sharing studies over time by year of publication ( Fig. 3). On average, GWAS citations rise quickly and stabilise around 2 years after publication. Then citations either stay stable or slowly decrease throughout the following years. However, summary statistics-sharing GWAS citation counts grow faster ( Fig. 3a) and sustain higher mean citation counts, regardless of the year of publication (Fig. 3b). Given the citation advantage of sharing papers to non-sharing takes two or more years to accumulate, we decided to exclude the anomalous data points from 2021 because there had not been sufficient time for them to stabilise.
To analyse the effect of sharing on citations, we first built an optimal linear model of log(RCR) using all considered covariates (ie. year of publication, SJR, publication in one of the top 20 GWAS journals by number of publications, and NLM score) except sharing status according to the BIC. The selected model included the year of publication, log-journal impact factor, and the National Library of Medicine's (NLM) "molecular/cellular" score. The molecular/cellular score represents the proportion of molecular/cellular MeSH terms in the articles' text, used to predict the translation potential of the research 20 . By adding a binary variable describing sharing practice, we concluded that sharing summary statistics has a positive effect on the RCR, providing 81.8% more citations on average than non-sharing articles (RCR ratio = 1.8177 [1.6798-1.967], P < 2e-16, Supplementary Data 5).
We recognised that our custom search was likely to be imperfect, not least because only 70% of papers had full text available. We estimated the same quantity using GWAS Catalog inferred sharing status, and found the estimated effect of sharing to be very similar (RCR ratio = 1.8438 [1.6858-2.0166]), providing reassurance that our result is robust to remaining mislabelling.

Discussion
Data sharing in the life sciences remains a controversial topic. We showed that overall summary statistics sharing rates are low, although we see a remarkable increase in the past 5 years. Many factors not included in this work but analysed elsewhere 21 , such as changes in scientific culture towards sharing, growing incentives from public and private funders, and varying privacy regulations across countries, along with technical difficulties, may influence sharing of GWAS summary statistics and other datasets. This may be further complicated by the multifactorial nature of data in many cases, the lack of clear definitions of what constitutes shared data, and the challenge of verifying the completeness of any dataset. Funders like Wellcome Trust, the NIH, the MRC and the ERC have mandated open-access publishing for articles, but strong mandates on data sharing are still generally lacking, and existing journal policies on data are not consistently enforced 22 . Thus, while data sharing remains reliant on the goodwill and diligence of researchers, both the inertia to changing practice and the effort required may outweigh the limited incentives, leaving data unshared.
Citations are imperfect yet crucial metrics for evaluating research impact, which affects hiring decisions and career prospects. We hypothesised that sharing GWAS summary statistics may positively affect citations by allowing other scientists to conduct research using shared data and, in turn, cite the original research. Indeed, we observed a consistent pattern of increased citation rates over time, and by using linear models, we estimated that sharing increased citation rates by 81.8% on average, an  estimate slightly higher than the 68% increase in citations found in a study of microarray data sharing >15 years ago 23 , and much higher than the 25% increase predicted in papers linking to more general biological data repositories 24 .
Our analysis of 353 GWAS papers that did not use the GWAS Catalog revealed that most studies did not share data at all or shared either restricted access and/or incomplete data (e.g. only top significant hits), which hampers reuse. Only 24 articles shared full summary statistics without controlled access or request requirements using alternative repositories, and five provided links that did not work anymore. An additional, broader analysis including all 3317 non-sharing papers for which full text was available provided 217 mislabeled sharers, although the estimated effect of sharing was similar to that using GWAS Catalog sharing only. These results highlight that the GWAS Catalog has become the de facto standard for unrestricted summary statistic sharing as well as a reliable, future-proof data storage platform. Therefore, we encourage authors to use standard repositories like GWAS Catalog whenever possible.
Finally, the field of GWAS has been focused on studies of white European subjects conducted by authors based in North American or European institutions, reflecting both early concerns of ancestry or admixture confounding and concentration of scientific funding in these regions [25][26][27] . This has led to a welldocumented understudy of diverse populations (see https:// gwasdiversitymonitor.com/ for a visual approach to the issue), and the data that is now accruing demonstrates the value of studying the whole human population to have better coverage of all human variation as well as to enable equitable benefits as GWAS findings begin to have clinical impacts 28 . The data we use reflect this history, and thus cannot be considered to reflect the impact of data sharing on citations of studies of underrepresented populations, although we do expect the direction of the effect would also be positive.
Whilst our work shows that there can be a direct benefit to the authors for sharing data, further work is needed to properly understand the other barriers to sharing, and to allow that these barriers may be different in studies of under-represented populations, to more fully support wider sharing of GWAS data for the benefit of all.
While appreciating the issue's complexity, we support the implementation of more data-sharing mandates and recognitionbased incentives, such as alternative metrics to promote datasharing work, independent of journal of publication, as well as the inclusion of data generation and stewardship on researchers' CVs 29,30 . We also agree with other authors that the nature of increasingly large and more complex datasets will require improved training on data stewardship 13 .
We consider that the strongest incentive for scientists to share data is good citizenship because data sharing increases the ability of all of us to make discoveries through meta-analysis or integrative studies, thus accelerating scientific knowledge. However, and despite the observed recent trend changes, that incentive alone is clearly insufficient because papers sharing data remain a minority. We hope the robust evidence here that data sharing can increase citations independent of the journal of publication will provide further incentives and that we will see sharing of summary statistics continue to increase in the coming years.

Methods
Analyses. The GWAS Catalog 10 is an established and highquality repository of curated human GWAS results, providing easy access to summary statistics made public by authors (via curator inclusion or author submission). Its large coverage (400,000+ associations from 5690 publications as of May 2022) and its easy-to-access statistics make it an ideal reference database for our analyses. Hence, we downloaded the full list of studies and available summary statistics in GWAS Catalog on 26th May 2022.
We fetched citation information for each study from NIH's database using iCiteR v0.2.1 31 , a wrapper for NIH's iCite API 32 . To quantify citations, here we focused on relative citation ratio (RCR), an improved metric to quantify the influence of a research article by using co-citation networks to field-normalise the number of citations 19 . We also used iCiteR to retrieve the number of citations each study received each year.
Despite not being an appropriate indicator for the individual quality of a given paper, journal impact factor can affect citations via journal visibility and prestige. We retrieved 2021 SJR (SCimago Journal Rank) scores to assess overall journal prestige 33,34 . There were 723 journals in our dataset, from which 691 had SJR data available for at least 1 year. Those  iCite tool uses Medical Subject Headings (MeSH) terms in articles' text to predict the potential for translation of research 20 . The tool provides scores that represent the proportion of terms that can be classified within three overarching branches of the MeSH ontology: Human, Animal, and Molecular/Cellular.
For each set of models, we sequentially added and removed predictors, using the BIC to choose the optimal model. For (1), this procedure selected the logistic model: where pSS stands for public summary statistics dataset available, encoded as [0, 1], year is the year of online publication [2007][2008][2009][2010][2011][2012][2013][2014][2015][2016][2017][2018][2019][2020], and lSJR is the logarithm of the SJR score, log(SJR). For (2), we selected covariates excluding pSS which produced the baseline linear model where molcel corresponds to the NLM molecular/cellular score, which showed to contribute to model fit, which we compared to to quantify the effect of sharing on log(RCR). In this case, modelling year as a factor, rather than a continuous variable, improved model fit.
While we expect manually curated GWAS Catalog to contain most publicly available summary statistics datasets, authors can choose to share their data on a different platform (eg. their own or consortium's website, Dryad, or GWAS archive), posing a potential bias in our analysis. To explore this scenario, we selected random 50% of studies labelled as non-sharers in two of the journals with most published GWAS (PLoS Genetics (100 studies) and Nature Genetics (253 studies)) and manually checked whether their summary statistics were listed in the manuscript as freely available elsewhere and whether the statistics still resided at any such URL. We noted that most mislabeled articles in our sample appeared after 2017. We broadened our analysis by checking for full-text availability on PubMed Central for 5152 non-sharer articles (Supplementary Data 3) and downloading the full text for 3317 where it was available. We developed a custom search strategy to identify sharing articles, matching phrases such as "available for download", "available at figshare" and more complex patterns.
Where the text search suggested data was available via dbGaP, we confirmed that data was freely available (ie not via data access committee) by confirming the dbGaP identifier contained files in the "analyses" subdirectory according to index file https://ftp.ncbi. nlm.nih.gov/dbgap/studies/Ftp_Table_of_Contents.zip downloaded on 25 October 2022. Full code for performing this search is at https://github.com/chr1swallace/data-sharing-search.
Reporting summary. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
All source and generated data underlying figures in this study are available in two Zenodo repositories, one containing the main data analysis (https://doi.org/10.5281/ zenodo.7516613) 36 and another containing the extended search for sharing outside GWAS Catalog (https://doi.org/10.5281/zenodo.7516708) 37 . These repositories contain links and information about how the source data was obtained. GWAS Catalog accessions and PubMed Identifiers for all GWAS Catalog studies included in our analysis are available in Supplementary Data 1.

Code availability
All code used in this work is publicly available without restriction in two Zenodo repositories,, one containing the main data analysis (https://doi.org/10.5281/zenodo. 7516613) 36 and another containing the extended search for sharing outside GWAS Catalog (https://doi.org/10.5281/zenodo.7516708) 37 . These repositories contain the scripts and datasets used to generate all figures, results, and supplementary tables.