COVID-19-related research data availability and quality according to the FAIR principles: A meta-research study

Background As per the FAIR principles (Findable, Accessible, Interoperable, and Reusable), scientific research data should be findable, accessible, interoperable, and reusable. The COVID-19 pandemic has led to massive research activities and an unprecedented number of topical publications in a short time. There has not been any evaluation to assess if this COVID-19-related research data complied with FAIR principles (or FAIRness) so far. Objective Our objective was to investigate the availability of open data in COVID-19-related research and to assess compliance with FAIRness. Methods We conducted a comprehensive search and retrieved all open-access articles related to COVID-19 from journals indexed in PubMed, available in the Europe PubMed Central database, published from January 2020 through June 2023, using the metareadr package. Using rtransparent, a validated automated tool, we identified articles that included a link to their raw data hosted in a public repository. We then screened the link and included those repositories which included data specifically for their pertaining paper. Subsequently, we automatically assessed the adherence of the repositories to the FAIR principles using FAIRsFAIR Research Data Object Assessment Service (F-UJI) and rfuji package. The FAIR scores ranged from 1–22 and had four components. We reported descriptive analysis for each article type, journal category and repository. We used linear regression models to find the most influential factors on the FAIRness of data. Results 5,700 URLs were included in the final analysis, sharing their data in a general-purpose repository. The mean (standard deviation, SD) level of compliance with FAIR metrics was 9.4 (4.88). The percentages of moderate or advanced compliance were as follows: Findability: 100.0%, Accessibility: 21.5%, Interoperability: 46.7%, and Reusability: 61.3%. The overall and component-wise monthly trends were consistent over the follow-up. Reviews (9.80, SD=5.06, n=160), and articles in dental journals (13.67, SD=3.51, n=3) and Harvard Dataverse (15.79, SD=3.65, n=244) had the highest mean FAIRness scores, whereas letters (7.83, SD=4.30, n=55), articles in neuroscience journals (8.16, SD=3.73, n=63), and those deposited in GitHub (4.50, SD=0.13, n=2,152) showed the lowest scores. Regression models showed that the most influential factor on FAIRness scores was the repository (R2=0.809). Conclusion This paper underscored the potential for improvement across all facets of FAIR principles, with a specific emphasis on enhancing Interoperability and Reusability in the data shared within general repositories during the COVID-19 pandemic.


Introduction
The COVID-19 pandemic introduced a significant shift in the scientific publishing ecosystem, catalyzed by the urgency of sharing findings in a rapidly evolving global health crisis (1).This led to an unprecedented proliferation of preprint publications and open-access materials, allowing researchers worldwide to access both peer-reviewed and non-peer-reviewed findings freely (2).Open-access publications are just one aspect of a larger, comprehensive movement: open science (3).Some funders and journals, such as CIHR (4), NIH (5), BMJ (6), and PLOS (7) Data openness is the cornerstone of research validation and replication, fortifying scientific credibility (9).Precise, exhaustive datasets form the bedrock on which scientific conclusions rest and inform the development of further research.In contrast, a major issue during the COVID-19 pandemic was the lack of high-quality, timely, and reliable data, partially feeding the burgeoning "infodemic" (10), where an excessive amount of information, including false or misleading content, is circulated in digital and physical spaces.Inaccurate or insufficient data can lead to skepticism and mistrust toward research findings, eroding public confidence and impeding a science-informed response (11).
To optimally utilize open research data, it must align with the FAIR principles, i.e., that data are Findable, Accessible, Interoperable, and Reusable (12).These criteria foster better data utility, extending its applicability beyond the original work and facilitating the exploration of different theories, substantiation of claims, probing of debates, prevention of unnecessary duplication, and deriving fresh knowledge from existing data (13).While privacy concerns may impede complete data openness, sharing metadata can be a partial but meaningful substitute (14).Metadata can provide insights into the nature of the data and its structure, facilitating interpretation and usability (15).Notably, recent studies demonstrate that data sharing as the first requirement for open data remains sparse in medical research (16), and that shared data often fail to meet the FAIR principles (17).For assessing research integrity, better decision-making, gaining public trust, and future preparedness, it is essential to clarify the data quality generated throughout the COVID-19 pandemic (18).Thus, this study aimed to assess the adherence of COVID-19-related research data to the FAIR principles (or FAIRness), a critical step towards improving data quality and trust in scientific outputs.

Methods
The protocol of this study was deposited prior to beginning the study on the Open Science Framework (OSF) website (https://doi.org/10.17605/OSF.IO/XAYP9  (21)(22)(23).As we were interested in subgroup analyses along study types, we further used EPMC's pubType column to detect reviews ("review|systematic review|meta-analysis|review-article"), research articles ("research-article") and letters ("letter").Since EPMC's categorization for randomized trials was deemed not to be sensitive enough (24)(25)(26) and did not provide any category for all observational studies, we used the L•OVE (Living OVerview of Evidence, https://iloveevidence.com)database to detect randomized trials and observational studies and classify them as such.L•OVE, powered by Epistemonikos Foundation, is an open platform that maps and organizes the best evidence in various medical and health sciences fields (27).We applied the "Reporting data" filter on the L•OVE website to detect PMIDs of RCT studies in our dataset.Then, we downloaded all identified open-access COVID-19-related available records in XML full-text format using the metareadr package (28) from the EPMC database.

Data Extraction
We used the rtransparent package (29) for programmatically assessing data availability in the included studies.The reliability of this package has previously been validated with an accuracy of 94.2% (89.7%-97.9%) in detecting the data availability of assessed papers (30).The rtransparent uses the oddpub package (31) for detecting data-sharing statements in XML files of papers.Briefly, the oddpub package uses regular expressions to identify whether an article mentions a) a general database in which data are frequently deposited (e.g., figshare); b) a field-specific database in which data are frequently deposited (e.g., dbSNP); c) online repositories in which data/code are frequently deposited (e.g., GitHub); d) language referring to commonly shared file formats (e.g., csv); e) language referring to the availability of data as a supplement (e.g., "supplementary data"); and f) language referring to the presence of a data sharing statement (e.g., "data availability statement").It finally checks whether these were mentioned in the context of positive statements (e.g., "can be downloaded") or negative statements (e.g., "not deposited") to produce its final adjudication.This adjudication indicates whether a data sharing statement is present, which aspect of data sharing was detected (e.g., mention of a public database), and then extracts the phrase in which this was detected.
Our previous study showed low FAIRness of data provided in field-specific databases and supplements (17).This is due to a lack of some properties that reduce FAIRness, such as the lack of an identifier to the dataset, non-machine-readable metadata, and the use of non-general file formats in field-specific databases.Therefore, we focused on studies that provided a link to a public database for their data.Another reason was to reduce the burden of work that added little to our study and helped automatize the workload.
After filtering the studies that provided their data in a general-purpose repository (limited to the ones that were defined and detected by the oddpub package, the list of these repositories is available in Appendix 2), we searched for the URL to their dataset in their full-text XML files.
To do this, we used keywords related to general-purpose databases and identified every URL that contained one of the keywords.These keywords are available in Appendix 3.After obtaining all the possible URLs to datasets, we manually screened the links.We included a URL only when it belonged to that specific study, i.e., we excluded URLs to general datasets notably, the

FAIRness assessment
FAIR Principles include four main components about how the shared data/metadata should be: Findable, Accessible, Interoperable, and Reusable (12).These four components are divided into 10 subcomponents (F1-F4, A1-A2, I1-I3, and R1) and five sub-subcomponents.To measure the level of FAIRness, a tool named FAIRsFAIR Research Data Object Assessment Service (F-UJI) has been developed by the FAIRsFAIR project (32).F-UJI is a web service to programmatically assess the FAIRness of research data objects based on metrics developed by the FAIRsFAIR project.It checks each component and subcomponents of FAIRness and assigns scores for each metric and an overall score.The lowest score for each component is 0 and the highest ranges from 3-8; the overall score ranges between 1 (because all of them have URLs, FsF-F1-01D=1) and 22.The metrics, scores, and definitions of each metric are illustrated in Appendix 4.
After finalizing the URLs, we used the F-UJI tool to automatically assess the FAIRness of each dataset.This software is based on Python.We used the rfuji package (33) in R, which is an R application programming interface (API) client for F-UJI.The workflow of running each software is available in Appendix 5.

Analysis
We reported the general characteristics of papers that had shared their data in a general-purpose repository.For the FAIRness assessment, we performed a descriptive analysis of compliance with FAIR metrics.FAIR-level differences between different journals and trends over time were explored.We established a categorization system comprising four compliance levels with FAIR principles for each component of FAIR: 0: incomplete; 1: initial; maximum score: advanced; every other score between initial and advanced: moderate.
We performed the Kruskal-Wallis rank sum test to compare the level of FAIRness between different article types, journal subject areas, and repositories.To determine the most influential factor among article type, journal subject area, and repository, we ran different regression models, adjusted for the number of citations and SJR score.Then, we compared the R 2 of the models.The factor that was in the model with the highest R 2 was considered the most influential factor.

FAIRness results
The FAIRness score for 480 repositories was 1, meaning either the repository was inaccessible or had nothing in it.We eliminated these from our analyses.Therefore, our final analyses were performed on the FAIRness results of 5,700 repositories.
The mean (standard deviation, SD) level of compliance with FAIR metrics was 9.4 (4.88).The mean for each metric was as follows: Findability: 4. The percentages of moderate or advanced compliance were as follows: Findability: 100.0%,Accessibility: 21.5%, Interoperability: 46.7%, and Reusability: 61.3% (Figure 2). Figure 3 shows the yearly mean score in each component of FAIR.All components show decreasing trends.However, monthly trends for the overall and component-wise trends were consistent over the follow-up (Figure 4).

FAIRness by article type
Reviews had the highest mean FAIRness score (9.80, SD=5.06, n=160), whereas research letters had the lowest score (7.83, SD=4.30, n=55).The Kruskal-Wallis rank sum test showed a P-value of 0.15 for the difference between the groups.Table 3 shows the detailed information.

FAIRness by journal subject area
Articles in dental journals had the highest mean FAIRness score (13.67,SD=3.51, n=3), whereas articles in neuroscience journals had the lowest score (8.16, SD=3.73, n=63).The Kruskal-Wallis rank sum test showed a P-value of <0.001 for the difference between the groups.Table 4 shows the detailed information.

FAIRness by repository
Harvard Dataverse had the highest mean FAIRness score (15.79,SD=3.65, n=244), whereas GitHub had the lowest score (4.50, SD=0.13, n=2,152).The Kruskal-Wallis rank sum test showed a P-value of <0.001 for the difference between the groups.Table 5 shows the detailed information.

The most influential factor
The R 2 for the repository was the highest (R 2 =0.809).The R 2 for the model with all these three factors was 0.812 (Table 6).The P-value for the number of citations and SJR score in all models was above 0.29.
, have aligned themselves with open science.Central to open science are three components: open protocols, open-access publications, and open data; collectively, they enhance transparency, collaboration and dissemination (8).

Figure 1 .
Figure 1.The flow diagram of the study.

Figure 3 .
Figure 3. Yearly trends for components of FAIR.

Figure 4 .
Figure 4. Monthly trend for FAIR score and its components.

Data Sources and Study Selection
articles organized into curated categories.It uses machine learning and deep learning algorithms 1, 2020 until April 15, 2023, there were 345,332 COVID-19-related articles, including open-access and non-open-access publications.Of those, 257,348 (74.5%) had available full text from the EPMC.However, 7 (<0.01%) of these open-access articles were not downloadable because of technical issues and were excluded from our analyses.Consequently, the sample included 257,341 full-text articles.
Of these, 20,873 (8.1%) were detected to have shared their data; of these, 8,015 (38.4%) had shared their data in a general-purpose repository.After screening the URLs, 6,180 URLs were included.

Table 2 .
Summary of FAIR metrics.

Table 3 .
FAIRness score for each article type (mean and SD).

Table 4 .
FAIRness score for each journal subject area (mean and SD).

Table 5 .
FAIRness score for each repository (mean and SD).

Table 6 .
R 2 for different linear regression models, adjusted for the number of citations and SJR score.