Abstract
Preprints in the life sciences are gaining popularity, but release of a preprint still precedes only a fraction of peer-reviewed publications. Quantitative evidence on the relationship between preprints and article-level metrics of peer-reviewed research remains limited. We examined whether having a preprint on bioRxiv.org was associated with the Altmetric Attention Score and number of citations of the corresponding peer-reviewed article. We integrated data from PubMed, CrossRef, Altmetric, and Rxivist (a collection of bioRxiv metadata). For each of 26 journals (comprising a total of 46,451 articles and 3,817 preprints), we used log-linear regression, adjusted for publication date and scientific subfield, to estimate fold-changes of Attention Score and citations between articles with and without a preprint. We also performed meta-regression of the fold-changes on journal-level characteristics. By random effects meta-analysis across journals, releasing a preprint was associated with a 1.53 times higher Attention Score + 1 (95% CI 1.42 to 1.65) and 1.31 times more citations + 1 (95% CI 1.24 to 1.38) of the peer-reviewed article. Journals with larger fold-changes of Attention Score tended to have lower impact factors and lower percentages of articles released as preprints. In contrast, a journal’s fold-change of citations was not associated with impact factor, percentage of articles released as preprints, or access model. The findings from this observational study can help researchers and publishers make informed decisions about how to incorporate preprints into their work.
Introduction
Preprints offer a way to freely disseminate research findings while a manuscript is being peer reviewed (Berg et al., 2016). Although releasing a preprint in disciplines such as physics and computer science—primarily via arXiv.org—is standard practice (Ginsparg, 2011), preprints in the life sciences are just starting to catch on (Abdill and Blekhman, 2019; “PrePubMed: Monthly Statistics for December 2018,” n.d.), spurred by the efforts of ASAPbio (“ASAPbio: Accelerating Science and Publication in biology,” n.d.), bioRxiv.org (now the largest repository of biology preprints), and others. Some researchers in the life sciences remain reluctant to release their work as preprints, partly for fear of being scooped (as preprints are not universally considered a marker of priority) (Bourne et al., 2017). Furthermore, some journals explicitly or implicitly refuse to accept manuscripts released as preprints (Reichmann et al., 2019), perhaps partly for fear of publishing articles not seen as novel or newsworthy. Currently, most peer-reviewed articles in the life sciences are not preceded by a preprint (Abdill and Blekhman, 2019).
Although the advantages of preprints have been well articulated (Bourne et al., 2017; Sarabipour et al., 2019), quantitative evidence for these advantages remains relatively sparse. In particular, how does releasing a preprint relate to the outcomes—in so far as they can be measured—of the peer-reviewed article? A recent study suggested that articles with preprints had higher Altmetric Attention Scores and more citations than those without (Serghiou and Ioannidis, 2018), but the study was based on only 776 peer-reviewed articles with preprints (commensurate with the smaller size of bioRxiv at the time) and pooled articles that were published in different journals. Here we sought to build on that study by leveraging the rapid growth of bioRxiv.
Materials and Methods
Code to reproduce this study is available at https://doi.org/10.6084/m9.figshare.8855795.
Collecting the data
Data came from four primary sources: PubMed, Altmetric, CrossRef, and Rxivist. We obtained data for peer-reviewed articles from PubMed using NCBI’s E-utilities API via the rentrez R package (Winter, 2017). We obtained Altmetric Attention Scores using the Altmetric Details Page API via the rAltmetric R package. The Altmetric Attention Score (“Attention Score”) is a aggregate measure of mentions from various sources, including social media, mainstream media, and policy documents (“Our sources,” 2015). We obtained numbers of citations, as well as links between bioRxiv preprints and peer-reviewed articles, using the CrossRef API via the rcrossref R package. We verified and supplemented the links from CrossRef using Rxivist (Abdill and Blekhman, 2019) via the Postgres database in the publicly available Docker image (https://hub.docker.com/r/blekhmanlab/rxivist_data). We merged data from the various sources using the Digital Object Identifier (DOI) and PubMed ID of the peer-reviewed article. We obtained journal impact factors and access models from the journals’ websites. As in previous work (Abdill and Blekhman, 2019), we classified access models as “immediately open” (in which all articles receive an open access license immediately upon publication) or “closed or hybrid” (anything else).
We included peer-reviewed articles published between January 1, 2015 and December 31, 2018. Since bioRxiv began accepting preprints on November 7, 2013, our start date ensures sufficient time for the earliest preprints to be published. We obtained each article’s Attention Score and number of citations on June 21, 2019, thus all predictions of Attention Score and citations are for this date. Preprints and peer-reviewed articles have distinct DOIs, and thus accumulate Attention Scores and citations independently of each other. To exclude news, commentaries, etc. (since PubMed indexes various types of publications), we only included articles that had a DOI, Medical Subject Headings (MeSH) terms, and at least 21 days between date received and date accepted (peer review time). These criteria excluded some peer-reviewed articles (e.g., no articles published in PeerJ had MeSH terms), but we chose to favor specificity over sensitivity. We included articles from journals having at least 200 articles meeting the above criteria, with at least 50 previously released as preprints. We excluded articles from journals that also publish articles outside the life sciences, since such articles would likely not be released as preprints on bioRxiv and could confound the analysis. We manually inspected 50 randomly selected articles from the final set, and found that all 50 were original research articles, and none were commentaries, reviews, etc.
Calculating principal components of MeSH term assignments
Medical Subject Headings (MeSH) are a controlled vocabulary used to index PubMed and other biomedical databases (“Medical Subject Headings,” 1999). For each journal, we generated a binary matrix of MeSH term assignments for the peer-reviewed articles (1 if a given term was assigned to a given article, and 0 otherwise). We only included MeSH terms assigned to at least 5% of articles in a given journal, and excluded the terms “Female” and “Male” (which referred to the biological sex of the study animals and were not related to the article’s field of research), resulting in between 13 and 59 MeSH terms per journal. We calculated the principal components (PCs) using the prcomp function in the R stats package and scaling the assignments for each term to have unit variance. We calculated the percentage of variance explained by each PC as that PC’s eigenvalue divided by the sum of all eigenvalues.
Quantifying the associations
For each journal, we fit two linear regression models, one in which the dependent variable was log2(Attention Score + 1) and one in which the dependent variable was log2(citations + 1). In each model, the independent variables were the article’s preprint status (encoded as 1 for articles preceded by a preprint and 0 otherwise), publication date (equivalent to time since publication, encoded using a natural cubic spline with three degrees of freedom), and values for the top ten PCs of MeSH term assignments. The spline for publication date provides flexibility to fit the non-linear accumulation of citations over time (Wang et al., 2013).
We extracted from each linear regression the coefficient (for the main analysis, this was a log2 fold-change) and corresponding 95% confidence interval (CI) for releasing a preprint, and exponentiated them to produce a fold-change and corresponding 95% CI. For each of log2(Attention Score + 1) and log2(citations + 1), we performed a random effects meta-analysis based on the Hartung-Knapp-Sidik-Jonkman method (IntHout et al., 2014) using the metagen function of the meta R package (Schwarzer et al., 2015). For each metric’s meta-regression, we fit a linear regression model in which the dependent variable was the log2 fold-change and the independent variables were the journal’s access model (encoded as 0 for “closed or hybrid” and 1 for “immediately open”), log2(impact factor in 2017), and log2(percentage of articles released as preprints).
As a secondary analysis, we added to the original linear regression model a variable corresponding to the number of days by which release of the preprint preceded publication of the peer-reviewed article (using 0 for articles without a preprint). In this model, the association between preprint status and either Attention Score or citations can no longer be interpreted using a single log2 fold-change.
Results
We first assembled a dataset of peer-reviewed articles from the life sciences, including each article’s Altmetric Attention Score and number of citations and whether it had a corresponding preprint on bioRxiv. Overall, our dataset included 46,451 articles, 3,817 of which had a preprint, published in 26 journals between January 1, 2015 and December 31, 2018 (Table 1). Release of the preprint preceded publication of the peer-reviewed article by a median of 182 days (Fig. S1). Across journals, each article’s Attention Score and citations were weakly correlated with each other (median Spearman correlation 0.29, Fig. S2).
To quantify associations with releasing a preprint for articles published in each journal, we fit linear regression models in which the dependent variables were log2(Attention Score + 1) and log2(citations + 1) (since both metrics were greater than or equal to zero and spanned orders of magnitude, Fig. S3). Each regression model included terms for an article’s preprint status, publication date (since, for example, older articles tend to have more citations) and approximate scientific subfield within the journal (since, for example, articles with preprints may be enriched in subfields that tend to receive more or fewer citations). We approximated scientific subfield as the top ten PCs of MeSH term assignments (Fig. S4 and S5), analogously to how genome-wide association studies use PCs to adjust for population stratification (Price et al., 2006). As preprint status is binary and the dependent variable is log2-transformed, the coefficient from linear regression corresponded to a log2 fold-change.
The fold-changes and lower bounds of the corresponding 95% confidence intervals (CIs) of both metrics were > 1 for most journals (Fig. 1A), indicating higher Attention Scores and more citations for articles released as preprints (Fig. 1B-C and S6). The fold-changes of Attention Score and citations were not significantly correlated with each other (Spearman correlation 0.21, p value 0.31). By random effects meta-analysis across journals, releasing a preprint was associated with a 1.53 times higher Altmetric Attention Score + 1 (95% CI 1.42 to 1.65) and 1.31 times more citations + 1 (95% CI 1.24 to 1.38) of the peer-reviewed article (Fig. 1A). We obtained similar results if we also considered the number of days by which each preprint preceded its peer-reviewed article (Fig. S7). If we excluded the PCs of MeSH term assignments from the regression, the fold-changes associated with releasing a preprint increased modestly for each metric (Fig. S8).
We next performed meta-regression of the log2 fold-changes on journal-level characteristics. Higher impact factor (which was correlated with mean log2(Attention Score + 1): Spearman correlation 0.84, p value 1.9·10−6) and higher percentage of articles released as preprints were significantly associated with a smaller log2 fold-change of Attention Score + 1 (Table 2 and Fig. 2). Neither variable, however, was associated with log2 fold-change of citations + 1. A journal’s access model (immediately open vs. closed or hybrid) was not associated with log2 fold-change of either metric.
Discussion
Here we find that peer-reviewed articles with a preprint on bioRxiv tend to have higher Altmetric Attention Scores and more citations than those without. The difference in citations, in particular, appears robust across journals of various fields of research, impact factors, access models, and percentages of articles released as preprints. Overall, our findings confirm and extend those of previous work (Serghiou and Ioannidis, 2018).
However, our data and analysis have several limitations. First, our data do not include other article-level metrics such as number of views, for which no universal API exists. Second, we only included preprints on bioRxiv, so the associations we observe may not apply to preprints on other repositories such as arXiv Quantitative Biology and PeerJ Preprints. Third, some preprints on bioRxiv may have been published as peer-reviewed articles, but not yet detected as such by bioRxiv’s internal system (Abdill and Blekhman, 2019). Fourth, our analysis ignores characteristics of the preprints themselves. Fifth, grouping scientific articles by their research area(s) is an ongoing challenge (Waltman and van Eck, 2012), and the principal components of MeSH terms are only a simple approximation. Sixth, our analysis does not indicate whether the associations between preprints, Attention Scores, and citations have changed over time, and the associations may change as the culture of preprints in the life sciences evolves.
Finally and most importantly, the data are observational, so we cannot conclude that releasing a preprint is causal for a higher Altmetric Attention Score and more citations of the peer-reviewed article. It could be that, for articles published in a wide range of journals (and accounting for publication date and scientific subfield), having a preprint on bioRxiv is merely a marker for research that is likely to receive more attention and citations anyway. In the future, it may be possible to link Attention Scores and citations with author-level characteristics such as h-index and institutional affiliation (unfortunately, unique author identifiers such as those from ORCID currently have low coverage of the published literature). If there is a causal role for preprints, it may be related to increased visibility that leads to “preferential attachment” (Wang et al., 2013) while the manuscript is in peer review. Without a randomized trial of preprints, these effects are extremely difficult to distinguish.
Despite these caveats, our findings contribute to the growing evidence of quantifiable benefits of preprints in biology, and may have implications for preprints in chemistry and medicine (Kiessling et al., 2016; Rawlinson and Bloom, 2019). We anticipate our study will help researchers and publishers make informed decisions about how to incorporate preprints into their work.
Acknowledgments
We thank Altmetric for providing their data free of charge for research purposes. We thank Tony Capra and Doug Ruderfer for helpful comments on the manuscript.