Publication practices during the COVID-19 pandemic: Biomedical preprints and peer-reviewed literature

The coronavirus pandemic introduced many changes to our society, and deeply affected the established in biomedical sciences publication practices. In this article, we present a comprehensive study of the changes in scholarly publication landscape for biomedical sciences during the COVID-19 pandemic, with special emphasis on preprints posted on bioRxiv and medRxiv servers. We observe the emergence of a new category of preprint authors working in the fields of immunology, microbiology, infectious diseases, and epidemiology, who extensively used preprint platforms during the pandemic for sharing their immediate findings. The majority of these findings were works-in-progress unfitting for a prompt acceptance by refereed journals. The COVID-19 preprints that became peer-reviewed journal articles were often submitted to journals concurrently with the posting on a preprint server, and the entire publication cycle, from preprint to the online journal article, took on average 63 days. This included an expedited peer-review process of 43 days and journal’s production stage of 15 days, however there was a wide variation in publication delays between journals. Only one third of COVID-19 preprints posted during the first nine months of the pandemic appeared as peer-reviewed journal articles. These journal articles display high Altmetric Attention Scores further emphasizing a significance of COVID-19 research during 2020. This article will be relevant to editors, publishers, open science enthusiasts, and anyone interested in changes that the 2020 crisis transpired to publication practices and a culture of preprints in life sciences.


Introduction
The lifecycle for any research starts and ends with a scholarly communication. Despite a 34 variety of avenues to communicate research findings, the foundation of the modern publication 35 practices is a publication in a peer-reviewed journal. The peer-review system is, at present, deeply 36 engraved in scientific minds as the golden standard for research quality. Certainly, the peer-review 37 process improves the drafted manuscript, but previous studies showed that its positive effect on 38 the overall quality of the final report is minor [1]. Besides, the traditional peer-review system is 39 notorious for reviewer bias, lack of agreement between reviewers, harsh criticism concealed by 40 anonymity, multiple cycles of reviews and rejections by different journals, and associated delays 41 and expenses [2].
Object Notation (JSON) format. Data analysis and visualization was done in Python (pandas, 159 numpy, requests, matplotlib, bokeh, and seaborn) using Jupyter Notebook. 160 To search PubMed, we used Entrez Programming Utilities (E-utilities) [43], an application 161 programming interface (API) that allows searching 38 databases from the National Center for 162 Biotechnology Information (NCBI). For E-Utilities, data were downloaded via CSV and converted 163 to Microsoft Excel for further analysis and visualization. 164 Rxivist [41] is a Python-based web crawler that parses the bioRxiv website, detects newly 165 posted preprints, and stores metadata about each item in a PostgreSQL database. The metadata we 166 extracted contained title, authors, submission date, category, DOI for preprint and, if published, 167 the new DOI and the journal of publication. 168 Crossref [42] is an official DOI registration agency of the International DOI Foundation 169 that establishes a cross-publisher citation linking system for academic that include journals, 170 conference proceedings, books, data sets, etc. It works with thousands of publishers to provide 171 authorized access to their metadata including DOI, publication date and other basic information. Allen Institute for AI (AI2) in collaboration with many partners and released on March 16, 2020. 175 We used its 2020.09.02 release downloaded on 2020.09.30 from CADRE [46] for metadata 176 associated with refereed journal articles.   Data challenges and study limitations. 195 Analysis of published preprints. When a preprint is published in a peer-review journal, a 196 reference to the new DOI of the journal article appears next to its title, and DOIs of a preprint and 197 a published article are permanently linked in indexing platforms and tools, which pull from various 198 APIs. Rxivist [41] showed to be an excellent tool for extracting published DOIs for preprints 199 eventually appearing as peer-reviewed journal articles but only when bioRxiv records linked 200 preprints to their external publications. Rxivist also had a two weeks delay in updating its metrics, 201 and it might be of this delay that some peer-reviewed preprint analogues were missing from 202 Rxivist. Additionally, at the time of our study, Rxivist did not include medRxiv preprints in its 203 database, which changed after Nov 27, 2020. We found that the most reliable method of extracting 204 metadata about each individual preprint was by accessing the BioRxiv API [40]. Using the Python 205 library requests, we were able to extract information about each preprint based on DOI, which 206 gave us a column called 'published.' Within this column, if the preprint was also published in a 207 journal, the metadata provided the DOI that corresponded to the published version of the paper.

208
To ensure we found all published preprints, we also accessed data from Crossref, Dimensions, and 209 CORD-19 APIs. To establish the linkage between the preprints and corresponding peer-reviewed 210 journal articles we performed both, DOI and title matching. All channels were then combined and 211 duplicates were dropped. For detailed demonstration of data obtained by every data channel, see 212 Published Collections in SI.

213
To validate whether we found all peer-reviewed preprint versions based on a combination 214 of Rxivist, Crossref, CORD-19, Dimensions, and BioRxiv API, we randomly selected a sample of 215 100 preprints that our data returned as "unpublished" from both bioRxiv and medRxiv, and 216 searched Google Scholar by title. Our analysis of "unpublished" preprints returned 10% of bioRxiv 217 and 4% of medRxiv preprints as being published in refereed journals. All found journal 218 11 publications had slight modifications in article titles or authors' list, and the original "unpublished" 219 preprints were not linked on preprint servers to the corresponding published versions. In 220 comparison, this false-negative rate is lower than the 37.5%, reported by Blekhman et al. [51]. All 221 manually found journal article versions of "unpublished" preprints were manually added to data 222 discussed in this article.

223
Double DOI. When we looked for published preprints based on title matching, we encountered 224 a few instances when two published DOIs existed for a peer-reviewed preprint version. In one 225 case, it was erratum for the paper and in the other case it was a publication on another preprint 226 server. In both cases, we used only the DOI for the article in the peer-reviewed journal and 227 publication on another preprint server was removed from further analysis. We also encountered a 228 few cases when preprints with different DOIs were linked to the same DOI of the published 229 version. On inspection, preprints with different DOIs were somewhat similar in titles and authors' 230 list but not identical. For our analysis, we kept only one DOI for a preprint that was published 231 earlier.

232
PubMed. As mentioned in the Introduction, the NIH Preprint Pilot started in June 2020 and at 233 this stage, it primarily focuses on NIH-supported and COVID-19 related preprints from various 234 servers. By Sept 26, PubMed indexed 1,048 preprints from medRxiv, bioRxiv, ChemRxiv, arXiv,

235
Research Square, and SSRN, of which 1,043 were on COVID-19, and this constituted only 11.5% 236 of 9,072 medRxiv and bioRxiv COVID-19 related preprints from the BioRxiv API. For these 237 reasons, we did not use PubMed as a data source for preprints. We used PubMed (through E-238 Utilities) to obtain metadata on peer-reviewed articles of "Journal Article" and "Review" article  In analyzing PubMed dates, we found that articles with a missing day-of-publication were 242 coded as being published on January 1 st ; a similar issue was reported earlier for Crossref dates

243
[38]. Based on low number of preprints in January, we decided to avoid discussing January data 244 for PubMed (this month is omitted in Fig 2). 245 13 Categories. In general, we used a single category for a preprint as indicated in metadata from 246 the BioRxiv API. However, as of September 25, we found six out of total 1,956 of COVID-19 247 related bioRxiv preprints (0.3%) that displayed two categories. Since this contradicted the servers' 248 statement that "Only one subject area can be selected for an article", we omitted the additional 249 category in our analysis. The journals' scope categories were extracted from Crossref [54]. ArticleDate@DateType="Electronic". When ArticleDate@DateType="Electronic" from PubMed 254 was not available, we substituted it with the "created-date" from Crossref.

255
Before deciding on which dates to use in our studies, we carefully analyzed those used in 256 previous studies and noted some inconsistency between different authors (Table 4) ArticleDate@DateType="Electronic" in PubMed and/or "date-created" from Crossref.

272
To assess the preprint pre-submission time, we subtracted the preprint deposition date from 273 the date the journal articles was "received". To assess the review time, we subtracted the date the 274 journal articles was "received" from the date it was "accepted". To assess the production stage 275 time, we subtracted the date the journal article was "accepted" from the date it was posted online   Consistently throughout the pandemic, medRxiv experienced a significantly higher flux of 291 COVID-19 preprints as compared to bioRxiv (Table 1 and Scholarly Output in SI). On average, 292 medRxiv preprints on COVID-19 constituted 78% (SD = 2%) of combined bioRxiv and medRxiv 293 preprints on any single month, except January, when the number of COVID-19 related medRxiv 294 preprints was only 27% of COVID-19 related bioRxiv preprints. May was the most productive 295 month for authors of medRxiv preprints. In June, the number of medRxiv COVID-19 preprints 296 declined by 31%, while the number of bioRxiv preprints increased by 6%. After June, we noted a    study is defined by bioRxiv and medRxiv preprints in relation to "Journal Article" and "Review" 332 article types in PubMed. Based on our analysis, in February, the amount of COVID-19 preprints 333 from medRxiv and bioRxiv constituted only 2% of biomedical articles on all topics but this fraction 334 increased to 15% in May (Fig 3). The number of peer-reviewed articles on COVID-19 has been 335 growing since the start of pandemic reaching a peak in July. In contrary, the number of unrelated 336 to coronavirus peer-reviewed literature has been slowly declining. As a result, the fraction of 337 COVID-19 journal articles with respect to all articles indexed in PubMed has been increasing since 338 the start of pandemic and reached 71% in October. At that time, the amount of COVID-19 bioRxiv 339 and medRxiv preprints was at 9% with respect to COVID-19 peer-reviewed literature in PubMed, 340 but this fraction was as high as 57% in February 2020. Thus, early in pandemic, there were over 341 half as many preprints as there were peer-reviewed articles about the newly emerged coronavirus. We also analyzed categories for bioRxiv preprints unrelated to COVID-19 (see Categories

373
Analysis in SI) deposited into the server during Jan 1 -Sept 30, 2020. We found that the majority 76% of all published medRxiv preprints (Fig 6). The publication rates vary across the preprint 400 categories (Fig 7). Thus, COVID-19 preprints in bioRxiv categories of microbiology and 401 biochemistry display the highest publication rates of 22%.    2020, and we found them at 34% and 29%, for bioRxiv and medRxiv preprints, respectively.

441
Despite being higher than the publication rate of 18% derived from our data in October, reanalyzed 442 publication rates are still low.
where pre-submission time (t  ) is the interval between preprint posting on a preprint server 454 and its submission to the peer-reviewed journal; peer-review time (tR) is the duration of the peer-455 review process; and production stage time (t  ) is the interval between article official acceptance 456 statement and its publication online.

457
The descriptive statistics for these publication delays are summarized in Table 2 and Fig 8   458 and will be discussed in detail below. It is worth noting that none of the publication delays display 459 a standard Gaussian distribution (Fig 9), thus we discuss both their medians and means.    time between medRxiv and bioRxiv preprints is statistically significant (  COVID-19 medRxiv and bioRxiv preprints (Jan-April, 2020). Dates retrieval method is not specified.
34.24 < 0.001 1.03 498 We also explored whether T  can explain the different publication rates for preprint  preprint on a preprint server shortly after or even prior to its submission to the peer-reviewed 514 journal. In our quest to explain the expedited publication times, we analyzed review time (tR) and 515 the production stage period (t  ) (Fig 10). We found that a mean review time (tR) for COVID-19 related bioRxiv and medRxiv 523 preprints is 43.4 days ( [67]. This discrepancy in early data is likely due to a sever skew in frequency distribution for tR. 536 The major advance in speeding up the peer-review process was observed for PLOS ONE.  The mean production stage time (t  ) for COVID-19 related bioRxiv and medRxiv preprints 549 is 14.6 days, about one third of the average tR found above for the same set of articles (Table 2).

550
The difference in t  for medRxiv and bioRxiv preprints on COVID-19 is not significant (Table 3).

551
As compared to the t  of 147 days reported by Björk Fig 10). We found that an average pre-submission time (t  ) for COVID-19 related preprints is 5.6 562 days (Table 2), a positive value implying that, on average, authors posted their manuscript to the 563 preprint server before advancing their preprints to journal publishers (Fig 11). Authors of bioRxiv 564 COVID-19 preprints waited longer than authors of medRxiv COVID-19 preprints; this difference 565 being statistically significant (Table 3). The distribution of t  frequencies indicates a median at 0 566 days ( Table 2, Fig 9). A more detailed analysis showed that 44% of the COVID-19 preprints were 567 deposited to bioRxiv or medRxiv servers after being submitted to the journal (negative t  ) and 568 only 28% of preprints were posted more than 10 days before they were submitted to the journal 569 where they were published. Our results mirror earlier findings by Anderson [69], who reported 570 those values as 57% and 29%, respectively, for papers that had preprint analogues and were

579
The t  , in days, is plotted for bioRxiv (red) and medRxiv (blue) preprints deposited during Jan 1 -

580
Sept 30, 2020. The 0 date is the date the preprint was submitted to the peer-reviewed journal and 581 a positive t  indicates that the preprint was deposited before being submitted to the journal.   601 We reasoned that variations in the set of journals publishing the majority of preprints could 602 be explained with difference in elapsed times for each journal (Fig 12). Indeed, in post hoc  (Table 5). In the previous section, we discussed journals that published the majority of bioRxiv and 618 medRxiv preprints based on the number of preprints. preprints and their article analogues (Fig 13). We found that the majority of COVID-19 preprints 646 in both medRxiv and bioRxiv were published in journals whose scope is general biochemistry, 647 genetics, and molecular biology. Additionally, microbiology preprints from bioRxiv were 648 published in journals specialized in microbiology, infectious diseases, and virology. The latter 649 category is currently absent in either bioRxiv or medRxiv platforms but is listed among Scopus 650 categories. The majority of medRxiv preprints were published in journals whose scope is general 651 medicine. Preprints in infectious diseases and epidemiology were published in journals whose 652 scope is infectious diseases and microbiology (medical). For both, medRxiv and bioRxiv, the

687
To assess the visibility of COVID-19 preprints, we compared the Altmetric Attention

688
Scores of COVID-19 related articles that had associated preprints to those that did not, and to 689 articles unrelated to COVID-19 that were published between Jan 1 and Nov 19, 2020 (Fig 14, 690 Table 6). We also stratified our results by journal to eliminate a potential effect of a journal's impact factor (IF) or other journal-specific variables. For the top ten journals that published the 692 majority of COVID-19 preprints, we found that Altmetric Attention Scores for articles that had 693 associated preprints were slightly higher on average but not significantly different from articles 694 that did not have associated preprints ( Table 6)  In this paper, we explored how publication practices in biomedical sciences reacted to an 718 emergency, such as COVID-19 pandemic. Our first focus was analyzing the usage of two major 719 biomedical preprint servers, bioRxiv and medRxiv. Following the deposition of the first preprint 720 on a "novel coronavirus" in mid-January 2020 [56], preprint submissions to these two platforms 721 increased rapidly. Submissions of new coronavirus related preprints reached 10 to 20 per day by 722 February and increased to about 150 per day by May (Fig 1). In addition to this incredible flow of The answer to this question lies within the trends in the most active fields in each preprint 737 server. Our analysis revealed that the majority of COVID-19 related preprints in bioRxiv were 738 deposited in those fields that are most relevant to coronavirus research, such as microbiology, 739 bioinformatics, and immunology ( Fig 4A) preprints. Our analysis of publication delays yielded a median t  of 0 days for published COVID-790 19 preprints, implying that preprints were submitted to preprint servers and to journals 791 simultaneously (Fig 9 and Fig 11). A more detailed analysis showed that only 28% of preprints 792 were deposited into servers for over 10 days prior to journal submission. Journal champions varied at various moments throughout the pandemic, which was found to be 828 related to variations in publication times among the journals (Fig 12). For example, it took the an earlier observation that a publication peak for COVID-19 preprints in May transfers to the 834 summit in July for journal article publications (Fig 2).

835
Complementary efforts of preprint servers and scholarly journals to disseminate knowledge 836 promptly, while differentiating reliable and important findings from those that may be misleading 837 attest to the upmost relevance of COVID-19 topic during 2020, as evident from Altmetric

838
Attention Scores for COVID-19 research articles (Fig 14).  In summary, our analysis showed that early in pandemic, preprints were prevailing in 852 disseminating findings on the topic of the public health emergency. Preprint authors deposited 853 them into fields previously underrepresented on bioRxiv or medRxiv servers but those that were consultations. We also thank Dr. Oscar Tutusaus for his assistance with manuscript editing.