Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Linguistic Analysis of the bioRxiv Preprint Landscape

View ORCID ProfileDavid N. Nicholson, Vincent Rubinetti, Dongbo Hu, View ORCID ProfileMarvin Thielk, View ORCID ProfileLawrence E. Hunter, View ORCID ProfileCasey S. Greene
doi: https://doi.org/10.1101/2021.03.04.433874
David N. Nicholson
1Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David N. Nicholson
Vincent Rubinetti
1Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia PA, USA
2Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Dongbo Hu
1Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marvin Thielk
3Elsevier, Philadelphia PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marvin Thielk
Lawrence E. Hunter
4Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora CO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lawrence E. Hunter
Casey S. Greene
1Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia PA, USA
2Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA
5Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora CO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Casey S. Greene
  • For correspondence: casey.s.greene@cuanschutz.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supplementary sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint-peer reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (https://greenelab.github.io/preprint-similarity-search/) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.

Competing Interest Statement

Marvin Thielk receives a salary from Elsevier Inc. where he contributes NLP expertise to health content operations. Elsevier did not restrict the results or interpretations that could be published in this manuscript. The opinions expressed here do not reflect the official policy or positions of Elsevier Inc.

Footnotes

  • ↵a Funded by The Gordon and Betty Moore Foundation (GBMF4552); The National Institutes of Health (T32 HG000046)

  • ↵b Funded by The Gordon and Betty Moore Foundation (GBMF4552); The National Institutes of Health (R01 HG010067)

  • ↵c Funded by The Gordon and Betty Moore Foundation (GBMF4552)

  • This manuscript (permalink) was automatically generated from greenelab/annorxiver_manuscript@2034e45 on May 12, 2021.

  • danich1, dnicholson329

  • vincerubinetti, vincerubinetti

  • dongbohu

  • MarvinT, TheNeuralCoder

  • LEHunter, ProfLHunter

  • cgreene, greenescientist

  • Add comparison to preprints in motion set described in Polka et al. https://www.biorxiv.org/content/10.1101/2021.02.20.432090v1

  • https://github.com/greenelab/annorxiver_manuscript

  • https://github.com/greenelab/preprint-similarity-search

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted May 25, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Linguistic Analysis of the bioRxiv Preprint Landscape
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Linguistic Analysis of the bioRxiv Preprint Landscape
David N. Nicholson, Vincent Rubinetti, Dongbo Hu, Marvin Thielk, Lawrence E. Hunter, Casey S. Greene
bioRxiv 2021.03.04.433874; doi: https://doi.org/10.1101/2021.03.04.433874
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Linguistic Analysis of the bioRxiv Preprint Landscape
David N. Nicholson, Vincent Rubinetti, Dongbo Hu, Marvin Thielk, Lawrence E. Hunter, Casey S. Greene
bioRxiv 2021.03.04.433874; doi: https://doi.org/10.1101/2021.03.04.433874

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3575)
  • Biochemistry (7520)
  • Bioengineering (5479)
  • Bioinformatics (20677)
  • Biophysics (10258)
  • Cancer Biology (7931)
  • Cell Biology (11583)
  • Clinical Trials (138)
  • Developmental Biology (6563)
  • Ecology (10136)
  • Epidemiology (2065)
  • Evolutionary Biology (13540)
  • Genetics (9498)
  • Genomics (12788)
  • Immunology (7872)
  • Microbiology (19451)
  • Molecular Biology (7614)
  • Neuroscience (41875)
  • Paleontology (306)
  • Pathology (1252)
  • Pharmacology and Toxicology (2179)
  • Physiology (3249)
  • Plant Biology (7007)
  • Scientific Communication and Education (1291)
  • Synthetic Biology (1942)
  • Systems Biology (5406)
  • Zoology (1107)