Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

RefDeduR: A text-normalization and decision-tree aided R package enabling accurate and high-throughput reference deduplication for large datasets

View ORCID ProfileJiaxian Shen, Fangqiong Ling, View ORCID ProfileErica M. Hartmann
doi: https://doi.org/10.1101/2022.09.29.510210
Jiaxian Shen
1Department of Civil and Environmental Engineering, Northwestern University, IL, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jiaxian Shen
  • For correspondence: jiaxianshen2022@u.northwestern.edu
Fangqiong Ling
2Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, MO, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Erica M. Hartmann
1Department of Civil and Environmental Engineering, Northwestern University, IL, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Erica M. Hartmann
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

As the scientific literature grows exponentially and research becomes increasingly interdisciplinary, accurate and high-throughput reference deduplication is vital in evidence synthesis studies (e.g., systematic reviews, meta-analyses) to ensure the completeness of datasets while reducing the manual screening burden. Existing tools fail to fulfill these emerging needs, as they are often labor-intensive, insufficient in accuracy, and limited to clinical fields. Here, we present RefDeduR, a text-normalization and decision-tree aided R package that enables accurate and high-throughput reference deduplication. We modularize the pipeline into text normalization, three-step exact matching, and two-step fuzzy matching processes. We also introduce a decision-tree algorithm, consider preprints when they co-exist with a peer-reviewed version, and provide actionable recommendations. Therefore, the tool is customizable, accurate, high-throughput, and practical. RefDeduR provides an effective solution to perform reference deduplication and represents a valuable advance in expanding the open-source toolkit to support evidence synthesis research.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • https://github.com/jxshen311/RefDeduR

  • https://jxshen311.github.io/RefDeduR/

  • https://github.com/jxshen311/RefDeduR_benchmark

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted October 03, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
RefDeduR: A text-normalization and decision-tree aided R package enabling accurate and high-throughput reference deduplication for large datasets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
RefDeduR: A text-normalization and decision-tree aided R package enabling accurate and high-throughput reference deduplication for large datasets
Jiaxian Shen, Fangqiong Ling, Erica M. Hartmann
bioRxiv 2022.09.29.510210; doi: https://doi.org/10.1101/2022.09.29.510210
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
RefDeduR: A text-normalization and decision-tree aided R package enabling accurate and high-throughput reference deduplication for large datasets
Jiaxian Shen, Fangqiong Ling, Erica M. Hartmann
bioRxiv 2022.09.29.510210; doi: https://doi.org/10.1101/2022.09.29.510210

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Scientific Communication and Education
Subject Areas
All Articles
  • Animal Behavior and Cognition (4095)
  • Biochemistry (8784)
  • Bioengineering (6493)
  • Bioinformatics (23382)
  • Biophysics (11765)
  • Cancer Biology (9166)
  • Cell Biology (13286)
  • Clinical Trials (138)
  • Developmental Biology (7421)
  • Ecology (11383)
  • Epidemiology (2066)
  • Evolutionary Biology (15112)
  • Genetics (10408)
  • Genomics (14019)
  • Immunology (9141)
  • Microbiology (22092)
  • Molecular Biology (8792)
  • Neuroscience (47429)
  • Paleontology (350)
  • Pathology (1423)
  • Pharmacology and Toxicology (2483)
  • Physiology (3711)
  • Plant Biology (8060)
  • Scientific Communication and Education (1433)
  • Synthetic Biology (2213)
  • Systems Biology (6020)
  • Zoology (1251)