Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Microbial contaminants cataloged as novel human sequences in recent human pan-genomes

View ORCID ProfileMosè Manni, View ORCID ProfileEvgeny Zdobnov
doi: https://doi.org/10.1101/2020.03.16.994376
Mosè Manni
1Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
2Swiss Institute of Bioinformatics, Geneva, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mosè Manni
  • For correspondence: mose.manni@unige.ch evgeny.zdobnov@unige.ch
Evgeny Zdobnov
1Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
2Swiss Institute of Bioinformatics, Geneva, Switzerland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Evgeny Zdobnov
  • For correspondence: mose.manni@unige.ch evgeny.zdobnov@unige.ch
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Human pan-genome studies offer the opportunity to identify human non-reference sequences (NRSs) which are, by definition, not represented in the reference human genome (GRCh38). NRSs serve as useful catalogues of genetic variation for population and disease studies and while the majority consists of repetitive elements, a substantial fraction is made of non-repetitive, non-reference (NRNR) sequences. The presence of non-human sequences in these catalogues can inflate the number of “novel” human sequences, overestimate the genetic differentiation among populations, and jeopardize subsequent analyses that rely on these resources. We uncovered almost 2,000 contaminant sequences of microbial origin in NRNR sequences from recent human pan-genome studies. The contaminant contigs (3,501,302 bp) harbour genes totalling 4,720 predicted proteins (>40 aa). The major sources of contamination are related to Rhyzobiales, Burkholderiales, Pseudomonadales and Lactobacillales, which may have been associated with the original samples or introduced later during sequencing experiments. We additionally observed that the majority of human novel protein-coding genes described in one of the studies entirely overlap repetitive regions and are likely to be false positive predictions. We report here the list of contaminant sequences in three recent human pan-genome catalogues and discuss strategies to increase decontamination efficacy for current and future pan-genome studies.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted March 18, 2020.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Microbial contaminants cataloged as novel human sequences in recent human pan-genomes
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Microbial contaminants cataloged as novel human sequences in recent human pan-genomes
Mosè Manni, Evgeny Zdobnov
bioRxiv 2020.03.16.994376; doi: https://doi.org/10.1101/2020.03.16.994376
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Microbial contaminants cataloged as novel human sequences in recent human pan-genomes
Mosè Manni, Evgeny Zdobnov
bioRxiv 2020.03.16.994376; doi: https://doi.org/10.1101/2020.03.16.994376

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3479)
  • Biochemistry (7318)
  • Bioengineering (5296)
  • Bioinformatics (20197)
  • Biophysics (9976)
  • Cancer Biology (7703)
  • Cell Biology (11250)
  • Clinical Trials (138)
  • Developmental Biology (6418)
  • Ecology (9916)
  • Epidemiology (2065)
  • Evolutionary Biology (13280)
  • Genetics (9352)
  • Genomics (12554)
  • Immunology (7674)
  • Microbiology (18939)
  • Molecular Biology (7417)
  • Neuroscience (40893)
  • Paleontology (298)
  • Pathology (1226)
  • Pharmacology and Toxicology (2126)
  • Physiology (3140)
  • Plant Biology (6838)
  • Scientific Communication and Education (1270)
  • Synthetic Biology (1891)
  • Systems Biology (5296)
  • Zoology (1085)