Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets

Hege Marie Bøvelstad, View ORCID ProfileEinar Holsbø, View ORCID ProfileLars Ailo Bongo, Eiliv Lund
doi: https://doi.org/10.1101/144519
Hege Marie Bøvelstad
Norwegian Institute of Public Health;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Einar Holsbø
UiT - The Arctic University of Norway
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Einar Holsbø
Lars Ailo Bongo
UiT - The Arctic University of Norway
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lars Ailo Bongo
  • For correspondence: larsab@cs.uit.no
Eiliv Lund
UiT - The Arctic University of Norway
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Info/History
  • Metrics
  • Data Supplements
  • Preview PDF
Loading

Abstract

Transcriptome measurements and other -omics type data are increasingly more used in epidemiological studies. Most of omics studies to date are small with samples sizes in the tens, or sometimes low hundreds, but this is changing. Our Norwegian Woman and Cancer (NOWAC) datasets are to date one or two orders of magnitude larger. The NOWAC biobank contains about 50000 blood samples from a prospective study. Around 125 breast cancer cases occur in this cohort each year. The large biological variation in gene expression means that many observations are needed to draw scientific conclusions. This is true for both microarray and RNA-seq type data. Hence, larger datasets are likely to become more common soon. Technical outliers are observations that somehow were distorted at the lab or during sampling. If not removed these observations add bias and variance in later statistical analyses, and may skew the results. Hence, quality assessment and data cleaning are important. We find common quality assessment libraries difficult to work with for large datasets for two reasons: slow execution speed and unsuitable visualizations. In this paper, we present our standard operating procedure (SOP) for large-sample transcriptomics datasets. Our SOP combines automatic outlier detection with manual evaluation to avoid removing valuable observations. We use laboratory quality measures and statistical measures of deviation to aid the analyst. These are available in the nowaclean R package, currently available on GitHub (https://github.com/3inar/nowaclean). Finally, we evaluate our SOP on one of our larger datasets with 832 observations.

Copyright 
The copyright holder for this preprint is the author/funder. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
  • Posted May 31, 2017.

Download PDF

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
Share
A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets
Hege Marie Bøvelstad, Einar Holsbø, Lars Ailo Bongo, Eiliv Lund
bioRxiv 144519; doi: https://doi.org/10.1101/144519
del.icio.us logo Digg logo Reddit logo Technorati logo Twitter logo CiteULike logo Connotea logo Facebook logo Google logo Mendeley logo
Citation Tools
A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets
Hege Marie Bøvelstad, Einar Holsbø, Lars Ailo Bongo, Eiliv Lund
bioRxiv 144519; doi: https://doi.org/10.1101/144519

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Epidemiology
Subject Areas
All Articles
  • Animal Behavior and Cognition (546)
  • Biochemistry (755)
  • Bioengineering (451)
  • Bioinformatics (4380)
  • Biophysics (1337)
  • Cancer Biology (905)
  • Cell Biology (1274)
  • Clinical Trials (43)
  • Developmental Biology (856)
  • Ecology (1471)
  • Epidemiology (712)
  • Evolutionary Biology (3458)
  • Genetics (2344)
  • Genomics (3043)
  • Immunology (493)
  • Microbiology (1977)
  • Molecular Biology (768)
  • Neuroscience (5836)
  • Paleontology (36)
  • Pathology (111)
  • Pharmacology and Toxicology (188)
  • Physiology (241)
  • Plant Biology (812)
  • Scientific Communication and Education (226)
  • Synthetic Biology (354)
  • Systems Biology (1203)
  • Zoology (149)