PT  - JOURNAL ARTICLE
AU  - Hege Marie Bøvelstad
AU  - Einar Holsbø
AU  - Lars Ailo Bongo
AU  - Eiliv Lund
TI  - A standard operating procedure for outlier removal in large-sample epidemiological transcriptomics datasets
AID  - 10.1101/144519
DP  - 2017 Jan 01
TA  - bioRxiv
PG  - 144519
4099  - http://biorxiv.org/content/early/2017/05/31/144519.short
4100  - http://biorxiv.org/content/early/2017/05/31/144519.full
AB  - Transcriptome measurements and other –omics type data are increasingly more used in epidemiological studies. Most of omics studies to date are small with samples sizes in the tens, or sometimes low hundreds, but this is changing. Our Norwegian Woman and Cancer (NOWAC) datasets are to date one or two orders of magnitude larger. The NOWAC biobank contains about 50000 blood samples from a prospective study. Around 125 breast cancer cases occur in this cohort each year. The large biological variation in gene expression means that many observations are needed to draw scientific conclusions. This is true for both microarray and RNA-seq type data. Hence, larger datasets are likely to become more common soon.Technical outliers are observations that somehow were distorted at the lab or during sampling. If not removed these observations add bias and variance in later statistical analyses, and may skew the results. Hence, quality assessment and data cleaning are important. We find common quality assessment libraries difficult to work with for large datasets for two reasons: slow execution speed and unsuitable visualizations.In this paper, we present our standard operating procedure (SOP) for large-sample transcriptomics datasets. Our SOP combines automatic outlier detection with manual evaluation to avoid removing valuable observations. We use laboratory quality measures and statistical measures of deviation to aid the analyst. These are available in the nowaclean R package, currently available on GitHub (https://github.com/3inar/nowaclean). Finally, we evaluate our SOP on one of our larger datasets with 832 observations.