PT - JOURNAL ARTICLE AU - Bøvelstad, Hege Marie AU - Holsbø, Einar AU - Bongo, Lars Ailo AU - Lund, Eiliv TI - A standard operating procedure for outlier removal in large-sample epidemiological transcriptomics datasets AID - 10.1101/144519 DP - 2017 Jan 01 TA - bioRxiv PG - 144519 4099 - http://biorxiv.org/content/early/2017/05/31/144519.short 4100 - http://biorxiv.org/content/early/2017/05/31/144519.full AB - Transcriptome measurements and other –omics type data are increasingly more used in epidemiological studies. Most of omics studies to date are small with samples sizes in the tens, or sometimes low hundreds, but this is changing. Our Norwegian Woman and Cancer (NOWAC) datasets are to date one or two orders of magnitude larger. The NOWAC biobank contains about 50000 blood samples from a prospective study. Around 125 breast cancer cases occur in this cohort each year. The large biological variation in gene expression means that many observations are needed to draw scientific conclusions. This is true for both microarray and RNA-seq type data. Hence, larger datasets are likely to become more common soon.Technical outliers are observations that somehow were distorted at the lab or during sampling. If not removed these observations add bias and variance in later statistical analyses, and may skew the results. Hence, quality assessment and data cleaning are important. We find common quality assessment libraries difficult to work with for large datasets for two reasons: slow execution speed and unsuitable visualizations.In this paper, we present our standard operating procedure (SOP) for large-sample transcriptomics datasets. Our SOP combines automatic outlier detection with manual evaluation to avoid removing valuable observations. We use laboratory quality measures and statistical measures of deviation to aid the analyst. These are available in the nowaclean R package, currently available on GitHub (https://github.com/3inar/nowaclean). Finally, we evaluate our SOP on one of our larger datasets with 832 observations.