TY - JOUR T1 - DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis JF - bioRxiv DO - 10.1101/342907 SP - 342907 AU - Greg Finak AU - Bryan T. Mayer AU - William Fulp AU - Paul Obrecht AU - Alicia Sato AU - Eva Chung AU - Drienna Holman AU - Raphael Gottardo Y1 - 2018/01/01 UR - http://biorxiv.org/content/early/2018/06/08/342907.abstract N2 - A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.antibodyA protein produced by B cells that recognizes a specific antigen. Antibodies are released by B cells and bind to antigens that the body recognizes as foreign (non-self), such as bacterial and viral antigens. 7, 9baselineIn vaccine trials this refers to a time point in the study before treatment is given, usually immediately before treatment. 7, 9FASTQA standard, text-based file format for storing RNA and DNA sequence data along with quality information for individual nucleotide calls and some associated metadata. FASTQ files are the standard output of RNA sequencing experiments. 4, 9FCMFlow cytometry is a high content, high throughput assay enabling simultaneous multiparametric (i.e. cell-surface and intracellular protein abundance, or DNA content) measurement on suspended single-cells as they pass through the detection apparatus of a flow cytometer.. 4, 9prime-boostA vaccination strategy where one type of vaccine formulation is given to prime the immune system and another is given to boost the immune response. An example heterologous prime-boost modality is where a recombinant DNA formulation is given as the prime vaccine and a protein is given as the boost. 7, 9QCThe process of checking data quality. 3, 9RNASeqRibonucleic acid (RNA) sequencing technology used to measure gene expression in biological isolates from cells or tissues. 3, 4, 9VISCThe Vaccine Immunology Statistical Center is a research team at the Fred Hutchinson Cancer Research Center working on analysis and integration of pre–clinical and clinical HIV vaccine trial data. 3, 7–10 ER -