RT Journal Article SR Electronic T1 Accessible, interactive and cloud-enabled genomic workflows integrated with the NCI Genomic Data Commons JF bioRxiv FD Cold Spring Harbor Laboratory SP 2022.08.11.503660 DO 10.1101/2022.08.11.503660 A1 Hung, Ling-Hong A1 Fukuda, Bryce A1 Schmitz, Robert A1 Hoang, Varik A1 Lloyd, Wes A1 Yeung, Ka Yee YR 2023 UL http://biorxiv.org/content/early/2023/04/10/2022.08.11.503660.abstract AB Large scale data resources such as the NCI’s Cancer Research Data Commons (CRDC) and the Genotype-Tissue Expression (GTEx) portal have the potential to simplify the analysis of cancer data by providing data that can be used as standards or controls. However, comparisons with data that is processed using different methodologies or even different versions of software, parameters and supporting datasets can lead to artefactual results. Reproducing the exact workflows from text-based standard operating procedures (SOPs) is problematic as the documentation can be incomplete or out of date, especially for complex workflows involving many executables and scripts. We extend our open-source Biodepot-workflow-builder (Bwb) platform to provide a dynamic solution that disseminates the computational protocols to process large-scale sequencing data developed by the National Cancer Institute (NCI) Genomic Data Commons (GDC). Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. Secure integration with protected-access CRDC data is achieved using the Data Commons Framework Services (DCFS) Gen3 protocol. These graphical workflows can be applied to reproducibly analyze datasets across other repositories and/or custom user data. Analyses can be performed on a local laptop, desktop, or cloud providers. With RNA-Seq datasets from the GDC and GTEx, we illustrate the importance of uniform analysis of control and treatment data for accurate inference of differentially expressed genes. Furthermore, we demonstrate that these best practices for analyzing RNA-seq data from different sources can be achieved using our accessible workflows. Most importantly, we demonstrate how our reproducible distribution of the methodology can transform the analyses of cancer genomic data by enabling researchers to leverage datasets across multiple repositories to enhance data interpretation.Competing Interest StatementLHH and KYY have equity interest in Biodepot LLC, which receives compensation from NCI SBIR contract numbers 75N91020C00009 and 75N91021C00022. The terms of this arrangement have been reviewed and approved by the University of Washington in accordance with its policies governing outside work and financial conflicts of interest in research.AMIAmazon Machine ImageAPIapplication programming interfaceAWSAmazon Web ServicesBwbBiodepot-workflow-builderCPTACClinical Proteomic Tumor Atlas ConsortiumCRDCCancer Research Data CommonsCWLCommon Workflow LanguageDCFSData Commons Framework ServicesdbGaPdatabase of Genotypes and PhenotypesDNA-SeqDNA sequencingDTTData Transfer ToolEC2Elastic Compute CloudGDCGenomic Data CommonsIGVIntegrated Genome ViewermiRNA-Seqmicro RNA sequencingNCINational Cancer InstituteNGSNext-generation sequencingPONPanel of NormalsRNA-SeqRNA sequencingTARGETTherapeutically Applicable Research to Generate Effective TreatmentTCGAThe Cancer Genome AtlasWDLWorkflow Description LanguageWGSwhole genome sequencingWXSwhole exome sequencing