PT - JOURNAL ARTICLE AU - Hung, Ling-Hong AU - Fukuda, Bryce AU - Schmitz, Robert AU - Hoang, Varik AU - Lloyd, Wes AU - Yeung, Ka Yee TI - Accessible, interactive and cloud-enabled genomic workflows integrated with the NCI Genomic Data Commons AID - 10.1101/2022.08.11.503660 DP - 2023 Jan 01 TA - bioRxiv PG - 2022.08.11.503660 4099 - http://biorxiv.org/content/early/2023/04/10/2022.08.11.503660.short 4100 - http://biorxiv.org/content/early/2023/04/10/2022.08.11.503660.full AB - Large scale data resources such as the NCI’s Cancer Research Data Commons (CRDC) and the Genotype-Tissue Expression (GTEx) portal have the potential to simplify the analysis of cancer data by providing data that can be used as standards or controls. However, comparisons with data that is processed using different methodologies or even different versions of software, parameters and supporting datasets can lead to artefactual results. Reproducing the exact workflows from text-based standard operating procedures (SOPs) is problematic as the documentation can be incomplete or out of date, especially for complex workflows involving many executables and scripts. We extend our open-source Biodepot-workflow-builder (Bwb) platform to provide a dynamic solution that disseminates the computational protocols to process large-scale sequencing data developed by the National Cancer Institute (NCI) Genomic Data Commons (GDC). Specifically, we converted the GDC DNA sequencing (DNA-Seq) and the GDC mRNA sequencing (mRNA-Seq) SOPs into reproducible, self-installing, containerized, and interactive graphical workflows. Secure integration with protected-access CRDC data is achieved using the Data Commons Framework Services (DCFS) Gen3 protocol. These graphical workflows can be applied to reproducibly analyze datasets across other repositories and/or custom user data. Analyses can be performed on a local laptop, desktop, or cloud providers. With RNA-Seq datasets from the GDC and GTEx, we illustrate the importance of uniform analysis of control and treatment data for accurate inference of differentially expressed genes. Furthermore, we demonstrate that these best practices for analyzing RNA-seq data from different sources can be achieved using our accessible workflows. Most importantly, we demonstrate how our reproducible distribution of the methodology can transform the analyses of cancer genomic data by enabling researchers to leverage datasets across multiple repositories to enhance data interpretation.Competing Interest StatementLHH and KYY have equity interest in Biodepot LLC, which receives compensation from NCI SBIR contract numbers 75N91020C00009 and 75N91021C00022. The terms of this arrangement have been reviewed and approved by the University of Washington in accordance with its policies governing outside work and financial conflicts of interest in research.AMIAmazon Machine ImageAPIapplication programming interfaceAWSAmazon Web ServicesBwbBiodepot-workflow-builderCPTACClinical Proteomic Tumor Atlas ConsortiumCRDCCancer Research Data CommonsCWLCommon Workflow LanguageDCFSData Commons Framework ServicesdbGaPdatabase of Genotypes and PhenotypesDNA-SeqDNA sequencingDTTData Transfer ToolEC2Elastic Compute CloudGDCGenomic Data CommonsIGVIntegrated Genome ViewermiRNA-Seqmicro RNA sequencingNCINational Cancer InstituteNGSNext-generation sequencingPONPanel of NormalsRNA-SeqRNA sequencingTARGETTherapeutically Applicable Research to Generate Effective TreatmentTCGAThe Cancer Genome AtlasWDLWorkflow Description LanguageWGSwhole genome sequencingWXSwhole exome sequencing