PT - JOURNAL ARTICLE AU - Nathan C. Sheffield AU - Michał Stolarczyk AU - Vincent P. Reuter AU - André F. Rendeiro TI - Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects AID - 10.1101/2020.10.08.331322 DP - 2021 Jan 01 TA - bioRxiv PG - 2020.10.08.331322 4099 - http://biorxiv.org/content/early/2021/05/19/2020.10.08.331322.short 4100 - http://biorxiv.org/content/early/2021/05/19/2020.10.08.331322.full AB - Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software. To address this, we present Portable Encapsulated Projects (PEP), a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many samples, whether from individual experiments, organisms, or single cells. In addition to standardization, the PEP specification provides descriptors and modifiers for different organizational layers of a project, which improve portability among computing environments and facilitate use of different processing tools. PEP includes a schema validator framework, allowing formal definition of required metadata attributes for any type of biomedical data analysis. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. PEP therefore presents an important step toward unifying data annotation and processing tools in data-intensive biological research projects.Competing Interest StatementThe authors have declared no competing interest.