Abstract
Data management and publication are core components of the research process. An emerging challenge that has received limited attention in biology is managing, working with, and providing access to data under continual active collection. “Living data” present unique challenges in quality assurance and control, data publication, archiving, and reproducibility. We developed a living data workflow for a long-term ecological study that addresses many of the challenges associated with managing this type of data. We do this by leveraging existing tools to: 1) perform quality assurance and control; 2) import, restructure, version, and archive data; 3) rapidly publish new data in ways that ensure appropriate credit to all contributors; and 4) automate most steps in the data pipeline to reduce the time and effort required by researchers. The workflow uses two tools from software development, version control and continuous integration, to create a modern data management system that automates the pipeline.
Glossary
- CI/continuous integration
- (also see Box 2) the continuous application of quality control. A practice used in software engineering to continuously implement processes for automated testing and integration of new code into a project.
- Git
- (also see Box 1) Git is an open source program for tracking changes in text files (version control), and is the core technology that GitHub, the social and user interface, is built on top of.
- GitHub
- (also see Box 1) a web-based hosting service for version control using git.
- Github-Travis integration
- connects the Travis continuous integration service to build and test projects hosted at GitHub. Once set up, a GitHub project will automatically deploy CI and test pull requests through Travis.
- Github-Zenodo integration
- connects a Github project to a Zenodo archive. Zenodo takes an archive of your GitHub repository each time you create a new release.
- Living data
- data that continue to be updated and added to, while simultaneously being made available for analyses. For example: long-term observational studies, experiments with repeated sampling, data derived from automated sensors (e.g., weather stations or GPS collars).
- Pull request
- A set of proposed changes to the files in a GitHub repository made by one collaborator, to be reviewed by other collaborators before being accepted or rejected.
- QA/QC
- Quality Assurance/Quality Control. The process of ensuring the data in our repository meet a certain quality standard.
- Repository
- a location (folder) containing all the files for a particular project. Files could include code, data files, or documentation. Each file’s revision history is also stored in the repository.
- testthat
- an R package that facilitates formal, automated testing
- Travis CI
- (also see Box 2) a hosted continuous integration service that is used to test and build GitHub projects. Open source projects are tested at no charge.
- unit test
- a software testing approach that checks to make sure that pieces of code work in the expected way
- Version control
- A system for managing changes made to a file or set of files over time that allows the user to a) see what changes were made when and b) revert back to a previous state if desired
- Zenodo
- a general, open-access, research data repository