MultiCellDS: a standard and a community for sharing multicellular data

Cell biology is increasingly focused on cellular heterogeneity and multicellular systems. To make the fullest use of experimental, clinical, and computational efforts, we need standardized data formats, community-curated “public data libraries”, and tools to combine and analyze shared data. To address these needs, our multidisciplinary community created MultiCellDS (MultiCellular Data Standard): an extensible standard, a library of digital cell lines and tissue snapshots, and support software. With the help of experimentalists, clinicians, modelers, and data and library scientists, we can grow this seed into a community-owned ecosystem of shared data and tools, to the benefit of basic science, engineering, and human health.

data are available online to drive machine learning, we lack a standardized way to record extracted features, such as cell positions, sizes, shapes, and immunohistochemical stain statuses. Moreover, our lack of standardized data prevents us from directly linking between experimental and computational model systems, while also hindering our efforts to reconcile experimental and simulation results against clinical knowledge. With standardized data, would could directly couple experimental, computational, and clinical workflows, develop unified tools, and exchange insights. This would be bring key aspects of information science to multicellular biology, while facilitating reproducibility. See Figure. 1.
To address these unmet needs, we developed the MultiCellular Data Standard (MultiCellDS), a communitydriven project [3] that facilitates the quantitative recording of the cellular microenvironment and phenotype in the form of digital cell lines and digital snapshots. Community-curated, centralized repositories of standardized data will pave the way to new data workflows and pipelines that will revolutionize the way we collaborate and learn in multicellular biology. By sharing standardized data-with formats that work for experimental, clinical, and computational models systems-we can work together to bridge the knowledge divide between molecular cell biology and the phenotypes of multicellular systems, tissues, organisms, and patients.
[ Figure 1: Integration of experimental, clinical, and computational workflows with standardized data] [Box 1: Glossary of key terms (see below)]

A community-driven project
To jump start the project, a core team-consisting of a cancer biologist, a mathematical modeler, data scientists, and a medical oncologist-drafted a working prototype of the data standard. To ensure that the standard could adapt to the diverse needs of the experimental, computational, and clinical communities, we assembled a group of invited "reviewers" (over thirty members spanning multiple disciplines, at institutions in the US and Europe) to critique the nascent standard and suggest improvements in three rounds of review. The invited reviews were supplemented by public talks and reviews to get feedback from the broader research community. The core team was responsible for leading the reviews, incorporating all feedback, and coordinating data and software contributions. This structure-a core team accountable to a skilled, multidisciplinary panel of reviewers-helped to balance the needs for extensive community feedback and involvement with the needs for fast and iterative development.

Use cases to drive development
Each round of review was driven by a set of use cases: to represent cell phenotype measurements as digital cell lines (round 1); to record simulation and experimental data as digital snapshots (round 2); and to record segmented pathology data and de-identified clinical annotations in digital snapshots (round 3). These terms are defined in Box 1 and the descriptions below. Each round of review iteratively refined the data standard until we could complete the use cases. This helped ensure that the standard was not just a theoretical dictionary of terms, but a workable data language. Tackling unexpected problems suggested new data elements and metadata that could never have been anticipated purely through brainstorming and committee meetings. As a side effect, this process helped populate an initial "public library" of data, while driving software development for data analysis, visualization, and simulation.

Digital snapshots: flash-freezing multiscale biology
Biological systems are typically observed or simulated at discrete, sampled times. Digital snapshots allow us to "freeze" these systems at a single time point and systematically record their states. Each snapshot begins with metadata: information on who generated the data and how (provenance) and other relevant details. A snapshot then records the microenvironmental context (e.g., oxygen concentration), either spatially or as average values. The snapshot closes with the multicellular data: e.g. cell positions, their phenotypes (at scales ranging from receptor status to gross behavior and morphology), and their types, if known ( Figure 2). We use the same data elements for in vitro, in vivo, and in silico systems to facilitate interdisciplinary work.

Digital cell lines: putting cell phenotype in context
By analyzing time series of digital snapshots, we can quantitate cell phenotype and correlate it with microenvironmental conditions. A digital cell line collects such measurements for a single cell type as phenotype datasets, allowing systematic recording of cell behavior in a single microenvironmental context (e.g., under normoxic culture conditions). To better systemize our current knowledge while exposing missing data, we cluster cell phenotype in several functional groups: cell cycling, cell death, mechanics, adhesion, motility, pharmacodynamics, secretion and uptake processes, and cell size/mass/morphology. A digital cell line can contain many phenotype datasets if it has been studied in many conditions, and each phenotype dataset can expand as our knowledge increases. Each phenotype dataset is matched to a description of the microenvironmental context, and it can be extended to embed any scale of data, such as "omics" data ( Figure 3).

A public library of digital cell lines and tissue snapshots
While testing and refining the data standard, we nucleated a "public library" of open data. To test whether digital cell lines could work beyond human cancer cell lines, we created digital cell lines for murine lymphoma, endothelial cells (to demonstrate highly motile, non-cancerous cells), yeast (our first non-mammalian cells), and bacteria (our first prokaryotic cell lines). Beyond "standardized" cell lines like MCF-7, we also created patientderived digital cell lines for glioblastoma multiforme [4] and ductal carcinoma in situ of the breast [5]. In the course of creating over 200 digital cell lines, we demonstrated that the hierarchical phenotype dataset structure could scale from basic (e.g., parameters derived from ATCC culture protocol documents [6]) to extremely detailed (e.g., MCF-10A and MDA-MB-231 lines derived from a multi-institution study [7]). We also seeded the library with digital snapshots, including reference cancer simulation datasets from [8] and [9] and segmented breast cancer pathology images [10], including patient clinical annotations. Over the next several years, we plan to drastically extend this public library to include segmented TCGA pathology data [11] and segmented mouse liver data [12]. This entire library-stored in a centralized repository called MultiCellDB (multicellular database; see http://portals.MultiCellDS.org)-is freely available under the CC BY 4.0 license.

Incentivizing good behavior: Rewarding contribution via attribution
There are three major types of contributions for community-curated data in MultiCellDB: generating the original data or measurements; performing data analysis or transformation; and actively curating the data (potentially from many sources!). All three types of contributions are essential, and they should be tracked in the provenance for reproducibility, transparency, and proper attribution. Moreover, the software tools used for data analyses and transformations need to be properly recorded. When a digital cell line or snapshot is used in a later publication, it is essential to record this chain of contributions, not only for reproducibility, but also to incentivize future contributions. Succinctly citing a chain of contributions is challenging, but we propose the following form: "We used digital cell line MCF-7 [refs1] version n1 (MultiCellDB id1), created with data and contributions from [refs2,refs3]." Here, refs1 cites the publications or preprints that created the first and current version of the digital cell line, refs2 cites the original data source(s), and refs3 cites tools, software, and post-publication analyses or protocols used to transform the original data (refs2) into MultiCellDS data elements. It is also important to cite a fixed version of the digital cell to ensure that future replication studies use the same data. These are recorded as the version number (n1) and a unique identifier (id1). (Box 2). Additional details on provenance and other metadata tracking can be found in the further resource documents below.

The payoffs for sharing data in a common format
There are additional benefits to having a common format for multicellular data. With a fixed target of data elements, software developers across labs can work together to write data analysis, visualization, and simulation software that can be connected into sophisticated, reproducible research pipelines (Figure 4). This should lead to higher-quality software with development costs spread over more labs, while allowing researchers to crossvalidate their results in a variety of compatible tools.
Because MultiCellDS uses the same data elements for experimental, clinical and simulation data, we can even use the same tools across disciplines, allowing better integration of experimental data into simulations, and more quantitative model validation. Insights in one domain can more readily "cross-pollinate" advances in the others when the data can be seamlessly read by the same tools across disciplines. Storing standardized data in centralized data repositories helps to archive critical data in the long term, thus improving reproducibility and repeatability. Moreover, MultiCellDS contributors can potentially widen their impact with increased data reuse and citations.

Getting a bird's eye view of biology with centralized data repositories
By uniformly collecting cell phenotype knowledge in a centralized repository (MultiCellDB), we gain a unique opportunity to take a step back from focused single-lab investigations to compare cell behavior across many cell types. This uniform recording will help us to identify conserved behavior as well as contradictory data that may point to unknown biology or experimental error. Moreover, we can more readily identify gaps in our knowledge, to more systematically plan future experiments.

Future directions, challenges, and a call to arms
We have developed a standard to systematically record cell phenotypes, microenvironmental conditions, and the state of multicellular systems. In doing so, we created an initial "public library" of digital cell lines and tissue snapshots. Building upon this, the community is actively creating an ecosystem of standards-compliant software to analyze, visualize, and simulate these shared data. In the near future, multidisciplinary teams will mine the shared data repository to formulate new biological hypotheses, encode these into computer models, and compare simulation outputs to experimental and clinical data to test and refine the hypotheses. Standardized data and shared software tools could accelerate this process, helping to close the gap between the benchtop, mathematical models, and the clinic.
Yet challenges remain. Recording the phenotypes of many cells in snapshots-or of single cell types in digital cell lines-falls short of multicellular biology. This is analogous to many actors delivering monologues on a shared stage. Drama and biology get interesting when the players interact. We must next characterize cell networks, including cell-cell interactions, cell mutations, and cell lineages. These enhancements can be woven into MultiCellDS using complementary ontologies, such as the Cell Behavior Ontology [13].
We must also expand the phenotype datasets to incorporate other critical measurements, such as genomic, proteomic, and metabolomics data. Moreover, we must account for the hysteresis in cell phenotype parameters: cells undergoing stresses do not always return to their original phenotypes once the stresses are removed. The community has begun testing ideas to addresses these key facets to multicellular systems biology, but broader participation is needed.
This early groundwork relied heavily upon data scientists and mathematicians to define data elements to characterize cell phenotype, microenvironmental conditions, and metadata. It is time to grow the community. We need experts across experimental biology and clinical practice to improve and expand the digital cell lines. We need a community discussion on what it means to improve a measurement-how do we know that a new measurement of cell motility is better than the old one? This will lead to useful discussions of not just reproducibility, but quality assurance in experimental pipelines. Over time, it will provide a useful library of "reference phenotype values" to help experimentalists to quantitatively compare their findings and separate protocol differences from genuine biological effects.
Moving forward, we must ensure that data donation and curation are user-friendly, and that the data standard evolves to meet the needs of the community. Just as community encyclopedias rely upon volunteer editors to update articles, we need volunteers to transform the scattered treasure of unannotated data into curated, standardized data in MultiCellDB. As MultiCellDS grows, we anticipate making the leap from a grassroots effort to a self-sustaining community, where the availability of standardized data and compatible software drives further adoption, contributions, new techniques, and community growth. Lastly, it is up to the community to make use of the data: to contribute more data; to mine the data for patterns that drive new hypotheses; to test new hypotheses in computational, experimental, and clinical models; and to unlock new knowledge that drives scientific progress and yields new therapies and strategies to improve health.