Annotated Compendium of 102 Breast Cancer Gene-Expression Datasets

Transcriptomic data from breast-cancer patients are widely available in public repositories. However, before a researcher can perform statistical inferences or make biological interpretations from such data, they must find relevant datasets, download the data, and perform quality checks. In many cases, it is also useful to normalize and standardize the data for consistency and to use updated genome annotations. Additionally, researchers need to parse and interpret metadata: clinical and demographic characteristics of patients. Each of these steps requires computational and/or biomedical expertise, thus imposing a barrier to reuse for many researchers. We have identified and curated 102 publicly available, breast-cancer datasets representing 17,151 patients. We created a reproducible, computational pipeline to download the data, perform quality checks, renormalize the raw gene-expression measurements (when available), assign gene identifiers from multiple databases, and annotate the metadata against the National Cancer Institute Thesaurus, thus making it easier to infer semantic meaning and compare insights across datasets. We have made the curated data and pipeline freely available for other researchers to use. Having these resources in one place promises to accelerate breast-cancer research, enabling researchers to address diverse types of questions, using data from a variety of patient populations and study contexts.

gene-expression measurements (when available), assign gene identifiers from multiple databases, and annotate the metadata against the National Cancer Institute Thesaurus, thus making it easier to infer semantic meaning and compare insights across datasets. We have made the curated data and pipeline freely available for other researchers to use. Having these resources in one place promises to accelerate breast-cancer research, enabling researchers to address diverse types of questions, using data from a variety of patient populations and study contexts.

Background
Among women worldwide, breast cancer is the most frequently diagnosed cancer type and the leading cause of cancer death; it accounts for 23% of total cases and 14% of deaths 1 , with over 2 million new cases diagnosed and 600,000 deaths in 2020 2 . This is a huge public health burden. Breast cancers exhibit considerable diversity and complexity 3 and are driven by a combination of factors, including genomic material inherited from parents and an accumulation of somatic events within individual cells; these events include mutations, chromosomal rearrangements, gene amplifications, and deletions that eventually lead to evasion of apoptosis and unregulated cellular proliferation 4 . While earlier diagnoses and treatment advances have improved survival and lowered mortality rates 5,6 , researchers continue to seek to understand the molecular underpinnings of this disease, identify subtypes of the disease, and explore ways to tailor treatments for individual patients 3,7 .
Microarrays and RNA sequencing have been used in hundreds of thousands of gene-expression studies (spanning many biomedical contexts) 8 , helping to identify differentially expressed genes, define disease subtypes, inform treatment and patient-management strategies, predict patient outcomes, etc. 9 Much of the data from these studies has been deposited in public databases such as Gene Expression Omnibus (GEO) 8,10 and ArrayExpress 11 . These databases offer opportunities for reuse, including studies that combine insights from multiple datasets. However, making data accessible is not enough. To be most useful for addressing biomedical questions, the data must also be findable, interoperable, and reusable 12 . First, researchers must locate the data.
Gene-expression datasets are available in diverse repositories. These repositories support keyword searches; however, our experience is that these searches miss many relevant datasets and require considerable manual effort to filter through search results. Although in some cases researchers can reuse data exactly in the form they are found in public repositories, data depositors use a heterogeneous mix of methods for verifying data quality, normalizing expression measurements, and then filtering, summarizing, and annotating these measurements. Sometimes steps are skipped or may be completed using invalid methods. Furthermore, genome annotations are regularly updated, so datasets that were annotated years ago may have stale annotations. Fortunately, gene-expression data are frequently provided in raw form, so researchers can perform these steps on their own, thus resulting in datasets that are more consistent and of higher quality.
A key challenge to reusing publicly available datasets is working with metadata. Nearly all gene-expression data from human patients have clinical and/or demographic data associated with them. However, researchers use different terms, abbreviations, and spellings (including misspellings) to describe the same clinical and demographic phenomena. For example, we have seen variable names as diverse as "drfs_even_time_years", "distal_recurrence_time", "distant_recurrence_free_survival_mo", "dmfs_time", "t_tdm", and "time" to describe the length of time a patient survived without a metastatic event after diagnosis. Some variable names are short and/or vague (e.g., "cell_type", "chemo", "nhg", "pt"). Sometimes measurement units are included in names of numeric variables (e.g., "age_years", "size_mm", "mfsdel_month"), but other times they are not. Some variable names incorrectly or inadequately reflect the semantics of those variables. For example, "ethnicity" sometimes refers to population ancestral groups (often referred to as "race"); "gender" sometimes refers to a person's biological sex assigned at birth. Clarifying these semantic inconsistencies is especially important when researchers wish to identify datasets that have particular metadata variables and/or to combine multiple datasets.
Formatting differences also cause challenges. In GEO, data are provided in the "series matrix" and SOFT formats 13 , whereas ArrayExpress uses standardized formats like MAGE-TAB 14 . Software tools are available to download and convert the data into formats and data structures that are not specific to gene-expression data 15,16 ; however, other challenges remain. Researchers sometimes store multiple values in a single cell (e.g., "female|tamoxifen|58"), use key/value pairs within cells (e.g., "sex: female", "treatment=tamoxifen", "age=58 years"), or include missing values (using descriptors such as "NA", "n/a", or "?") 17 . Addressing these problems requires time and expertise. To ensure semantic consistency, biomedical domain experts must manually review, interpret, and standardize metadata. To deal with formatting problems, researchers often must write custom computer code. These barriers have caused these valuable data resources to be underutilized and have slowed the research enterprise.
We have addressed these challenges for breast cancer by finding and curating 102 gene-expression datasets from the public domain. These datasets represent 17,151 patient samples. For each dataset, we performed quality-control checks, uniformly processed the expression data, mapped the expression data to multiple gene-reference databases, and standardized the metadata variables against the National Cancer Institute Thesaurus (NCIT) 18 , a popular reference standard with unique codes and synonyms for biomedical terms. Additionally, a practicing medical oncologist reviewed ambiguous term mappings. We have deposited these datasets, along with the code we used to process them, in an open-access repository, thus making it easy for other researchers to find, access, integrate, and reuse the data, as well as to reproduce our process. This repository is not tied to any programming language-the data can be imported using a variety of tools, including Microsoft Excel and software libraries available for commonly used programming languages.

Data finding
Initially, we searched for datasets within GEO and ArrayExpress. Our initial research interest was the interaction between race and HER2 amplification status among breast-cancer patients, and this guided our choice of search parameters. For GEO, our search parameters were ("Breast Cancer"[All Fields] AND "Her2"[All Fields] AND "race"[All Fields]). We selected the study types as "Expression profiling by array" or "Expression profiling by high-throughput sequencing". We individually screened the available datasets for relevance and duplicates.
In ArrayExpress, our search parameters were "Breast Cancer" AND "Her2" AND "race".
Next, we searched within PubMed for articles that matched the following criteria: "Breast Cancer" AND "Her2" AND "race". These criteria resulted in a limited number of datasets, so we expanded our searches in PubMed to breast cancer more broadly. We also searched for datasets using the Google search engine. As part of this process, we identified two existing We found a mix of different data-generation platforms, including Affymetrix, Agilent and Illumina microarrays, custom microarrays, as well as Illumina-based RNA sequencing.
Because the Affymetrix platforms were the most common-and well-recognized normalization and annotation packages could be applied to these platforms-we focused on these. We also included datasets that had been processed using RNA sequencing technology because of its advantages over microarray technologies 27 . With one exception, we excluded datasets that used Agilent microarrays, Illumina microarrays, or uncommon microarray platforms to reduce heterogeneity. The METABRIC project used Illumina microarrays; we included this dataset in our compendium because it is large (1980 samples) and provides extensive metadata. We excluded Affymetrix datasets for which raw data were unavailable or that were from cell lines or single cells. Raw data were not publicly available for the METABRIC dataset or the RNA-sequencing datasets; we used the processed versions of these datasets. This resulted in 102 datasets with a total of 17,151 samples.
Most of the samples are from breast tumors; however, a small number are from normal breast tissue as indicated in the metadata.

Data curation
We automated all steps of the curation process using custom computer scripts. Some scripts used the bash interpreter (Linux operating system); other scripts used the R statistical software 28 . We configured the scripts to be executed within Docker containers so that our curation process is reproducible and portable across computer systems 29 . We obtained the GEO data using the GEOquery 15 package. For the other sources, our scripts downloaded the data directly from the repositories. The data comprised both gene-expression data and metadata, which we processed to achieve consistency across all datasets.
Our workflow was divided into six major steps described below.
In step one, we downloaded the metadata for Affymetrix datasets, cleaned up column names by removing non-informative suffixes (e.g "_ch1") and removing columns that we determined would not add value in downstream analyses. For example, some columns had only one unique value (e.g., "extraction protocol", "last_update_date"), while others had all unique values such (e.g., "sample_name"). Because this process was subjective, and other researchers might find these columns useful, we retained the unfiltered metadata as a separate part of the outputs. Upon reviewing the metadata, we made additional customizations on a case-by-case basis. When a single dataset was processed using more than one platform, we separated them into multiple data files, according to their respective platforms. Finally, we summarized the metadata variables from each dataset. These summaries indicated the unique values for each variable, the number of unique values, and the minimum, maximum and mean (for numeric variables).
In step two, we downloaded raw CEL files for the Affymetrix datasets and normalized them using the SCAN.UPC 30 package, yielding zero-centered, log2-transformed intensities. This normalization step made use of custom chip definition files (CDF) for Affymetrix GeneChips using updated probe-set definitions from the Brainarray Microarray consortium 31 . Custom CDFs are necessary because the manufacturer's probe-set definitions have become outdated and are sometimes incorrectly assigned to genes or to multiple genes. Some GEO datasets had a mixture of Affymetrix and Agilent data. Our scripts selected the Affymetrix samples only.
In step three, we performed quality-control measures on the expression data. We used the DoppelgangR 32 package to identify Affymetrix samples that may have been duplicated.
Duplicates can occur when samples are inadvertently mislabeled or when studies include samples from previously published datasets. After manually reviewing the sample pairs that had been flagged as duplicates by DoppelgangR, we determined that an expression-similarity threshold of 0.99 may be appropriate for excluding samples; this score errs on the side of only excluding samples that were most likely to be duplicates. The sample pairs that exceed this threshold are indicated in our output files. However, we retained expression data for all samples and leave it to data users to choose if and when to exclude these samples.
Next, we evaluated each dataset using the IQRray 33 method, which identifies poor quality arrays by ranking all probe intensities in an array, computing the average rank for each probe set and returning a quality score for each sample in the dataset. Based on platform-specific quality thresholds from the authors' prior analysis, we excluded samples that did not meet these thresholds. The IQRray method is specific to samples processed on Affymetrix platforms.
In steps four and five, we processed the metadata and expression data, respectively, for non-Affymetrix datasets. We created a separate script for each dataset because they were heterogeneous. For instance, the files were downloaded from different sources, and the variables we excluded differed widely in each of these datasets. We also had to separate the TCGA dataset into tumor-only and normal-only datasets. Otherwise, the tasks performed in these steps paralleled what we did in steps one and two.
In step six, we standardized the names of individual samples, ensuring consistency with capitalization and removing suffixes. We also removed the control probeset values from the Affymetrix datasets. Next we filtered out poor quality arrays identified by the IQRray method.
To ensure consistency, we matched the samples in the expression data to samples in the metadata; thus the samples that were filtered by the IQRray method were also removed from the metadata. We also excluded any gene-expression samples for which metadata were missing. Finally, because some datasets used Entrez Gene identifiers while others used HUGO Gene Nomenclature Committee (HGNC) 34 identifiers, we used the biomaRt 35,36 package from Bioconductor 37 to map gene identifiers across Entrez Gene, HGNC, and Ensembl 38 . When a single gene symbol was associated with multiple Ensembl identifiers, we retained genes from the main chromosomes and excluded gene identifiers from haplotypic regions. While this removed most duplicated genes, some remained. Through manual inspection, we found that the first identifier (sorted alphanumerically) was the most appropriate identifier, so we retained those. We determined that these identifiers were most appropriate by looking up each gene in The Human Gene Database 39 . This database has a section for each gene named "External Ids" where different identifiers are listed. We found that the first identifiers matched those listed in this section. We also excluded data for which any gene identifier was missing.

Thesaurus mapping
We summarized the metadata variables into a single spreadsheet. The spreadsheet has two columns: "NCIT_Term_field_name" and "NCIT_Term_values," which contain the NCIT preferred name(s) and unique identifier(s) that we selected for each metadata variable name and its associated values, respectively. The former typically reflects the biomedical context of the variable; the latter typically indicates the manner in which the values were represented (e.g., units of measurement). In many instances, a single NCIT term was insufficient to convey the semantic meaning, so we used multiple NCIT terms. For example, for the variable pr_status_by_ihc, we used NCIT terms "Progesterone Receptor Status" and "Immunohistochemical Test Result". In most cases, we retained the original metadata variable names; however, in a few cases, the names were particularly generic (e.g., title or description), so we renamed them. In other cases, a single column had multiple values per cell, so we split the column and named the new columns appropriately.

Availability of data and materials
All gene expression data, metadata, and computer scripts generated during the current study are in a publicly accessible Open Science Framework (OSF) 41 repository (accessible from https://osf.io/eky3p). The list of datasets is summarized in Table 1.

Discussion
Huge sums are budgeted each year by the National Institutes of Health and other grantawarding agencies to generate gene-expression data and accompanying metadata.
Additionally, researchers put much time and effort into generating the original data. Placing the data in the public domain is invaluable for research, but that alone is not enough. To move the field forward, we need compendia of datasets that follow common themes. By reprocessing, curating, and sharing these breast-cancer datasets, we are able to save time that other researchers would have used in completing the same steps, thus reducing duplicated efforts. These efforts make the data more accessible to researchers who lack the expertise to complete these reprocessing and curation steps. Accordingly, we have sought to increase the benefit to the public of prior investments. is uncurated. We have taken considerable time to clean and manually curate the metadata variables in the datasets we reprocessed because it is difficult for many researchers to decipher the semantic meaning of a variable solely by its name or contents. Human review is essential (and time consuming). We envision that researchers will search our compendium for datasets associated with given NCIT term(s) and perform diverse types of meta-analyses using those data. Supplementary File 1 contains these NCIT term mappings.
To ensure reproducibility, we tested our scripts using an iterative process. The scripts have been configured to be executed within Docker containers so that they can be re-executed more easily by others on different operating systems and obtain the same outputs. Finally, our output files are simple text files and thus are programming-language agnostic. Other research groups have created valuable packages like the MetaGxBreast 19 and curatedBreastData 20 packages. These packages are accessible using the R programming language and are implemented within the Bioconductor framework, which is used heavily in the biomedical community 43 , but the outputs from our scripts can be accessed on any platform.
Although our compendium consists of many datasets relevant to a common biomedical theme, care must be taken when combining datasets. Systematic differences between data-generation platforms, laboratory equipment, and environmental conditions can bias analytical results 44 . We did not adjust for these differences but rather leave it to consumers of the data to identify scenarios and techniques for adjusting the data. Analyses that combine inferences across datasets-rather than combining data directly-may be less prone to such biases.
We mapped the metadata to NCIT terms but did not standardize the field names or values in the text files. Therefore, when combining datasets, researchers may wish to perform this step. For instance, one dataset might have a column that indicates estrogen-receptor status using values of "1" or "0", whereas another dataset might represent the same information using values of "positive" and "negative" or "P" and "N". To provide insight into the frequency with which biomedical concepts are represented in the metadata, we summarized the NCIT terms mapped to the metadata fields. Summaries for the NCIT terms used to describe field names can be found in Supplementary Files 2  In addition to aiding breast-cancer researchers in identifying datasets and samples associated with research topics of interest, biomedical-semantics researchers can use the ontology mappings for developing and testing algorithms that automate the ontology-mapping process. Although we were careful to ensure consistency when annotating the metadata variables, this process is subjective in nature. In many cases, several different NCIT terms could have been used to represent a given biomedical concept.
In nearly every case, more generic (parent) or specific (child) terms could have been used as alternatives. The NCI Term Browser provides mappings between terms and links to other ontologies, thus making it possible for researchers to use our annotations as a starting point to find alternative terms.

Conclusions
This standardized resource has the potential to reduce time that breast-cancer researchers spend on finding and preparing publicly available gene-expression data and allow them to focus more on answering biomedical questions. We have previously 9 surveyed the literature on types of analyses that have been commonly performed using gene-expression data from breast-cancer patients. Examples include predicting patient outcomes, defining breast-cancer subtypes, and identifying differentially expressed genes. We believe this resource will help in making these types of studies easier and to have more statistical power.

Supplementary Data
Supplementary File 1 is a summary of the metadata in which the variables have been mapped to the NCIT terms.