ABSTRACT
Prevalence of allergies has reached ~50% of industrialized populations and with children under ten being the most susceptible. However, the combination of the complexity of atopic allergy susceptibility/development and environmental factors has made identification of gene biomarkers challenging. The amount of publicly accessible transcriptomic data presents an unprecedented opportunity for mechanistic discoveries and validation of complex disease signatures across studies. However, this necessitates structured methodologies and visual tools for the interpretation of results. Here, we present a curated collection of transcriptomic datasets relevant to immunoglobin E (IgE)-mediated atopic diseases (ranging from allergies to primary immunodeficiencies). 30 datasets from the Gene Expression Omnibus (GEO), encompassing 1761 transcriptome profiles, were made available on the Gene Expression Browser (GXB), an online and open-source web application that allows for the query, visualization, and annotation of metadata. The thematic compositions, disease categories, sample number, and platforms of the collection are described. Ranked gene lists and sample grouping are used to facilitate data visualization/interpretation and are available online via GXB (http://ige.gxbsidra.org/dm3/geneBrowser/list). Dataset validation using associated publications showed good concordance in GXB gene expression trend and fold-change.
Database URL: http://ige.gxbsidra.org/dm3/geneBrowser/list
INTRODUCTION
Allergic disease is highly prevalent and currently reaches ~50% of the populations in industrialized nations (1). Although the generation of allergic responses is well-understood, the early sensitization steps and factors contributing to the development of immunoglobin E (IgE)-mediated diseases remain unclear. IgE is the major mediator of atopic response in humans, whereas atopy represents the predisposition to become over IgE-sensitized to allergens. However, not all encounters with a potential allergen will lead to sensitization. Similarly, not all sensitizations will result in a symptomatic allergic response even in atopic individuals.
The effect of IgE spans across multiple systems. In the circulation system, IgE increases flow and permeability of the blood vessels, fluid and protein in tissues, as well as flow to the lymph nodes. In the airway, IgE decreases air conduct diameter, increases mucus congestion, and can induce blockage. In the gastro-intestinal (GI) tract, IgE increases fluid secretion, peristalsis, and expulsion-diarrhea. The immunoglobin has also been suggested to play a role in the defense against parasite infection and as a general gate keeper for any foreign materials entering the body (2,3). Ultimately, IgE is involved in the normal spectrum of reaction to expulse foreign material from the body; hence, protecting by elimination. Nevertheless, over sensitization, which is developed by unknown mechanism(s), can lead to imbalance and pathology in affected individuals.
Two main groups of immune signals initiate the production of IgE in response to an antigen: 1) the signals that drive the differentiation of CD4 naive T cell to T helper type 2 (Th2) cells and 2) the cytokines and co-stimulatory molecules secreted by Th2 cells, which subsequently promote T follicular helper cells-induced immunoglobulin B cells class switch towards IgE production. Antigen characteristics, such as concentration and localization of the encounter (i.e. tissue, mucosa, circulation), can also affect Th2 cell induction. IL-4, IL-13, and STAT6 are key mediators of Th2 responses and IgM class switch to IgE. IL-4 secretion and mast cell CD40 surface expression also contribute to the IgE class switch to IgE at the site of allergic reaction.
Systemic levels of IgE alone is not a sufficient indicator for allergy risk (4). Peripheral blood IgE level can increase upon sensitization, but not reliable enough for deducing a diagnosis of allergy or allergen type (5). Concentration, binding strength or affinity, specificity, and portion of specific IgE to total IgE are all factors in translating a humoral IgE response into a clinical symptom (6). However, genetic component exists in allergic disease. Studies have demonstrated strong heritable components of allergic diseases and atopy, estimated at 33%-76% (7,8). Genetic components of food allergy and asthma have also been reviewed (9–12). Although single susceptibility gene have been identified for certain allergies (13–16), most genome-wide association studies reported multiple loci of susceptibility or gene regions to a wide range of allergy development (17–19). Environmental factors further complicate allergic susceptibility - the hygiene hypothesis of allergic susceptibility has received much attention in the last few decades (20). Other exposure-related factors (e.g. diet, pollution, tobacco smoke) are also likely to have contributed to the increasing susceptibility to allergic diseases in developed nations (1).
IgE may also be implicated in some primary immunodeficiency diseases (PIDs). For CD40 ligand and CD40 deficiency, circulating IgE+ plasma cells are absent (21). For others, IgE plasma levels are elevated: DOCK8 deficiency, AD-HIES Job’s syndrome, Comel-Netherton syndrome, PGM3 deficiency, IPEX (immune dysregulation, polyendocrinopathy, enteropathy X-linked), and Tyk2 deficiency. Large phenotypic heterogeneity is observed among these PIDs; however, the underlying mechanisms is not completely understood. This knowledge gap mirrors our limited understanding of the involved gene products mediating phenotypes atopic diseases.
There is a crucial need to pursue genetic markers of an atopic constitution amidst growing concern over the increasing abundance of IgE atopy and that current treatments remain mostly palliative. As both environmental interaction and genetic predisposition forms the determinant of allergy development, extrapolating potential gene markers of allergy atopy from a broad range of studies with associated clinical data is highly relevant, particularly given the high variations of the disease clinical phenotypes. With the advances in high through-put platforms for transcriptomic analyses, a large number of gene expression datasets is regularly deposited on publicly accessible repositories such as the NCBI Gene Expression Omnibus (GEO). GEO currently hosts over 90 000 datasets comprising over 2 million samples (https://www.ncbi.nlm.nih.gov/geo/) and presents an enormous opportunity for data mining across multiple studies.
The web-based Gene Expression Browser (GXB) is an open-source interface that allows for custom compilation of selected datasets (i.e. of interest to users) and facilitates visualization of gene expression data. GXB has been previously described (22) and used to generate a number of Data Notes (23–28). As well, GXB is a useful tool for novel gene/function discovery (29) and in system re-analysis approaches (30).
A search strategy was implemented to identify GEO datasets relevant to IgE-related atopic disease, including PIDs, and uploaded those datasets to the GXB platform online. The associated study metadata, such as the detailed gene information, relevant literature, study design, and sample information, were also uploaded to facilitate in-depth interpretation. In creating this collection of datasets, we aimed provide a resource which will facilitate risk prediction and interception of allergic diseases. The datasets were retrieved from publicly available GEO series and selected by relevance to IgE and atopic diseases and filtered by analysis platform and species. As previously demonstrated, data mining and re-analysis/re-interpretation of large and publicly available dataset is a promising avenue (31) to elucidate complicated diseases such as IgE-related atopy.
METHODS
Justification of data selection and filter
The focus of the GEO dataset selection was primarily on whether or not the dataset involved 1) IgE production or function or 2) if the pathology studied implicated directly or indirectly IgE production or function (Table 1A). We ensured a broad search approach by combining the 7 independent search results (Table 1A) using the “Merge collection” function available on “My NCBI” (https://www.ncbi.nlm.nih.gov/myncbi/) (Table 1B) and by performing another search using an assembly of all the terms used in Table 1A (Table 1C). 203 potentially relevant datasets were identified from the initial query (Table 1D). The query results were then manually filtered to restrict datasets to human sample, expression profiling by microarrays, and relevance to IgE-related atopic diseases. The process involved inspecting the study description, design, and sample type for each dataset and resulted in 30 curated datasets. The following criteria were deemed very important: clear description of tissue type, comparisons between patient vs healthy or stimulated vs unstimulated samples, and disease category implicating IgE. Furthermore, datasets/studies that are indirectly relevant to IgE-mediated atopic disease, such as gene expression in B cells after tonsillectomy, were also included as they were deemed valuable to 1) the discovery of putative novel gene-disease association, 2) to improving our knowledge of adaptive immunity, and/or 3) to increasing our knowledge about factors affecting IgE production.
The web-based GXB platform: a visualization tool for gene expression data
GXB is a valuable tool for training researchers about the reductionist investigative approaches (see description of such program in (31)). The creation of a web-based GXB platform was previously described in detail (22). In brief, the GXB is a simple interactive interface designed for visualization of large quantities of heterogenous data (Suppl. Figure 2). The platform allows for customizable data plots with overlapping metadata information, changeable sample order, as well as generation of sharable mini-URLs that encapsulate information about the display settings in use. The dataset navigation page allows for quick identification of datasets of interest either through filtering using pre-define lists or via query terms. The user has access to multiple functionalities within the GXB graphic interface. In brief, the data-viewing interface enables interactive browsing and graphic representation of large-scale data. This interpretable format displays ranked gene lists and expression results. The interface also allows for user flexibility in terms of changing how the gene list is ranked, the method used for ranking, sample grouping (i.e. disease type), sample sorting (i.e. gender or age) and view type (i.e. bar or chart). The end user can browse through the datasets, format graph for a selected gene within a dataset, and export data (i.e. annotation, FC, signal, groups). The original GEO data and annotation are accessible from the GXB interface (Downloads tabs) as well as from their GEO page (links provided on Study tab). The associated source code and instructions are publicly available (https://github.com/BenaroyaResearch/gxbrowser)], as well as the necessary R scripts (https://github.com/BenaroyaResearch/gxrscripts).
Construction of the dataset collection on GXB
The selected datasets were downloaded from GEO in SOFT file format and uploaded (http://ige.gxbsidra.org/dm3/geneBrowser/list) via the GXB interface (accessed by navigating the Tools menu located in the top-right corner of the GXB webpage; Tools / Chips Loaded / Upload Expression Data). The SOFT files contain metadata and normalized signal intensity data, generated by methods indicated by the author(s). These SOFT files can be analyzed directly for differentially expressed genes; thus, no additional processing was required. Using the “Sample Set Annotation Tool” of the GXB interface, the datasets were annotated according to the information provided on GEO. The raw signal data type of the dataset (ex. raw signal, log2 transformed and etc.,) are also specified for each GEO entry. The default data display is in linear scale (see further details below in “Presentation of datasets”). When necessary, the sample annotation file (which is part of the SOFT file) were edited in order to add group information; as it was important to identify groups in order to compute fold change (FC). For each dataset, individual samples were grouped based on relevant study variables. Three datasets (GSE19190, GSE75603, and GSE8507) were split via GXB’s “Sample Set Annotation Tool/Group Sets” interface to provide a more meaningful group comparison (these datasets share the same GSE number in Table 2). Genes were then ranked based on FCs of the specified two-group comparison. All the information annotated and presented on GXB is assembled in a SQL database (Data Citation 1).
Data availability
The curated datasets collected that have been described in this data note were assembled from the public repository NCBI GEO website: http://www.ncbi.nlm.nih.gov/gds/. In this study, we cited each dataset GEO accession number, and raw signal and annotation files are made available for download from the GXB web-application (http://ige.gxbsidra.org/dm3/geneBrowser/list).
RESULTS
Description of datasets
After applying the filtering strategy as previously described, we curated 30 datasets encompassing 1761 transcriptome profiles relevant to IgE-related atopic diseases. Detailed information on each dataset is presented in Table 2 and the summary of the data collection is presented in an aggregation of pie charts (Figure 1). The data collection includes a wide range of studies, sample types, as well as diseases. In total, 12 different microarray platforms are represented, with the majority being Affymetrix Human Array chips (various version). 17 datasets were generated from in vitro studies and 13 from ex vivo studies. Ten sample types are represented, with the most abundant being PBMC (n = 8) and airway epithelial cells (n = 5). Sample size of each dataset ranges from 5 to 628, with most studies having 10-50 samples. The dataset collection covered 7 main disease categories, including allergy, asthma, healthy responses, hyper IgE syndrome (HIES), dermatitis, atopic irritable bowel syndrome (IBS), and Urticaria. In the majority of the studies, comparisons are made between patients and health controls (n = 17), followed by stimulation (n = 6) and time course (n = 5). The frequency of the terms retrieved from the associated GEO descriptions and those derived from the MeSH/keywords of the published articles are visualized through word clouds presented in Figure 2.
Presentation of datasets
On the graphic interface of GXB, genes expression values are displayed in linear scale. The original signal data type can be found under the Info tab of individual datasets. The associated metadata are available under the Samples tab and this information can be used for graphical overlaying (via “Overlays” dropdown menu). The FCs are also displayed in linear scale (see Supplemental Information for a detailed example of FC calculation). For FC analysis, each subject/sample are grouped according to the experimental design (experimental variable vs controls) and/or, if available, as per the corresponding publications. It is also possible to rank the genes based on expression difference by selecting “Advanced” in the “Rank Lists” dropdown menu of the graphic interface. This display option can be more robust than FC when low expression is observed in one group. It is important to note that integration and re-analysis of the datasets is not the intent of this collection. Therefore, a more meaningful use of this collection lies in the comparison of FC expression for a gene across multiple relevant datasets (i.e. reductionist approach (31).
Dataset validation
Quality assessment of the datasets was performed by looking for key gene expression, i.e. marker genes, and highly differentiated genes as indicated in the associated publications. Table 2 includes the type of validation, the genes used, as well as the strength of the validation and the associated comments. Certain datasets were split in two to facilitate comparisons (GSE19190, GSE75603, and GSE8507); in these cases, validation is indicated for one of the two datasets as indicated in Table 2. Data validation was achieved for most datasets except for 6, due to having no publications (GSE14842, GSE37157, and GSE41861) or incomparable units (i.e. z-score) and/or absence of FC information (GSE64639, GSE54522 and GSE44956). Trend validations were done on 4 datasets that had no linked publications (GSE13619 and GSE99948) or had incomparable units to GXB (GSE70760 and GSE72542). But when possible, literature values were used for additional validation of the trend (i.e., GSE92688and GSE99948). Examples of the validation results are presented in Figure 3.
DISCUSSION
Potential application of the dataset collection
To demonstrate the potential use of the dataset collection, gene expression profile of different tissues for house dust mite (HDM) allergy were compared. Nasal epithelial (GSE19190) and PBMC-derived CD4 T cells (GSE70760) gene expression profiles (detailed in Table 2) were compared. In both dataset, genes associated with the Th2 pathway/axis were increased in HDM-sensitized patients compared against healthy controls (32,33); illustrating that Th2 response may results in the symptomatic phase. This is further supported by the IgE levels and sIgG/sIgE ratio in the same study (33).
SERPINB2, a gene coding for the inhibitor of plasminogen activator PAI-2, is markedly upregulated in patients in both datasets, hence tissues (Suppl. Figure 1A from GSE70760 and Figure 1B from GSE19190). This is consistent with previous association of asthma severity and biomarker panel including SERPINB2 from PBMC (34) and correlation of SERPINB2 expression in respiratory epithelial cells with atopic asthma severity (35). SERPINB2 has been reported to have a role in the interleukin-12-mediated signaling pathway; evidence from mice showed that SERPINB2 regulates IFNg production, causing down regulation of Th1 cytokines in macrophages (36).
The expression profiles of SERPINB2 in both tissues suggest an important role of the gene as a first line regulator of immune response, perhaps by preventing excessive Th1 response. Interestingly, IFNg was not decreased in PBMC-derived T cells, but IFNGR1 was, suggesting that in T cells, SERPINB2 may exert its role on the expression of the receptor rather than on IFNg itself. In Schroder et al (2010), stimulation of SERPINB2 −/− cells with antiCD40/IFNg resulted in greater Th1 cytokine production, thus supporting the idea that SERPINB2 affects IFNGR (36).
However, certain genes are found to be differently expressed between nasal epithelium or PBMC-derived T cells. An explanation is that the activation of Th1 immune response may be different in the target tissue and peripheral circulation. For example, IL1B, a Th1 promoting cytokine is found to be increased in CD4+ T cells in allergic patients (Suppl. Figure 1C, from GSE70760), but in the nasal epithelium, the gene is not upregulated in patients with severe case of allergic rhinitis (“uncontrol asthma”) compared against healthy control (Suppl. Figure 1D, from GSE19190). Further hypothesis-generating comparisons can be made by comparing the top differently expressed genes in both datasets as listed in Suppl. Table 1.
Conclusion
The dataset collection may be useful for exploring specific gene signatures in response to natural antigen/allergen exposure (i.e. allergic patients vs control) and delineate the major genetic drivers associated with increased level of IgE (i.e. cellular responses to specific antigen/allergen). Furthermore, comparative investigation of similarities and differences in expression of genes between datasets can highlight key mechanistic differences of immune signaling and/or provide insights for tissue-targeted intervention. In compiling the present dataset collection, we hope to offer a resource that may improve accessibility of public omics data to researchers in this field.
DATA CITATION
1. IgE_GXB_Database Figshare http://doi.org/10.6084/m9.figshare.7176851 (2018).
AUTHOR CONTRIBUTIONS
MG and SSYH contributed to conceptualization. MG, SSYH, and FA contributed to data curation and validation. MG and SSYH led, and FA supported, investigation and visualization. MG and SSYH performed formal analyses. SB and MT contributed to the maintenance of software. MG contributed writing – original draft, methodology, and project administration. MG and SSYH led, and FA, SB, MT, DC supported, writing – review & editing. MG lead and SSYH supported supervision. DC contributed funding acquisition and resources. The contributor’s roles listed above (underlined) follow the Contributor Roles Taxonomy (CRediT) described in Nature Communication 2014 (37) and managed by The Consortia Advancing Standards in Research Administration Information (CASRAI) (https://casrai.org/credit/).
COMPETING INTERESTS
No competing interests were disclosed.
FUNDING
All authors listed on this publication are affiliated with Sidra Medicine and the work is supported by the Qatar Foundation and a Qatar National Research Fund grant NPRP10-0205-170348.
ACKNOWLEDGEMENT
We would like to thank all the investigators who decided to make their datasets publicly available by depositing them into the NCBI GEO repository.