The Immune Signatures data resource, a compendium of systems vaccinology datasets

Vaccines are among the most cost-effective public health interventions for preventing infection-induced morbidity and mortality, yet much remains to be learned regarding the mechanisms by which vaccines protect. Systems immunology combines traditional immunology with modern ‘omic profiling techniques and computational modeling to promote rapid and transformative advances in vaccinology and vaccine discovery. The NIH/NIAID Human Immunology Project Consortium (HIPC) has leveraged systems immunology approaches to identify molecular signatures associated with the immunogenicity of many vaccines. However, comparative analyses have been limited by the distributed nature of some data, potential batch effects across studies, and the absence of multiple relevant studies from non-HIPC groups in ImmPort. To support comparative analyses across different vaccines, we have created the Immune Signatures Data Resource, a compendium of standardized systems vaccinology datasets. This data resource is available through ImmuneSpace, along with code to reproduce the processing and batch normalization starting from the underlying study data in ImmPort and the Gene Expression Omnibus (GEO). The current release comprises 1405 participants from 53 cohorts profiling the response to 24 different vaccines. This novel systems vaccinology data release represents a valuable resource for comparative and meta-analyses that will accelerate our understanding of mechanisms underlying vaccine responses.


Background & Summary
Vaccines, one of humanity's greatest public health achievements, save millions of lives every year by preventing infectious diseases 1,2 . Despite their widespread use and efficacy, much remains to be learned regarding their molecular mechanisms of action. This is true both for vaccines against pandemic infections such as influenza 3 , and SARS-coronavirus-2 4 , as well as for infections for which there are currently no authorized or approved vaccines such as HIV [5][6][7] . Elucidating the commonalities and differences in the immune responses induced by different vaccines and their association with protective antibody responses will provide deeper insight and a framework for the evidence-based design of better vaccines or vaccination strategies. Recent technologies have provided tools to probe the immune response to vaccination and integrate hierarchical levels of the biological system 8 . Alluded to as systems vaccinology 9 , this new application of systems biology tools provides new insights into molecular mechanisms of vaccine-induced immunogenicity and protection [10][11][12][13] .
The National Institute of Allergy and Infectious Diseases (NIAID) established a multi-institutional consortium, Human Immunology Project Consortium (HIPC) 14,15 , to characterize the immune system in diverse populations in response to a stimulus, such as vaccination, using high-dimensional 'omic platforms and modern computational tools 14 . Since the inception of the consortium in 2010, members of HIPC have published >500 articles, including many that describe molecular signatures associated with vaccine-induced protection. These studies include molecular signatures that predict the immunogenicity of vaccination against yellow fever [16][17][18][19] , seasonal influenza in healthy young adults, elderly [20][21][22][23][24] , and children 25 , shingles 26,27 , dengue 28,29 , malaria 30,31 , and meta-analyses of common signatures across different vaccines 32,33 . These molecular signatures resulted from large-scale data analysis using high-throughput systems biology approaches coupled with detailed clinical phenotyping in well-characterized human cohorts.
Predicting immunogenicity from 'omic signatures remains challenging, prompting methodological innovation to advance the field towards delivering on the promises of precision vaccination [34][35][36] . The factors that contribute to robust vaccination responses are highly complex and span multiple biological scales. The vast collection of high-dimensional profiling datasets poses significant challenges for comparative analysis of these studies, including biological variability as well as data challenges such as volume, technical noise, and diverse sample processing pipelines. Data integration of cellular and molecular signatures to predict vaccine responses requires harmonization and normalization of data from multiple sources 37 . The generation of big data poses simultaneous challenges and opportunities with the potential of contributing to precision medicine. The biological interpretation of the resulting molecular features correlated with robust responses is another key factor. Understanding how effective vaccines stimulate protective immune responses, and how these mechanisms may differ between vaccine types and targeted pathogens remains a substantial challenge for the field. Moreover, the systems vaccinology field has been limited by a lack of a formal framework to standardize immune signatures gathered from diverse studies, creating a bottleneck for comparative analysis. To address these challenges, and in support of advances in systems vaccinology by the HIPC project and the broader scientific community, we present the creation of the Immune Signatures Data Resource, a compendium of systems vaccinology studies that enables standardized comparative analysis to identify molecular signatures that correlate with the outcomes of vaccinations.
The current release of the Immune Signatures Data Resource consists of 4795 transcriptomic samples from 1405 participants curated from 30 ImmPort studies (16 from HIPC-related studies, 14 non-HIPC studies) (Fig. 2, Table 1). The transcriptomic profiling dataset is derived from 53 cohorts of 820 young adults (18-49 years old) and 585 (≥50 years old) older adult samples. The data resource covers 24 vaccines targeting 11 pathogens and 6 vaccine types (Figs. 1b, 4a, Table 2), thus creating a critical mass of data that will serve as a valuable resource for the broader scientific community. Additionally, data assembly and integration of these data set enables derivation of comparable signatures for each study for comparative analysis of the underlying data.

Methods
Database background information and structure. Compatibility with immport and immunespace, the central databases of the human immunology project consortium. Given the exponential growth of the number of datasets of multiple modalities, an urgent need emerged for data sharing across the broader scientific community. The HIPC implements the NIH Data Sharing policy to promote the principles of Findability, Accessibility, Interoperability, and Reusability (FAIR) via ImmPort, created under the National Institute of Allergy and Infectious Diseases Division of Allergy, Immunology, and Transplantation (NIAID-DAIT). ImmPort (ImmPort. org) is an open repository of participant-level large-scale human immunology data designed to aid scientists with data standards and guidelines for efficient secondary analyses 38,39 . ImmPort facilitates data sharing of immunology studies creating a centralized knowledge base and resources, and serves as a central data repository for HIPC. ImmuneSpace 14,33 extends ImmPort, providing access to additional data (e.g., standardized gene expression matrices) and web-based R tools for data accession, analysis, and reporting. Studies in the Immune Signatures Data Resource are archived through the Shared Data Portal on ImmPort and ImmuneSpace repositories and may be updated over time. To provide a consistent data source for reproducible results, we also archived a static copy of the data as a "virtual study" in ImmuneSpace (Figs. 1a and 2).
Identification of vaccine study cohorts with transcriptomic profiles. Through a literature search conducted from July 2017 to January 2020 with terms including "Vaccine [AND] signatures", "Vaccine [AND gene expression", "Vaccine [AND] immune response [AND] gene expression", we identified target publications containing transcriptomics profiling datasets and vaccination responses. We found 16 HIPC-funded vaccinology studies in ImmPort with transcriptomics datasets generated with matching immune response outcomes and surveyed HIPC centers of their publications. We excluded non-human study cohorts, cohorts with B cell and T cell transcriptomics since most studies are PBMC or whole blood-derived, studies other than with existing HIPC studies, as well as published systems vaccinology papers and databases, were submitted to the ImmPort database. ImmuneSpace captures these datasets to create a combined compendium dataset. Quality control assessments of these data include array quality checks for microarray studies, batch correction, imputations for missing age and sex/y-chromosome presence information, and normalization per study. The combined virtual study included transcriptional profiles and antibody response measurements from 1405 participants across 53 cohorts, profiling the response to 24 different vaccines. Note that Hepatitis A/B (Twinrix) cohort also received Diphtheria/Tetanus toxoid (Td) and Cholera inactivated vaccine at the same time (Dukoral). (b) Demographic data included biological sex, race, vaccine, and number of participants.
www.nature.com/scientificdata www.nature.com/scientificdata/ intramuscular mode of vaccine route, studies with subjects beyond our target age range (<18), and those studies that lack vaccine stimulation. Notably, we have supplemented the HIPC data previously available in ImmPort by curating and submitting 14 additional human vaccination studies to ImmPort. For studies that were not in ImmPort/ImmuneSpace, we located the underlying data by surveying public transcriptome databases (e.g., Gene Expression Omnibus (GEO)) or reaching out to study authors to request data access, allowing us to submit to ImmPort on their behalf. These datasets were then made available via ImmuneSpace to be processed for standardization, preprocessing checks, and normalization. The standard analytical pipeline enables reproducibility and comparability of future studies to be correlated with publicly available immune response measurement. This process created the virtual study for the HIPC named the Immune Signatures Data Resource (Figs. 1a, 2).
Gene expression data processing pipeline. Data were read directly from ImmuneSpace using ImmuneSpaceR functions and subsequently preprocessed, quality controlled, and integrated using the following pipeline: Quality control of microarray experiments. The ArrayQualityMetrics R package 40 was used for quality control and assurance of all microarray experiments (Fig. 3a). Outlier detection was based on the following statistics: i)  Continued www.nature.com/scientificdata www.nature.com/scientificdata/ Mean absolute difference of M-values (log-ratios) of each pair of arrays, ii) the Kolmogorov-Smirnov statistic K a between each array's signal intensity distribution and the distribution of the pooled data and, iii) the Hoeffding's statistic D a on the joint distribution of A (average) and M values for each array. Using pre-specified criteria within an established public microarray data reuse pipeline 40 , we flagged for removal arrays that failed all three quality control statistics.
Preprocessing. Raw probe intensity data for Affymetrix studies were background-corrected and summarized using the RMA algorithm 41 while the function read.ilmn (limma R package) was used to read and background correct Illumina raw probe intensities. To integrate RNA-seq and microarray data, raw counts for RNA-seq data were transformed using the variance stabilizing transformation (VST). VST yields expression values that are  www.nature.com/scientificdata www.nature.com/scientificdata/ normalized across samples and by library size and approximately homoskedastic. After a proper log-2 transformation they can be analyzed as microarray data, using linear models in the limma framework. Expression data within each study were quantile normalized and log-transformed separately for each cohort/sample type.
Annotation. We annotated the manufacturing IDs (probes from microarray/Illumina) to their corresponding gene alias. Gene aliases were mapped to the recent gene symbols from the HUGO Gene Nomenclature Committee 42 [accessed Dec 23, 2020]. For the rare case where a gene alias mapped to more than one gene symbol, the mapping was resolved by the following: i) If a gene alias mapped to itself as a symbol, as well as other symbols, then it was mapped to itself; ii) if the gene alias mapped to multiple symbols that did not include itself, then the gene alias was dropped from the study. As a result, the raw gene expression matrix was reduced to 10086 HUGO gene aliases with known unique mapping.
Gene-based expression profiles. Expression data were summarized at the probe level (for microarray data) and gene-alias level (RNA-seq) to the canonical Gene-Symbol level. The probes/gene-aliases were summarized by selecting the probe or gene-alias with the highest average expression (mean of probes across all samples, take the highest mean) across all samples within the matrix (cohort and sample type).
Cross-Study normalization. One of the main assumptions in expression analysis is that differences in gene expression across conditions occur in a relatively small number of processes. As such, the distribution across conditions should be similar, and departures of these assumptions are corrected, for example, using quantile normalization. This procedure usually creates a target distribution using all samples available, but we observed dissimilar distributions in our collection stemming from various platforms used. Such differences lead to extensive distributions and introduce artifacts in the data (Fig. 3b,c). The target distribution was obtained from samples using Affymetrix platforms, resulting in a well-defined distribution, and each sample in our collection was quantile normalized to this target distribution. Before cross-study normalization, there were 35,725 representative gene symbols present. There were 25,639 genes removed after normalization, as these genes were not present in all the studies. This yielded a final expression matrix of 4795 samples from 1405 participants representing 10,086 genes (Fig. 2).
Determining and adjusting for technical confounders. We studied the primary sources of variation in the data, including the study effect (which also encompasses the impact of different expression platforms (RNA-seq, Affymetrix arrays, Illumina arrays, etc.), sample types (Whole blood, PBMC), as well as demographics. We conducted Principal Component Analysis (PCA) to visualize such associations in a bidimensional space of principal components (PCs) and applied Principal Variance Component Analysis (PVCA) 43 to quantify the amount of variability attributed to different experimental conditions. This approach models the multivariate distribution of the PCs computed for the PCA as a function of experimental factors and estimates the total variance explained by each factor via mixed-effect models. Since many studies included only one vaccine, temporal variations due to vaccine response were confounded with the study effect. The assessment of the primary technical sources of variation was carried out using only the pre-vaccination data, not affected by the targeted pathogen and vaccine type used in the different studies. Of note, all studies enrolled healthy volunteers, and the first biosample was obtained pre-vaccination. The targeted pathogen and vaccine type should not affect these baseline data.
Platform, study, and sample types were identified as significant sources of variation in the gene expression matrix. The effect of those three variables was estimated by modeling gene expression at baseline (at which no vaccine or timepoint effect exists) with a linear model using the limma framework, including feature set vendor (Platform/Affy), study (batch factors), and sample type, Y-chromosome genes presence, as covariates. Study and cell-type effects were estimated using a linear model with age, Y-chromosome genes presence (biological sex), study, sample type (Whole Blood/PBMC), study, and platform as additive effects. From here, the study, platform, and cell-type effects were eliminated from the entirety of the expression matrix. There were three studies (SDY1276, SDY1264, SDY180) that contained multiple cohorts and were treated as separate studies.
Biological sex imputation. Imputation of biological sex, as defined by the presence of a Y-chromosome, was carried out based on the gene expression profiles of 13 Y-chromosome genes. Within each study, a multidimensional scaling was first applied to the Y-chromosome gene expression profiles. K-means clustering was then used to cluster samples into two groups. Participants in the cluster with higher mean expression values were considered male (i.e., the Y-chromosome was present) while those in the cluster with lower expression were considered female (i.e., the Y-chromosome was absent). The consistency of the Y-chromosome presence assignment across time points was verified (Fig. 3d). In the (few) cases where imputation was not in agreement across all time points, the reported sex was used and if no sex was reported, imputation followed a majority rule principle.
Age imputation. Age imputation for studies without reported ages (SDY1260, SDY1264, SDY1293, SDY1294, SDY1364, SDY1370, SDY1373, SDY984) employed the RAPToR R v1.1.5 package 44 . The RAPToR algorithm takes in a reference set of gene expression time series with reported ages and generates a near-continuous, high-temporal resolution from the interpolated reference dataset. Transcriptomic profiles of participants without reported ages were compared to the reference dataset via a correlation profile, providing age estimates for the sample. Finally, random subsets of genes from the subject's transcriptomic profile were bootstrapped to ascertain a confidence interval for the imputed age. We generated the reference dataset using the transcriptomic profiles of 21 studies in our resource for which age was reported. The studies were split into younger (age <50) and older (2022) 9:635 | https://doi.org/10.1038/s41597-022-01714-7 www.nature.com/scientificdata www.nature.com/scientificdata/ (age ≥50) cohorts, thus two different models were generated, and only baseline transcriptomic profiles were used in the reference dataset. As RAPToR also enables phenotypic data to be incorporated into the interpolation model, each possible combination of phenotypic features was tested. These phenotypic features included the top variables found during our PVCA tests as well as demographic information such as reported age, cohort and matrix type, Y chromosome imputation, study accession, feature set vendor and platform names, and cell types. For each combination, RAPToR predicted the age of participants in the 21 studies with known age, and the goodness of fit was evaluated by the coefficient of determination (R 2 ) and confirmed via RMSE. The best model for the younger and older cohorts was then used to impute ages for the 7 studies without reported age (Fig. 3e,f) Immune response datasets processing pipeline. To identify the molecular signatures that correlate with vaccine immunogenicity, we included immune response readouts in the creation of this data resource. For studies that were missing vaccine response endpoints in their public data deposition, we contacted study authors and requested available antibody response measures to vaccine antigens. Once shared, these data were submitted to ImmPort and linked to the relevant studies. These readouts include neutralizing antibody titers (Nab), hemagglutination inhibition assay (HAI) results for influenza studies, and Immunoglobulin IgG ELISA assay results. In participants for whom the humoral immune response was measured with multiple assays, the preference was given to HAI for influenza or Nab for non-influenza studies, then IgG ELISA datasets. The antibody measures www.nature.com/scientificdata www.nature.com/scientificdata/ were normalized within each study by estimating the fold-change differences between the post-vaccination time-point (generally between day 28 or day 30) compared to the baseline measurement. For influenza studies where the vaccine included multiple strains, the fold changes between the post-vaccination versus baseline were calculated for each strain, and the maximum fold change (MFC) over the strains was selected 33 . Due to the variability in baseline antibody (Ab) levels and immune memory such as influenza vaccines, we also estimated the maximum residual after baseline adjustment (maxRBA) method by calculating the maximum residual across all vaccine strains to adjust for variable baseline Ab levels using the R package titer 20 . A total of 30 studies with 1405 participants and 4795 samples have both transcriptomics and immune response readout data available (Fig. 2). This dataset enables researchers to carry out comparative analyses using immunogenicity data as well as prediction of the quality of response across multiple vaccines.

technical Validation
Quality control and assurance. For global quality control across all public microarray data, we used a well-established pipeline available through the ArrayQualitymetrics R package 40 . Using pre-specified criteria established in the existing public microarray data reuse pipeline 59 , arrays that failed 3 out of 3 calculated quality control statistics were flagged for removal (see Methods). Consistent with standard practice to perform such quality control analysis prior to downstream analysis and dataset submission to the Gene Expression Omnibus, none  www.nature.com/scientificdata www.nature.com/scientificdata/ of the samples were outliers by all three statistics (Fig. 3a). As expected for data from published peer-reviewed studies, all the identified studies passed the quality assurance method using the Arrayqualitymetrics method.

Y-chromosomal presence and age imputation.
A few studies were missing information for sex and for age. To achieve data completeness, we included the biological sex imputation based on the imputed presence of the Y-chromosome using gene expression, as well as imputation of age when the variable was missing or defined by a broad range of values. Age imputation employed the RAPToR tool using 21 studies with reported age to define the best predictive model for the younger (age <50 years) and older (age ≥ 50 years) cohorts separately. The model with the lowest root mean square error (RMSE) from the young cohort was generated by taking into account the model (X ~ age_reported + matrix) with a coefficient of determination of R 2 = 0.367 (Fig. 3e), while the old cohort yielded a prediction with R 2 of 0.536 for their highest performing model (Fig. 3f).
Definition of vaccination studies transcriptomic cohort. Data preprocessing in ImmuneSpace yielded a total of 30 studies and 59 cohorts, with 1482 participants and 5413 samples. After the data was preprocessed and quality control measures were performed, we further assessed the identified cohorts as defined in the flow diagram (Fig. 2). This curation included: i) removing participants that were not relevant to the objective (n = 34); ii) removing samples due to inconsistencies with time design determination (n = 178); iii) removing participants with no baseline expression data (n = 42). Some studies, such as SDY1368 and SDY67, were dropped from the normalized data sets as they did not include subjects within our target age range (18-50 years). In summary, we report that the final Immune Signatures Data Resource contains 53 cohorts from 30 studies with 1405 participants and 4795 samples. www.nature.com/scientificdata www.nature.com/scientificdata/ Assessment and adjustment of the batch effects. We evaluated the main sources of variation on the gene expression matrix to identify and adjust technical confounders (RNA-seq, Affymetrix arrays, Illumina arrays, etc.), study, and specimen types (e.g., whole blood vs. PBMCs) using the baseline samples. Since all studies enrolled healthy volunteers, and the first sample was taken pre-vaccination, pathogen and vaccine type would not affect the baseline data. Figure 3b clearly demonstrates robust clustering of samples by study, which are also grouped by platform type. The study effect and type of platform used accounted for the vast majority (95%) of variation, followed by specimen types (3.6%). It is thus essential that the data are corrected for these major effects prior to any analytical usage [see Materials and Methods for further details]. The study, platform type, and specimen type-specific effects were estimated using a linear model that also included age and Y-chromosome presence as additive effects using only baseline expression. Once the study, platform, and specimen-type effects were estimated, they were eliminated from the entirety of the expression matrix. Figure 3b shows that those effects can successfully be adjusted from the data, thus leading to a matrix of expression that is free of most technical biases induced by the laboratory and cell-type effects.
Immune signatures transcriptomics and immune response datasets. We report the total number of assay samples collected from the transcriptomic and immune response datasets tallied by targeted pathogen and vaccine type, across multiple systems vaccinology datasets (Fig. 4a). We captured about ~3000 HAI antibody titer results from influenza studies that were measured by the standard HAI assay pre-and at multiple time points post-vaccination, depending on the study. Mean titers were calculated for the reported strains of the virus and were based on the highest dilution reported at day 28-30 post-vaccination. In addition, neutralizing antibody (NAB) titers and IgG ELISA results specific to each pathogen were determined by each study and are summarized (Fig. 4a). The overall transcriptomics dataset comprises multiple time points from 7 days pre-vaccination up to day 180 days post-vaccination (Fig. 4b). While most of the datasets focus on the young adult population (ages 18-50 years old), the data resource also includes studies that profile older adults following hepatitis B, influenza, and varicella vaccination (Fig. 4c) that may be useful for analysis. The Euler diagram describes the dataset overlap of participants with transcriptomics datasets and corresponding to one or more immune response datasets (Fig. 4d).
Heterogeneity of the immune response to vaccination across targeted pathogens and vaccine types was reflected in variation in the longitudinal trajectories of HAI and NAB titer measurements (Fig. 5a,b). HAI and   www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ NAB titers generally increased by 14-28 days after vaccination but attenuated at different times for each vaccine (Fig. 5a,b). Change in NAB titers after vaccination were significantly different across the 5 unique combinations of targeted pathogen and vaccine types where these measurements were reported (ANOVA p < 10 −10 ), with significant differences across all 5 groups except between meningococcus and yellow fever vaccines (Fig. 5c). Some influenza vaccination studies reported both HAI and NAB measures of immunogenicity, and there was a significant positive correlation between the vaccination-induced changes in these titers across participants (Spearman's rho = 0.45, p < 10 −10 ) (Fig. 5d).

Usage Notes
The expression data and accompanying meta-data have been made available with different formats and options to ease usage. Data are available as standard expression sets (eSet) objects, the R/Bioconductor structure unifying expression values, metadata, and gene annotation Both normalized data and batch-adjusted data are available (Table 4). Users interested in a single study or those planning to work exclusively within participants' changes may opt for the normalized data without batch adjustment. For comparison of time points across studies or developing algorithms that use expression data, batch corrected matrices should be employed. Imputed age values for participants with no reported age were included to facilitate the use of age as a covariate in future analysis. Such analysis can be carried out with the complete data set and can be followed up by a sensitivity analysis using the small cohort with age-reported data. For the use of expression sets with the corresponding immune response per participant, these are available in eSets noted with a response. The selected immune response outcome per study is also summarized in Table 3.

acknowledgements
This research was conducted within the Human Immunology Project Consortium (HIPC) and supported by the National Institute of Allergy and Infectious Diseases. This work was supported in part by NIH grants U19AI128949, U19AI118608, U19AI118626, and U19AI089992, U19AI090023, U19AI089992, U19AI128914, U19AI118610, U19AI128913. The HIPC projects are listed at https://www.immuneprofiling.org/hipc/page/ showPage?pg=projects. This work was supported in part by the Canadian Institutes of Health Research [funding reference number FDN-154287]