FrozenChicken: Promoting the meta-analysis of chicken microarray data

The FrozenChicken RData package, contains the frozen vectors for the commercially available (in situ oligonucleotide) Affymetrix Chicken Genome Array (GEO platform id GPL3213). This package will promote, simplify, and ease the meta-analysis of chicken microarray data by the research community studying vertebrate development using the chick model organism. The package is freely available in https://github.com/iduarte/FrozenChicken. (*Equal contribution.)


| Search for relevant GEO data series
The chicken microarray datasets used were gathered using the ESearch function from the Entrez Programming Utilities (E-utilities) that provide a programmatic connection with the Entrez query system from NCBI. This search returned a list of Unique Identifiers (UIDs) for 1739 records that met the following query criteria: -search the database Geo DataSets (gds); -search for GEO platform id GPL3213, which is the Affymetrix Chicken Genome Array (the chicken commercially available microarray chip); -select only records that have .CEL supplementary files available for downloading.
The issued query was the following: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=GPL3213%5BACCN%5D+ AND+cel%5BsuppFile%5D&retmax=5000&usehistory=y 2 | Gather summary data for the data series found Using the list of UIDs returned from the previous step, we gathered the metadata associated with each entry. For this we used the ESummary function from NCBI's E-utilities, that returns the documented summaries for each UID, including the Geo Series (GSE) identifier for each experiment. The ESummary tool directly uses the esearch URL retrieved from the previous step.

| Download the relevant data sets
Careful manual inspection of the results retrieved from the previous step, led to the collection of

Case Study Using FrozenChicken
In this section we will show how to use the FrozenChicken vectors in a microarray data analysis of chicken transcriptomics. The workflow described here was performed in R, using RStudio (version 1.1.463). The typical workflow of a microarray data analysis is shown in Figure 1. Case study | 1. Install FrozenChicken and Additional R Packages We will start by installing the FrozenChicken package, which is deposited in GitHub. To install it directly from GitHub you should use the package remotes. If you do not have it, install it first: ## Install the package from CRAN repository install.packages("remotes")

## Load the package library(remotes)
Then install the R package FrozenChicken directly from GitHub: remotes::install_github("iduarte/FrozenChicken") Next you can load the library named affyChickGenomeArrayfrmavecs and the frozen parameters become available for the normalization of chicken microarray data from different experiments (provided that all use the same Affymetrix Chicken Genome Array platform).

## Load the FrozenChicken package
# This is the full name of the FrozenChicken data object library(affyChickGenomeArrayfrmavecs) ## Load the affyChickGenomeArrayfrmavecs data set data(affyChickGenomeArrayfrmavecs) To complete this case study, the following R Packages are required: ## Install Bioconductor (if not already installed) if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.10") ## Install the required Packages (if not already installed) BiocManager::install(c("ArrayExpress", "GEOquery", "Biobase", "affy", "arrayQualityMetrics", "ggplot2", "frma", "devtools")) Case Study | 2. Obtaining the Gene Expression Matrix To conduct a transcriptomics data analysis, one must obtain a gene expression matrix, i.e. a data table that reports the expression level measured for each gene. When the data to be analysed originates from microarrays that are deposited in public repositories, namely GEO or ArrayExpress, the completion of the following steps will generate a gene expression matrix: 1. Download the .CEL files (raw microarray data from Affymetrix) from its data repository. The data can be download using the ArrayExpress and GEOquery packages, respectively.
2. Read the raw .CEL files into R using the affy package, creating an 'AffyBatch' object containing the microarray data.
3. Extract the gene expression matrix from the 'AffyBatch' object.
All arrays passed the QC criteria, and so all will be included in the next analysis steps. If any outlier were to be flagged, then those arrays should be removed from the analysis, and the quality control steps have to be re-run.
## Load the required library for quality control library(arrayQualityMetrics) ## Step 3. Quality Control Report # This package uses the raw affybatch object directly # and not the expression matrix # (which is why we log transform the data). arrayQualityMetrics(expressionset = affybatch_chick, outdir = output_dir, force = FALSE, do.logtransform = TRUE)

Case Study | 4. Normalization with fRMA Using FrozenChicken
The normalization of data obtained from different experiments is pivotal to make the data comparable between arrays. Using the frozenRMA method this can be easily done using a vector of frozen parameters pre-computed from diverse datasets from the same microarray chip. FrozenChicken presents a package containing the pre-computation of these frozen parameters for the chicken commercial microarray Affymetrix Chicken Genome Array to be used with the fRMA package.

##
Step 4 -frozenRMA normalization using the FrozenChicken vectors eset_chick_frma <-frma(affybatch_chick, background="rma", normalize="quantile", summarize="robust_weighted_average", target="probeset", input.vecs=affyChickGenomeArrayfrmavecs, output.param=NULL, verbose=FALSE) # The data is Log2 tranformed by the process of fRMA normalization expres_chick_frma <-exprs(eset_chick_frma) Case Study | 5. Data Visualization Once the data have been normalized, we must confirm that the normalization was successful by running the quality control steps on the newly normalized data, and compare the results with the pre-normalized data. Here we show two of the most relevant plots to evaluate the success of the normalization procedure, namely, a boxplot (where each box corresponds to the intensity distribution of one array), and a Principal Component Analysis PCA plot to view the variation between the arrays (here, each dot is one array).

FrozenChicken Performance Evaluation
Before normalization, samples show variation between and within batches ( Figure 3A). Additionally, PCA analysis found that 92.64% of variance between the data points is explained by the identity of the experiment ( Figure 3C). Thus, the major source of variation in the raw intensity measurements is due to batch effects that should be reduced after the fRMA normalization.
Since the purpose of normalization is to remove unwanted variation between the transcriptional profiles, we expect that after the normalization, the relative gene-expression estimates will be distributed in a homogeneous way across the arrays, and also, the variance found by the PCA will decrease.
Our results show that, after normalizing the samples with the frozen vectors from FrozenChicken, the arrays exhibit similar distribution profiles ( Figure 3B), indicating that the normalization was successful.
In the PCA analysis, as expected, the variance described by the first component (PC1) has now decreased to 44.84% ( Figure 3D). Additionally, the distances between the points in the first principal component from the raw data, range between -1500 and 1500 ( Figure 3C), while the distances for the normalized values has decreased by nearly 10 fold (ranging between -200 and 100 ( Figure 3D), further confirming the success of the fRMA normalization using the pre-computed parameters from FrozenChicken.
It should be noted that, despite the successful normalization, there are still variation in the data ( Figure 3D), mostly explained by the difference in tissue types ( Figure 3D and Figure 4B), i.e. the biological variability that we are interested in studying. In the PCA from the raw data, the samples cluster by data repository, showing that the major source of variation explained by the first component was the experiment (technical variation that we are not interested in studying) ( Figure 3C and Figure  4A).

Conclusion
This case study shows that FrozenChicken is a reliable data package to be used with fRMA normalization for the pre-processing steps of chicken microarray data from different experiments, therefore promoting, simplifying, and easing future meta-analyses of chicken transcriptomics datasets from public repositories. This package will specially benefit the chicken research community, directly contributing to the quality of the scientific research using the chicken model organism. At the time of this publication (February 2021), the zenodo tutorial (DOI:10.5281/zenodo.3765944) describing this package had been downloaded over 1820 times (in less than one year), showing that our package has attracted the attention of our target audience.