PT - JOURNAL ARTICLE
AU - de los Campos, Gustavo
AU - Pook, Torsten
AU - Gonzalez-Raymundez, Agustin
AU - Simianer, Henner
AU - Mias, George
AU - Vazquez, Ana I.
TI - Analysis of variance when both input and output sets are high-dimensional
AID - 10.1101/2020.02.15.950949
DP - 2020 Jan 01
TA - bioRxiv
PG - 2020.02.15.950949
4099 - http://biorxiv.org/content/early/2020/02/15/2020.02.15.950949.short
4100 - http://biorxiv.org/content/early/2020/02/15/2020.02.15.950949.full
AB - Motivation Modern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.Results We propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.Availability The Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data analyses.Contact gustavoc{at}msu.eduSupplementary information Supplementary data are available at Bioinformatics online.