Abstract
Rapid advances in technology have led to a wealth of large-scale molecular omics datasets. Integrating such data offers an unprecedented opportunity to assess molecular interactions at multiple functional levels and provide a more comprehensive understanding of the biological pathways involved in different diseases subgroups. However, multiple omics data integration is a challenging task due to the heterogeneity in the different platforms used. There is a need to address the complex and correlated nature of different data-types, in order to identify a robust and reliable multi-omics signature that can predict a phenotype of interest.
We introduce a novel multivariate dimension reduction method for multiple omics integration, classification and identification of a multi-omics molecular signature. DIABLO - Data Integration Analysis for Biomarker discovery using a Latent component method for Omics studies, models the correlation structure between omics datasets, resulting in an improved ability to associate biomarkers across multiple functional levels to phenotypes of interest. We demonstrate the capabilities of DIABLO using simulated data and studies of breast cancer and asthma, integrating up to four types of omics datasets to identify relevant biomarkers, while still retaining competitive classification and predictive performance compared to existing methods.
Our statistical integrative framework can benefit a diverse range of research areas with varying types of study designs, as well as enabling module-based analyses. Importantly, graphical outputs of our method assist in the interpretation of such complex analyses and provide significant biological insights.
List of abbreviations
- DIABLO,
- Data Integration Analysis for Biomarker discovery using a Latent component method for Omics studies;
- AUC,
- area under the receiver operating curve;
- PLS,
- Projection to Latent Structure models;
- sPLS-DA,
- sparse PLS-Discriminant Analysis,
- sGCCA,
- sparse generalized canonical correlation analysis;
- PCA,
- Principal Component Analysis;
- BER,
- Balanced Error Rate;
- Enet,
- elastic net;
- RF,
- random forest;
- SVM,
- support vector machine;
- KEGG,
- Kyoto Encyclopedia of Genes and Genomes;