A Bayesian approach to multivariate and multilevel modelling with non-random missingness for hierarchical clinical proteomics data

Irene SL Zeng; Thomas Lumley; Katya Ruggierol; Martin Middleditch

doi:10.1101/153049

Abstract

High throughput mass-spectrometry-based proteomics data from clinical studies brings challenges to statistical analysis. The challenges originate from the hierarchical levels of protein abundance data and interactions between clinical study design and experimental design. The non-random missingness of the measurements from a vast amount of information also adds complexity in data analysis. We propose multivariate multilevel models to analyse protein abundances and to handle abundance-dependent missingness within a Bayesian framework. The proposed model enables the variance decomposition at different levels of the data hierarchy and provides shrinkage of protein-level estimates for a group of proteins. A logistic missingness and censored model with informative prior is used to handle incomplete data. Hamiltonian MC/No-U-Turn Sampling and Gibb MCMC algorithms are created to derive the posterior distribution of study parameters; Hamiltonian MC is demonstrated to gain more efficiency for these high-dimensional correlated data. Improvements of the proposed missing data model is compared to the univariate mixed effect model and the multivariate-multilevel model using complete data in a simulated study and a clinical proteomics study. The proposed model framework can be used in other types of data with similar structure and Non Random Missingness mechanism (MNAR).

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.