Variational Autoencoder Modular Bayesian Networks (VAMBN) for Simulation of Heterogeneous Clinical Study Data

Luise Gootjes-Dreesbach; Meemansa Sood; Akrishta Sahay; Martin Hofmann-Apitius; Holger Fröhlich

doi:10.1101/760744

Abstract

In the area of Big Data one of the major obstacles for the progress of biomedical research is the existence of data “silos”, because legal and ethical constraints often do not allow for sharing sensitive patient data from clinical studies across institutions. While federated machine learning now allows for building models from scattered data, there is still the need to investigate, mine and understand clinical data that cannot be accessed directly. Simulation of sufficiently realistic virtual patients could be a way to fill this gap.

In this work we propose a new machine learning approach (VAMBN) to learn a generative model of longitudinal clinical study data. VAMBN considers typical key aspects of such data, namely limited sample size coupled with comparable many variables of different numerical scales and statistical properties, and many missing values. We show that with VAMBN we can simulate virtual patients in a sufficiently realistic manner while making theoretical guarantees on data privacy. In addition, VAMBN allows for simulating counterfactual scenarios. Hence, VAMBN could facilitate data sharing as well as design of clinical trials.

Footnotes

Abbreviations

VAMBN: Variational Autoencoder Modular Bayesian Network
BN: Bayesian Network
MBN: Modular Bayesian Network
VAE: Variational Autoencoder
HI-VAE: Heterogenous and Incomplete Data Variational Autoencoder
DAG: directed acyclic graph
MCAR: missing completely at random
MAR: missing at random
MNAR: missing not at random
BIC: Bayesian Information Score
PPMI: Parkinson’s Progression Marker Initiative
PD: Parkinson’s Disease
UPDRS: Unified Parkinson’s Disease Rating Scales
ESS: Epworth Sleepiness Scale
RBD: REM sleep behavior disorder
CSF: cerebrospinal fluid

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.