TY - JOUR T1 - Evaluating Sample Augmentation in Microarray Datasets with Generative Models: A Comparative Pipeline and Insights in Tuberculosis JF - bioRxiv DO - 10.1101/2021.05.03.442476 SP - 2021.05.03.442476 AU - Ayushi Gupta AU - Saad Ahmad AU - Atharva Sune AU - Chandan Gupta AU - Harleen Kaur AU - Rintu Kutum AU - Tavpritesh Sethi Y1 - 2021/01/01 UR - http://biorxiv.org/content/early/2021/05/04/2021.05.03.442476.abstract N2 - High throughput screening technologies have created a fundamental challenge for statistical and machine learning analyses, i.e., the curse of dimensionality. Gene expression data are a quintessential example, high dimensional in variables (Large P) and comparatively much smaller in samples (Small N). However, the large number of variables are not independent. This understanding is reflected in Systems Biology approaches to the transcriptome as a network of coordinated biological functioning or through principal Axes of variation underlying the gene expression. Recent advances in generative deep learning offers a new paradigm to tackle the curse of dimensionality by generating new data from the underlying latent space captured as a deep representation of the observed data. These have led to widespread applications of approaches such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), especially in domains where millions of data points exist, such as in computer vision and single cell data. Very few studies have focused on generative modeling of bulk transcriptomic data and microarrays, despite being one of the largest types of publicly available biomedical data. Here we review the potential of Generative models in recapitulating and extending biomedical knowledge from microarray data, which may thus limit the potential to yield hundreds of novel biomarkers. Here we review the potential of generative models and conduct a comparative analysis of VAE, GAN and gaussian mixture model (GMM) in a dataset focused on Tuberculosis. We further review whether previously known axes genes can be used as an effective strategy to employ domain knowledge while designing generative models as a means to further reduce biological noise and enhance signals that can be validated by standard enrichment approaches or functional experiments.Competing Interest StatementThe authors have declared no competing interest. ER -