PT  - JOURNAL ARTICLE
AU  - Md Nazmul Haque
AU  - Sadia Sharmin
AU  - Amin Ahsan Ali
AU  - Abu Ashfaqur Sajib
AU  - Mohammad Shoyaib
TI  - Use of relevancy and complementary information for discriminatory gene selection from high-dimensional cancer data
AID  - 10.1101/2020.02.25.964304
DP  - 2020 Jan 01
TA  - bioRxiv
PG  - 2020.02.25.964304
4099  - http://biorxiv.org/content/early/2020/02/25/2020.02.25.964304.short
4100  - http://biorxiv.org/content/early/2020/02/25/2020.02.25.964304.full
AB  - With the advent of high-throughput technologies, life sciences are generating a huge amount of biomolecular data. Global gene expression profiles provide a snapshot of all the genes that are transcribed or not in a cell or in a tissue at a particular moment under a particular condition. The high-dimensionality of such gene expression data (i.e., very large number of features/genes analyzed in relatively much less number of samples) makes it difficult to identify the key genes (biomarkers) that are truly and more significantly attributing to a particular phenotype or condition, such as cancer or disease, de novo. With the increase in the number of genes, simple feature selection methods show poor performance for both selecting the effective and informative features and capturing biological information. Addressing these issues, here we propose Mutual information based Gene Selection method (MGS) for selecting informative genes and two ranking methods based on frequency (MGSf) and Random Forest (MGSrf) for ranking the selected genes. We tested our methods on four real gene expression datasets derived from different studies on cancerous and normal samples. Our methods obtained better classification rate with the datasets compared to recently reported methods. Our methods could also detect the key relevant pathways with a causal relationship to the phenotype.