RT Journal Article SR Electronic T1 Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices JF bioRxiv FD Cold Spring Harbor Laboratory SP 744789 DO 10.1101/744789 A1 Ananya Bhattacharjee A1 Md. Shamsuzzoha Bayzid YR 2019 UL http://biorxiv.org/content/early/2019/08/22/744789.abstract AB Background Due to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.Results We introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques are more accurate and robust than the best alternate technique for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate method fails.Conclusions This study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. Our proposed deep learning framework is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).