Abstract
Inferring phylogenetic trees for a set of taxa is one of the primary objective in the evolutionary biology. Numerous approaches exist to reconstruct the phylogenetic trees by considering different biological data, such as genome sequence, gene sequence, protein sequence, protein-protein interaction graph, etc. However, each method has its own strengths and weaknesses. Till date, no existing method guarantees to determine true phylogenetic trees all the times. Various studies identified distinct branch length configurations where the existing methods are inefficient to infer the correct tree topologies. Recent developments in deep neural architectures have shown exemplary performances in various hard and long-lasting problems in biological research. In this study, we propose a novel deep convolutional neural network (CNN)-based model, DeePhy, to reconstruct the phylogenetic trees from the unaligned sequences. The sequences are represented on a two-dimensional coordinate plane by utilizing a biological semantics-based mapping. Additionally, to assess the robustness of a method, here we also propose a novel bootstrapping technique to generate replicas from the unaligned sequences. We train the model on the triplet sequences, where the output is a triplet tree topology. We show that the well-trained DeePhy outperforms the state-of-the-art methods such as Neighbour Joining, Maximum Parsimony, Maximum Likelihood, and Bayesian Inference in inferring triplet tree topology. We experiment DeePhy on data esimulated under numerous critical conditions and various branch length configurations. We conduct the McNemar test for comparing the performance of DeePhy and the state-of-the-art methods. The results exhibit that DeePhy is significantly more accurate as well as remarkably robust in determining the triplet tree topologies for most of the cases than that of the conventional methods. By utilizing a triplet-based supertree reconstruction technique, we combine the triplet tree topologies to reconstruct trees with more than three taxa. Again, various comparison metrics such as Robinson Fould score, Matching Split score, and Deformity Index show that DeePhy also outperforms the conventional methods in inferring trees. Finally, to analyze the performance of DeePhy on real biological dataset, we apply it on Gadiformes dataset. Reassuringly, DeePhy reconstructs the phylogenetic tree from real biological data with known or widely accepted topologies. Although various practical challenges still need to be taken care of, the outcomes of our study suggest that the deep learning approaches be a successful endeavour in inferring the accurate phylogenetic trees.
Competing Interest Statement
The authors have declared no competing interest.