Abstract
In the era of structural genomics, with a large number of protein structures becoming available, identification of domains is an important problem in protein function analysis as it forms the first step in protein classification. Domain identification has been an active area of research for over four decades and a wide range of automated methods have been proposed. In the proposed network-based machine learning approach, NML-DIP, a combination of supervised (SVM) and unsupervised (k-means) machine learning techniques are used for domain identification in proteins. The algorithm proceeds by first representing protein structure as a protein contact network and using topological properties, viz., length, density, and interaction strength (that assesses inter- and intra-domain interactions) as feature vectors in the first SVM to distinguish between single and multi-domain proteins. A second SVM is used to identify the number of domains in multi-domain proteins. Thus, it does not require a prior information of the number of domains. The k-means algorithm is then used to identify the domain boundaries that are assessed using CATH annotation. The performance of the proposed algorithm is evaluated on four benchmark datasets and compared with four state-of-the-art domain identification methods. The performance of the approach is comparable to other domain identification tools and works well even when the domains are formed with non-contiguous segments. The performance of the program is significantly improved for prior information about the number of domains. The algorithm is available at: https://bit.ly/NML-DIP.