KCF-Convoy: efficient Python package to convert KEGG Chemical Function and Substructure fingerprints

Motivation In silico methodologies to assess pharmaceutical activity and toxicity are increasingly important in QSAR, and many chemical fingerprints have been developed to tackle this problem. Among them, KEGG Chemical Function and Substructure (KCF-S) has been shown to perform well in some pharmaceutical and metabolic studies. However, the software that generates KCF-S fingerprints has limited usability: the input file must be Molfile or SDF format, and the output is only a text file. Results We established a new Python package, KCF-Convoy, to generate KCF format and KCF-S fingerprints from Molfile, SDF, SMILES, and InChI seamlessly. The obtained KCF-S was used in a number of supervised machine-learning methods to distinguish herbicides from other pesticides, and to find characteristic substructures in taxonomy groups. Availability KCF-Convoy is implemented as a Python package freely available at https://github.com/KCF-Convoy and the user can use the package management system “pip” and also the Docker environment. Contact maskot@chemsys.t.u-tokyo.ac.jp


Introduction
Functionality and toxicity are inseparable characteristics of chemical compounds, necessitating the effective estimation of safety for the development of pharmaceuticals and bioactive compounds (Roy et al., 2015). To reduce the research and development costs of such compounds, and also from the viewpoint of animal welfare, there is an increasing need to develop in vitro and in silico methodologies that can replace conventional toxicity tests. To this end, it is considered beneficial to use animal experiment data that are already available, and to apply artificial intelligence and machine learning techniques to obtain brand new knowledge on toxicology. These approaches should help to achieve high-accuracy toxicity prediction methods covering various types of compounds that cannot be studied using existing Quantitative Structure-Activity Relationship (QSAR) models.
Among them, KCF-S has been shown to perform the best in some pharmaceutical and metabolic studies (Kotera et al., 2013;Sawada et al., 2014). However, there are currently only two ways to generate KCF format, namely via the chemical analysis tools on the GenomeNet API ( http://www.genome.jp/tools/gn_ca_tools_api.html ), and the downloadable KCFCO software provided by GenomeNet ( http://www.genome.jp/en/gn_ftp.html ). Both these ways have limited usability: the input file must be only Molfile or SDF format, and the output is only a text file.
Here, we established a new Python package, named KCF-Convoy, to generate KCF format and KCF-S fingerprints from Molfile, SDF, SMILES, and InChI files seamlessly ( Figure 1). The procedure developed by Kotera et al., 2013 (Figure 1a) requires several steps that generate intermediary files, and the resulting KCF-S fingerprints are obtained as a text file. KCF-Convoy ( Figure 1b) can generate KCF-S fingerprints without producing intermediary text files. The KCF-S fingerprints are generated in NumPy arrays ( http://www.numpy.org/ ), and can be directly subjected to the scikit-learn machine-learning library ( http://scikit-learn.org/ ) and other Python libraries. We also added options to select KCF-S attributes that represent different levels of chemical substructures (Kotera et al., 2013). The KCF-Convoy package should improve the usability of KCF-S fingerprints and further contribute to chemical bioinformatics studies.

Fig. 1. Differences between (a) the original and (b) proposed procedures for generating KCF Substructure
(KCF-S) vectors. Ellipses and rectangles indicate executable programs and text files, respectively.

M.Sato et al.
We prepared a variety of installation methods as explained in the Wiki page of https://github.com/KCF-Convoy/ .
One option is to use pip, the Python package management system. The RDKit library should be installed in advance. Miniconda ( https://conda.io/miniconda.html ) is recommended for this purpose. The following pip command can then be used to install the KCF-Convoy package:

% pip install kcfconvoy
It is also recommended to use the Docker environment to install KCF-Convoy as explained in the KCF-Convoy wiki.

Different from the original KCF Converter in GenomeNet, KCF-Convoy enables molecules to be input not only as
Molfile and SDF files, but also as SMILES and InChI files. It also enables the conversion of KCF format to other formats such as Molfile.
KCF-Convoy depicts a molecule as a mathematical graph structure and calculates KCF-defined biochemical substructures (Kotera et al., 2013), where vertices (nodes) are assigned KEGG atom labels that represent atomic elements with hybrid orbit and other physicochemical information (Hattori et al., 2003). KEGG atom labels generally consist of three letters; for example, C1a, where C is the element symbol of carbon, C1 indicates a sp3 carbon, and C1a indicates a methyl carbon (an sp3 carbon that forms a bond with only one atom other than hydrogen atoms). Definitions of all KEGG atoms are listed in http://www.genome.jp/kegg/reaction/KCF.html .
KCF-Convoy provides an integer KCF vector if a query molecule is given, and an integer KCF matrix if a group of molecules are given.

Applications
It has been shown previously that KCF-S vectors can outperform other chemical fingerprints in drug-target interaction prediction (Sawada et al., 2014) and enzymatic reaction prediction (Kotera et al., 2013). In this paper, we demonstrate two other examples to show the usefulness of KCF-S vectors.

Supervised machine learning to distinguish herbicides from other pesticides
An advantage of KCF-S compared to other chemical fingerprints is its interpretability. Figure 2 shows examples of molecules and their KCF Substructures that were found specifically either in herbicides or in other pesticides registered in the KEGG database (The complete list of 918 pesticides was obtained from http://www.kegg.jp/keggbin/get_htext?br08007.keg ). The C1a-O2a-C8y-N5x-C8y substructure was found only in herbicides, whereas the S0-P1a-O2b-C1b-C1a substructure was found in insecticides and fungicides, but not in herbicides. These examples show that KCF-S enables the interpretation of substructures that contribute to the molecular functions of the corresponding molecules.
To distinguish herbicides from other pesticides, we used different supervised machine-learning methods in the scikit-learn package, namely logistic regression, kNN (k-nearest neighbors), linear SVM (support vector machine), polynomial SVM, RBF (radial basis function) SVM, sigmoid SVM, decision tree, random forest, AdaBoost, extra tree, Gaussian process, gradient boosting, SGD (stochastic gradient decent), LDA (linear discriminant analysis), QDA (quadratic discriminant analysis), MLP (multi-layer perceptron) and naive Bayes. We used the KEGG BRITE database (Kanehisa et al., 2016) to collect the chemical structures of 918 pesticides, which included 365 herbicides, 299 insecticides, and 240 fungicides. Table 1 shows the summary of the results for the supervised machine-learning methods using KCF-S, Morgan, RDKit, Layered, and Pattern fingerprints as inputs. For most machine-learning methods, KCF-S generated better prediction results than the other four fingerprints. See https://github.com/KCF-Convoy/kcfconvoy/wiki/Exampleusage-of-KCF-Convoy-for-machine-learning for the detailed results and the Python code.

Extraction of characteristic substructures in a genus
Based on the assumption that taxonomically related species conserve similar metabolic pathways and chemical substructures, we investigated the appearance of substructures in the metabolite-species relationships described in the KNApSAcK database (Nakamura et al., 2014). As a result, we detected chemical substructures that were specific in a particular genus. We found that the N1b-C2c-S2a substructure was present only in eight metabolites out of the 50899 metabolites in KNApSAcK (Figure 3), and four of them were specifically present in genus Brassica. The eight metabolites were subjected to the E-zyme2 program (Moriya et al., 2016), which predicts candidate enzyme orthologs that can catalyze a reaction given as a query substrate-product pair. The prediction results

KCF-Convoy
showed that a Brassica oleracea (wild cabbage) cytochrome (CYP) enzyme (EC 1.14.14.1) could catalyze the interconversion of two of the metabolites (C00027106 and C00036583). We considered this reasonable because the difference between these two metabolites is one methoxy group (-OCH 3 ), indicating the hydroxylation reaction (typically catalyzed by CYP) may be followed by the methylation reaction.
The substructure effects in ligand-protein interactions have been investigated systematically and comprehensively (e.g., Malhotra and Karanicolas, 2017;Korkuc and Walther, 2016). This study reinforces the potential of KCF-S for de novo pathway reconstruction of natural products, and should also allow further understanding of substructure effects in wider range of issues in molecular biology.