A validated and interpretable predictive model of cruzain inhibitors

Chagas disease affects 8–11 million people worldwide, most of them living in Latin America. Moreover, migratory phenomenon have spread the infection beyond endemic areas. Efforts for the development of new pharmacological therapies are paramount, as the pharmacological profile of the two marketed drugs currently available, nifurtimox and benznidazole, needs to be improved. Cruzain, a parasitic cysteine protease, is one of the most attractive biological targets due to its roles in parasite survival and immune evasion. In this work, we generated Quantitative Structure-Activity Relationship linear models for the prediction of pIC50 values of cruzain inhibitors. The statistical parameters for internal and external validation indicate high predictability with a cross-validated correlation coefficient of and an external correlation coefficient of . The applicability domain is quantitatively defined, according to QSAR good practices, using the leverage method. A qualitative interpretation of the model is provided based on protein-ligand interactions obtained from docking studies and structural information codified in the molecular descriptors relevant to the QSAR model. The model described in this work will be valuable for the discovery of novel cruzain inhibitors. Author summary Chagas disease is a major health problem in Latin America. The disease involves a long-lasting silent phase that usually culminates in serious or fatal heart damage. Despite its prevalence, there are only two antichagas approved drugs available. Despite these drugs have been in the market for more than 50 years, significant undesirable side effects and modest effectiveness in the chronic phase are prevalent. The need of new drugs to treat this disease is evident. Cruzain is a vital protein for the survival of Trypanosoma cruzi, the parasite causative of Chagas disease. Inhibition of this species-specific protein has been associated with improvements in pharmacological effects in animal models. Thus, blocking the activity of cruzain is an attractive approach for the development of antichagas agents. In this work, we present a validated mathematical model capable of predicting the cruzain inhibition value of a molecule from its chemical structure. This model can contribute to the identification of potential pharmacological alternatives against Chagas disease.

Introduction 1 metabolism, enzymes in ergosterol biosynthesis and the kinetoplastid proteasome [5]. 23 Cruzain is a cathepsin L-like cysteine protease present in all stages of the parasite 24 life cycle. It plays significant roles in the trypanosomal growth, survival and evasion 25 from the host immune response. Plasma membrane-anchored cruzain degrades the Fc 26 fraction of antibodies, overcoming the classic path of complement activation [3,6]. In 27 the amastigotic intracellular stage, this cysteine protease degrades transcription factors, 28 such as NFkB and thus prevents the activation of macrophages [3]. Cruzain generates 29 the bloodstream pro-inflammatory peptide Lys-bradykinin, which activates host 30 immune cells, promoting the parasite uptake and spread by phagocytosis [6]. The use of 31 cruzain inhibitors in animal models has shown to be effective in clearing the parasite 32 burden, even in the chronic phase. The vinyl-sulphonic compound known as K777 was 33 one the first proof-of-concept studies about anti-tripanosomal activity of cruzain 34 inhibitors in animal models [7,8]. Parasite death induced by cruzain inhibitors is 35 attributed to the accumulation of a peptide precursor in the Golgi complex. Therefore, 36 these in vitro and in vivo evidence have validated cruzain as a potential biological 37 target for Chagas Disease [3,6]. A variety of chemotypes for cruzain inhibition have 38 activity. QSAR modeling is widely used in drug discovery, especially in the prediction of 48 enzyme inhibition and ADME-Tox properties [9]. In virtual screening, validated QSAR 49 models are used for prioritizing molecules for experimental evaluation. Carefully 50 validated QSAR models have rendered novel chemotypes and scaffolds with a desirable 51 biological activity [10]. The quality of a QSAR model can be evaluated using the OECD 52 principles [11]. These principles are a series of guidelines originally developed for the use 53 of QSAR modelling for regulatory purposes, but they became a valuable tool in the 54 standard QSAR practice [11,12]. In this work, we explored public databases of 55 structurally diverse cruzain inhibitors for the generation of QSAR predictive models of 56 this biological endpoint. The structural properties, encoded by molecular descriptors, 57 are rationalized in terms of protein-inhibitor interactions, using molecular docking, thus 58 providing a possible mechanistic interpretation of the model. This work will be useful in 59 the search of cruzain inhibitors.

61
Data compilation, curation, and pre-processing 62 Cruzain inhibitors were collected from the ChEMBL (release 24) database, searching by 63 molecular target using the keyword cruzain. Molecules annotated with IC 50 values were 64 selected and duplicated or missing values compounds were eliminated. Finally, a were calculated from their SMILES representation, using the wash tool in the software 74 Molecular Operating Environment 2019.01 (MOE) [15]. Lastly, the structures were 75 energy minimized using the MMFF94x force field. The curated database is available in 76 S1 Feature selection and model calculation were performed in Weka 3.8 [17,18].

92
Selection of relevant features was carried out using the Correlation-Based Feature The subset of features with the highest score were used in the generation of the

104
The goodness of fit for the model was estimated by calculating the following statistical 105 parameters: coefficient of determination (R 2 ), adjusted coefficient of determination 106 (R 2 − adj), F statistic (variance ratio) and its associated p-value. Internal validation 107 was carried out through the k-fold leave-some-out cross-validation with k = 10. The using the leverage method [23]. Leverage values, h i , are computed using Eq (3), where 122 X is the descriptor matrix of the training set and x i is the descriptor vector for a query 123 molecule. per molecule were selected to generate a database of bound conformations. These data 146 were used to generate protein-ligand interaction fingerprints in MOE.

147
The similarity maps tool included in the RDKit module for python was used to 148 generate partial charges and SLogP diagrams. These diagrams show the atomic 149 contributions to logP, calculated with the Wildman and Crippen algorithm [29], and the 150 Gasteiger partial charges [27]. The 2D depictions of molecules in the similarity maps 151 were generated by projecting the calculated 3D conformation of the molecules, so that 152 these depictions resemble their docked pose.

Results and Discussion
In this work, we present the preparation and analysis of a data set of 110 cruzain known inhibitors, as has been reported previously [2].  Electrostatic component of potential energy GCUT SLOGP 2 The GCUT descriptors using atomic contribution to logP using the Wildman and Crippen SlogP method.
Contact distances of vsurf EWmin vsurf EWmin1 Lowest hydrophilic energy vsurf ID8 Hydrophobic integy moment (-1.6 kcal/mol) vsurf Wp5 Polar volume (-3.0 kcal/mol) v i is the atomic Van der Waals surface area of atom i, q i is the Gasteiger-Marsili partial charge over the atom i, and L i is the Wildman-Crippen atomic contribution to LogP.
Statistical parameters describing the goodness of fit for the model are presented in 183 Table 2. Coefficients of determination near to 1 indicates that a high ratio of variance 184 present in the original data is explained by the model. In this case, 83% of the variance 185 already present in the pIC 50 of the training set is explained by Eq 5. The ratio of the 186 mean squared error of the one-parameter model and the generated model is measured 187 by the F statistic. If this ratio is high enough, the prediction made by Eq 5 has an error 188 less than the native variability in the data. The F value for this model is presented in 189  to the linear fit. The R 2 for this external set is 0.71, above the typical threshold of 0.6. 208 However, although a high value of both Q 2 and R 2 is required, it is not sufficient for the 209 predictability estimation since these parameters just measure the linear correspondence 210 between predicted and experimental values but not their 1:1 identity relationship [20]. 211 Since there is not consensus in the establishment of an universal predictability criteria 212 for QSAR modeling, one of the proposed practices is to calculate a set of parameters 213 that could characterize the deviation from an ideal prediction, as suggested by Chirico 214 et al [30] and Gramatica et al [31].   leverage values higher than the calculated limit. In these regions, any prediction made 234 by the model is considered an extrapolation and its reliability is low. Cruciani et al [32]. Basically, the hydrophobic integy moment is the unbalance between 247 the center of mass of the molecules and the hydrophobic regions. Thus, the descriptor 248 may be related to the complementariety of inhibitors with the binding site, i.e. the 249 ability to form hydrogen bonds or polar contacts with the catalytic site or the S1 250 subsite and hydrophobic or π interactions in the S2 or S1' subsites.  takes into consideration mostly halogen atoms bound to aromatic or aliphatic groups.

275
The most potent compounds in the data set are also rich in halogen-containing groups. 276 Halogenated substituents are frequently used, among with other effects, to fulfill steric 277 contacts into protein cavities, so they can exert a shape-complementary effect with the 278 cruzain binding site, particularly in the well-defined S2 cavity and in the clefts formed 279 by the S1 and S1' subsites. field is representative of favorable polar and hydrogen bond donor-acceptor regions [32]. 285 The total polar volume at this energy (Wp5) is positively correlated with biological activity, as can be deduced from its coefficient in the model equation. Futhermore, EWmin1 indicates that a lowest hydrophilic interaction energy is more favorable for 288 cruzain inhibition. Fig (8) shows isosurfaces for the interaction fields at a level of -3.0 289 kcal/mol, with the same molecules as in Fig (7). It is clear from these representations 290 that polar volumes extend around hydrogen bond donors and acceptors, mainly on 291 those groups that mimic the peptide bond. Thus, these grid-based descriptors account 292 for the ability of inhibitors to form hydrogen bonds in the binding site for the peptide 293 bond recognition. for the binding into cruzain subsites but also their spatial distribution, as described by 298 their integy moments and polar molecular fields. These requirements resemble a 299 pharmacophore model that molecules within the applicability domain must meet to 300 bind into the protein and exert its inhibitory effect.

301
In summary, we have presented a QSAR model with a well-defined endpoint, as 302 described in methodology section for the criteria of data selection. The algorithm is 303 unambiguously presented, which consists in the application of Eq 5 to calculate the 304 predicted pIC 50 for cruzain inhibition, given the required descriptors. The applicability 305 domain is defined using the leverage method, and a limit value is also given for the