Simulation-based comparison of Biopharmaceutics Classification System and drug structure

Background The Biopharmaceutics Classification System (BCS), which classifies bioactive molecules based on solubility and permeability, is widely used to guide new drug development and drug formulation, as well as predict pharmacokinetics. Here we performed computer simulations to study correlations between a molecule’s structure and its BCS classification. Methods A total of 411 small molecules were assigned to BCS categories based on published drug data, and their Pybel-FP4 fingerprints were extrapolated. The information gain(IG) of each fingerprint was calculated and its characteristic structure analyzed. IG was calculated using multiple thresholds, and results were verified using support vector machine prediction, while taking into account the dose coefficient(0-0.1, 0.1-1, or>1). Structural functional features common to fingerprints of compounds in each type of BCS class were determined using computer simulations. Results BCS classes III and IV appear to share several structural and functional characteristics, including Secondary aliphaticamine, Michael_acceptor, Isothiourea, and Sulfonamide Sulfonic_derivatives. Conclusion We demonstrate that our approach can correlate characteristic fingerprints of small-molecule drugs with BCS classifications, which may help guide the development and optimization of new drugs.


Introduction
The Biopharmaceutics Classification System (BCS) [1][2] was proposed in 1995 [3] to classify drugs into four categories based on their in vitro solubility and ability to be absorbed in the intestine (permeability):class I, high solubility/high permeability; class II, low solubility/high permeability; class III, high solubility/low permeability; and class IV, low solubility/low permeability [4] (Fig 1). The definition of high solubility according to the US Food and Drug Administration is that the highest dose of a single administration can be dissolved in 250 ml or less of an aqueous solution at 37℃ at pH1.0-7.5. Solubility and intestinal permeability have proven to be an adequate starting point for drug product development and regulation [5]. The role of BCS in drug development is facilitating biowaivers of in vivo bioequivalence studies [6]. At present, the BCS classification of drugs is achieved mainly experimentally, which requires a large amount of human, material, and financial resources, so a new method is urgently needed. Computer simulation is widely used in CADD: for example, FP4 and MACCS [7] molecular fingerprints [8] are used in new drug research. These fingerprints are developed to describe chemical structures in a chemical database. Meanwhile, computer simulation uses some evaluation indicators, such as information gain (IG), which refers to the weight of a fingerprint in this category [9], and frequency of a substructure (f) [10]. Even though a few studies of the application of computer simulation have been published recently, such as estimation of ADME (Absorption, Distribution, Metabolism, Excretion) properties and prediction of drug-induced liver toxicity [9][10], research on prediction of drug BCS classification based on structure has not been reported.
In this study, we combined BCS classification with computer simulation for the first time to identify the characteristic structures of each type of small-molecule drug. This approach may be useful for determining the BCS classification of new drugs. We used FP4 molecular fingerprints to describe the chemical structures of small-molecule drugs, and then calculated IG and f values to evaluate computer simulations. Furthermore, we used a Support-Vector Machine (SVM) [11][12][13] as the evaluation criteria for BCS classification accuracy (Fig 2).
SVM is a two-class classification model for linear separable cases. For linear indivisible cases, samples with low-dimensional input space are transformed into high-dimensional feature spaces by nonlinear mapping algorithms so that samples can be classified linearly [14]. Vector Machine (SVM) data output. Red and green balls represent small-molecule drugs. The degree of difference, called a hyperplane, in the molecular fingerprints is given by the distance separatingthe red data points from the green point. The red or green data points closest to the hyperplane form thesupport vector.
In addition, we introduced the concept of dose coefficient F, which we define as the ratio of the molecular mass of each small-molecule drug to the mass in the maximum dose. Drug dissolution and absorption as well as the requirement for excipients are all critical factors in design of drugs in all BCS classes [15][16][17]. Dose size affects the absorption of the drug, and the maximum dose of the drug affects the solubility of the drug. Moreover, small-molecule drugs have different molecular masses. Therefore using the dose coefficient F may eliminate the influence of the maximum dose.
In these ways, the present study may help promote the development of new drugs and further development of existing drugs, and it may shorten the time-to-market for drugs.

Establishment of database and molecular fingerprint analysis
A total of 359 small drug molecules were identified by the Provisional BCS Classification system (http://www.ddfint.net/search.cfm) (S1 Table). An additional 52 small drug molecules were identified in the literature using keyword search terms [18]. The molecular structures were downloaded (http://pubchem.ncbi.nlm.nih.gov/) for computer simulation.The BCS classification in the present study was the one recommended by the World Health Organization with CLogP [19] as the classification standard. Molecular fingerprints were calculated by inserting the molecular structure into Open Babel software. The resulting .sdf files were entered into ChemDes (http://www.scbdd.com/fingerprints/index/), which calculated the Pybel-FP4 fingerprints [20].
The following formulas were used to calculate information entropy and IG value [9] for each molecular fingerprint: Information entropy of all molecules in the entire database (1) where p 1 stands for the possibility of molecules in the first category, and p 0 indicates the possibility of molecules in the second category.
The effect of a molecular fingerprint on the overall system where P(t) stands for the probability that a molecular fingerprint will appear in the entire system, and P( ) indicates the probability that a molecular fingerprint will not appear in the t entire system.

Information entropy of a molecular fingerprint under high solubility conditions
Information entropy of a molecular fingerprint under low solubility conditions The influence of a molecular fingerprint on the entire molecule The larger the IG value, the greater is the effect of the structural composition on the entire molecular structure.
In this study, values of 1 equate to high solubility and 0 to low solubility.
Next, the frequency of a substructure (f) value was calculated [10]: Both f and IG values were ordered from largest to smallest, and the first 20 values were selected and the common parts were taken as the characteristic molecular fingerprint to be determined. If the IG value of the top 20 molecular fingerprints was greater than 0.01, it indicated that the molecular fingerprint had a significant influence on the whole molecule, but a value below 0.01 meant the influence was small and the BCS category could not be clearly distinguished.

SVM verification of BCS classification
To determine the accuracy of the SVM macros in differentiating small-molecule drugs into separate BCS classes, the classifications provided by the Provisional BCS website were used as a reference. IG values were extrapolated from molecular fingerprints and converted to binary file format. The binary FP4 molecular fingerprints were run through the SVM macros.
The data were divided into training and test sets in the ratio of 1:4 based on the drug class assigned to each small molecule by the following macros: SP1 (I-II A validation set [21] was created by examining the SVM output of the 359 small-molecule drugs from the known BCS classification as recognized by the World Health Organization. Data were derived as above, but thresholds were removed from the SVM macros. The accuracy of the output was compared to the known class designation of these molecules. Every combination comparing one set to another was tested in the SVM software to calculate the accuracy of the classification. Due to the similarity between the training and test sets, the classification by the SVM software was considered accurate if it showed a value between 70% and 90%. A value less than 70% meant that the BCS classification was not complete enough to distinguish between the two types of data, while more than 90% meant excessive SVM training, such that the machine's ability to generalize was insufficient. After the SVM was performed, molecular fingerprints were entered into SMARTS_InteLigand from Open Babel (http://www.scbdd.com/pybel_desc/fps-fp4/) [22] and the small-molecule drug structure was recreated based on fingerprint characteristics.

Classification based on dose coefficient F
When the SVM verification result was outside the ideal percentage range, the accuracy of the computer simulation prediction was improved using secondary classification according to dose coefficient F. We defined the concept of dose coefficient F for the first time, which considers the ratio of the molecular mass of each small-molecule drug to the mass in the maximum dose. This classification was divided into the following categories: 0 to 0.1, 0.1 to 1, and >1.

Verification of results
The obtained BCS feature structures were compared with those of small drug molecules with clear BCS classification in the BCS database to verify whether the results were accurate (Fig 3).   (Tables 1-4 and S1 File). However, the common molecular fingerprints determined by IG and f values overlapped among BCS categories, making it impossible to assign molecules to a BCS category based solely on these values. Therefore, this study utilized SVM software to verify accuracy.

Table 1. I-II comparison(SP1) result (S1 File. I-II Comparison(SP1)).
Abbreviations: SP1, I-II comparison; p 0 (t), the probability of fingerprint in the first category; p 1 (t), the probability of fingerprint in the second category; IG, Information gain; f, frequency of a substructure; f^2, square of f.

Table 2. III-IV comparison(SP0) result (S1 File. III-IV Comparison(SP0)).
Abbreviations: SP0, III-IV comparison; p 0 (t), the probability of fingerprint in the first category; p 1 (t), the probability of fingerprint in the second category; IG, Information gain; f, frequency of a substructure; f^2, square of f.

Table 3. I-III comparison(PS1) result (S1 File. I-III Comparison(PS1)).
Abbreviations: PS1, I-III comparison; p 0 (t), the probability of fingerprint in the first category; p 1 (t), the probability of fingerprint in the second category; IG, Information gain; f, frequency of a substructure; f^2, square of f.

Table 4. II-IV comparison(PS0)result (S1 File. II-IV Comparison(PS0)).
Abbreviations: PS0, II-IV comparison; p 0 (t), the probability of fingerprint in the first category; p 1 (t), the probability of fingerprint in the second category; IG, Information gain; f, frequency of a substructure; f^2, square of f.

SVM prediction results
The accuracy of the BCS classification between the training and test sets was evaluated by SVM (Tables 5-10). When the IG threshold was 0.005 and only solubility or permeability was considered, the resulting SVM prediction was highly accurate (70-85%). In contrast, when drug classes were compared based on high or low solubility or high and low permeability, the accuracy of the SVM prediction was low, indicating large variability between the BCS populations. These data suggest that the current terms defining each drug class are too vague for this type of throughput and require additional information to improve classification accuracy. Abbreviations: SP1, I-II comparison; SVM, Support-Vector Machine. Table 6. SVM predictive value under low permeability.

Table 8. SVM predictive value under low solubility.
Abbreviations: PS0, II-IV comparison; SVM, Support-Vector Machine. Table 9. SVM predictive value compared with high permeability and low permeability.

SVM prediction results based on dose coefficient F classification
The BCS classification was subdivided by the dose coefficient, and SVM prediction was performed (Tables 11-16 and S2 File). Compared to the SVM predictions without dose coefficient F, the SVM predictions with the coefficient of 0-0.1 improved BCS prediction accuracy when the SVM had a threshold of 0.01 or 0.02. Furthermore, when the dose coefficient was greater than 1, the accuracy of the SVM prediction improved within each threshold range. This indicated that the dose coefficient F affected the accuracy of the SVM prediction and the screening of the characteristic molecular fingerprint. Therefore, addition of the dose coefficient F can improve SVM accuracy. Abbreviations: BCS, Biopharmaceutical classification system; SVM, Support-Vector Machine. Abbreviations: BCS, Biopharmaceutical classification system; SVM, Support-Vector Machine. Abbreviations: BCS, Biopharmaceutical classification system; SVM, Support-Vector Machine. Abbreviations: BCS, Biopharmaceutical classification system; SVM, Support-Vector Machine.

Molecular fingerprints characteristic of BCS classes III and IV
Based on the IG value, f value, dose coefficient F, and SVM prediction results, characteristic molecular fingerprints were screened for small-molecule drugs assigned to BCS classes III and IV (Tables 17-19  Abbreviations: BCS, Biopharmaceutics Classification System; p 0 (t), the probability of fingerprint in the first category; p 1 (t), the probability of fingerprint in the second category; IG, Information gain. Table 19. Characteristic molecular fingerprints with dose coefficients greater than 1.
Abbreviations: BCS, Biopharmaceutics Classification System; p 0 (t), the probability of fingerprint in the first category; p 1 (t), the probability of fingerprint in the second category; IG, Information gain.

Verification of results
We verified the accuracy of our results by comparing the obtained BCS feature structures with those of small drug molecules with clear BCS classification in the BCS database (S3 File). These data confirmed that the molecular feature structures we identified were consistent with small-molecule drugs found in BCS classes III and IV, and that this method can be used as an accurate predictor.

Discussion
In this study, we used computer simulation to analyze the common molecular fingerprints of 411 drug molecules. We determined that IG value, f value, and dose coefficient F were all necessary for the accuracy of SVM classification prediction. Lastly, we generated five macros to calculate and classify BCS (S4 File).
Different BCS classifications have been developed by the World Health Organization, the US Food and Drug Administration and the European Medicines Agency. The FDA definition of high solubility is that the highest dose allowed for a single drug administration can be dissolved in 250 ml or less of an aqueous solution at 37℃ and pH1.0-7.5. The other regulatory bodies narrow the pH range to 1.2-6.8 or 1.0-6.8 [23]. In this study, we followed the classification from the World Health Organization.
The introduction of the dose coefficient into the classification prediction software improved the reliability and accuracy of the results. Some drugs have a maximum dose of 1000 (Cefmetazole), while some drugs have a maximum dose of only 0.003 (Alfacalcidol).
Moreover, small-molecule drugs have different molecular masses. In order to eliminate these two differences and make the results more reliable and accurate, we defined the concept of dose coefficient F as a standard for BCS secondary classification for the first time, which considers the ratio of the molecular mass of each small-molecule drug to the mass in the maximum dose.
This study failed to obtain the characteristic fingerprints of class I and class II drugs, which may be due to the fact that the constructed database is not large enough, such that the frequencies of fingerprints of various features are too low. Therefore, in future studies, it is necessary to continuously increase the number of small-molecule drugs in the database, so that the characteristic fingerprints obtained are more meaningful. Since the present study used only CLogP standards, future studies should include ALogP and KLogP [24] to make the research more extensive.

Conclusion
In this study, we used FP4 fingerprints to describe the chemical structures of drugs, and calculated IG values and f values as indicators of computer simulations. Furthermore, we used SVM as the evaluation criteria for the accuracy of BCS classification. The structural features of drugs in BCS classes III and IV were successfully obtained, including Secondary aliphaticamine, Michael_acceptor, Isothiourea, and Sulfonamide Sulfonic_derivatives. These structural features can be used for the classification and formulation of drugs in these two classes. We believe that with the increase in number of the exact classification of class I and class II drugs, the characteristic structures of the two types of drugs can be obtained successfully and further guide the development of new drugs.