DUBS: A Framework for Developing Directory of Useful Benchmarking Sets for Virtual Screening

Benchmarking is a crucial step in evaluating virtual screening methods for drug discovery. One major issue that arises among benchmarking datasets is a lack of a standardized format for representing the protein and ligand structures used to benchmark the virtual screening method. To address this, we introduce the Directory of Useful Benchmarking Sets (DUBS) framework, as a simple and flexible tool to rapidly created benchmarking sets using the protein databank. DUBS uses a simple input text based format along with the Lemon data mining framework to efficiently access and organize data to protein databank and output commonly used inputs for virtual screening software. The simple input format used by DUBS allows users to define their own benchmarking datasets and access the corresponding information directly from the software package. Currently, it only takes DUBS less than 2 minutes to create a benchmark using this format. Since DUBS uses a simple python script, users can easily modify to create more complex benchmarks. We hope that DUBS will be a useful community resource to provide a standardized representation for benchmarking datasets in virtual screening.


Introduction
Small molecule protein docking is one of many essential tools applied in virtual screening pipelines for drug discovery. [1][2][3] The docking protocol produces a 3D small molecule ligand pose inside the binding cavity of a protein, which is representative of the ligand conformation that is co-crystalized with the protein target. Energetics of binding site interactions of this pose are calculated to determine if the small molecule binds to the protein target, as well as, to estimate the binding affinity between the small molecule and the target. These calculations are used to virtually screen large libraries of synthetically feasible molecules to identify potential hits for specific targets. 4 The virtual screening community has developed several benchmarking datasets to evaluate how well different docking methods perform at these vital tasks. [5][6][7][8][9][10][11][12][13] One of the popular benchmarking set is the Astex Diverse Benchmark 5 that is used to evaluate how well a given methodology can reproduce the crystal pose of a small molecule ligand in the bound (holo) form of a protein . Similarly, the PDBBind 14 and related CASF 11,15 datasets assess the ability of a docking method to produce and select a crystal-like pose for the small molecule ligand, and how well the methodology can rank the binding affinity of the small molecule. The most common metric used to identify a crystal-like pose is a root mean square deviation (RMSD) within 2.0Å of the true crystal pose 16,17 . The Directory of Useful Decoys (DUD) 8 and Directory of Useful Decoys Enhanced (DUD-E) 9 datasets provide decoy ligands which may not bind to the associated target proteins to evaluate docking methods that can distinguish binders from non-binders. Furthermore, the Pinc is Not Cognate (PINC) 12 benchmark is designed to evaluate how well a methodology can 'cross dock' a given ligand. Cross docking refers to a process where a ligand is docked using the holo structure of a target protein crystalized with a different ligand. Similarly, the Holo-Apo Set 2 (HAP2) 13 benchmark is designed to measure the performance of methods to reproduce the crystal structure of a bound ligand using the apo (or unbound) form of the target protein. In addition to benchmarking sets derived from known protein structures, the docking community has produced 'challenge sets' designed to rigorously validate a docking methodology in a blind manner. Two of these datasets are the Community Structure Activity Resource (CSAR) [18][19][20][21] and the Drug Design Data Resource (D3R) [22][23][24] using the data donated from the virtual screening community. Given the importance of these benchmarking sets as shown in the above works, it is clear that their continued development is important to improving and validating virtual screening pipelines.
The performance of any new virtual screening or docking methodology is tested on at least a few of the above benchmarks to gauge advancement in the field. Unfortunately, there are as many ways to format the input structure files for docking as there are benchmarking sets since the input file format specifications are not followed diligently. For example, the DUD 8 and DUD-E 9 benchmarks remove the element type for atoms specified for the target proteins (a required component of the PDB file format as per http://www.wwpdb.org/documentation/file-format.php). Additionally, the PINC 12 benchmark only provides files in the MOL2 format where the protein atom names have been removed, causing difficulties when converting these files back to the PDB format. Finally, no reference inputs are provided for the Astex and HAP2 benchmarks and need to be rederived for any new method comparison. To support these benchmarking sets, the developers must support odd corner cases and other issues related to formatting which takes additional development time better spent improving their scientific methodology. The differences in structure of the target, decoys, and ligands results in errors associated with high quality evaluation of screening methods. 25 In general, attempts to use a non-standard version of these benchmarks may lead to differences in reported performance compared to prior publications. To alleviate these issues, we have created the Directory of Useful Benchmarking Sets (DUBS) which aims to provide a framework for curated and standardized version of past and future benchmarking sets.
For DUBS, we chose to base our benchmarking framework on the highly standardized and widely accepted Macro Molecular Transmission Format (MMTF) 26 for input as these files are already curated by the RCSB organization and provide a standard way of representing atomic coordinates, inter residue bonding, intra residue bonding, and other important crystallographic information. Using MMTF data structures, the entire protein data bank data takes less than 10 gigabytes of space 27 , allowing users to easily keep a copy of the PDB on their local hard drive. Furthermore, this is small enough to fit into random access memory on modern day workstations, enabling incredibly quick processing times as compared to other formats. Our recently published Lemon framework 28 will be used for rapid processing by creating simple Python scripts for generating suitable inputs for docking software evaluation. These files are written using the Chemfiles 29 input/output library which supports reading/writing a variety of formats in a highly standards compliant manner. We have chosen to use the PDB file format for protein input as this format is standard for use in the docking community, and the SDF file format as a preference for small molecule input as this format allows for the storage of formal charge instead of partial charge. We did not want to include the partial charges as such parameters can bias the performance of a virtual screening methodology, [30][31][32] and therefore should be handled with care for each docking simulation. Similar to the input file types, DUBS can output ligands (and proteins) in a variety of formats using the Chemfiles 29 input/output library. These file formats include SDF, Tinker XYZ, PDB, CML, ARC, CSSR, GRO, MOL2, CIF, MMTF, and SMI. DUBS can be used to standardize existing benchmarking datasets as well as rapidly create user defined new benchmarks for virtual screening applications.

Methods
DUBS is designed to work using a specific input format that can incorporate variability in different types of information to develop user defined benchmarking sets. An overview of the DUBS pipeline is given in Figure 1a where an input file formatted in this specific manner is parsed using the provided python script that will be referred to as the parser, henceforth. Currently, the input format for DUBS is designed using seven tags that is used by the parser to identify different types of information in the input file. This information is stored in nine dictionaries, which keep track of alignment of each benchmark protein to a reference protein (optional), and pairing of ligands to their respective protein. To better describe the input format and options, an example 'block' of the input format used for DUBS is shown in Listing 1 for hivproteasea2 from the PINC benchmarking set and the program options for each tag is provided in Listing 2 for developing new benchmarking sets. The input files for several existing benchmarks, such as, PDBbind, Astex, PINC, HAP2, DUD-E, etc. are provided on https://github.com/chopralab/lemon/tree/dubs/ and as Supporting Information files.
Some benchmarking sets such as PINC and HAP2 require the use of protein alignment as they are designed to evaluate docking performance on non-native protein conformations. To address this need, a customized version of the TM-align algorithm 33 is implemented in the Lemon framework 28 to allow for fast and accurate alignment between crystal structures of the same target protein. Briefly, this algorithm matches residues between the reference and non-reference crystal structures and attempts to find the affine transformation which minimizes the distance between the alpha carbons of the matched residues.
In contrast to the original algorithm, the Lemon implementation incorporates the chain name of the residues in addition to their standardized residue ID. Additionally, this algorithm makes use of an optimized version of the Kabsch algorithm 34 to improve performance time.
In the input file format used by DUBS, the first tag is the reference protein tag, denoted "@<reference>". This tag informs the parser that information on the reference protein will be read in on the next line. This line must contain the PDB ID of the reference protein and the file path of a MMTF or other Chemfiles-supported format for that protein separated by a single space. The line may also include a PDB chemical ID for the ligand bound to the reference protein so that it can be removed before alignment. This ligand code needs to be put after the file path, separate by a single space. It should be noted that only one reference protein is allowed per reference tag and the reference MMTF or other Chemfiles format file must be stored locally . The parser then saves the reference PDB ID and path as a key-value pair in the file path dictionary and the ligand to be removed in a separate dictionary with the reference PDB ID as the key. If a reference file is used, but the PDBID of the reference is not known, place a * followed by a unique number in place of the PDBID (i.e. *1, *2, *3, etc.). This will allow for proper storage of the file path and corresponding alignment information for the protein in the appropriate dictionaries.
The second tag is the aligned protein tag, denoted "@<align_prot>". This tag informs the parser about the information on proteins that will be aligned to the reference protein and are read on multiple subsequent lines. These lines should contain the PDB IDs of any proteins planned for alignment and the PBD chemical ID of any ligands that need to be removed before alignment, separated by a single space. It should be noted that a corresponding file is not needed for these proteins. Each of these PDB IDs is stored in a list in the alignment protein dictionary with the reference PDB ID as the key. The ligands that will be removed are stored in a separate dictionary with the PDB ID of the corresponding protein as the key. As an example, this tag was used in the input file for the PINC dataset (Listing 1). If no PDBIDs are provided but the tag is present (Listing 2), DUBS will not perform alignment and the docking benchmark will be a cognate set where the ligand will be docking to its co-crystal.
The third tag and fourth tags are the small molecule (sm) and non-small molecule ligand alignment tags, denoted "@<align_sm_ligand>" and "@<align_non_sm_ligand >" respectively. These tags inform the parser that information on various proteins and their respective ligands will be read in on the The fifth is the end tag, denoted "@<end>". This tag informs the parser that no additional information will be read and stored for the current reference protein until the next reference tag is detected. This allows for comments to be written in the file outside of each block of information at the user's discretion. The sixth and seventh tags are no alignment small molecule and non-small molecule ligands tags, denoted by "@<no_align_sm_ligands>" and "@<no_align_non_sm_ligands>" respectively.
These tags are used when no reference protein is used. In the input file, the subsequent lines should have the same format as their respective @<align_...> counterparts. If the alignment features of DUBS are to be used, the alignment tabs must appear between the reference tag and the end tag. The protein and ligand alignment tags used in between do not need to be in any order or even present at all if they are not needed. For cases where a reference protein is not used, simply put any ligands after their respective no alignment tags. If there are any user comments that need to be added in the input file, they must come before the first tag is used, or between an end tag and a reference tag. After parsing the input data and storing it in the appropriate dictionaries, the Lemon framework 28 is used to perform alignment and scoring on the benchmarking data to produce standardized docking output files.
The DUBS software is available in Python 3.6 and gcc 6.3.0. It is freely available for installation via the Python Package Infrastructure (PyPI). All standardized formatted benchmarking sets are available for download directly from GitHub (https://github.com/chopralab/lemon/tree/dubs/) . The source code for   the  Lemon  framework  API  and  documentation are  available  on  GitHub  at https://chopralab.github.io/lemon/latest/.

Results and Discussion
We showcase the ability of DUBS to reproduce and standardize previously published benchmarking sets.
The input file described above was created for six well established benchmarking sets that derive from the protein databank (see Listing S2-S7). We measured the amount of time required to generate these benchmarking sets when the Hadoop copy of the protein databank was stored in RAM and using a single CPU core. We generated the benchmarking inputs 50 times for each benchmarked to calculate a mean and standard deviation for each calculation. We measured the total time taken by the application as well as the 'user' and 'system' time, which represent the time spent by the application itself and the time taken accessing hardware/allocating memory, respectively. These results are shown in Figure 1b and indicate that DUBS is able to quickly produce a benchmarking set in less than two minutes. The most computationally expensive benchmark to generate is the PDBBind Refined set simply due to the number of ligands required to written as output files the software. The next most computationally expensive benchmark is PINC due to a large number of alignments that are required for the benchmark to be generated as there are 949 ligands as well as 60 proteins that needs to be aligned. It should be noted that the application took a greater proportion on its time in 'user' mode for this benchmark than in 'system'  with the number of proteins indicated.