Predicting the performance of fingerprint similarity searching

Methods Mol Biol. 2011:672:159-73. doi: 10.1007/978-1-60761-839-3_6.

Abstract

Fingerprints are bit string representations of molecular structure that typically encode structural fragments, topological features, or pharmacophore patterns. Various fingerprint designs are utilized in virtual screening and their search performance essentially depends on three parameters: the nature of the fingerprint, the active compounds serving as reference molecules, and the composition of the screening database. It is of considerable interest and practical relevance to predict the performance of fingerprint similarity searching. A quantitative assessment of the potential that a fingerprint search might successfully retrieve active compounds, if available in the screening database, would substantially help to select the type of fingerprint most suitable for a given search problem. The method presented herein utilizes concepts from information theory to relate the fingerprint feature distributions of reference compounds to screening libraries. If these feature distributions do not sufficiently differ, active database compounds that are similar to reference molecules cannot be retrieved because they disappear in the "background." By quantifying the difference in feature distribution using the Kullback-Leibler divergence and relating the divergence to compound recovery rates obtained for different benchmark classes, fingerprint search performance can be quantitatively predicted.

MeSH terms

  • Artificial Intelligence
  • Bayes Theorem
  • Benchmarking
  • Computing Methodologies*
  • Databases, Factual
  • Forecasting
  • Information Theory
  • Ligands
  • Linear Models
  • Molecular Structure
  • Pharmaceutical Preparations / chemistry
  • Pharmaceutical Preparations / classification
  • Structure-Activity Relationship

Substances

  • Ligands
  • Pharmaceutical Preparations