160,99 €
Written for drug developers rather than computer scientists, this monograph adopts a systematic approach to mining scientifi c data sources, covering all key steps in rational drug discovery, from compound screening to lead compound selection and personalized medicine. Clearly divided into four sections, the first part discusses the different data sources available, both commercial and non-commercial, while the next section looks at the role and value of data mining in drug discovery. The third part compares the most common applications and strategies for polypharmacology, where data mining can substantially enhance the research effort. The final section of the book is devoted to systems biology approaches for compound testing.
Throughout the book, industrial and academic drug discovery strategies are addressed, with contributors coming from both areas, enabling an informed decision on when and which data mining tools to use for one's own drug discovery project.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 575
Veröffentlichungsjahr: 2013
Contents
Cover
Methods and Principles in Medicinal Chemistry
Title Page
Copyright
List of Contributors
Preface
A Personal Foreword
Part One: Data Sources
Chapter 1: Protein Structural Databases in Drug Discovery
1.1 The Protein Data Bank: The Unique Public Archive of Protein Structures
1.2 PDB-Related Databases for Exploring Ligand–Protein Recognition
1.3 The sc-PDB, a Collection of Pharmacologically Relevant Protein–Ligand Complexes
1.4 Conclusions
References
Chapter 2: Public Domain Databases for Medicinal Chemistry
2.1 Introduction
2.2 Databases of Small Molecule Binding and Bioactivity
2.3 Trends in Medicinal Chemistry Data
2.4 Directions
2.5 Summary
Acknowledgments
References
Chapter 3: Chemical Ontologies for Standardization, Knowledge Discovery, and Data Mining
3.1 Introduction
3.2 Background
3.3 Chemical Ontologies
3.4 Standardization
3.5 Knowledge Discovery
3.6 Data Mining
3.7 Conclusions
Acknowledgments
References
Chapter 4: Building a Corporate Chemical Database Toward Systems Biology
4.1 Introduction
4.2 Setting the Scene
4.3 Dealing with Chemical Structures
4.4 Increased Accuracy of the Registration of Data
4.5 Implementation of the Platform
4.6 Linking Chemical Information to Analytical Data
4.7 Linking Chemicals to Bioactivity Data
4.8 Conclusions
Acknowledgment
References
Part Two: Analysis and Enrichment
Chapter 5: Data Mining of Plant Metabolic Pathways
5.1 Introduction
5.2 Pathway Representation
5.3 Pathway Management Platforms
5.4 Obtaining Pathway Information
5.5 Constructing Organism-Specific Pathway Databases
5.6 Conclusions
References
Chapter 6: The Role of Data Mining in the Identification of Bioactive Compounds via High-Throughput Screening
6.1 Introduction to the HTS Process: the Role of Data Mining
6.2 Relevant Data Architectures for the Analysis of HTS Data
6.3 Analysis of HTS Data
6.4 Identification of New Compounds via Compound Set Enrichment and Docking
6.5 Conclusions
Acknowledgments
References
Chapter 7: The Value of Interactive Visual Analytics in Drug Discovery: An Overview
7.1 Creating Informative Visualizations
7.2 Lead Discovery and Optimization
7.3 Genomics
References
Chapter 8: Using Chemoinformatics Tools from R
8.1 Introduction
8.2 System Call
8.3 Shared Library Call
8.4 Wrapping
8.5 Java Archives
8.6 Conclusions
References
Part Three: Applications to Polypharmacology
Chapter 9: Content Development Strategies for the Successful Implementation of Data Mining Technologies
9.1 Introduction
9.2 Knowledge Challenges in Drug Discovery
9.3 Case Studies
9.4 Knowledge-Based Data Mining Technologies
9.5 Future Trends and Outlook
References
Chapter 10: Applications of Rule-Based Methods to Data Mining of Polypharmacology Data Sets
10.1 Introduction
10.2 Materials and Methods
10.3 Results
10.4 Discussion
10.5 Conclusions
References
Chapter 11: Data Mining Using Ligand Profiling and Target Fishing
11.1 Introduction
11.2 In Silico Ligand Profiling Methods
11.3 Summary and Conclusions
References
Part Four: System Biology Approaches
Chapter 12: Data Mining of Large-Scale Molecular and Organismal Traits Using an Integrative and Modular Analysis Approach
12.1 Rapid Technological Advances Revolutionize Quantitative Measurements in Biology and Medicine
12.2 Genome-Wide Association Studies Reveal Quantitative Trait Loci
12.3 Integration of Molecular and Organismal Phenotypes Is Required for Understanding Causative Links
12.4 Reduction of Complexity of High-Dimensional Phenotypes in Terms of Modules
12.5 Biclustering Algorithms
12.6 Ping-Pong Algorithm
12.7 Module Commonalities Provide Functional Insights
12.8 Module Visualization
12.9 Application of Modular Analysis Tools for Data Mining of Mammalian Data Sets
12.10 Outlook
References
Chapter 13: Systems Biology Approaches for Compound Testing
13.1 Introduction
13.2 Step 1: Design Experiment for Data Production
13.3 Step 2: Compute Systems Response Profiles
13.4 Step 3: Identify Perturbed Biological Networks
13.5 Step 4: Compute Network Perturbation Amplitudes
13.6 Step 5: Compute the Biological Impact Factor
13.7 Conclusions
References
Index
Methods and Principles in Medicinal Chemistry
Edited by R. Mannhold, H. Kubinyi, G. Folkers
Editorial Board
H. Buschmann, H. Timmerman, H. van de Waterbeemd, T. Wieland
Previous Volumes of this Series:
Dömling, Alexander (Ed.)
Protein-Protein Interactions in Drug Discovery
2013
ISBN: 978-3-527-33107-9
Vol. 56
Kalgutkar, Amit S./Dalvie, Deepak/ Obach, R. Scott/Smith, Dennis A.
Reactive Drug Metabolites
2012
ISBN: 978-3-527-33085-0
Vol. 55
Brown, Nathan (Ed.)
Bioisosteres in Medicinal Chemistry
2012
ISBN: 978-3-527-33015-7
Vol. 54
Gohlke, Holger (Ed.)
Protein-Ligand Interactions
2012
ISBN: 978-3-527-32966-3
Vol. 53
Kappe, C. Oliver/Stadler, Alexander/ Dallinger, Doris
Microwaves in Organic and Medicinal Chemistry
Second, Completely Revised and Enlarged Edition
2012
ISBN: 978-3-527-33185-7
Vol. 52
Smith, Dennis A./Allerton, Charlotte/ Kalgutkar, Amit S./van de Waterbeemd, Han/Walker, Don K.
Pharmacokinetics and Metabolism in Drug Design
Third, Revised and Updated Edition
2012
ISBN: 978-3-527-32954-0
Vol. 51
De Clercq, Erik (Ed.)
Antiviral Drug Strategies
2011
ISBN: 978-3-527-32696-9
Vol. 50
Klebl, Bert/Müller, Gerhard/Hamacher, Michael (Eds.)
Protein Kinases as Drug Targets
2011
ISBN: 978-3-527-31790-5
Vol. 49
Sotriffer, Christoph (Ed.)
Virtual Screening
Principles, Challenges, and Practical Guidelines
2011
ISBN: 978-3-527-32636-5
Vol. 48
Rautio, Jarkko (Ed.)
Prodrugs and Targeted Delivery
Towards Better ADME Properties
2011
ISBN: 978-3-527-32603-7
Vol. 47
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.: applied for
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
© 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Print ISBN: 978-3-527-32984-7
ePDF ISBN: 978-3-527-65601-1
ePub ISBN: 978-3-527-65600-4
mobi ISBN: 978-3-527-65599-1
oBook ISBN: 978-3-527-65598-4
List of Contributors
Mohammad Afshar
Ariana Pharma
28 rue Docteur Finlay
75015 Paris
France
Kamal Azzaoui
Novartis Institutes for Biomedical Research (NIBR/CPC/iSLD)
Forum 1 Novartis Campus
4056 Basel
Switzerland
Igor I. Baskin
Strasbourg University
Faculty of Chemistry
UMR 7177 CNRS
1 rue Blaise Pascal
67000 Strasbourg
France
and
MV Lomonosov Moscow State University
Leninsky Gory
119992 Moscow
Russia
James N.D. Battey
Philip Morris International R&D
Biological Systems Research
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Sven Bergmann
Université de Lausanne
Department of Medical Genetics
Rue du Bugnon 27
1005 Lausanne
Switzerland
Sharon D. Bryant
Inte:Ligand GmbH
Clemens Maria Hofbauer-Gasse 6
2344 Maria Enzersdorf
Austria
Allen Cornett
Novartis Institutes for Biomedical Research (NIBR/DMP)
220 Massachusetts Avenue
Cambridge, MA 02139
USA
Renée Deehan
Selventa
One Alewife Center
Cambridge, MA 02140
USA
David A. Drubin
Selventa
One Alewife Center
Cambridge, MA 02140
USA
Christof Gaenzler
TIBCO Software Inc.
1235 Westlake Drive, Suite 210
Berwyn, PA 19132
USA
Michael Gilson
University of California
San Diego
Skaggs School of Pharmacy and Pharmaceutical Sciences
9500 Gilman Drive
La Jolla, CA 92093
USA
Janna Hastings
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UK
Julia Hoeng
Philip Morris International R&D
Biological Systems Research
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Nikolai V. Ivanov
Philip Morris International R&D
Biological Systems Research
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Edgar Jacoby
Janssen Research & Development
Turnhoutseweg 30
2340 Beerse
Belgium
Jeremy L. Jenkins
Novartis Institutes for Biomedical Research (NIBR/DMP)
220 Massachusetts Avenue
Cambridge, MA 02139
USA
Nathalie Jullian
Ariana Pharma
28 rue Docteur Finlay
75015 Paris
France
Esther Kellenberger
UMR 7200 CNRS-UdS
Structural Chemogenomics
74 route du Rhin
67400 Illkirch
France
Thierry Langer
Prestwick Chemical SAS
220, Blvd. Gonthier d'Andernach
67400 Illkirch-Strasbourg
France
Tiging Liu
University of California
San Diego
Skaggs School of Pharmacy and Pharmaceutical Sciences
9500 Gilman Drive
La Jolla, CA 92093
USA
Gilles Marcou
Strasbourg University
Faculty of Chemistry
UMR 7177 CNRS
1 rue Blaise Pascal
67000 Strasbourg
France
and
MV Lomonosov Moscow State University
Leninsky Gory
119992 Moscow
Russia
Elyette Martin
Philip Morris International R&D
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Florian Martin
Philip Morris International R&D
Biological Systems Research
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Aurélien Monge
Philip Morris International R&D
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
David Mosenkis
TIBCO Software Inc.
1235 Westlake Drive, Suite 210
Berwyn, PA 19312
USA
George Nicola
University of California San Diego
Skaggs School of Pharmacy and Pharmaceutical Sciences
9500 Gilman Drive
La Jolla, CA 92093
USA
Florian Nigsch
Novartis Institutes for Biomedical Research (NIBR)
CPC/LFP/MLI
4002 Basel
Switzerland
Manuel C. Peitsch
Philip Morris International R&D
Biological Systems Research
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Maxim Popov
Novartis Institutes for Biomedical Research (NIBR/CPC/iSLD)
Forum 1 Novartis Campus
4056 Basel
Switzerland
Pavel Pospisil
Philip Morris International R&D
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
John P. Priestle
Novartis Institutes for Biomedical Research (NIBR/CPC/iSLD)
Forum 1 Novartis Campus
4056 Basel
Switzerland
Josep Prous Jr.
Prous Institute for Biomedical Research
Research and Development
Rambla Catalunya 135
08008 Barcelona
Spain
Jordi Quintana
Parc Científic Barcelona (PCB)
Drug Discovery Platform
Baldiri Reixac 4
08028 Barcelona
Spain
Didier Rognan
UMR 7200 CNRS-UdS
Structural Chemogenomics
74 route du Rhin
67400 Illkirch
France
Ansgar Schuffenhauer
Novartis Institutes for Biomedical Research (NIBR/CPC/iSLD)
Forum 1 Novartis Campus
4056 Basel
Switzerland
Alain Sewer
Philip Morris International R&D
Biological Systems Research
Quai Jeanrenaud 5
2000 Neuchtel
Switzerland
Christoph Steinbeck
European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge CB10 1SD
UK
Ty M. Thomson
Selventa
Cambridge, MA 02140
USA
Yannic Tognetti
Ariana Pharma
28 rue Docteur Finlay
75015 Paris
France
Antoni Valencia
Prous Institute for Biomedical Research, SA
Computational Modeling
Rambla Catalunya 135
08008 Barcelona
Spain
Thibault Varin
Eli Lilly and Company
Lilly Research Laboratories
Lilly Corporate Center
Indianapolis, IN 46285
USA
Jurjen W. Westra
Selventa
Cambridge, MA 02140
USA
Preface
In general, the extraction of information from databases is called data mining. A database is a data collection that is organized in a way that allows easy accessing, managing, and updating its contents. Data mining comprises numerical and statistical techniques that can be applied to data in many fields, including drug discovery. A functional definition of data mining is the use of numerical analysis, visualization, or statistical techniques to identify nontrivial numerical relationships within a data set to derive a better understanding of the data and to predict future results. Through data mining, one derives a model that relates a set of molecular descriptors to biological key attributes such as efficacy or ADMET properties. The resulting model can be used to predict key property values of new compounds, to prioritize them for follow-up screening, and to gain insight into the compounds' structure–activity relationship. Data mining models range from simple, parametric equations derived from linear techniques to complex, nonlinear models derived from nonlinear techniques. More detailed information is available in literature [1–7].
This book is organized into four parts. Part One deals with different sources of data used in drug discovery, for example, protein structural databases and the main small-molecule bioactivity databases.
Part Two focuses on different ways for data analysis and data enrichment. Here, an industrial insight into mining HTS data and identifying hits for different targets is presented. Another chapter demonstrates the strength of powerful data visualization tools for simplification of these data, which in turn facilitates their interpretation.
Part Three comprises some applications to polypharmacology. For instance, the positive outcomes are described that data mining can produce for ligand profiling and target fishing in the chemogenomics era.
Finally, in Part Four, systems biology approaches are considered. For example, the reader is introduced to integrative and modular analysis approaches to mine large molecular and phenotypical data. It is shown how the presented approaches can reduce the complexity of the rising amount of high-dimensional data and provide a means for integrating different types of omics data. In another chapter, a set of novel methods are established that quantitatively measure the biological impact of chemicals on biological systems.
The series editors are grateful to Remy Hoffmann, Arnaud Gohier, and Pavel Pospisil for organizing this book and to work with such excellent authors. Last but not least, we thank Frank Weinreich and Heike Nöthe from Wiley-VCH for their valuable contributions to this project and to the entire book series.
DüsseldorfWeisenheim am SandZürichMay 2013
Raimund MannholdHugo KubinyiGerd Folkers
References
1. Cruciani, G., Pastor, M., and Mannhold, R. (2002) Suitability of molecular descriptors for database mining: a comparative analysis. Journal of Medicinal Chemistry, 45, 2685–2694.
2. Obenshain, M.K. (2004) Application of data mining techniques to healthcare data. Infection Control and Hospital Epidemiology, 25, 690–695.
3. Weaver, D.C. (2004) Applying data mining techniques to library design, lead generation and lead optimization. Current Opinion in Chemical Biology, 8, 264–270.
4. Yang, Y., Adelstein, S.J., and Kassis, A.I. (2009) Target discovery from data mining approaches. Drug Discovery Today, 14, 147–154.
5. Campbell, S.J., Gaulton, A., Marshall, J., Bichko, D., Martin, S., Brouwer, C., and Harland, L. (2010) Visualizing the drug target landscape. Drug Discovery Today, 15, 3–15.
6. Geppert, H., Vogt, M., and Bajorath, J. (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. Journal of Chemical Information and Modeling, 50, 205–216.
7. Hasan, S., Bonde, B.K., Buchan, N.S., and Hall, M.D. (2012) Network analysis has diverse roles in drug discovery. Drug Discovery Today, 17, 869–874.
A Personal Foreword
The term data mining is well recognized by many scientists and is often used when referring to techniques for advanced data retrieval and analysis. However, since there have been recent advances in techniques for data mining applied to the discovery of drugs and bioactive molecules, assembling these chapters from experts in the field has led to a realization that depending upon the field of interest (biochemistry, computational chemistry, and biology), data mining has a variety of aspects and objectives.
Coming from the ligand molecule world, one can state that the understanding of chemical data is more complete because, in principle, chemistry is governed by physicochemical properties of small molecules and our “microscopic” knowledge in this domain has advanced considerably over the past decades. Moreover, chemical data management has become relatively well established and is now widely used. In this respect, data mining consists in a thorough retrieval and analysis of data coming from different sources (but mainly from literature), followed by a thorough cleaning of data and its organization into compound databases. These methods have helped the scientific community for several decades to address pathological effects related to simple (single target) biological problems. Today, however, it is widely accepted that many diseases can only be tackled by modulating the ligand biological/pharmacological profile, that is, its “molecular phenotype.” These approaches require novel methodologies and, due to increased accessibility to high computational power, data mining is definitely one of them.
Coming from the biology world, the perception of data mining differs slightly. It is not just a matter of literature text mining anymore, since the disease itself, as well as the clinical or phenotypical observations, may be used as a starting point. Due to the complexity of human biology, biologists start with hypotheses based upon empirical observations, create plausible disease models, and search for possible biological targets. For successful drug discovery, these targets need to be druggable. Moreover, modern systems biology approaches take into account the full set of genes and proteins expressed in the drug environment (omics), which can be used to generate biological network information. Data mining these data, when structured into such networks, will provide interpretable information that leads to an increased knowledge of the biological phenomenon. Logically, such novel data mining methods require new and more sophisticated algorithms.
This book aims to cover (in a nonexhaustive manner) the data mining aspects for these two parallel but meant-to-be-convergent fields, which should not only give the reader an idea of the existence of different data mining approaches, algorithms, and methods used but also highlight some elements to assess the importance of linking ligand molecules to diseases. However, there is awareness that there is still a long way to go in terms of gathering, normalizing, and integrating relevant biological and pharmacological data, which is an essential prerequisite for making more accurate simulations of compound therapeutic effects.
This book is structured into four parts: Part One, Data Sources, introduces the reader to the different sources of data used in drug discovery. In Chapter 1, Kellenberger et al. present the Protein Data Bank and related databases for exploring ligand–protein recognition and its application in drug design. Chapter 2 by Nicola et al. is a reprint of a recently published article in Journal of Medicinal Chemistry (2012, 55 (16): 6987–7002) that nicely presents the main small-molecule bioactivity databases currently used in medicinal chemistry and the modern trends for their exploitation. In Chapter 3, Hastings et al. point out the importance of chemical ontologies for the standardization of chemical libraries in order to extract and organize chemical knowledge in a way similar to biological ontologies. Chapter 4 by Martin et al. presents the importance of a corporate chemical registry system as a central repository for uniform chemical entities (including their spectrometric data) and as an important point of entry for exploring public compound activity databases for systems biology data.
Part Two, Analysis and Enrichment, describes different ways for data analysis and data enrichment. In Chapter 5, Battey et al. didactically present the basics of plant pathway construction, the potential for their use in data mining, and the prediction of pathways using information from an enzymatic structure. Even though this chapter deals with plant pathways, the information can be readily interpreted and applied directly to metabolic pathways in humans. In Chapter 6, Azzaoui et al. present an industrial insight into mining HTS data and identifying hits for different targets and the associated challenges and pitfalls. In Chapter 7, Mosenkis et al. clearly demonstrate, using different examples, how powerful data visualization tools are key to the simplification of complex results, making them readily intelligible to the human brain and eye. We also welcome Chapter 8 by Marcou et al. that provides a concrete example of the increasingly frequent need for powerful statistical processing tools. This is exemplified by the use of R in the chemoinformatics process. Readers will note that this chapter is built like a tutorial for the R language in order to process, cluster, and visualize molecules, which is demonstrated by its application to a concrete example. For programmers, this may serve as an initiation to the use of this well-known bioinformatics tool for processing chemical information.
Part Three, Applications to Polypharmacology, contains chapters detailing tools and methods to mine data with the aim to elucidate preclinical profiles of small molecules and select potential new drug targets. In Chapter 9, Prous et al. nicely present three examples of knowledge bases that attempt to relate, in a comprehensive manner, the interactions between chemical compounds, biological entities (molecules and pathways), and their assays. The second part of this chapter presents the challenges that these knowledge-based data mining methodologies face when searching for potential mechanisms of action of compounds. In Chapter 10, Jullian et al. introduce the reader to the advantages of using rule-based methods when exploring polypharmacological data sets, compared to standard numerical approaches, and their application in the development of novel ligands. Finally, in Chapter 11, Bryant et al. familiarize us with the positive outcomes that data mining can produce for ligand profiling and target fishing in the chemogenomics era. The authors expose how searching through ligand and target pharmacophoric structural and descriptor spaces can help to design or extend libraries of ligands with desired pharmacological, yet lowered toxicological, properties.
In Part Four, Systems Biology Approaches, we are pleased to include two exciting chapters coming from the biological world. In Chapter 12, Bergmann introduces us to integrative and modular analysis approaches to mine large molecular and phenotypical data. The author argues how the presented approaches can reduce the complexity of the rising amount of high-dimensional data and provide a means to integrating different types of omics data. Moreover, astute integration is required for the understanding of causative links and the generation of more predictive models. Finally, in the very robust Chapter 13, Sewer et al. present systems biology-based approaches and establish a set of novel methods that quantitatively measure the biological impact of the chemicals on biological systems. These approaches incorporate methods that use mechanistic causal biological network models, built on systems-wide omics data, to identify any compound's mechanism of action and assess its biological impact at the pharmacological and toxicological level. Using a five-step strategy, the authors clearly provide a framework for the identification of biological networks that are perturbed by short-term exposure to chemicals. The quantification of such perturbation using their newly introduced impact factor “BIF” then provides an immediately interpretable assessment of such impact and enables observations of early effects to be linked with long-term health impacts.
We are pleased that you have selected this book and hope that you find the content both enjoyable and educational. As many authors have accompanied their chapters with clear concise pictures, and as someone once said “one figure can bear thousand words,” this Personal Foreword also contains a figure (see below). We believe that the novel applications of data mining presented in these pages by authors coming from both chemical and biological communities will provide the reader with more insight into how to reshape this pyramid into a trapezoidal form, with the enlarged knowledge area. Thus, improved data processing techniques leading to the generation of readily interpretable information, together with an increased understanding of the therapeutical processes, will enable scientists to take wiser decisions regarding what to do next in their efforts to develop new drugs.
We wish you a happy and inspiring reading.
Strasbourg, March 14, 2013
Remy Hoffmann, Arnaud Gohier, and Pavel Pospisil
Part One
Data Sources
1
Protein Structural Databases in Drug Discovery
Esther Kellenberger and Didier Rognan
The Protein Data Bank (PDB) was founded in the early 1970s to provide a repository of three-dimensional (3D) structures of biological macromolecules. Since then, scientists from around the world submit coordinates and information to mirror sites in the Unites States, Europe, and Asia. In 2003, the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB, USA), the Protein Data Bank in Europe (PDBe) – the Macromolecular Structure Database at the European Bioinformatics Institute (MSD-EBI) before 2009, and the Protein Data Bank Japan (PDBj) at the Osaka University formally merged into a single standardized archive, named the worldwide PDB (wwPDB, http://www.wwpdb.org/) [1]. At its creation in 1971 at the Brookhaven National Laboratory, the PDB registered seven structures. With more than 75 000 entries in 2011, the number of structures being deposited each year in PDB has been constantly increasing (Figure 1.1).
Figure 1.1 Yearly growth of deposited structures in the Protein Data Bank (accessed August 2011).
The growth rate was especially boosted in the 2000s by structural genomics initiatives [2,3]. Research centers from around the globe made joint efforts to overexpress, crystallize, and solve the protein structures at a high throughput for a reduced cost. Particular attention was paid to the quality and the utility of the structures, thereby resulting in supplementation of the PDB with new folds (i.e., three-dimensional organization of secondary structures) and new functional families [4,5].
The TargetTrack archive (http://sbkb.org) registers the status of macromolecules currently under investigation by all contributing centers (Table 1.1) and illustrates the difficulty in getting high-resolution crystal structures, since only 5% targets undergo the multistep process from cloning to deposition in the PDB.
Table 1.1 TargetTrack status statistics
If only 450 complexes between an FDA-approved drug and a relevant target are available according to the DrugBank [6], the PDB provides structural information for a wealth of potential druggable proteins, with more than 40 000 different sequences that cover about 18 000 clusters of similar sequences (more than 30% identity).
The PDB stores 3D structures of biological macromolecules, mainly proteins (about 92% of the database), nucleic acids, or complexes between proteins and nucleic acids. The PDB depositions are restricted to coordinates that are obtained using experimental data. More than 87% of PDB entries are determined by X-ray diffraction. About 12% of the structures have been computed from nuclear magnetic resonance (NMR) measurements. Few hundreds of structures were built from electron microscopy data. The purely theoretical models, such as ab initio or homology models, are no more accepted since 2006. For most entries, the PDB provides access to the original biophysical data, structure factors and restraints files for X-ray and NMR structures, respectively. During the past two decades, advances in experimental devices and computational methods have considerably improved the quality of acquired data and have allowed characterization of large and complex biological specimens [7,8]. As an example, the largest set of coordinates in the PDB describes a bacterial ribosomal termination complex (Figure 1.2) [9]. Its structure determined by electron microscopy includes 45 chains of proteins and nucleic acids for a total molecular weight exceeding 2 million Da.
Figure 1.2 Comparative display of the largest macromolecule in the PDB (Escherichia coli ribosomal termination complex, PDB code 1ml5, left) and of a prototypical drug (aspirin, PDB code 2qqt, right).
To stress the quality issue, one can note the recent increase in the number of crystal structures solved at very high resolution: 90% of the 438 structures with a resolution better than 1 Å was deposited after year 2000. More generally, the enhancement in the structure accuracy translates into a more precise representation of the biopolymer details (e.g., alternative conformations of an amino acid side chain) and into the enlarged description of the molecular environment of the biopolymer, that is, of the nonbiopolymer molecules, also named ligands. Ligands can be any component of the crystallization solution (ions, buffers, detergents, crystallization agents, etc.), but it can also be biologically relevant molecules (cofactors and prosthetic groups, inhibitors, allosteric modulators, and drugs). Approximately 11 000 different free ligands are spread across 70% of the PDB files.
The conception of a standardized representation of structural data was a requisite of the database creation. The PDB format was thus born in the 1970s and was designed as a human-readable format. Initially based on the 80 columns of a punch card, it has not much evolved over time and still consists in a flat file divided into two sections organized into labeled fields (see the latest PDB file format definition at http://www.wwpdb.org/docs.html). The first section, or header, is dedicated to the technical description and the annotation (e.g., authors, citation, biopolymer name, and sequence). The second one contains the coordinates of biopolymer atoms (ATOM records), the coordinates of ligand atoms (HETATM records), and the bonds within atoms (CONECT records). The PDB format is roughly similar to the connection table of MOL and SD files [10], but with an incomplete description of the molecular structure. In practice, no information is provided in the CONECT records for atomic bonds within biopolymer residues. Bond orders in ligands (simple, double, triple, and aromatic) are not specified and the connectivity data may be missing or wrong. In the HETATM records, each atom is defined by an arbitrary name and an atomic element (as in the periodic table). Because the hydrogen atoms are usually not represented in crystal structures, there are often atomic valence ambiguities in the structure of ligands.
To overcome limits in data handling and storage capacity for very large biological molecules, two new formats were introduced in 1997 (the macromolecular crystallographic information file or mmCIF) and 2005 [the PDB markup language (PDBML), an XML format derivative] [11,12]. They better suit the description of ligands, but are however not widely used by the scientific community. There are actually few programs able to read mmCIF and PDBML formats, whereas almost all programs can display molecules from PDB input coordinates.
Errors and inconsistencies are still frequent in PDB data (see examples in Table 1.2). Some of them are due to evolution in time of collection, curation, and processing of the data [13]. Others are directly introduced by the depositors because of the limits in experimental methods or because of an incomplete knowledge of the chemistry and/or biology of the studied sample. In 2007, the wwPDB released a complete remediated archive [14]. In practice, sequence database references and taxonomies were updated and primary citations were verified. Significant efforts have also been devoted to chemical description and nomenclature of the biopolymers and ligands. The PDB file format was upgraded (v3.0) to integrate uniformity and remediation data and a reference dictionary called the Chemical Component Dictionary has been established to provide an accurate description of all the molecular entities found in the database. To date, however, only a few modeling programs (e.g., MOE1 and SYBYL2) make use of the dictionary to complement the ligand information encoded in PDB files.
Table 1.2 Common errors in PDB files and effect of the wwPDB remediation
Description of errorsImpacted dataStatus upon remediationInvalid source organismAnnotationFixedInvalid reference to protein sequence databasesAnnotationFixedInconsistencies in protein sequencesaAnnotationFixedViolation of nomenclature in proteinbStructureFixedIncomplete CONECT record for ligand residuesStructurePartly solvedWrong chemistry in ligand residuesStructurePartly solvedViolation of nomenclature in ligandcStructureUnfixedWrong coordinatesdStructureUnfixeda. In HEADER and ATOM records.b. For example, residue or atom names.c. Discrepancy between the structure described in the PDB file and the definition in the Chemical Component Dictionary.d. For example, wrong side chain rotamers in proteins.The remediation by the wwPDB yielded in March 2009 to the version 3.2 of the PDB archive, with a focus on detailed chemistry of biopolymers and bound ligands. Remediation is still ongoing and the last remediated archive was released in July 2011. There are nevertheless still structural errors in the database. Some are easily detectable, for example, erroneous bond lengths and bond angles, steric clashes, or missing atoms. These errors are very frequent (e.g., the number of atomic clashes in the PDB was estimated to be 13 million in 2010), but in principle can be fixed by recomputing coordinates from structure factors or NMR restraints using a proper force field [15]. Other structural errors are not obvious. For example, a wrong protein topology is identified only if new coordinates supersede the obsolete structure or if the structure is retracted [16]. Hopefully, these errors are rare. More common and yet undisclosed structural ambiguities concern the ionization and the tautomerization of biopolymers and ligands (e.g., three different protonation states are possible for histidine residues).
To evaluate the accuracy of a PDB structure, querying the PDB-related databases PDBREPORT and PDB_REDO is a good start [15]. PDBREPORT (http://swift.cmbi.ru.nl/gv/pdbreport/) registers, for each PDB entry, all structural anomalies in biopolymers. PDB_REDO (http://www.cmbi.ru.nl/pdb_redo/) holds rerefined copies of the PDB structures solved by X-ray crystallography (Figure 1.3).
Figure 1.3 PDB_REDO characteristics of the 3rte PDB entry.
The quality issue was recently discussed in a drug design perspective with benchmarks for structure-based computer-aided methods [17–19]. A consensual conclusion is that the PDB is an invaluable resource of structural information provided that data quality is not overstated.
The bioactive structure of ligands in complex with relevant target is of special interest for drug design. During the last decade, many databases of ligand/protein information have been derived from the PDB. Their creation was always motivated by the ever-growing amount of structural data. Each database however has its own focus, which can be a large-scale analysis of ligands and/or proteins in PDB complexes, or training and/or testing affinity prediction, or other structure-based drug design methods (e.g., docking). Accordingly, ligands are either thoroughly collected across all PDB complexes or only retained if satisfying predefined requirements. As a consequence, the number of entries in PDB-related databases ranges from a few thousands to over 50 000 entries. These databases also differ greatly in their content. This section does not intend to establish an exhaustive list. We have chosen to discuss only the recent or widely used databases and to group them according to their main purposes (Table 1.3).
Table 1.3 Representative examples of PDB-related databases useful for drug design
The wwPDB contributors have developed free Web-based tools to match chemical structures in the PDB files to entities in the Chemical Component Dictionary; the Ligand Expo and PDBeChem resources are linked to the RCSB PDB and PDBe, respectively, and provide the chemical structure of all ligands of every PDB file [20,21]. A few other databases also hold one entry for each PDB entry. The Het-PDB database was designed in 2003 at the Nagahama Institute of Bio-Science and Technology to survey the nonbiopolymer molecules in the PDB and to draw statistics about their frequency and interaction mode [22]. It is still monthly updated and covers 12 000 ligands in the PDB. It revealed that the most repeated ligands in the PDB were metal ions, sugars, and nucleotides, all of which can be considered as part of the functional protein as a result of a posttraductional modification or as cofactors. Another important database was developed at Uppsala University to provide structural biologists with topology and parameters file for ligands [23]. This database named HIC-Up was maintained until 2008 by G. Kleywegt, who now leads the PDBe. Another useful service has been offered by the Structural Bioinformatics group in Berlin: the Web interface of the SuperLigands database allows the search for 2D and 3D similar ligands in the PDB [24]. The last update of SuperLigands was made in December 2009. Other PDB ligand warehouses have been developed during the last decade, but, like HIC-Up and SuperLigands, are not actively maintained, since the RCSB PDB and the PDBe directly integrate most of their data or services.
A few databases collect binding affinities such as experimentally determined inhibition (IC50, Ki) or dissociation (Kd) constant for PDB complexes. The larger ones are Binding MOAD, PDBbind, and BindingDB [25–27]. Both Binding MOAD and PDBbind were developed at the University of Michigan, and have in common the separation of biologically relevant PDB ligands from invalid ones, such as salts and buffers. Their focuses are however different. For example, PDBbind disregards any complex without binding data, whereas Binding MOAD groups proteins into functional families and chooses the highest affinity complex as a representative. BindingDB considers only potential drug targets in the PDB, but collects data for many ligands that are not represented in the PDB.
In all cases, data gathering implies the manual review of the reference publications in PDB files and, more generally, expert parsing of scientific literature. BindingDB also contains data extracted from two other Web resources, PubChem BioAssay and ChEMBL. PubChem BioAssay database at the National Center for Biotechnology Information (NIH) contains biological screening results. ChEMBL is the chemogenomics data resource at the European Molecular Biology Laboratory. It contains binding data and other bioactivities extracted from scientific literature for more than a million bioactive small molecules, including many PDB ligands.
Affinity databases were recently made available from two of the wwPDB mirror sites. The RCSB PDB Web site now includes hyperlinks to the actively maintained ones, BindingDB and BindingMOAD. The PDBe Web site communicates with ChEMBL.
As already described, RCSB PDB and PDBe resources currently provide chemical description and 3D coordinates for all ligands in the PDB. They also provide tools for inspection of protein–ligand binding (Ligand Explorer at RCSB PDB and PDBeMotifs at PDBe). But as already discussed in this chapter, PDB data are prone to chemical ambiguities and not directly suitable to finely describe nonbonded intermolecular interactions. Several initiatives aimed at the structural characterization of protein–ligand interactions at the PDB scale. Among the oldest one is Relibase that automatically analyzes all PDB entries, identifies all complexes involving nonbiopolymer groups, and supplies the structural data with additional information, such as atom and bond types [28]. Relibase allows various types of queries (text searching, 2D substructure searching, 3D protein–ligand interaction searching, and ligand similarity searching) and complex analyses, such as automatic superposition of related binding sites to compare ligand binding modes. The Web version of Relibase is freely available to academic users, but does not include all possibilities for exploration of PDB complexes.
If Relibase holds as many entries as PDB holds ligand–protein complexes, other databases were built using only a subset of the PDB information. For example, the sc-PDB is a nonredundant assembly of 3D structures for “druggable” PDB complexes [29]. The druggability here does not imply the existence of a drug–protein complex, but that both the binding site and the bound ligand obey topological and physicochemical rules typical of pharmaceutical targets and drug candidates, respectively. Strict selection rules and extensive manual verifications ensure the selection in the PDB of binary complexes between a small biologically relevant ligand and a druggable protein binding site. The preparation, content, and applications of the sc-PDB are detailed in Section 1.3.
Along the same lines, the PSMDB database endeavors to set up a smaller and yet most diverse data set of PDB ligand–protein complexes [30]. Full PDB entries are parsed to select structures determined by X-ray diffraction with a resolution lower than 2 Å, with at least one protein chain longer than 50 amino acids, and a noncovalently bound small ligand. The PDB file of each selected complex was split into free protein structure and bound ligand(s). The added value of PSMDB does not consists in these output structure files that contain the original PDB coordinates, but in the handling of redundancy at both the protein and ligand levels.
With the growing interest of the pharmaceutical industry for fragment-based approach to drug design [31], several applications focusing on individual fragments derived from PDB ligands have recently emerged. Algorithms for molecule fragmentation were applied to a selection of PDB ligands defining a library of fragment binding sites [32] to map the amino acid preference of such fragments [33] or to extract possible bioisosteres [34].
We decided in 2002 to set up a collection of protein–ligand binding sites called sc-PDB, originally designed for reverse docking applications [35]. While docking a set of ligands to a single protein was already a well-established computational technique for identifying potentially interesting novel ligands, the reverse paradigm (docking a single ligand to a set of protein active sites) was still a marginal approach. The main difficulty was indeed to automate the setup of protein–ligand binding sites with appropriate attributes, such as physicochemical (e.g., ionization and tautomerization states) and pharmacological properties of the ligand. It was not our intention to cover all ligand–protein complexes in the PDB, but rather to compile a large and yet not redundant set of experimental structures for known or potential therapeutic targets that had been cocrystallized with a known drug/inhibitor/activator or with a small endogenous ligand that could be replaced by a drug/inhibitor/activator (e.g., sildenafil in phosphodiesterase-5 is an adenosine mimic). Selection rules as well as the applicability domain of the database have considerably evolved over time and are reviewed in the following sections.
In brief, the selection scheme is made of simple and intelligible selection rules for the function and properties of the protein, the physicochemical properties of its ligand, and its binding mode (Figure 1.4).
Figure 1.4 Flowchart to select sc-PDB entries from the PDB. Unwanted molecules at step 3 are identified using a dictionary or simple filters (based on ligand molecular weight, ligand surface area buried into the protein, number of amino acids close to the ligand, number of rings, and number of rotatable bonds of ligand). The bioligand in step 4 is the ligand that passes step 3 and maximizes the product of ligand molecular weight and surface area buried into the protein.
The first publicly available version of the database was released in 2004 [35]. The database was named sc-PDB (acronym for sc reening the P rotein D ata B ank) (Table 1.4). At that time, it contained the atomic coordinates of proteins and their “druggable” binding sites. The protein was defined as all biopolymer chains, ions, and cofactors in the vicinity of the ligand. The binding site includes only the protein residues less than 6.5 Å away from the ligand. Noteworthy, all atoms were represented, including the hydrogen atoms not described in crystal structures. From 2005 onward, the sc-PDB has also provided the atomic coordinates of ligands. The ligand chemistry has been validated using an in-house dictionary, manually built from scratch then supplemented since 2007 by manually checked entries of the PDB Chemical Component Dictionary. The all-atoms representation of both partners of sc-PDB complexes have allowed us to refine the position of polar hydrogen atoms in the protein binding site and to compute an optimized pose of the bound ligand [29]. The sc-PDB is annually updated and regularly enriched with new information (ligand descriptors, binding mode encoded into an interaction fingerprint (IFP) [36], and cavity volume) and new functionalities (classification of similar binding sites [37]). A Web interface enables querying the database by combining requests about ligand chemical structures and properties, protein function and source organism, binding site properties, and ligand/protein binding properties (Figure 1.5).
Table 1.4 Annotation and available search options in the Web interface to the sc-PDB
ObjectPropertiesPDB X-ray structurePDB identifierResolutionDeposition dateLigandHET codeChemical structureFormulaMolecular weightLogPLogSPolar surface areaH-Bond donor countH-Bond acceptor countNumber of rotatable bondsNumber of ringsRule-of-five number of violationsProteinNameEC numberUniprot accession numberUniprot nameSource organism nameSource organism taxonomySource organism kingdomMutant/wild typeLigand binding siteIon/cofactorNumber of residuesNumber of nonstandard amino acidsNumber of chainsAverage B-factorCenter of massProtein–ligand interactionsNumber of hydrophobic interactionsAromatic face-to-face interactionsAromatic face-to-edge interactionsH-Bond (donor in protein or ligand)Ionic interaction (cation in protein or ligand)Metal coordinationAffinity data (Ki, Kd, IC50, or pKd)Ligand buried surface areaFigure 1.5 sc-PDB output for PDB protein–ligand complexes (3 hits) between an indole-containing ligand (blue substructure) of molecular weight <350 and a human kinase to which the ligand donates at least one hydrogen bond.
The current version of the database contains 9891 entries corresponding to 3039 different proteins (according to protein sc-PDB name [37]) and 5505 different ligands (according to canonical SMILES strings). The sc-PDB protein space is redundant. There are 395 different proteins with more than 5 copies and single-copy proteins represent 55% of the database entries. Noteworthy is the complex nature of many proteins: a cofactor is bound to 219 proteins; calcium, magnesium, manganese, cobalt, zinc, or iron ions are found in 981 different proteins. No sc-PDB ligands are located at the interface of a protein–protein complex. The functional and species distribution of sc-PDB proteins reflects the bias in protein function space of the PDB itself, yet the sc-PDB is enriched in enzymes. The sc-PDB ligands space is also redundant and most prevalent ligands are cofactors and other nucleotides, which are also the most promiscuous ligands (e.g., more than 100 different protein targets for adenoside 5′-diphosphate or nicotinamide adenine dinucleotide). About 75% of the sc-PDB ligands is not primary bioorganic metabolites (nucleic acids, peptides, amino acids, sugars, or lipids) or their derivatives. Most of them pass the Lipinksi's rule of five (69% with no violations and 20% with a single violation). The sc-PDB ligand space does not match that of commercial drugs because of a bias toward polar and flexible ligands. Finally, the sc-PDB ligand ensemble is not very diverse: for more than half of sc-PDB ligands, the ligand molecule is highly similar to at least one molecule in the pool of nonidentical ligands (with similarity evaluated by the Tanimoto coefficient, computed on feature-based circular 2D FCFP4 fingerprints, higher than 0.6).
The sc-PDB database has been developed for reverse docking applications [35] and is therefore an invaluable source for establishing large-scale docking benchmarks. Most validation studies, which flourished in the literature in the last decade, have been applied to a restricted set of a few hundred PDB targets [38–41] and in the best cases to a “clean” set of high-resolution protein structures in which erroneous PDB data (Table 1.2) have been removed [42]. In daily drug discovery programs, many targets under investigation do not obey such strict rules. Assessing the robustness of docking algorithms against a larger and more representative set of protein 3D structures is therefore of interest. The sc-PDB provides a unique source for such benchmarks since ligand, protein, and active site coordinates have been preprocessed and are ready for automated docking. When applied to a collection of 5681 complexes, Tietze and Apostoklasis reported with the GlamDock software [43] an accuracy (RMSD to the X-ray structure below 2.0 Å) significantly lower than that obtained with restricted protein sets with only 77% of sampling accuracy (RMSD of the best pose <2.0 Å) and 47% of scoring accuracy (RMSD of the top-ranked pose <2 Å). Along the same lines, we reported the accuracy of four docking algorithms in posing low molecular weight fragments into druggable sc-PDB binding sites and observed that ranking poses by a pure topological scoring function based on protein–ligand interaction fingerprints were much superior to poses by classical energy-based scoring functions [36].
Coming back to the seminal application for which the sc-PDB archive was initially developed (reverse docking), it appeared quite soon that the concept could be easily applied to a large and heterogeneous set of binding sites with a naïve target ranking scheme consisting of simple docking scores. Serial docking of four test ligands (biotin, methotrexate, 4-hydroxytamoxifen, and 6-hydroxy-1,6-dihydropurine ribonucleoside) to a collection of 2148 binding sites enabled recovering the known target(s) of the later ligands within the top 1% scoring entries, using the GOLD docking algorithm. These results were quite encouraging since these validated per se the reverse docking concept and notably the automated binding site setup protocol despite well-known insufficiencies regarding, for example, ionization/tautomerization of binding site residues as well as water-mediated ligand binding effects. These initial trials were applied to high-affinity ligands, which were relatively selective for very few targets. When applied to smaller and more permissive compounds (e.g., AMP), a larger list of potential targets (top 5 to 10%) had to be selected to fish the correct protein targets [35]. The main reason was an inaccurate scoring of the “good” binding sites, which was not a real surprise with regard to the abundant literature about the limitations of fast scoring functions utilized in docking algorithms [19,44]. In order to overcome these severe limitations, alternative target ranking schemes independent of any energy calculation have been developed. One particular problem in docking-based target fishing is that the distribution of docking scores may be quite heterogeneous across different binding sites with diverse physicochemical properties. Therefore, score normalization according to either ligand and/or target properties is necessary to get rid of frequent target hitters [45–47]. Another promising approach consists in the conversion of protein–ligand coordinates (docking poses) into simple 1D IFPs [36]. Assuming that a virtual hit is more likely to be a true hit if it shares a similar target–ligand interaction profile with a known ligand, docking poses can be ranked by decreasing similarity of the IFP to that of the reference compound(s). Combining docking scores with IFP similarities allows removing many false positives (wrong targets with high docking scores), while still selecting the true targets in the final hit list [48].
The sc-PDB provides, for each entry, all-atom Cartesian coordinates for the ligand, the target, and the binding site. By “binding site” we mean any monomer (amino acid, ion, cofactor, or prosthetic group) within 6.5 Å of any ligand heavy atom. Although the definition is conservative and excludes many potentially interesting pockets, it presents the advantage to favor cavities with well-described ligand occupancy. sc-PDB entries, therefore, can be used by cavity detection algorithms [49] to predict the most likely ligand binding sites and whether they are druggable or not, in other words, if the pocket could accommodate an orally available rule of five compliant drug-like molecule. When applied to 4915 sc-PDB protein structures, Volkamer et al. reported that the ligand is present in one of the three largest pockets in 90% of cases [50]. We used a grid-based cavity detection method (VolSite) to map cavity points with pharmacophoric properties of the closest protein atom, thus defining an ideal virtual ligand for each binding site (Figure 1.6).
Figure 1.6 Detection and pharmacophoric annotation of VolSite cavity points in the X-ray structure of Lactobacillus dihydrofolate reductase (PDB code 4dfr). The cognate ligand (methotrexate, sticks) is shown in the binding site of the protein (green transparent surface). Cavity points are colored by pharmacophoric properties (H-bond acceptor and negative ionizable, red: H-bond donor and positive ionizable, blue: hydrophobe, white: aromatic, cyan: null, magenta).
Predicting the druggability of a given target from its three-dimensional structure is an intense field of research in order to reduce attrition rates in pharmaceutical discovery [51]. As druggability is by far more complex than the simple propensity of a particular protein cavity to accommodate high-affinity drug-like compounds, other terms, such as “bindability” [52] or “ligandability” [51] have been proposed recently, since they better capture target property ranges (cavity volume, polarity, and buriedness) known to be important for druggable targets [52–56]. Since these important properties are theoretically encoded in the aforementioned cavity site points, we investigated whether the present cavity descriptors might be suitable for predicting the ligandability of cavities from their 3D structures. A training set of 62 cavities (50% druggable and 50% undruggable) was assembled from literature [53,57] and the distribution of site point properties was given as input for a support vector machine (SVM) classifier. The best cross-validated classification model achieves a very good accuracy of 80% and a Matthews correlation coefficient (MCC) of 0.62. Of course, larger sets of proteins of known (non)druggability are necessary to draw general conclusions, but the observed trend is quite promising and suggests that druggable target triage may be considered at an early level of drug discovery programs on condition that a high-resolution X-ray structure is available.
A second interesting application of the sc-PDB is the quantitative measure of its binding sites. Assuming that similar binding sites recognize similar ligands, comparing binding sites notably in the absence of 3D structure conservation permits identifying unexpected secondary targets for bioactive ligands. Several alignment-dependent or alignment-independent binding site comparison methods have been benchmarked on diverse collections of sc-PDB ligand binding sites [58–61] and have enabled the definition of global and local similarity thresholds for defining two sites as similar. Screening a library of binding sites for similarity to any given query is, therefore, possible and has already yielded the identification of an unexpected off-target (Synapsin I) for some but not all serine/threonine protein kinase inhibitors (Figure 1.7) [62].
Figure 1.7 Computational protocol used to detect local similarities between ATP-binding sites in pim-1 kinase and Synapsin I. The ATP-binding site in pim-1 kinase (occupied by the ligand staurosporine) is compared with SiteAlign [58] (step a) to 6415 binding sites stored in the sc-PDB database. Among the top scoring entries (step b), Synapsin I is the only protein not belonging to the protein kinase target family (step c) and present in numerous copies (step d). A systematic SiteAlign comparison (step e) of the ATP-binding site in Synapsin I with 978 other ATP-binding sites (from the sc-PDB) suggests that some but not all ATP-binding sites of protein kinases (steps f and g) resemble that of Synapsin I [62].
Interestingly, only inhibitors of binding sites (cyclin-dependent kinase type 2, pim-1, and casein kinase II) predicted similar to that of Synapsin I were indeed found to bind to Synapsin I, sometimes with nanomolar affinities, whereas inhibitors of binding sites distant to that of Synapsin I (e.g., checkpoint kinase 1, protein kinase A, HSP-90α, DAG kinase, and DNA topoisomerase II) were not recognized by the enzyme [62].
The structural knowledge encoded by 3500 protein–ligand complexes in the sc-PDB has been used to derive a model able to discriminate, from simple 1D cavity fingerprints, 120 000 ligands interacting from 500 000 ligand-noninteracting protein atoms [63]. When applied to a novel complex, the model was able to predict with 70% accuracy the protein atoms that are likely to interact with a ligand and, therefore, prioritize protein structure-based pharmacophore queries specifically targeting these hot spots.
The sc-PDB data set offers the opportunity to delineate evolutionary relationships between ligands and their targets or binding sites. By examining the distribution patterns of sc-PDB ligands in the protein universe, Ji et al. reported that synthetic compounds (e.g., enzyme inhibitors) tend to bind to a single protein fold, whereas “superligands” (metabolites) are much more permissive and can be accommodated by more than 10 different protein folds [64]. Target fold promiscuity was almost found for ancestral ligands (e.g., nucleotide-containing metabolites) that appeared quite early in the evolution and behave as hubs of metabolic networks. Interestingly, these ligands share common physicochemical properties (high flexibility and polarity) responsible for their promiscuity. Likewise, the analysis of cofactor usage (organic molecules and transition metal ions) by primitive redox proteins in the sc-PDB clearly shows that organic cofactors (NAD and NADP) are much more used than metals, probably because of the abundance of neutral residues at the border of the corresponding binding sites [65]. Finally, a survey of known interactions between phenolic ligands and their sc-PDB targets provides some explanations for the classically observed discrepancy between potent in vitro and moderate in vivo antioxidant properties of phenols [66]. A tight hydrogen bonding of phenolic moieties to many sc-PDB proteins suggests that reactive oxidative species (ROS) cannot be scavenged by phenols if they are already engaged in interactions with surrounding proteins.
Relationships between ligands and their targets could also be integrated in rational drug discovery programs. For example, retrieving from the sc-PDB, 171 diverse protein kinases cocrystallized with ATP competitors and aligning their binding sites led to the observation that crystal water patterns (position, hydrogen bond network to the kinase, and known inhibitor) were not necessarily conserved despite very high binding site similarities, thus suggesting novel avenues for optimizing the fine selectivity of kinases inhibitors [67]. By comparing the structure of unrelated targets binding to the same natural flavonoids, Quinn and coworkers introduced the concept of protein fold topology (PFT) [68] characterized by short stretches of not necessarily conserved secondary structures providing shared anchoring points to a common ligand. The concept was demonstrated for natural products binding to both biosynthetic enzymes and therapeutic targets and may explain why natural compounds are abundant among existing drugs [69].
In a recent report, Meslamani and Rognan describe a novel protein cavity kernel able to quantitatively measure the 3D similarity between two sc-PDB binding sites. A novel chemogenomic screening method based on a SVM was designed to browse the sc-PDB protein–ligand space and predict binary protein–ligand interactions from separate ligand and cavity fingerprints. The best SVM model was able to predict with a high recall (70%) and exquisite specificity (99%) and precision (99%) the binding of 14 117 external ligands to a set of 531 sc-PDB targets [70].
Exploiting structural knowledge on known protein–ligand complexes is a key step in the rational design of bioactive compounds. This knowledge has gained considerable value in the recent years, thanks to parallel endeavors of structural biologists and computational biologists/chemists to release an ever-increasing number of high-quality data. Many smart algorithms to parse and analyze the PDB have been described in the last couple of years with a large spectrum of applications ranging from hit identification and optimization to massive ligand profiling against a large array of possible targets. With the expected better coverage of the therapeutic target space by the PDB in the coming years, we anticipate a significant boost of rational drug discovery and notably a better interplay between protein structure-based and ligand-centric methods.
Notes
1. Chemical Computing Group, Montreal, Quebec, Canada H3A 2R7.
2. Tripos, St. Louis, MO 63144-2319, USA.
References
1. Berman, H., Henrick, K., Nakamura, H., and Markley, J.L. (2007) The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Research, 35, D301–D303.
2. Dessailly, B.H., Nair, R., Jaroszewski, L., Fajardo, J.E., Kouranov, A., Lee, D., Fiser, A., Godzik, A., Rost, B., and Orengo, C. (2009) PSI-2: structural genomics to cover protein domain family space. Structure, 17, 869–881.
3. Nair, R., Liu, J., Soong, T.T., Acton, T.B., Everett, J.K., Kouranov, A., Fiser, A., Godzik, A., Jaroszewski, L., Orengo, C., Montelione, G.T., and Rost, B. (2009) Structural genomics is the largest contributor of novel structural leverage. Journal of Structural and Functional Genomics, 10, 181–191.
4. Chandonia, J.M. and Brenner, S.E. (2006) The impact of structural genomics: expectations and outcomes. Science, 311, 347–351.
5. Brown, E.N. and Ramaswamy, S. (2007) Quality of protein crystal structures. Acta Crystallographica Section D, 63, 941–950.
6. Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., Frolkis, A., Pon, A., Banco, K., Mak, C., Neveu, V., Djoumbou, Y., Eisner, R., Guo, A.C., and Wishart, D.S. (2011) DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Research, 39, D1035–D1041.
7. Joachimiak, A. (2009) High-throughput crystallography for structural genomics. Current Opinion in Structural Biology, 19, 573–584.
8. Montelione, G.T. and Szyperski, T. (2010) Advances in protein NMR provided by the NIGMS Protein Structure Initiative: impact on drug discovery. Current Opinion in Drug Discovery & Development, 13, 335–349.
9. Klaholz, B.P., Pape, T., Zavialov, A.V., Myasnikov, A.G., Orlova, E.V., Vestergaard, B., Ehrenberg, M., and van Heel, M. (2003) Structure of the Escherichia coli ribosomal termination complex with release factor 2. Nature, 421, 90–94.
10. Dalby, A., Nourse, J.G., Hounshell, W.D., Gushurst, A.K.I., Grier, D.L., Leland, B.A., and Laufer, J. (1992) Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information and Computer Sciences, 32, 244–255.
11. Bourne, P.E., Berman, H.M., McMahon, B., Watenpaugh, K.D., Westbrook, J.D., and Fitzgerald, P.M.D. (1997) Macromolecular crystallographic information file. Methods in Enzymology, 277, 571–590.
12. Westbrook, J., Ito, N., Nakamura, H., Henrick, K., and Berman, H.M. (2005) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics, 21, 988–992.
13. Dutta, S., Burkhardt, K., Swaminathan, G.J., Kosada, T., Henrick, K., Nakamura, H., and Berman, H.M. (2008) Data deposition and annotation at the worldwide protein data bank. Methods in Molecular Biology, 426, 81–101.
14. Henrick, K., Feng, Z., Bluhm, W.F., Dimitropoulos, D., Doreleijers, J.F., Dutta, S., Flippen-Anderson, J.L., Ionides, J., Kamada, C., Krissinel, E., Lawson, C.L., Markley, J.L., Nakamura, H., Newman, R., Shimizu, Y., Swaminathan, J., Velankar, S., Ory, J., Ulrich, E.L., Vranken, W., Westbrook, J., Yamashita, R., Yang, H., Young, J., Yousufuddin, M., and Berman, H.M. (2008) Remediation of the protein data bank archive. Nucleic Acids Research, 36D426–D433.
15. Joosten, R.P., te Beek, T.A., Krieger, E., Hekkelman, M.L., Hooft, R.W., Schneider, R., Sander, C., and Vriend, G. (2011) A series of PDB related databases for everyday needs. Nucleic Acids Research, 39, D411–D419.
16. Joosten, R.P. and Vriend, G. (2007) PDB improvement starts with data deposition. Science, 317, 195–196.
17. Hartshorn, M.J., Verdonk, M.L., Chessari, G., Brewerton, S.C., Mooij, W.T.M., Mortenson, P.N., and Murray, C.W. (2007) Diverse, high-quality test set for the validation of protein–ligand docking performance. Journal of Medicinal Chemistry, 50, 726–741.
18. Hawkins, P., Warren, G., Skillman, A., and Nicholls, A. (2008) How to do an evaluation: pitfalls and traps. Journal of Computer-Aided Molecular Design, 22, 179–190.
19. Dunbar, J.B., Smith, R.D., Yang, C.-Y., Ung, P.M.-U., Lexa, K.W., Khazanov, N.A., Stuckey, J.A., Wang, S., and Carlson, H.A. (2011) CSAR benchmark exercise of 2010: selection of the protein–ligand complexes. Journal of Chemical Information and Modeling, 51, 2036–2046.
20. Feng, Z., Chen, L., Maddula, H., Akcan, O., Oughtred, R., Berman, H.M., and Westbrook, J. (2004) Ligand Depot: a data warehouse for ligands bound to macromolecules. Bioinformatics, 20, 2153–2155.
21. Golovin, A. and Henrick, K. (2008) MSDmotif: exploring protein sites and motifs. BMC Bioinformatics, 9, 312.
22. Yamaguchi, A., Iida, K., Matsui, N., Tomoda, S., Yura, K., and Go, M. (2004) Het-PDB Navi.: a database for protein–small molecule interactions. Journal of Biochemistry, 135, 79–84 [Erratum: Journal of Biochemistry (Tokyo), 2004, 135 (5), 651.].
23. Kleywegt, G.J. and Jones, T.A. (1998) Databases in protein crystallography. Acta Crystallographica Section D, 54, 1119–1131.
24.