183,99 €
Drug discovery is all about finding small molecules that interact in a desired way with larger molecules, namely proteins and other macromolecules in the human body. If the three-dimensional structures of both the small and large molecule are known, their interaction can be tested by computer simulation with a reasonable degree of accuracy. Alternatively, if active ligands are already available, molecular similarity searches can be used to find new molecules. This virtual screening can even be applied to compounds that have yet to be synthesized, as opposed to "real" screening that requires cost- and labor-intensive laboratory testing with previously synthesized drug compounds. Unique in its focus on the end user, this is a real "how to" book that does not presuppose prior experience in virtual screening or a background in computational chemistry. It is both a desktop reference and practical guide to virtual screening applications in drug discovery, offering a comprehensive and up-to-date overview. Clearly divided into four major sections, the first provides a detailed description of the methods required for and applied in virtual screening, while the second discusses the most important challenges in order to improve the impact and success of this technique. The third and fourth, practical parts contain practical guidelines and several case studies covering the most important scenarios for new drug discovery, accompanied by general guidelines for the entire workflow of virtual screening studies. Throughout the text, medicinal chemists from academia, as well as from large and small pharmaceutical companies report on their experience and pass on priceless practical advice on how to make best use of these powerful methods.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1034
Veröffentlichungsjahr: 2011
Contents
Cover
Methods and Principles in Medicinal Chemistry
Title Page
Copyright
Dedication
List of Contributors
Preface
Reference
A Personal Foreword
Part One: Principles
Chapter 1: Virtual Screening of Chemical Space: From Generic Compound Collections to Tailored Screening Libraries
1.1 Introduction
1.2 Concepts of Chemical Space
1.3 Concepts of Druglikeness and Leadlikeness
1.4 Diversity-Based Libraries
1.5 Focused Libraries
1.6 Virtual Combinatorial Libraries and Fragment Spaces
1.7 Databases of Chemical and Biological Information
1.8 Conclusions and Outlook
1.9 Glossary
Acknowledgment
References
Chapter 2: Preparing and Filtering Compound Databases for Virtual and Experimental Screening
2.1 Introduction
2.2 Ligand Databases
2.3 Considering Physicochemical Properties
2.4 Undesirables
2.5 Property-Based Filtering for Selected Targets
2.6 Summary
References
Chapter 3: Ligand-Based Virtual Screening
3.1 Introduction
3.2 Descriptors
3.3 Search Databases and Queries
3.4 Virtual Screening Techniques
3.5 Conclusions
References
Chapter 4: The Basis for Target-Based Virtual Screening: Protein Structures
4.1 Introduction
4.2 Selecting a Protein Structure for Virtual Screening
4.3 Setting Up a Protein Model for vHTS
4.4 Summary
4.5 Glossary of Crystallographic Terms
References
Chapter 5: Pharmacophore Models for Virtual Screening
5.1 Introduction
5.2 Compilation of Compounds
5.3 Pharmacophore Model Generation
5.4 Validation of Pharmacophore Models
5.5 Pharmacophore-Based Screening
5.6 Postprocessing of Pharmacophore-Based Screening Hits
5.7 Pharmacophore-Based Parallel Screening
5.8 Application Examples for Synthetic Compound Screening
5.9 Application Examples for Natural Product Screening
5.10 Conclusions
References
Chapter 6: Docking Methods for Virtual Screening: Principles and Recent Advances
6.1 Principles of Molecular Docking
6.2 Docking-Based Virtual Screening Flowchart
6.3 Recent Advances in Docking-Based VS Methods
6.4 Future Trends in Docking
References
Part Two: Challenges
Chapter 7: The Challenge of Affinity Prediction: Scoring Functions for Structure-Based Virtual Screening
7.1 Introduction
7.2 Physicochemical Basis of Protein–Ligand Recognition
7.3 Classes of Scoring Functions
7.4 Interesting New Approaches to Scoring Functions
7.5 Comparative Assessment of Scoring Functions
7.6 Tailoring Scoring Strategies in Virtual Screening
7.7 Caveats for Development of Scoring Functions
7.8 Conclusions
Acknowledgment
References
Chapter 8: Protein Flexibility in Structure-Based Virtual Screening: From Models to Algorithms
8.1 How Flexible Are Proteins? – A Historical Perspective
8.2 Flexible Protein Handling in Protein–Ligand Docking
8.3 Flexible Protein Handling in Docking-Based Virtual Screening
8.4 Summary
References
Chapter 9: Handling Protein Flexibility in Docking and High-Throughput Docking: From Algorithms to Applications
9.1 Introduction: Docking and High-Throughput Docking in Drug Discovery
9.2 The Challenge of Accounting for Protein Flexibility in Docking
9.3 Accounting for Protein Flexibility in Docking-Based Drug Discovery and Design
9.4 Conclusions
References
Chapter 10: Consideration of Water and Solvation Effects in Virtual Screening
10.1 Introduction
10.2 Experimental Approaches for Analyzing Water Molecules
10.3 Computational Approaches for Analyzing Water Molecules
10.4 Water-Sensitive Virtual Screening: Approaches and Applications
10.5 Conclusions and Recommendations
References
Part Three: Applications and Practical Guidelines
Chapter 11: Applied Virtual Screening: Strategies, Recommendations, and Caveats
11.1 Introduction
11.2 What Is Virtual Screening?
11.3 Spectrum of Virtual Screening Approaches
11.4 Molecular Similarity as a Foundation and Caveat of Virtual Screening
11.5 Goals of Virtual Screening
11.6 Applicability Domain
11.7 Reference and Database Compounds
11.8 Biological Activity versus Compound Potency
11.9 Methodological Complexity and Compound Class Dependence
11.10 Search Strategies and Compound Selection
11.11 Virtual and High-Throughput Screening
11.12 Practical Applications: An Overview
11.13 LFA-1 Antagonist
11.14 Selectivity Searching
11.15 Concluding Remarks
Acknowledgments
References
Chapter 12: Applications and Success Stories in Virtual Screening
12.1 Introduction
12.2 Practical Considerations
12.3 Successful Applications of Virtual Screening
12.4 Conclusion
References
Part Four: Scenarios and Case Studies: Routes to Success
Chapter 13: Scenarios and Case Studies: Examples for Ligand-Based Virtual Screening
13.1 Introduction
13.2 1D Ligand-Based Virtual Screening
13.3 2D Ligand-Based Virtual Screening
13.4 3D Ligand-Based Virtual Screening
13.5 Summary
References
Chapter 14: Virtual Screening on Homology Models
14.1 Introduction
14.2 Homology Models versus Crystal Structures: Comparative Evaluation of Screening Performance
14.3 Challenges of Homology Model-Based Virtual Screening
14.4 Case Studies
References
Chapter 15: Target-Based Virtual Screening on Small-Molecule Protein Binding Sites
15.1 Introduction
15.2 Structure-Based VS for Histone Arginine Methyltransferase PRMT1 Inhibitors
15.3 Identification of Nanomolar Histamine H3 Receptor Antagonists by Structure- and Pharmacophore-Based VS
15.4 Summary
Acknowledgment
References
Chapter 16: Target-Based Virtual Screening to Address Protein–Protein Interfaces
16.1 Introduction
16.2 Some Recent PPIM Success Stories
16.3 Protein–Protein Interfaces
16.4 PPIMs' Chemical Space and ADME/Tox Properties
16.5 Drug Discovery, Chemical Biology, and In Silico Screening Methods: Overview and Suggestions for PPIM Search
16.6 Case Studies
16.7 Conclusions and Future Directions
References
Chapter 17: Fragment-Based Approaches in Virtual Screening
17.1 Introduction
17.2 In Silico Fragment-Based Approaches
17.3 Our Approach to High-Throughput Fragment-Based Docking
17.4 Lessons Learned from Our Fragment-Based Docking
17.5 Challenges of Fragment-Based Approaches
Acknowledgments
References
Appendix A: Software Overview
Appendix B: Virtual Screening Application Studies
Index
Methods and Principles in Medicinal Chemistry
Edited by R. Mannhold, H. Kubinyi, G. Folkers
Editorial Board
H. Buschmann, H. Timmerman, H. van de Waterbeemd, T. Wieland
Previous Volumes of this Series:
Rautio, Jarkko (Ed.)
Prodrugs and Targeted Delivery
Towards Better ADME Properties
2011
ISBN: 978-3-527-32603-7
Vol. 47
Smit, Martine J. / Lira, Sergio A. / Leurs,
Rob (Eds.)
Chemokine Receptors as Drug Targets
2011
ISBN: 978-3-527-32118-6
Vol. 46
Ghosh, Arun K. (Ed.)
Aspartic Acid Proteases as Therapeutic Targets
2010
ISBN: 978-3-527-31811-7
Vol. 45
Ecker, Gerhard F. / Chiba, Peter (Eds.)
Transporters as Drug Carriers
Structure, Function, Substrates
2009
ISBN: 978-3-527-31661-8
Vol. 44
Faller, Bernhard / Urban, Laszlo (Eds.)
Hit and Lead Profiling
Identification and Optimization of Drug-like Molecules
2009
ISBN: 978-3-527-32331-9
Vol. 43
Sippl, Wolfgang / Jung, Manfred (Eds.)
Epigenetic Targets in Drug Discovery
2009
ISBN: 978-3-527-32355-5
Vol. 42
Todeschini, Roberto / Consonni, Viviana
Molecular Descriptors for Chemoinformatics
Volume I: Alphabetical Listing /
Volume II: Appendices, References
2009
ISBN: 978-3-527-31852-0
Vol. 41
van de Waterbeemd, Han / Testa,
Bernard (Eds.)
Drug Bioavailability
Estimation of Solubility, Permeability, Absorption and Bioavailability
Second, Completely Revised Edition
2008
ISBN: 978-3-527-32051-6
Vol. 40
Ottow, Eckhard / Weinmann, Hilmar (Eds.)
Nuclear Receptors as Drug Targets
2008
ISBN: 978-3-527-31872-8
Vol. 39
Vaz, Roy J. / Klabunde, Thomas (Eds.)
Antitargets
Prediction and Prevention of Drug Side Effects
2008
ISBN: 978-3-527-31821-6
Vol. 38
Series Editors
Prof. Dr. Raimund Mannhold
Molecular Drug Research Group
Heinrich-Heine-Universität
Universitätsstrasse 1
40225 Düsseldorf
Germany
Prof. Dr. Hugo Kubinyi
Donnersbergstrasse 9
67256 Weisenheim am Sand
Germany
Prof. Dr. Gerd Folkers
Collegium Helveticum
STW/ETH Zurich
8092 Zurich
Switzerland
Volume Editor
Prof. Dr. Christoph Sotriffer
University of Würzburg
Institute of Pharmacy and Food Chemistry
Am Hubland
97074 Würzburg
Germany
Cover Description
Virtual screening is a process of hierarchical filtering, searching compounds from chemical space that are suitable for interaction with targets from biological space. – The illustrated hit stems from a virtual screening study conducted by Brenk et al. (2003), which is discussed in chapters 10 and 12 of this book. (Support by Dr. Matthias Zentgraf in preparation of this graph is gratefully acknowledged.)
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.: applied for
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
© 2011 WILEY-VCH Verlag & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Typesetting Thomson Digital, Noida, India
Printing and Binding ???
Cover Design Schulz Grafik-Design, FußgÖnheim
Printed in the Federal Republic of Germany
Printed on acid-free paper
ISBN: 978-3-527-32636-5
Dedicated with love to Edith, Mathilde, Jonathan, and Therese
List of Contributors
Éric Arnoult
Janssen-Cilag S.A.
Campus de Maigremont
BP 615
27106 Val de Reuil Cedex
France
Jürgen Bajorath
Rheinische Friedrich-Wilhelms-
Universität
Department of Life Science Informatics
B-IT, LIMES Program Unit Chemical
Biology and Medicinal Chemistry
Dahlmannstr. 2
53113 Bonn
Germany
Daniele Bemporad
Janssen Pharmaceutical N.V.
Johnson & Johnson Pharmaceutical
Research and Development
Turnhoutseweg 30
2340 Beerse
Belgium
Markus Boehm
Pfizer Pharma Therapeutics R & D
Worldwide Medicinal Chemistry
Eastern Point Rd
Groton, CT 06340
USA
Christophe Buyck
Tibotec BVBA
Turnhoutseweg 30
2340 Beerse
Belgium
Amedeo Caflisch
University of Zurich
Department of Biochemistry
Winterthurerstrasse 190
8057 Zurich
Switzerland
Claudio N. Cavasotto
University of Texas at Houston
School of Biomedical Informatics
7000 Fannin, Ste. 600
Houston, TX 77030
USA
Jason C. Cole
University of Cambridge
Crystallographic Data Centre
12 Union Road
Cambridge CB2 1EZ
UK
Maxwell D. Cummings
Tibotec BVBA
Turnhoutseweg 30
2340 Beerse
Belgium
Preface
In an early definition by Walters, Stahl, and Murcko [1], virtual screening (VS) is described as the “Use of high-performance computing to analyze large databases of chemical compounds in order to identify possible drug candidates.”
Virtual screening has become an integral part of the drug discovery process. It has largely been a numbers game focusing on questions like how can we filter down the enormous chemical space of over 1060 conceivable compounds to a manageable number that can be synthesized or purchased and tested. Although filtering the entire chemical universe might be a fascinating question, more practical VS scenarios focus on designing and optimizing targeted combinatorial libraries and enriching libraries of available compounds from in-house compound repositories or vendor offerings.
The purpose of virtual screening is to come up with hits of novel chemical structure that bind to the macromolecular target of interest. Thus, success of a virtual screen is defined in terms of finding interesting new scaffolds rather than many hits. Interpretations of VS accuracy should therefore be considered with caution. Low hit rates of interesting scaffolds are clearly preferable over high hit rates of already known scaffolds.
In a logical and didactic way, this volume is organized in four parts covering principles, challenges, practical guidelines, and case studies under different scenarios. Chapters of Part One are dedicated to virtual screening of chemical space, processing of small molecule databases for virtual screening, ligand-based and target-based virtual screening, virtual screening with 3D pharmacophore models, and docking methods. Challenges discussed in Part Two comprise affinity prediction, fragment-based approaches, handling of protein flexibility, as well as consideration of water and solvation effects, as well as parallel virtual screening for compound profiling and prediction of off-target effects. Finally, strategies, recommendations, and caveats for applying virtual screening methodology are given and many success stories are described.
As an add-on value, this volume contains two appendices. A brief tabular compilation, including classification, short description, references, and links, gives a very informative software overview. Beyond it, successful virtual screening application studies for pharmacological targets are tabulated.
We are very grateful to Christoph Sotriffer who assembled a team of leading experts to discuss all above-mentioned aspects. This book is well suited both for all practitioners in medicinal chemistry and for graduate students who want to learn how to apply virtual screening methodology. We are also grateful to Frank Weinreich and Nicola Oberbeckmann-Winter for their ongoing support and enthusiasm for our series “Methods and Principles in Medicinal Chemistry.”
October 2010
Raimund Mannhold, Düsseldorf
Hugo Kubinyi, Weisenheim am Sand
Gerd Folkers, Zürich
Reference
1. Walters, W.P., Stahl, M.T., and Murcko, M.A. (1998) Virtual screening: an overview. Drug Discovery Today, 3, 160–178.
A Personal Foreword
Seek, and you will find.
Drug discovery in general and virtual screening in particular is a matter of searching. Often compared with looking for the needle in the haystack, virtual screening is an attempt to identify molecules with very particular properties by scanning through large pools of chemical compounds. As with any search, it makes sense to clarify a few questions before starting: What do I actually look for? Where should I search? How should the search be carried out? None of these questions is as trivial as it initially sounds – that is why an entire book can be dedicated to virtual screening.
Indeed, while the general concept is very simple, turning virtual screening into a practically useful method is much more complex. First of all, the object of the search, the intended goal must be defined. Obviously, in the context of drug discovery it is always a bioactive molecule, be it an enzyme inhibitor, a receptor agonist or an antagonist, or a disruptor of macromolecular complexes. The essential question, however, is which molecular property determines the bioactivity and provides a suitable discrimination between objects showing this activity and those that do not. The entire search makes sense only if such relevant and discriminating properties can be defined. Concepts of molecular similarity and models of molecular recognition are required in this context. In practice, the issue is further complicated by the fact that bioactivity alone is not sufficient for a virtual screening hit to be of value in drug discovery projects. Many additional criteria, such as synthetic accessibility and a suitable pharmacokinetic profile, must be met. Virtual screening must also take such boundary conditions into account.
This leads to the second issue: the area in which the search should be carried out. Simply going through the entire collection of all hitherto synthesized or isolated chemical compounds is certainly not the best idea. Again, criteria must be defined to set reasonable boundaries to the search. On the other hand, the big advantage of virtual over experimental screening is the possibility to extend the search to new areas of chemical space, as characterized, for example, by virtual combinatorial libraries. There are many good reasons for following such a route, but the danger of inadvertently just increasing the haystack (without many more needles in it) should not be overlooked.
The third issue deals with the technology itself. Which strategies and which tools are available for carrying out efficient searches? Due to the multifaceted nature of virtual screening, this touches many fields of the computational disciplines involved, from chemo- and bioinformatics to molecular modeling and computational chemistry. In fact, virtual screening is not a single method, but rather a process or workflow, a combination of multiple approaches. Thinking of virtual screening as a hierarchical filtering procedure is probably the most common view, as appreciated by the illustration chosen for the cover of this book. This view of hierarchical filtering has been propagated in particular by Gerhard Klebe, one of the pioneers of structure-based virtual screening, whose lab I had the pleasure to join back in the year 2000. It was also there that I personally came in touch with virtual screening. By then it was still a nascent technique, with just a few dozen publications available. But the impressive results that appeared in the literature have paved the way for fast dissemination of the technology, with hundreds of applications and a wealth of methodological developments in the past 10 years. In this sense, virtual screening has matured to a standard technique of early drug discovery, widely applied in industry and academia, and there is certainly no doubt today that this technique works and that it delivers practically useful results.
Nevertheless, success is not guaranteed, and many limitations exist that certainly preclude a simple “black-box-like” application. Using virtual screening is easy nowadays, with many dedicated software tools and databases freely available to anyone. Applying it correctly, however, is not as straightforward, and even published studies sometimes raise doubts whether the chosen settings and procedures can really be expected to deliver meaningful results. Besides internal method validation, experimental confirmation would be mandatory, but this is not always readily accessible and sometimes not carried out with sufficient detail and care. On the other hand, even very cautious application of the best possible techniques can lead to results that upon experimental validation disprove the original design hypothesis, for example, with respect to a certain requested interaction or binding mode. There is, in fact, a gray area in virtual screening where good results and valuable hits are obtained, but not for the right (or expected) reasons. One may attribute this to the fact that the filtering steps simply serve to eliminate all compounds with completely unsuitable properties. This increases the hit rate in the set of molecules that passed the filters and may lead to the discovery of actives even though the details of their interaction and their score are not correctly estimated. For many practical purposes, this may even be acceptable, but the scientific claim of the method certainly is to reliably predict active compounds on a rational basis of molecular similarity and correct assumptions about the underlying protein–ligand interactions.
These considerations bring me to my motivation for compiling this book: instead of focusing purely on the identification of active compounds, it is the question which underlying models, concepts, and procedures enable us to succeed in this endeavor. My feeling was that this should be of interest for the user and the developer alike. Accordingly, the goal was to arrange a book that on the one hand outlines the essential features for a successful practical application of virtual screening and on the other hand discusses the underlying hypotheses and limitations that are essential for a more complete understanding and also takes a step toward refined methods and an improved virtual screening technology.
With this intention in mind, I chose to outline the book in four major parts: principles, challenges, practical guidelines, and case studies under different scenarios. The Part One, “Principles,” illustrates the main concepts of virtual screening and its fundamental techniques and discusses the essential aspects of their successful application. As virtual screening requires a collection of compounds that bears the reasonable chance to contain molecules with the requested properties, attention is first paid to the underlying data. Simply screening the available corporate compound database is certainly the most straightforward approach, but success is by no means guaranteed since the collection may be heavily biased toward certain chemotypes with relatively limited diversity. To proceed more systematically, it makes sense to think about compound databases in terms of chemical space, that is, the entire collection of all possible molecules. Which regions are of relevance for drug discovery in general and for my specific target in particular? How well are these regions covered by real, already existing compounds? And how could the search be expanded into unexplored areas by means of virtual compound libraries? The first chapter is dedicated to these more fundamental issues of search spaces, whereas the second chapter focuses on the more practical aspects related to compound databases. Chemical structures must be represented in a format that conveys all required information in an unambiguous way and is amenable to rapid processing by many different software tools. On the other hand, the stereochemistry, tautomers, and different protonation states need to be considered. Furthermore, it may be advantageous to generate multiple conformers of each molecule. Once it is clear how structures should be represented and stored, the question arises which structures actually need to be processed. As illustrated in detail in this chapter, a raw database should normally be prefiltered to eliminate all compounds that are highly unlikely to be of practical use in a drug discovery project.
Once the search databases are set up, screening can start even in the absence of a 3D target structure, provided that at least one active ligand is known. Accordingly, in Chapter 3, ligand-based virtual screening is introduced. This class of methods is based on molecular similarity, which is the general assumption that similar molecules exhibit similar binding properties for a given target. As simple as this sounds, it is unfortunately not clear in which terms this similarity should be measured and which molecular properties are the best predictors of similarity in terms of binding affinity. Therefore, a discussion of descriptors and similarity measures lies at the heart of this chapter, before the screening based on such similarity measures is described. In this context, attention is also paid both to the selection of the reference ligand for the search and to the machine-learning approaches that can be applied if multiple active and inactive ligands are available.
If structural information about the target is available, structure-based (i.e., target-based) virtual screening can be applied. It is the second large class of approaches, either used as an alternative to ligand-based screening or in combination with it. Before describing in later chapters the techniques that use protein structure information, Chapter 4 analyzes the quality of the underlying experimental data and shows how such data must be handled in order to avoid potential pitfalls. The focus here is on protein crystal structures and the possible errors and limitations of the corresponding structural data. This is coupled with a detailed discussion how an appropriate protein structure can be selected and how it should be set up for virtual screening.
Chapter 5 is naturally suited to make the transition from ligand-based to target-based virtual screening, as it deals with 3D pharmacophore models. Such models can be generated either from a set of active ligands or from the structure of a binding site (ideally complexed with a ligand) or from the combination of both. Particular attention should be paid to model validation, as discussed in detail in this chapter. Enrichment assessments and enrichment metrics are illustrated, and these aspects are relevant not only for pharmacophore-based techniques but also for virtual screening in general. Since no chapter is dedicated exclusively to validation aspects, this section is a “must read” for anyone not yet familiar with validation of virtual screening procedures. The second half of Chapter 5 is dedicated to the screening techniques based on pharmacophore models and to the postprocessing of screening hits. A broad range of application examples concludes this chapter.
Moving on to the core technique of target-based virtual screening, Chapter 6 of Part One is dedicated to docking methods. The fundamentals of molecular docking as a process of sampling and scoring are introduced, before going through the docking-based virtual screening workflow, with ligand and protein setup, choice of the docking program and its application, and postprocessing of the docking results. Finally, recent advances of the docking field are summarized, including those related to the enduring issue of protein flexibility and the role of water, two topics that already point to the next part of the book.
In Part Two, three of the most frequently encountered challenges in structure-based virtual screening are analyzed in detail: the first about scoring (Chapter 7), the second about protein flexibility (Chapters 8 and 9), and the third about the consideration of water and solvation effects (Chapter 10). All three issues are closely related to each other since protein flexibility and water molecules directly affect the scoring of protein–ligand complexes. With an ideal scoring method at hand, docking and structure-based virtual screening would reduce to a pure search problem. Nonbinders could be clearly distinguished from binders, the binders could be correctly ranked by affinity, and this ranking would be based on the correct docking pose. Unfortunately, none of these ultimate goals has been reached so far. In particular, the discrimination of nonbinders and the ranking of binders are still not possible with sufficient reliability. The chapter on scoring functions describes available approaches, their limitations, recent developments, and the general strategies and recommendations for their application in virtual screening.
Protein flexibility as the second major challenge is discussed in two separate chapters. The first is dedicated primarily to the algorithmic description of the methods available for handling protein flexibility, based on a new classification of the different approaches. The second chapter focuses on their application in high-throughput docking and virtual screening. As these chapters illustrate, a multitude of different approaches are already available for at least partial consideration of protein flexibility, but taking full account of all ligand and protein degrees of freedom and of the mutual adaptation between the binding partners is still out of reach for practical application in virtual screening. Accordingly, one should carefully select the most appropriate approach, depending on the known properties and the available data of the target system, and one should be aware that “surprises” may occur as long as flexible systems are not treated fully flexibly.
Similar considerations apply to water molecules, given that the accurate prediction of water-mediated effects is still not possible. Again, the main problems are scoring and the dynamic nature of water interactions. Nevertheless, as illustrated in Chapter 10, many methods for analyzing water molecules in protein binding sites are available and water-sensitive virtual screening can be carried out. Even though perfect predictions are out of reach, careful consideration of the current knowledge about water molecules can certainly improve the outcome of virtual screening studies.
Despite all the challenges and limitations, impressive results can be obtained with virtual screening, as illustrated in Part Three, dedicated to “Applications and Practical Guidelines.” Strategies, recommendations, and caveats for applied virtual screening are summarized in Chapter 11. Focusing on ligand-based approaches, this chapter discusses the main aspects affecting the outcome of ligand-based virtual screening, outlines the applicability domain, and comments on opportunities and intrinsic limitations. This is followed in Chapter 12 by a summary of selected applications and success stories of prospective virtual screening. Here, particular emphasis is placed on workflow comparison. Both ligand-based and structure-based studies are presented, and illustrations are given for the fact that these two branches of virtual screening should not be seen as competing: it is rather their combination that may enhance hit rates. Further application examples and a more detailed discussion of particular case studies and practical aspects are provided in the next five chapters of Part Four, dedicated to virtual screening under different scenarios. As the first scenario, ligand-based application studies are presented and discussed (Chapter 13). In the absence of a target structure, an alternative scenario consists of homology modeling and the use of the resulting models for structure-based virtual screening; this is discussed in depth in Chapters 14 and 15. Similar to structure-based virtual screening using experimentally determined structures, the studies discussed primarily focus on small-molecule binding sites. In contrast, Chapter 16 is dedicated to the emerging field of protein–protein interactions and how target-based virtual screening can be used to identify modulators of such interactions. Finally, Chapter 17 describes how fragment-based approaches may be used in virtual screening and illustrates how these techniques can lead to the detection of small bioactive compounds.
All application studies mentioned throughout the book are summarized in a tabular compilation in the Appendix. The same is true for all cited programs, tools, and databases for virtual screening. Each entry in the tables contains a reference to the literature or a web site address. In fact, the tables were compiled with the explicit hope that the reader and the user may find them helpful for rapidly retrieving further information.
Editing this book was a valuable experience for me. It was a real pleasure to get in touch with all the scientists from industry and academia who contributed to this volume. Their willingness to share their knowledge and experience in virtual screening in the form of this book was essential for making this project come true. I am, therefore, deeply indebted to all authors for their contributions and their cooperation. I also thank the members of my group at the University of Würzburg for their help, in particular with respect to the preparation of the Appendix.
The book would not have been realized in this form without the kind invitation, the encouragement, and the valuable suggestions of the series editors. Their support is gratefully acknowledged. Finally, I also want to thank Dr. Frank Weinreich and Dr. Nicola Oberbeckmann-Winter from Wiley-VCH in Weinheim for their very pleasant collaboration and support during all steps of editing this volume.
May all readers find what they search – in virtual screening and in this book!
October 2010
Christoph Sotriffer, Würzburg
Part One
Principles
Chapter 1
Virtual Screening of Chemical Space: From Generic Compound Collections to Tailored Screening Libraries
Markus Boehm
1.1 Introduction
Today's challenge of making the drug discovery process more efficient remains unchanged. The need for developing safe and innovative drugs, under the increasing pressure of speed and cost reduction, has shifted the focus toward improving the early discovery phase of lead identification and optimization. “Fail early, fail fast, and fail cheap” has often been quoted as the key principle contributing to the overall efficiency gain in drug discovery. While high-throughput screening (HTS) of large compound libraries is still the major source for discovering novel hits in the pharmaceutical industry, virtual screening has made an increasing impact in many areas of the lead identification process and has evolved into an established computational technology in modern drug discovery over the past 10 years.
Traditionally, virtual screening is conducted simply by searching the company proprietary database of its compound collections, and this approach continues to be a mainstream application. However, the continuous development of novel and more sophisticated virtual screening methods has opened up the possibility to search also for compounds that do not necessarily exist in physical form in a screening collection. Such compounds can be obtained either from a multitude of external sources, such as compound libraries from commercial vendors, or from public or commercial databases. Even more, virtual screening can deal with molecules that purely exist as virtual entities derived from de novo design ideas or enumeration of combinatorial libraries. Taken to its extreme, any molecule conceivable by the human mind can in theory be evaluated by virtual screening. This has led to the concept of chemical space comprising the entire collection of all possible molecules – real and imaginary – that could be created. Since such a chemical space is huge, it is crucial for the success of drug discovery to identify those regions in chemical space that contain molecules of oral druglike quality that are likely to be biologically active. Virtual screening has the unique capability of not only searching the small fraction of chemical space occupied by compounds in existing screening collections but also exploring new and so far undiscovered regions (Figure 1.1). The challenge for the future is to better define and systematically explore those promising areas in chemical space.
Figure 1.1 Regions of biologically and medicinally relevant chemical space within the continuum of chemical space. Only a small portion of chemical space has been sampled by existing compound collections, which led to the discovery of drugs (A). Virtual screening has the unique opportunity to expand into unexplored chemical space to find new pockets of space where drugs are likely to be discovered (B).
1.2 Concepts of Chemical Space
Despite the fact that the term chemical space has received widespread attention in drug discovery, only few concrete definitions have been proposed. Lipinski suggested that chemical space “can be viewed as being analogous to the cosmological universe in its vastness, with chemical compounds populating space instead of stars” [1]. More concrete, chemical space can be defined as the entire collection of all meaningful chemical compounds, typically restricted to small organic molecules [2]. To navigate through the vastness of chemical space, compounds can be mapped onto the coordinates of a multidimensional descriptor space. Each dimension represents various properties describing the molecules, such as physicochemical or topological properties, molecular fingerprints, or similarity to a given reference compound [3]. Depending on the particular descriptor and property set used for defining a chemical space, the representation of compounds in this chemical space varies. Thus, the relative distribution of molecules within the chemical space and the relationship between them strongly depend on the chosen descriptor set. The consequence of this is that changes in chemical representation of molecules are likely to result in changes in their neighborhood relationship. This aspect is important to keep in mind when it comes to measuring diversity or similarity within a set of molecules.
How vast is chemical space? Various estimates of the size of chemical space have been proposed. The number of all possible, small organic compounds ranges anywhere from 1018 to 10180 molecules [4]. The first attempt to systematically enumerate all molecules of up to 13 heavy atoms applying basic chemical feasibility rules resulted in less than 109 structures [5]. However, with every additional heavy atom the number of possible structures grows exponentially due to the combinatorial explosion of enumeration. Thus, it is estimated that with less than 30 heavy atoms more than 1063 molecules with a molecular weight of less than 500 can be generated, predicted to be stable at room temperature and stable toward oxygen and water [6]. Compared to the estimated number of atoms in the entire observable universe (1080), it seems that for all practical purposes chemical space is infinite and any attempt to fully capture it even with computational methods appears to be futile. Even more, in contrast to the number of compounds in a typical screening collection of large pharmaceutical companies (106) it becomes clearly obvious that only a tiny fraction of chemical space is examined.
One might ask why hit identification in drug discovery is successful, despite the fact that only a very limited set of compounds within the entire chemical space is being probed. It has been hypothesized that existing screening collections are not just randomly selected from chemical space, but are already enriched with molecules that are likely to be recognized by biological targets [7]. Many synthesized compounds have been derived from natural products, metabolites, protein substrates, natural ligands, and other biogenic molecules. Hence, a certain “biogenic bias” is inherently built into existing screening libraries resulting in an increased chance of finding active hits. This observation indicates that, given the vast and infinite size of chemical space, the goal should not be to exhaustively sample the entire space but to identify those regions that contain compounds likely to be active against biological targets (biologically relevant chemical space).
Another limiting factor is that not all biologically active molecules have the desired physicochemical properties required for oral drugs. There are many aspects important for a biologically active compound to become a safe and orally administered drug, such as absorption, permeability, metabolic stability, or toxicity. The concept of druglikeness has been introduced to determine the characteristics necessary for a drug likely to be successful. Over time, this has been further extended toward leadlike criteria with more stringent rules and guidelines recommended for compounds in a screening collection (Section 1.3). It is generally assumed that molecules have an increased chance to be successfully developed into a medicine when they satisfy lead- and druglike criteria (medicinally relevant chemical space).
Unfortunately, not much is known about the size and regions of biologically and medicinally relevant chemical space. Current definitions of such relevant spaces often rely on the knowledge of existing, mostly orally administered drugs, and are limited by the chemical diversity of historical screening collections and by the biological diversity of known druggable targets. On one side, the data accumulated so far suggest that compounds active against certain target families (e.g., GPCRs, kinases, etc.) tend to cluster together in specific regions of chemical space [8]. For individual targets from those families, the relevant chemical space seems to be well defined, and the likelihood of finding drugs in these defined regions is high (Section 1.5). On the other side, there are many target classes that have been deemed as difficult or undruggable, such as certain proteases or phosphatases. Also, a fairly unresolved area in drug discovery is the identification of small molecule modulators of protein–protein interactions in biological signaling cascades [9]. It is assumed that the chemical space represented by traditional screening collections is inadequate to successfully tackle these “tough targets,” and new regions of chemical space need to be explored. Possible sources of chemical matter potentially occupying such unexplored regions of space can be derived from natural products or through emerging technologies such as diversity-oriented synthesis for generating natural product-like combinatorial libraries (Section 1.4).
1.3 Concepts of Druglikeness and Leadlikeness
It has been demonstrated that the lead development stage contributes 40% to the overall attrition rate throughout the whole drug development process, beginning from the first assay development to final registration [10]. Therefore, it is assumed that significant improvements can be realized in the early phase of lead identification and development. In-depth analysis of marketed oral drugs led to the introduction of druglikeness that defines the physicochemical properties that determine key issues of drug development, such as absorption and permeability. Lipinski's influential analysis of compounds failing to become orally administered drugs resulted in the well-known “rule of five” [11]. In short, the rule predicts that poor absorption or permeation of a drug is more likely to occur when there are more than 5 H-bond donors, 10 H-bond acceptors, the molecular weight is greater than 500, and the calculated log P is greater than 5 (Table 1.1). The concept of druglikeness has been widely accepted and embraced by scientists in drug discovery nowadays, with many variations and extensions of the original rules, and it has served its purpose well to help optimize pharmacokinetic properties of drug candidate molecules [12, 13].
Table 1.1 Comparison of properties typically used for leadlikeness and druglikeness criteria.
PropertiesLeadlikenessDruglikenessMolecular weight (MW)≤350≤500Lipophilicity (clog P)≤3.0≤5.0H-bond donor (sum of NH and OH)≤3≤5H-bond acceptor (sum of N and O)≤8≤10Polar surface area (PSA)≤120 Å2≤150 Å2Number of rotatable bonds≤8≤10Structural filtersReactive groupsWarhead-containing agentsFrequent hittersPromiscuous inhibitorsThe rules defining druglikeness, however, should not necessarily be applied to lead molecules. One of the reasons is the observation that, on average, compounds in comparison to their initial leads become larger and more complex during the lead optimization phase, and the associated physicochemical properties (e.g., molecular weight, calculated log P, etc.) increase accordingly [14, 15]. To ensure that the properties of an optimized compound remain within druglike space, the criteria for leadlikeness have been more narrowly defined to accommodate the expected growth during drug optimization (Table 1.1). Complementary to the comparison of drug and lead pairs from historical data, Hann et al. analyzed in a more theoretical approach, using a simplified ligand–receptor interaction model, how the probability of finding a hit varies with the complexity of a molecule [16]. The model shows that the probability of observing a “useful interaction event” decreases when molecules become increasingly complex. This suggests that less complex molecules, in accordance with leadlike criteria, are more likely to turn into hits (albeit weaker) serving as common starting points for the successful discovery of drugs.
Another aspect underlining the importance of leadlike properties is associated with the fundamental shift in the screening paradigm in drug discovery from functional biological assays to biochemical assays. While biological assays measure a true biological activity, biochemical assays are designed to measure specific molecular interactions between a compound and its target. Biochemical assays are highly sensitive assays, well suited for screening compounds in a high-throughput fashion, but due to their artificial nature they are also susceptible to compound interference resulting in false positive hits. It has been suggested that compounds with leadlike properties also interact with their targets in a leadlike manner, that is, by noncovalent binding through hydrogen bonds, hydrophobic interactions, and monoionic bonding [17]. In general, such desirable interaction types result in reversible, time-independent, and competitive binding characteristics allowing the generation of meaningful structure–activity data. In contrast, nonleadlike compounds tend to bind to their target in nonleadlike ways, such as forming covalent, chelate, or polyionic bonds. Thus, nonleadlike compounds are more prone to generating artifact data in biochemical assays.
Among the well-known offenders with nonleadlike properties are protein-reactive compounds, warhead-containing agents, frequent hitters, and aggregator compounds (Table 1.1) [18]. Computationally, the elimination of reactive and warhead-containing compounds can be accomplished by applying various sets of substructure filters [17, 19]. Frequent hitters can be identified by statistical models or other virtual screening methods [20]. Aggregator compounds have been described as being promiscuous inhibitors by forming aggregates in solution, resulting in nonspecific binding and interference with the biochemical assay [21]. However, they are difficult to predict computationally and require additional biophysical methods (e.g., light scattering experiments) or modifications of the biochemical assay (e.g., addition of detergent or protein serum) to support their detection experimentally [22]. Exacerbating the problem, the interference of compounds in biochemical screens resulting in artifact data mostly depends on the individual assay conditions, which makes it difficult to develop generally applicable rules for detecting potential false positives across different assays.
The ultimate goal of identifying compounds with leadlike properties is to design high-quality screening libraries, whether it is for experimental or virtual screening purposes [23]. From a practical standpoint, it appears that leadlike criteria are more straightforward to implement by applying rules to filter out nonleadlike compounds, with the aim of enriching the compound collection with leadlike matter. In other words, one can agree on which compounds not to screen, but the question which compounds to screen often leads to lengthy debates among experienced medicinal chemists.
1.4 Diversity-Based Libraries
Since the advent of large-scale combinatorial chemistry in drug discovery coupled with high-speed parallel synthesis of thousands of compounds, the concept of molecular diversity has increasingly gained importance. When little or nothing is known about the biological target, it is often assumed that screening a compound library as diverse as possible maximizes the chance of finding active hits. Moreover, the continuous addition of compounds to the screening file, either from internal combinatorial library efforts or through purchase of external compound collections, is most valuable when the underlying overall diversity can be expanded. At the same time, there is an ever-growing pressure to reduce costs by decreasing the number of compounds that need to be screened while simultaneously maintaining diversity. Hence, well-defined strategies for the optimal design of diversity-based libraries are necessary.
1.4.1 Concepts of Molecular Diversity
The generally accepted understanding of molecular diversity is a quantitative description of dissimilarity between molecules in a given set of compounds. The exact interpretation of this concept, however, has created quite a heated debate in the scientific literature [24]. For example, Roth fervently advocated that per se “there is no such thing as diversity” [25]. Diversity of chemical structure does not necessarily imply diversity of biological activity. In order to be meaningful, diversity can only be applied within a frame of reference, that is, the biological assay. Hence, structural diversity of compounds should be interpreted only with respect to their relative effect in biological screens. Finding descriptors for biological activity is necessary to describe the diversity of biological activities for compounds present in a library. Unfortunately, it is often difficult or impossible to predict in advance which descriptors are most effective in a given situation. While it remains to be a matter of subjectivity what makes a compound set diverse and how to quantify diversity, or if one compound set is more diverse than another, the minimum value gained by a diversity application is the elimination of redundancy within a screening set. A diverse set of compounds should contain only nonredundant molecules that simultaneously span a wide range of properties covering the chemical space.
The basis of removing redundancy from a compound set is formed by the general belief that similar molecules typically exhibit similar biological activities. This concept has been defined as the similarity property principle or neighborhood behavior, and is the fundamental assumption behind all similarity and diversity applications [26]. Although generally accepted, one can quickly find arguments against this principle, as there are many examples described where subtle modifications of a compound can lead to dramatic changes in activity (activity cliffs, “magic methyl,” etc.) or major changes in the molecular structure not resulting in significant activity differences (flat SAR). From a statistical point of view, however, it has been demonstrated that a set of compounds similar to an active hit contains a higher number of actives compared to a random set, thus increasing the probability of finding actives [27]. Various groups have analyzed large activity data sets and came to the conclusion that on average there is a 30% chance that a compound within a certain similarity cutoff (Tanimoto coefficient ≥0.85 using Daylight fingerprints) to an active hit is itself active [28, 29]. The backside of this finding is that diversity methods selecting a representative compound within a subset of similar compounds incur a 70% chance of picking an inactive compound and excluding compounds that might have had activity. Exacerbating the effect, diversity selections often tend to more aggressively reduce the size of screening sets by loosening similarity criteria beyond the range where the similarity property principle is applicable. This might lead to a decreased coverage of biological space, limiting the chance of finding actives within the chosen subset.
1.4.2 Descriptor-Based Diversity Selection
Various strategies for the design of diversity-based screening collections have been proposed. Before initiating the selection process, some more fundamental questions should be addressed. For instance, it is often unclear how large a screening library should be and how many cluster representatives need to be selected. Using fingerprints and default similarity cutoffs for clustering (see above) and assuming the presence of actives in a cluster, there is only a 30% probability of identifying an active hit when a single representative per cluster is chosen. The selection of five compounds per cluster increases the chance of finding actives to 80% [28]. This finding suggests the selection of multiple representatives per cluster to increase the likelihood of uncovering actives. However, this comes at the expense of including fewer clusters during the selection process.
A mathematical model was developed by Harper et al. to provide a more quantitative framework for assessing the optimal parameters of a screening collection and their effect on the probability of producing lead series in a given biological assay [30]. For each cluster in a screening collection, the percentage of compounds expected to hit the biological target, as well as the probability of an existing lead molecule in the cluster, is empirically estimated. According to the model, the expected number of lead series per screen (“lead discovery rate”) increases linearly with the number of compounds in a screening library. However, the probability of finding one or more lead series in a given screen does not grow proportionally with the size of the library. For instance, it was estimated that an average hit rate of 1.2 leads per screen is required to find at least one lead on 70% of screens. To increase the proportion of screens identifying leads to 80% and beyond requires a sharp increase in the number of compounds to be screened. This result of diminishing returns has been experienced by many companies when their screening collections have dramatically increased in size, but it has not translated into a proportional increase in successful screening campaigns. One of the main conclusions from the analysis is that, in order to increase the chance of finding lead series, a screening library of a given size should contain as many diverse clusters as possible, ideally with only one or few representatives per cluster. Increasing the number of compounds per cluster at the cost of decreasing the number of clusters ultimately lowers the likelihood of finding leads.
In principle, there are three main steps required to carry out diversity-based subset selections: (1) the calculation of descriptors representing the compound structures, (2) a quantitative method to describe the similarity or dissimilarity of molecules in relationship to each other, and (3) selection methods to identify compounds based on their similarity or dissimilarity values that best represent the entire compound set. In the following, the three steps are described in more detail.
Numerous descriptors encoding molecular properties with varying degrees of information content and complexity have been developed [31]. The current version of the Dragon software alone calculates over 3200 molecular descriptors [32]. The many different representations can be classified according to the type of information they encode [4, 33]. Whole-molecule descriptors represent different properties of a molecule in a single number, such as molecular weight or calculated log P. Descriptors derived from 2D representations of molecules include topological indices, which describe a structure according to its size and shape by a single number, and fingerprint-based descriptors, characterizing molecules by their substructural features. Graph-based molecular descriptors attempt to reduce the molecular complexity while capturing the overall information content of the molecular topology and properties. Descriptors derived from the 3D structure of molecules consist of fingerprint-based descriptors and other more complex representations, encoding properties such as shape or pharmacophore information of a molecule.
In order to quantify the degree of similarity or dissimilarity between two compounds, various similarity coefficients have been developed for different applications, many of them widely used for chemical similarity searching [34]. Several groups compared the performance of different similarity coefficients in combination with various fingerprint types, and it was often found that the Tanimoto coefficient markedly outperformed other similarity measures, making it the similarity coefficient of choice for fingerprint-based similarity searching [35].
Methods for selecting diverse subsets from a compound collection include (1) dissimilarity-based compound selection, (2) clustering, (3) partitioning, and (4) the use of optimization approaches, and are discussed in the following. Dissimilarity-based methods involve the selection of compound sets that maximize the dissimilarity between pairs of molecules [36]. In an iterative fashion, those molecules from a compound collection that are mostly dissimilar to the already selected compounds from the subset are added to the subset. The MaxMin selection technique and the sphere exclusion algorithm are the preferred methods of choice among dissimilarity-based methods [37, 38]. Clustering methods involve the identification of groups of compounds such that compounds within a cluster are highly similar whereas compounds from different clusters are dissimilar. Choosing one or only few representatives per cluster, usually the cluster centroids, has been demonstrated to be the best strategy for designing a highly diverse subset to maximize the chances of hit identification. Many different clustering algorithms have been developed, and they can be divided into hierarchical and nonhierarchical methods [39]. Since clustering is based on relative similarities of molecules to each other and not on an absolute scale in chemical space, it is often difficult to compare two different data sets, which is required, for instance, when purchasing new compound collections. In contrast, partitioning or cell-based methods provide an absolute measure of compounds in terms of their location in chemical space, spanned by a predefined descriptor set [40]. A low-dimensional descriptor space is required, where descriptors are mapped onto each axis of the chemical space by binning (partitioning) the range of their values into a set of cells. Molecules that fall into the same cells can be considered similar, and a diverse subset of compounds is selected by taking one or a few representatives from each cell. Pearlman's well-known BCUT descriptors, typically mapped into a six-dimensional space, were developed for the use in partitioning-based approaches [41]. A chemical global positioning system, ChemGPS, was introduced to provide a low-dimensional chemical space as a frame of reference suitable for diversity analysis [42]. A set of 72 descriptors was condensed into a nine-dimensional space by means of principal component analysis. Finally, optimization-based approaches use genetic algorithms or simulated annealing to efficiently sample large chemical spaces [43, 44].
1.4.3 Scaffold-Based Diversity Selection
An alternative approach to describe the diversity of a compound collection has been realized by the classification of molecules according to their underlying scaffolds. Compared to methods using traditional descriptors such as fingerprints, scaffold classification methods provide a different view of comparing databases of compounds. Scaffold diversity and coverage, as well as over- or underrepresented regions of scaffold space, can be easily assessed across different data sets, such as publicly or commercially available screening collections [45]. Scaffold analysis is also applied to HTS data to retrieve more chemically intuitive clustering results [46]. Finally, classification of compounds according to their scaffolds can help identify privileged structures and serve as a starting point for designing scaffold-focused libraries (Section 1.5) [47, 48].
Although there is no exact definition for a molecular scaffold, it generally refers to a common structural core motif. Scaffolds often resemble the chemotypes of molecules, which medicinal chemists use to categorize compounds into chemical series. Bemis and Murcko have introduced the widely used classification of compounds according to their molecular framework [49]. The molecular framework of a compound, also referred to as “Murcko scaffold,” is formed by deleting all terminal acyclic side-chain atoms from the original molecule. In addition, all atom and bond types can be removed to arrive at the graph framework of the molecule. The removal of linker length and ring size information results in the reduced graph representation of the molecule. The feature tree descriptor used in FTrees is a popular example where compounds are described by a graph (tree) that represents each molecular fragment and functional group (feature) as a node and their connectivity as edges [50]. This reduces the molecular descriptor complexity while still maintaining the overall topology and property information, making this descriptor ideal for scaffold hopping searches [51]. In a related approach, “molecular equivalence indices” (MEQI) classify molecules with respect to a variety of structural features and topological shapes, which can be used to hierarchically classify compound sets into classes of chemotypes [52]. Recently, a hierarchical classification system, Scaffold Tree, has been described [53]. Each level of the hierarchy consists of well-defined chemical substructures by iteratively removing rings from the molecular framework. Prioritization rules ensure that peripheral rings are removed first to achieve unique classification trees. Besides the benefit of its visually intuitive presentation of the scaffold tree, potential applications of this method are the detection of potential chemical series from screening hits on the basis of their hierarchical classification and the retrosynthetic combinatorial analysis of library compounds to identify the scaffolds that have been most likely used. The idea of a hierarchical classification of scaffolds has been expanded to incorporate the biological space associated with the compounds. The program Scaffold Hunter has been developed both to analyze the complex relationship of structure and activity data and to identify scaffolds of compounds likely to contain the desired biological activity [54, 55]. Analogous to the Scaffold Tree approach, scaffolds are hierarchically organized, however, using activity data as the key selection criterion during the structural deconstruction and tree building process. Scaffolds that share activity with their neighboring scaffolds in the hierarchical tree but are not represented by compounds in the data set are identified. Such virtual scaffolds can serve as starting points for the discovery of new biologically relevant scaffolds.
1.4.4 Sources of Diversity
Besides the established sources of obtaining diversity, mainly from historic compound collections, publicly or commercially available compound libraries, and natural products, novel approaches toward expanding diversity have been described in the recent literature.
The systematic enumeration of all possible organic molecules of up to 11 atoms of C, N, O, and F, applying simple valence, chemical stability, and synthetic feasibility rules, has been reported [56]. A total of 26.4 million compounds were generated and collected in a chemical universe database (GDB-11). An extended version (GDB-13) has been published that contains 970 million molecules of up to 13 atoms of C, N, O, S, and Cl enumerated in a similar manner, making it the largest database of publicly available virtual molecules [5]. It contains a vast number of unexplored structures and provides a new source for design ideas to identify bioactive small molecules and scaffolds. The first successful application of the GDB discovering a novel class of NMDA glycine site inhibitors has been recently reported [57].
Bioactive molecules have been shown to contain only a limited number of unique ring systems. For that reason, in analogy to the chemical universe of the GDB, several groups have explored the ring universe to identify novel ring systems and heteroaromatic scaffolds. A comprehensive collection of more than 40 000 different rings extracted from the CAS registry has been classified into ring systems on the basis of their topology, and it was shown that the distribution of rings is not continuous but contains many significant voids [58]. A drug ring database containing ring systems from proprietary and commercial compound collections has been developed as a source for scaffold replacement design [59]. Generating a database of over 600 000 heteroaromatic ring scaffolds, the comparison to scaffolds associated with biological activity revealed that bioactive scaffolds are very sparsely distributed, forming well-defined “bioactivity islands” in virtual scaffold space [60]. It is, however, unclear if biological activity is truly limited to only such small region of ring space, or if most ring systems are simply not synthetically accessible and thus have never been prepared. To overcome this limitation, the future challenge is to actively develop novel synthetic routes to prepare molecules with so far unexplored ring systems. A “virtual exploratory heterocyclic library” (VEHICLe) of almost 25 000 ring systems was created, containing a complete enumerated set of heteroaromatic rings, with rings being removed that are likely to be synthetically unfeasible according to a set of empirical rules [61]. Interestingly, the authors find that only 1700 of them (7%) have been published, and of these only a small percentage is routinely used in the synthesis of druglike molecules. They highlight many simple and apparently tractable heterocycles that have not been described in the literature so far and put out a “challenge to creative organic chemists to either make them or explain why they cannot be made.”
It has been argued that the trend in drug discovery over the past decade toward achiral, aromatic compounds, presumably due to their amenability to high-throughput synthetic approaches, may have contributed to a higher failure rate of drug development candidates [62]. Concurrently, it has been reported that the complexity of a molecule is a key criterion determining the success of the drug candidate [63]. The increase in molecular complexity, measured as the extent of bond saturation and the number of chiral centers, has been demonstrated to correlate with an overall improved compound developability. Changes in molecular complexity affect the three-dimensional shape of a compound, which might lead to improved interactions with the target receptor. The resulting improved potency and selectivity profile ultimately increases the chance of a successful drug candidate. Although aromatic rings and achiral centers still dominantly define classical drug structures, this might suggest a trend away from flat aromatic structures toward more complex molecules.
In the recent past, natural products and natural product-like molecules that lie outside the range of traditional “rule of five” druglike space have gained renewed interest in drug discovery [64, 65]. Technological advances have enabled the combination of approaches that leverage the unique diversity of building blocks from natural product sources with the strength of combinatorial library design. The diversity-oriented synthesis (DOS) approach allows the rapid synthesis of chemical libraries containing structurally complex molecules with a range of scaffold variations and chiral centers, creating a broad distribution of diverse compounds capable of binding a range of biological targets [66]. The main emphasis of natural product-like drug discovery so far, however, is on the identification of novel tool compounds to probe the target of interest and support further pharmacological in vitro studies, not on the development of oral drugs [67, 68]. Finally, macrocyclic molecules (containing a ring of seven or more atoms) represent another emerging structural class outside of classical oral druglike space, with a strong potential for historically difficult targets such as protein–protein interactions [69]. Macrocycles are capable of forming high-affinity interactions with the shallow contact surfaces that are typical for interfaces involved in protein–protein interactions. Due to their intrinsic conformational constraint, they can position arrays of functional groups across a wide interaction area, without the penalty of introducing multiple rotatable bonds.
Virtual screening provides an excellent opportunity to explore large databases of virtual small molecules and ring systems as highlighted above, supporting the design of combinatorial libraries with novel scaffolds or ring systems, or can be employed for tasks such as bioisosteric replacement design and scaffold hopping. However, in order to increase the chance of successfully synthesizing molecules proposed by virtual screening methods, more effort has to be put into the development of predictive methods to account for chemical feasibility. Not only should it include if a particular compound can be synthesized but it should also include if it can be rapidly followed up (i.e., chemically enabled) with analogues during lead optimization in a medicinal chemistry campaign. Computational approaches to assess synthetic accessibility have been described in the literature, mainly based on retrosynthetic or complexity-based analysis of molecules [70, 71].
1.5 Focused Libraries