134,99 €
A look at the methods and algorithms used to predict protein structure A thorough knowledge of the function and structure of proteins is critical for the advancement of biology and the life sciences as well as the development of better drugs, higher-yield crops, and even synthetic bio-fuels. To that end, this reference sheds light on the methods used for protein structure prediction and reveals the key applications of modeled structures. This indispensable book covers the applications of modeled protein structures and unravels the relationship between pure sequence information and three-dimensional structure, which continues to be one of the greatest challenges in molecular biology. With this resource, readers will find an all-encompassing examination of the problems, methods, tools, servers, databases, and applications of protein structure prediction and they will acquire unique insight into the future applications of the modeled protein structures. The book begins with a thorough introduction to the protein structure prediction problem and is divided into four themes: a background on structure prediction, the prediction of structural elements, tertiary structure prediction, and functional insights. Within those four sections, the following topics are covered: * Databases and resources that are commonly used for protein structure prediction * The structure prediction flagship assessment (CASP) and the protein structure initiative (PSI) * Definitions of recurring substructures and the computational approaches used for solving sequence problems * Difficulties with contact map prediction and how sophisticated machine learning methods can solve those problems * Structure prediction methods that rely on homology modeling, threading, and fragment assembly * Hybrid methods that achieve high-resolution protein structures * Parts of the protein structure that may be conserved and used to interact with other biomolecules * How the loop prediction problem can be used for refinement of the modeled structures * The computational model that detects the differences between protein structure and its modeled mutant Whether working in the field of bioinformatics or molecular biology research or taking courses in protein modeling, readers will find the content in this book invaluable.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 913
Veröffentlichungsjahr: 2011
Table of Contents
Cover
Table of Contents
WILEY SERIES ON BIOINFORMATICS: COMPUTATIONAL TECHNIQUES AND ENGINEERING
Title page
Copyright page
PREFACE
CONTRIBUTORS
CHAPTER 1 INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
1.1. INTRODUCTION TO PROTEIN STRUCTURES
1.2. PROTEIN STRUCTURE PREDICTION METHODS
CHAPTER 2 CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
2.1. WHY CRITICAL ASSESSMENT OF PROTEIN STRUCTURE PREDICTION (CASP) WAS NEEDED?
2.2. CASP PRINCIPLES AND ORGANIZATION
2.3. CASP PROCESS
2.4. METHOD CLASSES AND PREDICTION DIFFICULTY CATEGORIES
2.5. TBM
2.6. FREE MODELING OF NEW FOLD PROTEINS
2.7. OTHER MODELING CATEGORIES
2.8. SERVERS IN CASP
2.9. MODELING CHALLENGES AND CASP INITIATIVES
CHAPTER 3 THE PROTEIN STRUCTURE INITIATIVE
3.1. BACKGROUND, RATIONALE, AND HISTORY
3.2. OVERVIEW, PIPELINE, AND RESOURCES
3.3. TARGET SELECTION AND TARGET CATEGORIES
3.4. PERFORMANCE OF PSI
3.5. DISSEMINATION OF RESULTS
3.6. CONCLUSIONS AND FUTURE PERSPECTIVES
CHAPTER 4 PREDICTION OF ONE-DIMENSIONAL STRUCTURAL PROPERTIES OF PROTEINS BY INTEGRATED NEURAL NETWORKS
4.1. INTRODUCTION
4.2. LOCAL STRUCTURAL PROPERTIES
4.3. GLOBAL STRUCTURAL PROPERTIES
4.4. SPINE AND REAL-SPINE
4.5. CONCLUSION AND OUTLOOK
CHAPTER 5 LOCAL STRUCTURE ALPHABETS
5.1. INTRODUCTION
5.2. REPEATING STRUCTURAL ELEMENTS IN PROTEINS
5.3. BEYOND SECONDARY STRUCTURES
5.4. LOCAL STRUCTURE LIBRARIES
5.5. PBS
5.6. CONCLUSIONS AND PERSPECTIVES
ACKNOWLEDGMENTS
CHAPTER 6 SHEDDING LIGHT ON TRANSMEMBRANE TOPOLOGY
6.1. INTRODUCTION
6.2. STRUCTURE OF INTEGRAL MEMBRANE PROTEINS
6.3. TOPOLOGY PREDICTION OF MEMBRANE PROTEINS
6.4. DATABASES AND BENCHMARK SETS
6.5. CONCLUSIONS
CHAPTER 7 CONTACT MAP PREDICTION BY MACHINE LEARNING
7.1. INTRODUCTION
7.2. BINARY CONTACT MAP PREDICTION BY 2D-RNN
7.3. INCORPORATING TEMPLATE INFORMATION
7.4. FILTERING CONTACT MAPS
7.5. MULTI-CLASS DISTANCE MAPS
7.6. CASP8 EVALUATION
7.7. CONCLUSIONS
CHAPTER 8 A SURVEY OF REMOTE HOMOLOGY DETECTION AND FOLD RECOGNITION METHODS
8.1. INTRODUCTION
8.2. LITERATURE REVIEW
8.3. HIERARCHICAL MULTI-CLASS CLASSIFIERS
8.4. EXPERIMENTAL RESULTS
8.5. CONCLUSIONS
CHAPTER 9 INTEGRATIVE PROTEIN FOLD RECOGNITION BY ALIGNMENTS AND MACHINE LEARNING
9.1. INTRODUCTION
9.2. ALIGNMENT FOLD RECOGNITION METHODS
9.3. MACHINE LEARNING FOLD RECOGNITION METHODS
9.4. CONCLUSIONS
ACKNOWLEDGMENTS
CHAPTER 10 TASSER-BASED PROTEIN STRUCTURE PREDICTION
10.1. INTRODUCTION
10.2. METHODOLOGY
10.3. BENCHMARKING TASSER FOR STRUCTURE PREDICTION
10.4. FURTHER DEVELOPMENTS OF TASSER
10.5. TASSER’S PERFORMANCE IN CASP6-CASP8
10.6. APPLICATIONS OF TASSER
10.7. CONCLUSIONS
ACKNOWLEDGEMENTS
CHAPTER 11 COMPOSITE APPROACHES TO PROTEIN TERTIARY STRUCTURE PREDICTION: A CASE-STUDY BY I-TASSER
11.1. INTRODUCTION
11.2. I-TASSER: A COMPOSITE METHOD FOR PROTEIN STRUCTURE PREDICTION
11.3. AB INITIO PREDICTION OF I-TASSER ON SMALL PROTEINS
11.4. BLIND TEST OF I-TASSER IN CASP EXPERIMENTS
11.5. CONCLUDING REMARKS
CHAPTER 12 HYBRID METHODS FOR PROTEIN STRUCTURE PREDICTION
12.1. INTRODUCTION
12.2. SOURCES OF LIMITED STRUCTURAL DATA
12.3. TRANSLATION INTO STRUCTURAL RESTRAINTS
12.4. USE OF LIMITED EXPERIMENTAL DATA TO ELUCIDATE STRUCTURE
12.5. CONCLUSIONS AND FUTURE OUTLOOK
ACKNOWLEDGEMENT
CHAPTER 13 MODELING LOOPS IN PROTEIN STRUCTURES
13.1. INTRODUCTION
13.2. STRUCTURAL AND FUNCTIONAL ROLES OF LOOPS IN PROTEINS
13.3. PREDICTION OF LOOP CONFORMATIONS
13.4. APPLICATION OF LOOP PREDICTIONS
CHAPTER 14 MODEL QUALITY ASSESSMENT USING A STATISTICAL PROGRAM THAT ADOPTS A SIDE CHAIN ENVIRONMENT VIEWPOINT
14.1. INTRODUCTION
14.2. BACKGROUND
14.3. THE CIRCLE ALGORITHM: AN MQAP
14.4. THE FAMS-ACE2 (CIRCLE + CONSENSUS METHOD) ALGORITHM
14.5. CONCLUSIONS
CHAPTER 15 MODEL QUALITY PREDICTION
15.1. INTRODUCTION
15.2. A BRIEF HISTORY OF MODEL QUALITY ASSESSMENT
15.3. STATE-OF-THE-ART METHODS
15.4. PER-RESIDUE MODEL QUALITY PREDICTION
15.5. THE ASSESSMENT OF MODEL QUALITY ASSESSMENT
15.6. ONLINE RESOURCES
15.7. CONCLUSIONS AND FUTURE DIRECTIONS FOR DEVELOPERS
CHAPTER 16 LIGAND-BINDING RESIDUE PREDICTION
16.1. INTRODUCTION
16.2. EVALUATION OF LIGAND-BINDING PREDICTION METHODS
16.3. APPLICATION: HOMOLOGY MODELING OF BINDING SITES
16.4. CONCLUSION AND FUTURE OUTLOOK
CHAPTER 17 MODELING AND VALIDATION OF TRANSMEMBRANE PROTEIN STRUCTURES
17.1. INTRODUCTION
17.2. COMPARATIVE MODELING
17.3. EXPERIMENTAL DATA FITTING
17.4. QUALITY ASSESSMENT
ACKNOWLEDGMENTS
CHAPTER 18 STRUCTURE-BASED MACHINE LEARNING MODELS FOR COMPUTATIONAL MUTAGENESIS
18.1. INTRODUCTION
18.2. METHODOLOGY
18.3. PROTEIN-SPECIFIC MUTAGENESIS MODELS
18.4. UNIVERSAL MODELS OF THERMODYNAMIC STABILITY
18.5. CONCLUSIONS
CHAPTER 19 CONFORMATIONAL SEARCH FOR THE PROTEIN NATIVE STATE
19.1. THE QUEST FOR THE PROTEIN NATIVE STATE
19.2. EXHAUSTIVE SEARCH: DISCRETIZATION OF CONFORMATIONAL SPACE
19.3. SYSTEMATIC SEARCH: MD
19.4. BIASED RANDOM WALK: METROPOLIS MONTE CARLO (MC)
19.5. GUIDED SEARCH OF CONFORMATIONAL SPACE
19.6. ENHANCED SAMPLING OF CONFORMATIONAL SPACE
19.7. DISCUSSION OF FUTURE RESEARCH DIRECTIONS
CHAPTER 20 MODELING MUTATIONS IN PROTEINS USING MEDUSA AND DISCRETE MOLECULE DYNAMICS
20.1. INTRODUCTION
20.2. METHODS
20.3. RESULTS AND DISCUSSION
20.4. CONCLUSION AND FUTURE DIRECTIONS
ACKNOWLEDGMENTS
Index
Color Plates
WILEY SERIES ON BIOINFORMATICS: COMPUTATIONAL TECHNIQUES AND ENGINEERING
Series Editors, Yi Pan & Albert Zomaya
Knowledge Discovery in Bioinformatics: Techniques, Methods and Applications / Xiaohua Hu & Yi Pan
Grid Computing for Bioinformatics and Computational Biology / Albert Zomaya & El-Ghazali Talbi
Analysis of Biological Networks / Björn H. Junker & Falk Schreiber
Bioinformatics Algorithms: Techniques and Applications / Ion Mandoiu & Alexander Zelikovsky
Machine Learning in Bioinformatics / Yanqing Zhang & Jagath C. Rajapakse
Biomolecular Networks / Luonan Chen, Rui-Sheng Wang, & Xiang-Sun Zhang
Computational Systems Biology / Huma Lodhi
Computational Intelligence and Pattern Analysis in Biology Informatics / Ujjwal Maulik, Sanghamitra, & Jason T. Wang
Mathematics of Bioinformatics: Theory, Practice, and Applications / Matthew He
Introduction to Protein Structure Prediction: Methods and Algorithms / Huzefa Rangwala & George Karypis
Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Rangwala, Huzefa.
Introduction to protein structure prediction : methods and algorithms / Huzefa Rangwala, George Karypis.
p. cm.—(Wiley series in bioinformatics; 14)
Includes bibliographical references and index.
ISBN 978-0-470-47059-6 (hardback)
ISBN 978-1-118-09946-9 (ebk)
1. Proteins—Structure—Mathematical models. 2. Proteins—Structure—Computer simulation. I. Karypis, G. (George) II. Title.
QP551.R225 2010
572′.633—dc22
2010028352
PREFACE
PROTEIN STRUCTURE PREDICTION
Proteins play a crucial role in governing several life processes. Stunningly complex networks of proteins perform innumerable functions in every living cell. Knowing the function and structure of proteins is crucial for the development of better drugs, higher yield crops, and even synthetic biofuels. As such, knowledge of protein structure and function leads to crucial advances in life sciences and biology. The motivation behind the structural determination of proteins is based on the belief that structural information provides insights as to their function, which will ultimately result in a better understanding of intricate biological processes.
Breakthroughs in large-scale sequencing have led to a surge in the available protein sequence information that has far outstripped our ability to characterize the structural and functional characteristic of these proteins. Several research groups have been working on determining the three-dimensional structure of the protein using a wide variety of computational methods. The problem of unraveling the relationship between the amino acid sequence of a protein and its three-dimensional structure has been one of the grand challenges in molecular biology. The importance and the far reaching implications of being able to predict the structure of a protein from its amino acid sequence is manifested by the ongoing biennial competition on “Critical Assessment of Protein Structure Prediction” (CASP) that started more than 16 years ago. CASP is designed to assess the performance of current structure prediction methods and over the years the number of groups that have been participating in it continues to increase.
This book presents a series of chapters by authors who are involved in the task of structure determination and using modeled structures for applications involving drug discovery and protein design. The book is divided into the following themes.
BACKGROUND ON STRUCTURE PREDICTION
Chapter 1 provides an introduction to the protein structure prediction problem along with information about databases and resources that are widely used. Chapters 2 and 3 provide information regarding two very important initiatives in the field: (i) the structure prediction flagship competition (CASP), and (ii) the protein structure initiative (PSI), respectively. Since many of the approaches developed have been tested in the CASP competition, Chapter 2 lays the foundation for the need for such an evaluation, the problem definitions, significant innovations, competition format, as well as future outlook. Chapter 3 describes the protein structure initiative, which is designed to determine representative three-dimensional structures within the human genome.
PREDICTION OF STRUCTURAL ELEMENTS
Within each structural entity called a protein there lies a set of recurring substructures, and within these substructures are smaller substructures. Beyond the goal of predicting the three-dimensional structure of a protein from sequence several other problems have been defined and methods have been developed for solving the same. Chapters 4–6 provide the definitions of these recurring substructures called local alphabets or secondary structures and the computational approaches used for solving these problems. Chapter 6 specifically focuses on a class of transmembrane proteins known to be harder to crystallize. Knowing the pairs of residues within a protein that are within contact or at a closer distance provides useful distance constraints that can be used while modeling the three-dimensional structure of the protein. Chapter 7 focuses on the problem of contact map prediction and also shows the use of sophisticated machine learning methods to solve the problem. A successful solution for each of these subproblems assists in solving the overarching protein structure prediction problem.
TERTIARY STRUCTURE PREDICTION
Chapters 8–11 discuss the widely used structure prediction methods that rely on homology modeling, threading, and fragment assembly. Chapters 8–9 discuss the problems of fold recognition and remote homology detection that attempt to model the three-dimensional structure of a protein using known structures. Chapters 10 and 11 discuss a combination of threading-based approaches along with modeling the protein in parts or fragments and usually helps in modeling the structure of proteins known not to have a close homolog within the structure databases. Chapter 12 is a survey of the hybrid methods that use a combination of the computational and experimental methods to achieve high-resolution protein structures in a high-throughput manner. Chapter 17 provides information about the challenges in modeling transmembrane proteins along with a discussion of some of the widely used methods for these sets of proteins.
Chapter 13 describes the loop prediction problem and how the technique can be used for refinement of the modeled structures. Chapters 14 and 15 assess the modeled structures and provide a notion of the quality of structures. This is extremely important from a biologist’s perspective who would like to have a metric that describes the goodness of the structure before use. Chapter 19 provides insights into the different conformations that a protein may take and the approaches used to sample the different conformations.
FUNCTIONAL INSIGHTS
Certain parts of the protein structure may be conserved and interact with other biomolecules (e.g., proteins, DNA, RNA, and small molecules) and perform a particular function due to such interactions. Chapter 16 discusses the problem of ligand-binding site prediction and its role in determining the function of the proteins. The approach uses some of the homology modeling principles used for modeling the entire structure. Chapter 18 introduces a computational model that detects the differences between protein structure (modeled or experimentally-determined) and its modeled mutant. Chapter 20 describes the use of molecular dynamic-based approaches for modeling mutants.
ACKNOWLEDGEMENTS
We wish to acknowledge the many people who have helped us with this project. We firstly thank all the coauthors who spent time and energy to edit their chapters and also served as reviewers by providing critical feedback for improving other chapters. Kevin Deronne, Christopher Kauffman, and Rezwan Ahmed also assisted in reviewing several of the chapters and helped the book take a form that is complete on the topic of protein structure prediction and exciting to read. Finally, we wish to thank our families and friends.
We hope that you as a reader benefit from this book and feel as excited about this field as we are.
Huzefa Rangwala
George Karypis
CONTRIBUTORS
NIR BEN-TAL, Department of Biochemistry and Molecular Biology, Tel Aviv University, Tel Aviv, Israel
AURÉLIE BORNOT, Institut National de la Santé et de la Recherche Médicale, UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris, France
ALEXANDRE G. DE BREVERN, Institut National de la Santé et de la Recherche Médicale, Université Paris Diderot, Institut National de la Transfusion Sanguine, 75015, Paris, France
JIANLIN CHENG, Computer Science Department and Informatics Institute University of Missouri, Columbia, MO 65211
FENG DING, Department of Biochemistry and Biophysics University of North Carolina—Chapel Hill, NC 27599
NICHOLAS E. DIXON, School of Chemistry, University of Wollongong, NSW 2522, Australia
NIKOLAY V. DOKHOLYAN, Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, NC 27599
ESHEL FARAGGI, Indiana University School of Informatics, Indiana University-Purdue University Indianapolis, and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202
KRZYSZTOF FIDELIS, Protein Structure Prediction Center, Genome Center, University of California, Davis, Davis, CA
ANDRAS FISER, Department of Systems and Computational Biology and Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY 10461
NARCIS FERNANDEZ-FUENTES, Leeds Institute of Molecular Medicine, University of Leeds, Leeds, UK
ADAM GODZIK, Program in Bioinformatics and Systems Biology, Sanford-Burnham Medical Research Institute, La Jolla, CA 92037
THOMAS HUBER, The University of Queensland, School of Chemistry and Molecular Biosciences, QLD, Australia
AGNEL PRAVEEN JOSEPH, Institut National de la Santé et de la Recherche Médicale, UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris, France
KAZUHIKO KANOU, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan
GEORGE KARYPIS, Department of Computer Science, University of Minnesota Minneapolis, MN 55455
CHRIS KAUFFMAN, Department of Computer Science, University of Minnesota, Minneapolis, MN 55455
BOSTJAN KOBE, The University of Queensland, School of Chemistry and Molecular Biosciences, Brisbane, Australia
ANDRIY KRYSHTAFOVYCH, Protein Structure Prediction Center, Genome Center, University of California, Davis, Davis, CA
ALBERTO J.M. MARTIN, Complex and Adaptive Systems Lab, School of Computer Science and Informatics, UCD Dublin, Ireland
MAJID MASSA, Department of Bioinformatics and Computational Biology, George Mason University, Manassas, VA 20110
LIAM J. MCGUFFIN, School of Biological Sciences, The University of Reading, Reading, UK
CATHERINE MOONEY, Shields Lab, School of Medicine and Medical Science, University College Dublin, Ireland
JOHN MOULT, Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, MD 20850
DMITRI MOURADOV, The University of Queensland, School of Chemistry and Molecular Biosciences, QLD, Australia
CHRISTINE ORENGO, Department of Structural and Molecular Biology, University College London, London UK
SHASHI BHUSHAN PANDIT, Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA 30318
GIANLUCA POLLASTRI, Complex and Adaptive Systems Lab, School of Computer Science and Informatics, UCD Dublin, Ireland
HUZEFA RANGWALA, Department of Computer Science, George Mason University, Fairfax, VA 22030
BURKHARD ROST, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032
AMBRISH ROY, Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
MAYA SCHUSHAN, Department of Biochemistry and Molecular Biology, Tel Aviv University, Tel Aviv, Israel
AMARDA SHEHU, Department of Computer Science, George Mason University, Fairfax, VA 22030
MAYUKO TAKEDA-SHITAKA, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan
ISTVÁN SIMON, lntsitute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary
JEFFREY SKOLNICK, Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology Atlanta, GA 30318
ALLISON N. TEGGE, Computer Science Department and Informatics Institute, University of Missouri, Columbia, MO 65211
GENKI TERASHI, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan
GÁBOR E. TUSNADY, Intsitute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary
HIDEAKI UMEYAMA, School of Pharmacy, Kitasato University, Tokyo 108-8641, Japan
IOSIF I. VAISMAN, Department of Bioinformatics and Computational Biology, George Mason University, Manassas, VA 20110
IAN WALSH, Complex and Adaptive Systems Lab, School of Computer Science and Informatics, UCD Dublin, Ireland
ZHENG WANG, Computer Science Department, University of Missouri, Columbia, MO 65211
SITAO WU, Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
SHUANGYE YIN, Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, NC 27599
YANG ZHANG, Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
HONGYI ZHOU, Center for the Study of Systems Biology, School of Biology Georgia Institute of Technology, Atlanta, GA 30318
YAOQI ZHOU, Indiana University School of Informatics, Indiana University-Purdue University Indianapolis, and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202
CHAPTER 1
INTRODUCTION TO PROTEIN STRUCTURE PREDICTION
HUZEFA RANGWALA
Department of Computer Science George Mason University Fairfax, VA
GEORGE KARYPIS
Department of Computer Science University of Minnesota Minneapolis, MN
Proteins have a vast influence on the molecular machinery of life. Stunningly complex networks of proteins perform innumerable functions in every living cell. Knowing the function and structure of proteins is crucial for the development of improved drugs, better crops, and even synthetic biofuels. As such, knowledge of protein structure and function leads to crucial advances in life sciences and biology.
With recent advances in large-scale sequencing technologies, we have seen an exponential growth in protein sequence information. Protein structures are primarily determined using X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, but these methods are time consuming, expensive, and not feasible for all proteins. The experimental approaches to determine protein function (e.g., gene knockout, targeted mutation, and inhibitions of gene expression studies) are low-throughput in nature [1,2]. As such, our ability to produce sequence information far outpaces the rate at which we can produce structural and functional information.
Consequently, researchers are increasingly reliant on computational approaches to extract useful information from experimentally determined three-dimensional (3D) structures and functions of proteins. Unraveling the relationship between pure sequence information and 3D structure and/or function remains one of the fundamental challenges in molecular biology.
Function prediction is generally approached by using inheritance through homology [2], that is, proteins with similar sequences (common evolutionary ancestry) frequently carry out similar functions. However, several studies [2–4] have shown that a stronger correlation exists between structure conservation and function, that is, structure implies function, and a higher correlation exists between sequence conservation and structure, that is, sequence implies structure (sequence → structure → function).
1.1. INTRODUCTION TO PROTEIN STRUCTURES
In this section we introduce the basic definitions and facts about protein structure, the four different levels of protein structure, as well as provide details about protein structure databases.
1.1.1. Protein Structure Levels
Within each structural entity called a protein lies a set of recurring substructures, and within these substructures are smaller substructures still. As an example, consider hemoglobin, the oxygen-carrying molecule in human blood. Hemoglobin has four domains that come together to form its quaternary structure. Each domain assembles (i.e., folds) itself independently to form a tertiary structure. These tertiary structures are comprised of multiple secondary structure elements—in hemoglobin’s case α-helices. α-Helices (and their counterpart β-sheets) have elegant repeating patterns dependent upon sequences of amino acids.
1.1.1.1. Primary Structure.
Amino acids form the basic building blocks of proteins. Amino acids consists of a central carbon atom (Cα) attached by an amino (NH2), a carboxyl (COOH) group, and a side chain (R) group. The side chain group differentiates the various amino acids. In case of proteins, there are primarily 20 different amino acids that form the building blocks. A protein is a chain of amino acids linked with peptide bonds. Pairs of amino acid form a peptide bond between the amino group of one and the carboxyl group of the other. This polypeptide chain of amino acids is known as the primary structure or the protein sequence.
1.1.1.2. Secondary Structure.
A sequence of characters representing the secondary structure of a protein describes the general 3D form of local regions. These regions organize themselves independently from the rest of the protein into patterns of repeatedly occurring structural fragments. The most dominant local conformations of polypeptide chains are α-helices and β-sheets. These local structures have a certain regularity in their form, attributed to the hydrogen bond interactions between various residues. An α-helix has a coil-like structure, whereas a β-sheet consists of parallel strands of residues. In addition to regular secondary structure elements, irregular shapes form an important part of the structure and function of proteins. These elements are typically termed coil regions.
Secondary structure can be divided into several types, although usually at least three classes (α-helix, coils, and β-sheet) are used. No unique method of assigning residues to a particular secondary structure state from atomic coordinates exists, although the most widely accepted protocol is based on the Dictionary of Protein Secondary Structure (DSSP) algorithm [5]. DSSP uses the following structural classes: H (α-helix), G (310-helix), I (π-helix), E (β-strand), B (isolated β-bridge), T (turn), S (bend), and – (other). Several other secondary structure assignment algorithms use a reduction scheme that converts this eight-state assignment down to three states by assigning H and G to the helix state (H), E and B to a the strand state (E), and the rest (I, T, S, and –) to a coil state (C). This is the format generally used in structure databases.
1.1.1.3. Tertiary Structure.
The tertiary structure of the protein is defined as the global 3D structure, represented by 3D coordinates for each atoms. These tertiary structures are comprised of multiple secondary structure elements, and the 3D structure is a function of the interacting side chains between the different amino acids. Hence, the linear ordering of amino acids forms secondary structure; arranging secondary structures yields tertiary structure.
1.1.1.4. Quaternary Structure.
Quaternary structures represent the interaction between multiple polypeptide chains. The interaction between the various chains is due to the non-covalent interactions between the atoms of the different chains. Examples of these interactions include hydrogen bonding, van Der Walls interactions, ionic bonding, and disulfide bonding.
Research in computational structure prediction concerns itself mainly with predicting secondary and tertiary structures from known experimentally determined primary structure or sequence. This is due to the relative ease of determining primary structure and the complexity involved in quaternary structure.
1.1.2. Protein Sequence and Structure Databases
The large amount of protein sequence information, experimentally determined structure information, and structural classification information is stored in publicly available databases. In this section we review some of the databases that are used in this field, and provide their availability information in Table 1.1.
TABLE 1.1 Protein Sequence and Structure Databases
DatabaseInformationAvailability LinkUniProtSequencehttp://www.pir.uniprot.org/UniRefCluster sequenceshttp://www.pir.uniprot.org/NCBI nrNonredundant sequencesftp://ftp.ncbi.nlm.nih.gov/blast/db/PDBStructurehttp://www.rcsb.org/SCOPStructure classificationhttp://scop.mrc-lmb.cam.ac.uk/scop/CATHStructure classificationhttp://www.cathdb.info/FSSPStructure classificationhttp://www.ebi.ac.uk/dali/fssp/ASTRALCompendiumhttp://astral.berkeley.edu/The databases referred to in this table are most popular for protein structure-related information.
1.1.2.1. Sequence Databases.
The Universal Protein Resource (UniProt) [6] is the most comprehensive warehouse containing information about protein sequences and their annotation. It is a database of protein sequences and their function that is formed by aggregating the information present in the Swiss-Prot, TrEMBL, and Protein Information Resources (PIR) databases. The UniProtKB 13.2 version of database (released on April 8, 2008) consists of 5,939,836 protein sequence entries (Swiss-Prot providing 362,782 entries and TrEMBL providing 5,577,054 entries).
However, several proteins have high pairwise sequence identity, and as such lead to redundant information. The UniProt database [6] creates a subset of sequences such that the sequence identity between all pairs of sequences within the subset is less than a predetermined threshold. In essence, UniProt contains the UniRef100, UniRef90, and UniRef50 subsets where within each group the sequence identity between a pair of sequences is less than 100%, 90%, and 50%, respectively.
The National Center for Biotechnology Information (NCBI) also provides a nonredundant (NCBI nr) database of protein sequences using sequences from a wide variety of sources. This database will have pairs of proteins with high sequence identity, but removes all the duplicates. The NCBI nr version 2.2.18 (released on March 2, 2008) contains 6,441,864 protein sequences.
1.1.2.2. Protein Data Bank (PDB).
The Research Collaboratory for Structural Bioinformatics (RSCB) PDB [7] stores experimentally determined 3D structure of biological macromolecules including nucleotides and proteins. As of April 20, 2008 this database consists of 46,287 protein structures that are determined using X-ray crystallography (90%), NMR (9%), and other methods like Cryo-electron microscopy (Cryo-EM). These experimental methods are time-consuming, expensive, and need protein to crystallize.
1.1.2.3. Structure Classification Databases.
Various methods have been proposed to categorize protein structures. These methods are based on the pairwise structural similarity between the protein structures, as well as the topological and geometric arrangement of atoms and predominant secondary structure like subunits. Structural Classification of Proteins (SCOP) [8], Class, Architecture, Topology, and Homologous superfamily (CATH) [9], and Families of Structurally Similar Proteins (FSSP) [10] are three widely used structure classification databases. The classification methodology involves breaking a protein chain or complex into independent folding units called domains, and then classifying these domains into a set of hierarchical classes sharing similar structural characteristics.
SCOP Database.
SCOP [8] is a manually curated database that provides a detailed and comprehensive description of the evolutionary and structural relationships between proteins whose structure is known (present in the PDB). SCOP classifies proteins structures using visual inspection as well as structural comparison using a suite of automated tools. The basic unit of classification is generally a domain. SCOP classification is based on four hierarchical levels that encompass evolutionary and structural relationships [8]. In particular, proteins with clear evolutionary relationship are classified to be within the same family. Generally, protein pairs within the same family have pairwise residue identities greater than 30%. Protein pairs with low sequence identity, but whose structural and functional features imply probably common evolutionary information, are classified to be within the same superfamily. Protein pairs with similar major secondary structure elements and topological arrangement of substructures (as well as favoring certain packing geometries) are classified to be within the same fold. Finally, protein pairs having a predominant set of secondary structures (e.g., all α-helices proteins) lie within the same class. The four hierarchical levels, that is, family, superfamily, fold, and class define the structure of the SCOP database.
The SCOP 1.73 version database (released on September 26, 2007) classifies 34,494 PDB entries (97,178 domains) into 1086 unique folds, 1777 unique superfamilies, and 3464 unique families.
CATH Database.
CATH [9] database is a semi-automated protein structure classification database like the SCOP database. CATH uses a consensus of three automated classification techniques to break a chain into domains and classify them in the various structural categories [11]. Domains for proteins that are not resolved by the consensus approach are determined manually. These domains are then classified into the following hierarchical categories using both manual and automated methods in conjunction.
The first level membership, class, is determined based on the secondary structure composition and packing within the structure. The second level, architecture, clusters proteins sharing the same orientation of the secondary structure element but ignoring the connectivity between these substructural units. The third level, topology, groups protein pairs with a high structure alignment score as determined by the SSAP [12] algorithm, and in essence share both overall shape and connectivity of secondary structures. The fourth level, homologous pairs, shares a common ancestor and is identified by sequence alignment as well as the SSAP structure alignment method. Structures are further classified to be within the same sequence families if they share a high sequence identity.
The CATH 3.1.0 version database (released on January 19, 2007) classifies 30,028 (93,885 domains) proteins from the PDB into 40 architecture-level classes, 1084 topology-level classes, and 2091 homologous-level classes.
FSSP Database.
The FSSP [10] is a structure classification database. FSSP uses an automatic classification scheme that employs exhaustive structure- to-structure alignment of proteins using the DALI [13] alignment. FSSP does not provide a hierarchical classification like the SCOP and CATH databases, but instead employs a hierarchical clustering algorithm using the pairwise structure similarity scores that can be used for the definition of fold classes—however, not very accurate.
There have been several studies [14,15] analyzing the relationship between the SCOP, CATH, and FSSP databases for representing the fold space for proteins. The major disagreement between the three databases lies in the domain identification step, rather than the domain classification step. A high percentage of agreement exists between the SCOP, CATH, and FSSP databases especially at the fold level with sequence identity greater than 25%.
ASTRAL Compendium.
The A Structural Alignment Library (ASTRAL) [16–18] compendium is a set of database and tools used for analysis of protein structures and sequences. This database is partially derived from, and augments, the SCOP [8] database. ASTRAL provides accurate linkage between the biological sequence and the reported structure in PDB, and identifies the domains within the sequence using SCOP. Since the majority of domain sequences in PDB are very similar to others, ASTRAL tools reduce the redundancy by selecting high-quality representatives. Using the reduced nonredundant set of representation proteins allows for sampling of all the different structures in the PDB. This also removes bias due to overrepresented structures. Subsets provided by ASTRAL are based on SCOP domains and use high-quality structure files only. Independent subsets of representative proteins are identified using a greedy algorithm with filtering criterion based on pairwise sequence identity determined using the Basic Local Alignment Search Tool (BLAST) [19], an e-value-based threshold, or a SCOP level-based filter.
1.2. PROTEIN STRUCTURE PREDICTION METHODS
One of the biggest goals in structural bioinformatics is the prediction of the 3D structure of a protein from its one-dimensional (1D) protein sequence. The goal is to be able to determine the shape (known as a fold) that a given amino acid sequence will adopt. The problem is further divided based on whether the sequence will adopt a new fold or bear resemblance to an existing fold (template) in some protein structure database. Fold recognition is easy when the sequence in question has a high degree of sequence similarity to a sequence with known structure [20]. If the two sequences share evolutionary ancestry they are said to be homologous. For such sequence pairs we can build the structure for the query protein by choosing the structure of the known homologous sequence as template. This is known as comparative modeling.
In the case where no good template structure exists for the query, one must attempt to build the protein tertiary structure from scratch. These methods are usually called ab initio methods. In a third-fold prediction scenario, there may not necessarily be a good sequence similarity with a known structure, but a structural template may still exist for the given sequence. To clarify this case, if one were aware of the target structure then they could extract the template using structure–structure alignments of the target against the entire structural database. It is important to note that the target and template need not be homologous. These two cases define the fold prediction (homologous) and fold prediction (analogous) problems during the Critical Assessment of Protein Structure Prediction (CASP) competition.
1.2.1. Comparative Modeling
Comparative Modeling or homology modeling is used when there exists a clear relationship between the sequence of a query protein (unknown structure) and a sequence of a known structure. The most basic approach to structure prediction for such (query) proteins is to perform a pairwise sequence alignment against each sequence in protein sequence databases. This can be accomplished using sequence alignment algorithms such as Smith-Waterman [21] or sequence search algorithms (e.g., BLAST [19]). With a good sequence alignment in hand, the challenge in comparative modeling becomes how to best build a 3D protein structure for a query protein using the template structure.
The heart of the above process is the selection of a suitable structural template based on sequence pair similarity. This is followed by the alignment of query sequence to the template structure selected to build the backbone of the query protein. Finally the entire modeled structure is refined by loop construction and side chain modeling. Several comparative modeling methods, more commonly known as modeler programs, have been developed over the past several years [22,23] focusing on various parts of the problem.
As seen in the various years of CASP [24,25], the span of comparative modeling approaches [22,23] follows five basic steps: (i) selecting one or suitable templates, (ii) utilizing sensitive sequence template alignment algorithms, (iii) building a protein model using the sequence structure alignment as reference, (iv) evaluating the quality of the model, and (v) refining the model. These typical steps for the comparative modeling process are shown in Figure 1.1.
FIGURE 1.1 Flowchart for the comparative modeling process.
1.2.2. Fold Prediction (Homologous)
While satisfactory methods exist to detect homologs (proteins that share similar evolutionary ancestry) with high levels of similarity, accurately detecting homologs at low levels of sequence similarity (remote homology detection) remains a challenging problem. Some of the most popular approaches for remote homology prediction compare a protein with a collection of related proteins using methods such as Position-Specific Iterative-BLAST (PSI-BLAST) [26], protein family profiles [27], hidden Markov models (HMMs) [28,29], and Sequence Alignment and Modeling System (SAM) [30]. These schemes produce models that are generative in the sense that they build a model for a set of related proteins and then check to see how well this model explains a candidate protein.
In recent years, the performance of remote homology detection has been further improved through the use of methods that explicitly model the differences between the various protein families (classes) by building discriminative models. In particular, a number of different methods that use Support Vector Machines (SVM) [31] have been developed to produce results that are generally superior to those produced by either pairwise sequence comparisons or approaches based on generative models—provided there are sufficient training data [32–39].
1.2.3. Fold Prediction (Analogous)
Occasionally a query sequence will have a native fold similar to another known fold in a database, but the two sequences will have no detectable similarity. In many cases the two proteins will lack an evolutionary relationship as well. As the definition of this problem relies on the inability of current methods to detect sequential similarity, the set of proteins falling into this category remains in flux. As new methods continue to improve at finding sequential similarities as a result of increasing database size and better techniques, the number of proteins in question decreases. Techniques to find structures for such query sequences revolve around mounting the query sequence on a series of template structures in a process known as threading [40–42]. An objective energy function provides a score for each alignment, and the highest scoring template is chosen.
Obviously, if the correct template does not exist in the series then the method will not produce an accurate prediction. As a result of this limitation, predicting the structure of proteins in this category is as challenging as predicting protein targets that are part of the new or rare folds.
1.2.4. Ab Initio
Techniques to predict novel protein structure have come a long way in recent years, although a definitive solution to the problem remains elusive. Research in this area can be roughly divided into fragment assembly [43–45] and first principle-based approaches, although occasionally the two are combined [46]. The former attempt to assign a fragment with known structure to a section of the unknown query sequence. The latter start with an unfolded conformation, usually surrounded by solvent, and allow simulated physical forces to fold the protein as would normally happen in vivo. Usually, algorithms from either class will use reduced representations of query proteins during initial stages to reduce the overall complexity of the problem.
Even in case of these ab initio prediction methods, the state-of-the-art methods [46–48] determine several template structures (using the template selection methods used in comparative modeling methods). The final protein is modeled using an assembly of fragments or substructures fitted together using a highly optimized approximate energy and statistics-based potential function.
This book presents methods developed for protein structure prediction. In particular methods and problems that are prevalent in a biennial structure prediction competition (CASP) are discussed in the first half of the book. The second half of the book discusses approaches that combine experimental and computational approaches for structure prediction and also new techniques for predicting structures of transmembrane proteins. Finally, the book discusses the applications of protein structure within the context of function prediction and drug discovery.
REFERENCES
1. G. Pandey, V. Kumar, and M. Steinbach. Computational approaches for protein function prediction: A survey. Technical Report 06-23, Department of Computer Science and Engineering, University of Minnesota, 2006.
2. D. Lee, O. Redfern, and C. Orengo. Predicting protein function from sequence and structure. Nature Reviews. Molecular Cell Biology, 8(12):995–1005, 2007.
3. J.C. Whisstock and A.M. Lesk. Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics, 36(3):307–340, 2003.
4. D. Devos and A. Valencia. Practical limits of function prediction. Proteins, 41(1):98–107, 2000.
5. W. Kabsch and C. Sander. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22:2577–2637, 1983.
6. UniProt Consortium. The universal protein resource (uniprot). Nucleic Acids Research, 36(Database issue):D190–D195, 2008.
7. H.M. Berman, T.N. Bhat, P.E. Bourne, Z. Feng, G.G.H. Weissig, and J. Westbrook. The Protein Data Bank and the challenge of structural genomics. Nature Structural Biology, 7:957–959, 2000.
8. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–540, 1995.
9. C.A. Orengo, A.D. Mitchie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thorton. Cath- a hierarchic classification of protein domain structures. Structure, 5(8):1093–1108, 1997.
10. L. Holm and C. Sander. The fssp database: Fold classification based on structurestructure alignment of proteins. Nucleic Acids Research, 24(1):206–209, 1996.
11. S. Jones, M. Stewart, A. Michie, M.B. Swindells, C. Orengo, and J.M. Thornton. Domain assignment for protein structures using a consensus approach: Characterization and analysis. Protein Science, 7(2):233–242, 1998.
12. W.R. Taylor and A.C. Orengo. Protein structure alignment. Journal of Molecular Biology, 208(1):1–22, 1989.
13. L. Holm and C. Sander. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology, 233(1):123–138, 1993.
14. C. Hadley and D. Jones. A systematic comparison of protein structure classifications: Scop, cath and fssp. Structure, 7(9):1099–1112, 1999.
15. R. Day, D.A.C. Beck, R.S. Armen, and V. Daggett. A consensus view of fold space: Combining SCOP, CATH, and the Dali Dom ain Dictionary. Protein Science, 12(10):2150–2160, 2003.
16. S.E. Brenner, P. Koehl, and M. Levitt. The astral compendium for sequence and structure analysis. Nucleic Acids Research, 28:254–256, 2000.
17. J.-M. Chandonia, N.S. Walker, L.L. Conte, P. Koehl, M. Levitt, and S.E. Brenner. ASTRAL compendium enhancements. Nucleic Acids Research, 30(1):260–263, 2002.
18. J.M. Chandonia, G. Hon, N.S. Walker, L.L. Conte, P. Koehl, M. Levitt, and S.E. Brenner. The astral compendium in 2004. Nucleic Acids Research, 32:D189–D192, 2004.
19. S.F. Altschul, W. Gish, E.W. Miller, and D.J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.
20. P. Bourne and H. Weissig. Structural Bioinformatics. Hoboken, NJ: John Wiley & Sons, 2003.
21. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.
22. P.A. Bates and M.J.E Sternberg. Model building by comparison at casp3: Using expert knowledge and computer automation. Proteins: Structure, Functions, and Genetics, 3:47–54, 1999.
23. A. Fiser, R.K. Do, and A. Sali. Modeling of loops in protein structures. Protein Science, 9:1753–1773, 2000.
24. C. Venclovas. Comparative modeling in casp5: Progress is evident, but alignment errors remain a significant hindrance. Proteins: Structure, Function, and Genetics, 53:380–388, 2003.
25. C. Venclovas and M. Margelevicius. Comparative modeling in casp6 using consensus approach to template selection, sequence-structure alignment, and structure assessment. Proteins: Structure, Function, and Bioinformatics, 7:99–105, 2005.
26. S.F. Altschul, L.T. Madden, A.A. SchÃd’ffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997.
27. M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: Detection of distantly related proteins. PNAS, 84:4355–4358, 1987.
28. A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531, 1994.
29. P. Baldi, Y. Chauvin, T. Hunkapiller, and M. McClure. Hidden Markov models of biological primary sequence information. PNAS, 91:1053–1063, 1994.
30. K. Karplus, C. Barrett, and R. Hughey. Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14(10):846–856, 1998.
31. V. Vapnik. Statistical Learning Theory. New York: John Wiley, 1998.
32. T. Jaakkola, M. Diekhans, and D. Hassler. A dscriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7(1/2):95–114, 2000.
33. L. Liao and W.S. Noble. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Proceedings of the International Conference on Research in Computational Molecular Biology, 225–232, 2002.
34. C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: A string kernel for svm protein classification. Proceedings of the Pacific Symposium on Biocomputing, 564–575, 2002.
35. C. Leslie, E. Eskin, W.S. Noble, and J. Weston. Mismatch string kernels for svm protein classification. Advances in Neural Information Processing Systems, 20(4):467–476, 2003.
36. Y. Hou, W. Hsu, M.L. Lee, and C. Bystroff. Efficient remote homology detection using local structure. Bioinformatics, 19(17):2294–2301, 2003.
37. Y. Hou, W. Hsu, M.L. Lee, and C. Bystroff. Remote homology detection using local sequence-structure correlations. Proteins: Structure, Function, and Bioinformatics, 57:518–530, 2004.
38. H. Saigo, J.P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004.
39. R. Kuang, E. Ie, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile-based string kernels for remote homology detection and motif extraction. Journal of Bioinformatics and Computational Biology, 3:152–160, 2004.
40. D.T. Jones, W.R. Taylor, and J.M. Thorton. A new approach to protein fold recognition. Nature, 358:86–89, 1992.
41. D.T. Jones. Genthreader: An efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology, 287(4):797–815, 1999.
42. J.U. Bowie, R. Luethy, and D. Eisenberg. A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253:797–815, 1991.
43. K.T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. Journal of Molecular Biology, 268:209–225, 1997.
44. K. Karplus, R. Karchin, J. Draper, J. Casper, Y. Mandel-Gutfreund, M. Diekhans, and R. Hughey. Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins: Structure, Function, and Genetics, 53:491–496, 2003.
45. J. Lee, S.-Y. Kim, K. Joo, I. Kim, and J. Lee. Prediction of protein tertiary structure using profesy, a novel method based on fragment assembly and conformational space annealing. Proteins: Structure, Function, and Bioinformatics, 56:704–714, 2004.
46. C.A. Rohl, C.E.M. Strauss, K.M.S. Misura, and D. Baker. Protein structure prediction using rosetta. Methods in Enzymology, 383:66–93, 2004.
47. Y. Zhang. I-tasser server for protein 3d structure prediction. BMC Bioinformatics, 9:40, 2008.
48. Y. Zhang, A.J. Arakaki, and J. Skolnick. Tasser: An automated method for the prediction of protein tertiary structures in casp6. Proteins: Structure, Function, and Bioinformatics, 7:91–98, 2005.
CHAPTER 2
CASP: A DRIVING FORCE IN PROTEIN STRUCTURE MODELING
ANDRIY KRYSHTAFOVYCH and KRZYSZTOF FIDELIS
Protein Structure Prediction Center Genome Center University of California, Davis Davis, CA
JOHN MOULT
Center for Advanced Research in Biotechnology University of Maryland, College Park College Park, MD
2.1. WHY CRITICAL ASSESSMENT OF PROTEIN STRUCTURE PREDICTION (CASP) WAS NEEDED?
More than half a century has elapsed since it was shown that amino acid sequence determines the three-dimensional structure of a protein [1], but a general procedure to translate sequence into structure is still to be established. Several dozen methods for generating protein structure from sequence have been developed, providing different levels of model accuracy in different modeling circumstances. With such a variety of modeling approaches and success levels, it was important to establish an objective procedure to compare the performances of the methods and learn their advantages and weaknesses. Also, with only sparse reports on the performance of most methods it was difficult to arrive at a clear understanding of current capabilities and bottlenecks in the field. Specifically, it was not possible to address many key questions about modeling methods, in particular:
1. What are the most effective strategies for protein structure modeling?
2. What are the main factors influencing the outcome of a protein structure modeling experiment and how close can a model get to the corresponding experimental structure?
3. How can related structures on which a model can be based be identified reliably (the template identification problem)? How accurately can coordinates from the template structure be mapped to the correct positions on the target sequence (the alignment problem)? Are models produced by altering/refining templates more accurate than the models built by simply copying coordinates of the template (the refinement problem)?
4. How well can the reliability of the model in general and specific regions in particular be estimated (the quality assessment problem)?
5. How well can fully automatic modeling servers perform, compared with a combination of computing methods and human knowledge?
6. Has there been progress in the field?
7. What are the bottlenecks to further progress?
8. Where can future efforts be most productively focused?
In order to rigorously address these issues John Moult and colleagues pioneered the CASP experiment in 1994 [2]. The initiative was well accepted by the community of computational biologists, and the experiment, after eight completed rounds, continues to attract considerable attention to protein structure modelers from around the world. Two hundred thirty four predictor groups from 25 countries participated in the last completed CASP8, submitting over 80,000 predictions (see Fig. 2.1 for historical CASP participation statistics), and approximately the same number of predictor groups are participating in CASP9, which is currently (July 2010) under way.
FIGURE 2.1 Statistics on (a) the number of participating groups and (b) number of submitted predictions in CASP experiments held so far. In panel (b), bars representing the number of tertiary structure predictions are shown in dark gray, while bars representing the cumulative number of predictions in other categories (secondary structure, residue-residue contacts, disorder regions, domain boundaries, function, quality assessment) are shown in light gray.
Even though we, CASP co-organizers, continue to emphasize that CASP is primarily a scientific endeavor aimed at establishing the current state of the art in the protein structure prediction, many view it more as a “world championship” in this field of science. Thus, to a large extent, CASP owes its popularity to the twin human drives of competitiveness and curiosity. Whatever the case, a large community of structure modelers devote very considerable effort to the process, and it has now been emulated in other areas of computational biology [3–6].
2.2. CASP PRINCIPLES AND ORGANIZATION
In the pre-CASP times, protein structure modeling methods were tested using the procedure schematically shown in Figure 2.2a. Method developers selected sequences to test their own methods (usually with different research groups selecting different sets of proteins), and assessed the results by comparing models to the experimental structures already known to them at the time of “prediction.” Many apparently successful modeling results were reported in the literature but the inability of others to reproduce the results and the lack of resulting useful applications strongly suggested that this testing approach was not strict enough to ensure objective assessment of the results. In particular, many felt that the reported results were too easily influenced by the known answers. CASP was established to address the deficiencies in these traditional testing procedures. The main principles of CASP summarized in Figure 2.2b are:
“Blind” prediction regime. Predictors are required to submit their models before the answers (experimental structures) are publicly available. This is the primary CASP principle for ensuring rigorous conclusions.Independent assessment of the results. Experts in the field are invited to perform an independent assessment of all submitted models. The assessors may not participate in the experiment in the role of predictors.Same targets for everyone. Proteins for modeling (“targets” in CASP jargon) are selected not by the predictors but by the organizers who are not permitted to participate in the experiment and so have no interest in introducing any selection bias. The same set of targets is used to test all the methods, thus facilitating direct comparison of performance. Organizers strive to provide a reasonably large set of targets with a balanced range of difficulty, so that the assessment is statistically sound and shows the range of success and failure across the spectrum of structure modeling problems.Anonymity of assessment. All information that could be directly or indirectly used to identify submitting research groups are stripped off the predictions. This information is not made available to the assessors until after their analysis of the results is completed.Same evaluation criteria for everyone. All predictions are evaluated using the same set of numerical criteria.Data availability for post-experiment comparisons. All predictions and automatic evaluation results are released to the public upon completion of each CASP experiment, so as to allow others to reproduce the results, and to facilitate methods development.Control of the experiment by the participants. Those participating in CASP are involved in shaping the rules and scope of the experiment through a variety of mechanisms, particularly a discussion forum (FORCASP) and a predictors’ meeting at each conference, where motions for change are considered and voted upon.FIGURE 2.2 Schematics of (a) pre-CASP and (b) CASP testing procedures for protein structure prediction methods.
Together, these principles ensure a more objective determination of capabilities in the field of protein structure modeling than the conventional peer-review publication system. They make unjustified claims more difficult to publish, and provide a powerful mechanism for predictors to establish the strength of their methods.
The principles remain untouched from one experiment to another, but a number of changes and additions to the details have been introduced, and these are summarized in Table 2.1.
TABLE 2.1 Changes in the Consecutive CASPs
2.3. CASP PROCESS
CASP is a complicated process, requiring careful planning, data management, and security. The Protein Structure Prediction Center, established to support the experiment at the Lawrence Livermore Laboratory in 1996 and in 2005 at the University of California, Davis, provides the infrastructure for methods testing, develops method evaluation and visualization tools, and handles all data management issues [7].
Experiments are held every 2 years. The timetable of a typical CASP round is schematically shown in Figure 2.3. The experiment is open to all. The Prediction Center releases targets for prediction and collects models from registered participants for approximately 3 months. Targets for structure prediction are either structures soon-to-be solved by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, or structures already solved but not yet publicly accessible. Prediction methods are divided into two categories—those using a combination of computational methods and human experience, and those relying solely on computational methods. The integrity of the latter category is ensured by requiring that servers process target information and return models automatically. A window of 3 weeks is usually provided for prediction of a target by human-expert groups and 3 days by servers. Following closing of the server prediction window, the server models are posted at the Prediction Center web site. These models can then be used by human-expert predictors as starting points for further, more detailed modeling. They are also used for testing model quality assessment methods in CASP. Once all models of a target have been collected and the experimental structure is available, the Prediction Center performs a standard numerical evaluation of the models, taking the experimental structure as the gold standard. A battery of tools is used for the numerical evaluation of predictions—LGA [8], ACE [9], DAL [10], MAMMOTH [11], DALI [12]. If the target consists of more than one well-defined structural domain, the evaluation is performed on each of these as well as on the complete target (the official domain boundaries are defined by the assessors). The results of automatic evaluation are made available to the independent assessors, who typically add their own analysis methods and make more subjective assessments of the merits and faults of the models. The identity of the predictors is concealed from the assessors while they conduct their analysis. Assessment outcomes are presented to the community at the predictors’ meeting usually held in December of a CASP year. At that time, results of the evaluations are also made publicly available through the Prediction Center web site (http://predictioncenter.org) allowing predictors to compare their own models with those submitted by other groups. Details of all the experiments completed so far and their results are available through this web site. The web site also hosts a discussion forum, FORCASP, allowing exchange of thoughts by the predictors. The articles by the assessors, the organizers, and the most successful prediction groups are published in special issues of the journal Proteins: Structure, Function, and Bioinformatics. There are currently eight such issues available, one for each of the eight CASP experiments [2,13–19]. The articles in the special issues discuss in detail the methods tested in CASP, the evaluation results, and the analysis of the progress made. Below we briefly summarize the state of the art in different CASP modeling categories.
FIGURE 2.3 Timetable of the CASP experiment.
2.4. METHOD CLASSES AND PREDICTION DIFFICULTY CATEGORIES
In evaluating the ability of prediction methods, it is important to realize that difficulty of a modeling problem is determined by many factors. In theory, it is possible to calculate the structure of any protein from knowledge of its amino acid composition and environmental conditions alone, since it has long been established that these factors determine the functional conformation [1]. In practice, it is not yet possible to follow the detailed folding behavior of a system with as many atoms and degrees of freedom as a protein, nor to thoroughly search for the global free energy minimum of such a system [20–22]. Two types of methods for combating these limitations have been developed. One, by far the most effective at present, utilizes experimental structures of evolutionarily related proteins, providing templates on which to base a model. For cases where no such relationship exists, or none can be discovered, partially effective structure prediction techniques have been developed using simplified energy functions and employing approximate energy landscape search strategies. These two approaches define the main two classes of prediction methods—template-based modeling (TBM), sometimes referred to as comparative or homology modeling, and template-free modeling. Historically, template-free methods were often termed ab initio (or first principles), but members of the CASP community objected on the grounds that these methods often make use of knowledge-based potentials to evaluate interactions and assemblies of observed peptide fragment conformations to generate trial structures. Template-free methods are currently effective only for modeling small proteins (100 residues or less). Template-based methods can be applied wherever it is possible to identify a structurally similar protein that can be used as a template for building the model, irrespective of size. When the two approaches have been applied to the same modeling problem, template-based methods have usually proven more accurate than template-free methods. Thus, the most significant division in modeling difficulty is between cases where a model can be built based on templates derived from known experimental structures, and those where it cannot. At one extreme, high-resolution models competitive with experiments can be produced for proteins with sequences very similar to that of a known structure. At the other extreme, low resolution, very approximate models can be generated by template-free methods for proteins with no detectable sequence or structure relationship to known structures. To properly assess method successes and failures, CASP subdivides modeling into these two separate categories, each with its own challenges, and hence requiring its own evaluation procedures.
2.5. TBM
Whenever there is a detectable sequence relationship between two proteins, the corresponding structures have been found to be similar. Thus, if at least a single structure within a family of homologous proteins is determined experimentally, then template-based methods can be used to model practically all proteins in that family. The potential of this modeling is huge—by some estimates, structures are already known for a quarter of the protein single-domain families of significant size and half of all known sequences can be partially modeled due to their membership in these families (M. Levitt in [23]).
A typical template-based method consists of several consecutive steps: identifying probable templates; selecting/combining suitable templates; aligning target-template(s) sequence; copying structurally conserved regions from the selected template(s); modeling structurally variable regions; packing side chains; refining the model; and evaluating its quality. Each modeling step is prone to errors, but, as a rule, the earlier in the process the error is introduced, the costlier it is. As the template-based category covers a wide range of structure similarity, different kinds of errors are typical for different modeling difficulty subcategories.
2.5.1. High-Resolution TBM
The most reliable models can be built in cases where there is a strong sequence relationship between the target protein and a template (i.e., higher than ∼40% sequence identity between target and template). In these situations target and template are expected to have very similar structures. Template selection and alignment errors are rare here, and simply copying the backbone of a suitable template may be sufficient in producing a model that may rival NMR or low-resolution X-ray structures in accuracy (∼1 Å C-alpha atom root-mean-square deviations [RMSD] from the experimental structure). The main effort in this class of prediction shifts to modeling of regions of structure not present in a template (loops), proper placement of side chains, and fine adjustment of the structure (refinement).
Such high-resolution models often present a level of detail that is sufficient for detecting sites of protein–protein interactions, understanding enzyme reaction mechanisms, interpreting disease-causing mutations, molecular replacement in solving crystal structures, and occasionally even drug design.
2.5.2. Medium Difficulty Range TBM
New, more sensitive methods of detecting remote sequence relationships, especially Position-Specific Iterative-Basic Local Alignment Search Tool (PSI-BLAST) and profile–profile methods, have greatly extended our ability to utilize structure templates based on more remote sequence relationships. The quality of models in this category has steadily improved over the course of the CASP experiments. Models with quite accurate core (typically 2–3Å C-alpha atom RMSD from the native structure) can now often be generated. Factors still limiting progress include difficulty in recognizing best templates, combining information from several templates, aligning target sequences with template structures, adjusting for considerable shifts in conserved regions of structure, and modeling regions not represented in any of the available templates. As in high-resolution homology modeling, refinement methods play a role in improving the accuracy of final models.
