124,99 €
Harness the power of machine learning for quick and efficient calculations of protein structures and properties
Machine Learning in Protein Science is a unique and practical reference that shows how to employ machine learning approaches for full quantum mechanical (FQM) calculations of protein structures and properties, thereby saving costly computing time and making this technology available for routine users.
Machine Learning in Protein Science provides comprehensive coverage of topics including:
Machine Learning in Protein Science is an essential reference on the subject for biochemists, molecular biologists, theoretical chemists, biotechnologists, and medicinal chemists, as well as students in related programs of study.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 442
Veröffentlichungsjahr: 2025
Cover
Table of Contents
Title Page
Copyright
Chapter 1: Introduction
1.1 Background and Motivation
Bibliography
Chapter 2: Fundamental of Theoretical Calculations on Protein Systems
2.1 Strategies for Protein Calculations and Predictions
2.2 Force Field-based Calculation on Protein Systems
2.3 Quantum Mechanical Calculation on Protein Systems
2.4 ML on Protein Systems
Bibliography
Chapter 3: Protein Structure Prediction by Artificial Intelligence
3.1 AlphaFold from Google
3.2 ESM and ESMFold from Meta AI
Bibliography
Note
Chapter 4: Methods and Tools for Predicting Protein Folding Free Energy Change upon Mutation
4.1 Introduction
4.2 Clustered Tree Regression
4.3 Materials and Methods
4.4 Conclusion
Bibliography
Chapter 5: Deep Neural Network-assisted Full-system Quantum Mechanical (FQM) Calculations of Proteins
5.1 Introduction
5.2 DNN-based GFCC Method
5.3 DNN-based Two-body Molecular Fractionation with Conjugate Caps
5.4 Conclusion
Bibliography
Chapter 6: Transfer Learning-assisted FQM Calculations of Proteins
6.1 Introduction
6.2 Inductive Transfer Learning Force Field
6.3 Transfer-learning-based Deep Learning Protocol for FQM Calculations
Bibliography
Chapter 7: Protein Interaction Prediction with Artificial Intelligence
7.1 Background
7.2 Methods and Technical Framework
7.3 Results
7.4 Summary and Outlook
Bibliography
Chapter 8: Protein Function Annotation with Machine Learning
8.1 Background
8.2 Methods and Technical Framework
8.3 Protein Function Annotation
Bibliography
Chapter 9: Machine Learning-driven Ab Initio Protein Design
9.1 Background
9.2 Advances and Applications in ML-driven Ab Initio Protein Design
9.3 Graph Neural Networks for Flexible Protein Design
Bibliography
Chapter 10: Large Language Model of Protein Systems
10.1 Background
10.2 Methodology and Training of Protein Language Models
10.3 Applications of PLMs
10.4 Case Introduction: Decoding Protein Interactions Using ESM
10.5 Conclusion and Outlook
Bibliography
Chapter 11: Outlook
11.1 Background
11.2 AI’s Transformative Role
11.3 Applications Across Protein Science
11.4 AI-Augmented Experimental Techniques in Modern Protein Research
Bibliography
Index
End User License Agreement
Chapter 1
Figure 1.1 (a) The primary structure of a protein can be understood as a linear string. (b)...
Figure 1.2 The three-dimensional structural model of proteins is usually predicted by bioin...
Figure 1.3 The detailed structure of the binding sites between one drug molecule and a prot...
Figure 1.4 This image illustrates the bioinformatics analysis process from genes to protein...
Figure 1.5 This illustration showcases the application of machine learning in drug design. ...
Figure 1.6 This image shows the dynamic changes in protein structure. This change is crucia...
Chapter 2
Figure 2.1 A three-dimensional (3D) structural model of proteins. Proteins are long-chain m...
Figure 2.2 Amino acids are the basic units that make up proteins, and each amino acid molec...
Figure 2.3 Representation methods for protein, ligand, and hybrid features, which enable th...
Figure 2.4 The tertiary structure of proteins and their corresponding amino acid sequences....
Figure 2.5 Molecular mechanics potential energy function with continuum solvent.
Figure 2.6 The picture shows different models of asphalt molecular structure, with the cent...
Figure 2.7 This figure shows the excitation process of the entire system simulated by time-...
Figure 2.8 The figure is about the force field duel to the quantum mechanics calculation (A...
Figure 2.9 This figure shows the motion state of molecules at different temperatures, with ...
Figure 2.10 The figure shows the 3D structure of a protein, the interaction network of its a...
Chapter 3
Figure 3.1 Overview of AlphaFold structure.
Figure 3.2 Evoformer block in the AlphaFold structure. Arrows show the information flow. Th...
Figure 3.3 The pair representation interpreted as directed edges in a graph.
Figure 3.4 Triangle multiplicative update and triangle self-attention. The circles represen...
Figure 3.5 Predicting the backbone structure of a protein from a cluster of amino acids at ...
Figure 3.6 Biochemical properties of amino acids are represented in the Transformer model’s...
Figure 3.7 Protein sequence representations encode and organize biological variations.(a) E...
Figure 3.8 Structure predictions on CAMEO structure 7QQA and CASP target T1056 at all ESM-2...
Figure 3.9 ESMFold model architecture. Arrows show the information flow in the network from...
Chapter 4
Figure 4.1 Overview of CTR. (a) Protein chains. (b) Single-point amino acid mutation (e.g.,...
Figure 4.2 Distributions of the FireProt dataset. (a) Distribution of protein folding free energy ...
Figure 4.3 Results of the clustering process. (a) Visualization of the feature groups after...
Figure 4.4 Test losses over iterations using XGBoost. (a) Test loss of Regressor 1 trained ...
Figure 4.5 Experimental protein folding free energy change values against predicted o...
Figure 4.6 Predicted results by CTR with XGBoost and other regression methods. (a) Scatter ...
Figure 4.7 Performance of only one regressor using XGBoost. (a) Scatter plot of experimenta...
Chapter 5
Figure 5.1 Predicting structures of different scales of objects using quantum mechanics (QM...
Figure 5.2 Demonstration of protein structure.
Figure 5.3 Different stages of predicting structural information of a protein molecular wit...
Figure 5.4 The structure of DNNs. Deep means a large number of layers.
Figure 5.5 The process of predicting the energy and atomic force of protein structures usin...
Figure 5.6 MFCC fragmentation scheme utilized in the NN-GMFCC approach. (a) Peptide bond is...
Figure 5.7 (a) Main chain torsions in alanine dipeptide and the Ramachandran plots (Archana...
Figure 5.8 Structures of four largest protein systems studied in this work, along with PDB ...
Figure 5.9 CPU computational time and scaling of GMFCC/MM (B97XD/6-31G
⋆
) and NN-GMFCC.
Figure 5.10 Correlation of the atomic forces of proteins between NN-GMFCC calculations and G...
Figure 5.11 Comparison of the relative energies of 19 conformers (selected from 2 ns MD simu...
Figure 5.12 RMSEs for training and testing sets of (a) energies and (b) forces of 20 one-res...
Figure 5.13 Deviation between atomic forces for (a) protein 2 (1A3J) and (b) protein 5 (6OAB...
Figure 5.14 Computational time (CPU time) required for the NN-TMFCC (red dots) and full-QM(...
Figure 5.15 Relative energy of 20 conformations for (a) protein 2 (1A3J) and (b) protein 5 (...
Figure 5.16 The TMFCC scheme is used in the NN-TMFCC approach to construct residue-based fra...
Chapter 6
Figure 6.1 Inductive transfer learning force field (ITLFF) architecture, where proteins...
Figure 6.2 Statistics of datasets for the cap and 20 capped residues calculated by double-h...
Figure 6.3 Details of the performance of ITLFF on protein 5 k26. (a) The sequence and struc...
Figure 6.4 Illustration of TDL-FQM architecture. The (a) proteins are (b) “cut” into a seri...
Figure 6.5 Schematic diagram of the designed fragmental algorithm.
Figure 6.6 Transferability assessment and statistics and distribution of datasets. (a) The ...
Chapter 7
Figure 7.1 The figure shows a feature extraction and classification process based on protei...
Figure 7.2 The process of generating graph networks from Protein Data Bank (PDB) files and ...
Figure 7.3 The process of machine learning predicting hot spot regions in protein–protein i...
Figure 7.4 The process of predicting and analyzing protein interactions was demonstrated, i...
Figure 7.5 The model first extracts the molecular structural features of drugs and proteins...
Figure 7.6 This model utilizes attention mechanisms to encode and decode information by com...
Figure 7.7 Three forms of PPI were demonstrated: overall PPI, domain–domain PPI, and peptid...
Figure 7.8 The clustering and classification performance of different datasets, as well as ...
Chapter 8
Figure 8.1 The different types and functions of proteins, including contractile proteins (s...
Figure 8.2 The seven main functions of proteins are enzymatic proteins, storage proteins, h...
Figure 8.3 The step-by-step construction process of proteins from primary structure (amino ...
Figure 8.4 The workflow of a protein function prediction model was demonstrated, including ...
Figure 8.5 This graph compares the predictive performance of four different methods (NetGO3...
Figure 8.6 This figure shows the conservation of GDF8 and GDF11 protein sequences, the stru...
Chapter 9
Figure 9.1 This figure illustrates the process of using the deep learning model DeepPotenti...
Figure 9.2 This figure illustrates how a machine learning (ML) model takes protein sequence...
Figure 9.3 This image depicts the process of optimizing the functionality of Green Fluoresc...
Figure 9.4 This figure illustrates the process of extracting features from protein structur...
Figure 9.5 This includes operations such as adding, deleting, replacing, repairing, and sup...
Figure 9.6 The process of protein simulation includes protein fragmentation, modeling, comp...
Chapter 10
Figure 10.1 ProGen is a conditional language model based on transformer architecture, with 1...
Figure 10.2 This figure illustrates the process of inputting protein sequences into the ESM-...
Figure 10.3 Comprehensively demonstrated protein structure prediction technology. The upper ...
Figure 10.4 This figure shows the structure and functional analysis of a protein. (a) the ge...
Figure 10.5 This figure illustrates the concepts of design space, diffusion process, uncondi...
Chapter 11
Figure 11.1 Proteins are one of the most fundamental molecules in the body of life.
Figure 11.2 This figure shows the four levels of protein structure: from the basic amino aci...
Figure 11.3 This figure shows the four structural levels of proteins: the primary structure ...
Figure 11.4 This figure shows the visualization of LLMs tools developed to improve research ...
Figure 11.5 This figure describes the predicted structures of various proteins and complexes...
Chapter 4
Table 4.1 Performance of all methods in our experiments. Each method is adopted separately...
Cover
Table of Contents
Title Page
Copyright
Begin Reading
Index
End User License Agreement
iii
iv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
Jinjin Li
Yanqiang Han
Authors
Jinjin Li
Shanghai Jiao Tong University
Shanghai
China, 200240
Yanqiang Han
Shanghai Jiao Tong University
Shanghai
China, 200240
Cover Design: Wiley
Cover Images: © NicoElNino/Shutterstock, © ibreakstock/Shutterstock
Library of Congress Card No.: applied for
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
Print ISBN 9783527352159
ePDF ISBN 9783527842360
ePub ISBN 9783527842353
oBook ISBN 9783527842346
© 2026 WILEY-VCH GmbH, Boschstraße 12, 69469 Weinheim, Germany
All books published by WILEY-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
All rights reserved (including those of translation into other languages, text and data mining and training of artificial technologies or similar technologies). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
The manufacturer’s authorized representative according to the EU General Product Safety Regulation is WILEY-VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e-mail: [email protected].
AI Disclaimer: While the publisher and the authors have used their best efforts in preparing this work, including a review of the content of the work, neither the publisher nor the authors make any representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Proteins are the molecular machines that power life itself. Every cell in a living organism contains a vast array of proteins, each responsible for specific tasks, from facilitating chemical reactions to structural integrity, and regulating gene expression. The study of proteins is essential for understanding the fundamental processes of life, ranging from cellular metabolism to disease pathology. At the molecular level, proteins are composed of long chains of amino acids that fold into specific three-dimensional structures, a process known as protein folding (Ptitsyn, 1991; Richardson and Richardson, 1992). The unique shape of a protein determines its functionality, as only a specific conformation allows it to interact with other molecules, catalyze biochemical reactions, and maintain cellular processes (Figure 1.1).
Figure 1.1 (a) The primary structure of a protein can be understood as a linear string. (b) The secondary structure refers to how the peptide chain undergoes twists, folds, and other transformations based on the string of the primary structure, forming a local three-dimensional structure. (c) The tertiary structure is the process of splicing multiple secondary structures together and folding them into a complete three-dimensional protein structure. (d) A quaternary structure refers to the combination of multiple tertiary molecules into a complex.
However, despite the critical role of proteins in cellular function, a major challenge in molecular biology remains: understanding how proteins achieve their three-dimensional shapes and how mutations in these structures can lead to diseases. For decades, researchers have attempted to predict protein structures based on their amino acid sequences, but this task has proven to be extraordinarily complex. The sequence of amino acids in a protein is like a string of letters in an alphabet, yet the way these letters arrange themselves into a specific shape is governed by intricate physical and chemical interactions that are not immediately obvious from the sequence alone.
In the past, the understanding of protein structures relied heavily on experimental methods such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). These techniques can provide high-resolution information on the structure of proteins, but they are time-consuming, expensive, and often require high-quality samples, which are not always available. Moreover, they struggle to capture the dynamic nature of proteins, which constantly change shape during their interactions with other molecules. These challenges have led researchers to seek out computational approaches that can predict protein structure from sequence, simulate protein dynamics, and investigate the effects of mutations on protein function (Figure 1.2).
Figure 1.2 The three-dimensional structural model of proteins is usually predicted by bioinformatics software based on the amino acid sequence of proteins or analyzed through experimental methods such as X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy. Different colors represent different secondary structures of proteins.
Computational protein biology has seen immense progress in recent years. The development of new algorithms and the exponential growth of computational power have paved the way for the application of more efficient techniques. Among the most groundbreaking advancements in this field is the application of machine learning (ML) and artificial intelligence (AI) to predict protein structures and functions (Jumper et al., 2021; Rives et al., 2021). The ability to predict a protein’s structure from its sequence without the need for experimental data has been one of the “holy grails” of computational biology. ML models, particularly those based on deep learning techniques, have shown immense promise in this area, outperforming traditional methods in accuracy and speed. One of the most notable breakthroughs in this domain is AlphaFold, a deep learning algorithm developed by DeepMind. AlphaFold’s ability to predict protein structures with near-experimental accuracy has revolutionized the field and demonstrated the potential of AI-driven approaches in protein science (Figure 1.3).
Figure 1.3 The detailed structure of the binding sites between one drug molecule and a protein molecule demonstrates how drugs interact with proteins, which is crucial for drug design and understanding protein function.
The success of AlphaFold (Jumper et al., 2021), which has been heralded as a major milestone in structural biology, highlights the potential of ML to solve long-standing problems in computational biology. AlphaFold uses deep neural networks trained on vast datasets of known protein structures to predict the three-dimensional structure of proteins based on their amino acid sequences. The algorithm has achieved unprecedented levels of accuracy, solving the protein folding problem for a wide range of proteins with remarkable precision. AlphaFold’s success has provided a glimpse into the future of protein research, where ML models can be used not only to predict protein structure but also to simulate protein function, understand the effects of mutations, and design novel proteins with desired properties.
Despite the significant strides made in protein structure prediction, there remain several challenges that need to be addressed. While AlphaFold’s algorithm is capable of predicting the structure of individual proteins, the prediction of protein–protein interactions (PPIs), protein–ligand binding, and the dynamic behavior of proteins in complex biological environments is still an open problem. These processes are crucial for understanding cellular signaling pathways, enzyme catalysis, and drug design (Krasner, 1972). In particular, predicting how proteins interact with one another and how their structures change in response to different conditions is a complex task that requires a deeper understanding of the molecular forces at play. Moreover, protein interactions often occur in crowded cellular environments, making it difficult to model these interactions accurately using traditional computational methods (Zheng et al., 2020).
Furthermore, the impact of mutations on protein structure and function remains a significant challenge. Mutations in DNA can lead to changes in the amino acid sequence of a protein, which in turn may alter its structure and function. Some mutations can lead to loss of function, while others may result in gain of function, causing diseases such as cancer, neurodegenerative disorders, and genetic diseases. Being able to predict the effects of mutations on protein structure and function is crucial for understanding disease mechanisms and developing therapeutic strategies. Although ML models have shown promise in predicting the effects of mutations, there is still much to be done in terms of improving the accuracy and robustness of these predictions.
In addition to structure and mutation prediction, protein function annotation remains one of the most important challenges in bioinformatics. While the genome sequencing revolution has provided us with vast amounts of sequence data, the function of many proteins remains unknown. The process of assigning a biological function to a protein based on its sequence is known as function annotation. Traditionally, function annotation has relied on experimental techniques, such as gene knockout experiments, to determine the role of a protein in a biological context. However, these methods are time-consuming and expensive. Computational methods, particularly those based on ML, have the potential to accelerate the process of function annotation by predicting the biological role of a protein based on its sequence, structure, or interaction with other molecules.
The need for accurate, high-throughput methods for protein function annotation has become even more urgent in the context of personalized medicine. With the increasing availability of genomic data, there is a growing demand for tools that can predict how genetic variations in individuals affect protein function. The ability to link specific genetic mutations to disease-causing proteins can provide valuable insights into the molecular basis of disease and guide the development of targeted therapies. In this regard, ML has the potential to revolutionize the way we approach drug discovery and personalized medicine by enabling the rapid identification of disease-related proteins and the design of therapies that target these proteins.
The integration of quantum mechanical calculations into protein research represents another promising avenue for improving the accuracy of protein predictions. Quantum mechanics, which describes the behavior of matter at the atomic and subatomic levels, provides a powerful framework for modeling the interactions between atoms and molecules. By applying quantum mechanical methods to protein systems, researchers can gain a deeper understanding of the forces that govern protein folding, stability, and interactions. Quantum mechanical calculations are particularly useful for studying the detailed electronic structure of proteins, including the behavior of electrons and the formation of chemical bonds. However, these calculations are computationally expensive and often require specialized software and hardware. As a result, they have been limited to small systems or simplified models. The challenge lies in developing methods that combine the accuracy of quantum mechanical calculations with the scalability needed to model large, complex proteins.
In recent years, the combination of quantum mechanics and ML has emerged as a promising strategy to overcome the computational limitations of traditional quantum mechanical methods (Peral-García et al., 2024). By using ML algorithms to predict the parameters required for quantum mechanical calculations, researchers can improve the efficiency and accuracy of these methods. For example, deep learning techniques have been used to predict the electronic structure of molecules, enabling researchers to simulate the behavior of proteins more efficiently. This hybrid approach has the potential to revolutionize the field of protein modeling by making high-level quantum mechanical calculations accessible for larger and more complex systems.
As protein research continues to evolve, new computational techniques are allowing for more nuanced simulations and predictions, accelerating our ability to explore protein behavior. One of the key areas of innovation is in the development of more efficient algorithms for protein structure prediction, PPIs, and protein–ligand binding. Traditional molecular dynamics simulations, which track the movement of atoms in a protein over time, have been essential in understanding protein folding and dynamics. However, these simulations are often limited by computational cost and the difficulty of modeling large protein complexes accurately over long time scales. Emerging hybrid techniques that combine molecular dynamics with ML models hold promise for overcoming these barriers. By using data-driven approaches to predict the behavior of molecular systems, these models can enhance the efficiency of simulations and provide deeper insights into protein functionality.
Moreover, the need for high-resolution predictions in the study of protein dynamics is particularly evident in understanding the mechanisms behind diseases caused by protein misfolding. Diseases like Alzheimer’s, Parkinson’s, and cystic fibrosis have been linked to specific protein misfolding events. These diseases are often the result of mutations that destabilize the protein structure, leading to aggregation, loss of function, or toxic gain-of-function effects. Predicting the structural consequences of mutations and understanding the resulting changes in protein behavior are crucial for identifying therapeutic targets. ML methods have the potential to revolutionize this process by identifying patterns in the sequence–structure–function relationships of proteins that were previously invisible to conventional methods.
At the same time, the integration of protein structure prediction with personalized medicine offers a unique opportunity to transform healthcare. As more individuals undergo genome sequencing and provide detailed information about their genetic predispositions, the need for predictive models that can link specific genetic variations to diseases becomes increasingly important. The ability to predict how individual mutations will affect protein folding, stability, and interactions could allow for more precise and personalized treatments, reducing the trial-and-error approach that is common in drug development today. ML driven techniques could enable the creation of drug candidates tailored to an individual’s specific genetic makeup, optimizing treatment efficacy while minimizing adverse effects.
Beyond prediction, another important area of protein research is the de novo design of proteins with desired properties. Protein engineering has traditionally focused on modifying existing proteins for industrial or therapeutic purposes. However, the ability to design entirely new proteins from scratch offers an exciting frontier in biotechnology and synthetic biology (Huang et al., 2016). Advances in ML and computational modeling have made it increasingly feasible to design novel proteins with tailored functions, such as enzymes that catalyze specific reactions or therapeutic proteins that bind to disease-causing agents. The application of generative models and reinforcement learning algorithms to protein design is an area of active research, with the goal of creating proteins with unprecedented capabilities for use in medicine, agriculture, and environmental sustainability.
One of the most ambitious goals of protein science is to bridge the gap between computational predictions and experimental validation. While algorithms like AlphaFold have made substantial progress in accurately predicting protein structures, there is still a need for experimental validation to confirm these predictions in a laboratory setting. The integration of computational approaches with experimental data is crucial for improving the reliability of predictions and translating computational models into practical applications. For example, high-throughput experimental techniques, such as cryo-EM and mass spectrometry, could provide complementary data to refine computational models and validate predictions. This synergy between computational and experimental methods could pave the way for more rapid advancements in drug discovery, protein engineering, and our understanding of fundamental biological processes.
The integration of quantum mechanics with ML in protein research is also driving significant advancements in the field. Quantum mechanical calculations, which provide a detailed and accurate description of atomic interactions, have long been recognized for their potential to improve our understanding of protein behavior. However, these calculations are computationally expensive, limiting their applicability to small systems or simplified models. To address this, researchers have begun using ML algorithms to accelerate quantum mechanical calculations and extend them to larger, more complex systems. This hybrid approach allows for the accurate simulation of protein folding, stability, and interactions, facilitating the study of proteins that were previously too large or too dynamic to model with traditional quantum mechanics.
The application of deep neural networks to protein research is another area that has garnered significant attention. Deep learning models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have shown promise in tasks such as protein structure prediction (Karwasra et al., 2024), mutation effect prediction, and function annotation. These models excel at recognizing patterns in large datasets, which is particularly useful for identifying relationships between protein sequences and their corresponding structures or functions. The ability to train these models on vast amounts of data has led to breakthroughs in the prediction of protein functions, particularly in cases where experimental data is limited or unavailable.
ML also holds the potential to improve drug discovery by identifying novel drug candidates (Vamathevan et al., 2019; Catacutan et al., 2024). In traditional drug discovery pipelines, researchers often screen thousands of compounds to find those that interact with a target protein. However, this process is time-consuming and costly. ML algorithms can significantly speed up this process by predicting the binding affinity of compounds to a target protein before experimental screening. These models can analyze vast chemical libraries and identify potential drug candidates based on their predicted interaction with the protein of interest. This approach not only saves time and resources but also enables the discovery of compounds with higher potency and fewer side effects.
Another exciting development in protein research is the application of graph-based ML models to protein design and analysis (Akid et al., 2024; Ingraham et al., 2019). Proteins can be represented as graphs, with nodes corresponding to amino acids and edges representing interactions between them. Graph neural networks (GNNs) have been used to predict protein structure, function, and interactions by learning from the graph representations of proteins. These models offer a new way to capture the complex relationships between amino acids in a protein and can be applied to tasks such as protein folding, PPI prediction, and protein–ligand docking (Réau et al., 2023; Knutson et al., 2022).
As computational power continues to grow and ML models become more sophisticated, the future of protein science looks increasingly promising. The integration of AI-driven approaches with quantum mechanical calculations, protein structure prediction, and protein design will likely accelerate the pace of discovery and open up new possibilities for therapeutic interventions. The ability to predict and design proteins with tailored functions will not only revolutionize drug discovery but also transform fields such as materials science, agriculture, and environmental sustainability (Figure 1.4).
Figure 1.4 This image illustrates the bioinformatics analysis process from genes to protein structures. On the left is a diagram of genes, representing DNA sequences that contain genetic information. The middle section displays the amino acid sequence, which is the basic unit transcribed from DNA into mRNA and then translated into proteins during gene expression. On the right is the three-dimensional structure of the protein, predicted through homology modeling and molecular simulation techniques. Homologous modeling is a method of predicting unknown protein structures based on known protein structures, while molecular simulation is used to study the dynamic behavior of proteins at the atomic level. This process is of great significance for understanding the function of proteins and designing new drugs.
In the near future, the collaboration between experimentalists and computational biologists will be key to advancing our understanding of protein systems. While computational predictions provide invaluable insights into protein structure and function, experimental validation remains essential for confirming these predictions and translating them into practical applications. By combining the power of AI and quantum mechanics with experimental techniques, the field of protein research is poised to make major breakthroughs in understanding the molecular basis of life, improving human health, and addressing global challenges (Figure 1.5).
Figure 1.5 This illustration showcases the application of machine learning in drug design. (a) The role of message-passing neural networks in hit discovery. Through a series of messaging steps, molecular embeddings rich in contextual information are generated, capturing the connectivity among atoms. Upon aggregation, these vectors are fed into a feedforward neural network (FFNN) to yield predicted property values. Based on user-defined thresholds, compounds may subsequently be prioritized for in vitro potency validation. (b) Deep Docking (DD) synergizes physics-based docking methods with deep learning to evaluate vast chemical libraries, uncovering their potential to bind with specific proteins. Initially, a fraction of the chemical library (approximately 1%) undergoes physics-based docking, with the resulting scores utilized for training the FFNN, which then swiftly predicts docking scores for the remaining compounds (around 99%).
The rise of computational tools in protein research has been fueled by several factors, including advancements in computational power, the availability of large biological datasets, and the increasing recognition of the importance of protein systems in health, disease, and biotechnology. Historically, the study of proteins has relied heavily on experimental methods such as X-ray crystallography, NMR, and cryo-EM, each of which has its limitations. For example, X-ray crystallography requires proteins to be crystallized, which is often challenging for membrane proteins or large complexes. NMR, on the other hand, is limited by the size of the proteins that can be studied and requires high concentrations, making it less suitable for some biological systems. Cryo-EM has emerged as a powerful method for studying large, dynamic proteins, but it requires specialized equipment and is still relatively time-consuming. These experimental approaches, though invaluable, are often expensive, labor-intensive, and time-consuming.
Computational approaches, in contrast, provide a complementary toolset for protein research. The ability to simulate the behavior of proteins and predict their structures, functions, and interactions without the need for expensive experimental setups has opened new possibilities in protein science. Computational models, such as molecular dynamics simulations and quantum mechanical calculations, offer insights into protein folding, stability, and function on time scales and at resolutions that were previously unattainable. Moreover, these models can be used to test hypotheses about protein behavior before conducting costly experiments, saving both time and resources.
ML has become an increasingly powerful tool in this computational arsenal. By training algorithms on large datasets of protein sequences, structures, and functions, ML models can identify complex patterns that traditional methods cannot easily detect. These models have demonstrated remarkable success in predicting protein structures (e.g., AlphaFold), estimating the effects of mutations on protein function, and identifying novel protein–ligand interactions for drug discovery. The use of ML also allows researchers to develop predictive models that can generalize across a wide variety of proteins, making it easier to study proteins with unknown functions or to explore new avenues for drug development. One of the key advantages of ML is its ability to integrate diverse sources of data – ranging from genomic sequences to experimental structural data – into a unified model, enabling researchers to make more accurate predictions and derive new biological insights.
The application of ML and computational methods is particularly important in the context of personalized medicine. As we move toward an era of precision healthcare, where treatments are tailored to an individual’s genetic makeup, the ability to predict how genetic variations will affect protein function is becoming crucial. Genetic mutations can have profound effects on protein structure and function, potentially leading to diseases or altered responses to drugs. By leveraging large-scale genomic and proteomic data, ML algorithms can identify which mutations are likely to be pathogenic, how they affect protein stability, and what therapeutic strategies may be effective in mitigating these effects. This opens up the possibility of designing personalized treatments that are more effective and have fewer side effects.
In addition to the practical benefits for medicine, the computational study of protein systems has significant implications for biotechnology, agriculture, and environmental sustainability. For example, proteins play a key role in many industrial processes, including the production of biofuels, enzymes, and pharmaceuticals. The ability to design proteins with specific functions – such as enzymes that catalyze desired chemical reactions or proteins that bind to toxic compounds – could greatly enhance the efficiency of these processes. In agriculture, proteins that are resistant to pathogens or pests could be engineered to improve crop yields and reduce the need for chemical pesticides. In the environmental sector, proteins could be designed to break down pollutants or to capture carbon, contributing to efforts to combat climate change (Figure 1.6).
Figure 1.6 This image shows the dynamic changes in protein structure. This change is crucial for understanding the function, stability, and how proteins interact with other molecules. By studying these dynamic changes, scientists can better understand the working mechanisms of proteins in living organisms.
The need for accurate predictions in the field of protein systems extends beyond traditional protein folding and structure prediction. Protein interactions are essential for nearly every biological process, from signaling pathways to immune responses. Understanding how proteins interact with each other and with small molecules (such as drugs or other ligands) is fundamental to drug discovery, systems biology, and synthetic biology. Proteins often work in large, dynamic networks, where the binding of one protein can have cascading effects on the entire system. Predicting these interactions, especially in complex biological systems, has been a long-standing challenge in computational biology. Recent advances in ML and graph-based methods, however, offer new ways to model PPIs, protein–ligand interactions, and protein–small molecule binding. These methods can leverage the structural information of individual proteins and the network of interactions to predict how proteins will behave in various biological contexts.
The convergence of ML, quantum mechanical calculations, and experimental biology is poised to revolutionize the field of protein research. While ML models are powerful in making predictions, they are still limited by the quality and quantity of data they are trained on. Experimental methods, however, provide the real-world data needed to validate and refine these predictions. Integrating these two approaches, experimental data and computational models, could significantly accelerate the discovery of new protein functions, the development of new drugs, and the design of novel proteins for biotechnological applications. For instance, experimental techniques such as X-ray crystallography, NMR, and cryo-EM can generate high-resolution structural data that can be used to train ML models and improve the accuracy of structure prediction. Similarly, high-throughput screening methods can generate large datasets that can be used to train models for drug discovery, protein–ligand binding prediction, and mutation effect prediction.
One particularly exciting area of research is the application of quantum mechanical calculations in protein science. Quantum mechanics provides a level of detail that is unmatched by classical simulations, offering insights into the atomic-level interactions between molecules. However, quantum mechanical calculations are computationally expensive, especially when applied to large, complex proteins. Recent developments in ML and hybrid approaches, which combine quantum mechanics with data-driven models, promise to make quantum mechanical simulations more accessible and applicable to larger protein systems. This approach, often referred to as deep learning-assisted full-system quantum mechanical (FQM) calculations (Li et al., 2021), has the potential to improve the accuracy of protein simulations, enabling researchers to study proteins in more detail than ever before.
As the field progresses, it is clear that the future of protein research will be shaped by the continued integration of computational and experimental approaches. The growing availability of large biological datasets, advances in computational algorithms, and the increasing power of ML models will accelerate our understanding of protein systems. Furthermore, the application of quantum mechanical calculations, deep learning, and transfer learning will expand the range of proteins that can be studied, providing new insights into their functions and interactions. The ultimate goal of this research is to harness the power of computational models to predict and design proteins with desired functions, opening up new possibilities for drug discovery, protein engineering, and biotechnology.
In summary, the study of protein systems is a dynamic and rapidly evolving field that stands at the intersection of biology, chemistry, physics, and computer science. The integration of computational tools with experimental techniques is transforming our understanding of proteins and their roles in biological processes. As computational power continues to increase and new algorithms are developed, the potential for breakthrough discoveries in medicine, biotechnology, and environmental science grows exponentially. The future of protein research holds great promise, with ML, quantum mechanical calculations, and protein design leading the way toward a deeper understanding of the molecular basis of life.
Hajer Akid, Kirsley Chennen, Gabriel Frey, Julie Thompson, Mounir Ben Ayed, and Nicolas Lachiche. Graph-based machine learning model for weight prediction in protein–protein networks.
BMC Bioinformatics
, 25(1): 349, November 2024. ISSN 1471-2105.
https://doi.org/10.1186/s12859-024-05973-6
.
Denise B. Catacutan, Jeremie Alexander, Autumn Arnold, and Jonathan M. Stokes. Machine learning in preclinical drug discovery.
Nature Chemical Biology
, 20(8): 960–973, August 2024. ISSN 1552-4469.
https://doi.org/10.1038/s41589-024-01679-1
.
Po-Ssu Huang, Scott E. Boyken, and David Baker. The coming of age of de novo protein design.
Nature
, 537(7620): 320–327, September 2016. ISSN 1476-4687.
https://doi.org/10.1038/nature19946
.
John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models for graph-based protein design. In
Advances in Neural Information Processing Systems
, volume 32. Curran Associates, Inc., 2019.
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, et al. Highly accurate protein structure prediction with AlphaFold.
Nature
, 596(7873): 583–589, August 2021. ISSN 1476-4687.
https://doi.org/10.1038/s41586-021-03819-2
.
Ritu Karwasra, Kushagra Khanna, Kapil Suchal, Ajay Sharma, and Surender Singh. Chapter 13 - protein structure prediction with recurrent neural network and convolutional neural network: a case study. In Khalid Raza, Debmalya Barh, Deepak Singh, and Naeem Ahmad, editors,
Deep Learning Applications in Translational Bioinformatics
, volume 15 of
Advances in Ubiquitous Sensing Applications for Healthcare
, pp. 211–229. Academic Press, January 2024.
https://doi.org/10.1016/B978-0-443-22299-3.00013-X
.
Carter Knutson, Mridula Bontha, Jenna A. Bilbrey, and Neeraj Kumar. Decoding the protein–ligand interactions using parallel graph neural networks.
Scientific Reports
, 12(1): 7624, May 2022. ISSN 2045-2322.
https://doi.org/10.1038/s41598-022-10418-2
.
Joseph Krasner. Drug-protein interaction.
Pediatric Clinics of North America
, 19(1): 51–63, February 1972. ISSN 0031-3955.
https://doi.org/10.1016/S0031-3955(16)32666-9
.
Wei Li, Haibo Ma, Shuhua Li, and Jing Ma. Computational and data driven molecular material design assisted by low scaling quantum mechanics calculations and machine learning.
Chemical Science
, 12(45): 14987–15006, 2021.
https://doi.org/10.1039/D1SC02574K
.
David Peral-García, Juan Cruz-Benito, and Francisco José García-Peñalvo. Systematic literature review: Quantum machine learning and its applications.
Computer Science Review
, 51: 100619, February 2024. ISSN 1574-0137.
https://doi.org/10.1016/j.cosrev.2024.100619
.
O.B. Ptitsyn. How does protein synthesis give rise to the 3D-structure?
FEBS Letters
, 285(2): 176–181, 1991. ISSN 1873-3468.
https://doi.org/10.1016/0014-5793(91)80799-9
.
Manon Réau, Nicolas Renaud, Li C Xue, and Alexandre M J J Bonvin. DeepRank-GNN: a graph neural network framework to learn patterns in protein–protein interfaces.
Bioinformatics
, 39(1): btac759, January 2023. ISSN 1367-4811.
https://doi.org/10.1093/bioinformatics/btac759
.
David C. Richardson and Jane S. Richardson. The kinemage: a tool for scientific communication.
Protein Science
, 1(1): 3–9, 1992. ISSN 1469-896X.
https://doi.org/10.1002/pro.5560010102
.
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.
Proceedings of the National Academy of Sciences of the United States of America
, 118(15): e2016239118, April 2021.
https://doi.org/10.1073/pnas.2016239118
.
Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, et al. Applications of machine learning in drug discovery and development.
Nature Reviews Drug Discovery
, 18(6): 463–477, June 2019. ISSN 1474-1784.
https://doi.org/10.1038/s41573-019-0024-5
.
Shuangjia Zheng, Yongjian Li, Sheng Chen, Jun Xu, and Yuedong Yang. Predicting drug–protein interaction using quasi-visual question answering system.
Nature Machine Intelligence
, 2(2): 134–140, February 2020. ISSN 2522-5839.
https://doi.org/10.1038/s42256-020-0152-y
.
The study of protein systems and their behavior is an essential element of molecular biology, bioinformatics, and drug design. Over the past few decades, significant advances in computational techniques have allowed researchers to model, predict, and analyze protein structures and functions at unprecedented levels of detail (Peng et al., 2022; Aithani et al., 2023; Floudas, 2007; Xu et al., 2007). Protein systems are highly complex and dynamic, governed by interactions that can span vast ranges of timescales and spatial dimensions. As such, the strategies for protein calculations and predictions must combine various computational tools and models that capture both the static structures and the dynamic behaviors of proteins.
Protein science, in this context, involves the development of computational strategies to predict the three-dimensional (3D) structure of a protein from its amino acid sequence, understand its function, and predict its interactions with other molecules. The theoretical approaches used in protein science have evolved over time, from simple energy-based models to the most sophisticated quantum mechanical and machine learning (ML) methods available today. Each method has its advantages and limitations, and often, the best approach involves integrating these diverse techniques to achieve a more accurate and comprehensive understanding of protein systems (Lebowitz et al., 2002).
A central challenge in computational protein science is the prediction of protein structure. A protein’s structure is intricately linked to its function. The sequence of amino acids in a protein defines its 3D conformation, which in turn dictates how the protein interacts with other molecules, including ligands, other proteins, and nucleic acids. Traditionally, experimental techniques, such as X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, were used to determine protein structures. However, these methods are often resource-intensive, requiring pure protein samples, high-resolution equipment, and lengthy experimental processes. Furthermore, the determination of protein structures via these techniques can be particularly challenging for membrane proteins, large complexes, or intrinsically disordered proteins.
To address these limitations, computational methods have become an integral part of protein structure prediction. These methods are not only faster and less expensive but also have the ability to predict structures for a wide range of proteins, including those that may be too difficult or expensive to study experimentally. The computational strategies for protein structure prediction can be broadly categorized into three main approaches: homology modeling, ab initio prediction, and threading or fold recognition.
Homology modeling is one of the most widely used methods in protein structure prediction. It is based on the premise that evolutionarily related proteins will have similar structures, even if their sequences differ. If the target protein shares significant sequence similarity with a protein whose 3D structure is known, homology modeling can provide a reliable prediction of the target protein’s structure. This method relies on the alignment of the target protein’s sequence to a template sequence with a known structure, followed by the construction of a 3D model based on the template’s geometry. The accuracy of homology modeling depends heavily on the degree of sequence similarity between the target and template proteins, and it is most effective when the target sequence shares more than 30–40% identity with the template. When sequence similarity is low, homology modeling can lead to inaccurate predictions.
On the other hand, ab initio prediction methods attempt to predict protein structures without any prior knowledge of homologous proteins. These methods rely on physical principles to model the protein’s 3D structure, using force fields to simulate the interactions between atoms and molecules. The goal of ab initio methods is to find the conformation that minimizes the system’s energy, representing the most stable structure of the protein. While ab initio methods have been highly successful for small proteins, they become computationally expensive for larger systems due to the large number of possible conformations that must be sampled. The accuracy of ab initio predictions also depends on the precision of the force fields used, and for many years, ab initio methods have struggled to match the accuracy of experimental structures for large proteins (Figure 2.1).
Figure 2.1 A three-dimensional (3D) structural model of proteins. Proteins are long-chain molecules composed of amino acids that form specific 3D structures through complex folding. These structures are crucial for the function of proteins, and the different colored helices and layers represent the different secondary structural elements of proteins.
Threading, or fold recognition, is another computational technique used when there is no suitable template available for homology modeling. In this approach, the sequence of the target protein is compared to a database of known protein folds, and the best-fitting fold is identified. The target sequence is then “threaded” through the template to construct a model. Threading does not require high sequence identity between the target and template proteins, making it useful for predicting the structures of proteins with novel folds. This method has been successful in predicting the structures of proteins whose folds were previously unknown, but it still relies on the assumption that the folds present in the database are representative of all possible protein structures.
Beyond structure prediction, another essential component of protein calculations is understanding how proteins function. The function of a protein is closely tied to its 3D structure, as the specific arrangement of atoms within the protein determines how it interacts with other molecules. Traditional methods for studying protein function often involve experimental techniques such as enzyme assays, binding studies, or mutagenesis experiments. However, computational methods can complement these experimental techniques by predicting a protein’s function based on its structure or sequence.
The most common computational method for protein function prediction is based on the sequence–structure–function paradigm, which assumes that similar sequences and structures will have similar functions. This approach relies on databases of annotated protein structures and their associated functions, allowing researchers to predict the function of a protein by comparing it to known sequences and structures. Several tools and databases have been developed to facilitate this process, including Pfam, InterPro, and the Protein Data Bank (PDB). These databases provide access to a wealth of information about protein families, domains, and functional sites.
In addition to sequence–structure–function relationships, protein–protein interactions (PPIs) are a critical aspect of protein function (Lu et al., 2020). Proteins do not function in isolation but instead interact with other proteins to form complex molecular networks. Understanding these interactions is crucial for deciphering cellular pathways, disease mechanisms, and drug targets (Milroy et al., 2014). The prediction of PPIs has become an important research area in computational biology, and several methods have been developed to predict protein interactions based on sequence or structural data (Figure 2.2).
Figure 2.2 Amino acids are the basic units that make up proteins, and each amino acid molecule contains at least one amino group and one carboxyl group , both of which are attached to the same carbon atom, which is called alpha carbon.
Sequence-based PPI prediction methods rely on the idea that proteins with similar sequences are more likely to interact. These methods typically involve searching for conserved motifs or domains within protein sequences that are known to mediate interactions. While sequence-based methods are relatively simple and computationally efficient, they are limited in their accuracy and applicability, as they do not account for the structural context of protein interactions.
Structure-based methods for PPI prediction take into account the 3D structures of proteins and the specific surfaces that are involved in protein binding. One approach is molecular docking, where the 3D structures of two interacting proteins are computationally “docked” to predict their binding interface. This method involves solving the complex problem of determining how two proteins come together to form a stable complex, taking into account the spatial arrangement of atoms and the energetics of the interaction. While docking simulations can be highly accurate when the structures of the interacting proteins are known, they are limited by the availability of high-quality structural data and the assumption that the binding sites are static.
A more recent advancement in PPI prediction is the integration of ML techniques with sequence-based and structure-based methods (Zhang et al., 2024). ML algorithms, such as deep learning, can be trained on large datasets of known protein interactions to identify patterns and make predictions about new interactions. These methods often incorporate sequence information, structural data, and evolutionary profiles to improve the accuracy of PPI predictions.