97,99 €
Covers the fundamentals and techniques of multiple biological sequence alignment and analysis, and shows readers how to choose the appropriate sequence analysis tools for their tasks
This book describes the traditional and modern approaches in biological sequence alignment and homology search. This book contains 11 chapters, with Chapter 1 providing basic information on biological sequences. Next, Chapter 2 contains fundamentals in pair-wise sequence alignment, while Chapters 3 and 4 examine popular existing quantitative models and practical clustering techniques that have been used in multiple sequence alignment. Chapter 5 describes, characterizes and relates many multiple sequence alignment models. Chapter 6 describes how traditionally phylogenetic trees have been constructed, and available sequence knowledge bases can be used to improve the accuracy of reconstructing phylogeny trees. Chapter 7 covers the latest methods developed to improve the run-time efficiency of multiple sequence alignment. Next, Chapter 8 covers several popular existing multiple sequence alignment server and services, and Chapter 9 examines several multiple sequence alignment techniques that have been developed to handle short sequences (reads) produced by the Next Generation Sequencing technique (NSG). Chapter 10 describes a Bioinformatics application using multiple sequence alignment of short reads or whole genomes as input. Lastly, Chapter 11 provides a review of RNA and protein secondary structure prediction using the evolution information inferred from multiple sequence alignments.
• Covers the full spectrum of the field, from alignment algorithms to scoring methods, practical techniques, and alignment tools and their evaluations
• Describes theories and developments of scoring functions and scoring matrices
•Examines phylogeny estimation and large-scale homology search
Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Applications is a reference for researchers, engineers, graduate and post-graduate students in bioinformatics, and system biology and molecular biologists.
Ken Nguyen, PhD, is an associate professor at Clayton State University, GA, USA. He received his PhD, MSc and BSc degrees in computer science all from Georgia State University. His research interests are in databases, parallel and distribute computing and bioinformatics. He was a Molecular Basis of Disease fellow at Georgia State and is the recipient of the highest graduate honor at Georgia State, the William M. Suttles Graduate Fellowship.
Xuan Guo, PhD, is a postdoctoral associate at Oak Ridge National Lab, USA. He received his PhD degree in computer science from Georgia State University in 2015. His research interests are in bioinformatics, machine leaning, and cloud computing. He is an editorial assistant of International Journal of Bioinformatics Research and Applications.
Yi Pan, PhD, is a Regents' Professor of Computer Science and an Interim Associate Dean and Chair of Biology at Georgia State University. He received his BE and ME in computer engineering from Tsinghua University in China and his PhD in computer science from the University of Pittsburgh. Dr. Pan's research interests include parallel and distributed computing, optical networks, wireless networks and bioinformatics. He has published more than 180 journal papers with about 60 papers published in various IEEE/ACM journals. He is co-editor along with Albert Y. Zomaya of the Wiley Series in Bioinformatics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 432
Veröffentlichungsjahr: 2016
Cover
Title Page
Copyright
Preface
Chapter 1: Introduction
1.1 Motivation
1.2 The Organization of this Book
1.3 Sequence Fundamentals
Chapter 2: Protein/DNA/RNA Pairwise Sequence Alignment
2.1 Sequence Alignment Fundamentals
2.2 Dot-Plot Matrix
2.3 Dynamic Programming
2.4 Word Method
2.5 Searching Sequence Databases
References
Chapter 3: Quantifying Sequence Alignments
3.1 Evolution and Measuring Evolution
3.2 Substitution Matrices and Scoring Matrices
3.3 GAPS
3.4 Scoring Multiple Sequence Alignments
3.5 Circular Sum Score
3.6 Conservation Score Schemes
3.7 Diversity Scoring Schemes
3.8 Stereochemical Property Methods
3.9 Hierarchical Expected Matching Probability Scoring Metric (HEP)
Chapter 4: Sequence Clustering
4.1 Unweighted Pair Group Method with Arithmetic Mean – UPGMA
4.2 Neighborhood-Joining Method – NJ
4.3 Overlapping Sequence Clustering
Chapter 5: Multiple Sequences Alignment Algorithms
5.1 Dynamic Programming
5.2 Progressive Alignment
5.3 Consistency and Probabilistic MSA
5.4 Genetic Algorithms
5.5 New Development in Multiple Sequence Alignment Algorithms
5.6 Test Data and Alignment Methods
5.7 Results
Chapter 6: Phylogeny in Multiple Sequence Alignments
6.1 The Tree of Life
6.2 Phylogeny Construction
6.3 Inferring Phylogeny from Multiple Sequence Alignments
Chapter 7: Multiple Sequence Alignment on High-Performance Computing Models
7.1 Parallel Systems
7.2 Exiting Parallel Multiple Sequence Alignment
7.3 Reconfigurable-Mesh Computing Models – (R-Mesh)
7.4 Pairwise Dynamic Programming Algorithms
7.5 Progressive Multiple Sequence Alignment ON R-Mesh
Chapter 8: Sequence Analysis Services
8.1 EMBL-EBI: European Bioinformatics Institute
8.2 NCBI: National Center for Biotechnology Information
8.3 GenomeNet and Data Bank of Japan
8.4 Other Sequence Analysis and Alignment Web Servers
8.5 SeqAna: Multiple Sequence Alignment with Quality Ranking
8.6 Pairwise Sequence Alignment and Other Analysis Tools
8.7 Tool Evaluation
Chapter 9: Multiple Sequence for Next-Generation Sequences
9.1 Introduction
9.2 Overview of Next Generation Sequence Alignment Algorithms
Chapter 10: Multiple Sequence Alignment for Variations Detection
10.1 Introduction
10.2 Genetic Variants
10.3 Variation Detection Methods Based on MSA
10.4 Evaluation Methodology
10.5 Conclusion and Future Work
Chapter 11: Multiple Sequence Alignment for Structure Detection
11.1 Introduction
11.2 RNA Secondary Structure Prediction Based on MSA
11.3 Protein Secondary Structure Prediction Based on MSA
11.4 Conclusion and Future Work
References
Index
End User License Agreement
xi
xii
xiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
133
134
135
136
137
138
139
140
141
142
143
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
219
220
221
222
223
224
225
Cover
Table of Contents
Preface
Begin Reading
Chapter 1: Introduction
Figure 1.1 Transcription and translation processes: DNA RNA protein (the “central dogma” of biology).
Figure 1.2 The yeast Sec23/24 heterodimer 1M2V: (a) protein structure and (b) primary sequence.
Figure 1.3 BLOSUM62 substitution matrix.
Figure 1.4 Four fundamental wobble base pairs.
Figure 1.5 NCBI's sequence formats.
Figure 1.6 Illustration of GenBank sequence format.
Figure 1.7 (a) Phylis format and (b) Clustal format.
Chapter 2: Protein/DNA/RNA Pairwise Sequence Alignment
Figure 2.1 Dot-plot matrix of two sequences ACACACTA and AGCACACA. The connected diagonal lines indicate matching segments.
Figure 2.2 Structures of the CALM3 protein. Based on PyMOL rendering of PDB 1a29.
Figure 2.3 Dot-plot matrix of human CalM against itself to reveal motifs.
Figure 2.4 Generating a scoring matrix for sequence “GAATTCAGTTA” and sequence “GGATTCCGA.”
Figure 2.5 Backtracking, (a–c), to obtain the optimal alignment (d).
Figure 2.6 Global alignment of the two sequences.
Figure 2.7 Smith–Waterman's local alignment of two sequences: “G A A T T C A G T T A” and “G G A T T C C G A.”
Figure 2.8 Displays all available words of size 3.
Figure 2.9 Displays an exact word matched.
Figure 2.10 (a) Filtering dot plot; (b) being reevaluated by score matrix; (c) filtering highest scored diagonals starting from the core region – identified by an arrow; (d) assembling the diagonals.
Chapter 3: Quantifying Sequence Alignments
Figure 3.1 A human Calmodulin wraps around its binding domain in the plasma membrane. atoms are depicted as circles.
Figure 3.2 Displays the possible paths a nucleotide can be mutated to with the same probability in Jukes and Cantor's model.
Figure 3.3 Shows the score by (a) edit distance and (b) BLOSUM62 matrix.
Figure 3.4 Illustrates the steps in assembling an MSA.
Figure 3.5 Traversal of a tree by an SP measurement. The total visits on each edge is shown. Some edges are visited more often than others.
Figure 3.6 Circular traversal of a tree. The dotted arrow indicates the traversal path.
Figure 3.7 Taylor's Venn diagram of amino acid properties.
Figure 3.8 The true table of Taylor's Venn diagram.
Figure 3.9 Amino Acid Class Hierarchy (AACH) used in PIMA [50]. Uppercase characters are amino acids; lowercase characters are amino acid classes. X is the wild-card character of any type, including a gap.
Figure 3.10 HEP scoring tree generated from BLOSUM62 substitution matrix. The matching cardinality of each amino acid class is shown at the bottom of the corresponding class symbol.
Figure 3.11 PIMA active scoring tree for the third column.
Figure 3.12 Matching Leucine from NNV0 with a similar amino acid in sequence SFV1 [51]. The dotted arrow shows a mismatch, the dashed arrow shows a better match, and the solid arrow shows the best match.
Figure 3.13 The RT-OSM sequences. The six motifs of the RT-OSM are indicated by roman numeral(I–VI). The bold and capitalized letters represent the core amino acids of each motif. (
Source
: Adapted from Hudak and McClure 1999 [51].)
Figure 3.14 Motif detection accuracy on the RT-OSM data set. Score of 1.0 means a method correctly detects and ranks all six motifs in the data set.
Figure 3.15 Reliability scores on the BAliBASE3.0 benchmark.
Figure 3.16 Reliability scores on 150 references of the PREFAB4.0 benchmark.
Figure 3.17 BAliBASE total column (TC) scores.
Figure 3.18 Total column (TC) scores of 150 random sequence sets from PREFAB4.0.
Chapter 4: Sequence Clustering
Figure 4.1 (a) is the input sequences, (b) substitutions on first column, and (c) substitutions on second column.
Figure 4.2 Illustrates each step of building the UPGMA phylogeny.
Figure 4.3 Producing NJ tree – not drawn to scale. (a) is initial star tree; (b) shows the combined OTU of A and B; (c) is the NJ tree with the new OTU AB; and (d) is the final NJ evolution tree.
Figure 4.4 Illustration of different alignment outcomes from the same input sequences. (a) shows the input sequences with the motifs in bold; (b) shows the expected alignment; (c) and (d) show the alignment based on different phylogeny guiding tree.
Figure 4.5 Illustration of the clustering of the sequences to generate a sequence pattern. (a) is the all-pairwise alignments, (b) is sequence clusterization, and (c) is sequence pattern identification, where S' is the cluster pattern. The boxes represent the DP local alignment results.
Chapter 5: Multiple Sequences Alignment Algorithms
Figure 5.1 Sample alignment of BAliBASE reference 1 [52] ubiquitin set. Each block represents a secondary feature such as alpha helices, beta strands, or unimportant structures.
Figure 5.2 Illustration of the divide and conquer multiple sequence alignment (DCA).
Figure 5.3 An example of progressive multiple sequence alignment. (a) is the input sequences, (b) is an alignment guiding tree, (c) is external nodes alignment, and (d) is internal nodes alignment.
Figure 5.4 The Amino Acid Class Hierarchy (AACCH) used in PIMA family; X represent a wildcard for matching any symbol, including gap.
Figure 5.5 A pattern generated from a pair-wise alignment.
Figure 5.6 POA algorithm: (a) input sequences, (b) directed graphs for the sequences, (c) graph fusion of sequences S1 and S2, (d) alignment of S1 and S2, (e) graph fusion of S3 to alignment (S1,S2), (f) final partial order graph and final multiple sequence alignment.
Figure 5.7 PSAlign algorithm: (a) input sequences, (b) minimum spanning tree, (c) undirected graph constructed from pair-wise alignments and spanning three, (d) partial order graph, and (e) final alignment result.
Figure 5.8 Basic pair-HMM for aligning two sequences and . State emits two letters that are being aligned from each sequence. States and emit a letter in sequence and , respectively, that are aligned to a gap.
Figure 5.9 Generating extended pair-wise alignment in T-Coffee. (a) sequence SeqA is aligned to sequence SeqC via immediate sequence SeqB. (b) new pair-wise alignment for sequence SeqA and sequence SeqC.
Figure 5.10 T-Coffee multiple sequence alignment schema.
Figure 5.11 Representing three sequences in de Bruijn graph.
Figure 5.12 Superpath transformation: (a) detachment and (b) -cut.
Figure 5.13 Illustration of SAGA, where the two parents are crossed breed. The dotted boxes represent the consistent alignment between the two parents.
Figure 5.14 SAGA alignment scheme. At any generation , the parents are crossed breed or mutated by a random operation to generate a new set of children .
Figure 5.15 Neural networks.
Figure 5.16 Knowledge-based multiple sequence alignment scheme. Initially, the input sequences are partitioned into semi-related groups. The groups' representative sequences are used to search for any available biological information from sequence knowledge bases. The sequences are given different weights and regrouped based on the discovered knowledge.
Figure 5.17 Clusters alignment by partial order graph, where bold texts represent beta-strand blocks, italics represents alpha-helix blocks, and others are nonsignificant blocks. (a) and (b) show the two clusters to be aligned, each sequence in the cluster is preceding by its name; (c) is the partial order graph representing clusters (a) and (b); (d) is the fusing graph of (a) and (b). These sequences are obtained from BAliBASE.
Figure 5.18 Finding the maximum matching score between two partial order graph nodes by sliding techniques. The overlapping columns are enclosed in the rectangle. Nonoverlapping columns are considered matching with gaps.
Figure 5.19 Creation of an annotated sequence for KB-MSA consistency database. (a) shows the input sequences and the final annotated pattern for seq1 having three featured blocks f1, f2, and f3 with weights 1/3, 2/3, and 1/3, respectively; (b) shows the immediate pair-wise local alignments of seq1 and their weight-marking vectors.
Figure 5.20 KB-MSA tested against ClustalW, T-Coffee, ProbCons, and MAFFT on BAliBASE benchmark. With complete sequence knowledge, KB-MSA increases average alignment score by more than 9%. With 10% sequence knowledge, KB-MSA yields about 4% improvement.
Figure 5.21 KB-MSA testing against ClustalW, T-Coffee, ProbCons, and MAFFT on SABmark benchmark. With complete sequence knowledge, KB-MSA increases average alignment score by at least 8% on the original data sets and more than 6% on false-positive data sets.
Figure 5.22 KB-MSA testing against ClustalW, T-Coffee, ProbCons, and MAFFT on HOMSTRAD. With complete sequence knowledge, KB-MSA increases average alignment score about 10% on data sets with less than 20% sequence identity.
Figure 5.23 KB-MSA testing against ClustalW, T-Coffee, ProbCons, and MAFFT. With complete sequence knowledge, KB-MSA increases average alignment score at least 7% on data sets with less than 20% sequence identity.
Figure 5.24 This Figure illustrates the changeability between branches of an alignment guiding tree. Assume that the distances from A, B, C, and D to G are all less than , then A, B, C, and D are interchangeable.
Figure 5.25 Percentage of motifs aligned by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on RT-OSM sequence set PADT-NJ and PADT-UPGMA surpass ProbCons by 5% and 7%, respectively.
Figure 5.26 Total column (TC) score by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on BAliBASE Ref#1, Ref#2, and Ref#3. PADT-NJ gains at least 9% improvement on Ref#2.
Figure 5.27 Total column (TC) score by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on BAliBASE Ref#4 and Ref#5. PADT-NJ gains 6% on Ref#5 over DIAlign.
Figure 5.28 Quality score by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on PREFAB data sets. On the first group, PADT-UPGMA gains 4% improvement over DIAlign and more than 10% improvement over others.
Chapter 6: Phylogeny in Multiple Sequence Alignments
Figure 6.1 A small clade of the phylogeny. (
Source
: Adapted from Mayr et al. 2005 [109].)
Figure 6.2 (a) Rooted phylogeny, (b) un-rooted phylogeny, and (c) transformation of (b) into a rooted phylogeny.
Figure 6.3 A tree generated from the additive distance matrix of Table 6.1.
Figure 6.4 Adding a new node to a tree – the dashed lines represent all possible placements of a new edge.
Figure 6.5 (a) A rooted phylogeny created for sequences in Table 6.2 with the mutations are represented in the internal nodes. (b) A Fitch's representation of the tree in (a).
Figure 6.6 Illustration of character mutations. Figure (b) represents the construction of internal nodes with Fitch's algorithm.
Figure 6.7 Illustration of the nearest-neighbor interchange technique. (a) the original tree with four OTUs in which is closest neighbor of and is closest neighbor of ; (b) the first interchange where the nearest neighbor of (which is ) is interchanged with the nearest neighbor of (which is ), and (c) the second interchange of the neighbors.
Figure 6.8 Illustration of calculations of maximum likelihood of a tree.
Chapter 7: Multiple Sequence Alignment on High-Performance Computing Models
Figure 7.1 Allowable configurations on four port processing units; (a) shows the port's directions; (b) shows the 15 possible port connections, where the last five port configurations in curly braces are not allowed in linear R-Mesh (LR-Mesh) models.
Figure 7.2 Two 1-bit max switches. (a) Fusing {NSEW} to find the max of two inputs from North and West ports (
Source
: Adapted from Bertossi and Mei 2000 [132]); (b) construction of a 1-bit 4-input max switch.
Figure 7.3 An -bit three-input max switch, where the rectangle represents a 1-bit four-input max switch from Figure 7.2.
Figure 7.4 An -bit adder/subtractor that can perform addition or subtraction between two 1UN numbers during a broadcasting time. For additions, the inputs are on the North and West borders and the output is on the South border. For subtractions, the inputs are on the West and South borders and the output is on the North border. The number on the Westbound is 1-bit left-shifted. The dotted lines represent the omitted processing units that are the same as the ones in the last rows. This Figure shows the addition of 3 and 3. Note: the leading 1 bit of input number on the Westbound (left) has been shifted off. The right border is fed with zero (or no signal) during the subtract operation.
Figure 7.5 A dynamic programming R-Mesh. Each cell is a combination of a three-input max switch and three adder/subtractor units. The “+” and “” represent the actual functions of the adder/subtractor units in the configuration.
Figure 7.6 A configuration of a four-way max switch to solve the longest common subsequence (LCS). The South-East processing unit (in bold) conFigure NS, E, W if the symbols at row and column are different; otherwise, it conFigure N, E, SW. This Figure shows a configuration when the two symbols are different.
Figure 7.7 A configuration to select one of the two inputs in 1UN notation using the rightmost bit as a selector . When , the switch is turned on to allow the data to flow through and get selected by the max switch. When the selector , the on/off switch produces zeros and the other data flow will be selected. , is the difference between opening gap cost and extending gap cost .
Figure 7.8 An R-Mesh represents an -bit on/off switches. By default, all processing units on the last column (column ) conFigure {NS,E,W} and fuse {NSEW} when a signal (i.e., 1) travels through them. All cells on the main antidiagonal cells of the first rows and columns fuse {NE,S,W}, cells above the main antidiagonal fuse {NS,E,W}, and cells below the main antidiagonal fuse {N,S,EW}. Figure (a) shows the R-Mesh configuration on a selector bit of 1 (s = 1) and Figure (b) shows the R-Mesh configuration on a selector bit of 0 (s = 0).
Chapter 8: Sequence Analysis Services
Figure 8.1 SeqAna's multiple sequence alignment and quality ranking service.
Figure 8.2 The hierarchical expected matching scoring tree built from BLOSUM62.
Figure 8.3 Pairwise alignment service.
Figure 8.4 SeqAna's Unique BLAST Service.
Figure 8.5 Sequence format converting service.
Chapter 9: Multiple Sequence for Next-Generation Sequences
Figure 9.1 Assembling sequence reads.
Figure 9.2 “GATTATTACCA” suffix tree.
Chapter 10: Multiple Sequence Alignment for Variations Detection
Figure 10.1 Types of single-nucleotide changes and short tandem repeats.
Figure 10.2 Types of copy number variations, including insertion and deletion.
Figure 10.3 Types of chromosomal rearrangements.
Figure 10.4 Variant calling workflow using multiple sequence alignment.
Figure 10.5 Categories of variant callers based on read mapping.
Figure 10.6 The four types of variants defined by MUM alignment.
Figure 10.7 The workflow of Mugsy for variant detection. The uppers are the processes, and the lowers are employed tools.
Figure 10.8 The workflow of SoftSV. SoftSV uses the paired-end reads and the soft-clip information to identify both small and large indels, inversions, tandem repeats, and translocations.
Chapter 11: Multiple Sequence Alignment for Structure Detection
Figure 11.1 Seven RNA SS elements. Hairpin (H) loop, internal (I) loop, bulges (B) loop, multihelix (M) loop, single-stranded region, helix stem, and pseudoknot are shown in the 2D representations.
Figure 11.2 The workflows to predict RNA SS with the alignment and folding in three distinct orders.
Figure 11.3 Generate the SS for each RNA sequence from the multialigned sequences.
Chapter 1: Introduction
Table 1.1 Common Amino Acids
Table 1.2 Ambiguous Amino Acids
Chapter 3: Quantifying Sequence Alignments
Table 3.1 DNA/RNA Scoring Matrix Used by NCBI
Table 3.2 250 PAM Substitution Matrix
Table 3.3 BLOSUM45 Substitution Matrix
Table 3.4 GONNET Matrix Generated from SWISS-PROT Database
Table 3.5 Each Label Column Represents a Residue Position in a Multiple Sequence Alignment
Table 3.6 Each Label Column Represents a Residue Position in a Multiple Sequence Alignment
Chapter 6: Phylogeny in Multiple Sequence Alignments
Table 6.1 A Distance Matrix
Table 6.2 An Example Set of Five Sequences Each Containing Five Residue Symbols
Table 6.3 Illustration of Bootstrapping Sampling and Replacement
Chapter 7: Multiple Sequence Alignment on High-Performance Computing Models
Table 7.1 Summary of Progressive Multiple Sequence Alignment Components Along with Their Time and CPU Complexity
Chapter 9: Multiple Sequence for Next-Generation Sequences
Table 9.1 Burrows–Wheeler Transformation
Table 9.2 Reversing Burrows–Wheeler Transformation
Table 9.3 Next-Generation Sequencing Alignment Tools – Download
Table 9.4 Next-Generation Sequencing Alignment Tools – Constraints
Chapter 10: Multiple Sequence Alignment for Variations Detection
Table 10.1 Tools for Comparing Variant Calls
Table 10.2 The Simulator Algorithms for the Next-Generation Sequencing
Wiley Series on
Bioinformatics: Computational Techniques and Engineering
A complete list of the titles in this series appears at the end of this volume.
Ken Nguyen
Xuan Guo
Yi Pan
Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Names: Nguyen, Ken, 1975- author. | Guo, Xuan, 1987- author. | Pan, Yi, 1960-author.
Title: Multiple biological sequence alignment : scoring functions, algorithms and applications / Ken Nguyen, Xuan Guo, Yi Pan.
Description: Hoboken, New Jersey : John Wiley & Sons, 2016. | Includes bibliographical references and index.
Identifiers: LCCN 2016004186| ISBN 9781118229040 (cloth) | ISBN 9781119273752 (epub)
Subjects: LCSH: Sequence alignment (Bioinformatics)
Classification: LCC QH441 .N48 2016 | DDC 572.8-dc23 LC record available at http://lccn.loc.gov/2016004186
Cover image courtesy of GettyImages/OktalStudio
The advancement in sequencing technology has resulted in enormous amount of data that must be analyzed and categorized. Recently, the US National Center for Biotechnology Information (NCBI) announced the availability of whole genome sequences for more than 1000 species. The number of sequenced individual organisms is growing very fast around the world. However, the availability of sequence information is only the first step in understanding how cells survive, reproduce, and adjust their behavior. The natural next step is to analyze the data, extract knowledge from them, and get a better understanding of the functions, and control mechanisms of these species. Since sequence alignment is the fundamental operation in dealing with the sequencing data, it is gaining a lot of interest from researchers who are mining biological information from massive sequence databases. Multiple sequence alignment algorithms are used to align three or more biological sequences. For example, multiple biological sequence alignment is used in motif finding, sequence clustering, and sequence mining. The current advancement in sequencing technology has created a massive number of biological sequences (DNA/RNA/protein, etc.), thus expanding the sequence databases at a rate exceeding the capability of existing sequence analysis tools and facility. As a result, many traditional sequence alignment models and methods become obsolete, along with the books covering only these topics.
Exhaustive dynamic programming is a straightforward way to compute optimal multiple sequence alignments. However, this approach is prohibitive in terms of both time and space. To overcome these constraints, heuristics such as progressive alignment have been suggested. Other new heuristic algorithms based on information extracted from sequences themselves are described in the book. Another issue is how to evaluate the quality of the alignments obtained. Clearly, a simple numerical score for matching or mismatching of two biological symbols is not always the right thing to do. We need to take into account the biological meanings of these symbols. Furthermore, the validations should be done by biological experiments or by biological experts on real biological sequences based on their experiences. The aim of this book is to cover as much topics as possible arising recently in this field apart from traditional multiple sequence alignment algorithms. This book provides detailed descriptions of the traditional and modern approaches in biological sequence alignments and homology search. It covers the full spectrum of the field from alignment algorithms to scoring methods,practical techniques, alignment tools, and their evaluations and applications. It provides the details of the state of the computational methodologies for challenging tasks in multiple sequence alignment including the following topics:
Background of protein/DNA/RNA pairwise and multiple sequence alignment
Sequence clustering methods
Theories and developments of scoring functions and scoring matrices
Cutting-edge development in multiple sequence alignment algorithms
Knowledgebase-assisted sequence alignment
Evaluation of existing sequence alignment algorithms
Homologous sequence classification and organization
Phylogeny estimation
Large-scale homology search
Parallel and distributed sequence alignment algorithms
Popular sequence alignment servers
Alignment algorithms for next-generation sequences
Variants detection based on multiple sequence alignment
RNA/protein secondary structure prediction using multiple sequence alignment
Multiple sequence alignment is often included as a chapter in most of bioinformatics and sequence analysis books. It is the first book in the field that provides a comprehensive view into recently developed large-scale alignment models and techniques that promise to handle the current and future challenges of aligning enormous amount of sequences. It is also the first book specifically written for multiple biological sequence alignment.
Readers will be informed with the fundamentals in multiple biological sequence alignment and analysis to understand the state-of-the-art techniques in sequence alignment and be able to choose appropriate sequence analysis tools for their tasks. The intended readers of the book are researchers, engineers, graduate and postgraduate students in bioinformatics and systems biology, and molecular biologists. It provides the readers with a comprehensive book in multiple biological sequence alignment theory, methodology, evaluation, applications, and technologies.
Finally, we would like to express our sincere thanks to Brett Kurzman (editor), Allison McGinniss (project editor), Simone Taylor (senior editor), and Diana Gialo (editorial assistant) from Wiley for their guidance and help in finalizing this book. We would also like to thank our families for their support, patience, and love. We hope that our readers will enjoy reading this timely-produced book and give us feedback for future improvements.
Ken Nguyen, PhD Department of Computer Science and Information Technology Clayton State University 2000 Clayton State Blvd. Morrow, GA 30260, USA Email: [email protected]
Xuan Guo, PhDSchool of Genome Science and TechnologyUniversity of Tennessee Knoxville F337 Walters Life Science, Knoxville, TN 37996-0840, USA Email: [email protected]
Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak Ridge, TN 37831-6420 Email: [email protected]
Yi Pan, PhD Department of Computer Science Georgia State University 25 Park Place, Room 744, Atlanta, GA 30302-5060, USA Email: [email protected]
Majority of organisms on Earth, though diverse, share a significant biological similarity. There is an abundance of biological sequence data showing that any two mammals can have as many as 99% genes in common. Humans and fruit flies are two very different species that share at least 50% common genes. These striking facts have been discovered largely through biological sequence analysis.
Multiple sequence alignment is a fundamental task in bioinformatics and sequence analysis. In the early 1970s, deoxyribonucleic acid (DNA) sequences were obtained using laborious methods based on 2D chromatography. Thus, the number of sequences is limited and often being studied and annotated individually by scientists. By the late 1970s, Gilbert [1] and Sanger and Coulson [2] proposed DNA sequencing by chemical degradation and enzymatic synthesis, respectively. Their works earned a Nobel Prize in chemistry in 1980. Later, sequences are obtained by many newer methods such as dye-based methods [3], microarrays, mass spectrometry, X-ray, ultracentrifugation, and so on. Since the development of Sanger's method, the volume of sequences being identified and deposited is enormous. The current commercial sequencing such as “454 sequencing” can read up to 20 million bases per run and produce the sequences in hours. With this vast amount of sequences, manually annotating each sequence is infeasible. However, we need to categorize them by family, analyze them, find features that are common between them, and so on. The main step to solve this problem is finding the best way to start with the sequence fundamentals, thus leading the readers to the most modern and practical alignment techniques that have been proven to be effective in biological sequence analysis.
There are two popular trends in sequence analysis. One trend focuses primarily on applying rigorous mathematical methods to bring out the optimal alignment of the sequences, thus leading to revelation of possible hidden biological significance between sequences. The other trend stretches on correctly identifying the actual biological significance between the sequences, where some or all biological features may have already been known. These two trends emerge from specific tasks that bioinformatics scientists are dealing with. The first trend relates to prediction of the sequence structures and homology, evolution of species, or determination of the relationship between sequences in order to categorize and organize sequence databases. The second trend is to perform a daily task inwhich scientists want to arrange similar known features of the sequences into the same columns to see how closely they resemble each other. Thus, the second trend can be seen in evolution analysis, in sequence structure and functional analysis, or in drug design and discovery. In the later case, for each specific virus sequence, drug designers search for possible drug-like compounds from libraries of simple sequence models annotated with functional sites and specific drug-like compounds that can bind [4 5] . Hence, aligning a sequence obtained from a new virus against the library of sequences may lead to a manageable set of sequences and compounds to work with.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
