Multiple Biological Sequence Alignment - Ken Nguyen - E-Book

Multiple Biological Sequence Alignment E-Book

Ken Nguyen

0,0
97,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Covers the fundamentals and techniques of multiple biological sequence alignment and analysis, and shows readers how to choose the appropriate sequence analysis tools for their tasks

This book describes the traditional and modern approaches in biological sequence alignment and homology search.  This book contains 11 chapters, with Chapter 1 providing basic information on biological sequences. Next, Chapter 2 contains fundamentals in pair-wise sequence alignment, while Chapters 3 and 4 examine popular existing quantitative models and practical clustering techniques that have been used in multiple sequence alignment. Chapter 5 describes, characterizes and relates many multiple sequence alignment models. Chapter 6 describes how traditionally phylogenetic trees have been constructed, and available sequence knowledge bases can be used to improve the accuracy of reconstructing phylogeny trees. Chapter 7 covers the latest methods developed to improve the run-time efficiency of multiple sequence alignment.  Next, Chapter 8 covers several popular existing multiple sequence alignment server and services, and Chapter 9 examines several multiple sequence alignment techniques that have been developed to handle short sequences (reads) produced by the Next Generation Sequencing technique (NSG). Chapter 10 describes a Bioinformatics application using multiple sequence alignment of short reads or whole genomes as input. Lastly, Chapter 11 provides a review of RNA and protein secondary structure prediction using the evolution information inferred from multiple sequence alignments.

• Covers the full spectrum of the field, from alignment algorithms to scoring methods, practical techniques, and alignment tools and their evaluations

• Describes theories and developments of scoring functions and scoring matrices

•Examines phylogeny estimation and large-scale homology search

Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Applications is a reference for researchers, engineers, graduate and post-graduate students in bioinformatics, and system biology and molecular biologists.

Ken Nguyen, PhD, is an associate professor at Clayton State University, GA, USA. He received his PhD, MSc and BSc degrees in computer science all from Georgia State University. His research interests are in databases, parallel and distribute computing and bioinformatics. He was a Molecular Basis of Disease fellow at Georgia State and is the recipient of the highest graduate honor at Georgia State, the William M. Suttles Graduate Fellowship.

Xuan Guo, PhD, is a postdoctoral associate at Oak Ridge National Lab, USA. He received his PhD degree in computer science from Georgia State University in 2015. His research interests are in bioinformatics, machine leaning, and cloud computing. He is an editorial assistant of International Journal of Bioinformatics Research and Applications.

Yi Pan, PhD, is a Regents' Professor of Computer Science and an Interim Associate Dean and Chair of Biology at Georgia State University. He received his BE and ME in computer engineering from Tsinghua University in China and his PhD in computer science from the University of Pittsburgh. Dr. Pan's research interests include parallel and distributed computing, optical networks, wireless networks and bioinformatics. He has published more than 180 journal papers with about 60 papers published in various IEEE/ACM journals. He is co-editor along with Albert Y. Zomaya of the Wiley Series in Bioinformatics.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 432

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Preface

Chapter 1: Introduction

1.1 Motivation

1.2 The Organization of this Book

1.3 Sequence Fundamentals

Chapter 2: Protein/DNA/RNA Pairwise Sequence Alignment

2.1 Sequence Alignment Fundamentals

2.2 Dot-Plot Matrix

2.3 Dynamic Programming

2.4 Word Method

2.5 Searching Sequence Databases

References

Chapter 3: Quantifying Sequence Alignments

3.1 Evolution and Measuring Evolution

3.2 Substitution Matrices and Scoring Matrices

3.3 GAPS

3.4 Scoring Multiple Sequence Alignments

3.5 Circular Sum Score

3.6 Conservation Score Schemes

3.7 Diversity Scoring Schemes

3.8 Stereochemical Property Methods

3.9 Hierarchical Expected Matching Probability Scoring Metric (HEP)

Chapter 4: Sequence Clustering

4.1 Unweighted Pair Group Method with Arithmetic Mean – UPGMA

4.2 Neighborhood-Joining Method – NJ

4.3 Overlapping Sequence Clustering

Chapter 5: Multiple Sequences Alignment Algorithms

5.1 Dynamic Programming

5.2 Progressive Alignment

5.3 Consistency and Probabilistic MSA

5.4 Genetic Algorithms

5.5 New Development in Multiple Sequence Alignment Algorithms

5.6 Test Data and Alignment Methods

5.7 Results

Chapter 6: Phylogeny in Multiple Sequence Alignments

6.1 The Tree of Life

6.2 Phylogeny Construction

6.3 Inferring Phylogeny from Multiple Sequence Alignments

Chapter 7: Multiple Sequence Alignment on High-Performance Computing Models

7.1 Parallel Systems

7.2 Exiting Parallel Multiple Sequence Alignment

7.3 Reconfigurable-Mesh Computing Models – (R-Mesh)

7.4 Pairwise Dynamic Programming Algorithms

7.5 Progressive Multiple Sequence Alignment ON R-Mesh

Chapter 8: Sequence Analysis Services

8.1 EMBL-EBI: European Bioinformatics Institute

8.2 NCBI: National Center for Biotechnology Information

8.3 GenomeNet and Data Bank of Japan

8.4 Other Sequence Analysis and Alignment Web Servers

8.5 SeqAna: Multiple Sequence Alignment with Quality Ranking

8.6 Pairwise Sequence Alignment and Other Analysis Tools

8.7 Tool Evaluation

Chapter 9: Multiple Sequence for Next-Generation Sequences

9.1 Introduction

9.2 Overview of Next Generation Sequence Alignment Algorithms

Chapter 10: Multiple Sequence Alignment for Variations Detection

10.1 Introduction

10.2 Genetic Variants

10.3 Variation Detection Methods Based on MSA

10.4 Evaluation Methodology

10.5 Conclusion and Future Work

Chapter 11: Multiple Sequence Alignment for Structure Detection

11.1 Introduction

11.2 RNA Secondary Structure Prediction Based on MSA

11.3 Protein Secondary Structure Prediction Based on MSA

11.4 Conclusion and Future Work

References

Index

End User License Agreement

Pages

xi

xii

xiii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

133

134

135

136

137

138

139

140

141

142

143

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

219

220

221

222

223

224

225

Guide

Cover

Table of Contents

Preface

Begin Reading

List of Illustrations

Chapter 1: Introduction

Figure 1.1 Transcription and translation processes: DNA RNA protein (the “central dogma” of biology).

Figure 1.2 The yeast Sec23/24 heterodimer 1M2V: (a) protein structure and (b) primary sequence.

Figure 1.3 BLOSUM62 substitution matrix.

Figure 1.4 Four fundamental wobble base pairs.

Figure 1.5 NCBI's sequence formats.

Figure 1.6 Illustration of GenBank sequence format.

Figure 1.7 (a) Phylis format and (b) Clustal format.

Chapter 2: Protein/DNA/RNA Pairwise Sequence Alignment

Figure 2.1 Dot-plot matrix of two sequences ACACACTA and AGCACACA. The connected diagonal lines indicate matching segments.

Figure 2.2 Structures of the CALM3 protein. Based on PyMOL rendering of PDB 1a29.

Figure 2.3 Dot-plot matrix of human CalM against itself to reveal motifs.

Figure 2.4 Generating a scoring matrix for sequence “GAATTCAGTTA” and sequence “GGATTCCGA.”

Figure 2.5 Backtracking, (a–c), to obtain the optimal alignment (d).

Figure 2.6 Global alignment of the two sequences.

Figure 2.7 Smith–Waterman's local alignment of two sequences: “G A A T T C A G T T A” and “G G A T T C C G A.”

Figure 2.8 Displays all available words of size 3.

Figure 2.9 Displays an exact word matched.

Figure 2.10 (a) Filtering dot plot; (b) being reevaluated by score matrix; (c) filtering highest scored diagonals starting from the core region – identified by an arrow; (d) assembling the diagonals.

Chapter 3: Quantifying Sequence Alignments

Figure 3.1 A human Calmodulin wraps around its binding domain in the plasma membrane. atoms are depicted as circles.

Figure 3.2 Displays the possible paths a nucleotide can be mutated to with the same probability in Jukes and Cantor's model.

Figure 3.3 Shows the score by (a) edit distance and (b) BLOSUM62 matrix.

Figure 3.4 Illustrates the steps in assembling an MSA.

Figure 3.5 Traversal of a tree by an SP measurement. The total visits on each edge is shown. Some edges are visited more often than others.

Figure 3.6 Circular traversal of a tree. The dotted arrow indicates the traversal path.

Figure 3.7 Taylor's Venn diagram of amino acid properties.

Figure 3.8 The true table of Taylor's Venn diagram.

Figure 3.9 Amino Acid Class Hierarchy (AACH) used in PIMA [50]. Uppercase characters are amino acids; lowercase characters are amino acid classes. X is the wild-card character of any type, including a gap.

Figure 3.10 HEP scoring tree generated from BLOSUM62 substitution matrix. The matching cardinality of each amino acid class is shown at the bottom of the corresponding class symbol.

Figure 3.11 PIMA active scoring tree for the third column.

Figure 3.12 Matching Leucine from NNV0 with a similar amino acid in sequence SFV1 [51]. The dotted arrow shows a mismatch, the dashed arrow shows a better match, and the solid arrow shows the best match.

Figure 3.13 The RT-OSM sequences. The six motifs of the RT-OSM are indicated by roman numeral(I–VI). The bold and capitalized letters represent the core amino acids of each motif. (

Source

: Adapted from Hudak and McClure 1999 [51].)

Figure 3.14 Motif detection accuracy on the RT-OSM data set. Score of 1.0 means a method correctly detects and ranks all six motifs in the data set.

Figure 3.15 Reliability scores on the BAliBASE3.0 benchmark.

Figure 3.16 Reliability scores on 150 references of the PREFAB4.0 benchmark.

Figure 3.17 BAliBASE total column (TC) scores.

Figure 3.18 Total column (TC) scores of 150 random sequence sets from PREFAB4.0.

Chapter 4: Sequence Clustering

Figure 4.1 (a) is the input sequences, (b) substitutions on first column, and (c) substitutions on second column.

Figure 4.2 Illustrates each step of building the UPGMA phylogeny.

Figure 4.3 Producing NJ tree – not drawn to scale. (a) is initial star tree; (b) shows the combined OTU of A and B; (c) is the NJ tree with the new OTU AB; and (d) is the final NJ evolution tree.

Figure 4.4 Illustration of different alignment outcomes from the same input sequences. (a) shows the input sequences with the motifs in bold; (b) shows the expected alignment; (c) and (d) show the alignment based on different phylogeny guiding tree.

Figure 4.5 Illustration of the clustering of the sequences to generate a sequence pattern. (a) is the all-pairwise alignments, (b) is sequence clusterization, and (c) is sequence pattern identification, where S' is the cluster pattern. The boxes represent the DP local alignment results.

Chapter 5: Multiple Sequences Alignment Algorithms

Figure 5.1 Sample alignment of BAliBASE reference 1 [52] ubiquitin set. Each block represents a secondary feature such as alpha helices, beta strands, or unimportant structures.

Figure 5.2 Illustration of the divide and conquer multiple sequence alignment (DCA).

Figure 5.3 An example of progressive multiple sequence alignment. (a) is the input sequences, (b) is an alignment guiding tree, (c) is external nodes alignment, and (d) is internal nodes alignment.

Figure 5.4 The Amino Acid Class Hierarchy (AACCH) used in PIMA family; X represent a wildcard for matching any symbol, including gap.

Figure 5.5 A pattern generated from a pair-wise alignment.

Figure 5.6 POA algorithm: (a) input sequences, (b) directed graphs for the sequences, (c) graph fusion of sequences S1 and S2, (d) alignment of S1 and S2, (e) graph fusion of S3 to alignment (S1,S2), (f) final partial order graph and final multiple sequence alignment.

Figure 5.7 PSAlign algorithm: (a) input sequences, (b) minimum spanning tree, (c) undirected graph constructed from pair-wise alignments and spanning three, (d) partial order graph, and (e) final alignment result.

Figure 5.8 Basic pair-HMM for aligning two sequences and . State emits two letters that are being aligned from each sequence. States and emit a letter in sequence and , respectively, that are aligned to a gap.

Figure 5.9 Generating extended pair-wise alignment in T-Coffee. (a) sequence SeqA is aligned to sequence SeqC via immediate sequence SeqB. (b) new pair-wise alignment for sequence SeqA and sequence SeqC.

Figure 5.10 T-Coffee multiple sequence alignment schema.

Figure 5.11 Representing three sequences in de Bruijn graph.

Figure 5.12 Superpath transformation: (a) detachment and (b) -cut.

Figure 5.13 Illustration of SAGA, where the two parents are crossed breed. The dotted boxes represent the consistent alignment between the two parents.

Figure 5.14 SAGA alignment scheme. At any generation , the parents are crossed breed or mutated by a random operation to generate a new set of children .

Figure 5.15 Neural networks.

Figure 5.16 Knowledge-based multiple sequence alignment scheme. Initially, the input sequences are partitioned into semi-related groups. The groups' representative sequences are used to search for any available biological information from sequence knowledge bases. The sequences are given different weights and regrouped based on the discovered knowledge.

Figure 5.17 Clusters alignment by partial order graph, where bold texts represent beta-strand blocks, italics represents alpha-helix blocks, and others are nonsignificant blocks. (a) and (b) show the two clusters to be aligned, each sequence in the cluster is preceding by its name; (c) is the partial order graph representing clusters (a) and (b); (d) is the fusing graph of (a) and (b). These sequences are obtained from BAliBASE.

Figure 5.18 Finding the maximum matching score between two partial order graph nodes by sliding techniques. The overlapping columns are enclosed in the rectangle. Nonoverlapping columns are considered matching with gaps.

Figure 5.19 Creation of an annotated sequence for KB-MSA consistency database. (a) shows the input sequences and the final annotated pattern for seq1 having three featured blocks f1, f2, and f3 with weights 1/3, 2/3, and 1/3, respectively; (b) shows the immediate pair-wise local alignments of seq1 and their weight-marking vectors.

Figure 5.20 KB-MSA tested against ClustalW, T-Coffee, ProbCons, and MAFFT on BAliBASE benchmark. With complete sequence knowledge, KB-MSA increases average alignment score by more than 9%. With 10% sequence knowledge, KB-MSA yields about 4% improvement.

Figure 5.21 KB-MSA testing against ClustalW, T-Coffee, ProbCons, and MAFFT on SABmark benchmark. With complete sequence knowledge, KB-MSA increases average alignment score by at least 8% on the original data sets and more than 6% on false-positive data sets.

Figure 5.22 KB-MSA testing against ClustalW, T-Coffee, ProbCons, and MAFFT on HOMSTRAD. With complete sequence knowledge, KB-MSA increases average alignment score about 10% on data sets with less than 20% sequence identity.

Figure 5.23 KB-MSA testing against ClustalW, T-Coffee, ProbCons, and MAFFT. With complete sequence knowledge, KB-MSA increases average alignment score at least 7% on data sets with less than 20% sequence identity.

Figure 5.24 This Figure illustrates the changeability between branches of an alignment guiding tree. Assume that the distances from A, B, C, and D to G are all less than , then A, B, C, and D are interchangeable.

Figure 5.25 Percentage of motifs aligned by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on RT-OSM sequence set PADT-NJ and PADT-UPGMA surpass ProbCons by 5% and 7%, respectively.

Figure 5.26 Total column (TC) score by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on BAliBASE Ref#1, Ref#2, and Ref#3. PADT-NJ gains at least 9% improvement on Ref#2.

Figure 5.27 Total column (TC) score by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on BAliBASE Ref#4 and Ref#5. PADT-NJ gains 6% on Ref#5 over DIAlign.

Figure 5.28 Quality score by PADT-NJ, PADT-UPGMA, ClustalW, T-Coffee, ProbCons, MUSCLE, and DIAlign on PREFAB data sets. On the first group, PADT-UPGMA gains 4% improvement over DIAlign and more than 10% improvement over others.

Chapter 6: Phylogeny in Multiple Sequence Alignments

Figure 6.1 A small clade of the phylogeny. (

Source

: Adapted from Mayr et al. 2005 [109].)

Figure 6.2 (a) Rooted phylogeny, (b) un-rooted phylogeny, and (c) transformation of (b) into a rooted phylogeny.

Figure 6.3 A tree generated from the additive distance matrix of Table 6.1.

Figure 6.4 Adding a new node to a tree – the dashed lines represent all possible placements of a new edge.

Figure 6.5 (a) A rooted phylogeny created for sequences in Table 6.2 with the mutations are represented in the internal nodes. (b) A Fitch's representation of the tree in (a).

Figure 6.6 Illustration of character mutations. Figure (b) represents the construction of internal nodes with Fitch's algorithm.

Figure 6.7 Illustration of the nearest-neighbor interchange technique. (a) the original tree with four OTUs in which is closest neighbor of and is closest neighbor of ; (b) the first interchange where the nearest neighbor of (which is ) is interchanged with the nearest neighbor of (which is ), and (c) the second interchange of the neighbors.

Figure 6.8 Illustration of calculations of maximum likelihood of a tree.

Chapter 7: Multiple Sequence Alignment on High-Performance Computing Models

Figure 7.1 Allowable configurations on four port processing units; (a) shows the port's directions; (b) shows the 15 possible port connections, where the last five port configurations in curly braces are not allowed in linear R-Mesh (LR-Mesh) models.

Figure 7.2 Two 1-bit max switches. (a) Fusing {NSEW} to find the max of two inputs from North and West ports (

Source

: Adapted from Bertossi and Mei 2000 [132]); (b) construction of a 1-bit 4-input max switch.

Figure 7.3 An -bit three-input max switch, where the rectangle represents a 1-bit four-input max switch from Figure 7.2.

Figure 7.4 An -bit adder/subtractor that can perform addition or subtraction between two 1UN numbers during a broadcasting time. For additions, the inputs are on the North and West borders and the output is on the South border. For subtractions, the inputs are on the West and South borders and the output is on the North border. The number on the Westbound is 1-bit left-shifted. The dotted lines represent the omitted processing units that are the same as the ones in the last rows. This Figure shows the addition of 3 and 3. Note: the leading 1 bit of input number on the Westbound (left) has been shifted off. The right border is fed with zero (or no signal) during the subtract operation.

Figure 7.5 A dynamic programming R-Mesh. Each cell is a combination of a three-input max switch and three adder/subtractor units. The “+” and “” represent the actual functions of the adder/subtractor units in the configuration.

Figure 7.6 A configuration of a four-way max switch to solve the longest common subsequence (LCS). The South-East processing unit (in bold) conFigure NS, E, W if the symbols at row and column are different; otherwise, it conFigure N, E, SW. This Figure shows a configuration when the two symbols are different.

Figure 7.7 A configuration to select one of the two inputs in 1UN notation using the rightmost bit as a selector . When , the switch is turned on to allow the data to flow through and get selected by the max switch. When the selector , the on/off switch produces zeros and the other data flow will be selected. , is the difference between opening gap cost and extending gap cost .

Figure 7.8 An R-Mesh represents an -bit on/off switches. By default, all processing units on the last column (column ) conFigure {NS,E,W} and fuse {NSEW} when a signal (i.e., 1) travels through them. All cells on the main antidiagonal cells of the first rows and columns fuse {NE,S,W}, cells above the main antidiagonal fuse {NS,E,W}, and cells below the main antidiagonal fuse {N,S,EW}. Figure (a) shows the R-Mesh configuration on a selector bit of 1 (s = 1) and Figure (b) shows the R-Mesh configuration on a selector bit of 0 (s = 0).

Chapter 8: Sequence Analysis Services

Figure 8.1 SeqAna's multiple sequence alignment and quality ranking service.

Figure 8.2 The hierarchical expected matching scoring tree built from BLOSUM62.

Figure 8.3 Pairwise alignment service.

Figure 8.4 SeqAna's Unique BLAST Service.

Figure 8.5 Sequence format converting service.

Chapter 9: Multiple Sequence for Next-Generation Sequences

Figure 9.1 Assembling sequence reads.

Figure 9.2 “GATTATTACCA” suffix tree.

Chapter 10: Multiple Sequence Alignment for Variations Detection

Figure 10.1 Types of single-nucleotide changes and short tandem repeats.

Figure 10.2 Types of copy number variations, including insertion and deletion.

Figure 10.3 Types of chromosomal rearrangements.

Figure 10.4 Variant calling workflow using multiple sequence alignment.

Figure 10.5 Categories of variant callers based on read mapping.

Figure 10.6 The four types of variants defined by MUM alignment.

Figure 10.7 The workflow of Mugsy for variant detection. The uppers are the processes, and the lowers are employed tools.

Figure 10.8 The workflow of SoftSV. SoftSV uses the paired-end reads and the soft-clip information to identify both small and large indels, inversions, tandem repeats, and translocations.

Chapter 11: Multiple Sequence Alignment for Structure Detection

Figure 11.1 Seven RNA SS elements. Hairpin (H) loop, internal (I) loop, bulges (B) loop, multihelix (M) loop, single-stranded region, helix stem, and pseudoknot are shown in the 2D representations.

Figure 11.2 The workflows to predict RNA SS with the alignment and folding in three distinct orders.

Figure 11.3 Generate the SS for each RNA sequence from the multialigned sequences.

List of Tables

Chapter 1: Introduction

Table 1.1 Common Amino Acids

Table 1.2 Ambiguous Amino Acids

Chapter 3: Quantifying Sequence Alignments

Table 3.1 DNA/RNA Scoring Matrix Used by NCBI

Table 3.2 250 PAM Substitution Matrix

Table 3.3 BLOSUM45 Substitution Matrix

Table 3.4 GONNET Matrix Generated from SWISS-PROT Database

Table 3.5 Each Label Column Represents a Residue Position in a Multiple Sequence Alignment

Table 3.6 Each Label Column Represents a Residue Position in a Multiple Sequence Alignment

Chapter 6: Phylogeny in Multiple Sequence Alignments

Table 6.1 A Distance Matrix

Table 6.2 An Example Set of Five Sequences Each Containing Five Residue Symbols

Table 6.3 Illustration of Bootstrapping Sampling and Replacement

Chapter 7: Multiple Sequence Alignment on High-Performance Computing Models

Table 7.1 Summary of Progressive Multiple Sequence Alignment Components Along with Their Time and CPU Complexity

Chapter 9: Multiple Sequence for Next-Generation Sequences

Table 9.1 Burrows–Wheeler Transformation

Table 9.2 Reversing Burrows–Wheeler Transformation

Table 9.3 Next-Generation Sequencing Alignment Tools – Download

Table 9.4 Next-Generation Sequencing Alignment Tools – Constraints

Chapter 10: Multiple Sequence Alignment for Variations Detection

Table 10.1 Tools for Comparing Variant Calls

Table 10.2 The Simulator Algorithms for the Next-Generation Sequencing

Wiley Series on

Bioinformatics: Computational Techniques and Engineering

A complete list of the titles in this series appears at the end of this volume.

Multiple Biological Sequence Alignment

Scoring Functions, Algorithms and Applications

 

 

Ken Nguyen

Xuan Guo

Yi Pan

 

 

 

Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Names: Nguyen, Ken, 1975- author. | Guo, Xuan, 1987- author. | Pan, Yi, 1960-author.

Title: Multiple biological sequence alignment : scoring functions, algorithms and applications / Ken Nguyen, Xuan Guo, Yi Pan.

Description: Hoboken, New Jersey : John Wiley & Sons, 2016. | Includes bibliographical references and index.

Identifiers: LCCN 2016004186| ISBN 9781118229040 (cloth) | ISBN 9781119273752 (epub)

Subjects: LCSH: Sequence alignment (Bioinformatics)

Classification: LCC QH441 .N48 2016 | DDC 572.8-dc23 LC record available at http://lccn.loc.gov/2016004186

Cover image courtesy of GettyImages/OktalStudio

Preface

The advancement in sequencing technology has resulted in enormous amount of data that must be analyzed and categorized. Recently, the US National Center for Biotechnology Information (NCBI) announced the availability of whole genome sequences for more than 1000 species. The number of sequenced individual organisms is growing very fast around the world. However, the availability of sequence information is only the first step in understanding how cells survive, reproduce, and adjust their behavior. The natural next step is to analyze the data, extract knowledge from them, and get a better understanding of the functions, and control mechanisms of these species. Since sequence alignment is the fundamental operation in dealing with the sequencing data, it is gaining a lot of interest from researchers who are mining biological information from massive sequence databases. Multiple sequence alignment algorithms are used to align three or more biological sequences. For example, multiple biological sequence alignment is used in motif finding, sequence clustering, and sequence mining. The current advancement in sequencing technology has created a massive number of biological sequences (DNA/RNA/protein, etc.), thus expanding the sequence databases at a rate exceeding the capability of existing sequence analysis tools and facility. As a result, many traditional sequence alignment models and methods become obsolete, along with the books covering only these topics.

Exhaustive dynamic programming is a straightforward way to compute optimal multiple sequence alignments. However, this approach is prohibitive in terms of both time and space. To overcome these constraints, heuristics such as progressive alignment have been suggested. Other new heuristic algorithms based on information extracted from sequences themselves are described in the book. Another issue is how to evaluate the quality of the alignments obtained. Clearly, a simple numerical score for matching or mismatching of two biological symbols is not always the right thing to do. We need to take into account the biological meanings of these symbols. Furthermore, the validations should be done by biological experiments or by biological experts on real biological sequences based on their experiences. The aim of this book is to cover as much topics as possible arising recently in this field apart from traditional multiple sequence alignment algorithms. This book provides detailed descriptions of the traditional and modern approaches in biological sequence alignments and homology search. It covers the full spectrum of the field from alignment algorithms to scoring methods,practical techniques, alignment tools, and their evaluations and applications. It provides the details of the state of the computational methodologies for challenging tasks in multiple sequence alignment including the following topics:

Background of protein/DNA/RNA pairwise and multiple sequence alignment

Sequence clustering methods

Theories and developments of scoring functions and scoring matrices

Cutting-edge development in multiple sequence alignment algorithms

Knowledgebase-assisted sequence alignment

Evaluation of existing sequence alignment algorithms

Homologous sequence classification and organization

Phylogeny estimation

Large-scale homology search

Parallel and distributed sequence alignment algorithms

Popular sequence alignment servers

Alignment algorithms for next-generation sequences

Variants detection based on multiple sequence alignment

RNA/protein secondary structure prediction using multiple sequence alignment

Multiple sequence alignment is often included as a chapter in most of bioinformatics and sequence analysis books. It is the first book in the field that provides a comprehensive view into recently developed large-scale alignment models and techniques that promise to handle the current and future challenges of aligning enormous amount of sequences. It is also the first book specifically written for multiple biological sequence alignment.

Readers will be informed with the fundamentals in multiple biological sequence alignment and analysis to understand the state-of-the-art techniques in sequence alignment and be able to choose appropriate sequence analysis tools for their tasks. The intended readers of the book are researchers, engineers, graduate and postgraduate students in bioinformatics and systems biology, and molecular biologists. It provides the readers with a comprehensive book in multiple biological sequence alignment theory, methodology, evaluation, applications, and technologies.

Finally, we would like to express our sincere thanks to Brett Kurzman (editor), Allison McGinniss (project editor), Simone Taylor (senior editor), and Diana Gialo (editorial assistant) from Wiley for their guidance and help in finalizing this book. We would also like to thank our families for their support, patience, and love. We hope that our readers will enjoy reading this timely-produced book and give us feedback for future improvements.

Ken Nguyen, PhD Department of Computer Science and Information Technology Clayton State University 2000 Clayton State Blvd. Morrow, GA 30260, USA Email: [email protected]

Xuan Guo, PhDSchool of Genome Science and TechnologyUniversity of Tennessee Knoxville F337 Walters Life Science, Knoxville, TN 37996-0840, USA Email: [email protected]

Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak Ridge, TN 37831-6420 Email: [email protected]

Yi Pan, PhD Department of Computer Science Georgia State University 25 Park Place, Room 744, Atlanta, GA 30302-5060, USA Email: [email protected]

Chapter 1Introduction

Majority of organisms on Earth, though diverse, share a significant biological similarity. There is an abundance of biological sequence data showing that any two mammals can have as many as 99% genes in common. Humans and fruit flies are two very different species that share at least 50% common genes. These striking facts have been discovered largely through biological sequence analysis.

Multiple sequence alignment is a fundamental task in bioinformatics and sequence analysis. In the early 1970s, deoxyribonucleic acid (DNA) sequences were obtained using laborious methods based on 2D chromatography. Thus, the number of sequences is limited and often being studied and annotated individually by scientists. By the late 1970s, Gilbert [1] and Sanger and Coulson [2] proposed DNA sequencing by chemical degradation and enzymatic synthesis, respectively. Their works earned a Nobel Prize in chemistry in 1980. Later, sequences are obtained by many newer methods such as dye-based methods [3], microarrays, mass spectrometry, X-ray, ultracentrifugation, and so on. Since the development of Sanger's method, the volume of sequences being identified and deposited is enormous. The current commercial sequencing such as “454 sequencing” can read up to 20 million bases per run and produce the sequences in hours. With this vast amount of sequences, manually annotating each sequence is infeasible. However, we need to categorize them by family, analyze them, find features that are common between them, and so on. The main step to solve this problem is finding the best way to start with the sequence fundamentals, thus leading the readers to the most modern and practical alignment techniques that have been proven to be effective in biological sequence analysis.

1.1 Motivation

There are two popular trends in sequence analysis. One trend focuses primarily on applying rigorous mathematical methods to bring out the optimal alignment of the sequences, thus leading to revelation of possible hidden biological significance between sequences. The other trend stretches on correctly identifying the actual biological significance between the sequences, where some or all biological features may have already been known. These two trends emerge from specific tasks that bioinformatics scientists are dealing with. The first trend relates to prediction of the sequence structures and homology, evolution of species, or determination of the relationship between sequences in order to categorize and organize sequence databases. The second trend is to perform a daily task inwhich scientists want to arrange similar known features of the sequences into the same columns to see how closely they resemble each other. Thus, the second trend can be seen in evolution analysis, in sequence structure and functional analysis, or in drug design and discovery. In the later case, for each specific virus sequence, drug designers search for possible drug-like compounds from libraries of simple sequence models annotated with functional sites and specific drug-like compounds that can bind [4 5] . Hence, aligning a sequence obtained from a new virus against the library of sequences may lead to a manageable set of sequences and compounds to work with.

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!