123,99 €
A comprehensive overview of high-performance pattern recognition techniques and approaches to Computational Molecular Biology
This book surveys the developments of techniques and approaches on pattern recognition related to Computational Molecular Biology. Providing a broad coverage of the field, the authors cover fundamental and technical information on these techniques and approaches, as well as discussing their related problems. The text consists of twenty nine chapters, organized into seven parts: Pattern Recognition in Sequences, Pattern Recognition in Secondary Structures, Pattern Recognition in Tertiary Structures, Pattern Recognition in Quaternary Structures, Pattern Recognition in Microarrays, Pattern Recognition in Phylogenetic Trees, and Pattern Recognition in Biological Networks.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1185
Veröffentlichungsjahr: 2015
Wiley Series
Cover
Title Page
Copyright
List of Contributors
Preface
Part 1: Pattern Recognition in Sequences
Chapter 1: Combinatorial Haplotyping Problems
1.1 Introduction
1.2 Single Individual Haplotyping
1.3 Population Haplotyping
References
Chapter 2: Algorithmic Perspectives of the String Barcoding Problems
2.1 Introduction
2.2 Summary of Algorithmic Complexity Results for Barcoding Problems
2.3 Entropy-Based Information Content Technique for Designing Approximation Algorithms for String Barcoding Problems
2.4 Techniques for Proving Inapproximability Results for String Barcoding Problems
2.5 Heuristic Algorithms for String Barcoding Problems
2.6 Conclusion
Acknowledgments
References
Chapter 3: Alignment-Free Measures for Whole-Genome Comparison
3.1 Introduction
3.2 Whole-Genome Sequence Analysis
3.3 Underlying Approach
3.4 Experimental Results
3.5 Conclusion
Author's Contributions
3.6 Acknowledgments
References
Chapter 4: A Maximum Likelihood Framework for Multiple Sequence Local Alignment
4.1 Introduction
4.2 Multiple Sequence Local Alignment
4.3 Motif Finding Algorithms
4.4 Time Complexity
4.5 Case Studies
4.6 Conclusion
References
Chapter 5: Global Sequence Alignment with a Bounded Number of Gaps
5.1 Introduction
5.2 Definitions and Notation
5.3 Problem Definition
5.4 Algorithms
5.5 Conclusion
References
Part 2: Pattern Recognition in Secondary Structures
Chapter 6: A Short Review on Protein Secondary Structure Prediction Methods
6.1 Introduction
6.2 Representative Protein Secondary Structure Prediction Methods
6.3 Evaluation of Protein Secondary Structure Prediction Methods
6.4 Conclusion
Acknowledgments
References
Chapter 7: A Generic Approach to Biological Sequence Segmentation Problems: Application to Protein Secondary Structure Prediction
7.1 Introduction
7.2 Biological Sequence Segmentation
7.3 MSVMpred
7.4 Postprocessing with A Generative Model
7.5 Dedication to Protein Secondary Structure Prediction
7.6 Conclusions and Ongoing Research
Acknowledgments
References
Chapter 8: Structural Motif Identification and Retrieval: A Geometrical Approach
8.1 Introduction
8.2 A Few Basic Concepts
8.3 State of The Art
8.4 A Novel Geometrical Approach to Motif Retrieval
8.5 Implementation Notes
8.6 Conclusions and Future Work
Acknowledgment
References
Chapter 9: Genome-Wide Search for Pseudoknotted Noncoding RNA: A Comparative Study
9.1 Introduction
9.2 Background
9.3 Methodology
9.4 Results and Interpretation
9.5 Conclusion
References
Part 3: Pattern Recognition in Tertiary Structures
Chapter 10: Motif Discovery in Protein 3D-Structures using Graph Mining Techniques
10.1 Introduction
10.2 From Protein 3D-Structures to Protein Graphs
10.3 Graph Mining
10.4 Subgraph Mining
10.5 Frequent Subgraph Discovery
10.6 Feature Selection
10.7 Feature Selection for Subgraphs
10.8 Discussion
10.9 Conclusion
Acknowledgments
References
Chapter 11: Fuzzy and Uncertain Learning Techniques for the Analysis and Prediction Of Protein Tertiary Structures
11.1 Introduction
11.2 Genetic Algorithms
11.3 Supervised Machine Learning Algorithm
11.4 Fuzzy Application
11.5 Conclusion
References
Chapter 12: Protein Inter-Domain Linker Prediction
12.1 Introduction
12.2 Protein Structure Overview
12.3 Technical Challenges and Open Issues
12.4 Prediction Assessment
12.5 Current Approaches
12.6 Domain Boundary Prediction Using Enhanced General Regression Network
12.7 Inter-Domain Linkers Prediction Using Compositional Index and Simulated Annealing
12.8 Conclusion
References
Chapter 13: Prediction of Proline Cis–Trans Isomerization
13.1 Introduction
13.2 Methods
13.3 Model Evaluation and Analysis
13.4 Conclusion
References
Part 4: Pattern Recognition in Quaternary Structures
Chapter 14: Prediction of Protein Quaternary Structures
14.1 Introduction
14.2 Protein Structure Prediction
14.3 Template-Based Predictions
14.4 Critical Assessment of Protein Structure Prediction
14.5 Quaternary Structure Prediction
14.6 Conclusion
Acknowledgments
References
Chapter 15: Comparison of Protein Quaternary Structures by Graph Approaches
15.1 Introduction
15.2 Similarity in The Graph Model
15.3 Measuring Structural Similarity Via MCES
15.4 Protein Comparison Via Graph Spectra
15.5 Conclusion
References
Chapter 16: Structural Domains in Prediction of Biological Protein–Protein Interactions
16.1 Introduction
16.2 Structural Domains
16.3 The Prediction Framework
16.4 Feature Extraction and Prediction Properties
16.5 Feature Selection
16.6 Classification
16.7 Evaluation and Analysis
16.8 Results and Discussion
16.9 Conclusion
References
Part 5: Pattern Recognition in Microarrays
Chapter 17: Content-Based Retrieval of Microarray Experiments
17.1 Introduction
17.2 Information Retrieval: Terminology and Background
17.3 Content-Based Retrieval
17.4 Microarray Data and Databases
17.5 Methods for Retrieving Microarray Experiments
17.6 Similarity Metrics
17.7 Evaluating Retrieval Performance
17.8 Software Tools
17.9 Conclusion and Future Directions
Acknowledgment
References
Chapter 18: Extraction of Differentially Expressed Genes in Microarray Data
18.1 Introduction
18.2 From Microarray Image to Signal
18.3 Microarray Signal Analysis
18.4 Algorithms for De Gene Selection
18.5 Gene Ontology Enrichment and Gene Set Enrichment Analysis
18.6 Conclusion
References
Chapter 19: Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis
19.1 Introduction
19.2 Transcriptome Analysis
19.3 Microarrays
19.4 RNA-Seq
19.5 Benefits and Drawbacks of RNA-Seq and Microarray Technologies
19.6 Gene Expression Profile Analysis
19.7 Real Case Studies
19.8 Conclusions
References
Chapter 20: Mining Informative Patterns in Microarray Data
20.1 Introduction
20.2 Patterns with Similarity
20.3 Conclusion
References
Chapter 21: Arrow Plot and Correspondence Analysis Maps for Visualizing The Effects of Background Correction and Normalization Methods on Microarray Data
21.1 Overview
21.2 Arrow Plot
21.3 Significance Analysis of Microarrays
21.4 Correspondence Analysis
21.5 Impact of The Preprocessing Methods
21.6 Conclusions
Acknowledgments
References
Part 6: Pattern Recognition in Phylogenetic Trees
Chapter 22: Pattern Recognition in Phylogenetics: Trees and Networks
22.1 Introduction
22.2 Networks and Trees
22.3 Patterns and Their Processes
22.4 The Types of Patterns
22.5 Fingerprints
22.6 Constructing Networks
22.7 Multi-Labeled Trees
22.8 Conclusion
References
Chapter 23: Diverse Considerations for Successful Phylogenetic Tree Reconstruction: Impacts from Model Misspecification, Recombination, Homoplasy, and Pattern Recognition
23.1 Introduction
23.2 Overview on Methods and Frameworks for Phylogenetic Tree Reconstruction
23.3 Influence of Substitution Model Misspecification on Phylogenetic Tree Reconstruction
23.4 Influence of Recombination on Phylogenetic Tree Reconstruction
23.5 Influence of Diverse Evolutionary Processes on Species Tree Reconstruction
23.6 Influence of Homoplasy on Phylogenetic Tree Reconstruction: The Goals of Pattern Recognition
23.7 Concluding Remarks
Acknowledgments
References
Chapter 24: Automated Plausibility Analysis of Large Phylogenies
24.1 Introduction
24.2 Preliminaries
24.3 A NaÏve Approach
24.4 Toward a Faster Method
24.5 Improved Algorithm
24.6 Implementation
24.7 Evaluation
24.8 Conclusion
Acknowledgment
References
Chapter 25: A New Fast Method for Detecting and Validating Horizontal Gene Transfer Events Using Phylogenetic Trees and Aggregation Functions
25.1 Introduction
25.2 Methods
25.3 Experimental Study
25.4 Results and Discussion
25.5 Conclusion
References
Part 7: Pattern Recognition in Biological Networks
Chapter 26: Computational Methods for Modeling Biological Interaction Networks
26.1 Introduction
26.2 Measures/Metrics
26.3 Models of Biological Networks
26.4 Reconstructing and Partitioning Biological Networks
26.5 Ppi Networks
26.6 Mining PPI Networks—Interaction Prediction
26.7 Conclusions
References
Chapter 27: Biological Network Inference at Multiple Scales: From Gene Regulation to Species Interactions
27.1 Introduction
27.2 Molecular Systems
27.3 Ecological Systems
27.4 Models and Evaluation
27.5 Learning Gene Regulation Networks
27.6 Learning Species Interaction Networks
27.7 Conclusion
References
Chapter 28: Discovering Causal Patterns with Structural Equation Modeling: Application to Toll-Like Receptor Signaling Pathway in Chronic Lymphocytic Leukemia
28.1 Introduction
28.2 Toll-Like Receptors
28.3 Structural Equation Modeling
28.4 Application
28.5 Conclusion
References
Chapter 29: Annotating Proteins with Incomplete Label Information
29.1 Introduction
29.2 Related Work
29.3 Problem Formulation
29.4 Experimental Setup
29.5 Experimental Analysis
29.6 Conclusions
Acknowledgments
References
Index
Wiley Series on Bioinformatics: Computational Techniques and Engineering
End User License Agreement
xxi
xxii
xxiii
xxiv
xxv
xxvii
xxviii
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
249
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
417
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
Table of Contents
Preface
Begin Reading
Chapter 1: Combinatorial Haplotyping Problems
Figure 1.1 A chromosome in three individuals. There are four SNPs.
Figure 1.2 Sequence reads and assembly of the two chromosome copies.
Figure 1.3 SNP matrix and conflict graphs.
Figure 1.4 Haplotypes and corresponding genotypes.
Chapter 2: Algorithmic Perspectives of the String Barcoding Problems
Figure 2.1 An example of a valid barcode.
Figure 2.2 Inclusion relationships among the various barcoding problems.
Figure 2.3 A greedy algorithm for solving MSC [13].
Figure 2.4 Greedy selection of strings in our solution generates a tree partition of the set of input sequences such that each leaf node has exactly one sequence.
Chapter 3: Alignment-Free Measures for Whole-Genome Comparison
Figure 3.1 Whole-genome phylogeny of the 2009 world pandemic Influenza A (H1N1) generated by the UA. The clades are highlighted with different shades of gray. The node Mexico/4108 is probably the closest isolate to the origin of the influenza. The organisms that do not fall into one of the two main clades according to the literature are shown in bold.
Figure 3.2 Whole-genome phylogeny of the genus
Plasmodium
by UA, with our whole-genome distance highlighted on the branches.
Figure 3.3 Whole-genome phylogeny of prokaryotes by UA. The clades are highlighted with different shades of gray. Only two organisms do not fall into the correct clade:
Methanosarcina acetivorans
(Archaea) and
Desulfovibrio vulgaris
subsp.
vulgaris
(Bacteria).
Chapter 4: A Maximum Likelihood Framework for Multiple Sequence Local Alignment
Figure 4.1 Sequence local alignment and motif finding. (a) Multiple sequence local alignment of protein–DNA interaction binding sites. The short motif sequence shown in a highlighted box represents the site recognized and bound with a regulatory protein. (b) Motif sequences (sites) aligned to discover a pattern. (c) Nucleotide counting matrix is derived from the aligned sequences in (b) and counting column-wise. (d) Sequence logo representation of the motif alignment, which visually displays the residue conservation across the alignment positions. Sequence logo is plotted using
BiLogo
server (http://bipad.cmh.edu/bilogo.html). Notice, sequence logo plots the motif conservation across positions. Each column consists of stacks of nucleotide letters (A, C, G, T). The height of a letter represents its occurrence percentage on that position. The total information content on each column is adjusted to 2.0 bits as a maximum.
Figure 4.2 Results for sequences of CRP binding sites. (a) Motif sequence logo of CRP binding sites (22 bp). Sequence logos are plotted using
BiLogo
server (http://bipad.cmh.edu/bilogo.html). (b) Comparison of the curves along 100 iterations at maximum.
Figure 4.3 Alignment of the HTH_ICLR domains using 39 most diverse sequences from NCBI CDD database [20]. The 2G7U chain A structure data (mmdb: 38,409, Zhang et al. [20], unpublished) is used to plot the 3D structure using Cn3D software downloaded from NCBI [20]. (a) Alignment of the HTH_ICLR domains. The core motif is aligned in the center, the left subunit (motif-1) is aligned with three residues shift to the left compared to the same alignment reported in the CDD [20]. The right subunit (motif-2) is aligned with two residues shift to the left. (b) Sequence logos of three motifs: motif-1, core-motif, and motif-2 displayed from left to right. The sequence logos are plotted using
WebLogo
. (c) The organization of a typical protein in the IclR family (chain A) from NCBI CDD. (d) The 3D structure of the 2G7U chain A. The highlighted structure part is the HTH_ICLR domains.
Chapter 5: Global Sequence Alignment with a Bounded Number of Gaps
Figure 5.1 Distribution of gap lengths in
Homo sapiens
exome sequencing.
Figure 5.2 Matrices
and
for
and
.
Figure 5.3 Matrices
and
for
and
.
Figure 5.4 Solution for
,
,
, and
.
Figure 5.5 Matrices
and
for
,
, and
.
Chapter 6: A Short Review on Protein Secondary Structure Prediction Methods
Figure 6.1 The development history of protein secondary structure prediction methods.
Figure 6.2 The architecture of an NN for protein secondary structure prediction.
Figure 6.3 A general flowchart for the development of protein secondary structure prediction methods.
Chapter 7: A Generic Approach to Biological Sequence Segmentation Problems: Application to Protein Secondary Structure Prediction
Figure 7.1 Topology of MSVMpred. The cascade of classifiers computes estimates of the class posterior probabilities for the position in the sequence currently at the center of the analysis window .
Figure 7.2 Topology of the hybrid prediction method. Thanks to the postprocessing with the generative model, the context effectively available to perform the prediction exceeds that resulting from the combination of the two analysis windows of MSVMpred.
Figure 7.3 Correlation between the recognition rate () of MSVMpred2 and the one of the hybrid model. The dashed line is obtained by linear regression.
Chapter 8: Structural Motif Identification and Retrieval: A Geometrical Approach
Figure 8.1 Simplified diagram of the multilevel protein hierarchy.
Source
: Modified from a public-domain image created by Mariana Ruiz Villarreal (http://commons.wikimedia.org/wiki/File:Main_protein_structure_levels_en.svg).
Figure 8.2 Approximation of the secondary structure elements and motifs.
Figure 8.3 Greek key in theory and in practice: (a) idealized structure and (b) noncoplanar Greek key in staphylococcal nuclease [5].
Figure 8.4 ProSMoS: the meta-matrix of the 3br8 protein. Lines 2–8 is the list of SSEs inside the protein with their geometrical and biological properties, and lines 10–16 describe the relation matrix among the SSEs of 3br8.
Figure 8.5 (a) The geometric model of an SSM. Vertices and edges are labeled. (b) Two SSEs with the same SSM graph representation but with different connections.
Figure 8.6 A super-SSE of cardinality is present in a protein only if its subsets of cardinality are present as well.
Chapter 9: Genome-Wide Search for Pseudoknotted Noncoding RNA: A Comparative Study
Figure 9.1 A pseudoknotted secondary structure (top), its arc-based representation (middle), and dot–bracket notation (bottom).
Figure 9.2 A hypothetical Stockholm alignment with a pseudoknot.
Figure 9.3 Methodology used in the analysis of pseudoknotted ncRNA search.
Chapter 10: Motif Discovery in Protein 3D-Structures using Graph Mining Techniques
Figure 10.1 Triangulation example in a 2D-space. (a) Triangulation meets the Delaunay condition. (b) Triangulation does not meet the Delaunay condition.
Figure 10.2 Example of an all- protein 3D-structure (lipid-binding protein, PDB identifier: 2JW2) transformed into graphs of amino acids using the All Atoms technique (81 nodes and 562 edges), then using the Main Atom technique with as the main atom (81 nodes and 276 edges). Both techniques are used with a distance threshold of 7 Å.
Figure 10.4 Example of an – protein 3D-structure (for cell adhesion, PDB identifier: 3BQN) transformed into graphs of amino acids using the All Atoms technique (364 nodes and 3306 edges), then using the Main Atom technique with as the main atom (364 nodes and 1423 edges). Both techniques are used with a distance threshold of 7 Å.
Figure 10.5 Example of an unlabeled graph (a) and a labeled graph (b).
Figure 10.6 Example of a subgraph (g) that is frequent in the graphs (a), (b), and (c) with a support of .
Figure 10.7 Example of a candidates search tree in a BFS manner (a) and a DFS manner (b). Each node in the search tree represents a candidate subgraph.
Chapter 11: Fuzzy and Uncertain Learning Techniques for the Analysis and Prediction Of Protein Tertiary Structures
Figure 11.1 The Figure of the chemical structure of a protein, Protein Data Bank (PDB) code 1AIX [6].
Figure 11.9 Basic structure of an SVM.
Figure 11.5 An example of the possible moves in the HH model (a), as well as a sequence with 48 monomers [HHHHPHHPHHHHHPPHPPHHPPHPPPPPPHPPHPPPHPPHHPPHHHPH].
Source
: Custódio et al. [16]. Reproduced with permission of Elsevier (b).
Figure 11.6 A molecule used to depict the , labeled phi, , labeled psi, and , labeled omega, dihedral angles.
Figure 11.7 Illustration of the final product of an all-atom model. This is the actual backbone of protein 1AIX [6].
Figure 11.8 Basic structure of an ANN.
Figure 11.10 Structure of an ANFIS.
Chapter 12: Protein Inter-Domain Linker Prediction
Figure 12.1
N-score
and
C-score
are the number of AA residues that do not match when comparing the predicted linker segment with the known linker segment. A lower score means a more accurate prediction. For exact match, the
N-score
and
C-score
should be equal to zero.
Figure 12.2 Basic EGRN architecture.
Figure 12.3 Method overview. The method consists of two main steps: calculating the compositional index and then refining the prediction by detecting the optimal set of threshold values that distinguish inter-domain linkers from nonlinkers using simulated annealing.
Figure 12.4 Linker preference profile. Linker regions (1, 2, …, 5) less than a threshold value of 0 are identified from the protein sequence.
Figure 12.5 Optimal threshold values for XYNA_THENE protein sequence in DomCut data set.
Figure 12.6 Optimal threshold values for 6pax_A protein sequence in DS-All data set.
Chapter 13: Prediction of Proline Cis–Trans Isomerization
Figure 13.1 A general architecture of Method I ensemble. The collection of DTs , where the are independently, identically distributed random DTs, and each DT cast “a unit vote” for the final classification of input .
Figure 13.2 A flowchart of Method I ensemble. The left-hand side shows the main flow of the Method I ensemble while the right-hand side flowchart is the expansion of the process, to build the next split, of the main flowchart of left-hand side.
Chapter 14: Prediction of Protein Quaternary Structures
Figure 14.1 The different protein structures: (a) primary structure (polypeptide chain), (b) secondary structure (α-helix), (c) tertiary structure (example: myoglobin), and (d) quaternary structure (example: hemoglobin).
Figure 14.2 The schematic representation of different polypeptide chains that form various oligomers.
Source
: Chou and Cai (2003) [22]. Reproduced with permission of John Wiley and Sons, Inc.
Figure 14.3 A schematic drawing to illustrate protein oligomers with: (a)
C
5
symmetry, (b)
D
4
symmetry, (c) tetrahedron symmetry, (d) cubic symmetry, and (e) icosahedrons symmetry.
Source
: Chou and Cai (2003) [22].
Reproduced with permission of John Wiley and Sons, Inc.
Figure 14.4 Comparison of determined number of protein sequences and protein structures based on statistical data from PDB and UniProtKB.
Figure 14.5 Definition procedure for the huge number of possible sequence order patterns as a set of discrete numbers. Panels (a–c) represent the correlation mode between all the most contiguous residues, all the second-most contiguous residues, and all the third-most contiguous residues, respectively.
Source
: Chou and Cai (2003) [22].
Reproduced with permission of John Wiley and Sons, Inc.
Chapter 15: Comparison of Protein Quaternary Structures by Graph Approaches
Figure 15.1 An illustration of protein graph remodeling.
Figure 15.2 Protein graph transformation.
Figure 15.3 (a) a P-graph with and . (b) The P-table of . (c) The line graph of .
Figure 15.4 The process of constructing modular graph from and .
Figure 15.5 Protein structures with display style in schematic view.
Figure 15.6 An illustration of cospectral graphs.
Figure 15.7 An example of isomorphic graphs.
Figure 15.8 Clustering constructed by different distance metrics. (a) For RMSD, (b) for the Seidel spectrum, and (c) for the CATH code.
Chapter 16: Structural Domains in Prediction of Biological Protein–Protein Interactions
Figure 16.1 Four levels of the
Class, Architecture, Topology, and Homologous superfamily
(CATH) hierarchy.
Figure 16.2 Domain-based framework used to predict PPI types.
Figure 16.3 ROC curves and AUC values for all subsets of features of (a) MW and (b) ZH data sets.
Figure 16.4 Schematic view of level 3 of CATH DDIs present in the MW and ZH data sets.
Chapter 17: Content-Based Retrieval of Microarray Experiments
Figure 17.1 Query-by-example framework for content-based information retrieval.
Figure 17.2 Format of a gene expression matrix: the expression of
i
th gene in
j
th condition is shown by
e
ij
.
Figure 17.3 Microarray experiment retrieval framework based on differential expression fingerprints: (a) fingerprint extraction, (b) database search and ranked retrieval.
Chapter 18: Extraction of Differentially Expressed Genes in Microarray Data
Figure 18.1 Probe set in
Affymetrix
technology.
Figure 18.2 Scanned image of cDNA microarray.
Figure 18.3 The plot for four
Affymetrix
chips (a) before and (b) after normalization using quantile normalization.
Figure 18.4 Selection of DE gene using control. Volcano plots for (a) within-group comparison and (b) between-group comparison.
Figure 18.5 Enriched bar charts of selected genes using GO on biological process (a) and cellular components (b) categories.
Figure 18.6 Acyclic graph of cellular component analysis of selected genes.
Chapter 19: Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis
Figure 19.1 Overview of the RNA-Seq workflow.
Figure 19.2 RNA-Seq benefits.
Figure 19.3 Microarray benefits.
Figure 19.4 An example of GELA model for breast cancer.
Chapter 20: Mining Informative Patterns in Microarray Data
Figure 20.1 A conceptual view of a microarray data matrix. There are three parts in the data matrix: (i) numerical gene expression data, (ii) gene annotation, and (iii) sample annotation. The gene and sample annotations are important as the data only have meaning within the context of the underlying biology.
Figure 20.2 Clustering result on iris data, a 150 4 data set.
Figure 20.3 Consensus matrix for iris data with = 3 and =4.
Figure 20.4 Two genes that coexpress only under a subset of the samples: (a) the original patterns and (b) under extracted subset of samples.
Figure 20.5 An example of different types of biclusters. A1: constant bicluster, A2: constant rows, A3: constant columns, A4: coherent values (additive model), A5: coherent values (multiplicative model), A6: coherent values (multiplicative model), A7: coherent evolution on the columns, and A8: coherent sign changes on rows and columns.
Figure 20.6 An illustration of the (a) iterative row and column clustering combination method, and (b) divide-and-conquer algorithm.
Figure 20.7 An example of the OPC-tree constructed by Liu and Wang [17]. (a)The sequences set and (b) the OPC-tree.
Figure 20.8 An illustration of triclustering. (a) The 3D gene–sample-time microarray data. (b) An enlargement of expression pattern of on sample . (c) The tricluster that contains {} and {}.
Chapter 21: Arrow Plot and Correspondence Analysis Maps for Visualizing The Effects of Background Correction and Normalization Methods on Microarray Data
Figure 21.1 Example of a ROC curve.
Figure 21.2 Densities of the expression levels of two populations and the respective empirical ROC curves. represents the expression levels in the control group and represents the expression levels in the experimental group. The same classification rule was considered for estimating the ROC curves. The densities were estimated by the Kernel density estimator from two samples of size 100, simulated from two normal distributions. (a) ; (b) .
Figure 21.3
Arrow plot
. To select upregulated genes, the conditions AUC and OVL were considered, corresponding to triangles on the plot. To select downregulated genes, the conditions AUC 0.2 and OVL 0.5 were considered, corresponding to squares on the plot. To select special genes, the conditions OVL 0.5 and 0.4 AUC 0.6 were considered, corresponding to stars on the plot.
Figure 21.4 CA symmetric maps of the cancer classification leave-one-out cross-validation error rate estimates obtained from two classifiers: kNN (on the right side) and SVM (on the left side). The quality of representation of the points onto each axis/dimension in the CA map is indicated within brackets.
Figure 21.5 CA maps of the values of (FDR) corresponding to 20% of the top DE genes with the lowest FDR induced by SAM.
Figure 21.6 CA maps of the number of DE genes upregulated (a), number of DE genes downregulated (c), and number of special genes (b) detected by the
Arrow plot
for the data set.
Chapter 22: Pattern Recognition in Phylogenetics: Trees and Networks
Figure 22.1 A rooted phylogenetic tree (a) and a rooted phylogenetic network (b) for 12 species (labeled A–L), illustrating the various terms used to describe the diagrams. The arrows indicate the direction of evolutionary history, away from the common ancestor at the root.
Figure 22.2 (a) A species phylogeny for six species (labeled A–F) in which there is gene flow between two species plus a hybridization event. (b–f) Possible gene trees that might arise from the species phylogeny. (Not all possible gene trees are shown.) The hybridization event involves species D and F, which hybridize to form species E. This would create gene trees that match either (c) or (d). The gene flow might involve introgression of genes from species C into species B. This would create gene trees that match either (b) or (c/d). (Similarly, the gene flow might involve lateral transfer of genes from species C to species B. This would also create gene trees that match either (b) or (c/d).) There might also be
Incomplete Lineage Sorting
(ILS), and in example (e) there is genic polymorphism in the immediate ancestor of species A and B. If the different forms are not all passed to each descendant, then the gene tree will not track the species phylogeny through time. There might also be gene
Duplication and Loss
(D–L), and in example (f) there is gene duplication in the most recent common ancestor of species A, B, and C. These gene copies then track the species phylogeny through the two subsequent divergence events. One of the gene copies is then lost in species A, and the other gene copy is lost in species B and C. Note that the gene trees shown in (b), (e), and (f) would be identical when sampled in the contemporary species, even though the biological mechanisms that formed them are quite different.
Figure 22.3 Recombination during meiosis. Chromosomes are represented as oblongs, and genes as circles. During meiosis, homologous chromosomes form pairs, and parts of the pairs potentially cross over each other, as shown in the top pair of pictures. If the touching parts break apart and rejoin, then the chromosomes will have exchanged some of their gene copies, as shown in the bottom pair of pictures. Modified from Reference [26].
Figure 22.4 Sampled locations with different gene trees, mapped onto the 32 chromosomes of the nuclear genome of the zebra finch. Two different gene-tree patterns are shown, labeled ABBA (204 locations) and BABA (158 locations). Adjacent locations with the same gene tree may be part of the same chromosomal block, but adjacent locations with different trees are in different blocks. The authors attribute the different gene trees mainly to introgression. Reproduced with permission from Reference [33].
Figure 22.5 The relationships between (a) gene trees, (b) a MUL-tree, and (c) a phylogenetic network for four species (labeled A–D). Each gene tree can be derived from the MUL-tree by deleting duplicated labels, so that each label occurs once and once only in the tree. In this example, the two gene trees shown in (a) are not the complete set of trees that could be derived from the MUL-tree in (b). The network can be derived from the MUL-tree by minimizing the number of reticulations in the network. In this example, only one reticulation is needed.
Chapter 23: Diverse Considerations for Successful Phylogenetic Tree Reconstruction: Impacts from Model Misspecification, Recombination, Homoplasy, and Pattern Recognition
Figure 23.1 The number of articles per year with terms “phylogenetic reconstruction” in the title or the abstract, as measured by PubMed <http://www.ncbi.nlm.nih.gov/pubmed> (search conducted on 18 February 2014).
Figure 23.2 An illustrative example of a gene tree (thin lines) inside a species tree (gray tree in the background) and the effect of the three main evolutionary processes that can generate discordance between them.
Figure 23.3 An illustrative example of an ARG and its corresponding embedded trees.
Chapter 24: Automated Plausibility Analysis of Large Phylogenies
Figure 24.1 Euler traversal of a tree.
Figure 24.2 Node from .
Figure 24.3 Unrooted phylogeny of 24 taxa (leaves). The taxa for which we induce a subtree are marked in gray. The lines are inner nodes that represent the common ancestors and hence the minimum amount of nodes needed to maintain the evolutionary relationships among the selected taxa. The numbers denote the order of each node in the preorder traversal of the tree assuming that we root it at node 0.
Figure 24.4 Induced tree for Example 24.13.
Figure 24.5 Distribution of all relative RF distances between the large mega-phylogeny and the reference trees from STBase.
Figure 24.6 Distribution of relative RF distances for the 20,000 largest reference trees (30–2065 taxa).
Figure 24.7 Running time of the effective inducing step (dashed) compared to the overall execution time of the effective algorithm (dotted).
Figure 24.8 Speedup of the effective inducing tree approach. The speedup is calculated by dividing the overall naïve inducing time with the effective inducing time.
Figure 24.9 Total execution time of the naïve algorithm (dashed) compared to the effective approach (dotted).
Figure 24.10 Time needed for the preprocessing step of the effective algorithm.
Chapter 25: A New Fast Method for Detecting and Validating Horizontal Gene Transfer Events Using Phylogenetic Trees and Aggregation Functions
Figure 25.1 Intra- and intergroup phylogenetic relationships following an HGT. A horizontal gene transfer from species of the group to species of the group is shown by an arrow; the dotted line shows the position of species in the tree after the transfer; denotes the rest of the species of the group and denotes the rest of the species of the group . Each species is represented by a unique nucleotide sequence.
Figure 25.2 HGT-QFUNC sensitivity results for functions , , , and when detecting partial HGT in a synthetic data set based on -value ordering—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results under the medium degree of recombination (when 25% of the resulting sequence belong to one of the parent sequences) are depicted in the left panel. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.
Figure 25.3 Remaining HGT-QFUNC sensitivity results for functions , , , and when detecting complete and partial HGT in a synthetic data set based on -value ordering—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results for data without recombination are depicted in the left panel. The right panel depicts the results of the same simulations, for the cases of 1 and 128 transfers, with recombination levels of 0% (no recombination), 25%, and 50%. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.
Figure 25.4 HGT-QFUNC sensitivity results for functions , , , and when detecting complete HGT in a prokaryotic data set based on -value ordering (maximum -value of 0.05)—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. The HGT-QFUNC algorithm was limited to the following maximum numbers of positive values: (a) 300 HGT (corresponds to 50% bootstrap support in the HGT-Detection algorithm); (b) 200 HGT (corresponds to 75% bootstrap support in the HGT-Detection algorithm); (c) 100 HGT (corresponds to 90% bootstrap support in the HGT-Detection algorithm).
Figure 25.6 Remaining HGT-QFUNC sensitivity results for functions , , , and when detecting complete and partial HGT in a synthetic data set based on -value ordering (maximum -value of 0.05)—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results for data without recombination are depicted in the left panel. The right panel depicts the results of the same simulations, for the cases of 1 and 128 transfers, with recombination levels of 0% (no recombination), 25%, and 50%. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.
Figure 25.6 HGT-QFUNC sensitivity results for functions , , , and when detecting complete HGT in prokaryotic data set based on -value ordering—boxplot representation. The abscissa represents the sensitivity percentage and ordinate the tested function. The median value is shown by a vertical black line within each box. The HGT-QFUNC algorithm was limited to the following maximum numbers of positive values: (a) 300 HGT (corresponds to 50% bootstrap support in the HGT-Detection algorithm); (b) 200 HGT (corresponds to 75% bootstrap support in the HGT-Detection algorithm); (c) 100 HGT (corresponds to 90% bootstrap support in the HGT-Detection algorithm).
Figure 25.7 Distribution of the HGT-QFUNC maximum percentages of positive values chosen for prokaryotic data. The abscissa represents the percentage of the maximum possible number of HGTs between individual species. The ordinate represents the corresponding HGT-Detection bootstrap confidence level. Average values correspond to less than 6%, 4%, and 2% of the maximum possible number of HGTs for the 50%, 75%, and 90% bootstrap confidence levels, respectively.
Figure 25.8 HGT-QFUNC sensitivity results for functions , , , and when detecting partial HGT in a synthetic data set based on -value ordering (maximum -value of 0.05)—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results under medium degree of recombination (when 25% of the resulting sequence belong to one of the parent sequences) are depicted in the left panel. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.
Chapter 26: Computational Methods for Modeling Biological Interaction Networks
Figure 26.1 A PPI network.
Figure 26.2 A metabolic network [90].
Figure 26.3 Human PPI network. Proteins are shown as graph nodes. [35].
Chapter 27: Biological Network Inference at Multiple Scales: From Gene Regulation to Species Interactions
Figure 27.2
Receiver Operating Characteristic
(ROC). An ROC curve for a perfect predictor, random expectation, and a typical predictor between these two extremes is shown. The
Area Under the ROC
curve (AUROC) is used as scoring metric.
Figure 27.1 Hypothetical circadian clock networks from the literature and that inferred from the TiMet gene expression data. The panels
P2010
(a) and
P2013
(b) constitute hypothetical networks from the literature [42, 43]. The
TiMet
network (c) displays the reconstructed network from the TiMet data, described in Section 27.5.4, using the hierarchical Bayesian regression model from Section 27.5.1. Gene interactions are shown by black lines with arrowhead; protein interactions are shown by dashed lines. The interactions in the reconstructed network were obtained from their estimated posterior probabilities. Those above a selected threshold were included in the interaction network; those below were discarded. The choice of the cutoff threshold is, in principle, arbitrary. For optimal comparability, we selected the cutoff such that the average number of interactions from the published networks was matched (0.6 for molecular interactions).
Figure 27.3 AUROC scores obtained for different reconstruction methods, and different experimental settings. Boxplots of AUROC scores obtained from LASSO,
Elastic Net
(both in Section 27.4.2), homogBR (homogeneous Bayesian regression, Section 27.4.3), and nonhomogBR (nonhomogeneous Bayesian regression, Section 27.5.1). The latter utilizes light- induced partitioning of the observations. The subpanels are
coarse-mRNA
: incomplete data, only with mRNA concentrations and coarse gradient,
interp-mRNA
: incomplete data with interpolated gradient,
coarse-complete
: complete data with protein and mRNA concentrations, and
interp-complete
: complete data with interpolated gradient. The
coarse
gradients are computed from Equation 27.12 from 4-h intervals, and the interpolated gradients (
interp
) are derived from a Gaussian process as described in Section 27.5.2.
Figure 27.4 Multiple global
change-point
example. Partitioning with a horizontal
change-point
vector and vertical vector . The
pseudo-change-points
define the left and upper boundaries; and define the lower and right boundaries, where and are the number of locations along the horizontal and vertical directions, respectively. The number of
change-points
is , and the number of segments .
Figure 27.5
Mondrian process
example. (a) An example partitioning with a
Mondrian process
. (b) The associated tree with labels of the latent variable identifying each nonoverlapping segment with leaf nodes (light gray) designated as , where indexes all tree nodes.
Figure 27.6 Diagram of the niche model. Species are indicated by triangles. A species is placed with a niche value into the interval . A value is uniformly drawn that defines the centre of the range . All species with a value inside this interval, that is, , as indicated by the gray triangles, are consumed (“eaten”) by species . Diagram adapted from Reference [62].
Figure 27.7 Spatial distribution. Shown are the spatial distributions of growth rates entering Equation 27.16 as the spatial parameter (Section 27.6.1) decreases from to . A value of 0 corresponds to uniformly random noise, and is Brownian noise.
Figure 27.8 Comparative evaluation of four network reconstruction methods for the stochastic population dynamics data. Boxplots of AUROC scores obtained on the realistic simulated data described in Section 27.6.5.3 for different settings of the spatial parameter, with lower values causing stronger heterogeneity in the data. Box color scheme: BRAMP (white), BRAM (light gray), homogBR (gray), and LASSO (dark gray).
Figure 27.9 Comparison on synthetic data. Boxplots of AUROC scores obtained with five methods on the synthetic data described in Section 27.5: A Bayesian regression model with
Mondrian process change-points
(BRAMP, Section 27.6.4), a Bayesian regression model with global
change-points
(BRAM, Section 27.6.2), a Bayesian linear regression model without
change-points
(homogBR, Section 27.4.3), L1-penalized sparse regression (LASSO, Section 27.4.2), and the sparse regression
Elastic Net
method (Section 27.4.2). The boxplots show the distributions of the scores for 30 independent data sets with higher scores indicating better learning performance.
Figure 27.10 Species interaction network. Species interactions as inferred with BRAMP (Section 27.6.3), with an inferred marginal posterior probability of 0.5 (thick lines) and 0.1 (thin lines). Solid lines are positive interactions (e.g., mutualism, commensalism) and dashed are negative interactions (e.g., resource competition). Species are represented by numbers and have been ordered phylogenetically as displayed in Table 27.2.
Chapter 28: Discovering Causal Patterns with Structural Equation Modeling: Application to Toll-Like Receptor Signaling Pathway in Chronic Lymphocytic Leukemia
Figure 28.1 Path analysis: and are the exogenous variables. The bidirectional arrow means that they are correlated. and are endogenous and accompanied by an error term. can also be considered as exogenous, as it originates an effect that goes to . receives a direct effect from and receives a direct effect from . also receives an indirect effect of through .
Figure 28.2 Description of the steps of analysis.
Figure 28.3 Initial model for the TLR1/2 pathway.
Figure 28.4 Final model of the TLR1/2 pathway for M-CLL patients.
Figure 28.5 Final model of the TLR1/2 pathway for U-CLL patients.
Figure 28.6 Initial model for TLR2/6 pathway.
Figure 28.7 Final model of the TLR2/6 pathway for M-CLL patients.
Figure 28.8 Final model of the TLR2/6 pathway for U-CLL patients.
Figure 28.9 Initial model for the TLR7 pathway.
Figure 28.9 Final model of the TLR7 pathway for M-CLL patients.
Figure 28.10 Final model of the TLR7 pathway for U-CLL patients.
Figure 28.11 Initial model for TLR9 pathway.
Figure 28.12 Final model of the TLR9 pathway for M-CLL patients.
Figure 28.13 Final model of the TLR9 pathway for U-CLL patients.
Chapter 29: Annotating Proteins with Incomplete Label Information
Figure 29.1 The two tasks studied in this chapter: “
?
” denotes the missing functions; “1” means the protein has the corresponding function; “0” in (a), (b), and (c) means the protein does not have the corresponding function. Task1 replenishes the missing functions and Task2 predicts the function of proteins
p4
and
p5
, which are completely unlabeled. Some Figure are from Reference [35].
Figure 29.2 The benefit of using both the “guilt by association” rule and function correlations (ProDM_nFC is ProDM with no function correlation, and ProDM_nGBA is ProDM with no “guilt by association” rule).
Chapter 2: Algorithmic Perspectives of the String Barcoding Problems
Table 2.1 List of of a subset of approximability results proved in Reference [3]
Chapter 3: Alignment-Free Measures for Whole-Genome Comparison
Table 3.1 Example of counters for the ACS approach
Table 3.2 Benchmark for prokaryotes—Archaea and Bacteria domains
Table 3.3 Benchmark for unicellular eukaryotes—genus
Plasmodium
Table 3.4 Comparison of whole-genome phylogeny reconstructions
Table 3.5 Comparison of whole-genome phylogeny of Influenza virus
Table 7 Comparison of whole-genome phylogeny of
Plasmodium
Table 3.8 Main statistics for the underlying approach averaged over all experiments
Chapter 4: A Maximum Likelihood Framework for Multiple Sequence Local Alignment
Table 4.1 Values of
pla
, obtained for the different motif finding algorithms, for the CRP data set ()
Chapter 6: A Short Review on Protein Secondary Structure Prediction Methods
Table 6.1 Some state-of-the-art of protein secondary structure prediction tools
Table 6.2 Performance comparison of protein secondary structure prediction methods
a
Table 6.3 Performance comparison of protein secondary structure prediction methods at three prediction difficulty levels
a
Table 6.4 Misclassification rates of residue secondary structure states based on the benchmark data set with 538 proteins
a
Table 6.5 Performance comparison of alignment/threading methods in the prediction of protein secondary structure
Table 6.6 Compositions of the predicted and actual secondary structure types in the Y538 data set
Table 6.7 Data sets for protein secondary structure prediction
Chapter 7: A Generic Approach to Biological Sequence Segmentation Problems: Application to Protein Secondary Structure Prediction
Table 7.1 Prediction accuracy of MSVMpred2 and the hybrid model
Chapter 8: Structural Motif Identification and Retrieval: A Geometrical Approach
Table 8.1 The reference Table for the SSC algorithm
Table 8.2 CCMS Benchmarking Results
Table 8.3 Number of common super-SSEs found by CCMS in the testing data set, listed by their cardinality
Table 8.4 Number of common super-SSEs found by CCMS removing 2qx8 and 2qx9 from the testing data set, listed by their cardinality
Chapter 9: Genome-Wide Search for Pseudoknotted Noncoding RNA: A Comparative Study
Table 9.1 The six ncRNA families used in our study
Table 9.2 The 13 genomes used in our study
Table 9.3
Table 9.4 Results of ncRNA search with RNATOPS
Table 9.5 Results of ncRNA search with Infernal
Chapter 10: Motif Discovery in Protein 3D-Structures using Graph Mining Techniques
Table 10.1 Characteristics of existing subgraph selection approaches
Chapter 12: Protein Inter-Domain Linker Prediction
Table 12.1 Accuracy of domain boundary placement on the CASP7 benchmark data set
Table 12.2 Comparison of prediction accuracy of EGRN with other predictors
Table 12.3 Prediction performance of publicly domain linker prediction approaches
Chapter 13: Prediction of Proline Cis–Trans Isomerization
Table 13.1 Model performance comparisons
Chapter 15: Comparison of Protein Quaternary Structures by Graph Approaches
Table 15.1 Summary of related studies in protein graph building by graph-theoretic approach
Table 15.2 Annotations of eight selected macromolecules in the PDB
Table 15.3 Entries of selected macromolecules and their CATH codes
Table 15.4 Comparison of the proposed method with DALI RMSD
Table 15.5 The annotation and description of the MHC gene family
Table 15.6 Entries of selected macromolecules and their CATH codes
Chapter 16: Structural Domains in Prediction of Biological Protein–Protein Interactions
Table 16.1 Properties employed in different studies for the prediction of obligate and nonobligate complexes
Table 16.2 Data sets included in this review and their number of complexes
Table 16.3 Prediction results of SVM for the MW and ZH data sets
Table 16.4 A summary of the number of CATH DDIs of level 3 present in the ZH and MW data sets
Chapter 18: Extraction of Differentially Expressed Genes in Microarray Data
Table 18.1 Results of DE gene selection methods using sensitivity and accuracy
Chapter 19: Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis
Table 19.1 Microarrays versus RNA-Seq
Table 19.2 Gene expression profile data set
Table 19.3 Table of RNA-Seq counts
Table 19.4 Logic formulas in early stage
Table 19.5 Logic formulas in late stage
Table 19.6 Classification accuracies [%] on Alzheimer data sets
Table 19.7 Classification accuracies [%] on multiple sclerosis and psoriasis data sets
Table 19.8 Classification accuracy [%] on breast cancer data set
Chapter 20: Mining Informative Patterns in Microarray Data
Table 20.1 Corresponding mean-squared residue score and average correlation value of A1–A7
Chapter 21: Arrow Plot and Correspondence Analysis Maps for Visualizing The Effects of Background Correction and Normalization Methods on Microarray Data
Table 21.1 Description of the three cDNA microarray data sets considered
Table 21.2 Values of (FDR) corresponding to 20% of the top DE genes with the lowest FDR induced by SAM for the 36 preprocessing strategies per data set: (at the top), (in the middle), and (at the bottom)
Table 21.3 Number of DE genes as upregulated (at the top), special (in the middle), and downregulated (at the bottom) that were detected by the
Arrow plot
for the data set
Chapter 22: Pattern Recognition in Phylogenetics: Trees and Networks
Table 22.1 Summary of the processes involved in the evolution of chromosome blocks
Table 22.2 Possible fingerprints for gene flow via different evolutionary processes
Chapter 23: Diverse Considerations for Successful Phylogenetic Tree Reconstruction: Impacts from Model Misspecification, Recombination, Homoplasy, and Pattern Recognition
Table 23.1 An updated list of the most commonly used phylogenetic tree reconstruction programs available up to date
Chapter 24: Automated Plausibility Analysis of Large Phylogenies
Table 24.1 Test results for a mega-phylogeny of 55,473 taxa
Table 24.2 Test results for different input tree sizes (150–2554 taxa). The algorithm is executed on 30,000 small trees for each run. Each small tree contains exactly 64 taxa
Table 24.3 Test results for 1 million simulated reference trees (each containing 128 taxa)
Chapter 27: Biological Network Inference at Multiple Scales: From Gene Regulation to Species Interactions
Table 27.1 Improvement of the Bayesian regression model with
Mondrian process change-points
(BRAMP) on the stochastic population dynamics data
Table 27.2 Indices with full scientific names as appearing in Figure 27.10
Chapter 28: Discovering Causal Patterns with Structural Equation Modeling: Application to Toll-Like Receptor Signaling Pathway in Chronic Lymphocytic Leukemia
Table 28.1 Effects of the TLR1/2 pathway on M-CLL patients
Table 28.3 Goodness-of-fit indices for TLR 1/2 pathways
Table 28.2 Effects of the TLR1/2 pathway on U-CLL patients
Table 28.4 Effects of the TLR2/6 pathway on M-CLL patients
Table 28.6 Goodness-of-fit indices for TLR 2/6 pathways
Table 28.5 Effects of the TLR2/6 pathway on U-CLL patients
Table 28.7 Effects of the TLR7 pathway on M-CLL patients
Table 28.9 Goodness-of-fit indices for TLR7 pathway
Table 28.8 Effects of the TLR7 pathway on U-CLL patients
Table 28.10 Effects of the TLR9 pathway on M-CLL patients
Table 28.12 Goodness-of-fit indices for TLR 9 pathway
Chapter 29: Annotating Proteins with Incomplete Label Information
Table 29.1 Data set statistics (Avg Std means average number of functions for each protein and its standard deviation)
Table 29.2 Results of replenishing missing functions on
ScPPI
Table 29.4 Results of replenishing missing functions on
HumanPPI
Table 29.5 Prediction results on completely unlabeled proteins of
ScPPI
Table 29.7 Prediction results on completely unlabeled proteins of
HumanPPI
Table 29.8 Runtime analysis (s)
Edited by
Mourad Elloumi
Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE), and University of Tunis-El Manar, Tunisia
Costas S. Iliopoulos
King's College London, UK
Jason T. L. Wang
New Jersey Institute of Technology, USA
Albert Y. Zomaya
The University of Sydney, Australia
Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
