Pattern Recognition in Computational Molecular Biology - Mourad Elloumi - E-Book

Pattern Recognition in Computational Molecular Biology E-Book

Mourad Elloumi

0,0
123,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A comprehensive overview of high-performance pattern recognition techniques and approaches to Computational Molecular Biology

This book surveys the developments of techniques and approaches on pattern recognition related to Computational Molecular Biology. Providing a broad coverage of the field, the authors cover fundamental and technical information on these techniques and approaches, as well as discussing their related problems. The text consists of twenty nine chapters, organized into seven parts: Pattern Recognition in Sequences, Pattern Recognition in Secondary Structures, Pattern Recognition in Tertiary Structures, Pattern Recognition in Quaternary Structures, Pattern Recognition in Microarrays, Pattern Recognition in Phylogenetic Trees, and Pattern Recognition in Biological Networks.

  • Surveys the development of techniques and approaches on pattern recognition in biomolecular data
  • Discusses pattern recognition in primary, secondary, tertiary and quaternary structures, as well as microarrays, phylogenetic trees and biological networks
  • Includes case studies and examples to further illustrate the concepts discussed in the book
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches is a reference for practitioners and professional researches in Computer Science, Life Science, and Mathematics. This book also serves as a supplementary reading for graduate students and young researches interested in Computational Molecular Biology.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1185

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Wiley Series

Cover

Title Page

Copyright

List of Contributors

Preface

Part 1: Pattern Recognition in Sequences

Chapter 1: Combinatorial Haplotyping Problems

1.1 Introduction

1.2 Single Individual Haplotyping

1.3 Population Haplotyping

References

Chapter 2: Algorithmic Perspectives of the String Barcoding Problems

2.1 Introduction

2.2 Summary of Algorithmic Complexity Results for Barcoding Problems

2.3 Entropy-Based Information Content Technique for Designing Approximation Algorithms for String Barcoding Problems

2.4 Techniques for Proving Inapproximability Results for String Barcoding Problems

2.5 Heuristic Algorithms for String Barcoding Problems

2.6 Conclusion

Acknowledgments

References

Chapter 3: Alignment-Free Measures for Whole-Genome Comparison

3.1 Introduction

3.2 Whole-Genome Sequence Analysis

3.3 Underlying Approach

3.4 Experimental Results

3.5 Conclusion

Author's Contributions

3.6 Acknowledgments

References

Chapter 4: A Maximum Likelihood Framework for Multiple Sequence Local Alignment

4.1 Introduction

4.2 Multiple Sequence Local Alignment

4.3 Motif Finding Algorithms

4.4 Time Complexity

4.5 Case Studies

4.6 Conclusion

References

Chapter 5: Global Sequence Alignment with a Bounded Number of Gaps

5.1 Introduction

5.2 Definitions and Notation

5.3 Problem Definition

5.4 Algorithms

5.5 Conclusion

References

Part 2: Pattern Recognition in Secondary Structures

Chapter 6: A Short Review on Protein Secondary Structure Prediction Methods

6.1 Introduction

6.2 Representative Protein Secondary Structure Prediction Methods

6.3 Evaluation of Protein Secondary Structure Prediction Methods

6.4 Conclusion

Acknowledgments

References

Chapter 7: A Generic Approach to Biological Sequence Segmentation Problems: Application to Protein Secondary Structure Prediction

7.1 Introduction

7.2 Biological Sequence Segmentation

7.3 MSVMpred

7.4 Postprocessing with A Generative Model

7.5 Dedication to Protein Secondary Structure Prediction

7.6 Conclusions and Ongoing Research

Acknowledgments

References

Chapter 8: Structural Motif Identification and Retrieval: A Geometrical Approach

8.1 Introduction

8.2 A Few Basic Concepts

8.3 State of The Art

8.4 A Novel Geometrical Approach to Motif Retrieval

8.5 Implementation Notes

8.6 Conclusions and Future Work

Acknowledgment

References

Chapter 9: Genome-Wide Search for Pseudoknotted Noncoding RNA: A Comparative Study

9.1 Introduction

9.2 Background

9.3 Methodology

9.4 Results and Interpretation

9.5 Conclusion

References

Part 3: Pattern Recognition in Tertiary Structures

Chapter 10: Motif Discovery in Protein 3D-Structures using Graph Mining Techniques

10.1 Introduction

10.2 From Protein 3D-Structures to Protein Graphs

10.3 Graph Mining

10.4 Subgraph Mining

10.5 Frequent Subgraph Discovery

10.6 Feature Selection

10.7 Feature Selection for Subgraphs

10.8 Discussion

10.9 Conclusion

Acknowledgments

References

Chapter 11: Fuzzy and Uncertain Learning Techniques for the Analysis and Prediction Of Protein Tertiary Structures

11.1 Introduction

11.2 Genetic Algorithms

11.3 Supervised Machine Learning Algorithm

11.4 Fuzzy Application

11.5 Conclusion

References

Chapter 12: Protein Inter-Domain Linker Prediction

12.1 Introduction

12.2 Protein Structure Overview

12.3 Technical Challenges and Open Issues

12.4 Prediction Assessment

12.5 Current Approaches

12.6 Domain Boundary Prediction Using Enhanced General Regression Network

12.7 Inter-Domain Linkers Prediction Using Compositional Index and Simulated Annealing

12.8 Conclusion

References

Chapter 13: Prediction of Proline Cis–Trans Isomerization

13.1 Introduction

13.2 Methods

13.3 Model Evaluation and Analysis

13.4 Conclusion

References

Part 4: Pattern Recognition in Quaternary Structures

Chapter 14: Prediction of Protein Quaternary Structures

14.1 Introduction

14.2 Protein Structure Prediction

14.3 Template-Based Predictions

14.4 Critical Assessment of Protein Structure Prediction

14.5 Quaternary Structure Prediction

14.6 Conclusion

Acknowledgments

References

Chapter 15: Comparison of Protein Quaternary Structures by Graph Approaches

15.1 Introduction

15.2 Similarity in The Graph Model

15.3 Measuring Structural Similarity Via MCES

15.4 Protein Comparison Via Graph Spectra

15.5 Conclusion

References

Chapter 16: Structural Domains in Prediction of Biological Protein–Protein Interactions

16.1 Introduction

16.2 Structural Domains

16.3 The Prediction Framework

16.4 Feature Extraction and Prediction Properties

16.5 Feature Selection

16.6 Classification

16.7 Evaluation and Analysis

16.8 Results and Discussion

16.9 Conclusion

References

Part 5: Pattern Recognition in Microarrays

Chapter 17: Content-Based Retrieval of Microarray Experiments

17.1 Introduction

17.2 Information Retrieval: Terminology and Background

17.3 Content-Based Retrieval

17.4 Microarray Data and Databases

17.5 Methods for Retrieving Microarray Experiments

17.6 Similarity Metrics

17.7 Evaluating Retrieval Performance

17.8 Software Tools

17.9 Conclusion and Future Directions

Acknowledgment

References

Chapter 18: Extraction of Differentially Expressed Genes in Microarray Data

18.1 Introduction

18.2 From Microarray Image to Signal

18.3 Microarray Signal Analysis

18.4 Algorithms for De Gene Selection

18.5 Gene Ontology Enrichment and Gene Set Enrichment Analysis

18.6 Conclusion

References

Chapter 19: Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis

19.1 Introduction

19.2 Transcriptome Analysis

19.3 Microarrays

19.4 RNA-Seq

19.5 Benefits and Drawbacks of RNA-Seq and Microarray Technologies

19.6 Gene Expression Profile Analysis

19.7 Real Case Studies

19.8 Conclusions

References

Chapter 20: Mining Informative Patterns in Microarray Data

20.1 Introduction

20.2 Patterns with Similarity

20.3 Conclusion

References

Chapter 21: Arrow Plot and Correspondence Analysis Maps for Visualizing The Effects of Background Correction and Normalization Methods on Microarray Data

21.1 Overview

21.2 Arrow Plot

21.3 Significance Analysis of Microarrays

21.4 Correspondence Analysis

21.5 Impact of The Preprocessing Methods

21.6 Conclusions

Acknowledgments

References

Part 6: Pattern Recognition in Phylogenetic Trees

Chapter 22: Pattern Recognition in Phylogenetics: Trees and Networks

22.1 Introduction

22.2 Networks and Trees

22.3 Patterns and Their Processes

22.4 The Types of Patterns

22.5 Fingerprints

22.6 Constructing Networks

22.7 Multi-Labeled Trees

22.8 Conclusion

References

Chapter 23: Diverse Considerations for Successful Phylogenetic Tree Reconstruction: Impacts from Model Misspecification, Recombination, Homoplasy, and Pattern Recognition

23.1 Introduction

23.2 Overview on Methods and Frameworks for Phylogenetic Tree Reconstruction

23.3 Influence of Substitution Model Misspecification on Phylogenetic Tree Reconstruction

23.4 Influence of Recombination on Phylogenetic Tree Reconstruction

23.5 Influence of Diverse Evolutionary Processes on Species Tree Reconstruction

23.6 Influence of Homoplasy on Phylogenetic Tree Reconstruction: The Goals of Pattern Recognition

23.7 Concluding Remarks

Acknowledgments

References

Chapter 24: Automated Plausibility Analysis of Large Phylogenies

24.1 Introduction

24.2 Preliminaries

24.3 A NaÏve Approach

24.4 Toward a Faster Method

24.5 Improved Algorithm

24.6 Implementation

24.7 Evaluation

24.8 Conclusion

Acknowledgment

References

Chapter 25: A New Fast Method for Detecting and Validating Horizontal Gene Transfer Events Using Phylogenetic Trees and Aggregation Functions

25.1 Introduction

25.2 Methods

25.3 Experimental Study

25.4 Results and Discussion

25.5 Conclusion

References

Part 7: Pattern Recognition in Biological Networks

Chapter 26: Computational Methods for Modeling Biological Interaction Networks

26.1 Introduction

26.2 Measures/Metrics

26.3 Models of Biological Networks

26.4 Reconstructing and Partitioning Biological Networks

26.5 Ppi Networks

26.6 Mining PPI Networks—Interaction Prediction

26.7 Conclusions

References

Chapter 27: Biological Network Inference at Multiple Scales: From Gene Regulation to Species Interactions

27.1 Introduction

27.2 Molecular Systems

27.3 Ecological Systems

27.4 Models and Evaluation

27.5 Learning Gene Regulation Networks

27.6 Learning Species Interaction Networks

27.7 Conclusion

References

Chapter 28: Discovering Causal Patterns with Structural Equation Modeling: Application to Toll-Like Receptor Signaling Pathway in Chronic Lymphocytic Leukemia

28.1 Introduction

28.2 Toll-Like Receptors

28.3 Structural Equation Modeling

28.4 Application

28.5 Conclusion

References

Chapter 29: Annotating Proteins with Incomplete Label Information

29.1 Introduction

29.2 Related Work

29.3 Problem Formulation

29.4 Experimental Setup

29.5 Experimental Analysis

29.6 Conclusions

Acknowledgments

References

Index

Wiley Series on Bioinformatics: Computational Techniques and Engineering

End User License Agreement

Pages

xxi

xxii

xxiii

xxiv

xxv

xxvii

xxviii

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

165

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

249

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

417

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

Guide

Table of Contents

Preface

Begin Reading

List of Illustrations

Chapter 1: Combinatorial Haplotyping Problems

Figure 1.1 A chromosome in three individuals. There are four SNPs.

Figure 1.2 Sequence reads and assembly of the two chromosome copies.

Figure 1.3 SNP matrix and conflict graphs.

Figure 1.4 Haplotypes and corresponding genotypes.

Chapter 2: Algorithmic Perspectives of the String Barcoding Problems

Figure 2.1 An example of a valid barcode.

Figure 2.2 Inclusion relationships among the various barcoding problems.

Figure 2.3 A greedy algorithm for solving MSC [13].

Figure 2.4 Greedy selection of strings in our solution generates a tree partition of the set of input sequences such that each leaf node has exactly one sequence.

Chapter 3: Alignment-Free Measures for Whole-Genome Comparison

Figure 3.1 Whole-genome phylogeny of the 2009 world pandemic Influenza A (H1N1) generated by the UA. The clades are highlighted with different shades of gray. The node Mexico/4108 is probably the closest isolate to the origin of the influenza. The organisms that do not fall into one of the two main clades according to the literature are shown in bold.

Figure 3.2 Whole-genome phylogeny of the genus

Plasmodium

by UA, with our whole-genome distance highlighted on the branches.

Figure 3.3 Whole-genome phylogeny of prokaryotes by UA. The clades are highlighted with different shades of gray. Only two organisms do not fall into the correct clade:

Methanosarcina acetivorans

(Archaea) and

Desulfovibrio vulgaris

subsp.

vulgaris

(Bacteria).

Chapter 4: A Maximum Likelihood Framework for Multiple Sequence Local Alignment

Figure 4.1 Sequence local alignment and motif finding. (a) Multiple sequence local alignment of protein–DNA interaction binding sites. The short motif sequence shown in a highlighted box represents the site recognized and bound with a regulatory protein. (b) Motif sequences (sites) aligned to discover a pattern. (c) Nucleotide counting matrix is derived from the aligned sequences in (b) and counting column-wise. (d) Sequence logo representation of the motif alignment, which visually displays the residue conservation across the alignment positions. Sequence logo is plotted using

BiLogo

server (http://bipad.cmh.edu/bilogo.html). Notice, sequence logo plots the motif conservation across positions. Each column consists of stacks of nucleotide letters (A, C, G, T). The height of a letter represents its occurrence percentage on that position. The total information content on each column is adjusted to 2.0 bits as a maximum.

Figure 4.2 Results for sequences of CRP binding sites. (a) Motif sequence logo of CRP binding sites (22 bp). Sequence logos are plotted using

BiLogo

server (http://bipad.cmh.edu/bilogo.html). (b) Comparison of the curves along 100 iterations at maximum.

Figure 4.3 Alignment of the HTH_ICLR domains using 39 most diverse sequences from NCBI CDD database [20]. The 2G7U chain A structure data (mmdb: 38,409, Zhang et al. [20], unpublished) is used to plot the 3D structure using Cn3D software downloaded from NCBI [20]. (a) Alignment of the HTH_ICLR domains. The core motif is aligned in the center, the left subunit (motif-1) is aligned with three residues shift to the left compared to the same alignment reported in the CDD [20]. The right subunit (motif-2) is aligned with two residues shift to the left. (b) Sequence logos of three motifs: motif-1, core-motif, and motif-2 displayed from left to right. The sequence logos are plotted using

WebLogo

. (c) The organization of a typical protein in the IclR family (chain A) from NCBI CDD. (d) The 3D structure of the 2G7U chain A. The highlighted structure part is the HTH_ICLR domains.

Chapter 5: Global Sequence Alignment with a Bounded Number of Gaps

Figure 5.1 Distribution of gap lengths in

Homo sapiens

exome sequencing.

Figure 5.2 Matrices

and

for

and

.

Figure 5.3 Matrices

and

for

and

.

Figure 5.4 Solution for

,

,

, and

.

Figure 5.5 Matrices

and

for

,

, and

.

Chapter 6: A Short Review on Protein Secondary Structure Prediction Methods

Figure 6.1 The development history of protein secondary structure prediction methods.

Figure 6.2 The architecture of an NN for protein secondary structure prediction.

Figure 6.3 A general flowchart for the development of protein secondary structure prediction methods.

Chapter 7: A Generic Approach to Biological Sequence Segmentation Problems: Application to Protein Secondary Structure Prediction

Figure 7.1 Topology of MSVMpred. The cascade of classifiers computes estimates of the class posterior probabilities for the position in the sequence currently at the center of the analysis window .

Figure 7.2 Topology of the hybrid prediction method. Thanks to the postprocessing with the generative model, the context effectively available to perform the prediction exceeds that resulting from the combination of the two analysis windows of MSVMpred.

Figure 7.3 Correlation between the recognition rate () of MSVMpred2 and the one of the hybrid model. The dashed line is obtained by linear regression.

Chapter 8: Structural Motif Identification and Retrieval: A Geometrical Approach

Figure 8.1 Simplified diagram of the multilevel protein hierarchy.

Source

: Modified from a public-domain image created by Mariana Ruiz Villarreal (http://commons.wikimedia.org/wiki/File:Main_protein_structure_levels_en.svg).

Figure 8.2 Approximation of the secondary structure elements and motifs.

Figure 8.3 Greek key in theory and in practice: (a) idealized structure and (b) noncoplanar Greek key in staphylococcal nuclease [5].

Figure 8.4 ProSMoS: the meta-matrix of the 3br8 protein. Lines 2–8 is the list of SSEs inside the protein with their geometrical and biological properties, and lines 10–16 describe the relation matrix among the SSEs of 3br8.

Figure 8.5 (a) The geometric model of an SSM. Vertices and edges are labeled. (b) Two SSEs with the same SSM graph representation but with different connections.

Figure 8.6 A super-SSE of cardinality is present in a protein only if its subsets of cardinality are present as well.

Chapter 9: Genome-Wide Search for Pseudoknotted Noncoding RNA: A Comparative Study

Figure 9.1 A pseudoknotted secondary structure (top), its arc-based representation (middle), and dot–bracket notation (bottom).

Figure 9.2 A hypothetical Stockholm alignment with a pseudoknot.

Figure 9.3 Methodology used in the analysis of pseudoknotted ncRNA search.

Chapter 10: Motif Discovery in Protein 3D-Structures using Graph Mining Techniques

Figure 10.1 Triangulation example in a 2D-space. (a) Triangulation meets the Delaunay condition. (b) Triangulation does not meet the Delaunay condition.

Figure 10.2 Example of an all- protein 3D-structure (lipid-binding protein, PDB identifier: 2JW2) transformed into graphs of amino acids using the All Atoms technique (81 nodes and 562 edges), then using the Main Atom technique with as the main atom (81 nodes and 276 edges). Both techniques are used with a distance threshold of 7 Å.

Figure 10.4 Example of an – protein 3D-structure (for cell adhesion, PDB identifier: 3BQN) transformed into graphs of amino acids using the All Atoms technique (364 nodes and 3306 edges), then using the Main Atom technique with as the main atom (364 nodes and 1423 edges). Both techniques are used with a distance threshold of 7 Å.

Figure 10.5 Example of an unlabeled graph (a) and a labeled graph (b).

Figure 10.6 Example of a subgraph (g) that is frequent in the graphs (a), (b), and (c) with a support of .

Figure 10.7 Example of a candidates search tree in a BFS manner (a) and a DFS manner (b). Each node in the search tree represents a candidate subgraph.

Chapter 11: Fuzzy and Uncertain Learning Techniques for the Analysis and Prediction Of Protein Tertiary Structures

Figure 11.1 The Figure of the chemical structure of a protein, Protein Data Bank (PDB) code 1AIX [6].

Figure 11.9 Basic structure of an SVM.

Figure 11.5 An example of the possible moves in the HH model (a), as well as a sequence with 48 monomers [HHHHPHHPHHHHHPPHPPHHPPHPPPPPPHPPHPPPHPPHHPPHHHPH].

Source

: Custódio et al. [16]. Reproduced with permission of Elsevier (b).

Figure 11.6 A molecule used to depict the , labeled phi, , labeled psi, and , labeled omega, dihedral angles.

Figure 11.7 Illustration of the final product of an all-atom model. This is the actual backbone of protein 1AIX [6].

Figure 11.8 Basic structure of an ANN.

Figure 11.10 Structure of an ANFIS.

Chapter 12: Protein Inter-Domain Linker Prediction

Figure 12.1

N-score

and

C-score

are the number of AA residues that do not match when comparing the predicted linker segment with the known linker segment. A lower score means a more accurate prediction. For exact match, the

N-score

and

C-score

should be equal to zero.

Figure 12.2 Basic EGRN architecture.

Figure 12.3 Method overview. The method consists of two main steps: calculating the compositional index and then refining the prediction by detecting the optimal set of threshold values that distinguish inter-domain linkers from nonlinkers using simulated annealing.

Figure 12.4 Linker preference profile. Linker regions (1, 2, …, 5) less than a threshold value of 0 are identified from the protein sequence.

Figure 12.5 Optimal threshold values for XYNA_THENE protein sequence in DomCut data set.

Figure 12.6 Optimal threshold values for 6pax_A protein sequence in DS-All data set.

Chapter 13: Prediction of Proline Cis–Trans Isomerization

Figure 13.1 A general architecture of Method I ensemble. The collection of DTs , where the are independently, identically distributed random DTs, and each DT cast “a unit vote” for the final classification of input .

Figure 13.2 A flowchart of Method I ensemble. The left-hand side shows the main flow of the Method I ensemble while the right-hand side flowchart is the expansion of the process, to build the next split, of the main flowchart of left-hand side.

Chapter 14: Prediction of Protein Quaternary Structures

Figure 14.1 The different protein structures: (a) primary structure (polypeptide chain), (b) secondary structure (α-helix), (c) tertiary structure (example: myoglobin), and (d) quaternary structure (example: hemoglobin).

Figure 14.2 The schematic representation of different polypeptide chains that form various oligomers.

Source

: Chou and Cai (2003) [22]. Reproduced with permission of John Wiley and Sons, Inc.

Figure 14.3 A schematic drawing to illustrate protein oligomers with: (a)

C

5

symmetry, (b)

D

4

symmetry, (c) tetrahedron symmetry, (d) cubic symmetry, and (e) icosahedrons symmetry.

Source

: Chou and Cai (2003) [22].

Reproduced with permission of John Wiley and Sons, Inc.

Figure 14.4 Comparison of determined number of protein sequences and protein structures based on statistical data from PDB and UniProtKB.

Figure 14.5 Definition procedure for the huge number of possible sequence order patterns as a set of discrete numbers. Panels (a–c) represent the correlation mode between all the most contiguous residues, all the second-most contiguous residues, and all the third-most contiguous residues, respectively.

Source

: Chou and Cai (2003) [22].

Reproduced with permission of John Wiley and Sons, Inc.

Chapter 15: Comparison of Protein Quaternary Structures by Graph Approaches

Figure 15.1 An illustration of protein graph remodeling.

Figure 15.2 Protein graph transformation.

Figure 15.3 (a) a P-graph with and . (b) The P-table of . (c) The line graph of .

Figure 15.4 The process of constructing modular graph from and .

Figure 15.5 Protein structures with display style in schematic view.

Figure 15.6 An illustration of cospectral graphs.

Figure 15.7 An example of isomorphic graphs.

Figure 15.8 Clustering constructed by different distance metrics. (a) For RMSD, (b) for the Seidel spectrum, and (c) for the CATH code.

Chapter 16: Structural Domains in Prediction of Biological Protein–Protein Interactions

Figure 16.1 Four levels of the

Class, Architecture, Topology, and Homologous superfamily

(CATH) hierarchy.

Figure 16.2 Domain-based framework used to predict PPI types.

Figure 16.3 ROC curves and AUC values for all subsets of features of (a) MW and (b) ZH data sets.

Figure 16.4 Schematic view of level 3 of CATH DDIs present in the MW and ZH data sets.

Chapter 17: Content-Based Retrieval of Microarray Experiments

Figure 17.1 Query-by-example framework for content-based information retrieval.

Figure 17.2 Format of a gene expression matrix: the expression of

i

th gene in

j

th condition is shown by

e

ij

.

Figure 17.3 Microarray experiment retrieval framework based on differential expression fingerprints: (a) fingerprint extraction, (b) database search and ranked retrieval.

Chapter 18: Extraction of Differentially Expressed Genes in Microarray Data

Figure 18.1 Probe set in

Affymetrix

technology.

Figure 18.2 Scanned image of cDNA microarray.

Figure 18.3 The plot for four

Affymetrix

chips (a) before and (b) after normalization using quantile normalization.

Figure 18.4 Selection of DE gene using control. Volcano plots for (a) within-group comparison and (b) between-group comparison.

Figure 18.5 Enriched bar charts of selected genes using GO on biological process (a) and cellular components (b) categories.

Figure 18.6 Acyclic graph of cellular component analysis of selected genes.

Chapter 19: Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis

Figure 19.1 Overview of the RNA-Seq workflow.

Figure 19.2 RNA-Seq benefits.

Figure 19.3 Microarray benefits.

Figure 19.4 An example of GELA model for breast cancer.

Chapter 20: Mining Informative Patterns in Microarray Data

Figure 20.1 A conceptual view of a microarray data matrix. There are three parts in the data matrix: (i) numerical gene expression data, (ii) gene annotation, and (iii) sample annotation. The gene and sample annotations are important as the data only have meaning within the context of the underlying biology.

Figure 20.2 Clustering result on iris data, a 150 4 data set.

Figure 20.3 Consensus matrix for iris data with = 3 and =4.

Figure 20.4 Two genes that coexpress only under a subset of the samples: (a) the original patterns and (b) under extracted subset of samples.

Figure 20.5 An example of different types of biclusters. A1: constant bicluster, A2: constant rows, A3: constant columns, A4: coherent values (additive model), A5: coherent values (multiplicative model), A6: coherent values (multiplicative model), A7: coherent evolution on the columns, and A8: coherent sign changes on rows and columns.

Figure 20.6 An illustration of the (a) iterative row and column clustering combination method, and (b) divide-and-conquer algorithm.

Figure 20.7 An example of the OPC-tree constructed by Liu and Wang [17]. (a)The sequences set and (b) the OPC-tree.

Figure 20.8 An illustration of triclustering. (a) The 3D gene–sample-time microarray data. (b) An enlargement of expression pattern of on sample . (c) The tricluster that contains {} and {}.

Chapter 21: Arrow Plot and Correspondence Analysis Maps for Visualizing The Effects of Background Correction and Normalization Methods on Microarray Data

Figure 21.1 Example of a ROC curve.

Figure 21.2 Densities of the expression levels of two populations and the respective empirical ROC curves. represents the expression levels in the control group and represents the expression levels in the experimental group. The same classification rule was considered for estimating the ROC curves. The densities were estimated by the Kernel density estimator from two samples of size 100, simulated from two normal distributions. (a) ; (b) .

Figure 21.3

Arrow plot

. To select upregulated genes, the conditions AUC and OVL were considered, corresponding to triangles on the plot. To select downregulated genes, the conditions AUC 0.2 and OVL 0.5 were considered, corresponding to squares on the plot. To select special genes, the conditions OVL 0.5 and 0.4 AUC 0.6 were considered, corresponding to stars on the plot.

Figure 21.4 CA symmetric maps of the cancer classification leave-one-out cross-validation error rate estimates obtained from two classifiers: kNN (on the right side) and SVM (on the left side). The quality of representation of the points onto each axis/dimension in the CA map is indicated within brackets.

Figure 21.5 CA maps of the values of (FDR) corresponding to 20% of the top DE genes with the lowest FDR induced by SAM.

Figure 21.6 CA maps of the number of DE genes upregulated (a), number of DE genes downregulated (c), and number of special genes (b) detected by the

Arrow plot

for the data set.

Chapter 22: Pattern Recognition in Phylogenetics: Trees and Networks

Figure 22.1 A rooted phylogenetic tree (a) and a rooted phylogenetic network (b) for 12 species (labeled A–L), illustrating the various terms used to describe the diagrams. The arrows indicate the direction of evolutionary history, away from the common ancestor at the root.

Figure 22.2 (a) A species phylogeny for six species (labeled A–F) in which there is gene flow between two species plus a hybridization event. (b–f) Possible gene trees that might arise from the species phylogeny. (Not all possible gene trees are shown.) The hybridization event involves species D and F, which hybridize to form species E. This would create gene trees that match either (c) or (d). The gene flow might involve introgression of genes from species C into species B. This would create gene trees that match either (b) or (c/d). (Similarly, the gene flow might involve lateral transfer of genes from species C to species B. This would also create gene trees that match either (b) or (c/d).) There might also be

Incomplete Lineage Sorting

(ILS), and in example (e) there is genic polymorphism in the immediate ancestor of species A and B. If the different forms are not all passed to each descendant, then the gene tree will not track the species phylogeny through time. There might also be gene

Duplication and Loss

(D–L), and in example (f) there is gene duplication in the most recent common ancestor of species A, B, and C. These gene copies then track the species phylogeny through the two subsequent divergence events. One of the gene copies is then lost in species A, and the other gene copy is lost in species B and C. Note that the gene trees shown in (b), (e), and (f) would be identical when sampled in the contemporary species, even though the biological mechanisms that formed them are quite different.

Figure 22.3 Recombination during meiosis. Chromosomes are represented as oblongs, and genes as circles. During meiosis, homologous chromosomes form pairs, and parts of the pairs potentially cross over each other, as shown in the top pair of pictures. If the touching parts break apart and rejoin, then the chromosomes will have exchanged some of their gene copies, as shown in the bottom pair of pictures. Modified from Reference [26].

Figure 22.4 Sampled locations with different gene trees, mapped onto the 32 chromosomes of the nuclear genome of the zebra finch. Two different gene-tree patterns are shown, labeled ABBA (204 locations) and BABA (158 locations). Adjacent locations with the same gene tree may be part of the same chromosomal block, but adjacent locations with different trees are in different blocks. The authors attribute the different gene trees mainly to introgression. Reproduced with permission from Reference [33].

Figure 22.5 The relationships between (a) gene trees, (b) a MUL-tree, and (c) a phylogenetic network for four species (labeled A–D). Each gene tree can be derived from the MUL-tree by deleting duplicated labels, so that each label occurs once and once only in the tree. In this example, the two gene trees shown in (a) are not the complete set of trees that could be derived from the MUL-tree in (b). The network can be derived from the MUL-tree by minimizing the number of reticulations in the network. In this example, only one reticulation is needed.

Chapter 23: Diverse Considerations for Successful Phylogenetic Tree Reconstruction: Impacts from Model Misspecification, Recombination, Homoplasy, and Pattern Recognition

Figure 23.1 The number of articles per year with terms “phylogenetic reconstruction” in the title or the abstract, as measured by PubMed <http://www.ncbi.nlm.nih.gov/pubmed> (search conducted on 18 February 2014).

Figure 23.2 An illustrative example of a gene tree (thin lines) inside a species tree (gray tree in the background) and the effect of the three main evolutionary processes that can generate discordance between them.

Figure 23.3 An illustrative example of an ARG and its corresponding embedded trees.

Chapter 24: Automated Plausibility Analysis of Large Phylogenies

Figure 24.1 Euler traversal of a tree.

Figure 24.2 Node from .

Figure 24.3 Unrooted phylogeny of 24 taxa (leaves). The taxa for which we induce a subtree are marked in gray. The lines are inner nodes that represent the common ancestors and hence the minimum amount of nodes needed to maintain the evolutionary relationships among the selected taxa. The numbers denote the order of each node in the preorder traversal of the tree assuming that we root it at node 0.

Figure 24.4 Induced tree for Example 24.13.

Figure 24.5 Distribution of all relative RF distances between the large mega-phylogeny and the reference trees from STBase.

Figure 24.6 Distribution of relative RF distances for the 20,000 largest reference trees (30–2065 taxa).

Figure 24.7 Running time of the effective inducing step (dashed) compared to the overall execution time of the effective algorithm (dotted).

Figure 24.8 Speedup of the effective inducing tree approach. The speedup is calculated by dividing the overall naïve inducing time with the effective inducing time.

Figure 24.9 Total execution time of the naïve algorithm (dashed) compared to the effective approach (dotted).

Figure 24.10 Time needed for the preprocessing step of the effective algorithm.

Chapter 25: A New Fast Method for Detecting and Validating Horizontal Gene Transfer Events Using Phylogenetic Trees and Aggregation Functions

Figure 25.1 Intra- and intergroup phylogenetic relationships following an HGT. A horizontal gene transfer from species of the group to species of the group is shown by an arrow; the dotted line shows the position of species in the tree after the transfer; denotes the rest of the species of the group and denotes the rest of the species of the group . Each species is represented by a unique nucleotide sequence.

Figure 25.2 HGT-QFUNC sensitivity results for functions , , , and when detecting partial HGT in a synthetic data set based on -value ordering—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results under the medium degree of recombination (when 25% of the resulting sequence belong to one of the parent sequences) are depicted in the left panel. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.

Figure 25.3 Remaining HGT-QFUNC sensitivity results for functions , , , and when detecting complete and partial HGT in a synthetic data set based on -value ordering—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results for data without recombination are depicted in the left panel. The right panel depicts the results of the same simulations, for the cases of 1 and 128 transfers, with recombination levels of 0% (no recombination), 25%, and 50%. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.

Figure 25.4 HGT-QFUNC sensitivity results for functions , , , and when detecting complete HGT in a prokaryotic data set based on -value ordering (maximum -value of 0.05)—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. The HGT-QFUNC algorithm was limited to the following maximum numbers of positive values: (a) 300 HGT (corresponds to 50% bootstrap support in the HGT-Detection algorithm); (b) 200 HGT (corresponds to 75% bootstrap support in the HGT-Detection algorithm); (c) 100 HGT (corresponds to 90% bootstrap support in the HGT-Detection algorithm).

Figure 25.6 Remaining HGT-QFUNC sensitivity results for functions , , , and when detecting complete and partial HGT in a synthetic data set based on -value ordering (maximum -value of 0.05)—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results for data without recombination are depicted in the left panel. The right panel depicts the results of the same simulations, for the cases of 1 and 128 transfers, with recombination levels of 0% (no recombination), 25%, and 50%. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.

Figure 25.6 HGT-QFUNC sensitivity results for functions , , , and when detecting complete HGT in prokaryotic data set based on -value ordering—boxplot representation. The abscissa represents the sensitivity percentage and ordinate the tested function. The median value is shown by a vertical black line within each box. The HGT-QFUNC algorithm was limited to the following maximum numbers of positive values: (a) 300 HGT (corresponds to 50% bootstrap support in the HGT-Detection algorithm); (b) 200 HGT (corresponds to 75% bootstrap support in the HGT-Detection algorithm); (c) 100 HGT (corresponds to 90% bootstrap support in the HGT-Detection algorithm).

Figure 25.7 Distribution of the HGT-QFUNC maximum percentages of positive values chosen for prokaryotic data. The abscissa represents the percentage of the maximum possible number of HGTs between individual species. The ordinate represents the corresponding HGT-Detection bootstrap confidence level. Average values correspond to less than 6%, 4%, and 2% of the maximum possible number of HGTs for the 50%, 75%, and 90% bootstrap confidence levels, respectively.

Figure 25.8 HGT-QFUNC sensitivity results for functions , , , and when detecting partial HGT in a synthetic data set based on -value ordering (maximum -value of 0.05)—boxplot representation. The abscissa represents the sensitivity percentage and the ordinate the tested function. The median value is shown by a vertical black line within each box. Simulations for 2, 4, 8, 16, 32, and 64 random nonreciprocal sequence transfers between prokaryotic species (first value between parentheses) were carried out. Average simulation results under medium degree of recombination (when 25% of the resulting sequence belong to one of the parent sequences) are depicted in the left panel. Average simulation results under the highest level of recombination (when 50% of the resulting sequence belong to the source sequence and 50% to the destination sequence) is depicted in the right panel. For each data set, the maximum allowed number of positive values was the double of the number of transfers (i.e., 4, 8, 16, 32, 64, and 128, respectively). Calculations were done over 50 replicates for each combination of parameters.

Chapter 26: Computational Methods for Modeling Biological Interaction Networks

Figure 26.1 A PPI network.

Figure 26.2 A metabolic network [90].

Figure 26.3 Human PPI network. Proteins are shown as graph nodes. [35].

Chapter 27: Biological Network Inference at Multiple Scales: From Gene Regulation to Species Interactions

Figure 27.2

Receiver Operating Characteristic

(ROC). An ROC curve for a perfect predictor, random expectation, and a typical predictor between these two extremes is shown. The

Area Under the ROC

curve (AUROC) is used as scoring metric.

Figure 27.1 Hypothetical circadian clock networks from the literature and that inferred from the TiMet gene expression data. The panels

P2010

(a) and

P2013

(b) constitute hypothetical networks from the literature [42, 43]. The

TiMet

network (c) displays the reconstructed network from the TiMet data, described in Section 27.5.4, using the hierarchical Bayesian regression model from Section 27.5.1. Gene interactions are shown by black lines with arrowhead; protein interactions are shown by dashed lines. The interactions in the reconstructed network were obtained from their estimated posterior probabilities. Those above a selected threshold were included in the interaction network; those below were discarded. The choice of the cutoff threshold is, in principle, arbitrary. For optimal comparability, we selected the cutoff such that the average number of interactions from the published networks was matched (0.6 for molecular interactions).

Figure 27.3 AUROC scores obtained for different reconstruction methods, and different experimental settings. Boxplots of AUROC scores obtained from LASSO,

Elastic Net

(both in Section 27.4.2), homogBR (homogeneous Bayesian regression, Section 27.4.3), and nonhomogBR (nonhomogeneous Bayesian regression, Section 27.5.1). The latter utilizes light- induced partitioning of the observations. The subpanels are

coarse-mRNA

: incomplete data, only with mRNA concentrations and coarse gradient,

interp-mRNA

: incomplete data with interpolated gradient,

coarse-complete

: complete data with protein and mRNA concentrations, and

interp-complete

: complete data with interpolated gradient. The

coarse

gradients are computed from Equation 27.12 from 4-h intervals, and the interpolated gradients (

interp

) are derived from a Gaussian process as described in Section 27.5.2.

Figure 27.4 Multiple global

change-point

example. Partitioning with a horizontal

change-point

vector and vertical vector . The

pseudo-change-points

define the left and upper boundaries; and define the lower and right boundaries, where and are the number of locations along the horizontal and vertical directions, respectively. The number of

change-points

is , and the number of segments .

Figure 27.5

Mondrian process

example. (a) An example partitioning with a

Mondrian process

. (b) The associated tree with labels of the latent variable identifying each nonoverlapping segment with leaf nodes (light gray) designated as , where indexes all tree nodes.

Figure 27.6 Diagram of the niche model. Species are indicated by triangles. A species is placed with a niche value into the interval . A value is uniformly drawn that defines the centre of the range . All species with a value inside this interval, that is, , as indicated by the gray triangles, are consumed (“eaten”) by species . Diagram adapted from Reference [62].

Figure 27.7 Spatial distribution. Shown are the spatial distributions of growth rates entering Equation 27.16 as the spatial parameter (Section 27.6.1) decreases from to . A value of 0 corresponds to uniformly random noise, and is Brownian noise.

Figure 27.8 Comparative evaluation of four network reconstruction methods for the stochastic population dynamics data. Boxplots of AUROC scores obtained on the realistic simulated data described in Section 27.6.5.3 for different settings of the spatial parameter, with lower values causing stronger heterogeneity in the data. Box color scheme: BRAMP (white), BRAM (light gray), homogBR (gray), and LASSO (dark gray).

Figure 27.9 Comparison on synthetic data. Boxplots of AUROC scores obtained with five methods on the synthetic data described in Section 27.5: A Bayesian regression model with

Mondrian process change-points

(BRAMP, Section 27.6.4), a Bayesian regression model with global

change-points

(BRAM, Section 27.6.2), a Bayesian linear regression model without

change-points

(homogBR, Section 27.4.3), L1-penalized sparse regression (LASSO, Section 27.4.2), and the sparse regression

Elastic Net

method (Section 27.4.2). The boxplots show the distributions of the scores for 30 independent data sets with higher scores indicating better learning performance.

Figure 27.10 Species interaction network. Species interactions as inferred with BRAMP (Section 27.6.3), with an inferred marginal posterior probability of 0.5 (thick lines) and 0.1 (thin lines). Solid lines are positive interactions (e.g., mutualism, commensalism) and dashed are negative interactions (e.g., resource competition). Species are represented by numbers and have been ordered phylogenetically as displayed in Table 27.2.

Chapter 28: Discovering Causal Patterns with Structural Equation Modeling: Application to Toll-Like Receptor Signaling Pathway in Chronic Lymphocytic Leukemia

Figure 28.1 Path analysis: and are the exogenous variables. The bidirectional arrow means that they are correlated. and are endogenous and accompanied by an error term. can also be considered as exogenous, as it originates an effect that goes to . receives a direct effect from and receives a direct effect from . also receives an indirect effect of through .

Figure 28.2 Description of the steps of analysis.

Figure 28.3 Initial model for the TLR1/2 pathway.

Figure 28.4 Final model of the TLR1/2 pathway for M-CLL patients.

Figure 28.5 Final model of the TLR1/2 pathway for U-CLL patients.

Figure 28.6 Initial model for TLR2/6 pathway.

Figure 28.7 Final model of the TLR2/6 pathway for M-CLL patients.

Figure 28.8 Final model of the TLR2/6 pathway for U-CLL patients.

Figure 28.9 Initial model for the TLR7 pathway.

Figure 28.9 Final model of the TLR7 pathway for M-CLL patients.

Figure 28.10 Final model of the TLR7 pathway for U-CLL patients.

Figure 28.11 Initial model for TLR9 pathway.

Figure 28.12 Final model of the TLR9 pathway for M-CLL patients.

Figure 28.13 Final model of the TLR9 pathway for U-CLL patients.

Chapter 29: Annotating Proteins with Incomplete Label Information

Figure 29.1 The two tasks studied in this chapter: “

?

” denotes the missing functions; “1” means the protein has the corresponding function; “0” in (a), (b), and (c) means the protein does not have the corresponding function. Task1 replenishes the missing functions and Task2 predicts the function of proteins

p4

and

p5

, which are completely unlabeled. Some Figure are from Reference [35].

Figure 29.2 The benefit of using both the “guilt by association” rule and function correlations (ProDM_nFC is ProDM with no function correlation, and ProDM_nGBA is ProDM with no “guilt by association” rule).

List of Tables

Chapter 2: Algorithmic Perspectives of the String Barcoding Problems

Table 2.1 List of of a subset of approximability results proved in Reference [3]

Chapter 3: Alignment-Free Measures for Whole-Genome Comparison

Table 3.1 Example of counters for the ACS approach

Table 3.2 Benchmark for prokaryotes—Archaea and Bacteria domains

Table 3.3 Benchmark for unicellular eukaryotes—genus

Plasmodium

Table 3.4 Comparison of whole-genome phylogeny reconstructions

Table 3.5 Comparison of whole-genome phylogeny of Influenza virus

Table 7 Comparison of whole-genome phylogeny of

Plasmodium

Table 3.8 Main statistics for the underlying approach averaged over all experiments

Chapter 4: A Maximum Likelihood Framework for Multiple Sequence Local Alignment

Table 4.1 Values of

pla

, obtained for the different motif finding algorithms, for the CRP data set ()

Chapter 6: A Short Review on Protein Secondary Structure Prediction Methods

Table 6.1 Some state-of-the-art of protein secondary structure prediction tools

Table 6.2 Performance comparison of protein secondary structure prediction methods

a

Table 6.3 Performance comparison of protein secondary structure prediction methods at three prediction difficulty levels

a

Table 6.4 Misclassification rates of residue secondary structure states based on the benchmark data set with 538 proteins

a

Table 6.5 Performance comparison of alignment/threading methods in the prediction of protein secondary structure

Table 6.6 Compositions of the predicted and actual secondary structure types in the Y538 data set

Table 6.7 Data sets for protein secondary structure prediction

Chapter 7: A Generic Approach to Biological Sequence Segmentation Problems: Application to Protein Secondary Structure Prediction

Table 7.1 Prediction accuracy of MSVMpred2 and the hybrid model

Chapter 8: Structural Motif Identification and Retrieval: A Geometrical Approach

Table 8.1 The reference Table for the SSC algorithm

Table 8.2 CCMS Benchmarking Results

Table 8.3 Number of common super-SSEs found by CCMS in the testing data set, listed by their cardinality

Table 8.4 Number of common super-SSEs found by CCMS removing 2qx8 and 2qx9 from the testing data set, listed by their cardinality

Chapter 9: Genome-Wide Search for Pseudoknotted Noncoding RNA: A Comparative Study

Table 9.1 The six ncRNA families used in our study

Table 9.2 The 13 genomes used in our study

Table 9.3

Table 9.4 Results of ncRNA search with RNATOPS

Table 9.5 Results of ncRNA search with Infernal

Chapter 10: Motif Discovery in Protein 3D-Structures using Graph Mining Techniques

Table 10.1 Characteristics of existing subgraph selection approaches

Chapter 12: Protein Inter-Domain Linker Prediction

Table 12.1 Accuracy of domain boundary placement on the CASP7 benchmark data set

Table 12.2 Comparison of prediction accuracy of EGRN with other predictors

Table 12.3 Prediction performance of publicly domain linker prediction approaches

Chapter 13: Prediction of Proline Cis–Trans Isomerization

Table 13.1 Model performance comparisons

Chapter 15: Comparison of Protein Quaternary Structures by Graph Approaches

Table 15.1 Summary of related studies in protein graph building by graph-theoretic approach

Table 15.2 Annotations of eight selected macromolecules in the PDB

Table 15.3 Entries of selected macromolecules and their CATH codes

Table 15.4 Comparison of the proposed method with DALI RMSD

Table 15.5 The annotation and description of the MHC gene family

Table 15.6 Entries of selected macromolecules and their CATH codes

Chapter 16: Structural Domains in Prediction of Biological Protein–Protein Interactions

Table 16.1 Properties employed in different studies for the prediction of obligate and nonobligate complexes

Table 16.2 Data sets included in this review and their number of complexes

Table 16.3 Prediction results of SVM for the MW and ZH data sets

Table 16.4 A summary of the number of CATH DDIs of level 3 present in the ZH and MW data sets

Chapter 18: Extraction of Differentially Expressed Genes in Microarray Data

Table 18.1 Results of DE gene selection methods using sensitivity and accuracy

Chapter 19: Clustering and Classification Techniques for Gene Expression Profile Pattern Analysis

Table 19.1 Microarrays versus RNA-Seq

Table 19.2 Gene expression profile data set

Table 19.3 Table of RNA-Seq counts

Table 19.4 Logic formulas in early stage

Table 19.5 Logic formulas in late stage

Table 19.6 Classification accuracies [%] on Alzheimer data sets

Table 19.7 Classification accuracies [%] on multiple sclerosis and psoriasis data sets

Table 19.8 Classification accuracy [%] on breast cancer data set

Chapter 20: Mining Informative Patterns in Microarray Data

Table 20.1 Corresponding mean-squared residue score and average correlation value of A1–A7

Chapter 21: Arrow Plot and Correspondence Analysis Maps for Visualizing The Effects of Background Correction and Normalization Methods on Microarray Data

Table 21.1 Description of the three cDNA microarray data sets considered

Table 21.2 Values of (FDR) corresponding to 20% of the top DE genes with the lowest FDR induced by SAM for the 36 preprocessing strategies per data set: (at the top), (in the middle), and (at the bottom)

Table 21.3 Number of DE genes as upregulated (at the top), special (in the middle), and downregulated (at the bottom) that were detected by the

Arrow plot

for the data set

Chapter 22: Pattern Recognition in Phylogenetics: Trees and Networks

Table 22.1 Summary of the processes involved in the evolution of chromosome blocks

Table 22.2 Possible fingerprints for gene flow via different evolutionary processes

Chapter 23: Diverse Considerations for Successful Phylogenetic Tree Reconstruction: Impacts from Model Misspecification, Recombination, Homoplasy, and Pattern Recognition

Table 23.1 An updated list of the most commonly used phylogenetic tree reconstruction programs available up to date

Chapter 24: Automated Plausibility Analysis of Large Phylogenies

Table 24.1 Test results for a mega-phylogeny of 55,473 taxa

Table 24.2 Test results for different input tree sizes (150–2554 taxa). The algorithm is executed on 30,000 small trees for each run. Each small tree contains exactly 64 taxa

Table 24.3 Test results for 1 million simulated reference trees (each containing 128 taxa)

Chapter 27: Biological Network Inference at Multiple Scales: From Gene Regulation to Species Interactions

Table 27.1 Improvement of the Bayesian regression model with

Mondrian process change-points

(BRAMP) on the stochastic population dynamics data

Table 27.2 Indices with full scientific names as appearing in Figure 27.10

Chapter 28: Discovering Causal Patterns with Structural Equation Modeling: Application to Toll-Like Receptor Signaling Pathway in Chronic Lymphocytic Leukemia

Table 28.1 Effects of the TLR1/2 pathway on M-CLL patients

Table 28.3 Goodness-of-fit indices for TLR 1/2 pathways

Table 28.2 Effects of the TLR1/2 pathway on U-CLL patients

Table 28.4 Effects of the TLR2/6 pathway on M-CLL patients

Table 28.6 Goodness-of-fit indices for TLR 2/6 pathways

Table 28.5 Effects of the TLR2/6 pathway on U-CLL patients

Table 28.7 Effects of the TLR7 pathway on M-CLL patients

Table 28.9 Goodness-of-fit indices for TLR7 pathway

Table 28.8 Effects of the TLR7 pathway on U-CLL patients

Table 28.10 Effects of the TLR9 pathway on M-CLL patients

Table 28.12 Goodness-of-fit indices for TLR 9 pathway

Chapter 29: Annotating Proteins with Incomplete Label Information

Table 29.1 Data set statistics (Avg Std means average number of functions for each protein and its standard deviation)

Table 29.2 Results of replenishing missing functions on

ScPPI

Table 29.4 Results of replenishing missing functions on

HumanPPI

Table 29.5 Prediction results on completely unlabeled proteins of

ScPPI

Table 29.7 Prediction results on completely unlabeled proteins of

HumanPPI

Table 29.8 Runtime analysis (s)

Pattern Recognition in Computational Molecular Biology

Techniques and Approaches

Edited by

Mourad Elloumi

Laboratory of Technologies of Information and Communication and Electrical Engineering (LaTICE), and University of Tunis-El Manar, Tunisia

 

Costas S. Iliopoulos

King's College London, UK

 

Jason T. L. Wang

New Jersey Institute of Technology, USA

 

Albert Y. Zomaya

The University of Sydney, Australia

 

 

Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.