From Sequences to Graphs -  - E-Book

From Sequences to Graphs E-Book

0,0
126,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

In order to study living organisms, scientists not only study them at an overall macroscopic scale but also on a more detailed microscopic scale. This observation, pushed to its limits, consists of investigating the very center of each cell, where we find the molecules that determine the way it functions: DNA (deoxyribonucleic acid) and RNA (ribonucleic acid). In an organism, DNA carries the genetic information, which is called the genome. It is represented as four-letter sequences using the letters A, C, G and T; based on these sequences, computer methods described in this book can answer fundamental questions in bioinformatics. This book explores how to quickly find sequences of a few hundred nucleotides within a genome that may be made up of several billion, how to compare those sequences and how to reconstruct the complete sequence of a genome. It also discusses the problems of identifying bacteria in a given environment and predicting the structure of RNA based on its sequence.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 408

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.


Ähnliche


Table of Contents

Cover

Title Page

Copyright

Preface

Author Biographies

1 Methodological Concepts: Algorithmic Solutions of Bioinformatics Problems

1.1. Data, models, problem formalism in bioinformatics

1.2. Mathematical preliminaries

1.3. Vocabulary in text algorithmics

1.4. Graph theory

1.5. Algorithmic problems

1.6. Problem solutions

1.7. Complexity classes

1.8. Some algorithmic techniques

1.9. Validation

1.10. Conclusion

1.11. References

2 Sequence Indexing

2.1. Introduction

2.2. Word indexing

2.3. Full-text indexing

2.4. Indexing choice criteria

2.5. Conclusion and perspectives

2.6. References

3 Sequence Alignment

3.1. Introduction

3.2. Exact alignment

3.3. Heuristic alignment

3.4. References

4 Genome Assembly

4.1. Introduction

4.2. Sequencing technologies

4.3. Assembly strategies

4.4. Scaffold construction methods

4.5. Scaffold-ordering methods

4.6. Assembly validation

4.7. Conclusion

4.8. References

5 Metagenomics and Metatranscriptomics

5.1. What is metagenomics?

5.2. “Who are they”: taxonomic characterization of microbial communities

5.3. “What are they able to do?”: functional metagenomics

5.4. Comparative metagenomics

5.5. Conclusion

5.6. References

6 RNA Folding

6.1. Introduction

6.2. Optimization for structure prediction

6.3. Analyzing the Boltzmann ensemble

6.4. Studying RNA structure in practice

6.5. References

Conclusion

List of Authors

Index

Wiley End User License Agreement

List of Tables

Chapter 1

Table 1.1. Runtime comparison for polynomial (green, the first six lines) and ex...

Table 1.2. Different possible cases during a binary-type validation. The green b...

Chapter 2

Table 2.1. Features of the presented indexing structures. The structures allow (...

Table 2.2. Space occupied by each indexing structure for indexing a human genome...

Table 2.3. Space complexity of the indexing structures according to n, the size ...

Chapter 6

Table 6.1. Reference implementations of the algorithms considered in this chapte...

Guide

Cover

Table of Contents

Title Page

Copyright

Preface

Author Biographies

1 Methodological Concepts: Algorithmic Solutions of Bioinformatics Problems

Conclusion

List of Authors

Index

Wiley End User License Agreement

Pages

v

iii

iv

xi

xii

xiii

xiv

xv

xvii

xviii

xix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

233

234

235

237

238

239

240

241

242

243

244

SCIENCES

Computer Science, Field Directors – Valerie Berthe and Jean-Charles Pomerol

Bioinformatics, Subject Heads – Anne Siegel and Helene Touzet

From Sequences to Graphs

Discrete Methods and Structures for Bioinformatics

Coordinated by

Annie Chateau

Mikaël Salson

First published 2022 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the under mentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

www.iste.co.uk

John Wiley & Sons, Inc111 River StreetHoboken, NJ 07030USA

www.wiley.com

© ISTE Ltd 2022

The rights of Annie Chateau and Mikael Salson to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.

Library of Congress Control Number: 2022940907

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-78945-066-8

ERC code:

PE6 Computer Science and Informatics

PE6_13 Bioinformatics, biocomputing, and DNA and molecular computation

Preface

Scientists have long been interested in studying living organisms, both at a macroscopic scale, by analyzing their external appearance or their overall internal functioning, and at a more microscopic scale. Taken to its extreme, their observation consists of studying the nucleus of cells and the molecules of living organisms that define their functioning: namely DNA and RNA. In an organism, DNA is actually the carrier of genetic information, which is called the genome, thus playing an important role. However, the genome is not everything; it is composed of genes that allow for RNA production which have various functions such as protein synthesis or regulation of cell activity. In digital form, these DNA or RNA molecules are most often represented as text from four-letter alphabets (A, C, G and T for DNA; A, C, G and U for RNA). Using these DNA and RNA sequences, computer-based methods are able to answer a number of biological questions. This is the heart of this book. In the chapters of this book, we will find the answers to these questions, as well as their limitations to some fundamental questions that have been, and are still being, addressed by bioinformatics. How can a short sequence of a few hundred nucleotides quickly be searched for in a genome that can make a few billion of them? How can sequences be compared with one another? How can the complete sequence of a genome be reconstructed? How can the bacteria that make up our intestinal flora be identified? Based on their sequences, how can the structure that certain RNAs will adopt be predicted?

The methods that are described in this book have their roots anchored in two fields with long-standing foundations, which have long evolved alongside each other without any significant interaction: computer science and biology. It was during the 20th century that the symbiosis between computational methods and biological problems led to combined modeling and the design of bioinformatics algorithms, methods and tools.

DNA was first sequenced in the late 1970s, with low volumes and incurring huge costs. The need to store and manipulate this data automatically soon became very pressing. As a result, the mid-1980s witnessed the development of the first sequence databases. These databases are pooled by the community who feeds them sequencing experiments, which are increasingly growing and require more efficient methods. This is how sequence alignment methods are implemented, which are dedicated to these genomic sequences, and designed for optimizing the time and space used for this operation. These databases are not only maintained, but also expanded and made public internationally, further accelerating access to knowledge. Data acquisition is also accelerating since the first complete bacterial or yeast genomes in the mid-1990s, as well as with the human genome project, which has kept many teams busy for over a decade. Access to knowledge about these genomes leads to questioning living organisms from a completely new point of view, and opens up new avenues for several fields of application, particularly in the health sector, and also in ecology and evolution, along with the enrichment of the fundamental knowledge of organisms and how they function.

Since the mid-2000s, genomic data have been acquired at a much faster pace following the advent of high-throughput sequencers, which enable, to a certain extent, the transformation of DNA or RNA molecules into short sequences of letters at a low cost and at an increasingly frenetic pace. There is an ongoing discussion of projects involving several thousand, or even tens of thousands, complete genomes of individuals of interest. These developments make it now possible to question living organisms in a finer manner, at the scale of varieties and individuals of the same species, but also at the scale of the different tissues that make up an organism, or even at the scale of a natural environment sample containing thousands of different organisms. With these new questions emerges the need to model data as a whole, in a structured way, and to develop methods for the purpose of answering them.

At the same time, storage and information processing capacities, as well as computational performance allowed by increasingly powerful processors and exploiting increasingly complex parallelism, have accompanied an extraordinarily rapid progress in the field of algorithmics and problem modeling based on the use of elaborate discrete structures. Some particular operations that seemed inaccessible have become commonplace at a lower cost in modern programs, and it is not uncommon today to run calculations over a grid whose capacities far exceed what could be imagined some 20 years ago. Nonetheless, this is not enough to make feasible all the studies that we would like to achieve on sequencing data and their derivatives.

The data produced by the sequencers, due to their quantity (up to 10 million nucleotides per second) and their particularities (whose lengths and types of errors vary according to sequencing technologies), require the appropriate use of methods in order to extract relevant information therefrom in a reasonable amount of time without resorting to gigantic computing infrastructures.

Although the methods developed are generally independent of the technology, they must take into account the constraints of the technology in order to yield solutions for practical applications. In particular, the increase in the volume of data to be processed makes some solutions impractical and requires the use of much more faster heuristics. Therefore, the methods used in bioinformatics are most often at the crossroads between exact and approximate methods.

In order to better understand the specific terms and tools of bioinformatics, we have introduced most of them in Chapter 1. This chapter also covers in detail the data (DNA, RNA and proteins) which we are working with, and the way they are obtained. Moreover, this chapter presents some algorithmic notions that are useful for understanding this book, and addresses the concepts used in bioinformatics more broadly. The remaining chapters present the most commonly studied problems in bioinformatics from genomic data. Some chapters focus more on tools, others on methods and still others on a more detailed description of data. We briefly present the questions which the following chapters of this book will answer.

Sequence indexing. In order to address the influx of data, how can these be easily stored, queried and manipulated? This is the subject of Chapter 2, which explains how to respond to these different aspects. The crucial issues here are the conservation of information, the flexibility of the structure and its ability to answer in a reasonable time the most common question, namely “Is this sequence indeed in my genome?”

Sequence alignment. When studying one or more sequences, a question arises very quickly: how can we tell whether the sequences are similar if a sequence is approximately found in another, and also can a score that allows for classifying these comparisons between them be determined? A crucial point in bioinformatics is to be able to answer the question “What are the most significant occurrences of my pattern in my sequence?” This is what is called the sequence alignment, which is the subject of Chapter 3. This chapter also deals with the aspects of alignment-free comparison where, in order to cope with the volume of data and sometimes significant error rates, making use of an alignment is not feasible and heuristics are developed.

Genome assembly. In Chapter 4, the following question is addressed: “How can the complete sequence of an organism be obtained based on the reads produced by sequencing?” This fundamental problem thus arose from a technical difficulty that makes it impossible to read the genome of an organism in one piece from its cells. This technical difficulty will most likely disappear if advances in sequencing make it possible to read the genome in a single pass; however, the assembly is for the moment essential to the knowledge of the genome and raises many problems, such as “how to choose between two possibilities to assemble the reads?” or “how could the quality of the reconstruction be evaluated?” Graphs prove to be very interesting models in this context of reconstruction.

Metagenomics and metatranscriptomics. When several organisms are mixed in a sample, for example, of soil, from a marine environment or from the internal environment of an organism (the well-known microbiota), additional problems can occur along with those already existing during the assembly. For example, “how can we determine which species are present?” or still “how can genomes be assembled when they are mixed together?” This is the subject of Chapter 5.

RNA folding. RNA data are particular data because their secondary structure plays a fundamental role in the functioning of organisms. Chapter 6 proposes an overview of methods for modeling and inferring this secondary structure from sequence data. Here the fundamental question is “how can the folding of a word on itself be found and evaluated, taking into account the affinities between the characters of this word?”

Apart from the solutions provided to answer this large number of questions, it now becomes all the more necessary to take a step back from the methods capable of processing these data. What does it mean when we “find the same piece of sequence” of one organism in another, and what is the significance of this assertion with regard to the parameters chosen, the methods used and their limitations in terms of modeling? What is the place of this method in the landscape of methods already developed and of the problems raised by current data and knowledge?

It is this critical thinking, which can only be developed with full knowledge of the facts, that we also wish to cultivate in our readers. Indeed, the discrete structures used are models that we build, mathematical objects that are not absolute truths, that we design according to an objective, while keeping in mind that these models may have their limitations. Those limitations are expressed in terms of data representativeness, power of expression and classical methods that can be applied. We finally make trade-offs between these limitations and the need to obtain quick answers to crucial questions for the purpose of increasing the knowledge of living beings. The research presented in this book illustrates this explorative process. It is certainly an illusion to understand each of the mechanisms that underlie all of these methods, but the underlying general scheme can be a useful guide for our scientific practice.

This book is part of a series dedicated to methods in bioinformatics; in particular, it is related to the analysis of sequences characterizing genetic information in organisms. This book is intended for all students from the master’s level upwards, doctoral students, young and senior researchers. This book focuses on examples of modeling and problem solving using discrete structures. Far from being exhaustive, this book is intended to be a reminder and an opening for those who are dedicated to one of the fields covered, and an introduction for the future generation of bioinformaticians. Written by researchers in the field, coming from various backgrounds, with most of them significantly involved in bioinformatics training, this book has been conceived with a pedagogical effort that makes it accessible to less informed readers. We sincerely hope that you will enjoy reading this book.

In order to make this book, it was necessary to harness the energy and will of several people, whom we would like to thank in particular. First of all, we would like to thank the researchers who agreed to write the various chapters, an exercise that added to their already busy schedules, and with a quality which we applaud. Second, we would like to thank our coordinators, Hélène Touzet and Anne Siegel, for their kindness, advice and support throughout this adventure. Finally, we would like to thank ISTE for having trusted us and given us complete freedom in the organization and coordination of this book.

May 2022

Author Biographies

Annie Chateau has been a lecturer at the University of Montpellier since 2006. She switched to bioinformatics in 2004 after a thesis in mathematical logic. Her interests include algorithms and combinatorial structures related to various problems such as sequence alignment, genomic rearrangements, genome scaffolding, articulation between phylogeny and synteny, and more generally NGS data manipulation. She has been teaching algorithms in bioinformatics for about 15 years, at all levels, and is currently in charge of the Master’s in Bioinformatics in Montpellier.

Tom Davot-Grangé is currently a temporary teaching and research associate at the University of Montpellier. He defended his thesis on genome scaffolding in 2020. His area of research relates to the fields of fundamental computer science, such as graph theory and algorithmic complexity, as well as to bioinformatics within particular sequence contig scaffolding, a problem he studied during his PhD.

Cervin Guyomar is a research engineer at INRAE, in the UMR GenPhySE based in Toulouse. He defended a thesis in bioinformatics in 2018 on the metagenomics of the pea aphid holobiont. He was particularly involved in proposing innovative methods to assemble genomes and characterize microbial species variation in bacterial communities. Since then, he has been developing analyses for many types of genomic data, in particular for the functional annotation of animal genomes.

Dominique Lavenier is a senior researcher at the CNRS, IRISA, Rennes, France. After working in the field of machine architectures, and more particularly with specialized parallel architectures for genomic data processing, he joined the Symbiose bioinformatics group in Rennes in 2000. He created and co-directed the DEA in genomics and computer science (2000–2004), and then the Master’s in Bioinformatics in Rennes (2004–2008). Between 2008 and 2010, he was seconded as a university professor at ENS Cachan, Brittany branch. In 2012, he created the GenScale bioinformatics team, IRISA/Inria, Rennes. From 2016 to 2021, he was a member of the CoNRS in the computer science section (06) and in the interdisciplinary commission (CID 51): modeling and analysis of biological data and systems. His current research activities include the processing of genomic data volumes, and more specifically data structures and associated parallel algorithms.

Thierry Lecroq obtained a thesis in computer science from the University of Orleans in 1992. In the same year, he was appointed as a lecturer at the University of Rouen Normandy, France. He was then appointed as a professor at the same university in 2002. He is currently the director of the research team Information Processing in Biology and Health (TIBS) of the Laboratory of Informatics, Information and System Processing (LITIS). He is the co-author of several books and chapters on text algorithmics. He was one of the coordinators of the CNRS working group focused on this field for more than 10 years (2008–2018). He teaches in the Computer Science Department of the University of Rouen Normandy, France, and mainly in the Master’s of Bioinformatics.

Claire Lemaitre has been an Inria Research Fellow in the Genscale team of the Inria Rennes Bretagne Atlantique center and the IRISA computer science laboratory in Rennes since 2010. She has a multidisciplinary background in bioinformatics, which she completed with a PhD in 2008. Since her thesis on genomic rearrangements in mammalian genomes, she has been interested in the organization and evolution of genomes. For this purpose, she has developed algorithms and computational methods for comparing and analyzing DNA sequences, whether they are complete genomes or data from massive sequencing. In particular, she has developed software for genome assembly, detection of genomic variants and comparison of metagenomic data.

Laurent Noé has been a lecturer at the University of Lille since 2006. He is currently focusing on algorithms and models related to sequence comparison, NGS data and spaced seeds. He teaches networks, information coding and programming, and has been teaching bioinformatics for 15 years. He is currently co-responsible for the reinforced research course in computer science at the University of Lille.

Yann Ponty has been a senior researcher since 2020. He is responsible for the AMIBio team of the Computer Science Laboratory of the Ecole Polytechnique, where he has been conducting research in bioinformatics since his admission to the CNRS in 2009. He discovered RNA bioinformatics during his thesis and defended in 2006 at the University of Paris-Saclay. He developed original algorithmic methods for folding prediction according to the thermodynamic or kinetic principles, the research in structured RNA in genomes and the design of functional RNAs. His works are inspired, on the one hand, by algorithmic techniques, including dynamic programming and parameterized algorithms, and, on the other hand, by enumerative and analytical combinatorics through the random generation and analysis of algorithms on average. During his spare time, he teaches RNA bioinformatics in training courses for the Master’s in Bioinformatics at Paris-Saclay and Sorbonne University.

Vladimir Reinharz has been a professor at the University of Quebec in Montreal and a member of the Laboratory of Algebra, Combinatorics and Mathematical Computing since 2020. He has been conducting research in bioinformatics since his thesis in 2015 at McGill. He develops algorithmic methods to understand the relationship between sequence, structure and function in RNA, as well as the evolution and modularity of its three-dimensional structure. To this end, he uses various techniques and representations, from dynamic programming and integer programming including graphs. He teaches bioinformatics at all levels.

Mikaël Salson has been a lecturer in computer science at the University of Lille since 2010. He defended a thesis on full-text compressed indexes. He then developed with his colleagues algorithmic methods and software programs for the analysis of high-throughput sequencing data (RNA-Seq, V(D)J recombinations, microRNAs, etc.) or for their indexation (V(D)J recombinations, sequencing datasets). He is also responsible for the first year of the master’s degree in bioinformatics in Lille.