142,99 €
Biological evolution is the phenomenon concerning how species are born, are transformed or disappear over time. Its study relies on sophisticated methods that involve both mathematical modeling of the biological processes at play and the design of efficient algorithms to fit these models to genetic and morphological data. Models and Methods for Biological Evolution outlines the main methods to study evolution and provides a broad overview illustrating the variety of formal approaches used, notably including combinatorial optimization, stochastic models and statistical inference techniques. Some of the most relevant applications of these methods are detailed, concerning, for example, the study of migratory events of ancient human populations or the progression of epidemics. This book should thus be of interest to applied mathematicians interested in central problems in biology, and to biologists eager to get a deeper understanding of widely used techniques of evolutionary data analysis.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 583
Veröffentlichungsjahr: 2024
SCIENCES
Computer Science, Field Directors – Valérie Berthé and Jean-Charles Pomerol
Bioinformatics, Subject Heads – Anne Siegel and Hélène Touzet
Coordinated by
Gilles Didier
Stéphane Guindon
First published 2024 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the under mentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
www.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.wiley.com
© ISTE Ltd 2024The rights of Gilles Didier and Stéphane Guindon to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s), contributor(s) or editor(s) and do not necessarily reflect the views of ISTE Group.
Library of Congress Control Number: 2023932775
British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-78945-069-9
ERC code:PE6 Computer Science and Informatics PE6_13 Bioinformatics, biocomputing, and DNA and molecular computationLS8 Ecology, Evolution and Environmental Biology LS8_7 Macroevolution, paleobiology
Gilles DIDIER1 and Stéphane GUINDON2
1IMAG, CNRS, Université de Montpellier, France
2LIRMM, CNRS, Université de Montpellier, France
Evolution, in the usual meaning it takes in life science as the phenomenon by which living species evolve over time, is an essential biological process. It can arguably be said that it is the most important process, since all other biological phenomena are in some way derived therefrom. Evolution is thus at the origin not only of the extraordinary diversity of living beings, but has also shaped every biological function that can be observed. Its general theory dates back to the 19th century with the work of Darwin. Although this theory is now widely accepted and shared, its study remains an extremely lively and fertile scientific ground, which has extensively developed in recent decades, and is thus, consequently, rather broad.
It is obvious that the aim of this book cannot be to present every single aspect of evolution, but is instead to focus on the large number of interactions between biology, mathematics and computer science that its study has fostered. The main reason why evolutionary studies make such an extensive use of mathematical models and algorithms is due to the fact that biological evolution is a process which has been going on for more than 4 billion years and which is not, with rare exceptions (see Chapter 11), directly observable on our time scale. As such, it can only be studied from the data that are available to us today, that is, present-day species and the fossil record. In order to test hypotheses about the mechanisms governing evolution, it is generally necessary to express them in terms of mathematical models. The latter are simplified representations of biological reality that can be used to (imperfectly) reconstruct the evolutionary history of contemporary species (as well as ancestral species in the case of fossils) that we observe. By fitting these models to the data, their relevance can be assessed and the hypotheses initially proposed validated or rejected. The design of models and the inference of their parameters require us to deal with mathematical, computational and statistical problems that have contributed to opening up new fields of research, both theoretical and applied, in these areas.
The central object in the study of evolution is the tree of life. The latter is mainly a natural representation of the diversification of species in the sense that it describes the relationships between species (or even between individuals). In modeling, we shall see that it is sometimes interpreted as a support for evolution, which we can try to reconstruct from the available data, and sometimes as a representation of the statistical dependency between the characters carried by the species. Trees are also theoretical objects that have been studied in computer science and mathematics, particularly from the point of view of their combinatorics. Chapter 1 briefly presents this aspect before describing different evolutionary models leading to trees and the probabilities associated therewith within these models.
Although trees represent the framework in which evolution takes place, evolution itself operates on the various characteristics (which should be taken in the broadest sense here) that can be found in living beings. Moreover, it is through these characters that it can be studied. The most used “character” within this framework is genetic material, namely the molecules/polymers from which the sequence of elementary bricks can be extracted in the form of (long) words on finite alphabets. In fact, the development of genetics, first identified as a support for evolution, and DNA sequencing techniques have revolutionized the study of biological evolution by changing the nature and causing an explosion in the amount of exploitable data in this field. Using these (and other) data to better understand evolution requires an increase in mathematical and computational resources to address the ever-changing amount and type of data.
Chapter 2 presents the main Markovian models of DNA sequence evolution. These models, generally considered as mechanistic, describe evolutionary processes at the molecular level, over sufficiently long periods of time so that intraspecies genetic variability is negligible compared to interspecies variability. The vast majority of these models consider that the different positions along the genetic sequences evolve independently of each other and follow the same continuous-time Markovian model. This same chapter also describes probabilistic models for taking into account the variability of evolutionary rates along sequences, an important phenomenon from a biological point of view, particularly with regard to the evolution of genome coding parts, which are constrained by the structure of the genetic code. Finally, models of the same type as those used for DNA sequences can also be used for modeling the evolution of discrete characters such as the presence or absence of a given morphological characteristic or the number of fingers, etc.
Evolution also concerns the physical characteristics of species, in particular the so-called quantitative characters such as height, weight and so on. Although these are less used for phylogenetic inference, understanding their evolution is essential to biology, for example, for testing hypotheses about morphometric and allometric relationships in ecology. Evolutionary models of continuous characters prove to also be a relevant tool for the detection of possible traces of natural selection on the evolution of morphological characters. Chapter 3 presents in detail the generic framework in which these models are implemented as well as a wide range of regularly used approaches to appropriately model the correlation between character values that derives from evolutionary relationships between compared species.
The models presented in Chapters 2 and 3 assume that the characters being considered are probabilistically independent (e.g. sequences sites and the different edges in a phylogeny are considered as independent). Although this independence avoids an explosion of the computation time and the size of the models, it might not be considered realistic in many cases. Chapter 4 presents different approaches which make it possible to highlight and study the evolution interdependence of several characters, discrete or continuous. Similarly to the models presented in Chapter 3, the co-evolution models take into consideration the phylogenetic tree as a nuisance parameter in order to evaluate the correlation part between morphological characters that is not explained by evolutionary relationships.
Genetic sequences do not only evolve by means of mutations, as presented in Chapter 2, but sometimes change in a more radical way. In particular, at the genome level, evolution sometimes proceeds by duplication or inversion of whole sections of chromosomes. These types of changes are very rich in terms of information on the evolutionary distances between genomes. They provide a more comprehensive view of evolution than the “simple” models of point substitutions between nucleotides. Nonetheless, genomic rearrangements are more difficult to model mathematically. Chapter 5 reviews the general approach to detecting these rearrangements and reconstructing evolution at the genome scale.
The first five chapters of our book provide an overview of evolutionary models. The next four chapters illustrate how some of these models are applied in the context of phylogenetic inference, that is, for determining evolutionary relationships between species or individuals and the time since they have diverged. They present several approaches commonly used to answer these questions.
The first approach to reconstructing the evolution of a group of species consists of considering a distance or dissimilarity matrix, that is, a measure of the “resemblance” between pairs of species. Here, the assumption is made that the further apart species are from an evolutionary perspective, the less similar they are from the point of view of the chosen measure. Under this hypothesis, the tree that best represents these distances and that we will try to determine is close to the one that traces their evolution. The dissimilarity, or distance, used can be calculated based on genetic sequences and morphological characters. Although a distance drastically summarizes all the characters inherent to the species, different methods presented in Chapter 6 enable the reconstruction of realistic evolutionary trees from this information. These approaches are even virtually the only ones that can be applied in practice to reconstruct the trees comprising a large number of species due to their computational speed.
While the methods presented in Chapter 6 prove to be very fast, they do not directly take the evolution of the species under study into consideration, since the evolutionary distances between species only provide an approximate summary of the raw data. Other approaches directly involve mechanisms or evolutionary models for phylogenic inference. One of the first to be considered is parsimony, which is based on Occam’s razor principle and seeks phylogenies involving the fewest possible evolutionary events. Parsimony has gradually been supplanted by approaches based on probabilistic models such as those presented in Chapter 2. The first way to use such models for phylogenetic inference consists of looking for the tree maximizing the probability of the observed data under the chosen model. This is called maximum likelihood. The method is quite close to parsimony in spirit and these two approaches are described in Chapter 7.
Another way of using sequence evolution-based probabilistic models for phylogenetic inference can be conducted in the context of Bayesian sampling, where we no longer seek to determine only the tree that maximizes the probability of the observed data, but to associate with any tree its posterior probability, namely its probability conditional on the observed data. This approach takes into account the uncertainty inherent to the inference process, thereby overcoming a limitation of maximum likelihood approaches. Chapter 8 presents the general principles of Bayesian approaches by Markov chain Monte Carlo methods as well as their applications to phylogeny.
Bayesian approaches are not the only ones capable of quantifying uncertainty in the trees inferred by the various methods. Evaluating this uncertainty is essential, since in many situations, evolutionary history is difficult to reconstruct, to the point of causing controversy among people who study it. Chapter 9 presents several approaches for quantifying the uncertainty associated with each branch of a phylogenetic tree. The non-parametric bootstrap technique, well known to statisticians, has long been the preferred approach. Other approaches, which are faster, have recently emerged over the last two decades. These are described in detail, providing an overview of current solutions allowing for quantifying uncertainties in phylogenetic tree reconstructions.
While previous chapters of this book focused on the models and methods that underpin modern tools for studying evolution, the following chapters illustrate how these approaches can be used to expand further our knowledge of biology, which in turn can sometimes contribute to enhancing phylogenetic analyses.
For example, considering phylogenies containing fossil species benefits at several levels. Beyond their paleontological interest, fossil species are essential to dating phylogenetic trees, more precisely to the dating of their internal nodes representing speciations. Evolutionary models presented in Chapters 2 and 3 do not allow the determination of evolutionary durations in “absolute” time. These are only determined relatively to an evolutionary rate that is not identifiable without a temporal calibration point (e.g. the date of an evolutionary event). On the other hand, geological strata in which the fossils are found provide a means to date more or less precisely the period during which the associated species lived. Integrating fossil species into a phylogeny thus makes it possible to add time information and to estimate speciation dates in particular. Chapter 10 presents the challenges and particularities of this type of data and the difficulties of integrating them into phylogenetic analysis.
Another field of application for phylogenetic models is that of phylodynamics, as presented in Chapter 11. The point here is to trace the evolution of pathogens, viruses or bacteria, in order to understand the underlying population dynamics. The phylogenetic tree no longer describes a sequence of speciations but rather the transmission events of the pathogen between successive hosts. The pace at which these events occur, when analyzed within the context of an appropriate epidemiological model, is directly related to the dynamics of the epidemy. Chapter 11 gives an overview of modern statistical approaches (and unfortunately topical at the time of writing this preface) mixing phylogenetics and epidemiology.
Finally, it should be noted that in phylogenetic inference methods presented so far (with the exception of the previous chapter), the evolutionary unit is the species. Tracing the evolution of individuals constituting a species, or even a population, is another major challenge. Chapter 12 gives a broad overview of the advances made possible by the genealogy of individuals traced on the basis of genetic data combined with probabilistic models in population genetics. Such approaches allow us to infer variations in the size of ancestral human populations and to reconstruct, from the analysis of georeferenced genetic sequences alone, migratory events that occurred several thousand years ago.
The aim of this book is not to provide an exhaustive overview of techniques for studying evolution or their applications. For example, phylogenomic approaches, which aim at reconstructing evolution based on the analysis of multiple genes, are not presented. Similarly, techniques for detecting traces left by natural selection within genetic sequences, or those for molecular dating, are not discussed in detail. Our approach here has been to focus on some of the most frequently used approaches at the current time and to provide an in-depth description thereof. Many examples for applying these methods can of course be found and, again, we only give an overview of these. Nevertheless, we hope that after reading this book readers unfamiliar with the subject will be interested and curious about evolutionary research, while allowing students and researchers in the field to deepen their knowledge of these questions.
November 2023