151,99 €
Comprehensive coverage of the many different aspects of systems biology, resulting in an excellent overview of the experimental and computational approaches currently in use to study biological systems.
Each chapter represents a valuable introduction to one specific branch of systems biology, while also including the current state of the art and pointers to future directions. Following different methods for the integrative analysis of omics data, the book goes on to describe techniques that allow for the direct quantification of carbon fluxes in large metabolic networks, including the use of 13C labelled substrates and genome-scale metabolic models. The latter is explained on the basis of the model organism Escherichia coli as well as the human metabolism. Subsequently, the authors deal with the application of such techniques to human health and cell factory engineering, with a focus on recent progress in building genome-scale models and regulatory networks. They highlight the importance of such information for specific biological processes, including the ageing of cells, the immune system and organogenesis. The book concludes with a summary of recent advances in genome editing, which have allowed for precise genetic modifications, even with the dynamic control of gene expression.
This is part of the Advances Biotechnology series, covering all pertinent aspects of the field with each volume prepared by eminent scientists who are experts on the topic in question.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 907
Veröffentlichungsjahr: 2017
Cover
Related Titles
Title Page
Copyright
List of Contributors
About the Series Editors
Chapter 1: Integrative Analysis of Omics Data
Summary
1.1 Introduction
1.2 Omics Data and Their Measurement Platforms
1.3 Data Processing: Quality Assessment, Quantification, Normalization, and Statistical Analysis
1.4 Data Integration: From a List of Genes to Biological Meaning
1.5 Outlook and Perspectives
References
Chapter 2: 13C Flux Analysis in Biotechnology and Medicine
2.1 Introduction
2.2 Theoretical Foundations of
13
C MFA
2.3 Metabolic Flux Analysis in Biotechnology
2.4 Metabolic Flux Analysis in Medicine
2.5 Emerging Challenges for
13
C MFA
2.6 Conclusion
Acknowledgments
Disclosure
References
Chapter 3: Metabolic Modeling for Design of Cell Factories
Summary
3.1 Introduction
3.2 Building and Refining Genome-Scale Metabolic Models
3.3 Strain Design Algorithms
3.4 Case Studies
3.5 Conclusions
Acknowledgments
References
Chapter 4: Genome-Scale Metabolic Modeling and In silico Strain Design of Escherichia coli
4.1 Introduction
4.2 The COBRA Approach
4.3 History of
E. coli
Metabolic Modeling
4.4
In silico
Model-Based Strain Design of
E. coli
Cell Factories
4.5 Future Directions of Model-Guided Strain Design in
E. coli
References
Chapter 5: Accelerating the Drug Development Pipeline with Genome-Scale Metabolic Network Reconstructions
Summary
5.1 Introduction
5.2 Metabolic Reconstructions in the Drug Development Pipeline
5.3 Species-Level Microbial Reconstructions
5.4 The Human Reconstruction
5.5 Community Models
5.6 Personalized Medicine
5.7 Conclusion
References
Chapter 6: Computational Modeling of Microbial Communities
Summary
6.1 Introduction
6.2 Ecological Models
6.3 Genome-Scale Metabolic Models
6.4 Concluding Remarks
References
Chapter 7: Drug Targeting of the Human Microbiome
Summary
7.1 Introduction
7.2 The Human Microbiome
7.3 Association of the Human Microbiome with Human Diseases
7.4 Drug Targeting of the Human Microbiome
7.5 Future Perspectives
7.6 Concluding Remarks
Acknowledgments
References
Chapter 8: Toward Genome-Scale Models of Signal Transduction Networks
8.1 Introduction
8.2 The Potential of Network Reconstruction
8.3 Information Transfer Networks
8.4 Approaches to Reconstruction of ITNs
8.5 The rxncon Approach to ITNWR
8.6 Toward Quantitative Analysis and Modeling of Large ITNs
8.7 Conclusion and Outlook
Acknowledgments
References
Chapter 9: Systems Biology of Aging
Summary
9.1 Introduction
9.2 The Biology of Aging
9.3 The Mathematics of Aging
9.4 Future Challenges
Conflict of Interest
References
Chapter 10: Modeling the Dynamics of the Immune Response
10.1 Background
10.2 Dynamics of NF-κB Signaling
10.3 JAK/STAT Signaling
10.4 Conclusions
Acknowledgments
References
Meyers, R.A. (ed.)
2012
Print ISBN: 978-3-527-32607-5
Dehmer, M., Emmert-Streib, F., Graber, A., Salvador, A. (eds.)
2011
Print ISBN: 978-3-527-32750-8
Published:
Villadsen, J. (ed.)
2016
Print ISBN: 978-3-527-33674-6
Love, J. Ch. (ed.)
2016
Print ISBN: 978-3-527-33281-6
Wittmann, Ch., Liao, J.C. (eds.)
2017
Print ISBN: 978-3-527-34179-5
Wittmann, Ch., Liao, J.C. (eds.)
2017
Print ISBN: 978-3-527-34181-8
Coming soon:
Yoshida, T. (ed.)
Applied Bioengineering
2017
Print ISBN: 978-3-527-34075-0
Edited by Jens Nielsen and Stefan Hohmann
Sang Yup Lee
Tobias Österlund, Marija Cvijovic and Erik Kristiansson
Data generation and analysis are essential parts of systems biology. Today, large amounts of omics data can be generated fast and cost-efficiently thanks to the development of modern high-throughput measurement techniques. Their interpretation is, however, challenging because of the high dimensionality and the often substantial levels of noise. Integrative analysis provides a framework for analysis of the omics data from a biological perspective, starting from the raw data, via preprocessing and statistical analysis, to the interpretation of the results. By integrating the data into structures created from biological information available in resources, databases, or genome-scale models, the focus moves from the individual transcripts or proteins to the entire pathways and other relevant biochemical functions present in the cell. The result provides a context-based interpretation of the omics data, which can be used to form a holistic and unbiased view of biological systems at a molecular level. The concept of integrative analysis can be used for many forms of omics data, including genome sequencing, transcriptomics, and proteomics, and can be applied to a wide range of fields within the life sciences.
Systems biology is an interdisciplinary approach to biology and medicine that employs both experimentation and mathematical modeling to achieve a better understanding of biological systems by describing their shape, state, behavior, and evolutionary history. An important aim of systems biology is to deliver predictive and informative models that highlight the fundamental and presumably conserved relationships of biomolecular systems and thereby provide an improved insight into the many cellular processes [1]. Systems biology research methodology is a cyclical process fueled by quantitative experiments in combination with mathematical modeling (Figure 1.1) [2, 3]. In its most basic form, the cycle starts with the formulation of a set of hypotheses, which is followed by knowledge generation and model construction where an abstract description of the biological system (a model) is formulated and its parameters are estimated from data taken from the literature. The final step is defined by model predictions, where the constructed model is used to address the original hypotheses by providing a quantitative analysis of the system, which, in turn, generates new biological insight.
Figure 1.1 Systems biology research methodology. In the systems biology cycle, novel hypotheses are first formulated, which is followed by knowledge generation, model construction, and model predictions, which, in turn, leads to new biological insights. The development of high-throughput techniques have enabled rapid and cost-efficient generation of omics data from, for example, genome sequencing, transcriptomics, and proteomics. Integrative analysis provides a framework where omics data is systematically analyzed in a biological context, by data integration into known biological networks or other data resources, which enables improved interpretation and easier integration into quantitative models.
The development of high-throughput measurement techniques in the recent years has resulted in an unprecedented ability to rapidly and cost efficiently generate molecular data. Bioassays are today established for large-scale characterization of genes and their expression at the different layers defined by the central dogma: the genome, the transcriptome, and the proteome. The resulting data, which in this chapter will be referred to as omics data, is however complex because of its high dimensionality and is therefore hard to interpret and directly integrate into quantitative models. The concept of integrative analysis is a framework to systematically analyze the different components of omics data in relation to their corresponding biological functions and properties. The resulting biological interpretation can be used to form a holistic and unbiased view of biological systems at a molecular level. Thanks to the comprehensiveness of the omics data, all components (i.e., genes, transcripts, or proteins) can be measured simultaneously, which opens up opportunities for testing of existing hypotheses as well as generation of completely new hypotheses of the studied biological system.
The process of integrative analysis can be divided into two main steps: data processing and data integration (Figure 1.2). Integrative analysis starts from raw omics data and ends with the biological interpretation, and during this process the dimensionality of the data is reduced. The first step, the data processing, takes the high-dimensional omics data, and by applying computational and statistical tools, removes noise and errors while identifying genes and other components that contain information significant for the experiment. The next step, the data integration, uses the list of identified genes to pinpoint relevant functions and pathways by integrating the data on top of a “scaffold” built using established biological information collected from various resources and databases. The result, which is based on the combined analysis of the genes with similar functional properties, has a substantially reduced dimension, which considerably facilitates its interpretation.
Figure 1.2 Description of the concept of integrative analysis as a tool for reduction of the dimension of omics data. Integrative analysis starts with raw omics data, which is typically affected by high levels of noise and errors. Computational and statistical approaches are first used to process the data to produce a ranked list of genes that are found to be of significant importance in the experiment. The gene list is used as input to the data integration, where known biological information is used as a basis for the interpretation of the data. During integrative analysis, the dimension of the data is significantly reduced, from potentially millions of data points to a limited number of significant biological functions and pathways, which considerably facilitates the interpretation.
Many studies in the life sciences aim to understand biological systems, often in relation to a perturbation caused by, for example, disease, genetic variability, changes in environmental parameters, or other factors introduced through laboratory experiments. A commonly used measurement technique is transcriptomics, where the transcriptional response is analyzed and the genes that are differentially expressed between investigated conditions are identified. In this setting, the data integration shifts the focus from what genes are differentially expressed to providing a biological context where activated and repressed pathways, functions, or subnetworks can be identified. This provides a more relevant view of the data, which paves the way toward more sound and detailed biological conclusions.
In this chapter, we provide a broad overview of integrative analysis of omics data. We will describe the general concept of integrative analysis and provide an outline of the many associated computational steps. It should, however, be pointed out that this topic has been extensively researched during the recent years and – due to the scope of the topic at hand – we will not be able to cover all aspects and details in a single chapter. We have therefore provided a comprehensive set of references throughout the text, which are the recommended starting points for further reading. Also, our main focus throughput this chapter will be on data generated by techniques from genomics, transcriptomics, and proteomics. This means that other types of data, which are commonly encountered in systems biology, such as metabolomics and lipidomics, will receive little attention, and here we instead refer the reader to the recent reviews by Robinson et al. [4] and Kim et al. [5].
The chapter is organized as follows. Section 1.2 contains an overview of some the types of omics data that are commonly used in integrative analysis. This is followed by Section 1.3, where we focus on the data processing, starting from the quality assessment of the raw data to statistical analysis. Section 1.4 explains the concepts of data integration and describes the different approaches and data resources that can be used. We end the chapter with an outlook discussing future challenges related to the continuous growth of biological information.
In this section three commonly used types of omics data will be described, namely genome sequencing, transcriptomics (RNA sequencing and microarrays), and mass spectrometry (MS)-based proteomics.
Genome sequencing is used for determining the order of the complete set of nucleotides present in an organism. The comparative analysis of the genome of a strain or a multicellular organism in relation to a reference genome is referred to as “resequencing,” which enables identification of the complete genotype and its variation between individuals. This includes both small mutations, such as single nucleotide polymorphisms (SNPs) and short insertions/deletions (indels), and larger structural variations such as genome rearrangements and copy number alterations [6]. The resulting information, containing a list of all identified genetic variants, is often subjected to integrative analysis in order to provide a biological context where the genotype can be linked to a phenotype [7]. Whole-genome and exome resequencing are important techniques for the study of human disease [8], and in, for example, cancer, the set of germline and somatic mutations are often good predictors of the tumor phenotype, including aggressiveness, ability to metastasize, and drug resistance [9].
Transcriptomics is the large-scale analysis of gene expression at the transcript level. Modern transcriptomics is based on RNA-seq, which is the process where RNA is reversed-transcribed into complementary DNA (cDNA) and then sequenced en masse [10]. From the resulting data, the relative abundance of expressed mRNA and other functional noncoding RNA can be estimated. RNA-seq can also provide detailed information about alternative splicing and expression of isoforms as well as antisense transcription [11]. Analogous to transcriptomics, proteomics is the study of the gene expression but at the protein level. Large-scale proteomics data is generated by bottom-up tandem MS (shotgun proteomics), where a mixture of proteins extracted from a sample is first enzymatically digested (using, e.g., trypsin) followed by peptide separation using liquid chromatography. The peptides are then subjected to two consecutive mass spectrometry runs where the individual peptides are first separated and then fragmented to generate a set of mass spectra. The resulting data provides information about the peptide sequences and their relative abundance in the sample [12]. Proteomics can also be used to study post-translational modifications, such as phosphorylation and ubiquitination [13]. Integrative analysis of transcriptomic and proteomic data has long been popular to study and interpret differences in gene expression between tissues and individuals, as well as medical, environmental, or experimental conditions [14, 15].
The recently introduced next-generation sequencing (NGS) technology has revolutionized large-scale characterization of DNA [16]. In contrast to the traditional Sanger sequencing, which is inherently a serial process, NGS is massively parallel and can characterize billions of DNA fragments simultaneously. This has enabled rapid and cost-efficient generation of vast volumes of DNA sequence data, and, consequently, genome resequencing and transcriptomics are today almost exclusively based on NGS. There are several NGS platforms available, and they all have differences in their performance and characteristics [17]. The Illumina platform uses a sequencing-by-synthesis approach where fluorescence-tagged nucleotides are consecutively incorporated to form the reverse strand of single-stranded DNA fragments. Each incorporated base is registered using a camera, which provides information about the nucleotide sequence of billions of fragments simultaneously. The Illumina sequencing technique has a high throughput, where one single run can generate more than 1 terabase of sequence data. The length of the generated reads are however relatively short (currently 100–300 bases) [18]. The IonTorrent platform also applies sequencing-by-synthesis scheme, but the incorporated bases are instead registered by semiconductor measurement of fluctuations in pH resulting from the release of hydrogen ions [17]. The IonTorrent platform provides quick sequencing runs and can generate reads up to 400 bases but has a lower throughput that the Illumina platform. A third commonly used platform is Pacific Bioscience (PacBio), which uses a sequencing technique where fluorescence pulses of the incorporated tagged nucleotides are detected in real time [18]. PacBio can generate sequence reads up to 20 000 bases but has still a limited throughput compared to the Illumina and IonTorrent platforms [19].
Similar to those of DNA sequencing technology, the performance and throughput of MS-based proteomics have increased drastically during the last decade. This is a result of the improvements in and optimization of the many of the steps in the proteomics workflow. In particular, improved protein digestions through the use of multiple proteases, optimized chromotographic peptide separation, and novel instrumentation with higher resolving power and scan speed have significantly increased the performance – both with respect to sequencing depth and proteome coverage [20]. As a consequence, MS-based proteomics can today be used to identify >10 000 unique proteins in a single sample using low volumes of starting material and thus generate a comprehensive snapshot of the proteome [21, 22].
Microarray technology, first introduced 20 years ago, is based on fluorescence-tagged cDNA that is hybridized to unique gene-specific probes distributed over a chip. A laser scanner is used to extract information about the amount of DNA captured by each probe. Microarrays were previously popular, for example, for large-scale transcriptomics and identification of SNPs but have, compared to NGS-based techniques, lower resolution and are plagued by high technical variation and systematic error [23, 24]. Even though the microarray measurement technology has to a large extent been superseded, there is a large accumulated body of microarray data present in the public repositories that can be subjected to integrative analysis [25]. There is a vast literature regarding all steps of the processing of microarray data, and it will therefore be less extensively covered in this chapter [26, 27].
Figure 1.3
Yi Ern Cheah*, Clinton M. Hasenour* and Jamey D. Young
Metabolism is essential for cellular function. Metabolic reactions break down nutrients to provide the chemical energy and material resources required for organisms to grow and survive. Nature provides a diverse and adaptable collection of enzymes that can be reassembled by metabolic engineers to create recombinant organisms capable of producing a wide range of valuable biochemicals. On the other hand, metabolic physiologists study how metabolism is natively regulated and how human diseases cause this regulation to go awry. In both contexts, there is a critical need to quantify metabolic phenotypes under baseline conditions and then to assess the response to targeted genetic, nutritional, or environmental changes. These studies enable researchers to decipher how metabolic pathways function and how they can be manipulated to achieve a desired outcome.
The rapid expansion of genome sequence information has opened the floodgates to new genetic, transcriptomic, and proteomic studies of industrial host organisms and biomedical disease models. However, these studies typically infer metabolic pathway alterations indirectly from changes in enzyme expression, rather than from direct measurements of metabolic flux. This can be misleading, since metabolic enzymes are tightly regulated by allosteric feedback, post-translational modifications, and substrate availability. Therefore, mRNA or protein abundances often do not correlate strongly with metabolic pathway fluxes [1]. Furthermore, metabolic fluxes cannot be determined solely by static measurements of intracellular metabolite pool sizes [2]. In contrast, 13C metabolic flux analysis (13C MFA) can provide functional readouts of in vivo metabolic pathway activities. The goal of 13C MFA is to identify a unique flux solution that describes the rates of all major intracellular metabolic pathways under the experimental conditions of interest. This solution is usually depicted as a flux map, which is an arrow diagram that shows the biochemical reactions connecting metabolites in the network and the rate of each interconversion (Figure 2.1). Because metabolic fluxes represent the final outcome of cellular regulation at many different levels, a flux map provides an ultimate representation of the cellular phenotype [3].
Figure 2.1 Simplified flux map representing fermentative yeast metabolism. Fluxes entering glycolysis, pentose phosphate pathway (PPP), and citric acid cycle (CAC) are shown relative to a glucose uptake rate of 100. The major extracellular products are ethanol, glycerol, and CO2.
In simple metabolic networks, it is occasionally possible to calculate a unique flux solution by measuring extracellular exchange rates and solving for the intracellular fluxes using stoichiometric mass balances. However, in typical networks containing multiple parallel pathways that carry out similar biochemical conversions, or when internal cycles or reaction reversibilities exist within the network (Figure 2.2), it is impossible to obtain a unique flux solution without imposing assumptions regarding cofactor balancing or metabolic objective functions [4]. Imposing such assumptions can provide a stoichiometrically feasible prediction of one possible flux solution, but confirming whether this solution is accurate requires additional measurements. These measurements can be most readily obtained through isotope labeling experiments (ILEs), where exogenous substrates (e.g., glucose, glutamine, etc.) containing heavy isotopes (e.g., 13C, 14C, 2H, 3H, 15N, etc.) are administered to the cells or tissues under investigation. Then, at some later time after these substrates have been sufficiently metabolized, the incorporation of heavy isotopes into downstream metabolic products is measured. It is typically necessary to collect samples at multiple time points to assess whether the isotope labeling has fully equilibrated or whether it is still changing dynamically, since this has important implications for data interpretation and analysis [5, 6]. Because different biochemical pathways rearrange the heavy atoms in unique ways, the measured patterns of isotope incorporation provide specific information about the route(s) that the labeled atoms took through the metabolic network and can even provide quantitative information about the relative activities of these various pathways (Figure 2.3). Although earlier work relied heavily on radioactive tracers and scintillation counting to detect isotope incorporation, modern ILEs typically involve the use of stable isotope tracers followed by detection with either mass spectrometry (MS) or nuclear magnetic resonance (NMR). These stable isotope measurements provide richer datasets that translate into a greater number of experimentally derived constraints needed to enforce a unique flux solution [8]. The essence of 13C MFA is the integration of isotope labeling measurements with directly measurable extracellular exchange rates to calculate the unknown intracellular flux solution.
Figure 2.2 Situations in which stoichiometric flux balancing fails to provide a unique solution. The fluxes interconverting metabolites A and B (blue arrows) are not observable based on measurements of the total uptake of metabolite A and/or the total production of metabolite B (black arrows).
Figure 2.3 A straightforward 13C tracer experiment. Assuming that each pathway produces a unique labeling pattern in the three-carbon end product, the mass isotopomer distribution (MID) of the product pool provides a direct readout of relative flux in the network. Mass isotopomers are molecules with the same chemical formula but different molecular weights due to varying incorporation of heavy atoms. Mass isotopomers are denoted M0, M1, M2, and so on, in the order of increasing weight.
Adapted from Duckwall et al. [7].
The central challenge of 13C MFA derives from the fact that intracellular metabolic fluxes are per se unmeasurable quantities, whose values must be calculated from measurable extracellular exchange rates and isotope labeling patterns [9]. In general, the relationship between the labeling of network metabolites and the intracellular flux solution is complex and nonlinear. It is therefore practically impossible to determine analytical expressions for fluxes as explicit functions of measurement data for most biologically relevant networks. As a result, the typical 13C MFA procedure involves a least-squares regression to calculate a flux solution that minimizes the lack of fit between experimental measurements and computationally simulated measurements (Figure 2.4). The latter are derived from a mathematical model that comprises a system of mass and isotopomer balances on all metabolites in the biochemical network. Antoniewicz [6] and Niedenfuhr et al. [5] provide excellent introductions to the different types of ILEs and modeling approaches that have been used for flux determination. Most ILEs are designed to collect samples under metabolic and isotopic steady-state conditions, where the mass and isotopomer balances are described by algebraic equations. However, isotopically nonstationary metabolic flux analysis (INST-MFA) can be used to regress fluxes from transient isotope labeling measurements by applying a model composed of algebraic (steady-state) mass balances coupled to differential (non-steady-state) isotopomer balances. In either scenario, the balance equations can be solved for any particular set of flux parameters to simulate the available isotope labeling measurements. The central task of 13C MFA is thus equivalent to performing nonlinear parameter estimation, whereby the flux parameters in the isotopomer model are iteratively adjusted until the best fit with experimental data is achieved.
Figure 2.413C flux estimation procedure. The process is initialized by selecting a set of “free” fluxes that span the nullspace of the stoichiometric matrix. Starting guesses are provided for all free fluxes to seed the procedure. This enables calculation of all linearly dependent fluxes through stoichiometric mass balancing. The flux values are then substituted into the isotopomer balances to simulate measurable isotopomer abundances. The sum-of-squared residuals (SSR) is calculated, which represents the total deviation between all measured quantities and their model-predicted values. An optimization algorithm is then applied to iteratively adjust the values of the free fluxes until the SSR is minimized. At this point, the flux solution is recovered, and various statistical tests can be applied to assess the goodness of fit and to estimate the uncertainties of the regressed parameters.
Because the isotopomer model must be solved up to hundreds or even thousands of times during the flux regression to achieve an optimal fit, a great deal of effort has been placed on developing improved strategies to simulate ILEs. The process of constructing balance equations that describe the relative abundance of measurable isotopomers – first using 1H-NMR positional 13C enrichment measurements and later using 13C-NMR or MS measurements – was systematized by the introduction of atom mapping matrices [10] and isotopomer mapping matrices [11], respectively. This allowed isotopomer models of arbitrary complexity to be automatically constructed from a list of precursor–product relationships. Wiechert et al. achieved a major breakthrough when they showed how the inherently nonlinear isotopomer balances could be transformed into a cascaded system of linear equations by converting from isotopomer variables to so-called cumomer variables [12]. They developed the first generalized software for isotopomer modeling that was made publicly available, called 13CFLUX, which relied on cumomer balancing to simulate ILEs [13]. Wiechert [14] provides an excellent review of the state of the art and historical development of 13C MFA up to 2001. In the following sections, we outline some of the major theoretical advances that have occurred in the past 15 years since that review appeared.
While the cumomer method provides substantial computational savings by transforming a nonlinear system of isotopomer balances into a cascaded system of linear equations, the total number of isotopomer/cumomer variables needed to simulate ILEs for a given metabolic network is the same. For any single metabolite, the number of possible isotopomers grows exponentially with the number of potentially labeled atoms (e.g., a metabolite with N carbon atoms that can be either 12C or 13C possesses 2N isotopomers). For carbon labeling experiments, this results in large, yet manageable, systems of isotopomer balance equations since most central metabolites have ≤6 carbon atoms and therefore ≤64 possible isotopomers. On the other hand, in situations where multiple heavy atoms are administered simultaneously (e.g., mixtures of 13C- and 2H-labeled tracers), the number of isotopomers associated with each metabolite can grow into the thousands or even millions. Therefore, a new framework is required to reduce the computational burden of modeling these more sophisticated labeling experiments.
The elementary metabolite unit (EMU) approach was developed by Antoniewicz et al. [15] to address precisely this problem of combinatorial explosion. Through a novel path-tracing algorithm, the method systematically identifies the minimal set of atom groups (so-called EMUs) required to simulate the available isotopomer measurements. By constructing balance equations only for those EMUs that contribute to measurable outputs of the model, the number of variables can be minimized without any loss of information. Indeed, it can be shown that measurements simulated by the EMU and cumomer approaches are numerically identical. Furthermore, the balance equations relating the labeling of EMUs to pathway fluxes can be arranged into a cascaded series of linear equations, much like the foregoing cumomer approach. Thus, the EMU method maintains the efficiencies achieved by the cumomer method while providing further computational savings [15, 16]. In addition to modeling mixed-tracer ILEs, the reduction in system size has obvious benefits when modeling other complex ILEs such as those required for INST-MFA [17]. Several software packages based on the EMU method are now publicly available: Metran [18], OpenFLUX [19], and INCA [20], to name a few.
While the solution of the forward tracer simulation (i.e., calculation of isotopomer measurements from a given set of flux values) is always uniquely determined for well-defined networks, the inverse problem (i.e., regression of fluxes from isotopomer measurements) may not possess a unique solution [21, 22]. Furthermore, because fluxes are determined by a least-squares approach, not all fluxes can be estimated with equal precision. Therefore, it is imperative to not only assign values to all observable flux parameters but also to determine their uncertainties. The most expedient way to estimate flux uncertainties is to compute the flux covariance matrix at the optimal flux solution [23, 24]. The diagonal elements of this matrix correspond to the variances of the associated flux values. However, the covariance matrix reflects a linearization of the underlying isotopomer balance equations and therefore provides only local estimates of parameter uncertainties. An alternative approach involves Monte Carlo analysis of synthetic datasets derived from the optimal flux solution [9]. This can be a computationally demanding approach, since in practice one should continue generating new synthetic datasets and refitting until the desired uncertainty metric stabilizes. Therefore, many rounds of Monte Carlo analysis are typically required before the corresponding “tail” probabilities finally converge.
A different approach was taken by Antoniewicz et al. [25], who performed a sensitivity analysis by systematically perturbing each individual flux parameter. By gradually varying a given parameter away from its optimal value while adjusting the remaining parameters to minimize the sum-of-squared residuals (SSR), one can computationally trace the envelope of points that describes the sensitivity of the SSR to the varied parameter. The parameter continuation approach of Antoniewicz et al. [25] is more computationally efficient than Monte Carlo when applied to typical 13C MFA models. A drawback, however, is that the sensitivity calculation must be performed for each fitted parameter one at a time. Suthers et al. [16] have partially addressed this problem through the application of flux coupling analysis. Using their method, confidence intervals need to be calculated only once for each group of fully coupled fluxes, thus reducing the total number of computations required. Parallelization of the sensitivity analysis algorithm has also been achieved [20], which can result in dramatic time savings when implemented within a cluster computing environment.
The precision with which a particular flux can be estimated is determined by the sensitivity of the available measurements to the flux in question, which is a function of (i) the isotope tracer applied, (ii) the atom transitions within the metabolic network, (iii) the intracellular flux distribution, and (iv) the available isotopomer measurements. Since (ii) and (iii) are not under the control of the experimenters, the key elements of experimental design involve choosing appropriate combinations of (i) and (iv) to determine the fluxes of interest. Until recently, the prevailing philosophy has been to measure as many metabolites as possible within the pathways of interest using whichever analytical techniques (e.g., NMR, gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), etc.) can be practically implemented by the experimenters. Therefore, the focus of experimental design has been on choosing a labeling strategy that will maximize the capacity of these isotopomer measurements to differentiate between alternative flux states. In the earliest examples, tracer substrates were chosen heuristically, leading to fairly simple labeling schemes. Typically, this involved either (i) feeding [1-13C]glucose with 1H-NMR measurements of amino acid positional 13C enrichments [26] or (ii) feeding a mixture of [U-13C6]glucose and unlabeled glucose with 2D [13C,1H]-correlation spectroscopy (COSY) NMR measurements of amino acid isotopomer abundances [27–29]. Schmidt et al. [9] were the first to introduce a mixture of differentially labeled substrates to determine fluxes within a comprehensive 13C MFA model. They applied a combination of [1-13C]glucose and [U-13C6]glucose to quantify the intracellular flux distribution of recombinant Aspergillus niger using 2D NMR of hydrolyzed protein and chitin components.
The increased precision of flux estimates obtainable from combining differentially labeled substrates, whether fed simultaneously or in parallel experiments, is now widely understood and accepted. However, it is not always appreciated that the optimal tracer combination can depend strongly on the network topology and the available measurements, which vary from system to system. This implies that there is a need to tailor the labeling strategy to the system of interest rather than naïvely following prior conventions. The first systematic treatment of optimal design of ILEs was introduced by Möllney et al. [30]. Their approach was built upon the classical theory of optimal experiment design (OED), which recognizes that the most important statistical property describing parameter identifiability – at least in the neighborhood of the best-fit solution – is the covariance matrix of the estimated parameters. An important feature of this matrix is that it can be approximated a priori (i.e., in the absence of experimental data) if initial flux estimates are available. In order to apply standard optimization algorithms, it is necessary to define a scalar measure related to the covariance matrix that can serve as the objective function to be optimized. Typically, this objective is computed from the determinant (D-criterion) or trace (A-criterion) of the covariance matrix. By selecting the labeling scheme or tracer combination that optimizes the chosen objective function, it is possible to systematically compute an experimental design that is expected to maximize flux identifiability. Although the optimal design will change depending upon the actual flux values, Möllney et al. [30] found that the optimal design depends only weakly on the initial flux values chosen. Therefore, this procedure can be used to derive a satisfactory initial labeling strategy, which can be further refined as experimental data are obtained. Several authors have since applied the approach of Möllney et al. [30] to examine flux identifiability in a variety of systems ranging from bacterial cultures [24, 31] to plant seeds [32, 33].
Although classical OED approaches are based on a scalar performance criterion derived from the parameter covariance matrix, in some cases it may be undesirable to condense all system information into a single number. Furthermore, the covariance matrix only provides a local estimate of flux uncertainty. For these reasons, several alternative approaches have been developed to investigate the identifiability of 13C MFA systems [22, 34, 35]. Metallo et al. [36] examined flux identifiability of a carcinoma cell line using a variety of single tracers. Rather than using the parameter covariance matrix as a local estimate of flux uncertainty, they applied the parameter continuation method of Antoniewicz et al. [25] to compute more accurate nonlinear confidence intervals on all fluxes. They identified [1,2-13C2]glucose and [U-13C5]glutamine as the most useful single tracers for flux determination in glycolysis and citric acid cycle (CAC) pathways, respectively. Walther et al. [37] later extended this approach to examine mixtures of tracers identified using a genetic algorithm.
Another recent development was the introduction of the elementary metabolite unit basis vector (EMU-BV) approach, which can be used to express the labeling of any metabolite in the network as a linear combination of EMUs [38]. This allows the influence of fluxes on the isotopomer measurements to be decoupled from substrate labeling, and thereby enables a fully a priori approach to tracer selection that does not depend on the choice of a reference flux map. Crown et al
