80,99 €
The definitive introduction to data analysis in quantitative proteomics
This book provides all the necessary knowledge about mass spectrometry based proteomics methods and computational and statistical approaches to pursue the planning, design and analysis of quantitative proteomics experiments. The author’s carefully constructed approach allows readers to easily make the transition into the field of quantitative proteomics. Through detailed descriptions of wet-lab methods, computational approaches and statistical tools, this book covers the full scope of a quantitative experiment, allowing readers to acquire new knowledge as well as acting as a useful reference work for more advanced readers.
Computational and Statistical Methods for Protein Quantification by Mass Spectrometry:
With clear and thorough descriptions of the various methods and approaches, this book is accessible to biologists, informaticians, and statisticians alike and is aimed at readers across the academic spectrum, from advanced undergraduate students to post doctorates entering the field.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 510
Veröffentlichungsjahr: 2012
Contents
Cover
Title Page
Copyright
Preface
Terminology
Acknowledgements
Chapter 1: Introduction
1.1 The composition of an organism
1.2 Homeostasis, physiology, and pathology
1.3 Protein synthesis
1.4 Site, sample, state, and environment
1.5 Abundance and expression – protein and proteome profiles
1.6 The importance of exact specification of sites and states
1.7 Relative and absolute quantification
1.8 In vivo and in vitro experiments
1.9 Goals for quantitative protein experiments
1.10 Exercises
Chapter 2: Correlations of mRNA and protein abundances
2.1 Investigating the correlation
2.2 Codon bias
2.3 Main results from experiments
2.4 The ideal case for mRNA-protein comparison
2.5 Exploring correlation across genes
2.6 Exploring correlation within one gene
2.7 Correlation across subsets
2.8 Comparing mRNA and protein abundances across genes from two situations
2.9 Exercises
2.10 Bibliographic notes
Chapter 3: Protein level quantification
3.1 Two-dimensional gels
3.2 Protein arrays
3.3 Western blotting
3.4 ELISA – Enzyme-Linked Immunosorbent Assay
3.5 Bibliographic notes
Chapter 4: Mass spectrometry and protein identification
4.1 Mass spectrometry
4.2 Isotope composition of peptides
4.3 Presenting the intensities – the spectra
4.4 Peak intensity calculation
4.5 Peptide identification by MS/MS spectra
4.6 The protein inference problem
4.7 False discovery rate for the identifications
4.8 Exercises
4.9 Bibliographic notes
Chapter 5: Protein quantification by mass spectrometry
5.1 Situations, protein, and peptide variants
5.2 Replicates
5.3 Run – experiment – project
5.4 Comparing quantification approaches/methods
5.5 Classification of approaches for quantification using LC-MS/MS
5.6 The peptide (occurrence) space
5.7 Ion chromatograms
5.8 From peptides to protein abundances
5.9 Protein inference and protein abundance calculation
5.10 Peptide tables
5.11 Assumptions for relative quantification
5.12 Analysis for differentially abundant proteins
5.13 Normalization of data
5.14 Exercises
5.15 Bibliographic notes
Chapter 6: Statistical normalization
6.1 Some illustrative examples
6.2 Non-normally distributed populations
6.3 Testing for normality
6.4 Outliers
6.5 Variance inequality
6.6 Normalization and logarithmic transformation
6.7 Exercises
6.8 Bibliographic notes
Chapter 7: Experimental normalization
7.1 Sources of variation and level of normalization
7.2 Spectral normalization
7.3 Normalization at the peptide and protein level
7.4 Normalizing using sum, mean, and median
7.5 MA-plot for normalization
7.6 Local regression normalization – LOWESS
7.7 Quantile normalization
7.8 Overfitting
7.9 Exercises
7.10 Bibliographic notes
Chapter 8: Statistical analysis
8.1 Use of replicates for statistical analysis
8.2 Using a set of proteins for statistical analysis
8.3 Missing values
8.4 Prediction and hypothesis testing
8.5 Statistical significance for multiple testing
8.6 Exercises
8.7 Bibliographic notes
Chapter 9: Label based quantification
9.1 Labeling techniques for label based quantification
9.2 Label requirements
9.3 Labels and labeling properties
9.4 Experimental requirements
9.5 Recognizing corresponding peptide variants
9.6 Reference free vs. reference based
9.7 Labeling considerations
9.8 Exercises
9.9 Bibliographic notes
Chapter 10: Reporter based MS/MS quantification
10.1 Isobaric labels
10.2 iTRAQ
10.3 TMT – Tandem Mass Tag
10.4 Reporter based quantification runs
10.5 Identification and quantification
10.6 Peptide table
10.7 Reporter based quantification experiments
10.8 Exercises
10.9 Bibliographic notes
Chapter 11: Fragment based MS/MS quantification
11.1 The label masses
11.2 Identification
11.3 Peptide and protein quantification
11.4 Exercises
11.5 Bibliographic notes
Chapter 12: Label based quantification by MS spectra
12.1 Different labeling techniques
12.2 Experimental setup
12.3 MaxQuant as a model
12.4 The MaxQuant procedure
12.5 Exercises
12.6 Bibliographic notes
Chapter 13: Label free quantification by MS spectra
13.1 An ideal case – two protein samples
13.2 The real world
13.3 Experimental setup
13.4 Forms
13.5 The quantification process
13.6 Form detection
13.7 Pair-wise retention time correction
13.8 Approaches for form-tuple detection
13.9 Pair-wise alignment
13.10 Using a reference run for alignment
13.11 Complete pair-wise alignment
13.12 Hierarchical progressive alignment
13.13 Simultaneous iterative alignment
13.14 The end result and further analysis
13.15 Exercises
13.16 Bibliographic notes
Chapter 14: Label free quantification by MS/MS spectra
14.1 Abundance measurements
14.2 Normalization
14.3 Proposed methods
14.4 Methods for single abundance calculation
14.5 Methods for relative abundance calculation
14.6 Comparing methods
14.7 Improving the reliability of spectral count quantification
14.8 Handling shared peptides
14.9 Statistical analysis
14.10 Exercises
14.11 Bibliographic notes
Chapter 15: Targeted quantification – Selected Reaction Monitoring
15.1 Selected Reaction Monitoring – the concept
15.2 A suitable instrument
15.3 The LC-MS/MS run
15.4 Label free and label based quantification
15.5 Requirements for SRM transitions
15.6 Finding optimal transitions
15.7 Validating transitions
15.8 Assay development
15.9 Exercises
15.10 Bibliographic notes
Chapter 16: Absolute quantification
16.1 Performing absolute quantification
16.2 Label based absolute quantification
16.3 Label free absolute quantification
16.4 Exercises
16.5 Bibliographic notes
Chapter 17: Quantification of post-translational modifications
17.1 PTM and mass spectrometry
17.2 Modification degree
17.3 Absolute modification degree
17.4 Relative modification degree
17.5 Discovery based modification stoichiometry
17.6 Exercises
17.7 Bibliographic notes
Chapter 18: Biomarkers
18.1 Evaluation of potential biomarkers
18.2 Evaluating threshold values for biomarkers
18.3 Exercises
18.4 Bibliographic notes
Chapter 19: Standards and databases
19.1 Standard data formats for (quantitative) proteomics
19.2 Databases for proteomics data
19.3 Bibliographic notes
Chapter 20: Appendix A: Statistics
20.1 Samples, populations, and statistics
20.2 Population parameter estimation
20.3 Hypothesis testing
20.4 Performing the test – test statistics and p-values
20.5 Comparing means of populations
20.6 Comparing variances
20.7 Percentiles and quantiles
20.8 Correlation
20.9 Regression analysis
20.10 Types of values and variables
Chapter 21: Appendix B: Clustering and discriminant analysis
21.1 Clustering
21.2 Discriminant analysis
21.3 Bibliographic notes
Bibliography
Index
This edition first published 2013 © 2013 John Wiley & Sons, Ltd
Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data applied for.
A catalogue record for this book is available from the British Library.
ISBN: 978-1-119-96400-1
Preface
Mass spectrometry based proteomics has quickly become the method of choice for the high-throughput analysis of entire proteomes. Through the advent of a variety of powerful approaches for quantitative proteomics, these experiments now also yield large volumes of quantitative data on protein expression levels.
The ability to quantify a proteome in high-throughput opens up many new possibilities for research, including protein biomarker discovery, fundamental insight into the dynamics of a cell or tissue proteome over time, and analysis of expression changes in function of (external) perturbations. Such analyses in turn benefit both biological insight as well as clinical applications, and form a crucial element of the emerging field of systems biology.
Adding quantitative data to the already complex output of a proteomics experimentation does however further increase the importance of correct data processing and results evaluation. As such, the computational methods employed to analyze such data are increasingly seen as a crucial step in the overall workflow.
With this in mind the intention of this book is to systemize and describe the different experimental approaches for protein quantification by mass spectrometry, and to present the corresponding computational and statistical methods used to analyze data from such experiments.
The first three chapters act as an introduction, introducing terms, concepts, and protein quantification as well as its relation to mRNA quantification. Chapter 4 then provides a brief introduction to mass spectrometry based proteomics, of which a more detailed description can be found in Eidhammer et al. (2007). Chapter 5 systematizes and classifies the different quantification approaches that are described in more detail in Chapters 9 to 15, each of which are dedicated to a specific approach. Statistical data processing and analysis for quantitative proteomics experiments form the subject of Chapters 6, 7, and 8. The final four Chapters (16 to 19) deal with more specific quantification tasks and a few orthogonal topics associated with protein quantification.
Given that statistics form an essential part of the computational methods used to process, analyze, and interpret quantitative mass spectrometry data, we have also included two appendices that can be used as a reference on the statistical knowledge required for a complete understanding of the methods described in the various chapters.
It is not the goal of this book to provide an exhaustive theoretical foundation for the computational analysis of quantitative proteomics, but rather to present the main challenges, and extract and systematize the common principles used in solving these problems. The presentation is illustrated throughout by numerous figures and examples.
We have tried to restrict the description of a given subject to one section or chapter, but for some of the subjects it is unavoidable to treat them several times in different contexts. In these cases we have included the necessary cross references.
Note that we have also included many references and websites, although we are aware that websites can change rapidly.
The book is directed at biologists, (bio-)informaticians, and statisticians alike, and is aimed at readers across the academic spectrum, from advanced undergraduate students to post docs first entering the field of protein quantification.
Terminology
Many of the terms used in molecular biology and proteomics do not have a unique or commonly accepted definition. We have mainly tried to follow the IUPAC (International Union of Pure and Applied Chemistry) Compendium of analytical nomenclature. They have an ongoing revision of the terms used in the larger field of mass spectrometry, and a first draft exists.
IUPAChttp://www.iupac.org/publications/analytical_compendiumOngoing revisionhttp://www.msterms.com/First drafthttp://www.sgms.ch/links/IUPAC_MS_Terms_Draft.pdfWe have endeavored to employ customarily used terms wherever possible, defining our own terms only where it was necessary or more appropriate for clarity. When several synonyms exist for a term, these are also given.
Acknowledgements
Numerous papers and websites have been used in the writing of this book, and citations have been given for these works. However, in order to enhance readability we have tried to minimize literature references in the text, and have instead included a bibliographic section at the end of each chapter. The sources we consulted are detailed there, as well as additional literature that may be interesting or relevant to the reader. It is our sincere hope that this structure will serve to provide the authors of the source literature with the appropriate acknowledgments.
In addition, there are a lot of people whom we have consulted during the work, and we especially want to thank Rein Aasland, Thin Thin Aye, Frode Berven, Niklaas Colaert, Sven Degroeve, Olav Mjaavatten, Jill Anette Opsahl, Eystein Oveland, Kjell Petersen, Pål Puntervoll, and An Staes for helpful discussions and for sharing their insights and profound knowledge.
IE dedicates his work to his wife Maria.
GEE thanks the Centre for Clinical Research, Haukeland University Hospital and the Departement of Public Health and Primary Health Care, University of Bergen for approving and providing facilities to work with this book. Also he thanks his colleagues in the Western Regional Health Authorities Statisticians Network and in the Life Style Epidemiology Research Group for inspiring work environment, as well as his co-workers in numerous medical and health related research projects through the years. Finally, GEE sincerely thanks his wife Kirsti, his family and his friends for continuing love and support.
HB would like to thank his friends and colleagues at the Proteomics Unit at the Department of Biomedicine, and at the Department of Informatics / the Computational Biology Unit for numerous interesting discussions and collaborations over the last years. Thanks also to my international collaborators in the extended Computational Omics and Systems Biology group and the PRIDE team at the European Bioinformatics Institute. Our ongoing collaborations make it a lot easier to see how the work I do every day constitutes a small part of the bigger puzzle that is proteomics. Finally, I would like to thank my friends and family for their continuing support.
LM would like to thank Ghent University, VIB, and his Computational Omics and Systems Biology (CompOmics) group members for creating a vibrant and intellectually stimulating atmosphere, and for many useful discussions. Special thanks go out to my wife Leen, and two sons, Ruben and Alexander, for their patience and understanding during the long hours of writing. Ruben, at 4 years of age, was fond of asking why daddy was always working on his computer, while Alexander at 9 months was simply looking bemused at the incessant sound of fingers tapping away at the keyboard. For better or for worse, both of them will now have to learn to live with a lot more fatherly attention!
3
Protein level quantification
Quantitative proteomics experiments have been performed for many years, using different methods and techniques. In this chapter we briefly describe the most common methods that do not rely on the combination of liquid chromatography and mass spectrometry.
3.1 Two-dimensional gels
In general, 2D SDS-PAGE gel quantification is based on the signal intensity of the spot in which the protein has been found. It is important to note that this intensity is an indirect measure, since protein spots are first stained or labeled in order to become visible, and it is the intensity of the stain or label that is subsequently measured. Obviously, spots containing more than one protein present challenges to any such quantitative study as the unexpected proteins add staining intensity that will be incorrectly assumed to be derived from the identified protein. And even if the spot is known to hold more than one protein, it is not possible to deconvolute the contributions of the individual proteins to the total amount of staining.
On the other hand, different variants such as truncated or modified forms of the protein are typically spread over different spots, making the global quantification of a protein’s expression level problematic. Indeed, such an analysis requires the addition of all intensities across all the different variant spots. Missed spots, that is, spots that could not be identified for whatever reason, will thus result in an underestimation of the quantification. Also remember that each individual spot has a chance of containing contaminant proteins that can contribute to staining intensity. The fact that different forms often localize in different spots can however be beneficial if one wants to compare different variants with one another, for example, comparing a phosphorylated form of a protein to its unphosphorylated form.
Another difficulty for 2D gel based quantification lies in the usable range of the staining procedures used for quantification. Several staining procedures do not stain at all below a certain amount of protein present, and become saturated above a certain amount of protein present. The interval between these amounts can be considered ‘covered’ by the staining procedure and is referred to as the dynamic range.
In addition to high sensitivity, it is desirable to have a linear response in staining intensity versus amount of protein present for the dynamic range. Most nonfluorescent staining agents have relatively poor dynamic range, with even the best nonfluorescent technique (silver staining) only yielding a dynamic range of about one order of magnitude. Fluorescent stains can attain the sensitivity of silver staining when used under optimal conditions, and can result in a much better dynamic range, providing up to five orders of magnitude. The fluorescent detection does come with its own caveats however, including fading signals if the fluorophore is (slowly) decomposed by the influx of light, and problems with the interference of fluorescence of flourophores that become associated with detergent micelles rather than proteins.
3.1.1 Comparing results from different experiments – DIGE
One of the most serious issues of 2D gel based quantification lies in the poor reproducibility of gels. As a result, comparing gels is not a simple process. A widely-used technique to avoid the typical low reproducibility between different gels is the fluorescence-based Differential In-Gel Electrophoresis (DIGE) approach. In this technique, the proteins in the different samples are labeled with different fluorescent dyes having different excitation wavelengths. The samples are then mixed, and the proteins separated on the same 2D PAGE gel. Because of the different excitation wavelengths of the labels, separate gel patterns can be obtained for each sample. This effectively allows the samples to be run under identical circumstances, greatly limiting the factors that can cause variation, and the spots from two samples are therefore much more directly comparable. Figure 3.1 illustrates the process.
Figure 3.1 Illustration of the DIGE procedure. Two samples plus a reference sample (usually obtained by pooling) are each labeled with different dyes and then mixed. A 2D gel is then used to separate the proteins. Using different excitation wavelengths of light, three different gel images can be obtained from the same gel, one for each sample.
3.2 Protein arrays
Another often-used method for discovery oriented protein quantification is provided by protein arrays. The use of protein arrays can be explained by considering three components:
a glass plate or slide in a regular grid;a set of capture molecules that are fixed to the plate;a set of binding molecules that are to be bound to the capture molecules.The capture molecules are typically antibodies, proteins, or peptides, although other (macro-)molecules such as DNA can be used as well. The different types of capture molecules enable different types of analyses to be carried out. We can divide the protein arrays into forward arrays and reverse arrays.
3.2.1 Forward arrays
In forward arrays the proteins to be analyzed are the binding molecules.
Antibody arrays
In this design the capture molecules are a collection of antibodies. Upon sample application, these antibodies bind the proteins they were raised against, and bound proteins can subsequently be detected. Despite the straightforward design, there are several problems with this type of array. First and foremost, there must be a sufficiently specific and sensitive antibody available for each protein to be measured. This is for instance not the case for every human protein, let alone for less well-studied species. Furthermore, antibodies tend to be expensive, and can often suffer from a specific binding (leading to a falsely exaggerated signal) or poor binding efficiency (leading to a falsely subdued signal). Indeed, antibodies that function quite well in simpler situations such as Western blotting (next Section), may not function as well when confronted with a whole-proteome mixture. Since proteins cover a very broad range of physiochemical parameters, it is furthermore difficult to optimize binding and washing conditions to ensure reasonable capture of all proteins.
Other types of forward arrays
Since antibody arrays can be challenging from a practical perspective, other types of forward arrays have been developed, where the capture molecules consist of peptides, small molecules or stretches of DNA sequence. Usually, these arrays are designed to capture a specific subset of proteins, such as DNA-binding, or drug-binding proteins.
3.2.2 Reverse arrays
As the name implies, reverse arrays take the opposite approach, and the proteins to be analyzed are the capture molecules. The objective here is often antibody profiling, where a sample of plasma or serum from a patient is run over the array to detect antibodies against the spotted proteins. Such arrays are particularly useful for the detection of auto-antibodies, that are aimed at the patient’s own proteins and form important factors in a variety of auto-immune diseases. These arrays are more easily produced than antibody arrays, but still pose the challenge that the relevant proteins must all be (recombinantly) expressed and purified. Reverse arrays that cover most of the proteome of various model organisms, and of course humans, are now commercially available.
3.2.3 Detection of binding molecules
Detection of the binding molecules is usually carried out using fluorescence, although more advanced techniques such as surface plasmon resonance and atomic force microscopy can also be used. In the case of fluorescent detection, the fluorophore can either be attached directly to the sample proteins after lysis, or it can be added after capture through the use of a secondary antibody in so-called sandwich designs.
3.2.4 Analysis of protein array readouts
Since protein arrays closely resemble RNA microarrays, the processing of readout data is commonly handled in the same fashion. Protein arrays typically follow design guidelines copied from RNA microarrays as well, including the use of on-array replicate spots that are scattered over the array.
3.3 Western blotting
Where 2D gel electrophoresis and protein arrays provide discovery oriented methods used to detect and quantify as many proteins as possible in a sample, Western blotting provides a targeted means to quantify a single protein in a sample. In Western blotting, the protein complement of a sample after lysis is typically separated on a single gel dimension, and the separated proteins are subsequently transferred (blotted) onto a membrane with a certain affinity for proteins (typically nitrocellulose or polyvinylidene fluoride (PVDF)).
This membrane is then probed with an affinity reagent that binds specifically to the protein of interest. This affinity reagent is most commonly an antibody specific to the protein. Detection of the bound antibody can be performed through a secondary antibody carrying a linked enzyme, or directly through a linked enzyme on the primary antibody.
The linked enzyme is chosen so it performs an observable reaction, most commonly horseradish peroxidase cleaving a chemoluminescent reagent, producing detectable light in proportion to the amount of bound enzyme, in turn dependent on the amount of protein. Other types of detection are possible, for instance using antibodies connected to a fluorophore, and the use of radioactively labeled antibodies. This last method however, is expensive, laborious, and dangerous, and is only used in those cases where exquisite sensitivity is required.
3.4 ELISA – Enzyme-Linked Immunosorbent Assay
Another popular means of targeted protein quantification is offered by ELISA (Enzyme-Linked Immunosorbent Assay). Here, a whole proteome lysate is affixed to a solid substrate (often a microtiter plate) and is then probed by a specific affinity reagent, usually an antibody. This affinity reagent is then either directly or indirectly (via a secondary antibody for instance) detected through a coupled enzyme (often horseradish peroxidase, similar to Western blotting).
ELISA is often used in clinical assays, and the development of an ELISA is therefore often the endpoint of a biomarker discovery pipeline.
3.5 Bibliographic notes
