71,99 €
This book grew out of an online interactive offered through statcourse.com, and it soon became apparent to the author that the course was too limited in terms of time and length in light of the broad backgrounds of the enrolled students. The statisticians who took the course needed to be brought up to speed both on the biological context as well as on the specialized statistical methods needed to handle large arrays. Biologists and physicians, even though fully knowledgeable concerning the procedures used to generate microaarrays, EEGs, or MRIs, needed a full introduction to the resampling methods--the bootstrap, decision trees, and permutation tests, before the specialized methods applicable to large arrays could be introduced. As the intended audience for this book consists both of statisticians and of medical and biological research workers as well as all those research workers who make use of satellite imagery including agronomists and meteorologists, the book provides a step-by-step approach to not only the specialized methods needed to analyze the data from microarrays and images, but also to the resampling methods, step-down multi-comparison procedures, multivariate analysis, as well as data collection and pre-processing. While many alternate techniques for analysis have been introduced in the past decade, the author has selected only those techniques for which software is available along with a list of the available links from which the software may be purchased or downloaded without charge. Topical coverage includes: very large arrays; permutation tests; applying permutation tests; gathering and preparing data for analysis; multiple tests; bootstrap; applying the bootstrap; classification methods; decision trees; and applying decision trees.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 191
Veröffentlichungsjahr: 2011
Table of Contents
Title Page
Copyright
Preface
Chapter 1: Very Large Arrays
1.1 Applications
1.2 Problems
1.3 Solutions
Chapter 2: Permutation Tests
2.1 Two-Sample Comparison
2.2 k-Sample Comparison
2.3 Computing The p-Value
2.4 Multiple-Variable Comparisons
2.5 Categorical Data
2.6 Software
Chapter 3: Applying the Permutation Test
3.1 Which Variables Should Be Included?
3.2 Single-Value Test Statistics
3.3 Recommended Approaches
3.4 To Learn More
Chapter 4: Biological Background
4.1 Medical Imaging
4.2 Microarrays
4.3 To Learn More
Chapter 5: Multiple Tests
5.1 Reducing the Number of Hypotheses to Be Tested
5.2 Controlling the OverAll Error Rate
5.3 Controlling the False Discovery Rate
5.4 Gene Set Enrichment Analysis
5.5 Software for Performing Multiple Simultaneous Tests
5.7 To Learn More
Chapter 6: The Bootstrap
6.1 Samples and Populations
6.2 Precision of an Estimate
6.3 Confidence Intervals
6.4 Determining Sample Size
6.5 Validation
6.6 Building a Model
6.7 How Large Should The Samples Be?
6.9 To Learn More
Chapter 7: Classification Methods
7.1 Nearest Neighbor Methods
7.2 Discriminant Analysis
7.3 Logistic Regression
7.4 Principal Components
7.5 Naive Bayes Classifier
7.6 Heuristic Methods
7.7 Decision Trees
7.8 Which Algorithm Is Best for Your Application?
7.9 Improving Diagnostic Effectiveness
7.10 Software for Decision Trees
Chapter 8: Applying Decision Trees
8.1 Photographs
8.2 Ultrasound
8.3 MRI Images
8.4 EEGs and EMGs
8.5 Misclassification Costs
8.6 Receiver Operating Characteristic
8.7 When the Categories are as Yet Undefined
8.8 Ensemble Methods
8.9 Maximally Diversified Multiple Trees
8.10 Putting it all Together
8.12 To Learn More
Glossary of Biomedical Terminology
Glossary of Statistical Terminology
Appendix: An R Primer
R.1 Getting Started
R.2 Store and Retrieve Data
R.3 Resampling
R.4 Expanding R's Capabilities
Bibliography
Author Index
Subject Index
Copyright © 2010 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Good, Phillip I.
Analyzing the large number of variables in biomedical and satellite imagery/Phillip I. Good.
p. cm. Includes bibliographical references and index.
ISBN 978-0-470-92714-4 (pbk.)
1. Data mining. 2. Mathematical statistics. 3. Biomedical engineering-Data processing.
4. Remote sensing-Data processing. I. Title.
QA76.9.D343G753 2011
066.3'12–dc22
2010030988
Preface
This text arose from a course I teach for http://statcourse.com on the specialized techniques required to analyze the very large data sets that arise in the study of medical images—EEGs, MEGs, MRI, fMRI, PET, ultrasound, and X-rays, as well as microarrays and satellite imagery.
The course participants included both biomedical research workers and statisticians, and it soon became obvious that while the one required a more detailed explanation of statistical methods, the other needed to know a great deal more about the biological context in which the data was collected.
Toward this end, the present text includes a chapter aimed at statisticians on the collection and preprocessing of biomedical data as well as a glossary of biological terminology. For biologists and physicians whose training in statistics may have been in a distant past, a glossary of statistical terminology with expanded definitions is provided.
You'll find that the chapters in this text are paired for the most part: An initial chapter that provides a detailed explanation of a statistical method is followed by one illustrating the application of the method to real-world data.
As a statistic without the software to make it happen is as useless as sheet music without an instrument to perform on, I have included links to the many specialized programs that may be downloaded from the Internet (in many cases without charge) as well as a number of program listings. As R is rapidly being adopted as the universal language for processing very large data sets, an R primer is also included in an appendix.
PHILLIP I. GOOD
HUNTINGTON BEACH CA
Chapter 1
Very Large Arrays
1.1 Applications
Very large arrays of data, that is, data sets for which the number of observations per subject may be an order of magnitude greater than the number of subjects that are observed, arise in genetics research (microarrays), neurophysiology (EEGs), and image analysis (ultrasound, MRI, fMRI, MEG, and PET maps, telemetry). Microarrays of as many as 22,000 genes may be collected from as few as 50 subjects. While EEG readings are collected from a relatively small number of leads, they are collected over a period of time, so that the number of observations per subject is equal to the number of leads times the number of points in time at which readings are taken. fMRI images of the brain can be literally four dimensional when the individual time series are taken into account.
In this chapter, we consider the problems that arise when we attempt to analyze such data, potential solutions to these problems, and our plan of attack in the balance of this book.
1.2 Problems
1. The limited number of subjects means that the precision of any individual observation is equally limited. If n is the sample size, the precision of any individual observation is roughly proportional to the square root of n.
2. The large number of variables means that it is almost certain that changes in one or several of them will appear to be statistically significant purely by chance.
3. The large number of variables means that missing and/or erroneously recorded data is inevitable.
4. The various readings are not independent and identically distributed; rather, they are interdependent both in space and in time.
5. Measurements are seldom Gaussian (normally distributed), nor likely to adhere to any other well-tabulated distribution.
1.3 Solutions
Solutions to these problems require all of the following.
Distribution-free methods—permutation tests, bootstrap, and decision trees—are introduced in Chapters 2, 6, and 7, respectively. Their application to very large arrays is the subject of Chapters 3, 6, and 8.
One might ask, why not use parametric tests? To which Karniski et al. (1994) would respond:
Utilizing currently available parametric statistical tests, there are essentially four methods that are frequently used to attempt to answer the question. One may combine data from multiple variables to reduce the number of variables, such as in principal component analysis. One may use multiple tests of single variables and then adjust the critical value.
One may use univariate tests, and then adjust the results for violation of the assumption of sphericity (in repeated measures design). Or one may use multivariate tests, so long as the number of subjects far exceeds the number of variables.
Methods for reducing the number of variables under review are also considered in Chapters 3, 5, and 8.
Methods for controlling significance levels and/or false detection rates are discussed in Chapter 5.
Chapter 4, on gathering and preparing data, provides the biomedical background essential to those who will be analyzing very large data sets derived from medical images and microarrays.