60,99 €
R -- the statistical and graphical environment is rapidly emerging as an important set of teaching and research tools for biologists. This book draws upon the popularity and free availability of R to couple the theory and practice of biostatistics into a single treatment, so as to provide a textbook for biologists learning statistics, R, or both. An abridged description of biostatistical principles and analysis sequence keys are combined together with worked examples of the practical use of R into a complete practical guide to designing and analyzing real biological research. Topics covered include: * simple hypothesis testing, graphing * exploratory data analysis and graphical summaries * regression (linear, multi and non-linear) * simple and complex ANOVA and ANCOVA designs (including nested, factorial, blocking, spit-plot and repeated measures) * frequency analysis and generalized linear models. Linear mixed effects modeling is also incorporated extensively throughout as an alternative to traditional modeling techniques. The book is accompanied by a companion website www.wiley.com/go/logan/r with an extensive set of resources comprising all R scripts and data sets used in the book, additional worked examples, the biology package, and other instructional materials and links.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 813
Veröffentlichungsjahr: 2011
Contents
Preface
R quick reference card
General key to statistical methods
1 Introduction to R
1.1 Why R?
1.2 Installing R
1.3 The R environment
1.4 Object names
1.5 Expressions, Assignment and Arithmetic
1.6 R Sessions and workspaces
1.7 Getting help
1.8 Functions
1.9 Precedence
1.10 Vectors - variables
1.11 Matrices, lists and data frames
1.12 Object information and conversion
1.13 Indexing vectors, matrices and lists
1.14 Pattern matching and replacement (character search and replace)
1.15 Data manipulation
1.16 Functions that perform other functions repeatedly
1.17 Programming in R
1.18 An introduction to the R graphical environment
1.19 Packages
1.20 Working with scripts
1.21 Citing R in publications
1.22 Further reading
2 Data sets
2.1 Constructing data frames
2.2 Reviewing a data frame - fix()
2.3 Importing (reading) data
2.4 Exporting (writing) data
2.5 Saving and loading of R objects
2.6 Data frame vectors
2.7 Manipulating data sets
2.8 Dummy data sets - generating random data
3 Introductory statistical principles
3.1 Distributions
3.2 Scale transformations
3.3 Measures of location
3.4 Measures of dispersion and variability
3.5 Measures of the precision of estimates - standard errors and confidence intervals
3.6 Degrees of freedom
3.7 Methods of estimation
3.8 Outliers
3.9 Further reading
4 Sampling and experimental design with R
4.1 Random sampling
4.2 Experimental design
5 Graphical data presentation
5.1 The plot()function
5.2 Graphical Parameters
5.3 Enhancing and customizing plots with low-level plotting functions
5.4 Interactive graphics
5.5 Exporting graphics
5.6 Working with multiple graphical devices
5.7 High-level plotting functions for univariate (single variable) data
5.8 Presenting relationships
5.9 Presenting grouped data
5.10 Presenting categorical data
5.11 Trellis graphics
5.12 Further reading
6 Simple hypothesis testing – one and two population tests
6.1 Hypothesis testing
6.2 One- and two-tailed tests
6.3 t-tests
6.4 Assumptions
6.5 Statistical decision and power
6.6 Robust tests
6.7 Further reading
6.8 Key for simple hypothesis testing
6.9 Worked examples of real biological data sets
7 Introduction to Linear models
7.1 Linear models
7.2 Linear models in R
7.3 Estimating linear model parameters
7.4 Comments about the importance of understanding the structure and parameterization of linear models
8 Correlation and simple linear regression
8.1 Correlation
8.2 Simple linear regression
8.3 Smoothers and local regression
8.4 Correlation and regression in R
8.5 Further reading
8.6 Key for correlation and regression
8.7 Worked examples of real biological data sets
9 Multiple and curvilinear regression
9.1 Multiple linear regression
9.2 Linear models
9.3 Null hypotheses
9.4 Assumptions
9.5 Curvilinear models
9.6 Robust regression
9.7 Model selection
9.8 Regression trees
9.9 Further reading
9.10 Key and analysis sequence for multiple and complex regression
9.11 Worked examples of real biological data sets
10 Single factor classification (ANOVA)
10.1 Null hypotheses
10.2 Linear model
10.3 Analysis of variance
10.4 Assumptions
10.5 Robust classification (ANOVA)
10.6 Tests of trends and means comparisons
10.7 Power and sample size determination
10.8 ANOVA in R
10.9 Further reading
10.10 Key for single factor classification (ANOVA)
10.11 Worked examples of real biological data sets
11 Nested ANOVA
11.1 Linear models
11.2 Null hypotheses
11.3 Analysis of variance
11.4 Variance components
11.5 Assumptions
11.6 Pooling denominator terms
11.7 Unbalanced nested designs
11.8 Linear mixed effects models
11.9 Robust alternatives
11.10 Power and optimisation of resource allocation
11.11 Nested ANOVA in R
11.12 Further reading
11.13 Key for nested ANOVA
11.14 Worked examples of real biological data sets
12 Factorial ANOVA
12.1 Linear models
12.2 Null hypotheses
12.3 Analysis of variance
12.4 Assumptions
12.5 Planned and unplanned comparisons
12.6 Unbalanced designs
12.7 Robust factorial ANOVA
12.8 Power and sample sizes
12.9 Factorial ANOVA in R
12.10 Further reading
12.11 Key for factorial ANOVA
12.12 Worked examples of real biological data sets
13 Unreplicated factorial designs – randomized block and simple repeated measures
13.1 Linear models
13.2 Null hypotheses
13.3 Analysis of variance
13.4 Assumptions
13.5 Specific comparisons
13.6 Unbalanced un-replicated factorial designs
13.7 Robust alternatives
13.8 Power and blocking efficiency
13.9 Unreplicated factorial ANOVA in R
13.10 Further reading
13.11 Key for randomized block and simple repeated measures ANOVA
13.12 Worked examples of real biological data sets
14 Partly nested designs: split plot and complex repeated measures
14.1 Null hypotheses
14.2 Linear models
14.3 Analysis of variance
14.4 Assumptions
14.5 Other issues
14.6 Further reading
14.7 Key for partly nested ANOVA
14.8 Worked examples of real biological data sets
15 Analysis of covariance (ANCOVA)
15.1 Null hypotheses
15.2 Linear models
15.3 Analysis of variance
15.4 Assumptions
15.5 Robust ANCOVA
15.6 Specific comparisons
15.7 Further reading
15.8 Key for ANCOVA
15.9 Worked examples of real biological data sets
16 Simple Frequency Analysis
16.1 The chi-square statistic
16.2 Goodness of fit tests
16.3 Contingency tables
16.4 G-tests
16.5 Small sample sizes
16.6 Alternatives
16.7 Power analysis
16.8 Simple frequency analysis in R
16.9 Further reading
16.10 Key for Analysing frequencies
16.11 Worked examples of real biological data sets
17 Generalized linear models (GLM)
17.1 Dispersion (over or under)
17.2 Binary data - logistic (logit) regression
17.3 Count data - Poisson generalized linear models
17.4 Assumptions
17.5 Generalized additive models (GAM’s) - non-parametric GLM
17.6 GLM and R
17.7 Further reading
17.8 Key for GLM
17.9 Worked examples of real biological data sets
Bibliography
R index
Statistics index
Companion website for this book: wiley.com/go/logan/r
Companion website
A companion website for this book is available at:
www.wiley.com/go/logan/r
The website includes figures from the book for downloading.
A John Wiley & Sons, Inc., Publication
This edition first published 2010, © 2010 by Murray Logan
Blackwell Publishing was acquired by John Wiley & Sons in February 2007. Blackwell’s publishing program has been merged with Wiley’s global Scientific, Technical and Medical business to form Wiley-Blackwell.
Registered office: John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial offices: 9600 Garsington Road, Oxford, OX4 2DQ, UKThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK111 River Street, Hoboken, NJ 07030-5774, USA
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloguing-in-Publication Data
Logan, Murray.Biostatistical design and analysis using R : a practical guide / Murray Logan.p. cm.Includes bibliographical references and index.ISBN 978-1-4443-3524-8 (hardcover : alk. paper) – ISBN 978-1-4051-9008-4 (pbk. : alk. paper)1. Biometry. 2. R (Computer program language) I. Title.QH323.5.L645 2010570.1′5195–dc22
2009053162
A catalogue record for this book is available from the British Library.
Typeset in 10.5/13pt Minion by Laserwords Private Limited, Chennai, India
Preface
R is a powerful and flexible statistical and graphical environment that is freely distributed under the GNU Public Licencea for all major computing platforms (Windows, MacOSX and Linux). This open source licence along with a relatively simple scripting syntax has promoted diverse and rapid evolution and contribution. As the broader scientific community continues to gain greater instruction and exposure to the overall project, the popularity of R as a teaching and research tool continues to accelerate.
It is now widely acknowledged that R proficiency as a scientific skill set is becoming increasingly more desirable and useful throughout the scientific community. However, as with most open source developments, the emphasis of the R project remains on the expansive development of tools and features. Applied documentation still remains somewhat sparse and somewhat incomprehensible to the average biologist. Whilst there are a number of excellent texts on R emerging, the bulk of these texts are devoted to the R language itself. Any featured examples therein are used primarily for the purpose of illustrating the suite of commonly used R features and procedures, rather than to illustrate how R can be used to perform common biostatistical analyses.
Coinciding with the increasing interest in R as both a learning and research tool for biostatistics, has been the success of a relatively new major biostatistics textbook (Quinn and Keough, 2002). This text provides detailed coverage of most of the major statistical concepts and tests that biologists are likely to encounter with an emphasis on the practical implementation of these concepts with real biological data. Undoubtedly, a large part of the appeal of this book is attributable to the extensive use of real biological examples to augment and reinforce the text. Furthermore, by concentrating on the information biologists need to implement their research, and avoiding the overuse of complex mathematical descriptions, the authors have appealed to those biologists who don’t require (or desire) a knowledge of performing or programming entire analyses from scratch. Such biologists tend to use statistical software that is already available and specifically desire information that will help them achieve reliable statistical and biological outcomes. Quinn and Keough (2002) also advocate a number of alternative texts that provide more detailed coverage of specific topics and that also adopt this real example approach.
Typically, most biostatistical texts focus on the principles of design and analysis without extending into the practical use of software to implement these principles. Similarly, R/S-plus texts tend to concentrate on documenting and showcasing the features of R without providing much of a biostatistical account of the principles behind the features or illustrating how these tools can be extended to achieve comprehensive real world analyses. Consequently, many biological students and professionals struggle to translate the theoretical advice into computational outcomes. Although some of these difficulties can be addressed after extensively reading through a number of software references, many of the difficulties remain. The inconsistency and incompatibility between theory texts and software reference texts is mainly the result of differing intentions of the two genres and is a source of great frustration.
The reluctance of biostatistical texts to promote or instruct on any particular statistical software (except for extremely specialized cases where historically only a single dedicated program was available) is in part an acknowledgment of the diversity of software packages available (each of which differs substantially in the range of features offered as well as the user interface and output provided). Furthermore, software upgrades generally involve major alternations to the way in which preexisting tasks are performed and thus being associated with a single software package tends to restrict the longevity and audience of the text. In contrast, although contributers are constantly extending the feature set of R environments, overall the project maintains a consistent user interface. Consequently, there is currently both a need and opportunity for a text that fills the gap between biostatistics texts and software texts, so as to assist biologists with the practical side of performing statistical analysis.
Many biological researchers and students have at one stage or another used one or other of the major biostatistics texts and gained a good understanding of the principles. However, from time to time (and particularly when preparing to generate a new design or analyse a new data set), they require a quick refresher to help remind them of the issues and principles relevant to their current design and/or analysis scenarios. In most cases, they do not need to re-read the more discursive texts and in many cases express a reluctance to invest large amounts of valuable research time doing so. Therefore, there is also a need for a quick reference that summarizes the key concepts of contemporary biostatistics and leads users step-wise through each of the analysis procedures and options. Such a guide would also help users to identify their areas of statistical naivete and enable them to return to a more comprehensive text with a more focused and efficient objective.
Therefore, the intended focus of this book will be to highlight the major concepts, principles and issues in contemporary biostatistics as well as demonstrate how to use R (as a research design, analysis and presentation tool) to complete examples from major biostatistics textbooks. In so doing, this proposed text acknowledges the important role that statistical software and real examples play in reinforcing statistical principles and practices.
Hence in summary, the intentions of the book are three-fold
(i)To provide very brief refresher summaries of the main concepts, issues and options involved in a range of contemporary biostatistical analyses
(ii)To provide key guides that steps users through the procedures and options of a range of contemporary biostatistical analyses
(iii)To provide detailed R scripts and documentation that enable users to perform a range of real worked examples from statistics texts that are popular among biological and environmental scientists
Worked examples
Where possible and appropriate, this book will make use the same examples that appear in the popular biostatistical texts so as to take advantage of the history and information surrounding those examples as well as any familiarity that users may have with those examples. Having said this however, access to these other texts will not be necessary to get good value out of the materials.
Website
This book is augmented by a website (http://www.wiley.com./go/logan/r) which includes:
raw data sets and R analysis scripts associated with all worked examplesthe biologypackage that contains many functions utilized in this bookan R reference card containing links to pages within the bookTypographical convensions
Throughout this book, all R language objects and functions will be printed in courier (monospaced)typeface. Commands will begin with the standard R command prompt (<) and lines continuing on from a previous line will begin with the continuation prompt (+). In syntax used within the chapter keys, datasetis used as an example and should be replaced by the name of the actual data frame when used. Similarly, all vector names should be replaced by the names used to denote the various variables in your data set.
Acknowledgements
The inspiration for this book came primarily from Gerry Quinn and Mick Keough towards whom I am both indebted and infuriated (in equal quantities). As authors of a statistical piece themselves, they should known better than to encourage others to attempt such an undertaking! I also wish to acknowledge the intellectualizing and suggestions of Patrick Baker and Andrew Robinson, the former of whom’s regular supply of ideas remains a constant source of material and torment. Countless numbers of students and colleagues have also helped refine the materials and format of this book. As almost all of the worked examples in this book are adapted from the major biostatistical texts, the contributions of these other authors cannot be overstated. Finally, I would like to thank Nat, Kara, Saskia and Anika for your support and tolerance while I wrote this “extremely quite boring book with rid-ic-li-us pictures’’ (S. Logan, age 7).
a This is an open source licence that ensured that the application as well as its source code is freely available to use, modify and redistribute.
R quick reference card
Session management
> q() Quitting R (see page 8)
> ls() List the objects in the current environment (see page 7)
> rm(...) Remove objects from the current environment (see page 7)
> setwd(dir) Set the current working directory (see page 7)
> getwd() Get the current working directory (see page 7)
Getting help
> ?function Getting help on a function (see page 8)
> help(function) Getting help on a function (see page 8)
> example(function) Run the examples associated with the manual page for the function (see page 8)
> demo(topic) Run an installed demonstration script (see page 8)
> apropos("topic") Return names of all objects in search list that match “topic” (see page 9)
> help.search("topic") Getting help about a concept (see page 9)
> help.start() Launch R HTML documentation (see page 9)
Built in constants
> LETTERS the 26 upper-case letters of the English alphabet (see page 17)
> letters the 26 lower-case letters of the English alphabet (see page 17)
> month.name English names of the 12 months of the year
> month.abb Abbreviated English names of the 12 months of the year
> piπ – the ratio of a circles circumference to diameter (see page 105)
Packages
> installed.packages() List of all currently installed packages (see page 44)
> update.packages() Update installed packages (see page 44)
> install.packages(pkgs) Install package(s) (pkgs) from CRAN mirror (see page 45)
R CMD INSTALL package Install an add-on package (see page 43)
> library(package) Loading an add-on package (see page 45)
> data(name) Load a data set or structure inbuilt into R or a loaded package.
Importing/Exporting
> source("file") Input, parse and sequentially evaluate the file (see page 45)
> sink("file") Redirect non-graphical output to file
> read.table() Read data in table format and create a data frame, with variables in columns (see page 51)
> read.table() Read data left on the clipboard in table format and create a data frame, with variables in columns (see page 51)
> read.systat("file.syd", to.data.frame=T) Read SYSTAT data file and create a data frame (see page 52)
> read.spss("file.sav", to.data.frame=T) Read SPSS data file and create a data frame (see page 52)
> as.data.frame(read.mtp("file.mtp")) Read Minitab Portable Worksheet data file and create a data frame (see page 52)
> read.xport("file") Read SAS XPORT data file and create a data frame (see page 52)
> write.table() Write the contents of a dataframe to file in table format (see page 53)
> save(object, file="file.RData") Write the contents of the object to file (see page 53)
> load(file="file.RData") Load the contents of a file (see page 53)
> dump(object, file="file") Save the contents of an object to a file (see page 53)
Generating Vectors
> c(...) Concatenate objects (see page 6)
> seq(from, to, by=, length=) Generate a sequence (see page 12)
> rep(x, times, each) Replicate each of the values of x (see page 13)
Character vectors
> paste(..., ) Combine multiple vectors together after converting them into character vectors (see page 13)
> substr(x, start, stop) Extract substrings from a character vector (see page 14)
Factors
> factor(x) Convert the vector (x) into a factor (see page 15)
> factor(x, levels=c()) Convert the vector (x) into a factor and define the order of levels (see page 15)
> gl(levels, reps, length, labels=) Generate a factor vector by specifying the pattern of levels (see page 15)
> levels(factor) Lists the levels (in order) of a factor (see page 54)
> levels(factor) <- Sets the names of the levels of a factor (see page 54)
Matrices
> matrix(x,nrow, ncol, byrow=F) Create a matrix with nrow and/or ncol dimensions out of a vector (x) (see page 16)
> cbind(...) Create a matrix (or data frame) by combining the sequence of vectors, matrices or data frames by columns (see page 16)
> rbind(...) Create a matrix (or data frame) by combining the sequence of vectors, matrices or data frames by rows (see page 16)
> rownames(x) Read (or set with <-) the row names of the matrix (x) (see page 17)
> colnames(x) Read (or set with <-) the column names of the matrix (x) (see page 17)
Lists
> list(...) Generate a list of named (for arguments in the form name=x) and/or unnamed (for arguments in the form (x) components from the sequence of objects (see page 17)
Data frames
> data.frame(...) Convert a set of vectors into a data frame (see page 49)
> row.names(dataframe) Read (or set with <-) the row names of the data frame (see page 49)
> fix(dataframe) View and edit a dataframe in a spreadsheet (see page 49)
Indexing
Vectors
> x[i] Select the ith element (see page 21)
> x[i:j] Select the ith through jth elements inclusive see page 21)
> x[c(1,5,6,9)] Select specific elements (see page 21)
> x[-i] Select all except the element (see page 21)
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
