94,99 €
Few books on statistical data analysis in the natural sciences are written at a level that a non-statistician will easily understand. This is a book written in colloquial language, avoiding mathematical formulae as much as possible, trying to explain statistical methods using examples and graphics instead. To use the book efficiently, readers should have some computer experience. The book starts with the simplest of statistical concepts and carries readers forward to a deeper and more extensive understanding of the use of statistics in environmental sciences. The book concerns the application of statistical and other computer methods to the management, analysis and display of spatial data. These data are characterised by including locations (geographic coordinates), which leads to the necessity of using maps to display the data and the results of the statistical methods. Although the book uses examples from applied geochemistry, and a large geochemical survey in particular, the principles and ideas equally well apply to other natural sciences, e.g., environmental sciences, pedology, hydrology, geography, forestry, ecology, and health sciences/epidemiology.
The book is unique because it supplies direct access to software solutions (based on R, the Open Source version of the S-language for statistics) for applied environmental statistics. For all graphics and tables presented in the book, the R-scripts are provided in the form of executable R-scripts. In addition, a graphical user interface for R, called DAS+R, was developed for convenient, fast and interactive data analysis.
Statistical Data Analysis Explained: Applied Environmental Statistics with R provides, on an accompanying website, the software to undertake all the procedures discussed, and the data employed for their description in the book.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 668
Veröffentlichungsjahr: 2011
Contents
Preface
Acknowledgements
About the authors
1: Introduction
1.1 The Kola Ecogeochemistry Project
2: Preparing the Data for Use in R and DAS+R
2.1 Required data format for import into R and DAS+R
2.2 The detection limit problem
2.3 Missing values
2.4 Some “typical” problems encountered when editing a laboratory data report file to a DAS+R file
2.5 Appending and linking data files
2.6 Requirements for a geochemical database
2.7 Summary
3: Graphics to Display the Data Distribution
3.1 The one-dimensional scatterplot
3.2 The histogram
3.3 The density trace
3.4 Plots of the distribution function
3.5 Boxplots
3.6 Combination of histogram, density trace, one-dimensional scatterplot, boxplot, and ECDF-plot
3.7 Combination of histogram, boxplot or box-and-whisker plot, ECDF-plot, and CP-plot
3.8 Summary
4: Statistical Distribution Measures
4.1 Central value
4.2 Measures of spread
4.3 Quartiles, quantiles and percentiles
4.4 Skewness
4.5 Kurtosis
4.6 Summary table of statistical distribution measures
4.7 Summary
5: Mapping Spatial Data
5.1 Map coordinate systems (map projection)
5.2 Map scale
5.3 Choice of the base map for geochemical mapping
5.4 Mapping geochemical data with proportional dots
5.5 Mapping geochemical data using classes
5.6 Surface maps constructed with smoothing techniques
5.7 Surface maps constructed with kriging
5.8 Colour maps
5.9 Some common mistakes in geochemical mapping
5.10 Summary
6: Further Graphics for Exploratory Data Analysis
6.1 Scatterplots (xy-plots)
6.2 Linear regression lines
6.3 Time trends
6.4 Spatial trends
6.5 Spatial distance plot
6.6 Spiderplots (normalised multi-element diagrams)
6.7 Scatterplot matrix
6.8 Ternary plots
6.9 Summary
7: Defining Background and Threshold, Identification of Data Outliers and Element Sources
7.1 Statistical methods to identify extreme values and data outliers
7.2 Detecting outliers and extreme values in the ECDF- or CP-plot
7.3 Including the spatial distribution in the definition of background
7.4 Methods to distinguish geogenic from anthropogenic element sources
7.5 Summary
8: Comparing Data in Tables and Graphics
8.1 Comparing data in tables
8.2 Graphical comparison of the data distributions of several data sets
8.3 Comparing the spatial data structure
8.4 Subset creation – a mighty tool in graphical data analysis
8.5 Data subsets in scatterplots
8.6 Data subsets in time and spatial trend diagrams
8.7 Data subsets in ternary plots
8.8 Data subsets in the scatterplot matrix
8.9 Data subsets in maps
8.10 Summary
9: Comparing Data Using Statistical Tests
9.1 Tests for distribution (Kolmogorov-Smirnov and Shapiro-Wilk tests)
9.2 The one-sample t-test (test for the central value)
9.3 Wilcoxon signed-rank test
9.4 Comparing two central values of the distributions of independent data groups
9.5 Comparing two central values of matched pairs of data
9.6 Comparing the variance of two data sets
9.7 Comparing several central values
9.8 Comparing the variance of several data groups
9.9 Comparing several central values of dependent groups
9.10 Summary
10: Improving Data Behaviour for Statistical Analysis: Ranking and Transformations
10.1 Ranking/sorting
10.2 Non-linear transformations
10.3 Linear transformations
10.4 Preparing a data set for multivariate data analysis
10.5 Transformations for closed number systems
10.6 Summary
11: Correlation
11.1 Pearson correlation
11.2 Spearman rank correlation
11.3 Kendall-tau correlation
11.4 Robust correlation coefficients
11.5 When is a correlation coefficient significant?
11.6 Working with many variables
11.7 Correlation analysis and inhomogeneous data
11.8 Correlation results following additive logratio or centred logratio transformations
11.9 Summary
12: Multivariate Graphics
12.1 Profiles
12.2 Stars
12.3 Segments
12.4 Boxes
12.5 Castles and trees
12.6 Parallel coordinates plot
12.7 Summary
13: Multivariate Outlier Detection
13.1 Univariate versus multivariate outlier detection
13.2 Robust versus non-robust outlier detection
13.3 The chi-square plot
13.4 Automated multivariate outlier detection and visualisation
13.5 Other graphical approaches for identifying outliers and groups
13.6 Summary
14: Principal Component Analysis (PCA) and Factor Analysis (FA)
14.1 Conditioning the data for PCA and FA
14.2 Principal component analysis (PCA)
14.3 Factor analysis
14.4 Summary
15: Cluster Analysis
15.1 Possible data problems in the context of cluster analysis
15.2 Distance measures
15.3 Clustering samples
15.4 Clustering variables
15.5 Evaluation of cluster validity
15.6 Selection of variables for cluster analysis
15.7 Summary
16: Regression Analysis (RA)
16.1 Data requirements for regression analysis
16.2 Multiple regression
16.3 Classical least squares (LS) regression
16.4 Robust regression
16.5 Model selection in regression analysis
16.6 Other regression methods
16.7 Summary
17: Discriminant Analysis (DA) and Other Knowledge-Based Classification Methods
17.1 Methods for discriminant analysis
17.2 Data requirements for discriminant analysis
17.3 Visualisation of the discriminant function
17.4 Prediction with discriminant analysis
17.5 Exploring for similar data structures
17.6 Other knowledge-based classification methods
17.7 Summary
18: Quality Control (QC)
18.1 Randomised samples
18.2 Trueness
18.3 Accuracy
18.4 Precision
18.5 Analysis of variance (ANOVA)
18.6 Using maps to assess data quality
18.7 Variables analysed by two different analytical techniques
18.8 Working with censored data – a practical example
18.9 Summary
19: Introduction to R and Structure of the DAS+R Graphical User Interface
19.1 R
19.2 R-scripts
19.3 A brief overview of relevant R commands
19.4 DAS+R
19.5 Summary
References
Plates
Index
Copyright © 2008 John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,
West Sussex PO19 8SQ, England
Telephone (+44) 1243 779777
Email (for orders and customer service enquiries): [email protected]
Visit our Home Page on www.wileyeurope.com or www.wiley.com
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book.
This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Other Wiley Editorial Offices
John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA
Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA
Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany
John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia
John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809
John Wily & Sons Canada Ltd, 6045 Freemont Blvd, Mississauga, Ontario, L5R 4J3
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
ISBN 978-0-470-98581-6
Preface
Although several books already exist on statistical data analysis in the natural sciences, there are few books written at a level that a non-statistician will easily understand. In our experience many colleagues in earth and environmental sciences are not sufficiently trained in mathematics or statistics to easily comprehend the necessary formalism. This is a book written in colloquial language, avoiding mathematical formulae as much as possible (some may argue too much) trying to explain the methods using examples and graphics instead. To use the book efficiently, readers should have some computer experience and some basic understanding of statistical methods. We start with the simplest of statistical concepts and carry readers forward to a deeper and more extensive understanding of the use of statistics in the natural sciences. Importantly, users of the book, rather than readers, will require a sound knowledge of their own branch of natural science.
In the book we try to demonstrate, based on practical examples, how data analysis in environmental sciences should be approached, outline advantages and disadvantages of methods and show and discuss the do’s and don’ts. We do not use "simple toy examples" to demonstrate how well certain statistical techniques function. The book rather uses a single, large, real world example data set, which is investigated in more and more depth throughout the book. We feel that this makes it an interesting read from beginning to end, without preventing the use of single chapters as a reference for certain statistical techniques. This approach also clearly demonstrates the limits of classical statistical data analysis with environmental (geochemical) data. The special properties of environmental data (e.g., spatial dependencies, outliers, skewed distributions, closure) do not agree well with the assumptions of "classical" (Gaussian) statistics. These are, however, the statistical methods taught in all basic statistics courses at universities because they are the most fundamental statistical methods. As a consequence, up to this day, techniques that are far from ideal for the data at hand are widely applied by earth and environmental scientists in data analysis. Applied earth science data call for the use of robust and non-parametric statistical methods. These techniques are extensively used and demonstrated in the book. The focus of the book is on the exploratory use of statistical methods extensively applying graphical data analysis techniques.
The book concerns the application of statistical and other computer methods to the management, analysis and display of spatial data. These data are characterised by including locations (geographic coordinates), which leads to the necessity of using maps to display the data and the results of the statistical methods. Although the book uses examples from applied geochemistry, the principles and ideas equally well apply to other natural sciences, e.g., environmental sciences, pedology, hydrology, geography, forestry, ecology, and health sciences/epidemiology. That is, to anybody using spatially dependent data. The book will be useful to postgraduate students, possibly final year students with dissertation projects, students and others interested in the application of modern statistical methods (and not so much in theory), and natural scientists and other applied statistical professionals. The book can be used as a textbook, full of practical examples or in a basic university course on exploratory data a analysis for spatial data. The book can also serve as a manual to many statistical methods and will help the reader to better understand how different methods can be applied to their data – and what should not be done with the data.
The book is unique because it supplies direct access to software solutions (based on R, the Open Source version of the S-language for statistics) for applied environmental statistics. For all graphics and tables presented in the book, the R-codes are provided in the form of executable R-scripts. In addition, a graphical user interface for R, called DAS+R, was developed by the last author for convenient, fast and interactive data analysis. Providing powerful software for the combination of statistical data analysis and mapping is one of the highlights of the software tools. This software may be used with the example data as a teaching/learning tool, or with the reader’s own data for research.
Clemens Reimann
Geochemist
Peter Filzmoser
Statistician
Robert G. Garrett
Geochemist
Rudolf Dutter
Statistician
Trondheim, Vienna, Ottawa
September 1, 2007.
Acknowledgements
This book is the result of a fruitful cooperation between statisticians and geochemists that has spanned many years. We thank our institutions (the Geological Surveys of Norway (NGU) and Canada (GSC) and Vienna University of Technology (VUT)) for providing us with the time and opportunity to write the book. The Department for International Relations of VUT and NGU supported some meetings of the authors.
We thank the Wiley staff for their very professional support and discussions.
Toril Haugland and Herbert Weilguni were our test readers, they critically read the whole manuscript, made many corrections and valuable comments.
Many external reviewers read single chapters of the book and suggested important changes.
The software accompanying the book was developed with the help of many VUT students, including Andreas Alfons, Moritz Gschwandner, Alexander Juschitz, Alexander Kowarik, Johannes Löffler, Martin Riedler, Michael Schauerhuber, Stefan Schnabl, Christian Schwind, Barbara Steiger, Stefan Wohlmuth and Andreas Zainzinger, together with the authors.
Friedrich Leisch of the R core team and John Fox and Matthias Templ were always available for help with R and good advice concerning R-commander.
Friedrich Koller supplied lodging, many meals and stimulating discussions for Clemens Reimann when working in Vienna. Similarly, the Filzmoser family generously hosted Robert G. Garrett during a working visit to Austria.
NGU allowed us to use the Kola Project data; the whole Kola Project team is thanked for many important discussions about the interpretation of the results through many years.
Arne Bjørlykke, Morten Smelror, and Rolf Tore Ottesen wholeheartedly backed the project over several years.
Heidrun Filzmoser is thanked for translating the manuscript from Word into Latex. The families of the authors are thanked for their continued support, patience with us and understanding.
Many others that are not named above contributed to the outcome, we wish to express our gratitude to all of them.
About the authors
Clemens REIMANN
Clemens Reimann (born 1952) holds an M.Sc. in Mineralogy and Petrology from the University of Hamburg (Germany), a Ph.D. in Geosciences from Leoben Mining University, Austria, and a D.Sc. in Applied Geochemistry from the same university. He has worked as a lecturer in Mineralogy and Petrology and Environmental Sciences at Leoben Mining University, as an exploration geochemist in eastern Canada, in contract research in environmental sciences in Austria and managed the laboratory of an Austrian cement company before joining the Geological Survey of Norway in 1991 as a senior geochemist. From March to October 2004 he was director and professor at the German Federal Environment Agency (Umweltbundesamt, UBA), responsible for the Division II, Environmental Health and Protection of Ecosystems. At present he is chairman of the EuroGeoSurveys geochemistry expert group, acting vice president of the International Association of GeoChemistry (IAGC), and associate editor of both Applied Geochemistry and Geochemistry: Exploration, Environment, Analysis.
Peter FILZMOSER
Peter Filzmoser (born 1968) studied Applied Mathematics at the Vienna University of Technology, Austria, where he also wrote his doctoral thesis and habilitation devoted to the field of multivariate statistics. His research led him to the area of robust statistics, resulting in many international collaborations and various scientific papers in this area. His interest in applications of robust methods resulted in the development of R software packages. He was and is involved in the organisation of several scientific events devoted to robust statistics. Since 2001 he has been dozent at the Statistics Department at Vienna University of Technology. He was visiting professor at the Universities of Vienna, Toulouse and Minsk.
Robert G. GARRETT
Bob Garrett studied Mining Geology and Applied Geochemistry at Imperial College, London, and joined the Geological Survey of Canada (GSC) in 1967 following post-doctoral studies at Northwestern University, Evanston. For the next 25 years his activities focussed on regional geochemical mapping in Canada, and overseas for the Canadian International Development Agency, to support mineral exploration and resource appraisal. Throughout his work there has been a use of computers and statistics to manage data, assess their quality, and maximise the knowledge extracted from them. In the 1990s he commenced collaborations with soil and agricultural scientists in Canada and the US concerning trace elements in crops. Since then he has been involved in various Canadian Federal and university-based research initiatives aimed at providing sound science to support Canadian regulatory and international policy activities concerning risk assessments and risk management for metals. He retired in March 2005 but remains active as an Emeritus Scientist.
Rudolf DUTTER
Rudolf Dutter is senior statistician and full professor at Vienna University of Technology, Austria. He studied Applied Mathematics in Vienna (M.Sc.) and Statistics at Université de Montréal, Canada (Ph.D.). He spent three years as a post-doctoral fellow at ETH, Zurich, working on computational robust statistics. Research and teaching activities followed at the Graz University of Technology, and as a full professor of statistics at Vienna University of Technology, both in Austria. He also taught and consulted at Leoben Mining University, Austria; currently he consults in many fields of applied statistics with main interests in computational and robust statistics, development of statistical software, and geostatistics. He is author and coauthor of many publications and several books, e.g., an early booklet in German on geostatistics.
1
Introduction
Statistical data analysis is about studying data – graphically or via more formal methods. Exploratory Data Analysis (EDA) techniques (Tukey, 1977) provide many tools that transfer large and cumbersome data tabulations into easy to grasp graphical displays which are widely independent of assumptions about the data. They are used to “visualise” the data. Graphical data analysis is often criticised as non-scientific because of its apparent ease. This critique probably stems from many scientists trained in formal statistics not being aware of the power of graphical data analysis.
Occasionally, even in graphical data analysis mathematical data transformations are useful to improve the visibility of certain parts of the data. A logarithmic transformation would be a typical example of a transformation that is used to reduce the influence of unusually high values that are far removed from the main body of data.
Graphical data analysis is a creative process, it is far from simple to produce informative graphics. Among others, choice of graphic, symbols, and data subsets are crucial ingredients for gaining an understanding of the data. It is about iterative learning, from one graphic to the next until an informative presentation is found, or as Tukey (1977) said “It is important to understand what you can do before you learn to measure how well you seem to have done it”.
However, for a number of purposes graphics are not sufficient to describe a given data set. Here the realms of descriptive statistics are entered. Descriptive statistics are based on model assumptions about the data and thus more restrictive than EDA. A typical model assumption used in descriptive statistics would be that the data follow a normal distribution. The normal distribution is characterised by a typical bell shape (see Figure 4.1 upper left) and depends on two parameters, mean and variance (Gauss, 1809). Many natural phenomena are described by a normal distribution. Thus this distribution is often used as the basic assumption for statistical methods and estimators. Statisticians commonly assume that the data under investigation are a random selection of many more possible observations that altogether follow a normal distribution. Many formulae for statistical calculations, e.g., for mean, standard deviation and correlation are based on a model. It is always possible to use the empirical data at hand and the given statistical formula to calculate “values”, but only if the data follow the model will the values be representative, even if another random sample is taken. If the distribution of the samples deviates from the shape of the model distribution, e.g., the bell shape of the normal distribution, statisticians will often try to use transformations that force the data to approach a normal distribution. For environmental data a simple log-transformation of the data will often suffice to approach a normal distribution. In such a case it is said that the data come from a lognormal distribution.
Environmental data are frequently characterised by exceptionally high values that deviate widely from the main body of data. In such a case even a data transformation will not help to approach a normal distribution. Here other statistical methods are needed, that will still provide reliable results. Robust statistical procedures have been developed for such data and are often used throughout this book.
Inductive statistics is used to test hypotheses that are formulated by the investigator. Most methods rely heavily on the normal distribution model. Other methods exist that are not based on these model assumptions (non-parametric statistical tests) and these are often preferable for environmental data.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!