154,99 €
The first part of this book is devoted to methods seeking relevant dimensions of data. The variables thus obtained provide a synthetic description which often results in a graphical representation of the data. After a general presentation of the discriminating analysis, the second part is devoted to clustering methods which constitute another method, often complementary to the methods described in the first part, to synthesize and to analyze the data. The book concludes by examining the links existing between data mining and data analysis.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 547
Veröffentlichungsjahr: 2013
Table of Contents
Preface
Chapter 1. Principal Component Analysis: Application to Statistical Process Control
1.1. Introduction
1.2. Data table and related subspaces
1.3. Principal component analysis
1.4. Interpretation of PCA results
1.5. Application to statistical process control
1.6. Conclusion
1.7. Bibliography
Chapter 2. Correspondence Analysis: Extensions and Applications to the Statistical Analysis of Sensory Data
2.1. Correspondence analysis
2.2. Multiple correspondence analysis
2.3. An example of application at the crossroads of CA and MCA
2.4. Conclusion: two other extensions
2.5. Bibliography
Chapter 3. Exploratory Projection Pursuit
3.1. Introduction
3.2. General principles
3.3. Some indexes of interest: presentation and use
3.4. Generalized principal component analysis
3.5. Example
3.6. Further topics
3.7. Bibliography
Chapter 4. The Analysis of Proximity Data
4.1. Introduction
4.2. Representation of proximity data in a metric space
4.3. Isometric embeddingand projection
4.4. Multidimensional scaling and approximation
4.5. A fielded application
4.6. Bibliography
Chapter 5. Statistical Modeling of Functional Data
5.1. Introduction
5.2. Functional framework
5.3. Principal components analysis
5.4. Linear regression models and extensions
5.5. Forecasting
5.6. Concluding remarks
5.7. Bibliography
Chapter 6. Discriminant Analysis
6.1. Introduction
6.2. Main steps in supervisedclassification
6.3. Standard methods in supervised classification
6.4. Recent advances
6.5. Conclusion
6.6. Bibliography
Chapter 7. Cluster Analysis
7.1. Introduction
7.2. General principles
7.3. Hierarchical clustering
7.4. Partitional clustering: the k-means algorithm
7.5. Miscellaneous clustering methods
7.6. Block clustering
7.7. Conclusion
7.8. Bibliography
Chapter 8. Clustering and the Mixture Model
8.1. Probabilistic approaches in cluster analysis
8.2. The mixture model
8.3. EM algorithm
8.4. Clustering and the mixture model
8.5. Gaussian mixture model
8.6. Binary variables
8.7. Qualitative variables
8.8. Implementation
8.9. Conclusion
8.10. Bibliography
Chapter 9. Spatial Data Clustering
9.1. Introduction
9.2. Non-probabilistic approaches
9.3. Markov random fields as models
9.4. Estimating the parameters for a Markov field
9.5. Application to numerical ecology
9.6. Bibliography
List of Authors
Index
First published in France in 2003 by Hermes Science/Lavoisier entitled: Analyse des données © LAVOISIER, 2003
First published in Great Britain and the United States in 2009 by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.iste.co.uk
www.wiley.com
© ISTE Ltd, 2009
The rights of Gérard Govaert to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Cataloging-in-Publication Data
Analyse des données. English.
Data analysis / edited by Gérard Govaert.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-84821-098-1
1. Mathematical statistics. I. Govaert, Gérard. II. Title.
QA276.D325413 2009
519.5--dc22
2009016228
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN: 978-1-84821-098-1
Statistical analysis has traditionally been separated into two phases: an exploratory phase, drawing on a set of descriptive and graphical techniques, and a decisional phase, based on probabilistic models. Some of the tools employed as part of the exploratory phase belong to descriptive statistics, whose elementary exploratory methods consider only a very limited number of variables. Other tools belong to data analysis, the subject matter of this book. This topic comprises more elaborate exploratory methods to handle multidimensional data, and is often seen as stepping beyond a purely exploratory context.
The first part of this book is concerned with methods for obtaining the pertinent dimensions from a collection of data. The variables so obtained provide a synthetic description, often leading to a graphical representation of the data. A considerable number of methods have been developed, adapted to different data types and different analytical goals. Chapters 1 and 2 discuss two reference methods, namely Principal Components Analysis (PCA) and Correspondence Analysis (CA), which we illustrate with examples from statistical process control and sensory analysis. Chapter 3 looks at a family of methods known as Projection Pursuit (less well known, but with a promising future), that can be seen as an extension of PCA and CA, which makes it possible to specify the structures that are being sought. Multidimensional positioning methods, discussed in Chapter 4, seek to represent proximity matrix data in low-dimensional Euclidean space. Chapter 5 is devoted to functional data analysis where a function such as a temperature or rainfall graph, rather than a simple numerical vector, is used to characterize individuals.
The second part is concerned with methods of clustering, which seek to organize data into homogenous classes. These methods provide an alternative means, often complementary to those discussed in the first part, of synthesizing and analyzing data. In view of the clear link between clustering and discriminant analysis – in pattern recognition the former is termed unsupervised and the latter supervised learning – Chapter 6 gives a general introduction to discriminant analysis. Chapter 7 then provides an overall picture of clustering. The statistical interpretation of clustering in terms of mixtures of probability distributions is discussed in Chapter 8 and Chapter 9 looks at how this approach can be applied to spatial data.
I would like to express my heartfelt thanks to all the authors who were involved in this publication. Without their expertise, their professionalism, their invaluable contributions and the wealth of their experience, it would not have been possible.
Gérard GOVAERT
Principal component analysis (PCA) is an exploratory statistical method for graphical description of the information present in large datasets. In most applications, PCA consists of studying p variables measured on n individuals. When n and p are large, the aim is to synthesize the huge quantity of information into an easy and understandable form.
Unidimensional or bidimensional studies can be performed on variables using graphical tools (histograms, box plots) or numerical summaries (mean, variance, correlation). However, these simple preliminary studies in a multidimensional context are insufficient since they do not take into account the eventual relationships between variables, which is often the most important point.
Principal component analysis is often considered as the basic method of factor analysis, which aims to find linear combinations of the p variables called components used to visualize the observations in a simple way. Because it transforms a large number of correlated variables into a few uncorrelated principal components, PCA is a dimension reduction method. However, PCA can also be used as a multivariate outlier detection method, especially by studying the last principal components. This property is useful in multidimensional quality control.
Data are generally represented in a rectangular table with n rows for the individuals and p columns corresponding to the variables. Choosing individuals and variables to analyze is a crucial phase which has an important influence on PCA results. This choice has to take into account the aim of the study; in particular, the variables have to describe the phenomenon being analyzed.
Usually PCA deals with numerical variables. However, ordinal variables such as ranks can also be processed by PCA. Later in this chapter, we present the concept of supplementary variables which afterwards integrates nominal variables.
Let X be the (n,p)matrix of observations:
Table 1.1 is an example of such a data matrix. Computations have been carried out using SPAD 5 software, version 5 1, kindly provided by J.-P. Gauchi.
The data file contains 57 brands of mineral water described by 11 variables defined in Table 1.2. The data come from the bottle labels. Numerical variables are homogenous; they are all active variables (see section 1.4.3). A variable of a different kind such as price would be considered as a supplementary variable. On the other hand, qualitative variables such as country, type and whether still or sparkling (PG) are necessarily supplementary variables.
Table 1.1.Data table
Let be the vector of arithmetic means of each of the p variables, defining the centroid:
Table 1.2.Variable description
Name
Complete water name as labeled on the bottle
Country
Identified by the official car registration letters; sometimes it is necessary to add a letter, for example Crete: GRC (Greece Crete)
Type
M for mineral water, S for spring water
PG
P for still water, G for sparkling water
CA
Calcium ions (mg/litre)
MG
Magnesium ions (mg/litre)
NA
Sodium ions (mg/litre)
K
Potassium ions (mg/litre)
SUL
Sulfate ions (mg/litre)
NO3
Nitrate ions (mg/litre)
HCO3
Carbonate ions (mg/litre)
CL
Chloride ions (mg/litre)
where .
However, it can be useful for some applications to use weight pi varying from one individual to another as grouped data or a reweighted sample. These weights, which are positive numbers summing to 1, can be viewed as frequencies and are stored in a diagonal matrix of size n:
We define the linear correlation coefficient between variables k and by:
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
