Data Analysis -  - E-Book

Data Analysis E-Book

0,0
154,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The first part of this book is devoted to methods seeking relevant dimensions of data. The variables thus obtained provide a synthetic description which often results in a graphical representation of the data. After a general presentation of the discriminating analysis, the second part is devoted to clustering methods which constitute another method, often complementary to the methods described in the first part, to synthesize and to analyze the data. The book concludes by examining the links existing between data mining and data analysis.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 547

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Preface

Chapter 1. Principal Component Analysis: Application to Statistical Process Control

1.1. Introduction

1.2. Data table and related subspaces

1.3. Principal component analysis

1.4. Interpretation of PCA results

1.5. Application to statistical process control

1.6. Conclusion

1.7. Bibliography

Chapter 2. Correspondence Analysis: Extensions and Applications to the Statistical Analysis of Sensory Data

2.1. Correspondence analysis

2.2. Multiple correspondence analysis

2.3. An example of application at the crossroads of CA and MCA

2.4. Conclusion: two other extensions

2.5. Bibliography

Chapter 3. Exploratory Projection Pursuit

3.1. Introduction

3.2. General principles

3.3. Some indexes of interest: presentation and use

3.4. Generalized principal component analysis

3.5. Example

3.6. Further topics

3.7. Bibliography

Chapter 4. The Analysis of Proximity Data

4.1. Introduction

4.2. Representation of proximity data in a metric space

4.3. Isometric embeddingand projection

4.4. Multidimensional scaling and approximation

4.5. A fielded application

4.6. Bibliography

Chapter 5. Statistical Modeling of Functional Data

5.1. Introduction

5.2. Functional framework

5.3. Principal components analysis

5.4. Linear regression models and extensions

5.5. Forecasting

5.6. Concluding remarks

5.7. Bibliography

Chapter 6. Discriminant Analysis

6.1. Introduction

6.2. Main steps in supervisedclassification

6.3. Standard methods in supervised classification

6.4. Recent advances

6.5. Conclusion

6.6. Bibliography

Chapter 7. Cluster Analysis

7.1. Introduction

7.2. General principles

7.3. Hierarchical clustering

7.4. Partitional clustering: the k-means algorithm

7.5. Miscellaneous clustering methods

7.6. Block clustering

7.7. Conclusion

7.8. Bibliography

Chapter 8. Clustering and the Mixture Model

8.1. Probabilistic approaches in cluster analysis

8.2. The mixture model

8.3. EM algorithm

8.4. Clustering and the mixture model

8.5. Gaussian mixture model

8.6. Binary variables

8.7. Qualitative variables

8.8. Implementation

8.9. Conclusion

8.10. Bibliography

Chapter 9. Spatial Data Clustering

9.1. Introduction

9.2. Non-probabilistic approaches

9.3. Markov random fields as models

9.4. Estimating the parameters for a Markov field

9.5. Application to numerical ecology

9.6. Bibliography

List of Authors

Index

First published in France in 2003 by Hermes Science/Lavoisier entitled: Analyse des données © LAVOISIER, 2003

First published in Great Britain and the United States in 2009 by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd, 2009

The rights of Gérard Govaert to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Cataloging-in-Publication Data

Analyse des données. English.

Data analysis / edited by Gérard Govaert.

p. cm.

Includes bibliographical references and index.

ISBN 978-1-84821-098-1

1. Mathematical statistics. I. Govaert, Gérard. II. Title.

QA276.D325413 2009

519.5--dc22

2009016228

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN: 978-1-84821-098-1

Preface

Statistical analysis has traditionally been separated into two phases: an exploratory phase, drawing on a set of descriptive and graphical techniques, and a decisional phase, based on probabilistic models. Some of the tools employed as part of the exploratory phase belong to descriptive statistics, whose elementary exploratory methods consider only a very limited number of variables. Other tools belong to data analysis, the subject matter of this book. This topic comprises more elaborate exploratory methods to handle multidimensional data, and is often seen as stepping beyond a purely exploratory context.

The first part of this book is concerned with methods for obtaining the pertinent dimensions from a collection of data. The variables so obtained provide a synthetic description, often leading to a graphical representation of the data. A considerable number of methods have been developed, adapted to different data types and different analytical goals. Chapters 1 and 2 discuss two reference methods, namely Principal Components Analysis (PCA) and Correspondence Analysis (CA), which we illustrate with examples from statistical process control and sensory analysis. Chapter 3 looks at a family of methods known as Projection Pursuit (less well known, but with a promising future), that can be seen as an extension of PCA and CA, which makes it possible to specify the structures that are being sought. Multidimensional positioning methods, discussed in Chapter 4, seek to represent proximity matrix data in low-dimensional Euclidean space. Chapter 5 is devoted to functional data analysis where a function such as a temperature or rainfall graph, rather than a simple numerical vector, is used to characterize individuals.

The second part is concerned with methods of clustering, which seek to organize data into homogenous classes. These methods provide an alternative means, often complementary to those discussed in the first part, of synthesizing and analyzing data. In view of the clear link between clustering and discriminant analysis – in pattern recognition the former is termed unsupervised and the latter supervised learning – Chapter 6 gives a general introduction to discriminant analysis. Chapter 7 then provides an overall picture of clustering. The statistical interpretation of clustering in terms of mixtures of probability distributions is discussed in Chapter 8 and Chapter 9 looks at how this approach can be applied to spatial data.

I would like to express my heartfelt thanks to all the authors who were involved in this publication. Without their expertise, their professionalism, their invaluable contributions and the wealth of their experience, it would not have been possible.

Gérard GOVAERT

Chapter 1

Principal Component Analysis: Application to Statistical Process Control1

1.1. Introduction

Principal component analysis (PCA) is an exploratory statistical method for graphical description of the information present in large datasets. In most applications, PCA consists of studying p variables measured on n individuals. When n and p are large, the aim is to synthesize the huge quantity of information into an easy and understandable form.

Unidimensional or bidimensional studies can be performed on variables using graphical tools (histograms, box plots) or numerical summaries (mean, variance, correlation). However, these simple preliminary studies in a multidimensional context are insufficient since they do not take into account the eventual relationships between variables, which is often the most important point.

Principal component analysis is often considered as the basic method of factor analysis, which aims to find linear combinations of the p variables called components used to visualize the observations in a simple way. Because it transforms a large number of correlated variables into a few uncorrelated principal components, PCA is a dimension reduction method. However, PCA can also be used as a multivariate outlier detection method, especially by studying the last principal components. This property is useful in multidimensional quality control.

1.2. Data table and related subspaces

1.2.1.Data and their characteristics

Data are generally represented in a rectangular table with n rows for the individuals and p columns corresponding to the variables. Choosing individuals and variables to analyze is a crucial phase which has an important influence on PCA results. This choice has to take into account the aim of the study; in particular, the variables have to describe the phenomenon being analyzed.

Usually PCA deals with numerical variables. However, ordinal variables such as ranks can also be processed by PCA. Later in this chapter, we present the concept of supplementary variables which afterwards integrates nominal variables.

1.2.1.1. Data table

Let X be the (n,p)matrix of observations:

Table 1.1 is an example of such a data matrix. Computations have been carried out using SPAD 5 software, version 5 1, kindly provided by J.-P. Gauchi.

The data file contains 57 brands of mineral water described by 11 variables defined in Table 1.2. The data come from the bottle labels. Numerical variables are homogenous; they are all active variables (see section 1.4.3). A variable of a different kind such as price would be considered as a supplementary variable. On the other hand, qualitative variables such as country, type and whether still or sparkling (PG) are necessarily supplementary variables.

Table 1.1.Data table

1.2.1.2. Summaries

1.2.1.2.1. Centroid

Let be the vector of arithmetic means of each of the p variables, defining the centroid:

Table 1.2.Variable description

Name

Complete water name as labeled on the bottle

Country

Identified by the official car registration letters; sometimes it is necessary to add a letter, for example Crete: GRC (Greece Crete)

Type

M for mineral water, S for spring water

PG

P for still water, G for sparkling water

CA

Calcium ions (mg/litre)

MG

Magnesium ions (mg/litre)

NA

Sodium ions (mg/litre)

K

Potassium ions (mg/litre)

SUL

Sulfate ions (mg/litre)

NO3

Nitrate ions (mg/litre)

HCO3

Carbonate ions (mg/litre)

CL

Chloride ions (mg/litre)

where .

However, it can be useful for some applications to use weight pi varying from one individual to another as grouped data or a reweighted sample. These weights, which are positive numbers summing to 1, can be viewed as frequencies and are stored in a diagonal matrix of size n:

1.2.1.2.2. Covariance matrix and correlation matrix

We define the linear correlation coefficient between variables k and by:

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!