75,99 €
This bookpresents material on both the analysis of the classical concepts of correlation and on the development of their robust versions, as well as discussing the related concepts of correlation matrices, partial correlation, canonical correlation, rank correlations, with the corresponding robust and non-robust estimation procedures. Every chapter contains a set of examples with simulated and real-life data. Key features: * Makes modern and robust correlation methods readily available and understandable to practitioners, specialists, and consultants working in various fields. * Focuses on implementation of methodology and application of robust correlation with R. * Introduces the main approaches in robust statistics, such as Huber's minimax approach and Hampel's approach based on influence functions. * Explores various robust estimates of the correlation coefficient including the minimax variance and bias estimates as well as the most B- and V-robust estimates. * Contains applications of robust correlation methods to exploratory data analysis, multivariate statistics, statistics of time series, and to real-life data. * Includes an accompanying website featuring computer code and datasets * Features exercises and examples throughout the text using both small and large data sets. Theoretical and applied statisticians, specialists in multivariate statistics, robust statistics, robust time series analysis, data analysis and signal processing will benefit from this book. Practitioners who use correlation based methods in their work as well as postgraduate students in statistics will also find this book useful.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 426
Veröffentlichungsjahr: 2016
Cover
Wiley Series in Probability and Statistics
Title Page
Copyright
Dedication
Preface
Acknowledgements
About the companion website
Chapter 1: Introduction
1.1 Historical Remarks
1.2 Ontological Remarks
References
Chapter 2: Classical Measures of Correlation
2.1 Preliminaries
2.2 Pearson's Correlation Coefficient: Definitions and Interpretations
2.3 Nonparametric Measures of Correlation
2.4 Informational Measures of Correlation
2.5 Summary
References
Chapter 3: Robust Estimation of Location
3.1 Preliminaries
3.2 Huber's Minimax Approach
3.3 Hampel's Approach Based on Influence Functions
3.4 Robust Estimation of Location: A Sequel
3.5 Stable Estimation
3.6 Robustness Versus Gaussianity
3.7 Summary
References
Chapter 4: Robust Estimation of Scale
4.1 Preliminaries
4.2 - and -Estimates of Scale
4.3 Huber Minimax Variance Estimates of Scale
4.4 Highly Efficient Robust Estimates of Scale
4.5 Monte Carlo Experiment
4.6 Summary
References
Chapter 5: Robust Estimation of Correlation Coefficients
5.1 Preliminaries
5.2 Main Groups of Robust Estimates of the Correlation Coefficient
5.3 Asymptotic Properties of the Classical Estimates of the Correlation Coefficient
5.4 Asymptotic Properties of Nonparametric Estimates of Correlation
5.5 Bivariate Independent Component Distributions
5.6 Robust Estimates of the Correlation Coefficient Based on Principal Component Variances
5.7 Robust Minimax Bias and Variance Estimates of the Correlation Coefficient
5.8 Robust Correlation via Highly Efficient Robust Estimates of Scale
5.9 Robust -Estimates of the Correlation Coefficient in Independent Component Distribution Models
5.10 Monte Carlo Performance Evaluation
5.11 Robust Stable Radical -Estimate of the Correlation Coefficient of the Bivariate Normal Distribution
5.12 Summary
References
Chapter 6: Classical Measures of Multivariate Correlation
6.1 Preliminaries
6.2 Covariance Matrix and Correlation Matrix
6.3 Sample Mean Vector and Sample Covariance Matrix
6.4 Families of Multivariate Distributions
6.5 Asymptotic Behavior of Sample Covariance Matrix and Sample Correlation Matrix
6.6 First Uses of Covariance and Correlation Matrices
6.7 Working with the Covariance Matrix–Principal Component Analysis
6.8 Working with Correlations–Canonical Correlation Analysis
6.9 Conditionally Uncorrelated Components
6.10 Summary
References
Chapter 7: Robust Estimation of Scatter and Correlation Matrices
7.1 Preliminaries
7.2 Multivariate Location and Scatter Functionals
7.3 Influence Functions and Asymptotics
7.4 M-functionals for Location and Scatter
7.5 Breakdown Point
7.6 Use of Robust Scatter Matrices
7.7 Further Uses of Location and Scatter Functionals
7.8 Summary
References
Chapter 8: Nonparametric Measures of Multivariate Correlation
8.1 Preliminaries
8.2 Univariate Signs and Ranks
8.3 Marginal Signs and Ranks
8.4 Spatial Signs and Ranks
8.5 Affine Equivariant Signs and Ranks
8.6 Summary
References
Chapter 9: Applications to Exploratory Data Analysis: Detection of Outliers
9.1 Preliminaries
9.2 State of the Art
9.3 Problem Setting
9.4 A New Measure of Outlier Detection Performance
9.5 Robust Versions of the Tukey Boxplot with Their Application to Detection of Outliers
9.6 Robust Bivariate Boxplots and Their Performance Evaluation
9.7 Summary
References
Chapter 10: Applications to Time Series Analysis: Robust Spectrum Estimation
10.1 Preliminaries
10.2 Classical Estimation of a Power Spectrum
10.3 Robust Estimation of a Power Spectrum
10.4 Performance Evaluation
10.5 Summary
References
Chapter 11: Applications to Signal Processing: Robust Detection
11.1 Preliminaries
11.2 Robust Minimax Detection Based on a Distance Rule
11.3 Robust Detection of a Weak Signal with Redescending -Estimates
11.4 A Unified Neyman–Pearson Detection of Weak Signals in a Fusion Model with Fading Channels and Non-Gaussian Noises
11.5 Summary
References
Chapter 12: Final Remarks
12.1 Points of Growth: Open Problems in Multivariate Statistics
12.2 Points of Growth: Open Problems in Applications
Wiley Series in Probability and Statistics
Index
WILEY SERIES IN PROBABILITY AND STATISTICS
End User License Agreement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
202
203
203
204
204
205
205
206
206
207
207
208
208
209
209
210
210
211
211
212
212
213
213
214
214
215
215
216
216
217
217
218
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
308
309
310
311
312
Cover
Table of Contents
Begin Reading
Chapter 2: Classical Measures of Correlation
Figure 2.1
Data with positive correlation.
Figure 2.5
Linear dependent data correlation.
Figure 2.6
Linear dependent data correlation and determination coefficients.
Figure 2.9
Approximately nonlinear dependent data correlation and determination coefficients.
Figure 2.10
Ellipse of equal probability for the standard bivariate normal distribution with the major and minor diameters dependent on the correlation coefficient.
Figure 2.11
The sample correlation coefficient as the cosine of the angle between the variable vectors.
Figure 2.12
The quadrant correlation coefficient.
Chapter 3: Robust Estimation of Location
Figure 3.1
Figure 3.2 Tukey's sensitivity curve for the sample mean
Figure 3.3 Tukey's sensitivity curve for the sample median
Figure 3.4 Influence function for the population mean
Figure 3.5 Influence function for the population median
Figure 3.6 Hampel's score function
Figure 3.7 Huber's skipped mean score function
Figure 3.8 Mosteller-Tukey's biweight score function
Figure 3.9 Optimal score function for the class of nondegenerate distribution densities
Figure 3.10 Optimal score function for the class of distribution densities with a bounded variance
Figure 3.11 Optimal score function for the class of finite distribution densities
Figure 3.12 Optimal score function for the class of approximate finite distribution densities
Figure 3.13 Optimal score function for the class of nondegenerate distribution densities with a bounded variance in the intermediate between the classes and zone
Figure 3.14 Efficiency and stability of the estimates of location at the normal distribution with density
Figure 3.16 Efficiency and stability of the estimates of location at the Cauchy distribution with density
Figure 3.17 Standard Gaussian distribution density
Chapter 4: Robust Estimation of Scale
Figure 4.1 Score function for the standard deviation.
Figure 4.3 Score function for the median absolute deviation.
Figure 4.4 Score function for the minimax variance estimate of scale: the trimmed standard deviation.
Figure 4.5 The median absolute deviation score function at the standard normal density.
Figure 4.6 Efficiency of the estimate of scale at the standard normal.
Figure 4.7 Breakdown point of the estimate of scale.
Figure 4.8 Influence function of the estimate of scale.
Figure 4.9 Typical dependence of Monte Carlo accuracy on the number of trials.
Figure 4.10 Score function for the Huber -estimate of scale.
Figure 4.11 Standardized variance (axis ) versus average absolute bias (axis ) performance at the standard normal distribution: .
Figure 4.14 Average bias dependence on the contamination fraction at the contaminated normal distribution: .
Figure 4.12 Standardized variance (axis ) versus average absolute bias (axis ) performance at the standard normal distribution: .
Figure 4.13 Average bias dependence on the contamination fraction at the contaminated normal distribution: .
Chapter 5: Robust Estimation of Correlation Coefficients
Figure 5.1 Impact of outliers on the Pearson correlation coefficient
Figure 5.2 Asymptotic relative efficiencies of nonparametric correlation measures (axis ) versus the correlation coefficient (axis ) of the normal distribution
Chapter 9: Applications to Exploratory Data Analysis: Detection of Outliers
Figure 9.1 Tukey's univariate boxplot
Figure 9.2 -boxplot
Figure 9.3 Hypotheses testing under shift contamination
Figure 9.4 The relationship between the ROC curve and -mean
Figure 9.5 -bivariate boxplot
Figure 9.6 -bivariate boxplot realization: , , 4 suspicious observations
Figure 9.7 Legend
Figure 9.8 Shift contamination:
Figure 9.10 Shift contamination:
Figure 9.11 Scale contamination:
Figure 9.13 Scale contamination:
Figure 9.12
Figure 9.14 Ellipse deviation estimate
Figure 9.15 Ellipse shape deviations of the Tukey bagplot and -boxplot
Figure 9.16 The variances of location estimates: grey—-boxplot, black—Tukey's bagplot
Chapter 10: Applications to Time Series Analysis: Robust Spectrum Estimation
Figure 10.1 Median Fourier transform power spectrum estimate breakdown point property: the mixture of two sinusoids model with the and duration intervals.
Figure 10.3 Median Fourier transform power spectrum estimate breakdown point property: the median periodogram power spectrum estimate.
Figure 10.4 Power spectrum estimation with robust filter-cleaners: model with
AO
contamination; , (Spangl 2008).
Figure 10.7 Power spectra estimation: model with
AO
contamination; , .
Figure 10.8 Smoothed classical power spectrum estimation in the disorder model with contamination; , .
Figure 10.9 Smoothed robust power spectrum estimation in the disorder model with contamination; , .
Figure 10.6 Power spectra estimation by the Yule–Walker method: model with
AO
contamination; , .
Chapter 11: Applications to Signal Processing: Robust Detection
Figure 11.1 Error probability in the Gaussian noise: , .
Figure 11.3 Error probability in the generalized Gaussian noise close to uniform: asymptotics, , .
Figure 11.2 Error probability in the contaminated Gaussian noise: asymptotics, , .
Figure 11.4 Probability of missing in the Gaussian noise: , , .
Figure 11.6 Probability of missing in the Gaussian noise: , .
Figure 11.7 The parallel fusion model with the sensor nodes and fusion center.
Figure 11.8 ROC curves for detection in the Gaussian noise at SNR = 0, 10, 15, and 20 dB with =100.
Figure 11.9 ROC curves for detection in the Cauchy noise at GSNR = 0, 10, 20, and 30 dB with =100.
Figure 11.10 ROC curves for detection in the Laplace noise at SNR = 0, 10, and 20 dB with =100.
Chapter 2: Classical Measures of Correlation
Table 2.1 Pearson's correlation between a random observation and its sign
Table 2.2 Pearson's correlation between a random observation and its rank
Chapter 3: Robust Estimation of Location
Table 3.1 Efficiency and stability of -estimates of location
Chapter 4: Robust Estimation of Scale
Table 4.1 Computation time in microseconds
Table 4.5 Monte Carlo means and standardized variances in the Cauchy distribution model
Table 4.2 Monte Carlo means and standardized variances in the scale-only standard normal distribution model
Table 4.3 Monte Carlo means and standardized variances in the location-scale standard normal distribution model
Table 4.4 Monte Carlo means and standardized variances in the Tukey gross error distribution model (, )
Chapter 5: Robust Estimation of Correlation Coefficients
Table 5.1 Normal distribution :
Table 5.8 Bivariate Cauchy -distribution :
Table 5.2 Normal distribution :
Table 5.3 Contaminated normal distribution : , , ,
Table 5.4 Contaminated normal : , , ,
Table 5.6 ICD Cauchy distribution :
Chapter 9: Applications to Exploratory Data Analysis: Detection of Outliers
Table 9.1 Boundaries of the power and false alarm rate
Table 9.2 -means for detection tests under scale contamination: ,
Table 9.3 -means for detection tests under shift contamination: ,
Table 9.4 -means for detection tests under shift contamination with the different values of :
Table 9.5 -means for boxplot tests applied to server data
Table 9.6 Types of contaminated Gaussian distribution densities
Chapter 11: Applications to Signal Processing: Robust Detection
Table 11.1 The factor in (11.13) for various noises
Table 11.2 Detection efficiency and stability for various detectors and noise distributions (the best detector performances for each noise distribution are boldfaced except the maximum likelihood and the minimum error sensitivity cases)
Table 11.3 The factor for various detectors and noise distributions (the best detector performances for each noise distribution are boldfaced except the maximum likelihood case)
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg
Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.
Georgy L. Shevlyakov
Peter the Great Saint-Petersburg Polytechnic University, Russia
Hannu Oja
University of Turku, Finland
This edition first published 2016 © 2016 by John Wiley and Sons Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Names: Shevlyakov, Georgy L. | Oja, Hannu.
Title: Robust correlation : theory and applications / Georgy L. Shevlyakov, Peter The Great Saint-Petersburg Polytechnic University, Russia, Hannu Oja, University Of Turku, Finland.
Description: Chichester, West Sussex : John Wiley & Sons, Inc., 2016. | Series: Wiley series in probability and statistics | Includes bibliographical references and index.
Identifiers: LCCN 2016017308 (print) | LCCN 2016020693 (ebook) | ISBN 9781118493458 (cloth) | ISBN 9781119264538 (pdf) | ISBN 9781119264491 (epub)
Subjects: LCSH: Correlation (Statistics) | Mathematical statistics.
Classification: LCC QA278.2 .S4975 2016 (print) | LCC QA278.2 (ebook) | DDC
519.5/37–dc23
LC record available at https://lccn.loc.gov/2016017308
A catalogue record for this book is available from the British Library.
To our families
Robust statistics as a branch of mathematical statistics appeared due to the seminal works of John W. Tukey (1960), Peter J. Huber (1964), and Frank R. Hampel (1968). It has been intensively developed since the sixties of the last century and is definitely formed by the present. The term “robust” (Latin: strong, sturdy, tough, vigorous) as applied to statistical procedures was proposed by George E.P. Box (1953).
The principal reason for research in this field of statistics is of a general mathematical nature. Optimality (accuracy) and stability (reliability) are the mutually complementary characteristics of many mathematical procedures. It is well-known that the performance of optimal procedures is, as a rule, rather sensitive to “small” perturbations of prior assumptions. In mathematical statistics, the classical example of such an unstable optimal procedure is given by the least squares method: its performance may become disastrously poor under small deviations from normality.
Roughly speaking, robustness means stability of statistical inference under the departures from the accepted distribution models. Since the term “stability” is generally overloaded in mathematics, the term “robustness” may be regarded as its synonym.
Peter J. Huber and Frank R. Hampel contributed much to robust statistics: they proposed and developed two principal approaches to robustness, namely, the minimax approach and the approach based on influence functions, which were applied to almost all areas of statistics: robust estimation of location, scale, regression, and multivariate model parameters, as well as to robust hypothesis testing. It is remarkable that although robust statistics involves mathematically highly refined asymptotic tools, nevertheless robust methods show a satisfactory performance in small samples, being quite useful in applications.
The main topic of our book is robust correlation. Correlation analysis is widely used in multivariate statistics and data analysis: computing correlation and covariance matrices is both an initial and a basic step in most procedures of multivariate statistics, for example, in principal component analysis, factor and discriminant analysis, detection of multivariate outliers, etc.
Our work represents new results generally related to robust correlation and data analysis technologies, with definite accents both on theoretical aspects and practical needs of data processing: we have written the book to be accessible both to the users of statistical methods as well as to professional statisticians. However, the mathematical background requires the basics of calculus, linear algebra, andmathematical statistics.
Chapter 1 is an introduction into the book, providing historical aspects of the origin and development of the notion “correlation” in science as well as ontological remarks on the subject of statistics and data processing. Chapter 2 delivers a survey of the classical measures of correlation aimed most at estimating linear dependencies.
Chapter 3 represents Huber's and Hampel's principal approaches to robustness in mathematical statistics, with novel additions to them, namely, a stable estimation approach and an essay on robustness versus Gaussianity, the latter of which could be helpful for students and their teachers. Except for a few paragraphs on the application of Huber's minimax approach to distribution classes of a non-neighborhood nature, Chapters 1 to 3 are accessible to a wide reader audience.
Chapters 4 to 8 comprise the core of the book, which contains most of the new theoretical and experimental (Monte Carlo) results. Chapter 4 treats the problems of robust estimation of a scale parameter, and the obtained results are used in Chapter 5 for the design of highly robust and efficient estimates of a correlation coefficient including robust minimax (in the Huber sense) estimates. Chapter 6 provides an overview of classical multivariate correlation measures and inference tools based on the covariance and correlation matrix. Chapter 7 deals with robust correlation measures and inference tools that are based on various robust covariance matrix functionals and estimates; in particular, robust versions of principal component and canonical correlation analysis are given. Chapter 8 comprises correlation measures and inference tools based on various concepts of univariate and multivariate signs and ranks.
Chapters 9 to 11 are devoted to the applications of the aforementioned robust estimates of correlation, as well as of location and scale, to different problems of statistical data and signal analysis, with a few examples of real-life data and signal processing. Chapter 9 is confined to the applications to exploratory data analysis and its technologies, mostly treating an important problem of detection of outliers in the data. Chapter 10 outlines a few novel approaches to robust estimation of time series power spectra: although the obtained results are preliminary, they are profitable, deserving a further thorough study. In Chapter 11, various problems of robust signal detection are posed and treated, in the solution of which the Huber's minimax and stable approaches to robust detection are successfully exploited.
Chapter 12 outlines several open problems in robust multivariate analysis and its applications.
From the aforementioned it follows that there are two main blocks of the book: Chapters 1 to 3 and 9 to 11 aim at the applied statistician and statistics user audience, while Chapters 4 to 8 focus on the theoretical aspects of robust correlation.
Most of the contents of the book, namely Chapters 1 to 5 and 9 to 11, have been written by the first author. The second author contributed Chapters 6 to 8 on general multivariate analysis.
John W. Tukey, Peter J. Huber, Frank R. Hampel, Elvezio M. Ronchetti, and Peter J. Rousseeuw have essentially influenced, directly or indirectly, our views on robustness and data analysis.
The first author is deeply grateful to his teachers and colleagues for their helpful and constructive discussions, namely, to Igor B. Chelpanov, Peter Filzmoser, Eugene P. Gilbo, Jana Jureckova, Abram M. Kagan, Vladimir Ya. Katkovnik, Yuriy S. Kharin, Kiseon Kim, Lev B. Klebanov, Stephan Morgenthaler, Yakov Yu. Nikitin, Boris T. Polyak, Alexander M. Shurygin, and Nikita O. Vilchevski.
Some results presented in Chapters 4, 5, and 9 to 11 by the first author are based on the Ph.D. and M.Sc. dissertations of his former students, including Kliton Andrea, JinTae Park, Pavel Smirnov, Galina Lavrentyeva, Nickolay Lyubomishchenko, and Nikita Vassilevskiy—we would like to thank them.
Research on multivariate analysis reported by the second author is to some degree based on the thesis works of his several ex-students, including Jyrki Möttönen, Samuli Visuri, Esa Ollila, Sara Taskinen, Seija Sirkiä, and Klaus Nordhausen. We wish to thank them all. The second author is naturally also indebted to many colleagues and coauthors for valuable discussions and express his sincere thanks for discussions and cooperation in this specific research area with Christopher Croux, Tom Hettmansperger, Annaliisa Kankainen, Visa Koivunen, Ron Randles, Bob Serfling, and Dave Tyler.
We are also grateful to Igor Bezdvornyh and Maksim Sovetnikov for their technical help in the preparation of the manuscript.
Finally, we wish to thank our wives, Elena and Ritva, for their patience, support, and understanding.
Don't forget to visit the companion website for this book:
www.wiley.com/go/Shevlyakov/Robust
There you will find valuable material designed to enhance your learning, including:
Datasets
R codes
Scan this QR code to visit the companion website.
This book is most about correlation, association and partially about regression, i.e., about those areas of science where the dependencies between random variables that mathematically describe the relations between observed phenomena and associated with them features are studied. Evidently, these concepts and terms firstly appeared in applied sciences, not in mathematics. Below we briefly overview the historical aspects of the considered concepts.
The word “correlation” is of late Latin origin meaning “association”, “connection”, “correspondence”, “interdependence”, “relationship”, but relationship not in the conventional for that time deterministic functional form.
The term “correlation” was introduced into science by a French naturalist Georges Cuvier (1769–1832), one of the major figures in natural sciences in the early 19th century, who had founded paleontology and comparative anatomy. Cuvier discovered and studied the relationships between the parts of animals, between the structure of animals and their mode of existence, between the species of animals and plants, and many others. This experience made him establish the general principles of “the correlation of parts” and of “the functional correlation” (Rudwick 1997):
Today comparative anatomy has reached such a point of perfection that, after inspecting a single bone, one can often determine the class, and sometimes even the genus of the animal to which it belonged, above all if that bone belonged to the head or the limbs. … This is because the number, direction, and shape of the bones that compose each part of an animal's body are always in a necessary relation to all the other parts, in such a way that – up to a point – one can infer the whole from any one of them and vice versa.
From Cuvier to Galton, correlation had been understood as a qualitatively described relationship, not deterministic but of a statistical nature, however observed at that time within a rather narrow area of phenomena.
The notion of regression is connected with the great names of Laplace, Legendre, Gauss, and Galton (1885), who coined this term. Laplace (1799) was the first to propose a method for processing the astronomical data, namely, the least absolute values method. Legendre (1805) and Gauss (1809) independently of each other introduced the least squares method.
Francis Galton (1822–1911), a British anthropologist, biologist, psychologist, andmeteorologist, understood that correlation is the interrelationship in average between any random variables (Galton 1888):
Two variable organs are said to be co-related when the variation of the one is accompanied on the average by more or less variation of the other, and in the same direction.… It is easy to see that co-relation must be the consequence of the variations of the two organs being partly due to common cause.… If they were in no respect due to common causes, the co-relation would be nil.
Correlation analysis (this term also was coined by Galton) deals with estimation of the value of correlation by number indexes or coefficients.
Similarly to Cuvier, Galton introduced regression dependence observing live nature, in particular, processing the heredity and sweet peas data (Galton 1894). Regression characterizes the correlation dependence between random variables functionally in average. Studying the sizes of sweet peas beans, he noticed that the offspring seeds did not reveal the tendency to reproduce the size of their parents being closer to the population mean than them. Namely, the seeds were smaller than their parents in the case of large parent sizes, and vice versa. Galton called this dependence regression, for the reverse changes had been observed: firstly, he used the term “the law of reversion”. Further studies showed that on average the offspring regression to the population mean was proportional to the parent deviations from it – this allowed the observed dependence to be described using the linear function. The similar linear regression is described by Galton as a result of processing the heights of 930 adult children and their 205 parents (Galton 1894).
The term “regression” became popular, and now it is used in the case of functional dependencies in average between any random variables. Using modern terminology, we may say that Galton considered the slope of the simple linear regression line as a measure of correlation (Galton 1888):
Letthe deviation of the subject [in units of the probably error, ], whichever of the two variables may be taken in that capacity; and letbe the corresponding deviations of the relative, and let the mean of these be. Then we find: (1) thatfor all values of; (2) thatis the same, whichever of the two variables is taken for the subject; (3) thatis always less than 1; (4) thatmeasures the closeness of co-relation.
Now we briefly comment on the above-mentioned properties (1)–(4): the first is just the simple linear regression equation between the standardized variables and ; the second means that the co-relation is symmetric with regard to the variables and ; the third and fourth show that Galton had not yet recognized the idea of negative correlation: stating that could not be greater than 1, he evidently understood as a positive measure of “co-relation”. Originally stood for the regression slope, and that is really so for the standardized variables; Galton perceived the correlation coefficient as a scale invariant regression slope.
Galton contributed much to science studying the problems of heredity of qualitative and quantitative features. They were numerically examined by Galton on the basis of the concept of correlation. Until the present, the data on demography, heredity, and sociology collected by Galton with the corresponding numerical examples of correlations computed are used.
Karl Pearson (1857–1936), a British mathematician, statistician, biologist, and philosopher, had written out the explicit formulas for the population product-moment correlation coefficient (Pearson 1895)
and its sample version
(here and are the sample means of the observations and of random variables and ). However, Pearson did not definitely distinguish the population and sample versions of the correlation coefficient, as it is commonly done at present.
Thus, on the one hand, the sample correlation coefficient is a statistical counterpart of the correlation coefficient of a bivariate distribution, where , , and are the variances and the covariance of the random variables and , respectively.
On the other hand, it is an efficient maximum likelihood estimate of the correlation coefficient of the bivariate normal distribution (Kendall and Stuart 1963) with density
where , , , .
Galton (1888) derived the bivariate normal distribution (1.3), and he was the first who used it to scatter the frequencies of children's stature and parents' stature. Pearson noted that “in 1888 Galton had completed the theory of bivariate normal correlation” (Pearson 1920).
Like Galton, Auguste Bravais (1846), a French naval officer and astronomer, came very near to the definition (1.1) when he called one parameter of the bivariate normal distribution “une correlation”, but he did not recognize it as a measure of the interrelationship between variables. However, “his work in Pearson's hands proved useful in framing formal approaches in those areas” (Stigler 1986).
Pearson's formulas (1.1) and (1.2) proved to be fruitful for studying dependencies: correlation analysis and most of multivariate statistical analysis tools are based on the pair-wise Pearson correlations; we may also add the correlation and spectral theories of stochastic processes, etc.
Since the time Pearson introduced the sample correlation coefficient (1.2), many other measures of correlation have been used aiming at estimation of the closeness of interrelationship (the coefficients of association, determination, contingency, etc.). Some of them were proposed by Karl Pearson (1920).
It would not be out of place to note the contributions to correlation analysis of the other British statisticians.
Ronald Fisher (1890–1962) is one of the creators of mathematical statistics. In particular, he is the originator of the analysis of variance and together with Karl Pearson he stands at the beginning of the theory of hypothesis testing. He introduced the notion of a sufficient statistic and proposed the maximum likelihood method (Fisher 1922). Fisher also payed much attention to correlation analysis: his tools for verifying the significance of correlation under the normal law are used until now.
George Yule (1871–1951) is a prominent statistician of the first half of the 20th century. He contributed much to the statistical theories of regression, correlation (Yule's coefficient of contingency between random events), and spectral analysis.
Maurice Kendall (1907–1983) is one of the creators of nonparametric statistics, in particular, of the nonparametric correlation analysis (the Kendall -rank correlation) (Kendall 1938). It is noteworthy that he is the coauthor of the classical course in mathematical statistics (Kendall and Stuart 1962, 1963, 1968).
In what follows, we represent their contributions to correlation analysis in more detail.
Our personal research experience in applied statistics and real-life data analysis is relatively broad and long. It is concerned with the problems of data processing in medicine (cardiology and ophthalmology), biology (genetics), economics and finances (financial mathematics), industry (mechanical engineering, energetics, and material science), and analysis of semantic data and informatics (information retrieval from big data). Besides and due to those problems, we have been working in theoretical statistics, most in robust and nonparametric statistics, as well as in multivariate statistics and time series analysis. Now we briefly outline our vision of the topic of this book to indicate its place in the general context of statistical data analysis with its philosophy and ideological environment.
The reader should only remember that any classification is a convention, such are the forthcoming ones.
The customary forms of data representation are as follows (Shevlyakov and Vilchevski 2002, 2011):
as a sample of real numbers being the most convenient form to deal with;
as a sample of real-valued vectors of dimension ;
as an observed realization , of a real-valued continuous process (function);
as a sample of “non-numerical nature” data representing qualitative variables;
as the semantic type of data (statements, texts, pictures, etc.).
The first three possibilities mostly occur in the natural and technical sciences with the measurement techniques being well developed, clearly defined, and largely standardized. In the social sciences, the last forms are relatively common.
To summarize: in this book we deal mostly with the first three forms and, partially, with the fourth.
The experience of treating various statistical problems shows that practically all of them are solved with the use of only a few qualitatively different types of data statistics. Here we do not discuss how to use them in solving statistical problems: only note that their solutions result in computing some of those statistics, and final decision making essentially depends on their values (Mosteller and Tukey 1977; Tukey 1962).
These data statistics may be classified as follows:
measures of location (central tendency, mean values),
measures of scale (spread, dispersion, scatter),
measures of correlation (interdependence, association),
measures of extreme values,
measures of a data distribution shape,
measures of data spectrum.
To summarize: in this book we mainly focus on the measures of correlation, however dealing if needed with the other types of data statistics.
These aims can be formulated as follows:
(A1)
compact representation of data,
(A2)
estimation of model parameters explaining and/or revealing data structure,
(A3)
prediction.
A human mind cannot efficiently work with large volumes of information, since there exist natural psychological bounds on the perception ability (Miller 1956). Thus it is necessary to provide a compact data output of information for expert analysis: only in this case we may expect a satisfactory final decision. Note that data processing often begins and ends with the first item (A1).
The next step (A2) is to propose an explanatory underlying model for the observed data and phenomena. It may be a regression model, or a distribution model, or any other, desirably a low-complexity one: an essentially multiparametric model is usually a “bad” model; nevertheless, we should recall a cute note of George Box: “All models are wrong, but some of them are useful” (Box and Draper 1987). However, parametric models are the first to consider and examine.
Finally, the first two aims are only the steps to the last aim (A3): here we have to state that this aim remains a main challenge to statistics and to science as a whole.
To summarize: in this book we pursue aims (A1) and (A2).
The need for stability in statistical inference directly leads to the use of robust statistical methods. It may be roughly stated that, with respect to the level of prior information about underlying data distributions, robust statistical methods occupy the intermediate place between classical parametric and nonparametric methods.
In parametric statistics, the shape of an underlying data distribution is assumed known up to the values of unknown parameters. In nonparametric statistics, it is supposed that the underlying data distribution belongs to some sufficiently “wide”class of distributions (continuous, symmetric, etc.). In robust statistics, at least within Huber's minimax approach (Huber 1964), we also consider distribution classes but with more detailed information about the underlying distribution, say, in the form of a neighborhood of the normal distribution. The latter peculiarity allows the efficiency of robust procedures to be raised as compared with nonparametric methods, simultaneously retaining their high stability.
At present, there exist two main approaches in robustness:
Huber's minimax approach — quantitative robustness (Huber 1981; Huber and Ronchetti 2009).
Hampel's approach based on influence functions — qualitative robustness (Hampel 1968; Hampel
et al
. 1986).
In Chapter 3, we describe these approaches in detail. Now we classify the existing approaches in statistics with respect to the level of prior information about the underlying data distribution in the case of point parameter estimation:
A given data distribution with a random parameter — the Bayesian statistics (Berger 1985; Bernardo and Smith 1994; Jaynes 2003).
A given data distribution with an unknown parameter — the classical parametric statistics (Fisher 1922; Kendall and Stuart 1963).
A data distribution with an unknown parameter belongs to a distribution class , usually a neighborhood of a given distribution, e.g., normal — the robust statistics (Hampel
et al
. 1986; Huber 1981; Kolmogorov 1931; Tukey 1960).
A data distribution with an unknown parameter belongs to some general distribution class — the classical nonparametric statistics (Hettmansperger and McKean 1998; Kendall and Stuart 1963; Wasserman 2007).
A data distribution does not exist in the case of unique samples and frequency instability — the probability-free approaches to data analysis: fuzzy (Zadeh 1975), exploratory (Bock and Diday 2000; Tukey 1977), interval probability (Kuznetsov 1991; Walley 1990), logical-algebraic, geometrical (Billard and Diday 2003; Diday 1972).
Note that the upper and lower levels of this hierarchy, namely the Bayesian and the probability-free approaches, are being intensively developed at present.
To summarize: in this book we mainly use Huber's and Hampel's robust approaches to statistical data analysis.
Berger JO 1985
Statistical Decision Theory and Bayesian Analysis
, Springer.
Bernardo JM and Smith AFM 1994
Bayesian Theory
, Wiley.
Billard L and Diday E 2003 From the statistics of data to the statistics of knowledge: symbolic data analysis.
J. Amer. Statist. Assoc.
98
, 991–999.
Bock HH and Diday E (eds) 2000
Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data
, Springer.
Box GEP and Draper NR 1987
Empirical Model-Building and Response Surfaces
, Wiley.
Bravais A 1846 Analyse mathématique sur les probabilités des erreurs de situation d'un point.
Mémoires presents par divers savants l'Académie des Sciences de l'Institut de France
.
Sciences Mathématiques et Physiques
9
, 255–332.
Diday E 1972 Nouvelles Méthodes et Nouveaux Concepts en Classification Automatique et Reconnaissance des Formes. These de doctorat d'état, Univ. Paris IX.
Fisher RA 1922 On the mathematical foundations of theoretical statistics.
Philosophical Transactions of the Royal Society
, A
222
, 309–368.
Galton F 1885 Regression towards mediocrity in hereditary stature.
Journal of Anthropological Institute
15
, 246–263.
Galton F 1888 Co-relations and their measurement, chiefly from anthropometric data.
Proceedings of the Royal Society of London
45
, 135–145.
Galton F 1894
Natural Inheritance
, Macmillan, London.
Gauss CF 1809
Theoria Motus Corporum Celestium
,
Perthes, Hamburg; English translation: Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections
. New York: Dover, 1963.
Hampel FR 1968
Contributions to the Theory of Robust Estimation
. PhD thesis, University of California, Berkeley.
Hampel FR, Ronchetti E, Rousseeuw PJ, and Stahel WA 1986
Robust Statistics. The Approach Based on Influence Functions
, Wiley.
Hettmansperger TP and McKean JW 1998
Robust Nonparametric Statistical Methods. Kendall's Library of Statistics
, Edward Arnold, London.
Huber PJ 1964 Robust estimation of a location parameter.
Ann. Math. Statist.
35
, 73–101.
Huber PJ 1981
Robust Statistics
, Wiley.
Huber PJ and Ronchetti E (eds) 2009
Robust Statistics
, 2nd edn, Wiley.
Jaynes AT 2003
Probability Theory. The Logic of Science
, Cambridge University Press.
Kendall MG 1938 A new measure of rank correlation.
Biometrika
30
, 81–89.
Kendall MG and Stuart A 1962
The Advanced Theory of Statistics. Distribution Theory
, vol. 1, Griffin, London.
Kendall MG and Stuart A 1963
The Advanced Theory of Statistics. Inference and Relationship
, vol. 2, Griffin, London.
Kendall MG and Stuart A 1968
The Advanced Theory of Statistics. Design and Analysis, and Time Series
, vol. 3, Griffin, London.
Kolmogorov AN 1931 On the method of median in the theory of errors.
Math. Sbornik
38
, 47–50.
Kuznetsov VP 1991
Interval Statistical Models
, Radio i Svyaz, Moscow (in Russian).
Legendre AM 1805
Nouvelles methods pour la determination des orbits des cometes
, Didot, Paris.
Miller GA 1956 The magical number seven, plus or minus two: Some limits on our capacity for processing information.
Psychological Review
63
, 81–97.
Mosteller F and Tukey JW 1977
Data Analysis and Regression
, Addison–Wesley.
Pearson K 1895 Contributions to the mathematical theory of evolution.
Philosophical Transactions of the Royal Society of London
A
186
, 343–414.
Pearson K 1920 Notes on the history of correlations.
Biometrika
13
, 25–45.
Rudwick MJS 1997
Georges Cuvier, Fossil Bones, and Geological Catastrophes
, University of Chicago Press.
Shevlyakov GL and Vilchevski NO 2002
Robustness in Data Analysis: criteria and methods
, VSP, Utrecht.
Shevlyakov GL and Vilchevski NO 2011
Robustness in Data Analysis
, De Gruyter, Boston.
Stigler SM 1986
The History of Statistics: The Measurement of Uncertainty before 1900
. Belknap Press/Harvard University Press.
Tukey JW 1960 A survey of sampling from contaminated distributions. In
Contributions to Probability and Statistics
. (ed. Olkin I). pp. 448–485. Stanford Univ. Press.
Tukey JW 1962 The future of data analysis.
Ann. Math. Statist.
33
, 1–67.
Tukey JW 1977
Exploratory Data Analysis
, Addison–Wesley.
Walley P 1990
Statistical Reasoning with Imprecise Probabilities
, Chapman & Hall.
Wasserman L 2007
All of Nonparametric Statistics
, Springer.
Zadeh LA 1975 Fuzzy logic and approximate reasoning.
Synthese
30
, 407–428.
In this chapter we define several conventional measures of correlation, focusing most on Pearson's correlation coefficient and closely related to it constructions, enlist their principal properties and computational peculiarities.
Here we comment on the requirements that should be imposed on the measures of correlation to distinguish them from the measures of location and scale (Renyi 1959; Schweizer and Wolff 1981).
Let be a measure of correlation between any random variables and . Here we consider both positive–negative correlation
and positive correlation
It is natural to impose the following requirements on :
(R1)
Symmetry: .
(R2)
Invariancy to linear transformations of random variables:
(R3)
Attainability of the limit values 0, , and :
for independent and , ;
;
for positive-negative correlation.
(R4)
Invariancy to strictly monotonic transformations of random variables:
for strictly monotonic functions and .
(R5)
.
Requirement (R1) holds almost for all known measures of correlation, being a natural assumption for correlation analysis as compared to regression analysis when it is not known which variables are dependent and which not.
Requirement (R2) makes a measure of correlation independent of the chosen measures of location and scale, since each of them reflects qualitatively different data characteristics.
Requirements (R3a), (R3b), and (R3c), on the one hand, are merely technical: it is practically and theoretically convenient to deal with a bounded scaleless measure of correlation; on the other hand, they refer to the correspondence of the limit values of to the limit cases of association between random variables and : the relation may mean independence of and whereas the relation indicates the functional dependence between and .
The first three requirements hold for almost all known measures of correlation. This is not so with the relation , which does not guarantee independence of and for several measures of correlation, for example, for Pearson's product-moment, the Spearman rank correlation, and for a few others.
However, this property holds for the maximal correlation coefficient defined as
where is Pearson's product-moment correlation coefficient (1.1), and are Borel-measurable functions such that has sense (Gebelein 1941). The independence of random variables and also follows from the null value of Sarmanov's correlation coefficient (also called the maximal correlation coefficient) (Sarmanov 1958): in the case of a continuous symmetric bivariate distribution of and , its value is reciprocal to the minimal eigenvalue of some integral operator. Apparently, Gebelein's and Sarmanov's correlation coefficients are rather complicated in their usage.
Recently, in Székely et al. (2007), a distance correlation has been proposed: its equality to null implies independence, but like Gebelein's and Sarmanov's correlation coefficients, it is much more complicated in computing than classical measures of correlation.
Requirements (R4) and (R5) refer to the rank measures of correlation, for example, to the Spearman and Kendall -rank correlation coefficients.
Now we enlist the well-known seven postulates of Renyi (1959), formulated for a measure of dependence defined in the segment :
(P1)
is defined for any pair of random variables and , neither of them being constant with probability 1.
(P2)
.
(P3)
.
(P4)
if and only if and are independent.
(P5)
if there is a strict dependence between and , that is, either or , where and are Borel-measurable functions.
(P6)
If the Borel-measurable functions and map the real axis in a one-to-one way on to itself, .
(P7)
If the joint distribution of and is normal, then , where is Pearson's product-moment correlation coefficient.
This set of postulates is more restrictive than the proposed set (R1)–(R5) mostly because of the chosen range and the last postulate (P7) that yield the absolute value of Pearson's correlation . Later we return to this set when considering informational measures of correlation. Moreover, in what follows, we generally focus on the conventional tools of correlation analysis based on Pearson's correlation coefficient and closely related to it measures, implicitly using Renyi's postulates.
Below we represent a series of different conceptual and computational definitions of the population and sample Pearson's correlation coefficients and , respectively. Each definition indicates a different way of thinking about this measure within different statistical contexts by using algebraic, geometric, and trigonometric settings (Rogers and Nicewander 1988).
The traditional for introductory statistics textbooks definitions (1.1) and (1.2) for Pearson's and can be evidently rewritten as
and
where and are the mean squared errors.
Equation (2.1) for the sample correlation coefficient can be regarded as the sample covariance of the standardized random variables, namely and .
Pearson's correlation coefficient possesses the properties (R1) and (R2) with the bounds : the cases and correspond to the linear dependence between variables.
Thus, Pearson's correlation coefficient is a measure of the linear interrelationship between random variables. Furthermore, the relations and do not induce independence of random variables. The typical shapes of correlated data clusters are exhibited in Figs 2.1 to 2.5.
Figure 2.1Data with positive correlation.
Figure 2.2Data with negative correlation.
Figure 2.3Data with approximately zero correlation.
Figure 2.4Approximately nonlinear dependent data correlation.
Figure 2.5Linear dependent data correlation.
The problem of estimation of the correlation coefficient is directly related to the linear regression problem of fitting the straight line of the conditional expectation (Kendall and Stuart 1963)
