107,99 €
NONPARAMETRIC STATISTICS WITH APPLICATIONS TO SCIENCE AND ENGINEERING WITH R Introduction to the methods and techniques of traditional and modern nonparametric statistics, incorporating R code Nonparametric Statistics with Applications to Science and Engineering with R presents modern nonparametric statistics from a practical point of view, with the newly revised edition including custom R functions implementing nonparametric methods to explain how to compute them and make them more comprehensible. Relevant built-in functions and packages on CRAN are also provided with a sample code. R codes in the new edition not only enable readers to perform nonparametric analysis easily, but also to visualize and explore data using R's powerful graphic systems, such as ggplot2 package and R base graphic system. The new edition includes useful tables at the end of each chapter that help the reader find data sets, files, functions, and packages that are used and relevant to the respective chapter. New examples and exercises that enable readers to gain a deeper insight into nonparametric statistics and increase their comprehension are also included. Some of the sample topics discussed in Nonparametric Statistics with Applications to Science and Engineering with R include: * Basics of probability, statistics, Bayesian statistics, order statistics, Kolmogorov-Smirnov test statistics, rank tests, and designed experiments * Categorical data, estimating distribution functions, density estimation, least squares regression, curve fitting techniques, wavelets, and bootstrap sampling * EM algorithms, statistical learning, nonparametric Bayes, WinBUGS, properties of ranks, and Spearman coefficient of rank correlation * Chi-square and goodness-of-fit, contingency tables, Fisher exact test, MC Nemar test, Cochran's test, Mantel-Haenszel test, and Empirical Likelihood Nonparametric Statistics with Applications to Science and Engineering with R is a highly valuable resource for graduate students in engineering and the physical and mathematical sciences, as well as researchers who need a more comprehensive, but succinct understanding of modern nonparametric statistical methods.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 576
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Preface
Acknowledgments
1 Introduction
1.1 Efficiency of Nonparametric Methods
1.2 Overconfidence Bias
1.3 Computing with R
1.4 Exercises
References
Note
2 Probability Basics
2.1 Helpful Functions
2.2 Events, Probabilities, and Random Variables
2.3 Numerical Characteristics of Random Variables
2.4 Discrete Distributions
2.5 Continuous Distributions
2.6 Mixture Distributions
2.7 Exponential Family of Distributions
2.8 Stochastic Inequalities
2.9 Convergence of Random Variables
2.10 Exercises
References
Notes
3 Statistics Basics
3.1 Estimation
3.2 Empirical Distribution Function
3.3 Statistical Tests
3.4 Confidence Intervals
3.5 Likelihood
3.6 Exercises
References
4 Bayesian Statistics
4.1 The Bayesian Paradigm
4.2 Ingredients for Bayesian Inference
4.3 Point Estimation
4.4 Interval Estimation: Credible Sets
4.5 Bayesian Testing
4.6 Bayesian Prediction
4.7 Bayesian Computation and Use of WinBUGS
4.8 Exercises
References
Note
5 Order Statistics
5.1 Joint Distributions of Order Statistics
5.2 Sample Quantiles
5.3 Tolerance Intervals
5.4 Asymptotic Distributions of Order Statistics
5.5 Extreme Value Theory
5.6 Ranked Set Sampling
5.7 Exercises
References
6 Goodness of Fit
6.1 Kolmogorov–Smirnov Test Statistic
6.2 Smirnov Test to Compare Two Distributions
6.3 Specialized Tests for Goodness of Fit
6.4 Probability Plotting
6.5 Runs Test
6.6 Meta Analysis
6.7 Exercises
References
7 Rank Tests
7.1 Properties of Ranks
7.2 Sign Test
7.3 Spearman Coefficient of Rank Correlation
7.4 Wilcoxon Signed Rank Test
7.5 Wilcoxon (Two‐Sample) Sum Rank Test
7.6 Mann–Whitney
Test
7.7 Test of Variances
7.8 Walsh Test for Outliers
7.9 Exercises
References
Notes
8 Designed Experiments
8.1 Kruskal–Wallis Test
8.2 Friedman Test
8.3 Variance Test for Several Populations
8.4 Exercises
References
Note
9 Categorical Data
9.1 Chi‐Square and Goodness‐of‐Fit
9.2 Contingency Tables: Testing for Homogeneity and Independence
9.3 Fisher Exact Test
9.4 Mc Nemar Test
9.5 Cochran's Test
9.6 Mantel–Haenszel Test
9.7 Central Limit Theorem for Multinomial Probabilities
9.8 Simpson's Paradox
9.9 Exercises
References
Notes
10 Estimating Distribution Functions
10.1 Introduction
10.2 Nonparametric Maximum Likelihood
10.3 Kaplan–Meier Estimator
10.4 Confidence Interval for
10.5 Plug‐in Principle
10.6 Semi‐Parametric Inference
10.7 Empirical Processes
10.8 Empirical Likelihood
10.9 Exercises
References
11 Density Estimation
11.1 Histogram
11.2 Kernel and Bandwidth
11.3 Exercises
References
12 Beyond Linear Regression
12.1 Least‐Squares Regression
12.2 Rank Regression
12.3 Robust Regression
12.4 Isotonic Regression
12.5 Generalized Linear Models
12.6 Exercises
References
13 Curve Fitting Techniques
13.1 Kernel Estimators
13.2 Nearest Neighbor Methods
13.3 Variance Estimation
13.4 Splines
13.5 Summary
13.6 Exercises
References
Notes
14 Wavelets
14.1 Introduction to Wavelets
14.2 How Do the Wavelets Work?
14.3 Wavelet Shrinkage
14.4 Exercises
References
Notes
15 Bootstrap
15.1 Bootstrap Sampling
15.2 Nonparametric Bootstrap
15.3 Bias Correction for Nonparametric Intervals
15.4 The Jackknife
15.5 Bayesian Bootstrap
15.6 Permutation Tests
15.7 More on the Bootstrap
15.8 Exercises
References
Note
16 EM Algorithm
Definition
16.1 Fisher's Example
16.2 Mixtures
16.3 EM and Order Statistics
16.4 MAP via EM
16.5 Infection Pattern Estimation
Exercises
References
17 Statistical Learning
17.1 Discriminant Analysis
17.2 Linear Classification Models
17.3 Nearest Neighbor Classification
17.4 Neural Networks
17.5 Binary Classification Trees
Exercises
References
Note
18 Nonparametric Bayes
18.1 Dirichlet Processes
18.2 Bayesian Contingency Tables and Categorical Models
18.3 Bayesian Inference in Infinitely Dimensional Nonparametric Problems
Exercises
References
Appendix A: WinBUGS
A.1 Using WinBUGS
A.2 Built‐in Functions and Common Distributions in BUGS
Appendix B: R Coding
B.1 Programming in R
B.2 Basics of R
B.3 R Commands
B.4 R for Statistics
R Index
Author Index
Subject Index
End User License Agreement
Chapter 1
Table 1.1 Asymptotic relative efficiency (ARE) of some basic nonparametric t...
Chapter 4
Table 4.1 Some conjugate pairs.
Table 4.2 Treatment of
according to the value of log‐Bayes factor.
Chapter 6
Table 6.1 Upper quantiles for Kolmogorov–Smirnov test statistic.
Table 6.2 Tail probabilities for Smirnov two‐sample test.
Table 6.3 Null distribution of Anderson–Darling test statistic: modification...
Table 6.4 Quantiles for Shapiro–Wilk test statistic.
Table 6.5 Coefficients for the Shapiro–Wilk test.
Chapter 7
Table 7.1 Quantiles of
for the Wilcoxon signed rank test.
Table 7.2 Distribution of
when
and
.
Chapter 9
Table 9.1 Mendel's data.
Table 9.2 Horse‐kick fatalities data.
Table 9.3 Observed groups of dolphins, including
time of day
and
activity
.
Table 9.4 Five reviewers found 27 issues in software example as in Gilb and ...
Chapter 10
Table 10.1 Waiting times for insects to visit flowers.
Chapter 12
Table 12.1 Size of pituitary fissure for subjects of various ages.
Table 12.2 Cæsarean section birth data.
Table 12.3 Bliss beetle data.
Chapter 14
Table 14.1 Some common wavelet filters from the Daubechies, Coiflet, and Sym...
Chapter 16
Table 16.1 Frequency distribution of the number of children among 4075 widow...
Table 16.2 Some of the 20 steps in the EM implementation of ZIP modeling on ...
Appendix A
Table A.1 Built‐in functions in WinBUGS.
Table A.2 Built‐in distributions with BUGS names and their parametrizations....
Appendix B
Table B.1 Built‐in functions in R.
Table B.2 Statistics functions in R.
Table B.3 Probability functions in R.
Chapter 2
Figure 2.1 Probability density function for DRAM chip defect frequency (
) a...
Figure 2.2 Distribution functions
(
(2,4)) and
(
(3,6)): (a) plot of
a...
Figure 2.3 (a) Histogram of single sample generated from Poisson
distribut...
Chapter 3
Figure 3.1 Empirical distribution function based on normal samples (sizes 20...
Figure 3.2 Graph of statistical test power for binomial test for specific al...
Figure 3.3 (a) The binomial
PMF. (b) 95% confidence intervals based on exa...
Chapter 4
Figure 4.1 The normal
likelihood,
prior, and posterior for data
Figure 4.2 Bayesian credible set based on
density.
Figure 4.3 (a) Posterior density for
. (b) Posterior predictive density for...
Chapter 5
Figure 5.1 Diagram of simple system of three components in series (a) and pa...
Figure 5.2 Distribution of order statistics from a sample of five
.
Chapter 6
Figure 6.1 Comparing the EDF for river length data versus normal distributio...
Figure 6.2 Fitted distributions: (a)
and (b) mixture of normals.
Figure 6.3 EDF for samples of
generated from normal and exponential with
Figure 6.4 Plots of EDF versus
CDF for (a)
observations of
data and (b...
Figure 6.5 (a) Plot of EDF versus normal CDF and (b) normal probability plot...
Figure 6.6 Weibull probability plot of 30 observations generated from a norm...
Figure 6.7 Data from
are plotted against data from (a)
, (b)
, (c)
, an...
Figure 6.8 Probability distribution of runs under
.
Figure 6.9 Runs versus games for (a) 2005 St. Louis Cardinals and (b) 2003 D...
Chapter 7
Figure 7.1 Nineteenth‐century country carolers singing “Hogmanay, Trollolay,...
Chapter 8
Figure 8.1 Box plot for crop yields.
Figure 8.2 Box plot of vehicle performance grades of three cars (A,B,C).
Chapter 9
Figure 9.1 Genetic model for a dihybrid cross between round, yellow peas and...
Figure 9.2 Original data of horse‐kick fatalities from von Bortkiewicz (1898...
Figure 9.3 Barplot of dolphin's data.
Figure 9.4 Waffle chart for showing party split in voting for 1964 Civil Rig...
Figure 9.5 Waffle chart for showing demographic and party splits in voting f...
Figure 9.6 (a) Matrix of 1200 plots (
). Lighter color corresponds to higher...
Chapter 10
Figure 10.1 Kaplan–Meier estimator for waiting times (solid line for male fl...
Figure 10.2 Kaplan–Meier estimator cord strength (in coded units).
Figure 10.3 Empirical likelihood ratio as a function of (a) the mean and (b)...
Chapter 11
Figure 11.1 Playfair's 1786 bar chart of wheat prices in England.
Figure 11.2 Empirical “density” (a) and histogram (b) for 30 normal
variab...
Figure 11.3 Histograms with normal fit of 5000 generated variables using (a)...
Figure 11.4 (a) Normal, (b) triangular, (c) box, and (d) Epanechnikov kernel...
Figure 11.5 Density estimation for sample of size
using various kernels: (...
Figure 11.6 Density estimation for sample of size
using various bandwidth ...
Figure 11.7 Density estimation for 2001 radiation measurements using bandwid...
Figure 11.8 (a) Univariate density estimator for first variable. (b) Univari...
Figure 11.9 Bivariate Density estimation for sample of size
.
Chapter 12
Figure 12.1 (a) Plot of test #1 scores (during term) and test #2 scores (eig...
Figure 12.2 Regression: least squares (
) and nonparametric (
).
Figure 12.3 Star data with (a)
Ordinary least squares
(
OLS
) regression, (b) ...
Figure 12.4 Anscombe's four regressions: least squares (dashed line) versus ...
Figure 12.5 (a) Greatest convex minorant based on nine observations. (b) Gre...
Figure 12.6 Cæsarean birth infection observed proportions (
) and model pred...
Chapter 13
Figure 13.1 Linear Regression (solid line) and local estimator (dashed line)...
Figure 13.2 (a) A family of symmetric beta kernels. (b)
.
Figure 13.3 Nadaraya–Watson estimators for different values of bandwidth.
Figure 13.4 Loess curve fitting for motorcycle data using (a)
(b)
(c)
...
Figure 13.5 A cubic spline drawing of letter
.
Figure 13.6 (a) Interpolating sine function. (b) Interpolating a surface. (c...
Figure 13.7 (a) Square plus noise. (b) Motorcycle data: time (
) and acceler...
Figure 13.8 Blazar OJ287 luminosity.
Chapter 14
Figure 14.1 Wavelets from the Daubechies family. Depicted are scaling functi...
Figure 14.2 Wavelet‐based data processing.
Figure 14.3 (a) Haar wavelet
. (b) Some dilations and translations of Haar ...
Figure 14.4 A function interpolating
on [0,8).
Figure 14.5 (a) Hard and (b) soft thresholding with
(dashed line for refer...
Figure 14.6 Demo output: (a) original
doppler
signal, (b) noisy
doppler
, (c)...
Figure 14.7 Panel (a) shows
hourly measurements of the water level for a w...
Figure 14.8 One step in wavelet transformation of 2‐D data exemplified on ce...
Chapter 15
Figure 15.1 Baron Von Munchausen: the first bootstrapper.
Figure 15.2 Scatter plot of 24 distance–velocity pairs. Distance is measured...
Figure 15.3 (a) Histogram of correlations from 50 000 bootstrap samples. (b)...
Figure 15.4 95% confidence band the CDF of Crowder's data using 1000 bootstr...
Figure 15.5 (a) The histogram of 50,000 BB resamples for the correlation bet...
Figure 15.6 A coin of Manuel I Comnenus (1143–1180).
Figure 15.7 Panels (a) and (b) show permutation null distribution of statist...
Chapter 16
Figure 16.1 Observations from the
mixture (histogram), the mixture (
dotted
...
Chapter 17
Figure 17.1 Targets illustrating the difference between model bias and varia...
Figure 17.2 Two types of iris classified according to (a) petal length versu...
Figure 17.3 Nearest‐neighbor classification of 50 observations plotted in (a...
Figure 17.4 Basic structure of feed‐forward neural network.
Figure 17.5 Purifying a tree by splitting.
Figure 17.6 (a) Location of 37 tropical (circles) and other (plus signs) hur...
Figure 17.7 Binary tree classification applied to Fisher's iris data using (...
Chapter 18
Figure 18.1 The base CDF
is shown as a dotted line. Fifteen random CDFs fr...
Figure 18.2 For a sample
Beta(2,2) observations, a boxplot of “noninformat...
Figure 18.3 Histograms of 40 000 samples from (a) posterior of lambda and (b...
Figure 18.4 Bayesian rule (18.7) and comparable hard and soft thresholding r...
Figure 18.5 (a) A noisy
doppler
signal (SNR
7,
, noise variance
). (b) Sig...
Figure 18.6 Approximation of Bayesian shrinkage rule calculated by WinBUGS....
Appendix A
Figure A.1 Traces of the four parameters from simple example: (a)
, (b)
(...
Appendix B
Figure B.1 RStudio console features four boxes: Source Editor, Console, Work...
Figure B.2 Graphical summary of lengths for 141 rivers in North America usin...
Figure B.3 Histogram using ggplot.
Figure B.4 Plot of two normal densities.
Cover
Table of Contents
Title Page
Copyright
Preface
Acknowledgments
Begin Reading
Appendix A WinBUGS
Appendix B R Coding
R Index
Author Index
Subject Index
End User License Agreement
ii
iii
iv
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
6
7
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
389
390
391
392
393
394
395
397
398
399
400
401
402
403
404
405
407
408
409
411
412
413
414
415
417
418
419
420
421
422
423
424
425
426
427
428
Established by Walter A. Shewhart and Samuel S. Wilks
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay
Editors Emeriti: Harvey Goldstein, J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels
The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state‐of‐the‐art developments in the field and classical methods.
Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches.
This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.
A complete list of titles in this series can be found at http://www.wiley.com/go/wsps
Second Edition
Paul Kvam
University of Richmond
Richmond, Virginia, USA
Brani Vidakovic
Texas A&M University
College Station, Texas, USA
Seong‐joon Kim
Chosun University
Gwangju, South Korea
This second edition first published 2023
© 2023 John Wiley & Sons, Inc.
Edition History
John Wiley & Sons, Inc. (1e, 2007)
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Paul Kvam, Brani Vidakovic, and Seong‐joon Kim to be identified as the authors of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data applied for
Hardback ISBN: 9781119268130
Cover image: © Aleksandr Semenov/Shutterstock
Cover design by Wiley
Danger lies not in what we don't know, but in what we think we know that just ain't so.
Mark Twain (1835–1910)
This textbook is a substantial revision of a previous textbook written in 2007 by Kvam and Vidakovic. The biggest difference in this version is the adoption of the R programming language as a supplementary learning tool for the purpose of teaching concepts, illustrating examples, and completing computational homework assignments. In the original book, the authors relied on Matlab.
There has been plenty of change in the world of nonparametric statistics since we finished the first edition of this book. While the statistics community had already adapted to a modern framework for data analysis that relies increasingly on nonparametric procedures (not to mention Bayesian alternatives to traditional inference), we sense more adapters in engineering, medical research, chemistry, biology, and especially the behavioral sciences with each passing year. However, the field of nonparametric statistics has also receded toward the periphery of the statistics curriculum in the wake of data science, which continues to encroach on graduate curriculums associated with statistics, causing more programs to replace traditional statistics courses with the trendier versions involving data structures.
There are quality monographs/texts dealing with nonparametric statistics, such as the encyclopedic book by Hollander and Wolfe, Nonparametric Statistical Methods, or the excellent book by Conover, Practical Nonparametric Statistics, which has served as a staple for a generation of professors tasked to teach a course in this subject. Before engaging in writing the first version of this textbook, we taught several iterations of a graduate course on nonparametric statistics at Georgia Tech. The audience consisted of MS and PhD students in Engineering Statistics, Electrical Engineering, Bioengineering, Management, Logistics, Applied Mathematics, and Physics. While comprising a nonhomogeneous group, all of the students had solid mathematical, programming, and statistical training needed to benefit from the course.
In our course, we relied on the third edition of Conover's book, which is mainly concerned with what most of us think of as traditional nonparametric statistics: proportions, ranks, categorical data, goodness of fit, and so on, with the understanding that the text would be supplemented by the instructor's handouts. We ended up supplying an increasing number of handouts every year, for units such as density and function estimation, wavelets, Bayesian approaches to nonparametric problems, EM algorithm, splines, machine learning, and other arguably modern nonparametric topics. Later on, we decided to merge the handouts and fill the gaps.
With this new edition, we adhere to the traditional form one expects in an academic textbook, but we aim to provide more informal discussion and commentary to balance with the regimen of lessons that help the student progress through a statistics methods course. Unlike newer books that focus on data science, we want to help the student learn more than just how to implement a statistical procedure. We want them to understand, to a higher degree, what they are doing (or what R is doing for them).
We hope the book provides all of the tools and motivation for a student to study methods of nonparametric statistics, but we also aim to keep a conversational tone in our writing. Reading math‐infused textbooks can be challenging, but it need not be a drudgery. For that reason, we remind the reader of the bigger picture, including the historical and cultural aspects linked to the development and application of nonparametric procedures. We think it is important to acknowledge the fundamental contributions to the field of nonparametric statistics by not only our field's pioneers, such as Karl Pearson, Nathan Mantel, or Brad Efron, but also others in our vanguard, including François‐Marie Arouet (Voltaire), Karl Popper, and Baron Von Munchausen.
Computing. The book is integrated with R, and for many procedures covered in this book, we feature subroutines and packages (free libraries of code) of R code. The choice of software was natural: engineers, scientists, and increasingly statisticians are communicating in the “R language.” R is an open‐source language for statistical computing and quickly emerging environment as the standard for research and development. R provides a wide variety of packages that allow to perform various kinds of analyses and powerful graphic components. For Bayesian calculation we previously relied on WinBUGS, a free software from Cambridge's Biostatistics Research Unit. Both R and WinBUGS are briefly covered in two appendices for readers less familiar with them. For R‐programmers who want to see a variety of programming modules for nonparametric inference in the R language, we refer you to the R‐series guide Nonparametric Statistical Methods Using R by Kloke and McKean.
Outline of Chapters. For a typical graduate student to cover the full breadth of this textbook, two semesters would be required. For a one‐semester course, the instructor should necessarily cover Chapters 1–3 and 5–9 to start. Depending on the scope of the class, the last part of the course can include different chapter selections.
Chapters 2–4 contain important background material the student needs to understand to effectively learn and apply the methods taught in a nonparametric analysis course. Because the ranks of observations have special importance in a nonparametric analysis, Chapter 5 presents basic results for order statistics and includes statistical methods to create tolerance intervals.
Traditional topics in estimation and testing are presented in Chapters 7–10 and should receive emphasis even to students who are most curious about advanced topics such as density estimation (Chapter 11), curve fitting (Chapter 13), and wavelets (Chapter 14). These topics include a core of rank tests that are analogous to common parametric procedures (e.g. ‐tests, analysis of variance).
Basic methods of categorical data analysis are contained in Chapter 9. Although most students in the biological sciences are exposed to a wide variety of statistical methods for categorical data, engineering students and other students in the physical sciences typically receive less schooling in this quintessential branch of statistics. Topics include methods based on tabled data, chi‐square tests, and the introduction of general linear models. Also included in the first part of the book is the topic of “goodness of fit” (Chapter 6), which refers to testing data not in terms of some unknown parameters, but the unknown distribution that generated it. In a way, goodness of fit represents an interface between distribution‐free methods and traditional parametric methods of inference, and both analytical and graphical procedures are presented. Chapter 10 presents the nonparametric alternative to maximum likelihood estimation and likelihood ratio‐based confidence intervals.
The term “regression” is familiar from your previous course that introduced you to statistical methods. Nonparametric regression provides an alternative method of analysis that requires fewer assumptions of the response variable. In Chapter 12, we use the regression platform to introduce other important topics that build on linear regression, including isotonic (constrained) regression, robust regression, and generalized linear models. In Chapter 13, we introduce more general curve fitting methods. Regression models based on wavelets (Chapter 14) are presented in a separate chapter.
In the latter part of the book, emphasis is placed on nonparametric procedures that are becoming more relevant to engineering researchers and practitioners. Beyond the conspicuous rank tests, this text includes many of the newest nonparametric tools available to experimenters for data analysis. Chapter 17 introduces fundamental topics of statistical learning as a basis for data mining and pattern recognition and includes discriminant analysis, nearest‐neighbor classifiers, neural networks, and binary classification trees. Computational tools needed for nonparametric analysis include bootstrap resampling (Chapter 15) and the EM algorithm (Chapter 16). Bootstrap methods, in particular, have become indispensable for uncertainty analysis with large data sets and elaborate stochastic models.
The textbook also unabashedly includes a review of Bayesian statistics and an overview of nonparametric Bayesian estimation. If you are familiar with Bayesian methods, you might wonder what role they play in nonparametric statistics. Admittedly, the connection is not obvious, but in fact nonparametric Bayesian methods (Chapter 18) represent an important set of tools for complicated problems in statistical modeling and learning, where many of the models are nonparametric in nature.
The book is intended both as a reference text and a text for a graduate course. We hope the reader will find this book useful. All comments, suggestions, updates, and critiques will be appreciated.
April 2022 Paul Kvam
Department of Mathematics
University of Richmond
Brani Vidakovic
Department of Statistics
Texas A & M University
Seong‐joon Kim
Department of Industrial Engineering
Chosun University
We would like to thank Lori Kvam, Draga Vidakovic, and the rest of our families.
For every complex question, there is a simple answer and it is wrong.
H. L. Mencken
Jacob Wolfowitz first coined the term nonparametric, saying “We shall refer to this situation [where a distribution is completely determined by the knowledge of its finite parameter set] as the parametric case, and denote the opposite case, where the functional forms of the distributions are unknown, as the non‐parametric case” (Wolfowitz, 1942). From that point on, nonparametric statistics was defined by what it is not: traditional statistics based on known distributions with unknown parameters. Randles, Hettmansperger, and Casella (2004) extended this notion by stating that “nonparametric statistics can and should be broadly defined to include all methodology that does not use a model based on a single parametric family.”
Traditional statistical methods are based on parametric assumptions; that is, the data can be assumed to be generated by some well‐known family of distributions, such as normal, exponential, Poisson, and so on. Each of these distributions has one or more parameters (e.g. the normal distribution has and ), at least one of which is presumed unknown and must be inferred. The emphasis on the normal distribution in linear model theory is often justified by the central limit theorem, which guarantees approximate normality of sample means provided the sample sizes are large enough. Other distributions also play an important role in science and engineering. Physical failure mechanisms often characterize the lifetime distribution of industrial components (e.g. Weibull or lognormal), so parametric methods are important in reliability engineering.
However, with complex experiments and messy sampling plans, the generated data might not be attributed to any well‐known distribution. Analysts limited to basic statistical methods can be trapped into making parametric assumptions about the data that are not apparent in the experiment or the data. In the case where the experimenter is not sure about the underlying distribution of the data, statistical techniques are needed that can be applied regardless of the true distribution of the data. These techniques are called nonparametric methods, or distribution‐free methods.
The terms nonparametric and distribution‐free are not synonymous… Popular usage, however, has equated the terms… Roughly speaking, a nonparametric test is one which makes no hypothesis about the value of a parameter in a statistical density function, whereas a distribution‐free test is one which makes no assumptions about the precise form of the sampled population.
J. V. Bradley (1968)
It can be confusing to understand what is implied by the word “nonparametric.” What is termed modern nonparametrics includes statistical models that are quite refined, except the distribution for error is left unspecified. Wasserman's recent book All Things Nonparametric (Wasserman, 2005) emphasizes only modern topics in nonparametric statistics, such as curve fitting, density estimation, and wavelets. Conover's Practical Nonparametric Statistics (Conover, 1999), on the other hand, is a classic nonparametrics textbook but mostly limited to traditional binomial and rank tests, contingency tables, and tests for goodness of fit. Topics that are not really under the distribution‐free umbrella, such as robust analysis, Bayesian analysis, and statistical learning also have important connections to nonparametric statistics and are all featured in this book. Perhaps this text could have been titled A Bit Less of Parametric Statistics with Applications in Science and Engineering, but it surely would have sold fewer copies. On the other hand, if sales were the primary objective, we would have titled this Nonparametric Statistics for Data Science or maybe Nonparametric Statistics with Pictures of Naked People.
Doubt is not a pleasant condition, but certainty is absurd.
Francois Marie Voltaire (1694–1778)
It would be a mistake to think that nonparametric procedures are simpler than their parametric counterparts. On the contrary, a primary criticism of using parametric methods in statistical analysis is that they oversimplify the population or process we are observing. Indeed, parametric families are not more useful because they are perfectly appropriate, rather because they are perfectly convenient.
Table 1.1 Asymptotic relative efficiency (ARE) of some basic nonparametric tests.
Parametric test
Nonparametric test
ARE (normal)
ARE (double exponential)
Two‐sample test
‐test
Mann–Whitney
0.955
1.50
Three‐sample test
One‐way layout
Kruskal–Wallis
0.864
1.50
Variances test
‐test
Conover
0.760
1.08
Nonparametric methods are inherently less powerful than parametric methods. This must be true because the parametric methods are assuming more information to construct inferences about the data. In these cases the estimators are inefficient, where the efficiencies of two estimators are assessed by comparing their variances for the same sample size. This inefficiency of one method relative to another is measured in power in hypothesis testing, for example.
However, even when the parametric assumptions hold perfectly true, we will see that nonparametric methods are only slightly less powerful than the more presumptuous statistical methods. Furthermore, if the parametric assumptions about the data fail to hold, only the nonparametric method is valid. A ‐test between the means of two normal populations can be dangerously misleading if the underlying data are not actually normally distributed. Some examples of the relative efficiency of nonparametric tests are listed in Table 1.1, where asymptotic relative efficiency (ARE) is used to compare parametric procedures (second column) with their nonparametric counterparts (third column). ARE describes the relative efficiency of two estimators of a parameter as the sample size approaches infinity and is listed for the normal distribution, where parametric assumptions are justified, and the double‐exponential distribution. For example, if the underlying data are normally distributed, the ‐test requires 955 observations to have the same power of the Wilcoxon signed‐rank test based on 1000 observations.
Parametric assumptions allow us to extrapolate away from the data. For example, it is hardly uncommon for an experimenter to make inferences about a population's extreme upper percentile (say, 99th percentile) with a sample so small that none of the observations would be expected to exceed that percentile. If the assumptions are not justified, this is grossly unscientific.
Nonparametric methods are seldom used to extrapolate outside the range of observed data. In a typical nonparametric analysis, little or nothing can be said about the probability of obtaining future data beyond the largest sampled observation or less than the smallest one. For this reason, the actual measurements of a sample item means less than its rank within the sample. In fact, nonparametric methods are typically based on ranks of the data, and properties of the population are deduced using order statistics (Chapter 5). The measurement scales for typical data are as follows:
Nominal scale
: numbers used only to categorize outcomes (e.g. we might define a random variable to equal one in the event a coin flips heads and zero if it flips tails).
Ordinal scale
: numbers can be used to order outcomes (e.g. the event X is greater than the event Y if X =
medium
and Y =
small
).
Interval scale
: order between numbers and distances between numbers are used to compare outcomes.
Only interval scale measurements can be used by parametric methods. Nonparametric methods based on ranks can use ordinal scale measurements, and simpler nonparametric techniques can be used with nominal scale measurements.
The binomial distribution is characterized by counting the number of independent observations that are classified into a particular category. Binomial data can be formed from measurements based on a nominal scale of measurements; thus binomial models are most encountered models in nonparametric analysis. For this reason, Chapter 3 includes a special emphasis on statistical estimation and testing associated with binomial samples.
Be slow to believe what you worst want to be true
Samual Pepys (1633–1703)
Confirmation Bias or Overconfidence Bias describes our tendency to search for or interpret information in a way that confirms our preconceptions. Business and finance has shown interest in this psychological phenomenon (Tversky and Kahneman, 1974) because it has proven to have a significant effect on personal and corporate financial decisions where the decision maker will actively seek out and give extra weight to evidence that confirms a hypothesis they already favor. At the same time, the decision maker tends to ignore evidence that contradicts or disconfirms their hypothesis.
Overconfidence bias has a natural tendency to affect an experimenter's data analysis for the same reasons. While the dictates of the experiment and the data sampling should reduce the possibility of this problem, one of the clear pathways open to such bias is the infusion of parametric assumptions into the data analysis. After all, if the assumptions seem plausible, the researcher has much to gain from the extra certainty that comes from the assumptions in terms of narrower confidence intervals and more powerful statistical tests.
Nonparametric procedures serve as a buffer against this human tendency of looking for the evidence that best supports the researcher's underlying hypothesis. Given the subjective interests behind many corporate research findings, nonparametric methods can help alleviate doubt to their validity in cases when these procedures give statistical significance to the corporation's claims.
If everything isn't black and white, I say…
Why the hell not?
John Wayne (1907–1979)
Because a typical nonparametric analysis can be computationally intensive, computer support is essential to understand both theory and applications. Numerous software products can be used to complete exercises and run nonparametric analysis in this textbook, including SAS, SPSS, MINITAB, MATLAB, StatXact, and JMP (to name a few). A student familiar with one of these platforms can incorporate it with the lessons provided here, and without too much extra work.
It must be stressed, however, that demonstrations in this book rely mainly on a single software called R (maintained by R Foundation). R is a “GNU”‐(free) programming environment for statistical computing and graphics. Today, the R is one of the fastest growing software programs with over 5000 packages that enable us to perform various kinds of statistical analysis. Because of its open source and extensible nature, it has been widely used in research and engineering practice and is rapidly becoming the dominant software tool for data manipulation, modeling, analysis, and graphical display. R is available on Unix systems, Microsoft Windows, and Apple Macintosh. If you are unfamiliar with R, in the first appendix, we present a brief tutorial along with a short description of some R procedures that are used to solve analytical problems and demonstrate nonparametric methods in this book. For a more comprehensive guide, we recommend the book An Introduction to R (Venables, Smith, and the R Core Team, 2014). For more detail information, visit
http://www.r-project.org
A user‐friendly computing platform for R is provided by R‐Studio, which can be downloaded for free at
https://www.rstudio.com
RStudio Cloud allows students a convenient way of accessing the RStudio development environment without having to worry about installation problems associated with R and RStudio. Classroom instructors and students can easily share work spaces using R‐Markdown files, for example,
http://rstudio.cloud
We hope that many students of statistics will find this book useful, but it was written primarily with the scientist and engineer in mind. With nothing against statisticians (some of our acquaintances know statisticians), our approach emphasizes the application of the method over its mathematical theory. We have intentionally made the text less heavy with theory and instead emphasized applications and examples. If you come into this course thinking the history of nonparametric statistics is dry and unexciting, you are probably right, at least compared with the history of ancient Rome, the British monarchy, or maybe even Wayne Newton.1 Nonetheless, we made efforts to convince you otherwise by noting the interesting historical context of the research and the personalities behind its development. For example, we will learn more about Karl Pearson (1857–1936) and R. A. Fisher (1890–1962), legendary scientists and competitive archrivals, who both contributed greatly to the foundation of nonparametric statistics through their separate research directions.
In short, this book features techniques of data analysis that rely less on the assumptions of the data's good behavior – the very assumptions that can get researchers in trouble. Science's gravitation toward distribution‐free techniques is due to both a deeper awareness of experimental uncertainty and the availability of ever‐increasing computational abilities to deal with the implied ambiguities in the experimental outcome.
1.1
Describe a potential data analysis in your field of study where parametric methods are appropriate. How would you defend this assumption?
1.2
Describe another potential data analysis in your field of study where parametric methods may not be appropriate. What might prevent you from using parametric assumptions in this case?
1.3
Describe three ways in which overconfidence bias can affect the statistical analysis of experimental data. How can this problem be overcome?
1.4
For an analysis of variance involving three treatment groups, the traditional one‐way layout is more efficient than Kruskal and Wallis's nonparametric test. If the Kruskal and Wallis test requires 400 observations to achieve the required test power that is desired, how many samples would the parametric test need to achieve the same power?
1.5
Find an example of data from your field of study that is considered ordinal.
Bradley, J. V. (1968),
Distribution Free Statistical Tests
, Englewood Cliffs, NJ: Prentice Hall.
Conover, W. J. (1999),
Practical Nonparametric Statistics
, New York: Wiley.
Randles, R. H., Hettmansperger, T.P., and Casella, G. (2004), ”Introduction to the Special Issue Nonparametric Statistics,”
Statistical Science
, 19, 561–562.
Tversky, A., and Kahneman, D. (1974), “Judgment Under Uncertainty: Heuristics and Biases,”
Science
, 185, 1124–1131.
Venables, W. N., Smith, D. M., and the R Core Team (2014),
An Introduction to R, version 3.1.0
., Technical Report, The Comprehensive R Archive Network(CRAN).
Wasserman, L. (2005),
All of Nonparametric Statistics
, New York: Springer‐Verlag.
Wolfowitz, J. (1942), “Additive Partition Functions and a Class of Statistical Hypotheses,”
Annals of Statistics
, 13, 247–279.
1
Strangely popular Las Vegas entertainer.
Probability theory is nothing but common sense reduced to calculation.
Pierre Simon Laplace (1749–1827)
In Chapters 2 and 3, we review some fundamental concepts of elementary probability and statistics. If you think you can use these chapters to catch up on all the statistics you forgot since you passed “Introductory Statistics” in your college sophomore year, you are acutely mistaken. What is offered here is an abbreviated reference list of definitions and formulas that have applications to nonparametric statistical theory. Some parametric distributions, useful for models in both parametric and nonparametric procedures, are listed, but the discussion is abridged.
Permutations
: The number of arrangements of
distinct objects is
In R:
factorial(n).
Combinations
: The number of distinct ways of choosing
items from a set of
is
In R: choose(n,k). Note that all possible ways of choosing items from a set of can be obtained by combn(n,k).
is called the gamma function. If
is a positive integer,
. In R:
gamma(t)
.
Incomplete gamma is defined as
In R:
pgamma(t,z,1)
. The upper tail incomplete gamma is defined as
in R:
1‐pgamma(t,z,1)
. If
is an integer,
Note that pgamma is a cumulative distribution function (CDF) of the gamma distribution. By letting the scale parameter set to 1, pgamma reduced to the incomplete gamma.
Beta function
:
. In R:
beta(a,b)
.
Incomplete beta
:
In R:
pbeta(x,a,b)
represents normalized incomplete beta defined as
Summations of powers of integers
:
Floor function
:
denotes the greatest integer
. In R:
floor(a)
.
Geometric series
:
Stirling's formula
: To approximate the value of a large factorial,
Common limit for
: For a constant
,
This can also be expressed as as .
Newton's formula
: For a positive integer
,
Taylor Series expansion
: For a function
, its Taylor series expansion about
is defined as
where
denotes
th derivative of
evaluated at
and, for some
between
and
,
Convex function
: A function
is
convex
if for any
,
for all values of
and
. If
is twice differentiable, then
is convex if
Also, if
is convex, then
is said to be
concave
.
Bessel function
:
is defined as the solution to the equation
In R: besselJ(x,n).
The
conditional probability
of event
occurring given that event
occurs is
, where
represents the intersection of events
and
, and
.
Events
and
are stochastically
independent
if and only if
or equivalently,
.
Law of total probability
: Let
be a partition of the sample space
, i.e.
and
for
. For event
,
.
Bayes formula
: For an event
where
and partition
of
,
A function that assigns real numbers to points in the sample space of events is called a
random variable
.
1
For a random variable
,
represents its (cumulative)
distribution function
, which is nondecreasing with
and
. In this book, it will often be denoted simply as CDF. The
survivor function
is defined as
.
If the CDF's derivative exists,
represents the
probability density function
, or
.
A
discrete random variable
is one that can take on a countable set of values
so that
. Over the support
, the probability
is called the
probability mass function
, or
PMF
.
A
continuous random variable
is one that takes on any real value in an interval, so
, where
is the density function of
.
For two random variables
and
, their
joint distribution function
is
If the variables are continuous, one can define joint density function
as
The conditional density of
, given
is
where
is the density of
Two random variables
and
, with distributions
and
, are
independent
if the joint distribution
of
is such that
. For any sequence of random variables
that are independent with the same (identical) marginal distribution, we will denote this using
i.i.d
.
For a random variable
with distribution function
, the
expected value
of some function
is defined as
. If
is continuous with density
, then
. If
is discrete, then
.
The
th
moment
of
is denoted as
. The
th moment about the mean or
th central moment of
is defined as
, where
.
The
variance
of a random variable
is the second central moment,
. Often, the variance is denoted by
or simply by
when it is clear which random variable is involved. The square root of variance,
is called the standard deviation of
With
, the
th
quantile
of
, denoted
is the value
such that
and
. If the CDF
is invertible, then
. The 0.5th quantile is called the
median
of
.
For two random variables
and
, the
covariance
of
and
is defined as
, where
and
are the respective expectations of
and
.
For two random variables
and
with covariance
, the
correlation coefficient
is defined as
where
and
are the respective standard deviations of
and
. Note that
is a consequence of the Cauchy–Schwartz inequality (
Section 2.8
).
The
characteristic function
of a random variable
is defined as
The moment generating function of a random variable is defined as
whenever the integral exists. By differentiating times and letting , we have that
The
conditional expectation
of a random variable
is given
is defined as
where
is a conditional density of
given
.
For random variables
and
with finite means and variances, we can obtain moments of
through its conditional distribution:
These two equations are commonly referred to as Adam and Eve's rules.
Ironically, parametric distributions have an important role to play in the development of nonparametric methods. Even if we are analyzing data without making assumptions about the distributions that generate the data, these parametric families appear nonetheless. In counting trials, for example, we can generate well‐known discrete distributions (e.g. binomial, geometric) assuming only that the counts are independent and probabilities remain the same from trial to trial.
A simple Bernoulli random variable is dichotomous with and for some . It is denoted as Suppose an experiment consists of independent trials in which two outcomes are possible (e.g. success or failure), with for each trial. If is defined as the number of successes (out of ), then , and there are arrangements of successes and failures, each having the same probability . is a binomial random variable with PMF:
This is denoted by . From the moment generating function , we obtain and .
The cumulative distribution for a binomial random variable is not simplified beyond the sum; i.e. . However, interval probabilities can be computed in R using pbinom(x,n,p), which computes the CDF at value . The PMF is also computed in R using dbinom(x,n,p).
A Poisson random variable may characterize the number of events occurring in some fixed interval of time, so that the events occur with constant rate and independently of the time since any previous event. The PMF for the Poisson distribution is
This is denoted by . From , we have and ; the mean and the variance coincide.
The sum of a finite independent set of Poisson variables also has a Poisson distribution. Specifically, if , then is distributed as . Furthermore, the Poisson distribution is a limiting form for a binomial model, i.e.
R commands for Poisson CDF, PDF, quantile, and a random number are ppois, dpois, qpois, and rpois.
Suppose we are dealing with i.i.d. trials again, this time counting the number of successes observed until a fixed number of failures () occur. If we observe consecutive failures at the start of the experiment, for example, the count is and , where is the probability of failure. If , we have observed successes and failures in trials. There are different ways of arranging those trials, but we can only be concerned with the arrangements in which the last trial ended in a failure. So there are really only arrangements, each equal in probability. With this in mind, the PMF is