Nonparametric Statistics with Applications to Science and Engineering with R - Paul Kvam - E-Book

Nonparametric Statistics with Applications to Science and Engineering with R E-Book

Paul Kvam

0,0
107,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

NONPARAMETRIC STATISTICS WITH APPLICATIONS TO SCIENCE AND ENGINEERING WITH R Introduction to the methods and techniques of traditional and modern nonparametric statistics, incorporating R code Nonparametric Statistics with Applications to Science and Engineering with R presents modern nonparametric statistics from a practical point of view, with the newly revised edition including custom R functions implementing nonparametric methods to explain how to compute them and make them more comprehensible. Relevant built-in functions and packages on CRAN are also provided with a sample code. R codes in the new edition not only enable readers to perform nonparametric analysis easily, but also to visualize and explore data using R's powerful graphic systems, such as ggplot2 package and R base graphic system. The new edition includes useful tables at the end of each chapter that help the reader find data sets, files, functions, and packages that are used and relevant to the respective chapter. New examples and exercises that enable readers to gain a deeper insight into nonparametric statistics and increase their comprehension are also included. Some of the sample topics discussed in Nonparametric Statistics with Applications to Science and Engineering with R include: * Basics of probability, statistics, Bayesian statistics, order statistics, Kolmogorov-Smirnov test statistics, rank tests, and designed experiments * Categorical data, estimating distribution functions, density estimation, least squares regression, curve fitting techniques, wavelets, and bootstrap sampling * EM algorithms, statistical learning, nonparametric Bayes, WinBUGS, properties of ranks, and Spearman coefficient of rank correlation * Chi-square and goodness-of-fit, contingency tables, Fisher exact test, MC Nemar test, Cochran's test, Mantel-Haenszel test, and Empirical Likelihood Nonparametric Statistics with Applications to Science and Engineering with R is a highly valuable resource for graduate students in engineering and the physical and mathematical sciences, as well as researchers who need a more comprehensive, but succinct understanding of modern nonparametric statistical methods.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 576

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Preface

Acknowledgments

1 Introduction

1.1 Efficiency of Nonparametric Methods

1.2 Overconfidence Bias

1.3 Computing with R

1.4 Exercises

References

Note

2 Probability Basics

2.1 Helpful Functions

2.2 Events, Probabilities, and Random Variables

2.3 Numerical Characteristics of Random Variables

2.4 Discrete Distributions

2.5 Continuous Distributions

2.6 Mixture Distributions

2.7 Exponential Family of Distributions

2.8 Stochastic Inequalities

2.9 Convergence of Random Variables

2.10 Exercises

References

Notes

3 Statistics Basics

3.1 Estimation

3.2 Empirical Distribution Function

3.3 Statistical Tests

3.4 Confidence Intervals

3.5 Likelihood

3.6 Exercises

References

4 Bayesian Statistics

4.1 The Bayesian Paradigm

4.2 Ingredients for Bayesian Inference

4.3 Point Estimation

4.4 Interval Estimation: Credible Sets

4.5 Bayesian Testing

4.6 Bayesian Prediction

4.7 Bayesian Computation and Use of WinBUGS

4.8 Exercises

References

Note

5 Order Statistics

5.1 Joint Distributions of Order Statistics

5.2 Sample Quantiles

5.3 Tolerance Intervals

5.4 Asymptotic Distributions of Order Statistics

5.5 Extreme Value Theory

5.6 Ranked Set Sampling

5.7 Exercises

References

6 Goodness of Fit

6.1 Kolmogorov–Smirnov Test Statistic

6.2 Smirnov Test to Compare Two Distributions

6.3 Specialized Tests for Goodness of Fit

6.4 Probability Plotting

6.5 Runs Test

6.6 Meta Analysis

6.7 Exercises

References

7 Rank Tests

7.1 Properties of Ranks

7.2 Sign Test

7.3 Spearman Coefficient of Rank Correlation

7.4 Wilcoxon Signed Rank Test

7.5 Wilcoxon (Two‐Sample) Sum Rank Test

7.6 Mann–Whitney

Test

7.7 Test of Variances

7.8 Walsh Test for Outliers

7.9 Exercises

References

Notes

8 Designed Experiments

8.1 Kruskal–Wallis Test

8.2 Friedman Test

8.3 Variance Test for Several Populations

8.4 Exercises

References

Note

9 Categorical Data

9.1 Chi‐Square and Goodness‐of‐Fit

9.2 Contingency Tables: Testing for Homogeneity and Independence

9.3 Fisher Exact Test

9.4 Mc Nemar Test

9.5 Cochran's Test

9.6 Mantel–Haenszel Test

9.7 Central Limit Theorem for Multinomial Probabilities

9.8 Simpson's Paradox

9.9 Exercises

References

Notes

10 Estimating Distribution Functions

10.1 Introduction

10.2 Nonparametric Maximum Likelihood

10.3 Kaplan–Meier Estimator

10.4 Confidence Interval for

10.5 Plug‐in Principle

10.6 Semi‐Parametric Inference

10.7 Empirical Processes

10.8 Empirical Likelihood

10.9 Exercises

References

11 Density Estimation

11.1 Histogram

11.2 Kernel and Bandwidth

11.3 Exercises

References

12 Beyond Linear Regression

12.1 Least‐Squares Regression

12.2 Rank Regression

12.3 Robust Regression

12.4 Isotonic Regression

12.5 Generalized Linear Models

12.6 Exercises

References

13 Curve Fitting Techniques

13.1 Kernel Estimators

13.2 Nearest Neighbor Methods

13.3 Variance Estimation

13.4 Splines

13.5 Summary

13.6 Exercises

References

Notes

14 Wavelets

14.1 Introduction to Wavelets

14.2 How Do the Wavelets Work?

14.3 Wavelet Shrinkage

14.4 Exercises

References

Notes

15 Bootstrap

15.1 Bootstrap Sampling

15.2 Nonparametric Bootstrap

15.3 Bias Correction for Nonparametric Intervals

15.4 The Jackknife

15.5 Bayesian Bootstrap

15.6 Permutation Tests

15.7 More on the Bootstrap

15.8 Exercises

References

Note

16 EM Algorithm

Definition

16.1 Fisher's Example

16.2 Mixtures

16.3 EM and Order Statistics

16.4 MAP via EM

16.5 Infection Pattern Estimation

Exercises

References

17 Statistical Learning

17.1 Discriminant Analysis

17.2 Linear Classification Models

17.3 Nearest Neighbor Classification

17.4 Neural Networks

17.5 Binary Classification Trees

Exercises

References

Note

18 Nonparametric Bayes

18.1 Dirichlet Processes

18.2 Bayesian Contingency Tables and Categorical Models

18.3 Bayesian Inference in Infinitely Dimensional Nonparametric Problems

Exercises

References

Appendix A: WinBUGS

A.1 Using WinBUGS

A.2 Built‐in Functions and Common Distributions in BUGS

Appendix B: R Coding

B.1 Programming in R

B.2 Basics of R

B.3 R Commands

B.4 R for Statistics

R Index

Author Index

Subject Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Asymptotic relative efficiency (ARE) of some basic nonparametric t...

Chapter 4

Table 4.1 Some conjugate pairs.

Table 4.2 Treatment of

according to the value of log‐Bayes factor.

Chapter 6

Table 6.1 Upper quantiles for Kolmogorov–Smirnov test statistic.

Table 6.2 Tail probabilities for Smirnov two‐sample test.

Table 6.3 Null distribution of Anderson–Darling test statistic: modification...

Table 6.4 Quantiles for Shapiro–Wilk test statistic.

Table 6.5 Coefficients for the Shapiro–Wilk test.

Chapter 7

Table 7.1 Quantiles of

for the Wilcoxon signed rank test.

Table 7.2 Distribution of

when

and

.

Chapter 9

Table 9.1 Mendel's data.

Table 9.2 Horse‐kick fatalities data.

Table 9.3 Observed groups of dolphins, including

time of day

and

activity

.

Table 9.4 Five reviewers found 27 issues in software example as in Gilb and ...

Chapter 10

Table 10.1 Waiting times for insects to visit flowers.

Chapter 12

Table 12.1 Size of pituitary fissure for subjects of various ages.

Table 12.2 Cæsarean section birth data.

Table 12.3 Bliss beetle data.

Chapter 14

Table 14.1 Some common wavelet filters from the Daubechies, Coiflet, and Sym...

Chapter 16

Table 16.1 Frequency distribution of the number of children among 4075 widow...

Table 16.2 Some of the 20 steps in the EM implementation of ZIP modeling on ...

Appendix A

Table A.1 Built‐in functions in WinBUGS.

Table A.2 Built‐in distributions with BUGS names and their parametrizations....

Appendix B

Table B.1 Built‐in functions in R.

Table B.2 Statistics functions in R.

Table B.3 Probability functions in R.

List of Illustrations

Chapter 2

Figure 2.1 Probability density function for DRAM chip defect frequency (

) a...

Figure 2.2 Distribution functions

(

(2,4)) and

(

(3,6)): (a) plot of

a...

Figure 2.3 (a) Histogram of single sample generated from Poisson

distribut...

Chapter 3

Figure 3.1 Empirical distribution function based on normal samples (sizes 20...

Figure 3.2 Graph of statistical test power for binomial test for specific al...

Figure 3.3 (a) The binomial

PMF. (b) 95% confidence intervals based on exa...

Chapter 4

Figure 4.1 The normal

likelihood,

prior, and posterior for data

Figure 4.2 Bayesian credible set based on

density.

Figure 4.3 (a) Posterior density for

. (b) Posterior predictive density for...

Chapter 5

Figure 5.1 Diagram of simple system of three components in series (a) and pa...

Figure 5.2 Distribution of order statistics from a sample of five

.

Chapter 6

Figure 6.1 Comparing the EDF for river length data versus normal distributio...

Figure 6.2 Fitted distributions: (a)

and (b) mixture of normals.

Figure 6.3 EDF for samples of

generated from normal and exponential with

Figure 6.4 Plots of EDF versus

CDF for (a)

observations of

data and (b...

Figure 6.5 (a) Plot of EDF versus normal CDF and (b) normal probability plot...

Figure 6.6 Weibull probability plot of 30 observations generated from a norm...

Figure 6.7 Data from

are plotted against data from (a)

, (b)

, (c)

, an...

Figure 6.8 Probability distribution of runs under

.

Figure 6.9 Runs versus games for (a) 2005 St. Louis Cardinals and (b) 2003 D...

Chapter 7

Figure 7.1 Nineteenth‐century country carolers singing “Hogmanay, Trollolay,...

Chapter 8

Figure 8.1 Box plot for crop yields.

Figure 8.2 Box plot of vehicle performance grades of three cars (A,B,C).

Chapter 9

Figure 9.1 Genetic model for a dihybrid cross between round, yellow peas and...

Figure 9.2 Original data of horse‐kick fatalities from von Bortkiewicz (1898...

Figure 9.3 Barplot of dolphin's data.

Figure 9.4 Waffle chart for showing party split in voting for 1964 Civil Rig...

Figure 9.5 Waffle chart for showing demographic and party splits in voting f...

Figure 9.6 (a) Matrix of 1200 plots (

). Lighter color corresponds to higher...

Chapter 10

Figure 10.1 Kaplan–Meier estimator for waiting times (solid line for male fl...

Figure 10.2 Kaplan–Meier estimator cord strength (in coded units).

Figure 10.3 Empirical likelihood ratio as a function of (a) the mean and (b)...

Chapter 11

Figure 11.1 Playfair's 1786 bar chart of wheat prices in England.

Figure 11.2 Empirical “density” (a) and histogram (b) for 30 normal

variab...

Figure 11.3 Histograms with normal fit of 5000 generated variables using (a)...

Figure 11.4 (a) Normal, (b) triangular, (c) box, and (d) Epanechnikov kernel...

Figure 11.5 Density estimation for sample of size

using various kernels: (...

Figure 11.6 Density estimation for sample of size

using various bandwidth ...

Figure 11.7 Density estimation for 2001 radiation measurements using bandwid...

Figure 11.8 (a) Univariate density estimator for first variable. (b) Univari...

Figure 11.9 Bivariate Density estimation for sample of size

.

Chapter 12

Figure 12.1 (a) Plot of test #1 scores (during term) and test #2 scores (eig...

Figure 12.2 Regression: least squares (

) and nonparametric (

).

Figure 12.3 Star data with (a)

Ordinary least squares

(

OLS

) regression, (b) ...

Figure 12.4 Anscombe's four regressions: least squares (dashed line) versus ...

Figure 12.5 (a) Greatest convex minorant based on nine observations. (b) Gre...

Figure 12.6 Cæsarean birth infection observed proportions (

) and model pred...

Chapter 13

Figure 13.1 Linear Regression (solid line) and local estimator (dashed line)...

Figure 13.2 (a) A family of symmetric beta kernels. (b)

.

Figure 13.3 Nadaraya–Watson estimators for different values of bandwidth.

Figure 13.4 Loess curve fitting for motorcycle data using (a)

(b)

(c)

...

Figure 13.5 A cubic spline drawing of letter

.

Figure 13.6 (a) Interpolating sine function. (b) Interpolating a surface. (c...

Figure 13.7 (a) Square plus noise. (b) Motorcycle data: time (

) and acceler...

Figure 13.8 Blazar OJ287 luminosity.

Chapter 14

Figure 14.1 Wavelets from the Daubechies family. Depicted are scaling functi...

Figure 14.2 Wavelet‐based data processing.

Figure 14.3 (a) Haar wavelet

. (b) Some dilations and translations of Haar ...

Figure 14.4 A function interpolating

on [0,8).

Figure 14.5 (a) Hard and (b) soft thresholding with

(dashed line for refer...

Figure 14.6 Demo output: (a) original

doppler

signal, (b) noisy

doppler

, (c)...

Figure 14.7 Panel (a) shows

hourly measurements of the water level for a w...

Figure 14.8 One step in wavelet transformation of 2‐D data exemplified on ce...

Chapter 15

Figure 15.1 Baron Von Munchausen: the first bootstrapper.

Figure 15.2 Scatter plot of 24 distance–velocity pairs. Distance is measured...

Figure 15.3 (a) Histogram of correlations from 50 000 bootstrap samples. (b)...

Figure 15.4 95% confidence band the CDF of Crowder's data using 1000 bootstr...

Figure 15.5 (a) The histogram of 50,000 BB resamples for the correlation bet...

Figure 15.6 A coin of Manuel I Comnenus (1143–1180).

Figure 15.7 Panels (a) and (b) show permutation null distribution of statist...

Chapter 16

Figure 16.1 Observations from the

mixture (histogram), the mixture (

dotted

...

Chapter 17

Figure 17.1 Targets illustrating the difference between model bias and varia...

Figure 17.2 Two types of iris classified according to (a) petal length versu...

Figure 17.3 Nearest‐neighbor classification of 50 observations plotted in (a...

Figure 17.4 Basic structure of feed‐forward neural network.

Figure 17.5 Purifying a tree by splitting.

Figure 17.6 (a) Location of 37 tropical (circles) and other (plus signs) hur...

Figure 17.7 Binary tree classification applied to Fisher's iris data using (...

Chapter 18

Figure 18.1 The base CDF

is shown as a dotted line. Fifteen random CDFs fr...

Figure 18.2 For a sample

Beta(2,2) observations, a boxplot of “noninformat...

Figure 18.3 Histograms of 40 000 samples from (a) posterior of lambda and (b...

Figure 18.4 Bayesian rule (18.7) and comparable hard and soft thresholding r...

Figure 18.5 (a) A noisy

doppler

signal (SNR

7,

, noise variance

). (b) Sig...

Figure 18.6 Approximation of Bayesian shrinkage rule calculated by WinBUGS....

Appendix A

Figure A.1 Traces of the four parameters from simple example: (a)

, (b)

(...

Appendix B

Figure B.1 RStudio console features four boxes: Source Editor, Console, Work...

Figure B.2 Graphical summary of lengths for 141 rivers in North America usin...

Figure B.3 Histogram using ggplot.

Figure B.4 Plot of two normal densities.

Guide

Cover

Table of Contents

Title Page

Copyright

Preface

Acknowledgments

Begin Reading

Appendix A WinBUGS

Appendix B R Coding

R Index

Author Index

Subject Index

End User License Agreement

Pages

ii

iii

iv

xiii

xiv

xv

xvi

xvii

1

2

3

4

5

6

7

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

389

390

391

392

393

394

395

397

398

399

400

401

402

403

404

405

407

408

409

411

412

413

414

415

417

418

419

420

421

422

423

424

425

426

427

428

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by Walter A. Shewhart and Samuel S. Wilks

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay

Editors Emeriti: Harvey Goldstein, J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels

The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state‐of‐the‐art developments in the field and classical methods.

Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches.

This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research.

A complete list of titles in this series can be found at http://www.wiley.com/go/wsps

Nonparametric Statistics with Applications to Science and Engineering with R

 

Second Edition

Paul Kvam

University of Richmond

Richmond, Virginia, USA

Brani Vidakovic

Texas A&M University

College Station, Texas, USA

Seong‐joon Kim

Chosun University

Gwangju, South Korea

 

 

 

 

 

 

 

 

This second edition first published 2023

© 2023 John Wiley & Sons, Inc.

Edition History

John Wiley & Sons, Inc. (1e, 2007)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Paul Kvam, Brani Vidakovic, and Seong‐joon Kim to be identified as the authors of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data applied for

Hardback ISBN: 9781119268130

Cover image: © Aleksandr Semenov/Shutterstock

Cover design by Wiley

Preface

Danger lies not in what we don't know, but in what we think we know that just ain't so.

Mark Twain (1835–1910)

This textbook is a substantial revision of a previous textbook written in 2007 by Kvam and Vidakovic. The biggest difference in this version is the adoption of the R programming language as a supplementary learning tool for the purpose of teaching concepts, illustrating examples, and completing computational homework assignments. In the original book, the authors relied on Matlab.

There has been plenty of change in the world of nonparametric statistics since we finished the first edition of this book. While the statistics community had already adapted to a modern framework for data analysis that relies increasingly on nonparametric procedures (not to mention Bayesian alternatives to traditional inference), we sense more adapters in engineering, medical research, chemistry, biology, and especially the behavioral sciences with each passing year. However, the field of nonparametric statistics has also receded toward the periphery of the statistics curriculum in the wake of data science, which continues to encroach on graduate curriculums associated with statistics, causing more programs to replace traditional statistics courses with the trendier versions involving data structures.

There are quality monographs/texts dealing with nonparametric statistics, such as the encyclopedic book by Hollander and Wolfe, Nonparametric Statistical Methods, or the excellent book by Conover, Practical Nonparametric Statistics, which has served as a staple for a generation of professors tasked to teach a course in this subject. Before engaging in writing the first version of this textbook, we taught several iterations of a graduate course on nonparametric statistics at Georgia Tech. The audience consisted of MS and PhD students in Engineering Statistics, Electrical Engineering, Bioengineering, Management, Logistics, Applied Mathematics, and Physics. While comprising a nonhomogeneous group, all of the students had solid mathematical, programming, and statistical training needed to benefit from the course.

In our course, we relied on the third edition of Conover's book, which is mainly concerned with what most of us think of as traditional nonparametric statistics: proportions, ranks, categorical data, goodness of fit, and so on, with the understanding that the text would be supplemented by the instructor's handouts. We ended up supplying an increasing number of handouts every year, for units such as density and function estimation, wavelets, Bayesian approaches to nonparametric problems, EM algorithm, splines, machine learning, and other arguably modern nonparametric topics. Later on, we decided to merge the handouts and fill the gaps.

With this new edition, we adhere to the traditional form one expects in an academic textbook, but we aim to provide more informal discussion and commentary to balance with the regimen of lessons that help the student progress through a statistics methods course. Unlike newer books that focus on data science, we want to help the student learn more than just how to implement a statistical procedure. We want them to understand, to a higher degree, what they are doing (or what R is doing for them).

We hope the book provides all of the tools and motivation for a student to study methods of nonparametric statistics, but we also aim to keep a conversational tone in our writing. Reading math‐infused textbooks can be challenging, but it need not be a drudgery. For that reason, we remind the reader of the bigger picture, including the historical and cultural aspects linked to the development and application of nonparametric procedures. We think it is important to acknowledge the fundamental contributions to the field of nonparametric statistics by not only our field's pioneers, such as Karl Pearson, Nathan Mantel, or Brad Efron, but also others in our vanguard, including François‐Marie Arouet (Voltaire), Karl Popper, and Baron Von Munchausen.

Computing. The book is integrated with R, and for many procedures covered in this book, we feature subroutines and packages (free libraries of code) of R code. The choice of software was natural: engineers, scientists, and increasingly statisticians are communicating in the “R language.” R is an open‐source language for statistical computing and quickly emerging environment as the standard for research and development. R provides a wide variety of packages that allow to perform various kinds of analyses and powerful graphic components. For Bayesian calculation we previously relied on WinBUGS, a free software from Cambridge's Biostatistics Research Unit. Both R and WinBUGS are briefly covered in two appendices for readers less familiar with them. For R‐programmers who want to see a variety of programming modules for nonparametric inference in the R language, we refer you to the R‐series guide Nonparametric Statistical Methods Using R by Kloke and McKean.

Outline of Chapters. For a typical graduate student to cover the full breadth of this textbook, two semesters would be required. For a one‐semester course, the instructor should necessarily cover Chapters 1–3 and 5–9 to start. Depending on the scope of the class, the last part of the course can include different chapter selections.

Chapters 2–4 contain important background material the student needs to understand to effectively learn and apply the methods taught in a nonparametric analysis course. Because the ranks of observations have special importance in a nonparametric analysis, Chapter 5 presents basic results for order statistics and includes statistical methods to create tolerance intervals.

Traditional topics in estimation and testing are presented in Chapters 7–10 and should receive emphasis even to students who are most curious about advanced topics such as density estimation (Chapter 11), curve fitting (Chapter 13), and wavelets (Chapter 14). These topics include a core of rank tests that are analogous to common parametric procedures (e.g. ‐tests, analysis of variance).

Basic methods of categorical data analysis are contained in Chapter 9. Although most students in the biological sciences are exposed to a wide variety of statistical methods for categorical data, engineering students and other students in the physical sciences typically receive less schooling in this quintessential branch of statistics. Topics include methods based on tabled data, chi‐square tests, and the introduction of general linear models. Also included in the first part of the book is the topic of “goodness of fit” (Chapter 6), which refers to testing data not in terms of some unknown parameters, but the unknown distribution that generated it. In a way, goodness of fit represents an interface between distribution‐free methods and traditional parametric methods of inference, and both analytical and graphical procedures are presented. Chapter 10 presents the nonparametric alternative to maximum likelihood estimation and likelihood ratio‐based confidence intervals.

The term “regression” is familiar from your previous course that introduced you to statistical methods. Nonparametric regression provides an alternative method of analysis that requires fewer assumptions of the response variable. In Chapter 12, we use the regression platform to introduce other important topics that build on linear regression, including isotonic (constrained) regression, robust regression, and generalized linear models. In Chapter 13, we introduce more general curve fitting methods. Regression models based on wavelets (Chapter 14) are presented in a separate chapter.

In the latter part of the book, emphasis is placed on nonparametric procedures that are becoming more relevant to engineering researchers and practitioners. Beyond the conspicuous rank tests, this text includes many of the newest nonparametric tools available to experimenters for data analysis. Chapter 17 introduces fundamental topics of statistical learning as a basis for data mining and pattern recognition and includes discriminant analysis, nearest‐neighbor classifiers, neural networks, and binary classification trees. Computational tools needed for nonparametric analysis include bootstrap resampling (Chapter 15) and the EM algorithm (Chapter 16). Bootstrap methods, in particular, have become indispensable for uncertainty analysis with large data sets and elaborate stochastic models.

The textbook also unabashedly includes a review of Bayesian statistics and an overview of nonparametric Bayesian estimation. If you are familiar with Bayesian methods, you might wonder what role they play in nonparametric statistics. Admittedly, the connection is not obvious, but in fact nonparametric Bayesian methods (Chapter 18) represent an important set of tools for complicated problems in statistical modeling and learning, where many of the models are nonparametric in nature.

The book is intended both as a reference text and a text for a graduate course. We hope the reader will find this book useful. All comments, suggestions, updates, and critiques will be appreciated.

April 2022 Paul Kvam

Department of Mathematics

University of Richmond

 

Brani Vidakovic

Department of Statistics

Texas A & M University

 

Seong‐joon Kim

Department of Industrial Engineering

Chosun University

Acknowledgments

We would like to thank Lori Kvam, Draga Vidakovic, and the rest of our families.

1Introduction

For every complex question, there is a simple answer and it is wrong.

H. L. Mencken

Jacob Wolfowitz first coined the term nonparametric, saying “We shall refer to this situation [where a distribution is completely determined by the knowledge of its finite parameter set] as the parametric case, and denote the opposite case, where the functional forms of the distributions are unknown, as the non‐parametric case” (Wolfowitz, 1942). From that point on, nonparametric statistics was defined by what it is not: traditional statistics based on known distributions with unknown parameters. Randles, Hettmansperger, and Casella (2004) extended this notion by stating that “nonparametric statistics can and should be broadly defined to include all methodology that does not use a model based on a single parametric family.”

Traditional statistical methods are based on parametric assumptions; that is, the data can be assumed to be generated by some well‐known family of distributions, such as normal, exponential, Poisson, and so on. Each of these distributions has one or more parameters (e.g. the normal distribution has and ), at least one of which is presumed unknown and must be inferred. The emphasis on the normal distribution in linear model theory is often justified by the central limit theorem, which guarantees approximate normality of sample means provided the sample sizes are large enough. Other distributions also play an important role in science and engineering. Physical failure mechanisms often characterize the lifetime distribution of industrial components (e.g. Weibull or lognormal), so parametric methods are important in reliability engineering.

However, with complex experiments and messy sampling plans, the generated data might not be attributed to any well‐known distribution. Analysts limited to basic statistical methods can be trapped into making parametric assumptions about the data that are not apparent in the experiment or the data. In the case where the experimenter is not sure about the underlying distribution of the data, statistical techniques are needed that can be applied regardless of the true distribution of the data. These techniques are called nonparametric methods, or distribution‐free methods.

The terms nonparametric and distribution‐free are not synonymous… Popular usage, however, has equated the terms… Roughly speaking, a nonparametric test is one which makes no hypothesis about the value of a parameter in a statistical density function, whereas a distribution‐free test is one which makes no assumptions about the precise form of the sampled population.

J. V. Bradley (1968)

It can be confusing to understand what is implied by the word “nonparametric.” What is termed modern nonparametrics includes statistical models that are quite refined, except the distribution for error is left unspecified. Wasserman's recent book All Things Nonparametric (Wasserman, 2005) emphasizes only modern topics in nonparametric statistics, such as curve fitting, density estimation, and wavelets. Conover's Practical Nonparametric Statistics (Conover, 1999), on the other hand, is a classic nonparametrics textbook but mostly limited to traditional binomial and rank tests, contingency tables, and tests for goodness of fit. Topics that are not really under the distribution‐free umbrella, such as robust analysis, Bayesian analysis, and statistical learning also have important connections to nonparametric statistics and are all featured in this book. Perhaps this text could have been titled A Bit Less of Parametric Statistics with Applications in Science and Engineering, but it surely would have sold fewer copies. On the other hand, if sales were the primary objective, we would have titled this Nonparametric Statistics for Data Science or maybe Nonparametric Statistics with Pictures of Naked People.

1.1 Efficiency of Nonparametric Methods

Doubt is not a pleasant condition, but certainty is absurd.

Francois Marie Voltaire (1694–1778)

It would be a mistake to think that nonparametric procedures are simpler than their parametric counterparts. On the contrary, a primary criticism of using parametric methods in statistical analysis is that they oversimplify the population or process we are observing. Indeed, parametric families are not more useful because they are perfectly appropriate, rather because they are perfectly convenient.

Table 1.1 Asymptotic relative efficiency (ARE) of some basic nonparametric tests.

Parametric test

Nonparametric test

ARE (normal)

ARE (double exponential)

Two‐sample test

‐test

Mann–Whitney

0.955

1.50

Three‐sample test

One‐way layout

Kruskal–Wallis

0.864

1.50

Variances test

‐test

Conover

0.760

1.08

Nonparametric methods are inherently less powerful than parametric methods. This must be true because the parametric methods are assuming more information to construct inferences about the data. In these cases the estimators are inefficient, where the efficiencies of two estimators are assessed by comparing their variances for the same sample size. This inefficiency of one method relative to another is measured in power in hypothesis testing, for example.

However, even when the parametric assumptions hold perfectly true, we will see that nonparametric methods are only slightly less powerful than the more presumptuous statistical methods. Furthermore, if the parametric assumptions about the data fail to hold, only the nonparametric method is valid. A ‐test between the means of two normal populations can be dangerously misleading if the underlying data are not actually normally distributed. Some examples of the relative efficiency of nonparametric tests are listed in Table 1.1, where asymptotic relative efficiency (ARE) is used to compare parametric procedures (second column) with their nonparametric counterparts (third column). ARE describes the relative efficiency of two estimators of a parameter as the sample size approaches infinity and is listed for the normal distribution, where parametric assumptions are justified, and the double‐exponential distribution. For example, if the underlying data are normally distributed, the ‐test requires 955 observations to have the same power of the Wilcoxon signed‐rank test based on 1000 observations.

Parametric assumptions allow us to extrapolate away from the data. For example, it is hardly uncommon for an experimenter to make inferences about a population's extreme upper percentile (say, 99th percentile) with a sample so small that none of the observations would be expected to exceed that percentile. If the assumptions are not justified, this is grossly unscientific.

Nonparametric methods are seldom used to extrapolate outside the range of observed data. In a typical nonparametric analysis, little or nothing can be said about the probability of obtaining future data beyond the largest sampled observation or less than the smallest one. For this reason, the actual measurements of a sample item means less than its rank within the sample. In fact, nonparametric methods are typically based on ranks of the data, and properties of the population are deduced using order statistics (Chapter 5). The measurement scales for typical data are as follows:

Nominal scale

: numbers used only to categorize outcomes (e.g. we might define a random variable to equal one in the event a coin flips heads and zero if it flips tails).

Ordinal scale

: numbers can be used to order outcomes (e.g. the event X is greater than the event Y if X =

medium

and Y =

small

).

Interval scale

: order between numbers and distances between numbers are used to compare outcomes.

Only interval scale measurements can be used by parametric methods. Nonparametric methods based on ranks can use ordinal scale measurements, and simpler nonparametric techniques can be used with nominal scale measurements.

The binomial distribution is characterized by counting the number of independent observations that are classified into a particular category. Binomial data can be formed from measurements based on a nominal scale of measurements; thus binomial models are most encountered models in nonparametric analysis. For this reason, Chapter 3 includes a special emphasis on statistical estimation and testing associated with binomial samples.

1.2 Overconfidence Bias

Be slow to believe what you worst want to be true

Samual Pepys (1633–1703)

Confirmation Bias or Overconfidence Bias describes our tendency to search for or interpret information in a way that confirms our preconceptions. Business and finance has shown interest in this psychological phenomenon (Tversky and Kahneman, 1974) because it has proven to have a significant effect on personal and corporate financial decisions where the decision maker will actively seek out and give extra weight to evidence that confirms a hypothesis they already favor. At the same time, the decision maker tends to ignore evidence that contradicts or disconfirms their hypothesis.

Overconfidence bias has a natural tendency to affect an experimenter's data analysis for the same reasons. While the dictates of the experiment and the data sampling should reduce the possibility of this problem, one of the clear pathways open to such bias is the infusion of parametric assumptions into the data analysis. After all, if the assumptions seem plausible, the researcher has much to gain from the extra certainty that comes from the assumptions in terms of narrower confidence intervals and more powerful statistical tests.

Nonparametric procedures serve as a buffer against this human tendency of looking for the evidence that best supports the researcher's underlying hypothesis. Given the subjective interests behind many corporate research findings, nonparametric methods can help alleviate doubt to their validity in cases when these procedures give statistical significance to the corporation's claims.

If everything isn't black and white, I say…

Why the hell not?

John Wayne (1907–1979)

1.3 Computing with R

Because a typical nonparametric analysis can be computationally intensive, computer support is essential to understand both theory and applications. Numerous software products can be used to complete exercises and run nonparametric analysis in this textbook, including SAS, SPSS, MINITAB, MATLAB, StatXact, and JMP (to name a few). A student familiar with one of these platforms can incorporate it with the lessons provided here, and without too much extra work.

It must be stressed, however, that demonstrations in this book rely mainly on a single software called R (maintained by R Foundation). R is a “GNU”‐(free) programming environment for statistical computing and graphics. Today, the R is one of the fastest growing software programs with over 5000 packages that enable us to perform various kinds of statistical analysis. Because of its open source and extensible nature, it has been widely used in research and engineering practice and is rapidly becoming the dominant software tool for data manipulation, modeling, analysis, and graphical display. R is available on Unix systems, Microsoft Windows, and Apple Macintosh. If you are unfamiliar with R, in the first appendix, we present a brief tutorial along with a short description of some R procedures that are used to solve analytical problems and demonstrate nonparametric methods in this book. For a more comprehensive guide, we recommend the book An Introduction to R (Venables, Smith, and the R Core Team, 2014). For more detail information, visit

http://www.r-project.org

A user‐friendly computing platform for R is provided by R‐Studio, which can be downloaded for free at

https://www.rstudio.com

RStudio Cloud allows students a convenient way of accessing the RStudio development environment without having to worry about installation problems associated with R and RStudio. Classroom instructors and students can easily share work spaces using R‐Markdown files, for example,

http://rstudio.cloud

We hope that many students of statistics will find this book useful, but it was written primarily with the scientist and engineer in mind. With nothing against statisticians (some of our acquaintances know statisticians), our approach emphasizes the application of the method over its mathematical theory. We have intentionally made the text less heavy with theory and instead emphasized applications and examples. If you come into this course thinking the history of nonparametric statistics is dry and unexciting, you are probably right, at least compared with the history of ancient Rome, the British monarchy, or maybe even Wayne Newton.1 Nonetheless, we made efforts to convince you otherwise by noting the interesting historical context of the research and the personalities behind its development. For example, we will learn more about Karl Pearson (1857–1936) and R. A. Fisher (1890–1962), legendary scientists and competitive archrivals, who both contributed greatly to the foundation of nonparametric statistics through their separate research directions.

In short, this book features techniques of data analysis that rely less on the assumptions of the data's good behavior – the very assumptions that can get researchers in trouble. Science's gravitation toward distribution‐free techniques is due to both a deeper awareness of experimental uncertainty and the availability of ever‐increasing computational abilities to deal with the implied ambiguities in the experimental outcome.

1.4 Exercises

1.1

Describe a potential data analysis in your field of study where parametric methods are appropriate. How would you defend this assumption?

1.2

Describe another potential data analysis in your field of study where parametric methods may not be appropriate. What might prevent you from using parametric assumptions in this case?

1.3

Describe three ways in which overconfidence bias can affect the statistical analysis of experimental data. How can this problem be overcome?

1.4

For an analysis of variance involving three treatment groups, the traditional one‐way layout is more efficient than Kruskal and Wallis's nonparametric test. If the Kruskal and Wallis test requires 400 observations to achieve the required test power that is desired, how many samples would the parametric test need to achieve the same power?

1.5

Find an example of data from your field of study that is considered ordinal.

References

Bradley, J. V. (1968),

Distribution Free Statistical Tests

, Englewood Cliffs, NJ: Prentice Hall.

Conover, W. J. (1999),

Practical Nonparametric Statistics

, New York: Wiley.

Randles, R. H., Hettmansperger, T.P., and Casella, G. (2004), ”Introduction to the Special Issue Nonparametric Statistics,”

Statistical Science

, 19, 561–562.

Tversky, A., and Kahneman, D. (1974), “Judgment Under Uncertainty: Heuristics and Biases,”

Science

, 185, 1124–1131.

Venables, W. N., Smith, D. M., and the R Core Team (2014),

An Introduction to R, version 3.1.0

., Technical Report, The Comprehensive R Archive Network(CRAN).

Wasserman, L. (2005),

All of Nonparametric Statistics

, New York: Springer‐Verlag.

Wolfowitz, J. (1942), “Additive Partition Functions and a Class of Statistical Hypotheses,”

Annals of Statistics

, 13, 247–279.

Note

1

   Strangely popular Las Vegas entertainer.

2Probability Basics

Probability theory is nothing but common sense reduced to calculation.

Pierre Simon Laplace (1749–1827)

In Chapters 2 and 3, we review some fundamental concepts of elementary probability and statistics. If you think you can use these chapters to catch up on all the statistics you forgot since you passed “Introductory Statistics” in your college sophomore year, you are acutely mistaken. What is offered here is an abbreviated reference list of definitions and formulas that have applications to nonparametric statistical theory. Some parametric distributions, useful for models in both parametric and nonparametric procedures, are listed, but the discussion is abridged.

2.1 Helpful Functions

Permutations

: The number of arrangements of

distinct objects is

In R:

factorial(n).

Combinations

: The number of distinct ways of choosing

items from a set of

is

In R: choose(n,k). Note that all possible ways of choosing items from a set of can be obtained by combn(n,k).

is called the gamma function. If

is a positive integer,

. In R:

gamma(t)

.

Incomplete gamma is defined as

In R:

pgamma(t,z,1)

. The upper tail incomplete gamma is defined as

in R:

1‐pgamma(t,z,1)

. If

is an integer,

Note that pgamma is a cumulative distribution function (CDF) of the gamma distribution. By letting the scale parameter set to 1, pgamma reduced to the incomplete gamma.

Beta function

:

. In R:

beta(a,b)

.

Incomplete beta

:

In R:

pbeta(x,a,b)

represents normalized incomplete beta defined as

Summations of powers of integers

:

Floor function

:

denotes the greatest integer

. In R:

floor(a)

.

Geometric series

:

Stirling's formula

: To approximate the value of a large factorial,

Common limit for

: For a constant

,

This can also be expressed as as .

Newton's formula

: For a positive integer

,

Taylor Series expansion

: For a function

, its Taylor series expansion about

is defined as

where

denotes

th derivative of

evaluated at

and, for some

between

and

,

Convex function

: A function

is

convex

if for any

,

for all values of

and

. If

is twice differentiable, then

is convex if

Also, if

is convex, then

is said to be

concave

.

Bessel function

:

is defined as the solution to the equation

In R: besselJ(x,n).

2.2 Events, Probabilities, and Random Variables

The

conditional probability

of event

occurring given that event

occurs is

, where

represents the intersection of events

and

, and

.

Events

and

are stochastically

independent

if and only if

or equivalently,

.

Law of total probability

: Let

be a partition of the sample space

, i.e.

and

for

. For event

,

.

Bayes formula

: For an event

where

and partition

of

,

A function that assigns real numbers to points in the sample space of events is called a

random variable

.

1

For a random variable

,

represents its (cumulative)

distribution function

, which is nondecreasing with

and

. In this book, it will often be denoted simply as CDF. The

survivor function

is defined as

.

If the CDF's derivative exists,

represents the

probability density function

, or

PDF

.

A

discrete random variable

is one that can take on a countable set of values

so that

. Over the support

, the probability

is called the

probability mass function

, or

PMF

.

A

continuous random variable

is one that takes on any real value in an interval, so

, where

is the density function of

.

For two random variables

and

, their

joint distribution function

is

If the variables are continuous, one can define joint density function

as

The conditional density of

, given

is

where

is the density of

Two random variables

and

, with distributions

and

, are

independent

if the joint distribution

of

is such that

. For any sequence of random variables

that are independent with the same (identical) marginal distribution, we will denote this using

i.i.d

.

2.3 Numerical Characteristics of Random Variables

For a random variable

with distribution function

, the

expected value

of some function

is defined as

. If

is continuous with density

, then

. If

is discrete, then

.

The

th

moment

of

is denoted as

. The

th moment about the mean or

th central moment of

is defined as

, where

.

The

variance

of a random variable

is the second central moment,

. Often, the variance is denoted by

or simply by

when it is clear which random variable is involved. The square root of variance,

is called the standard deviation of

With

, the

th

quantile

of

, denoted

is the value

such that

and

. If the CDF

is invertible, then

. The 0.5th quantile is called the

median

of

.

For two random variables

and

, the

covariance

of

and

is defined as

, where

and

are the respective expectations of

and

.

For two random variables

and

with covariance

, the

correlation coefficient

is defined as

where

and

are the respective standard deviations of

and

. Note that

is a consequence of the Cauchy–Schwartz inequality (

Section 2.8

).

The

characteristic function

of a random variable

is defined as

The moment generating function of a random variable is defined as

whenever the integral exists. By differentiating times and letting , we have that

The

conditional expectation

of a random variable

is given

is defined as

where

is a conditional density of

given

.

For random variables

and

with finite means and variances, we can obtain moments of

through its conditional distribution:

These two equations are commonly referred to as Adam and Eve's rules.

2.4 Discrete Distributions

Ironically, parametric distributions have an important role to play in the development of nonparametric methods. Even if we are analyzing data without making assumptions about the distributions that generate the data, these parametric families appear nonetheless. In counting trials, for example, we can generate well‐known discrete distributions (e.g. binomial, geometric) assuming only that the counts are independent and probabilities remain the same from trial to trial.

2.4.1 Binomial Distribution

A simple Bernoulli random variable is dichotomous with and for some . It is denoted as Suppose an experiment consists of independent trials in which two outcomes are possible (e.g. success or failure), with for each trial. If is defined as the number of successes (out of ), then , and there are arrangements of successes and failures, each having the same probability . is a binomial random variable with PMF:

This is denoted by . From the moment generating function , we obtain and .

The cumulative distribution for a binomial random variable is not simplified beyond the sum; i.e. . However, interval probabilities can be computed in R using pbinom(x,n,p), which computes the CDF at value . The PMF is also computed in R using dbinom(x,n,p).

2.4.2 Poisson Distribution

A Poisson random variable may characterize the number of events occurring in some fixed interval of time, so that the events occur with constant rate and independently of the time since any previous event. The PMF for the Poisson distribution is

This is denoted by . From , we have and ; the mean and the variance coincide.

The sum of a finite independent set of Poisson variables also has a Poisson distribution. Specifically, if , then is distributed as . Furthermore, the Poisson distribution is a limiting form for a binomial model, i.e.

(2.1)

R commands for Poisson CDF, PDF, quantile, and a random number are ppois, dpois, qpois, and rpois.

2.4.3 Negative Binomial Distribution

Suppose we are dealing with i.i.d. trials again, this time counting the number of successes observed until a fixed number of failures () occur. If we observe consecutive failures at the start of the experiment, for example, the count is and , where is the probability of failure. If , we have observed successes and failures in trials. There are different ways of arranging those trials, but we can only be concerned with the arrangements in which the last trial ended in a failure. So there are really only arrangements, each equal in probability. With this in mind, the PMF is