192,99 €
An essential roadmap to the application of computational statistics in contemporary data science In Computational Statistics in Data Science, a team of distinguished mathematicians and statisticians delivers an expert compilation of concepts, theories, techniques, and practices in computational statistics for readers who seek a single, standalone sourcebook on statistics in contemporary data science. The book contains multiple sections devoted to key, specific areas in computational statistics, offering modern and accessible presentations of up-to-date techniques. Computational Statistics in Data Science provides complimentary access to finalized entries in the Wiley StatsRef: Statistics Reference Online compendium. Readers will also find: * A thorough introduction to computational statistics relevant and accessible to practitioners and researchers in a variety of data-intensive areas * Comprehensive explorations of active topics in statistics, including big data, data stream processing, quantitative visualization, and deep learning Perfect for researchers and scholars working in any field requiring intermediate and advanced computational statistics techniques, Computational Statistics in Data Science will also earn a place in the libraries of scholars researching and developing computational data-scientific technologies and statistical graphics.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1223
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
List of Contributors
Preface
Reference
Part I: Computational Statistics and Data Science
1 Computational Statistics and Data Science in the Twenty‐First Century
1 Introduction
2 Core Challenges 1–3
3 Model‐Specific Advances
4 Core Challenges 4 and 5
5 Rise of Data Science
Acknowledgments
Notes
References
2 Statistical Software
1 User Development Environments
2 Popular Statistical Software
3 Noteworthy Statistical Software and Related Tools
4 Promising and Emerging Statistical Software
5 The Future of Statistical Computing
6 Concluding Remarks
Acknowledgments
References
Further Reading
3 An Introduction to Deep Learning Methods
1 Introduction
2 Machine Learning: An Overview
3 Feedforward Neural Networks
4 Convolutional Neural Networks
5 Autoencoders
6 Recurrent Neural Networks
7 Conclusion
References
4 Streaming Data and Data Streams
1 Introduction
2 Data Stream Computing
3 Issues in Data Stream Mining
4 Streaming Data Tools and Technologies
5 Streaming Data Pre‐Processing: Concept and Implementation
6 Streaming Data Algorithms
7 Strategies for Processing Data Streams
8 Best Practices for Managing Data Streams
9 Conclusion and the Way Forward
References
Part II: Simulation‐Based Methods
5 Monte Carlo Simulation: Are We There Yet?
1 Introduction
2 Estimation
3 Sampling Distribution
4 Estimating
5 Stopping Rules
6 Workflow
7 Examples
References
6 Sequential Monte Carlo: Particle Filters and Beyond
1 Introduction
2 Sequential Importance Sampling and Resampling
3 SMC in Statistical Contexts
4 Selected Recent Developments
Acknowledgments
Note
References
7 Markov Chain Monte Carlo Methods, A Survey with Some Frequent Misunderstandings
1 Introduction
2 Monte Carlo Methods
3 Markov Chain Monte Carlo Methods
4 Approximate Bayesian Computation
5 Further Reading
Abbreviations and Acronyms
Notes
References
Note
8 Bayesian Inference with Adaptive Markov Chain Monte Carlo
1 Introduction
2 Random‐Walk Metropolis Algorithm
3 Adaptation of Random‐Walk Metropolis
4 Multimodal Targets with Parallel Tempering
5 Dynamic Models with Particle Filters
6 Discussion
Acknowledgments
Notes
References
9 Advances in Importance Sampling
1 Introduction and Problem Statement
2 Importance Sampling
3 Multiple Importance Sampling (MIS)
4 Adaptive Importance Sampling (AIS)
Acknowledgments
Notes
References
Part III: Statistical Learning
10 Supervised Learning
1 Introduction
2 Penalized Empirical Risk Minimization
3 Linear Regression
4 Classification
5 Extensions for Complex Data
6 Discussion
References
11 Unsupervised and Semisupervised Learning
1 Introduction
2 Unsupervised Learning
3 Semisupervised Learning
4 Conclusions
Acknowledgment
Notes
References
12 Random Forests
1 Introduction
2 Random Forest (RF)
3 Random Forest Extensions
4 Random Forests of Interaction Trees (RFIT)
5 Random Forest of Interaction Trees for Observational Studies
6 Discussion
References
13 Network Analysis
1 Introduction
2 Gaussian Graphical Models for Mixed Partial Compositional Data
3 Theoretical Properties
4 Graphical Model Selection
5 Analysis of a Microbiome–Metabolomics Data
6 Discussion
References
14 Tensors in Modern Statistical Learning
1 Introduction
2 Background
3 Tensor Supervised Learning
4 Tensor Unsupervised Learning
5 Tensor Reinforcement Learning
6 Tensor Deep Learning
Acknowledgments
References
15 Computational Approaches to Bayesian Additive Regression Trees
1 Introduction
2 Bayesian CART
3 Tree MCMC
4 The BART Model
5 BART Example: Boston Housing Values and Air Pollution
6 BART MCMC
7 BART Extentions
8 Conclusion
References
Part IV: High‐Dimensional Data Analysis
16 Penalized Regression
1 Introduction
2 Penalization for Smoothness
3 Penalization for Sparsity
4 Tuning Parameter Selection
References
17 Model Selection in High‐Dimensional Regression
1 Model Selection Problem
2 Model Selection in High‐Dimensional Linear Regression
3 Interaction‐Effect Selection for High‐Dimensional Data
4 Model Selection in High‐Dimensional Nonparametric Models
5 Concluding Remarks
References
18 Sampling Local Scale Parameters in High-Dimensional Regression Models
1 Introduction
2 A Blocked Gibbs Sampler for the Horseshoe
3 Sampling
4 Sampling
5 Appendix: A. Newton–Raphson Steps for the Inverse‐cdf Sampler for
Acknowledgment
References
Note
19 Factor Modeling for High-Dimensional Time Series
1 Introduction
2 Identifiability
3 Estimation of High‐Dimensional Factor Model
4 Determining the Number of Factors
Acknowledgment
References
Part V: Quantitative Visualization
20 Visual Communication of Data: It Is Not a Programming Problem, It Is Viewer Perception
1 Introduction
2 Case Studies Part 1
3 Let StAR Be Your Guide
4 Case Studies Part 2: Using StAR Principles to Develop Better Graphics
5 Ask Colleagues Their Opinion
6 Case Studies: Part 3
7 Iterate
8 Final Thoughts
Notes
References
21 Uncertainty Visualization
1 Introduction
2 Uncertainty Visualization Theories
3 General Discussion
References
22 Big Data Visualization
1 Introduction
2 Architecture for Big Data Analytics
3 Filtering
4 Aggregating
5 Analyzing
6 Big Data Graphics
7 Conclusion
References
23 Visualization‐Assisted Statistical Learning
1 Introduction
2 Better Visualizations with Seriation
3 Visualizing Machine Learning Fits
4 Condvis2 Case Studies
5 Discussion
References
24 Functional Data Visualization
1 Introduction
2 Univariate Functional Data Visualization
3 Multivariate Functional Data Visualization
4 Conclusions
Acknowledgment
References
Part VI: Numerical Approximation and Optimization
25 Gradient‐Based Optimizers for Statistics and Machine Learning
1 Introduction
2 Convex Versus Nonconvex Optimization
3 Gradient Descent
4 Proximal Gradient Descent: Handling Nondifferentiable Regularization
5 Stochastic Gradient Descent
References
26 Alternating Minimization Algorithms
1 Introduction
2 Coordinate Descent
3 EM as Alternating Minimization
4 Matrix Approximation Algorithms
5 Conclusion
References
27 A Gentle Introduction to Alternating Direction Method of Multipliers (ADMM) for Statistical Problems
1 Introduction
2 Two Perfect Examples of ADMM
3 Variable Splitting and Linearized ADMM
4 Multiblock ADMM
5 Nonconvex Problems
6 Stopping Criteria
7 Convergence Results of ADMM
Acknowledgments
References
28 Nonconvex Optimization via MM Algorithms: Convergence Theory
1 Background
2 Convergence Theorems
3 Paracontraction
4 Bregman Majorization
References
Part VII: High‐Performance Computing
29 Massive Parallelization
1 Introduction
2 Gaussian Process Regression and Surrogate Modeling
3 Divide‐and‐Conquer GP Regression
4 Empirical Results
5 Conclusion
Acknowledgments
References
30 Divide‐and‐Conquer Methods for Big Data Analysis
1 Introduction
2 Linear Regression Model
3 Parametric Models
4 Nonparametric and Semiparametric Models
5 Online Sequential Updating
6 Splitting the Number of Covariates
7 Bayesian Divide‐and‐Conquer and Median‐Based Combining
8 Real‐World Applications
9 Discussion
Acknowledgment
References
31 Bayesian Aggregation
1 From Model Selection to Model Combination
2 From Bayesian Model Averaging to Bayesian Stacking
3 Asymptotic Theories of Stacking
4 Stacking in Practice
5 Discussion
References
32 Asynchronous Parallel Computing
1 Introduction
2 Asynchronous Parallel Coordinate Update
3 Asynchronous Parallel Stochastic Approaches
4 Doubly Stochastic Coordinate Optimization with Variance Reduction
5 Concluding Remarks
References
Index
Abbreviations and Acronyms
End User License Agreement
Chapter 2
Table 1 Summary of selected statistical software.
Table 2 Summary of selected user environments/workflows.
Chapter 3
Table 1 Connection between input and output matrices in the third layer of L...
Chapter 4
Table 1 Streaming data versus static data [9, 10]
Chapter 5
Table 1 Probabilities for each action figure
Chapter 8
Table 1 Summary of ingredients of Algorithm 2 for the four adaptive MCMC me...
Table 2 Summary of recommended algorithms for specific problems and their s...
Chapter 9
Table 1 Summary of the notation.
Table 2 Comparison of various AIS algorithms according to different feature...
Table 3 Comparison of various AIS algorithms according to the computational...
Chapter 21
Table 1 Summary of uncertainty visualization theory detailed in this chapte...
Chapter 29
Table 2 Updated GPU/CPU results based on a more modern cascade of supercomp...
Chapter 1
Figure 1 A nontraditional and critically important application in computatio...
Chapter 3
Figure 1 An MLP with three layers.
Figure 2 Convolution operation with stride size
.
Figure 3 Pooling operation with stride size
.
Figure 4 LeNet‐5 of LeCun
et al
. [8].
Figure 5 Architecture of an autoencoder.
Figure 6 Architecture of variational autoencoder (VAE).
Figure 7 Feedforward network.
Figure 8 Architecture of recurrent neural network (RNN).
Figure 9 Architecture of long short‐term memory network (LSTM).
Chapter 4
Figure 1 Taxonomy of concept drift in data stream.
Chapter 5
Figure 1 Histograms of simulated boxes and mean number of boxes for two Mont...
Figure 2 Estimated risk at
(a) and at
(b) with pointwise Bonferroni corr...
Figure 3 Estimated density of the marginal posterior for
from an initial r...
Figure 4 Estimated autocorrelations for nonlinchpin sampler (a) and linchpin...
Chapter 7
Figure 1 Importance sampling with importance distribution of an exponential
Figure 2 Failed simulation of a Student's
distribution with mean
when si...
Figure 3 Recovery of a Normal
distribution when simulating
realizations ...
Figure 4 Histogram of
simulations of a
distribution with the target dens...
Figure 5 (a) Histogram of
iterations of a slice sampler with a Normal
ta...
Figure 6 100 last moves of the above slice sampler.
Figure 7 Independent Metropolis sequence with a proposal
equal to the dens...
Figure 8 Fit of a Metropolis sample of size
to a target when using a trunc...
Figure 9 Graph of a truncated Normal density and fit by the histogram of an ...
Chapter 9
Figure 1 Graphical description of three possible dependencies between the ad...
Chapter 12
Figure 1 Decision tree for headache data.
Figure 2 RFIT analysis of the headache data: (a) Estimated ITE with SE error...
Figure 3 Exploring important effect moderators in the headache data: (a) Var...
Figure 4 Comparison of MSE averaged over 1000 interaction trees using method...
Chapter 13
Figure 1 The metabolite–microbe interaction network. Only edges linking a me...
Figure 2 Scatter plots of microbe and metabolite pairs.
Chapter 14
Figure 1 An example of first‐, second‐, and third‐order tensors.
Figure 2 Tensor fibers, unfolding and vectorization.
Figure 3 An example of magnetic resonance imaging. The image is obtained fro...
Figure 4 A third‐order tensor with a checkerbox structure.
Figure 5 A schematic illustration of the low‐rank tensor clustering method....
Figure 6 The tensor formulation of multidimensional advertising decisions.
Figure 7 Illustration of the tensor‐based CNN compression from Kossaifi
et a
...
Chapter 15
Figure 1 A Bayesian tree.
Figure 2 The Boston housing data was compiled from the 1970 US Census, where...
Figure 3 The distribution of
and the sparse Dirichlet prior [16]. The key ...
Chapter 16
Figure 1 LASSO and nonconvex penalties: both SCAD and MCP do not penalize th...
Chapter 17
Figure 1 Hierarchy‐preserving solution paths by RAMP. (a) Strong hierarchy; ...
Chapter 18
Figure 1 Marginal prior of
for different choices of
.
Figure 2 Estimated autocorrelations for
for the three algorithms. Approxim...
Figure 3 Trace plots (with true value indicated) and density estimates for o...
Figure 4 (a) Plots
for
and
(in
dashed gray
and
dashed black
, respectiv...
Figure 5 Plot of
as a function of
, where
varies between
and 1.
Figure 6 The posterior mean of
in a normal means problem: the
‐axis and
Chapter 20
Figure 1 ACS 2017 state estimates of the number of households (millions).
Figure 2 ACS 2017 state estimates of the number of households (millions). A ...
Figure 3 ACS 2017 median household income (USD) with 95% confidence interval...
Figure 4 Log 10 US ACS 2017 state estimates of the number of households (per...
Figure 5 ACS 2017 state estimates of the number of households (millions), wi...
Figure 6 2017 ACS household median income (USD) estimates with 95% confidenc...
Figure 7 Sloppy plot of 2017 ACS household median income (USD) estimates.
Figure 8 Sloppy plot of 2017 ACS household median income (USD) estimates wit...
Figure 9 ACS 2017 state estimates of the number of households (millions).
Figure 10 ACS 2017 state estimates of the number of households (millions). T...
Figure 11 2017 ACS household median income (USD) estimates with confidence i...
Chapter 21
Figure 1 A subset of the graphical annotations used to show properties of a ...
Figure 2 The process of generating a quantile dotplot from a log‐normal dist...
Figure 3 Illustration of HOPs compared to error bars from the same distribut...
Figure 4 Example Cone of Uncertainty produced by the National Hurricane Cent...
Figure 5 (a) An example of an ensemble hurricane path display that utilizes ...
Chapter 22
Figure 1 Classic dataflow visualization architecture.
Figure 2 Client–server visualization architecture.
Figure 3 (a) Piecewise linear confidence intervals and (b) bootstrapped regr...
Figure 4 Dot plot and histogram.
Figure 5 2D binning of 100 000 points.
Figure 6 2D binning of thousands of clustered points.
Figure 7 Massive data scatterplot matrix by Dan Carr [9].
Figure 8 nD aggregator illustrated with 2D example.
Figure 9 (a) Parallel coordinate plots of all columns and (b) aggregated col...
Figure 10 Code snippets for computing statistics on aggregated data sources....
Figure 11 Box plots of 100 000 Gaussians.
Figure 12 Lensing a scatterplot matrix.
Figure 13 Sorted and scrolling parallel coordinates [27].
Chapter 23
Figure 1 Parallel coordinate plot of the Pima data, colored by the diabetes ...
Figure 2 Heatmap of the LDA scores for measuring group separation for one an...
Figure 3 PD/ICE plots for predictor smoke, from two fits to the FEV data. Ea...
Figure 4 Condvis2 screenshot for a linear model and random forest fit to the...
Figure 5 Condvis2 screenshot for a linear model and random forest fit to the...
Figure 6 Condvis2 section plots for glucose and age from a BART (
dashed line
Figure 7 Condvis2 section plots for glucose and age showing classification b...
Figure 8 Condvis2 section plots for mixed effects models and random forest f...
Figure 9 Condvis2 section plots of two mixed effects models and a fixed effe...
Chapter 24
Figure 1 Functional data: the hip (a) and knee (b) angles of each of the 39 ...
Figure 2 The functional boxplots for the hip and knee angles of each of the ...
Figure 3 The bivariate and marginal MS plots for the hip and knee angles of ...
Figure 4 The two‐stage functional boxplots for the hip (a) and knee (b) angl...
Figure 5 The trajectory functional boxplot (a) and the MSBD–WO plot (b) for ...
Chapter 26
Figure 1 Minimizing
of Equation (3) via coordinate descent starting from t...
Chapter 29
Figure 1 Simple computer surrogate model example where the response,
, is m...
Figure 2 Example local designs
under MSPE and ALC criteria. Numbers plotte...
Figure 3 LAGP‐calculated predictive mean on “Herbie's Tooth” data. Actually,...
Figure 4 Time versus accuracy comparison on SARCOS data.
Chapter 31
Figure 1 The organization and connections of concepts in this chapter.
Chapter 32
Figure 1 Synchronous versus asynchronous parallel computing with shared memo...
Cover Page
Table of Contents
Title Page
Copyright
List of Contributors
Begin Reading
Index
Abbreviations and Acronyms
Wiley End User License Agreement
iv
xxiii
xxiv
xxv
xxvi
xxvii
xxix
xxx
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
79
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
183
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
323
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
405
406
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
443
444
445
446
447
448
449
450
451
452
453
454
455
457
458
459
460
461
462
463
464
465
466
467
469
471
472
473
474
475
476
477
478
479
481
482
483
484
485
486
487
488
489
490
491
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
535
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
631
632
633
634
635
636
637
Edited byWalter W. PiegorschUniversity of Arizona
Richard A. LevineSan Diego State University
Hao Helen ZhangUniversity of Arizona
Thomas C. M. LeeUniversity of California‐Davis
This edition first published 2022
© 2022 John Wiley & Sons, Ltd.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Walter W. Piegorsch, Richard A. Levine, Hao Helen Zhang, Thomas C. M. Lee to be identified as the author(s) of the editorial material in this work has been asserted in accordance with law.
Registered Office(s)
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
ISBN 9781119561071 (hardback)
Cover Design: Wiley
Cover Image: © goja1/Shutterstock
Ayodele Adebiyi
Landmark University
Omu‐Aran, Kwara
Nigeria
Anirban Bhattacharya
Texas A&M University
College Station, TX
USA
Peter Calhoun
Jaeb Center for Health Research
Tampa, FL
USA
Wu Changye
Université Paris Dauphine PSL
Paris
France
Xueying Chen
Novartis Pharmaceuticals Corp.
East Hanover, NJ
USA
Jerry Q. Cheng
New York Institute of Technology
New York, NY
USA
Hugh Chipman
Acadia University
Wolfville, Nova Scotia
Canada
Olawande Daramola
Cape Peninsula University of Technology
Cape Town
South Africa
Katarina Domijan
Maynooth University
Maynooth
Ireland
Víctor Elvira
School of Mathematics
University of Edinburgh, Edinburgh
UK
Juanjuan Fan
Department of Mathematics and Statistics
San Diego State University
San Diego, CA
USA
James M. Flegal
University of California
Riverside, CA
USA
Marc G. Genton
King Abdullah University of Science and Technology
Thuwal
Saudi Arabia
Edward George
The Wharton School
University of Pennsylvania
Philadelphia, PA
USA
Robert B. Gramacy
Virginia Polytechnic Institute and State University
Blacksburg, VA
USA
Richard Hahn
The School of Mathematical and Statistical Sciences
Arizona State University
Tempe, AZ
USA
Botao Hao
DeepMind
London
UK
Andrew J. Holbrook
University of California
Los Angeles, CA
USA
Mingyi Hong
University of Minnesota
Minneapolis, MN
USA
Cho‐Jui Hsieh
University of California
Los Angeles, CA
USA
Jessica Hullman
Northwestern University
Evanston, IL
USA
David R. Hunter
Penn State University
State College, PA
USA
Catherine B. Hurley
Maynooth University
Maynooth
Ireland
Xiang Ji
Tulane University
New Orleans, LA
USA
Adam M. Johansen
University of Warwick
Coventry
UK
James E. Johndrow
University of Pennsylvania
Philadelphia, PA
USA
Galin L. Jones
University of Minnesota
Twin‐Cities Minneapolis, MN
USA
Seung Jun Shin
Korea University
Seoul
South Korea
Matthew Kay
Northwestern University
Evanston, IL
USA
Alexander D. Knudson
The University of Nevada
Reno, NV
USA
Taiwo Kolajo
Federal University Lokoja
Lokoja
Nigeria
and
Covenant University
Ota
Nigeria
Alfonso Landeros
University of California
Los Angeles, CA
USA
Kenneth Lange
University of California
Los Angeles, CA
USA
Thomas C.M. Lee
University of California at Davis
Davis, CA
USA
Richard A. Levine
Department of Mathematics and Statistics
San Diego State University
San Diego, CA
USA
Hongzhe Li
University of Pennsylvania
Philadelphia, PA
USA
Jia Li
The Pennsylvania State University
University Park, PA
USA
Lexin Li
University of California
Berkeley, CA
USA
Yao Li
University of North Carolina at Chapel Hill
Chapel Hill, NC
USA
Yufeng Liu
University of North Carolina at Chapel Hill
Chapel Hill, NC
USA
Rong Ma
University of Pennsylvania
Philadelphia, PA
USA
Shiqian Ma
University of California
Davis, CA
USA
Luca Martino
Universidad Rey Juan Carlos de Madrid
Madrid
Spain
Robert McCulloch
The School of Mathematical and Statistical Sciences
Arizona State University
Tempe, AZ
USA
Weibin Mo
University of North Carolina at Chapel Hill
Chapel Hill, NC
USA
Edward Mulrow
NORC at the University of Chicago
Chicago, IL
USA
Akihiko Nishimura
Johns Hopkins University
Baltimore, MD
USA
Lace Padilla
University of California
Merced, CA
USA
Vincent A. Pisztora
The Pennsylvania State University
University Park, PA
USA
Matthew Pratola
The Ohio State University
Columbus, OH
USA
Christian P. Robert
Université Paris Dauphine PSL
Paris
France
and
University of Warwick
Coventry
UK
Alfred G. Schissler
The University of Nevada
Reno, NV
USA
Rodney Sparapani
Institute for Health and Equity
Medical College of Wisconsin
Milwaukee, WI
USA
Kelly M. Spoon
Computational Science Research Center
San Diego State University
San Diego, CA
USA
Xiaogang Su
Department of Mathematical Sciences
University of Texas
El Paso, TX
USA
Marc A. Suchard
University of California
Los Angeles, CA
USA
Ying Sun
King Abdullah University of Science and Technology
Thuwal
Saudi Arabia
Nola du Toit
NORC at the University of Chicago
Chicago, IL
USA
Dootika Vats
Indian Institute of Technology Kanpur
Kanpur
India
Matti Vihola
University of Jyväskylä
Jyväskylä
Finland
Justin Wang
University of California at Davis
Davis, CA
USA
Will Wei Sun
Purdue University
West Lafayette, IN
USA
Leland Wilkinson
H2O.ai, Mountain View
California
USA
and
University of Illinois at Chicago
Chicago, IL
USA
Joong‐Ho Won
Seoul National University
Seoul
South Korea
Yichao Wu
University of Illinois at Chicago
Chicago, IL
USA
Min‐ge Xie
Rutgers University
Piscataway, NJ
USA
Ming Yan
Michigan State University
East Lansing, MI
USA
Yuling Yao
Columbia University
New York, NY
USA
and
Center for Computational Mathematics
Flatiron Institute
New York, NY
USA
Chun Yip Yau
Chinese University of Hong Kong
Shatin
Hong Kong
Hao H. Zhang
University of Arizona
Tucson, AZ
USA
Hua Zhou
University of California
Los Angeles, CA
USA
Computational statistics is a core area of modern statistical science and its connections to data science represent an ever‐growing area of study. One of its important features is that the underlying technology changes quite rapidly, riding on the back of advances in computer hardware and statistical software. In this compendium we present a series of expositions that explore the intermediate and advanced concepts, theories, techniques, and practices that act to expand this rapidly evolving field. We hope that scholars and investigators will use the presentations to inform themselves on how modern computational and statistical technologies are applied, and also to build springboards that can develop their further research. Readers will require knowledge of fundamental statistical methods and, depending on the topic of interest they peruse, any advanced statistical aspects necessary to understand and conduct the technical computing procedures.
The presentation begins with a thoughtful introduction on how we should view Computational Statistics & Data Science in the 21st Century (Holbrook, et al.), followed by a careful tour of contemporary Statistical Software (Schissler, et al.). Topics that follow address a variety of issues, collected into broad topic areas such as Simulation‐based Methods, Statistical Learning, Quantitative Visualization, High‐performance Computing, High‐dimensional Data Analysis, and Numerical Approximations & Optimization.
Internet access to all of the articles presented here is available via the online collection Wiley StatsRef: Statistics Reference Online (Davidian, et al., 2014–2021); see https://onlinelibrary.wiley.com/doi/book/10.1002/9781118445112.
From Deep Learning (Li, et al.) to Asynchronous Parallel Computing (Yan), this collection provides a glimpse into how computational statistics may progress in this age of big data and transdisciplinary data science. It is our fervent hope that readers will benefit from it.
We wish to thank the fine efforts of the Wiley editorial staff, including Kimberly Monroe‐Hill, Paul Sayer, Michael New, Vignesh Lakshmikanthan, Aruna Pragasam, Viktoria Hartl‐Vida, Alison Oliver, and Layla Harden in helping bring this project to fruition.
Tucson, ArizonaSan Diego, California Tucson, ArizonaDavis, California
Walter W. Piegorsch
Richard A. Levine
Hao Helen Zhang
Thomas C. M. Lee
Davidian, M., Kenett, R.S., Longford, N.T., Molenberghs, G., Piegorsch, W.W., and Ruggeri, F., eds. (2014–2021).
Wiley StatsRef: Statistics Reference Online
. Chichester: John Wiley & Sons. doi:10.1002/9781118445112.
Andrew J. Holbrook1, Akihiko Nishimura2, Xiang Ji3, and Marc A. Suchard1
1University of California, Los Angeles, CA, USA
2Johns Hopkins University, Baltimore, MD, USA
3Tulane University, New Orleans, LA, USA
We are in the midst of the data science revolution. In October 2012, the Harvard Business Review famously declared data scientist the sexiest job of the twenty‐first century [1]. By September 2019, Google searches for the term “data science” had multiplied over sevenfold [2], one multiplicative increase for each intervening year. In the United States between the years 2000 and 2018, the number of bachelor's degrees awarded in either statistics or biostatistics increased over 10‐fold (382–3964), and the number of doctoral degrees almost tripled (249–688) [3]. In 2020, seemingly every major university has established or is establishing its own data science institute, center, or initiative.
Data science [4, 5] combines multiple preexisting disciplines (e.g., statistics, machine learning, and computer science) with a redirected focus on creating, understanding, and systematizing workflows that turn real‐world data into actionable conclusions. The ubiquity of data in all economic sectors and scientific disciplines makes data science eminently relevant to cohorts of researchers for whom the discipline of statistics was previously closed off and esoteric. Data science's emphasis on practical application only enhances the importance of computational statistics, the interface between statistics and computer science primarily concerned with the development of algorithms producing either statistical inference1 or predictions. Since both of these products comprise essential tasks in any data scientific workflow, we believe that the pan‐disciplinary nature of data science only increases the number of opportunities for computational statistics to evolve by taking on new applications2 and serving the needs of new groups of researchers.
This is the natural role for a discipline that has increased the breadth of statistical application from the beginning. First put forward by R.A. Fisher in 1936 [6, 7], the permutation test allows the scientist (who owns a computer) to test hypotheses about a broader swath of functionals of a target population while making fewer statistical assumptions [8]. With a computer, the scientist uses the bootstrap [9, 10] to obtain confidence intervals for population functionals and parameters of models too complex for analytic methods. Newton–Raphson optimization and the Fisher scoring algorithm facilitate linear regression for binary, count, and categorical outcomes . More recently, Markov chain Monte Carlo (MCMC) has made Bayesian inference practical for massive, hierarchical, and highly structured models that are useful for the analysis of a significantly wider range of scientific phenomena.
While computational statistics increases the diversity of statistical applications historically, certain central difficulties exist and will continue to remain for the rest of the twenty‐first century. In Section 2, we present the first class of Core Challenges, or challenges that are easily quantifiable for generic tasks. Core Challenge 1 is Big , or statistical inference when the number “N” of observations or data points is large; Core Challenge 2 is Big , or statistical inference when the model parameter count “P” is large; and Core Challenge 3 is Big , or statistical inference when the model's objective or density function is multimodal (having many modes “”)3. When large, each of these quantities brings its own unique computational difficulty. Since well over 2.5 exabytes (or bytes) of data come into existence each day [15], we are confident that Core Challenge 1 will survive well into the twenty‐second century.
But Core Challenges 2 and 3 will also endure: data complexity often increases with size, and researchers strive to understand increasingly complex phenomena. Because many examples of big data become “big” by combining heterogeneous sources, big data often necessitate big models. With the help of two recent examples, Section 3 illustrates how computational statisticians make headway at the intersection of big data and big models with model‐specific advances. In Section 3.1, we present recent work in Bayesian inference for big N and big P regression. Beyond the simplified regression setting, data often come with structures (e.g., spatial, temporal, and network), and correct inference must take these structures into account. For this reason, we present novel computational methods for a highly structured and hierarchical model for the analysis of multistructured and epidemiological data in Section 3.2.
The growth of model complexity leads to new inferential challenges. While we define Core Challenges 1–3 in terms of generic target distributions or objective functions, Core Challenge 4 arises from inherent difficulties in treating complex models generically. Core Challenge 4 (Section 4.1) describes the difficulties and trade‐offs that must be overcome to create fast, flexible, and friendly “algo‐ware”. This Core Challenge requires the development of statistical algorithms that maintain efficiency despite model structure and, thus, apply to a wider swath of target distributions or objective functions “out of the box”. Such generic algorithms typically require little cleverness or creativity to implement, limiting the amount of time data scientists must spend worrying about computational details. Moreover, they aid the development of flexible statistical software that adapts to complex model structure in a way that users easily understand. But it is not enough that software be flexible and easy to use: mapping computations to computer hardware for optimal implementations remains difficult. In Section 4.2, we argue that Core Challenge 5, effective use of computational resources such as central processing units (CPU), graphics processing units (GPU), and quantum computers, will become increasingly central to the work of the computational statistician as data grow in magnitude.
Before providing two recent examples of twenty‐first century computational statistics (Section 3), we present three easily quantified Core Challenges within computational statistics that we believe will always exist: big , or inference from many observations; big , or inference with high‐dimensional models; and big , or inference with nonconvex objective – or multimodal density – functions. In twenty‐first century computational statistics, these challenges often co‐occur, but we consider them separately in this section.
Having a large number of observations makes different computational methods difficult in different ways. A worst case scenario, the exact permutation test requires the production of datasets. Cheaper alternatives, resampling methods such as the Monte Carlo permutation test or the bootstrap, may require anywhere from thousands to hundreds of thousands of randomly produced datasets [8, 10]. When, say, population means are of interest, each Monte Carlo iteration requires summations involving expensive memory accesses. Another example of a computationally intensive model is Gaussian process regression [16, 17]; it is a popular nonparametric approach, but the exact method for fitting the model and predicting future values requires matrix inversions that scale . As the rest of the calculations require relatively negligible computational effort, we say that matrix inversions represent the computational bottleneck for Gaussian process regression.
To speed up a computationally intensive method, one only needs to speed up the method's computational bottleneck. We are interested in performing Bayesian inference [18] based on a large vector of observations . We specify our model for the data with a likelihood function and use a prior distribution with density function to characterize our belief about the value of the ‐dimensional parameter vector a priori. The target of Bayesian inference is the posterior distribution of conditioned on
The denominator's multidimensional integral quickly becomes impractical as grows large, so we choose to use the MetropolisHastings (M–H) algorithm to generate a Markov chain with stationary distribution [19, 20]. We begin at an arbitrary position and, for each iteration , randomly generate the proposal state from the transition distribution with density . We then accept proposal state with probability
The ratio on the right no longer depends on the denominator in Equation (1), but one must still compute the likelihood and its terms .
It is for this reason that likelihood evaluations are often the computational bottleneck for Bayesian inference. In the best case, these evaluations are , but there are many situations in which they scale [21, 22] or worse. Indeed, when is large, it is often advantageous to use more advanced MCMC algorithms that use the gradient of the log‐posterior to generate better proposals. In this situation, the log‐likelihood gradient may also become a computational bottleneck [21].
One of the simplest models for big problems is ridge regression [23], but computing can become expensive even in this classical setting. Ridge regression estimates the coefficient by minimizing the distance between the observed and predicted values and along with a weighted square norm of :
For illustrative purposes, we consider the following direct method for computing .4 We can first multiply the design matrix by its transpose at the cost of and subsequently invert the matrix at the cost of . The total complexity shows that (i) a large number of parameters is often sufficient for making even the simplest of tasks infeasible and (ii) a moderate number of parameters can render a task impractical when there are a large number of observations. These two insights extend to more complicated models: the same complexity analysis holds for the fitting of generalized linear models (GLMs) as described in McCullagh and Nelder [12].
In the context of Bayesian inference, the length of the vector dictates the dimension of the MCMC state space. For the M‐H algorithm (Section 2.1) with ‐dimensional Gaussian target and proposal, Gelman et al. [25] show that the proposal distribution's covariance should be scaled by a factor inversely proportional to . Hence, as the dimension of the state space grows, it behooves one to propose states that are closer to the current state of the Markov chain, and one must greatly increase the number of MCMC iterations. At the same time, an increasing often slows down rate‐limiting likelihood calculations (Section 2.1). Taken together, one must generate many more, much slower MCMC iterations. The wide applicability of latent variable models [26] (Sections 3.1 and 3.2) for which each observation has its own parameter set (e.g., ) means M‐H simply does not work for a huge class of models popular with practitioners.
For these reasons, Hamiltonian Monte Carlo (HMC) [27] has become a popular algorithm for fitting Bayesian models with large numbers of parameters. Like M‐H, HMC uses an accept step (Equation 2). Unlike M‐H, HMC takes advantage of additional information about the target distribution in the form of the log‐posterior gradient. HMC works by doubling the state space dimension with an auxiliary Gaussian “momentum” variable independent to the “position” variable . The constructed Hamiltonian system has energy function given by the negative logarithm of the joint distribution
and we produce proposals by simulating the system according to Hamilton's equations
Thus, the momentum of the system moves in the direction of the steepest ascent for the log‐posterior, forming an analogy with first‐order optimization. The cost is repeated gradient evaluations that may comprise a new computational bottleneck, but the result is effective MCMC for tens of thousands of parameters [21, 28]. The success of HMC has inspired research into other methods leveraging gradient information to generate better MCMC proposals when is large [29].
Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP‐complete in general [30], meaning – in layman's terms – it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute‐force grid and random searches and heuristic methods such as particle swarm optimization [31] and genetic algorithms [32] have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well‐behaved class of convex functions [33, 34], which do not admit multiple local minima. Since Fisher introduced his “maximum likelihood” in 1922 [35], statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the twentieth century: exponential family log‐likelihoods are log‐concave, so Newton–Raphson and Fisher scoring are guaranteed optimality in the context of GLMs [12, 34].
Nearing the end of the twentieth century, multimodality and nonconvexity became more important for statisticians considering high‐dimensional regression, that is, regression with many covariates (big ). Here, for purposes of interpretability and variance reduction, one would like to induce sparsity on the weights vector by performing best subset selection [36, 37]:
where , and denotes the ‐norm, that is, the number of nonzero elements. Because best subset selection requires an immensely difficult nonconvex optimization, Tibshirani [38] famously replaces the ‐norm with the ‐norm, thereby providing sparsity, while nonetheless maintaining convexity.
Historically, Bayesians have paid much less attention to convexity than have optimization researchers. This is most likely because the basic theory [13] of MCMC does not require such restrictions: even if a target distribution has one million modes, the well‐constructed Markov chain explores them all in the limit. Despite these theoretical guarantees, a small literature has developed to tackle multimodal Bayesian inference [39–42] because multimodal target distributions do present a challenge in practice. In analogy with Equation (3), Bayesians seek to induce sparsity by specifiying priors such as the spike‐and‐slab [43–45], for example,
As with the best subset selection objective function, the spike‐and‐slab target distribution becomes heavily multimodal as grows and the support of 's discrete distribution grows to potential configurations.
In the following section, we present an alternative Bayesian sparse regression approach that mitigates the combinatorial problem along with a state‐of‐the‐art computational technique that scales well both in and .
These challenges will remain throughout the twenty‐first century, but it is possible to make significant advances for specific statistical tasks or classes of models. Section 3.1 considers Bayesian sparse regression based on continuous shrinkage priors, designed to alleviate the heavy multimodality (big ) of the more traditional spike‐and‐slab approach. This model presents a major computational challenge as and grow, but a recent computational advance makes the posterior inference feasible for many modern large‐scale applications.
And because of the rise of data science, there are increasing opportunities for computational statistics to grow by enabling and extending statistical inference for scientific applications previously outside of mainstream statistics. Here, the science may dictate the development of structured models with complexity possibly growing in and . Section 3.2 presents a method for fast phylogenetic inference, where the primary structure of interest is a “family tree” describing a biological evolutionary history.
With the goal of identifying a small subset of relevant features among a large number of potential candidates, sparse regression techniques have long featured in a range of statistical and data science applications [46]. Traditionally, such techniques were commonly applied in the “” setting, and correspondingly computational algorithms focused on this situation [47], especially within the Bayesian literature [48].
Due to a growing number of initiatives for large‐scale data collections and new types of scientific inquiries made possible by emerging technologies, however, increasingly common are datasets that are “big ” and “big ” at the same time. For example, modern observational studies using health‐care databases routinely involve patients and clinical covariates [49]. The UK Biobank provides brain imaging data on patients, with , depending on the scientific question of interests [50]. Single‐cell RNA sequencing can generate datasets with (the number of cells) in millions and (the number of genes) in tens of thousands, with the trend indicating further growths in data size to come [51].
Bayesian sparse regression, despite its desirable theoretical properties and flexibility to serve as a building block for richer statistical models, has always been relatively computationally intensive even before the advent of “big and big ” data [45, 52, 53]. A major source of its computational burden is severe posterior multimodality (big ) induced by the discrete binary nature of spike‐and‐slab priors (Section 2.3). The class of global–local continuous shrinkage priors is a more recent alternative to shrink s in a more continuous manner, thereby alleviating (if not eliminating) the multimodality issue [54, 55]. This class of prior is represented as a scale mixture of Gaussians:
The idea is that the global scale parameter would shrink most s toward zero, while the local scales, with its heavy‐tailed prior , allow a small number of and hence s to be estimated away from zero. While motivated by two different conceptual frameworks, the spike‐and‐slab can be viewed as a subset of global–local priors in which is chosen as a mixture of delta masses placed at and . Continuous shrinkage mitigates the multimodality of spike‐and‐slab by smoothly bridging small and large values of .
On the other hand, the use of continuous shrinkage priors does not address the increasing computational burden from growing and in modern applications. Sparse regression posteriors under global–local priors are amenable to an effective Gibbs sampler, a popular class of MCMC we describe further in Section 4.1. Under the linear and logistic models, the computational bottleneck of this Gibbs sampler stems from the need for repeated updates of from its conditional distribution
where is an additional parameter of diagonal matrix and .5 Sampling from this high‐dimensional Gaussian distribution requires operations with the standard approach [58]: for computing the term and for Cholesky factorization of . While an alternative approach by Bhattacharya et al. [48] provides the complexity of , the computational cost remains problematic in the big and big regime at after choosing the faster of the two.
The conjugate gradient (CG) sampler of Nishimura and Suchard [57] combined with their prior‐preconditioning technique overcomes this seemingly inevitable growth of the computational cost. Their algorithm is based on a novel application of the CG method [59, 60], which belongs to a family of iterative methods in numerical linear algebra. Despite its first appearance in 1952, CG received little attention for the next few decades, only making its way into major software packages such as MATLAB in the 1990s [61]. With its ability to solve a large and structured linear system via a small number of matrix–vector multiplications without ever explicitly inverting , however, CG has since emerged as an essential and prototypical algorithm for modern scientific computing [62, 63].
Despite its earlier rise to prominence in other fields, CG has not found practical applications in Bayesian computation until rather recently [57, 64]. We can offer at least two explanations for this. First, being an algorithm for solving a deterministic linear system, it is not obvious how CG would be relevant to Monte Carlo simulation, such as sampling from ; ostensively, such a task requires computing a “square root” of the precision matrix so that for . Secondly, unlike direct linear algebra methods, iterative methods such as CG have a variable computational cost that depends critically on the user's choice of a preconditioner and thus cannot be used as a “black‐box” algorithm.6 In particular, this novel application of CG to Bayesian computation is a reminder that other powerful ideas in other computationally intensive fields may remain untapped by the statistical computing community; knowledge transfers will likely be facilitated by having more researchers working at intersections of different fields.
Nishimura and Suchard [57] turns CG into a viable algorithm for Bayesian sparse regression problems by realizing that (i) we can obtain a Gaussian vector by first generating and and then setting and (ii) subsequently solving yields a sample from the distribution (4). The authors then observe that the mechanism through which a shrinkage prior induces sparsity of s also induces a tight clustering of eigenvalues in the prior‐preconditioned matrix . This fact makes it possible for prior‐preconditioned CG to solve the system in matrix–vector operations of form , where roughly represents the number of significant s that are distinguishable from zeros under the posterior. For having a structure as in (4), can be computed via matrix–vector multiplications of form and , so each operation requires a fraction of the computational cost of directly computing and then factorizing it.
Prior‐preconditioned CG demonstrates an order of magnitude speedup in posterior computation when applied to a comparative effectiveness study of atrial fibrillation treatment involving patients and covariates [57]. Though unexplored in their work, the algorithm's heavy use of matrix–vector multiplications provides avenues for further acceleration. Technically, the algorithm's complexity may be characterized as , for the matrix–vector multiplications by and , but the theoretical complexity is only a part of the story. Matrix–vector multiplications are amenable to a variety of hardware optimizations, which in practice can make orders of magnitude difference in speed (Section 4.2). In fact, given how arduous manually optimizing computational bottlenecks can be, designing algorithms so as to take advantage of common routines (as those in Level 3 BLAS
