Computational Statistics in Data Science -  - E-Book

Computational Statistics in Data Science E-Book

0,0
192,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An essential roadmap to the application of computational statistics in contemporary data science In Computational Statistics in Data Science, a team of distinguished mathematicians and statisticians delivers an expert compilation of concepts, theories, techniques, and practices in computational statistics for readers who seek a single, standalone sourcebook on statistics in contemporary data science. The book contains multiple sections devoted to key, specific areas in computational statistics, offering modern and accessible presentations of up-to-date techniques. Computational Statistics in Data Science provides complimentary access to finalized entries in the Wiley StatsRef: Statistics Reference Online compendium. Readers will also find: * A thorough introduction to computational statistics relevant and accessible to practitioners and researchers in a variety of data-intensive areas * Comprehensive explorations of active topics in statistics, including big data, data stream processing, quantitative visualization, and deep learning Perfect for researchers and scholars working in any field requiring intermediate and advanced computational statistics techniques, Computational Statistics in Data Science will also earn a place in the libraries of scholars researching and developing computational data-scientific technologies and statistical graphics.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1223

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

List of Contributors

Preface

Reference

Part I: Computational Statistics and Data Science

1 Computational Statistics and Data Science in the Twenty‐First Century

1 Introduction

2 Core Challenges 1–3

3 Model‐Specific Advances

4 Core Challenges 4 and 5

5 Rise of Data Science

Acknowledgments

Notes

References

2 Statistical Software

1 User Development Environments

2 Popular Statistical Software

3 Noteworthy Statistical Software and Related Tools

4 Promising and Emerging Statistical Software

5 The Future of Statistical Computing

6 Concluding Remarks

Acknowledgments

References

Further Reading

3 An Introduction to Deep Learning Methods

1 Introduction

2 Machine Learning: An Overview

3 Feedforward Neural Networks

4 Convolutional Neural Networks

5 Autoencoders

6 Recurrent Neural Networks

7 Conclusion

References

4 Streaming Data and Data Streams

1 Introduction

2 Data Stream Computing

3 Issues in Data Stream Mining

4 Streaming Data Tools and Technologies

5 Streaming Data Pre‐Processing: Concept and Implementation

6 Streaming Data Algorithms

7 Strategies for Processing Data Streams

8 Best Practices for Managing Data Streams

9 Conclusion and the Way Forward

References

Part II: Simulation‐Based Methods

5 Monte Carlo Simulation: Are We There Yet?

1 Introduction

2 Estimation

3 Sampling Distribution

4 Estimating

5 Stopping Rules

6 Workflow

7 Examples

References

6 Sequential Monte Carlo: Particle Filters and Beyond

1 Introduction

2 Sequential Importance Sampling and Resampling

3 SMC in Statistical Contexts

4 Selected Recent Developments

Acknowledgments

Note

References

7 Markov Chain Monte Carlo Methods, A Survey with Some Frequent Misunderstandings

1 Introduction

2 Monte Carlo Methods

3 Markov Chain Monte Carlo Methods

4 Approximate Bayesian Computation

5 Further Reading

Abbreviations and Acronyms

Notes

References

Note

8 Bayesian Inference with Adaptive Markov Chain Monte Carlo

1 Introduction

2 Random‐Walk Metropolis Algorithm

3 Adaptation of Random‐Walk Metropolis

4 Multimodal Targets with Parallel Tempering

5 Dynamic Models with Particle Filters

6 Discussion

Acknowledgments

Notes

References

9 Advances in Importance Sampling

1 Introduction and Problem Statement

2 Importance Sampling

3 Multiple Importance Sampling (MIS)

4 Adaptive Importance Sampling (AIS)

Acknowledgments

Notes

References

Part III: Statistical Learning

10 Supervised Learning

1 Introduction

2 Penalized Empirical Risk Minimization

3 Linear Regression

4 Classification

5 Extensions for Complex Data

6 Discussion

References

11 Unsupervised and Semisupervised Learning

1 Introduction

2 Unsupervised Learning

3 Semisupervised Learning

4 Conclusions

Acknowledgment

Notes

References

12 Random Forests

1 Introduction

2 Random Forest (RF)

3 Random Forest Extensions

4 Random Forests of Interaction Trees (RFIT)

5 Random Forest of Interaction Trees for Observational Studies

6 Discussion

References

13 Network Analysis

1 Introduction

2 Gaussian Graphical Models for Mixed Partial Compositional Data

3 Theoretical Properties

4 Graphical Model Selection

5 Analysis of a Microbiome–Metabolomics Data

6 Discussion

References

14 Tensors in Modern Statistical Learning

1 Introduction

2 Background

3 Tensor Supervised Learning

4 Tensor Unsupervised Learning

5 Tensor Reinforcement Learning

6 Tensor Deep Learning

Acknowledgments

References

15 Computational Approaches to Bayesian Additive Regression Trees

1 Introduction

2 Bayesian CART

3 Tree MCMC

4 The BART Model

5 BART Example: Boston Housing Values and Air Pollution

6 BART MCMC

7 BART Extentions

8 Conclusion

References

Part IV: High‐Dimensional Data Analysis

16 Penalized Regression

1 Introduction

2 Penalization for Smoothness

3 Penalization for Sparsity

4 Tuning Parameter Selection

References

17 Model Selection in High‐Dimensional Regression

1 Model Selection Problem

2 Model Selection in High‐Dimensional Linear Regression

3 Interaction‐Effect Selection for High‐Dimensional Data

4 Model Selection in High‐Dimensional Nonparametric Models

5 Concluding Remarks

References

18 Sampling Local Scale Parameters in High-Dimensional Regression Models

1 Introduction

2 A Blocked Gibbs Sampler for the Horseshoe

3 Sampling

4 Sampling

5 Appendix: A. Newton–Raphson Steps for the Inverse‐cdf Sampler for

Acknowledgment

References

Note

19 Factor Modeling for High-Dimensional Time Series

1 Introduction

2 Identifiability

3 Estimation of High‐Dimensional Factor Model

4 Determining the Number of Factors

Acknowledgment

References

Part V: Quantitative Visualization

20 Visual Communication of Data: It Is Not a Programming Problem, It Is Viewer Perception

1 Introduction

2 Case Studies Part 1

3 Let StAR Be Your Guide

4 Case Studies Part 2: Using StAR Principles to Develop Better Graphics

5 Ask Colleagues Their Opinion

6 Case Studies: Part 3

7 Iterate

8 Final Thoughts

Notes

References

21 Uncertainty Visualization

1 Introduction

2 Uncertainty Visualization Theories

3 General Discussion

References

22 Big Data Visualization

1 Introduction

2 Architecture for Big Data Analytics

3 Filtering

4 Aggregating

5 Analyzing

6 Big Data Graphics

7 Conclusion

References

23 Visualization‐Assisted Statistical Learning

1 Introduction

2 Better Visualizations with Seriation

3 Visualizing Machine Learning Fits

4 Condvis2 Case Studies

5 Discussion

References

24 Functional Data Visualization

1 Introduction

2 Univariate Functional Data Visualization

3 Multivariate Functional Data Visualization

4 Conclusions

Acknowledgment

References

Part VI: Numerical Approximation and Optimization

25 Gradient‐Based Optimizers for Statistics and Machine Learning

1 Introduction

2 Convex Versus Nonconvex Optimization

3 Gradient Descent

4 Proximal Gradient Descent: Handling Nondifferentiable Regularization

5 Stochastic Gradient Descent

References

26 Alternating Minimization Algorithms

1 Introduction

2 Coordinate Descent

3 EM as Alternating Minimization

4 Matrix Approximation Algorithms

5 Conclusion

References

27 A Gentle Introduction to Alternating Direction Method of Multipliers (ADMM) for Statistical Problems

1 Introduction

2 Two Perfect Examples of ADMM

3 Variable Splitting and Linearized ADMM

4 Multiblock ADMM

5 Nonconvex Problems

6 Stopping Criteria

7 Convergence Results of ADMM

Acknowledgments

References

28 Nonconvex Optimization via MM Algorithms: Convergence Theory

1 Background

2 Convergence Theorems

3 Paracontraction

4 Bregman Majorization

References

Part VII: High‐Performance Computing

29 Massive Parallelization

1 Introduction

2 Gaussian Process Regression and Surrogate Modeling

3 Divide‐and‐Conquer GP Regression

4 Empirical Results

5 Conclusion

Acknowledgments

References

30 Divide‐and‐Conquer Methods for Big Data Analysis

1 Introduction

2 Linear Regression Model

3 Parametric Models

4 Nonparametric and Semiparametric Models

5 Online Sequential Updating

6 Splitting the Number of Covariates

7 Bayesian Divide‐and‐Conquer and Median‐Based Combining

8 Real‐World Applications

9 Discussion

Acknowledgment

References

31 Bayesian Aggregation

1 From Model Selection to Model Combination

2 From Bayesian Model Averaging to Bayesian Stacking

3 Asymptotic Theories of Stacking

4 Stacking in Practice

5 Discussion

References

32 Asynchronous Parallel Computing

1 Introduction

2 Asynchronous Parallel Coordinate Update

3 Asynchronous Parallel Stochastic Approaches

4 Doubly Stochastic Coordinate Optimization with Variance Reduction

5 Concluding Remarks

References

Index

Abbreviations and Acronyms

End User License Agreement

List of Tables

Chapter 2

Table 1 Summary of selected statistical software.

Table 2 Summary of selected user environments/workflows.

Chapter 3

Table 1 Connection between input and output matrices in the third layer of L...

Chapter 4

Table 1 Streaming data versus static data [9, 10]

Chapter 5

Table 1 Probabilities for each action figure

Chapter 8

Table 1 Summary of ingredients of Algorithm 2 for the four adaptive MCMC me...

Table 2 Summary of recommended algorithms for specific problems and their s...

Chapter 9

Table 1 Summary of the notation.

Table 2 Comparison of various AIS algorithms according to different feature...

Table 3 Comparison of various AIS algorithms according to the computational...

Chapter 21

Table 1 Summary of uncertainty visualization theory detailed in this chapte...

Chapter 29

Table 2 Updated GPU/CPU results based on a more modern cascade of supercomp...

List of Illustrations

Chapter 1

Figure 1 A nontraditional and critically important application in computatio...

Chapter 3

Figure 1 An MLP with three layers.

Figure 2 Convolution operation with stride size

.

Figure 3 Pooling operation with stride size

.

Figure 4 LeNet‐5 of LeCun

et al

. [8].

Figure 5 Architecture of an autoencoder.

Figure 6 Architecture of variational autoencoder (VAE).

Figure 7 Feedforward network.

Figure 8 Architecture of recurrent neural network (RNN).

Figure 9 Architecture of long short‐term memory network (LSTM).

Chapter 4

Figure 1 Taxonomy of concept drift in data stream.

Chapter 5

Figure 1 Histograms of simulated boxes and mean number of boxes for two Mont...

Figure 2 Estimated risk at

(a) and at

(b) with pointwise Bonferroni corr...

Figure 3 Estimated density of the marginal posterior for

from an initial r...

Figure 4 Estimated autocorrelations for nonlinchpin sampler (a) and linchpin...

Chapter 7

Figure 1 Importance sampling with importance distribution of an exponential

Figure 2 Failed simulation of a Student's

distribution with mean

when si...

Figure 3 Recovery of a Normal

distribution when simulating

realizations ...

Figure 4 Histogram of

simulations of a

distribution with the target dens...

Figure 5 (a) Histogram of

iterations of a slice sampler with a Normal

ta...

Figure 6 100 last moves of the above slice sampler.

Figure 7 Independent Metropolis sequence with a proposal

equal to the dens...

Figure 8 Fit of a Metropolis sample of size

to a target when using a trunc...

Figure 9 Graph of a truncated Normal density and fit by the histogram of an ...

Chapter 9

Figure 1 Graphical description of three possible dependencies between the ad...

Chapter 12

Figure 1 Decision tree for headache data.

Figure 2 RFIT analysis of the headache data: (a) Estimated ITE with SE error...

Figure 3 Exploring important effect moderators in the headache data: (a) Var...

Figure 4 Comparison of MSE averaged over 1000 interaction trees using method...

Chapter 13

Figure 1 The metabolite–microbe interaction network. Only edges linking a me...

Figure 2 Scatter plots of microbe and metabolite pairs.

Chapter 14

Figure 1 An example of first‐, second‐, and third‐order tensors.

Figure 2 Tensor fibers, unfolding and vectorization.

Figure 3 An example of magnetic resonance imaging. The image is obtained fro...

Figure 4 A third‐order tensor with a checkerbox structure.

Figure 5 A schematic illustration of the low‐rank tensor clustering method....

Figure 6 The tensor formulation of multidimensional advertising decisions.

Figure 7 Illustration of the tensor‐based CNN compression from Kossaifi

et a

...

Chapter 15

Figure 1 A Bayesian tree.

Figure 2 The Boston housing data was compiled from the 1970 US Census, where...

Figure 3 The distribution of

and the sparse Dirichlet prior [16]. The key ...

Chapter 16

Figure 1 LASSO and nonconvex penalties: both SCAD and MCP do not penalize th...

Chapter 17

Figure 1 Hierarchy‐preserving solution paths by RAMP. (a) Strong hierarchy; ...

Chapter 18

Figure 1 Marginal prior of

for different choices of

.

Figure 2 Estimated autocorrelations for

for the three algorithms. Approxim...

Figure 3 Trace plots (with true value indicated) and density estimates for o...

Figure 4 (a) Plots

for

and

(in

dashed gray

and

dashed black

, respectiv...

Figure 5 Plot of

as a function of

, where

varies between

and 1.

Figure 6 The posterior mean of

in a normal means problem: the

‐axis and

Chapter 20

Figure 1 ACS 2017 state estimates of the number of households (millions).

Figure 2 ACS 2017 state estimates of the number of households (millions). A ...

Figure 3 ACS 2017 median household income (USD) with 95% confidence interval...

Figure 4 Log 10 US ACS 2017 state estimates of the number of households (per...

Figure 5 ACS 2017 state estimates of the number of households (millions), wi...

Figure 6 2017 ACS household median income (USD) estimates with 95% confidenc...

Figure 7 Sloppy plot of 2017 ACS household median income (USD) estimates.

Figure 8 Sloppy plot of 2017 ACS household median income (USD) estimates wit...

Figure 9 ACS 2017 state estimates of the number of households (millions).

Figure 10 ACS 2017 state estimates of the number of households (millions). T...

Figure 11 2017 ACS household median income (USD) estimates with confidence i...

Chapter 21

Figure 1 A subset of the graphical annotations used to show properties of a ...

Figure 2 The process of generating a quantile dotplot from a log‐normal dist...

Figure 3 Illustration of HOPs compared to error bars from the same distribut...

Figure 4 Example Cone of Uncertainty produced by the National Hurricane Cent...

Figure 5 (a) An example of an ensemble hurricane path display that utilizes ...

Chapter 22

Figure 1 Classic dataflow visualization architecture.

Figure 2 Client–server visualization architecture.

Figure 3 (a) Piecewise linear confidence intervals and (b) bootstrapped regr...

Figure 4 Dot plot and histogram.

Figure 5 2D binning of 100 000 points.

Figure 6 2D binning of thousands of clustered points.

Figure 7 Massive data scatterplot matrix by Dan Carr [9].

Figure 8 nD aggregator illustrated with 2D example.

Figure 9 (a) Parallel coordinate plots of all columns and (b) aggregated col...

Figure 10 Code snippets for computing statistics on aggregated data sources....

Figure 11 Box plots of 100 000 Gaussians.

Figure 12 Lensing a scatterplot matrix.

Figure 13 Sorted and scrolling parallel coordinates [27].

Chapter 23

Figure 1 Parallel coordinate plot of the Pima data, colored by the diabetes ...

Figure 2 Heatmap of the LDA scores for measuring group separation for one an...

Figure 3 PD/ICE plots for predictor smoke, from two fits to the FEV data. Ea...

Figure 4 Condvis2 screenshot for a linear model and random forest fit to the...

Figure 5 Condvis2 screenshot for a linear model and random forest fit to the...

Figure 6 Condvis2 section plots for glucose and age from a BART (

dashed line

Figure 7 Condvis2 section plots for glucose and age showing classification b...

Figure 8 Condvis2 section plots for mixed effects models and random forest f...

Figure 9 Condvis2 section plots of two mixed effects models and a fixed effe...

Chapter 24

Figure 1 Functional data: the hip (a) and knee (b) angles of each of the 39 ...

Figure 2 The functional boxplots for the hip and knee angles of each of the ...

Figure 3 The bivariate and marginal MS plots for the hip and knee angles of ...

Figure 4 The two‐stage functional boxplots for the hip (a) and knee (b) angl...

Figure 5 The trajectory functional boxplot (a) and the MSBD–WO plot (b) for ...

Chapter 26

Figure 1 Minimizing

of Equation (3) via coordinate descent starting from t...

Chapter 29

Figure 1 Simple computer surrogate model example where the response,

, is m...

Figure 2 Example local designs

under MSPE and ALC criteria. Numbers plotte...

Figure 3 LAGP‐calculated predictive mean on “Herbie's Tooth” data. Actually,...

Figure 4 Time versus accuracy comparison on SARCOS data.

Chapter 31

Figure 1 The organization and connections of concepts in this chapter.

Chapter 32

Figure 1 Synchronous versus asynchronous parallel computing with shared memo...

Guide

Cover Page

Table of Contents

Title Page

Copyright

List of Contributors

Begin Reading

Index

Abbreviations and Acronyms

Wiley End User License Agreement

Pages

iv

xxiii

xxiv

xxv

xxvi

xxvii

xxix

xxx

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

79

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

183

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

323

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

405

406

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

443

444

445

446

447

448

449

450

451

452

453

454

455

457

458

459

460

461

462

463

464

465

466

467

469

471

472

473

474

475

476

477

478

479

481

482

483

484

485

486

487

488

489

490

491

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

535

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

631

632

633

634

635

636

637

Computational Statistics in Data Science

Edited byWalter W. PiegorschUniversity of Arizona

Richard A. LevineSan Diego State University

Hao Helen ZhangUniversity of Arizona

Thomas C. M. LeeUniversity of California‐Davis

 

 

 

 

 

 

 

 

This edition first published 2022

© 2022 John Wiley & Sons, Ltd.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Walter W. Piegorsch, Richard A. Levine, Hao Helen Zhang, Thomas C. M. Lee to be identified as the author(s) of the editorial material in this work has been asserted in accordance with law.

Registered Office(s)

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

ISBN 9781119561071 (hardback)

Cover Design: Wiley

Cover Image: © goja1/Shutterstock

List of Contributors

Ayodele Adebiyi

Landmark University

Omu‐Aran, Kwara

Nigeria

Anirban Bhattacharya

Texas A&M University

College Station, TX

USA

Peter Calhoun

Jaeb Center for Health Research

Tampa, FL

USA

Wu Changye

Université Paris Dauphine PSL

Paris

France

Xueying Chen

Novartis Pharmaceuticals Corp.

East Hanover, NJ

USA

Jerry Q. Cheng

New York Institute of Technology

New York, NY

USA

Hugh Chipman

Acadia University

Wolfville, Nova Scotia

Canada

Olawande Daramola

Cape Peninsula University of Technology

Cape Town

South Africa

Katarina Domijan

Maynooth University

Maynooth

Ireland

Víctor Elvira

School of Mathematics

University of Edinburgh, Edinburgh

UK

Juanjuan Fan

Department of Mathematics and Statistics

San Diego State University

San Diego, CA

USA

James M. Flegal

University of California

Riverside, CA

USA

Marc G. Genton

King Abdullah University of Science and Technology

Thuwal

Saudi Arabia

Edward George

The Wharton School

University of Pennsylvania

Philadelphia, PA

USA

Robert B. Gramacy

Virginia Polytechnic Institute and State University

Blacksburg, VA

USA

Richard Hahn

The School of Mathematical and Statistical Sciences

Arizona State University

Tempe, AZ

USA

Botao Hao

DeepMind

London

UK

Andrew J. Holbrook

University of California

Los Angeles, CA

USA

Mingyi Hong

University of Minnesota

Minneapolis, MN

USA

Cho‐Jui Hsieh

University of California

Los Angeles, CA

USA

Jessica Hullman

Northwestern University

Evanston, IL

USA

David R. Hunter

Penn State University

State College, PA

USA

Catherine B. Hurley

Maynooth University

Maynooth

Ireland

Xiang Ji

Tulane University

New Orleans, LA

USA

Adam M. Johansen

University of Warwick

Coventry

UK

James E. Johndrow

University of Pennsylvania

Philadelphia, PA

USA

Galin L. Jones

University of Minnesota

Twin‐Cities Minneapolis, MN

USA

Seung Jun Shin

Korea University

Seoul

South Korea

Matthew Kay

Northwestern University

Evanston, IL

USA

Alexander D. Knudson

The University of Nevada

Reno, NV

USA

Taiwo Kolajo

Federal University Lokoja

Lokoja

Nigeria

and

Covenant University

Ota

Nigeria

Alfonso Landeros

University of California

Los Angeles, CA

USA

Kenneth Lange

University of California

Los Angeles, CA

USA

Thomas C.M. Lee

University of California at Davis

Davis, CA

USA

Richard A. Levine

Department of Mathematics and Statistics

San Diego State University

San Diego, CA

USA

Hongzhe Li

University of Pennsylvania

Philadelphia, PA

USA

Jia Li

The Pennsylvania State University

University Park, PA

USA

Lexin Li

University of California

Berkeley, CA

USA

Yao Li

University of North Carolina at Chapel Hill

Chapel Hill, NC

USA

Yufeng Liu

University of North Carolina at Chapel Hill

Chapel Hill, NC

USA

Rong Ma

University of Pennsylvania

Philadelphia, PA

USA

Shiqian Ma

University of California

Davis, CA

USA

Luca Martino

Universidad Rey Juan Carlos de Madrid

Madrid

Spain

Robert McCulloch

The School of Mathematical and Statistical Sciences

Arizona State University

Tempe, AZ

USA

Weibin Mo

University of North Carolina at Chapel Hill

Chapel Hill, NC

USA

Edward Mulrow

NORC at the University of Chicago

Chicago, IL

USA

Akihiko Nishimura

Johns Hopkins University

Baltimore, MD

USA

Lace Padilla

University of California

Merced, CA

USA

Vincent A. Pisztora

The Pennsylvania State University

University Park, PA

USA

Matthew Pratola

The Ohio State University

Columbus, OH

USA

Christian P. Robert

Université Paris Dauphine PSL

Paris

France

and

University of Warwick

Coventry

UK

Alfred G. Schissler

The University of Nevada

Reno, NV

USA

Rodney Sparapani

Institute for Health and Equity

Medical College of Wisconsin

Milwaukee, WI

USA

Kelly M. Spoon

Computational Science Research Center

San Diego State University

San Diego, CA

USA

Xiaogang Su

Department of Mathematical Sciences

University of Texas

El Paso, TX

USA

Marc A. Suchard

University of California

Los Angeles, CA

USA

Ying Sun

King Abdullah University of Science and Technology

Thuwal

Saudi Arabia

Nola du Toit

NORC at the University of Chicago

Chicago, IL

USA

Dootika Vats

Indian Institute of Technology Kanpur

Kanpur

India

Matti Vihola

University of Jyväskylä

Jyväskylä

Finland

Justin Wang

University of California at Davis

Davis, CA

USA

Will Wei Sun

Purdue University

West Lafayette, IN

USA

Leland Wilkinson

H2O.ai, Mountain View

California

USA

and

University of Illinois at Chicago

Chicago, IL

USA

Joong‐Ho Won

Seoul National University

Seoul

South Korea

Yichao Wu

University of Illinois at Chicago

Chicago, IL

USA

Min‐ge Xie

Rutgers University

Piscataway, NJ

USA

Ming Yan

Michigan State University

East Lansing, MI

USA

Yuling Yao

Columbia University

New York, NY

USA

and

Center for Computational Mathematics

Flatiron Institute

New York, NY

USA

Chun Yip Yau

Chinese University of Hong Kong

Shatin

Hong Kong

Hao H. Zhang

University of Arizona

Tucson, AZ

USA

Hua Zhou

University of California

Los Angeles, CA

USA

Preface

Computational statistics is a core area of modern statistical science and its connections to data science represent an ever‐growing area of study. One of its important features is that the underlying technology changes quite rapidly, riding on the back of advances in computer hardware and statistical software. In this compendium we present a series of expositions that explore the intermediate and advanced concepts, theories, techniques, and practices that act to expand this rapidly evolving field. We hope that scholars and investigators will use the presentations to inform themselves on how modern computational and statistical technologies are applied, and also to build springboards that can develop their further research. Readers will require knowledge of fundamental statistical methods and, depending on the topic of interest they peruse, any advanced statistical aspects necessary to understand and conduct the technical computing procedures.

The presentation begins with a thoughtful introduction on how we should view Computational Statistics & Data Science in the 21st Century (Holbrook, et al.), followed by a careful tour of contemporary Statistical Software (Schissler, et al.). Topics that follow address a variety of issues, collected into broad topic areas such as Simulation‐based Methods, Statistical Learning, Quantitative Visualization, High‐performance Computing, High‐dimensional Data Analysis, and Numerical Approximations & Optimization.

Internet access to all of the articles presented here is available via the online collection Wiley StatsRef: Statistics Reference Online (Davidian, et al., 2014–2021); see https://onlinelibrary.wiley.com/doi/book/10.1002/9781118445112.

From Deep Learning (Li, et al.) to Asynchronous Parallel Computing (Yan), this collection provides a glimpse into how computational statistics may progress in this age of big data and transdisciplinary data science. It is our fervent hope that readers will benefit from it.

We wish to thank the fine efforts of the Wiley editorial staff, including Kimberly Monroe‐Hill, Paul Sayer, Michael New, Vignesh Lakshmikanthan, Aruna Pragasam, Viktoria Hartl‐Vida, Alison Oliver, and Layla Harden in helping bring this project to fruition.

Tucson, ArizonaSan Diego, California Tucson, ArizonaDavis, California

Walter W. Piegorsch

Richard A. Levine

Hao Helen Zhang

Thomas C. M. Lee

Reference

Davidian, M., Kenett, R.S., Longford, N.T., Molenberghs, G., Piegorsch, W.W., and Ruggeri, F., eds. (2014–2021).

Wiley StatsRef: Statistics Reference Online

. Chichester: John Wiley & Sons. doi:10.1002/9781118445112.

Part IComputational Statistics and Data Science

1Computational Statistics and Data Science in the Twenty‐First Century

Andrew J. Holbrook1, Akihiko Nishimura2, Xiang Ji3, and Marc A. Suchard1

1University of California, Los Angeles, CA, USA

2Johns Hopkins University, Baltimore, MD, USA

3Tulane University, New Orleans, LA, USA

1 Introduction

We are in the midst of the data science revolution. In October 2012, the Harvard Business Review famously declared data scientist the sexiest job of the twenty‐first century [1]. By September 2019, Google searches for the term “data science” had multiplied over sevenfold [2], one multiplicative increase for each intervening year. In the United States between the years 2000 and 2018, the number of bachelor's degrees awarded in either statistics or biostatistics increased over 10‐fold (382–3964), and the number of doctoral degrees almost tripled (249–688) [3]. In 2020, seemingly every major university has established or is establishing its own data science institute, center, or initiative.

Data science [4, 5] combines multiple preexisting disciplines (e.g., statistics, machine learning, and computer science) with a redirected focus on creating, understanding, and systematizing workflows that turn real‐world data into actionable conclusions. The ubiquity of data in all economic sectors and scientific disciplines makes data science eminently relevant to cohorts of researchers for whom the discipline of statistics was previously closed off and esoteric. Data science's emphasis on practical application only enhances the importance of computational statistics, the interface between statistics and computer science primarily concerned with the development of algorithms producing either statistical inference1 or predictions. Since both of these products comprise essential tasks in any data scientific workflow, we believe that the pan‐disciplinary nature of data science only increases the number of opportunities for computational statistics to evolve by taking on new applications2 and serving the needs of new groups of researchers.

This is the natural role for a discipline that has increased the breadth of statistical application from the beginning. First put forward by R.A. Fisher in 1936 [6, 7], the permutation test allows the scientist (who owns a computer) to test hypotheses about a broader swath of functionals of a target population while making fewer statistical assumptions [8]. With a computer, the scientist uses the bootstrap [9, 10] to obtain confidence intervals for population functionals and parameters of models too complex for analytic methods. Newton–Raphson optimization and the Fisher scoring algorithm facilitate linear regression for binary, count, and categorical outcomes . More recently, Markov chain Monte Carlo (MCMC) has made Bayesian inference practical for massive, hierarchical, and highly structured models that are useful for the analysis of a significantly wider range of scientific phenomena.

While computational statistics increases the diversity of statistical applications historically, certain central difficulties exist and will continue to remain for the rest of the twenty‐first century. In Section 2, we present the first class of Core Challenges, or challenges that are easily quantifiable for generic tasks. Core Challenge 1 is Big , or statistical inference when the number “N” of observations or data points is large; Core Challenge 2 is Big , or statistical inference when the model parameter count “P” is large; and Core Challenge 3 is Big , or statistical inference when the model's objective or density function is multimodal (having many modes “”)3. When large, each of these quantities brings its own unique computational difficulty. Since well over 2.5 exabytes (or bytes) of data come into existence each day [15], we are confident that Core Challenge 1 will survive well into the twenty‐second century.

But Core Challenges 2 and 3 will also endure: data complexity often increases with size, and researchers strive to understand increasingly complex phenomena. Because many examples of big data become “big” by combining heterogeneous sources, big data often necessitate big models. With the help of two recent examples, Section 3 illustrates how computational statisticians make headway at the intersection of big data and big models with model‐specific advances. In Section 3.1, we present recent work in Bayesian inference for big N and big P regression. Beyond the simplified regression setting, data often come with structures (e.g., spatial, temporal, and network), and correct inference must take these structures into account. For this reason, we present novel computational methods for a highly structured and hierarchical model for the analysis of multistructured and epidemiological data in Section 3.2.

The growth of model complexity leads to new inferential challenges. While we define Core Challenges 1–3 in terms of generic target distributions or objective functions, Core Challenge 4 arises from inherent difficulties in treating complex models generically. Core Challenge 4 (Section 4.1) describes the difficulties and trade‐offs that must be overcome to create fast, flexible, and friendly “algo‐ware”. This Core Challenge requires the development of statistical algorithms that maintain efficiency despite model structure and, thus, apply to a wider swath of target distributions or objective functions “out of the box”. Such generic algorithms typically require little cleverness or creativity to implement, limiting the amount of time data scientists must spend worrying about computational details. Moreover, they aid the development of flexible statistical software that adapts to complex model structure in a way that users easily understand. But it is not enough that software be flexible and easy to use: mapping computations to computer hardware for optimal implementations remains difficult. In Section 4.2, we argue that Core Challenge 5, effective use of computational resources such as central processing units (CPU), graphics processing units (GPU), and quantum computers, will become increasingly central to the work of the computational statistician as data grow in magnitude.

2 Core Challenges 1–3

Before providing two recent examples of twenty‐first century computational statistics (Section 3), we present three easily quantified Core Challenges within computational statistics that we believe will always exist: big , or inference from many observations; big , or inference with high‐dimensional models; and big , or inference with nonconvex objective – or multimodal density – functions. In twenty‐first century computational statistics, these challenges often co‐occur, but we consider them separately in this section.

2.1 Big N

Having a large number of observations makes different computational methods difficult in different ways. A worst case scenario, the exact permutation test requires the production of datasets. Cheaper alternatives, resampling methods such as the Monte Carlo permutation test or the bootstrap, may require anywhere from thousands to hundreds of thousands of randomly produced datasets [8, 10]. When, say, population means are of interest, each Monte Carlo iteration requires summations involving expensive memory accesses. Another example of a computationally intensive model is Gaussian process regression [16, 17]; it is a popular nonparametric approach, but the exact method for fitting the model and predicting future values requires matrix inversions that scale . As the rest of the calculations require relatively negligible computational effort, we say that matrix inversions represent the computational bottleneck for Gaussian process regression.

To speed up a computationally intensive method, one only needs to speed up the method's computational bottleneck. We are interested in performing Bayesian inference [18] based on a large vector of observations . We specify our model for the data with a likelihood function and use a prior distribution with density function to characterize our belief about the value of the ‐dimensional parameter vector a priori. The target of Bayesian inference is the posterior distribution of conditioned on

(1)

The denominator's multidimensional integral quickly becomes impractical as grows large, so we choose to use the MetropolisHastings (M–H) algorithm to generate a Markov chain with stationary distribution [19, 20]. We begin at an arbitrary position and, for each iteration , randomly generate the proposal state from the transition distribution with density . We then accept proposal state with probability

(2)

The ratio on the right no longer depends on the denominator in Equation (1), but one must still compute the likelihood and its terms .

It is for this reason that likelihood evaluations are often the computational bottleneck for Bayesian inference. In the best case, these evaluations are , but there are many situations in which they scale [21, 22] or worse. Indeed, when is large, it is often advantageous to use more advanced MCMC algorithms that use the gradient of the log‐posterior to generate better proposals. In this situation, the log‐likelihood gradient may also become a computational bottleneck [21].

2.2 Big P

One of the simplest models for big problems is ridge regression [23], but computing can become expensive even in this classical setting. Ridge regression estimates the coefficient by minimizing the distance between the observed and predicted values and along with a weighted square norm of :

For illustrative purposes, we consider the following direct method for computing .4 We can first multiply the design matrix by its transpose at the cost of and subsequently invert the matrix at the cost of . The total complexity shows that (i) a large number of parameters is often sufficient for making even the simplest of tasks infeasible and (ii) a moderate number of parameters can render a task impractical when there are a large number of observations. These two insights extend to more complicated models: the same complexity analysis holds for the fitting of generalized linear models (GLMs) as described in McCullagh and Nelder [12].

In the context of Bayesian inference, the length of the vector dictates the dimension of the MCMC state space. For the M‐H algorithm (Section 2.1) with ‐dimensional Gaussian target and proposal, Gelman et al. [25] show that the proposal distribution's covariance should be scaled by a factor inversely proportional to . Hence, as the dimension of the state space grows, it behooves one to propose states that are closer to the current state of the Markov chain, and one must greatly increase the number of MCMC iterations. At the same time, an increasing often slows down rate‐limiting likelihood calculations (Section 2.1). Taken together, one must generate many more, much slower MCMC iterations. The wide applicability of latent variable models [26] (Sections 3.1 and 3.2) for which each observation has its own parameter set (e.g., ) means M‐H simply does not work for a huge class of models popular with practitioners.

For these reasons, Hamiltonian Monte Carlo (HMC) [27] has become a popular algorithm for fitting Bayesian models with large numbers of parameters. Like M‐H, HMC uses an accept step (Equation 2). Unlike M‐H, HMC takes advantage of additional information about the target distribution in the form of the log‐posterior gradient. HMC works by doubling the state space dimension with an auxiliary Gaussian “momentum” variable independent to the “position” variable . The constructed Hamiltonian system has energy function given by the negative logarithm of the joint distribution

and we produce proposals by simulating the system according to Hamilton's equations

Thus, the momentum of the system moves in the direction of the steepest ascent for the log‐posterior, forming an analogy with first‐order optimization. The cost is repeated gradient evaluations that may comprise a new computational bottleneck, but the result is effective MCMC for tens of thousands of parameters [21, 28]. The success of HMC has inspired research into other methods leveraging gradient information to generate better MCMC proposals when is large [29].

2.3 Big M

Global optimization, or the problem of finding the minimum of a function with arbitrarily many local minima, is NP‐complete in general [30], meaning – in layman's terms – it is impossibly hard. In the absence of a tractable theory, by which one might prove one's global optimization procedure works, brute‐force grid and random searches and heuristic methods such as particle swarm optimization [31] and genetic algorithms [32] have been popular. Due to the overwhelming difficulty of global optimization, a large portion of the optimization literature has focused on the particularly well‐behaved class of convex functions [33, 34], which do not admit multiple local minima. Since Fisher introduced his “maximum likelihood” in 1922 [35], statisticians have thought in terms of maximization, but convexity theory still applies by a trivial negation of the objective function. Nonetheless, most statisticians safely ignored concavity during the twentieth century: exponential family log‐likelihoods are log‐concave, so Newton–Raphson and Fisher scoring are guaranteed optimality in the context of GLMs [12, 34].

Nearing the end of the twentieth century, multimodality and nonconvexity became more important for statisticians considering high‐dimensional regression, that is, regression with many covariates (big ). Here, for purposes of interpretability and variance reduction, one would like to induce sparsity on the weights vector by performing best subset selection [36, 37]:

(3)

where , and denotes the ‐norm, that is, the number of nonzero elements. Because best subset selection requires an immensely difficult nonconvex optimization, Tibshirani [38] famously replaces the ‐norm with the ‐norm, thereby providing sparsity, while nonetheless maintaining convexity.

Historically, Bayesians have paid much less attention to convexity than have optimization researchers. This is most likely because the basic theory [13] of MCMC does not require such restrictions: even if a target distribution has one million modes, the well‐constructed Markov chain explores them all in the limit. Despite these theoretical guarantees, a small literature has developed to tackle multimodal Bayesian inference [39–42] because multimodal target distributions do present a challenge in practice. In analogy with Equation (3), Bayesians seek to induce sparsity by specifiying priors such as the spike‐and‐slab [43–45], for example,

As with the best subset selection objective function, the spike‐and‐slab target distribution becomes heavily multimodal as grows and the support of 's discrete distribution grows to potential configurations.

In the following section, we present an alternative Bayesian sparse regression approach that mitigates the combinatorial problem along with a state‐of‐the‐art computational technique that scales well both in and .

3 Model‐Specific Advances

These challenges will remain throughout the twenty‐first century, but it is possible to make significant advances for specific statistical tasks or classes of models. Section 3.1 considers Bayesian sparse regression based on continuous shrinkage priors, designed to alleviate the heavy multimodality (big ) of the more traditional spike‐and‐slab approach. This model presents a major computational challenge as and grow, but a recent computational advance makes the posterior inference feasible for many modern large‐scale applications.

And because of the rise of data science, there are increasing opportunities for computational statistics to grow by enabling and extending statistical inference for scientific applications previously outside of mainstream statistics. Here, the science may dictate the development of structured models with complexity possibly growing in and . Section 3.2 presents a method for fast phylogenetic inference, where the primary structure of interest is a “family tree” describing a biological evolutionary history.

3.1 Bayesian Sparse Regression in the Age of Big N and Big P

With the goal of identifying a small subset of relevant features among a large number of potential candidates, sparse regression techniques have long featured in a range of statistical and data science applications [46]. Traditionally, such techniques were commonly applied in the “” setting, and correspondingly computational algorithms focused on this situation [47], especially within the Bayesian literature [48].

Due to a growing number of initiatives for large‐scale data collections and new types of scientific inquiries made possible by emerging technologies, however, increasingly common are datasets that are “big ” and “big ” at the same time. For example, modern observational studies using health‐care databases routinely involve patients and clinical covariates [49]. The UK Biobank provides brain imaging data on patients, with , depending on the scientific question of interests [50]. Single‐cell RNA sequencing can generate datasets with (the number of cells) in millions and (the number of genes) in tens of thousands, with the trend indicating further growths in data size to come [51].

3.1.1 Continuous shrinkage: alleviating big M

Bayesian sparse regression, despite its desirable theoretical properties and flexibility to serve as a building block for richer statistical models, has always been relatively computationally intensive even before the advent of “big and big ” data [45, 52, 53]. A major source of its computational burden is severe posterior multimodality (big ) induced by the discrete binary nature of spike‐and‐slab priors (Section 2.3). The class of global–local continuous shrinkage priors is a more recent alternative to shrink s in a more continuous manner, thereby alleviating (if not eliminating) the multimodality issue [54, 55]. This class of prior is represented as a scale mixture of Gaussians:

The idea is that the global scale parameter would shrink most s toward zero, while the local scales, with its heavy‐tailed prior , allow a small number of and hence s to be estimated away from zero. While motivated by two different conceptual frameworks, the spike‐and‐slab can be viewed as a subset of global–local priors in which is chosen as a mixture of delta masses placed at and . Continuous shrinkage mitigates the multimodality of spike‐and‐slab by smoothly bridging small and large values of .

On the other hand, the use of continuous shrinkage priors does not address the increasing computational burden from growing and in modern applications. Sparse regression posteriors under global–local priors are amenable to an effective Gibbs sampler, a popular class of MCMC we describe further in Section 4.1. Under the linear and logistic models, the computational bottleneck of this Gibbs sampler stems from the need for repeated updates of from its conditional distribution

(4)

where is an additional parameter of diagonal matrix and .5 Sampling from this high‐dimensional Gaussian distribution requires operations with the standard approach [58]: for computing the term and for Cholesky factorization of . While an alternative approach by Bhattacharya et al. [48] provides the complexity of , the computational cost remains problematic in the big and big regime at after choosing the faster of the two.

3.1.2 Conjugate gradient sampler for structured high‐dimensional Gaussians

The conjugate gradient (CG) sampler of Nishimura and Suchard [57] combined with their prior‐preconditioning technique overcomes this seemingly inevitable growth of the computational cost. Their algorithm is based on a novel application of the CG method [59, 60], which belongs to a family of iterative methods in numerical linear algebra. Despite its first appearance in 1952, CG received little attention for the next few decades, only making its way into major software packages such as MATLAB in the 1990s [61]. With its ability to solve a large and structured linear system via a small number of matrix–vector multiplications without ever explicitly inverting , however, CG has since emerged as an essential and prototypical algorithm for modern scientific computing [62, 63].

Despite its earlier rise to prominence in other fields, CG has not found practical applications in Bayesian computation until rather recently [57, 64]. We can offer at least two explanations for this. First, being an algorithm for solving a deterministic linear system, it is not obvious how CG would be relevant to Monte Carlo simulation, such as sampling from ; ostensively, such a task requires computing a “square root” of the precision matrix so that for . Secondly, unlike direct linear algebra methods, iterative methods such as CG have a variable computational cost that depends critically on the user's choice of a preconditioner and thus cannot be used as a “black‐box” algorithm.6 In particular, this novel application of CG to Bayesian computation is a reminder that other powerful ideas in other computationally intensive fields may remain untapped by the statistical computing community; knowledge transfers will likely be facilitated by having more researchers working at intersections of different fields.

Nishimura and Suchard [57] turns CG into a viable algorithm for Bayesian sparse regression problems by realizing that (i) we can obtain a Gaussian vector by first generating and and then setting and (ii) subsequently solving yields a sample from the distribution (4). The authors then observe that the mechanism through which a shrinkage prior induces sparsity of s also induces a tight clustering of eigenvalues in the prior‐preconditioned matrix . This fact makes it possible for prior‐preconditioned CG to solve the system in matrix–vector operations of form , where roughly represents the number of significant s that are distinguishable from zeros under the posterior. For having a structure as in (4), can be computed via matrix–vector multiplications of form and , so each operation requires a fraction of the computational cost of directly computing and then factorizing it.

Prior‐preconditioned CG demonstrates an order of magnitude speedup in posterior computation when applied to a comparative effectiveness study of atrial fibrillation treatment involving patients and covariates [57]. Though unexplored in their work, the algorithm's heavy use of matrix–vector multiplications provides avenues for further acceleration. Technically, the algorithm's complexity may be characterized as , for the matrix–vector multiplications by and , but the theoretical complexity is only a part of the story. Matrix–vector multiplications are amenable to a variety of hardware optimizations, which in practice can make orders of magnitude difference in speed (Section 4.2). In fact, given how arduous manually optimizing computational bottlenecks can be, designing algorithms so as to take advantage of common routines (as those in Level 3 BLAS