Statistics for Data Science and Analytics - Peter C. Bruce - E-Book

Statistics for Data Science and Analytics E-Book

Peter C. Bruce

0,0
104,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration

Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations.

A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of “kitchen sink” formulas. Regression is taught both as a tool for explanation and for prediction.

This book is informed by the authors’ experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves.

Statistics for Data Science and Analytics includes information on sample topics such as:

  • Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and sets
  • Experiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary data
  • Specialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels—the workhorses of data science—and how to get the most value from them
  • Statistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions

Written by and for data science instructors, Statistics for Data Science and Analytics is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 522

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright

Dedication

About the Authors

Acknowledgments

About the Companion Website

Introduction

Statistics and Data Science

Accompanying Web Resources

Python

Using Python with this Book

1 Statistics and Data Science

1.1 Big Data: Predicting Pregnancy

1.2 Phantom Protection from Vitamin E

1.3 Statistician, Heal Thyself

1.4 Identifying Terrorists in Airports

1.5 Looking Ahead

1.6 Big Data and Statisticians

2 Designing and Carrying Out a Statistical Study

2.1 Statistical Science

2.2 Big Data

2.3 Data Science

2.4 Example: Hospital Errors

2.5 Experiment

2.6 Designing an Experiment

2.7 The Data

2.8 Variables and Their Flavors

2.9 Python: Data Structures and Operations

2.10 Are We Sure We Made a Difference?

2.11 Is Chance Responsible? The Foundation of Hypothesis Testing

2.12 Probability

2.13 Significance or Alpha Level

2.14 Other Kinds of Studies

2.15 When to Use Hypothesis Tests

2.16 Experiments Falling Short of the Gold Standard

2.17 Summary

2.18 Python: Iterations and Conditional Execution

2.19 Python:

Numpy

,

scipy

, and

pandas

—The Workhorses of Data Science

Exercises

Notes

3 Exploring and Displaying the Data

3.1 Exploratory Data Analysis

3.2 What to Measure—Central Location

3.3 What to Measure—Variability

3.4 What to Measure—Distance (Nearness)

3.5 Test Statistic

3.6 Examining and Displaying the Data

3.7 Python: Exploratory Data Analysis/Data Visualization

Exercises

Notes

4 Accounting for Chance—Statistical Inference

4.1 Avoid Being Fooled by Chance

4.2 The Null Hypothesis

4.3 Repeating the Experiment

4.4 Statistical Significance

4.5 Power

4.6 The Normal Distribution

4.7 Summary

4.8 Python: Random Numbers

Exercises

Notes

5 Probability

5.1 What Is Probability

5.2 Simple Probability

5.3 Probability Distributions

5.4 From Binomial to Normal Distribution

5.5 Appendix: Binomial Formula and Normal Approximation

5.6 Python: Probability

Exercises

6 Categorical Variables

6.1 Two-way Tables

6.2 Conditional Probability

6.3 Bayesian Estimates

6.4 Independence

6.5 Multiplication Rule

6.6 Simpson’s Paradox

6.7 Python: Counting and Contingency Tables

Exercises

Notes

7 Surveys and Sampling

7.1 Literary Digest—Sampling Trumps “All Data”

7.2 Simple Random Samples

7.3 Margin of Error: Sampling Distribution for a Proportion

7.4 Sampling Distribution for a Mean

7.5 The Bootstrap

7.6 Rationale for the Bootstrap

7.7 Standard Error

7.8 Other Sampling Methods

7.9 Absolute vs. Relative Sample Size

7.10 Python: Random Sampling Strategies

Exercises

Notes

8 More than Two Samples or Categories

8.1 Count Data—RC Tables

8.2 The Role of Experiments (Many Are Costly)

8.3 Chi-Square Test

8.4 Single Sample—Goodness-of-Fit

8.5 Numeric Data: ANOVA

8.6 Components of Variance

8.7 Factorial Design

8.8 The Problem of Multiple Inference

8.9 Continuous Testing

8.10 Bandit Algorithms

8.11 Appendix: ANOVA, the Factor Diagram, and the -Statistic

8.12 More than One Factor or Variable—From ANOVA to Statistical Models

8.13 Python: Contingency Tables and Chi-square Test

8.14 Python: ANOVA

Exercises

Notes

9 Correlation

9.1 Example: Delta Wire

9.2 Example: Cotton Dust and Lung Disease

9.3 The Vector Product Sum Test

9.4 Correlation Coefficient

9.5 Correlation is not Causation

9.6 Other Forms of Association

9.7 Python: Correlation

Exercises

Notes

10 Regression

10.1 Finding the Regression Line by Eye

10.2 Finding the Regression Line by Minimizing Residuals

10.3 Linear Relationships

10.4 Prediction vs. Explanation

10.5 Python: Linear Regression

Exercises

Note

11 Multiple Linear Regression

11.1 Terminology

11.2 Example—Housing Prices

11.3 Interaction

11.4 Regression Assumptions

11.5 Assessing Explanatory Regression Models

11.6 Assessing Regression for Prediction

11.7 Python: Multiple Linear Regression

Exercises

Note

12 Predicting Binary Outcomes

12.1 -Nearest-Neighbors

12.2 Python: Classification

12.3 Exercises

Note

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Hospital errors (partial): before and after treatment.

Table 2.2 Reduction in major errors in hospitals.

Table 2.3 Error reduction: compact table.

Chapter 3

Table 3.1 Hospital error reductions, treatment, and control groups.

Table 3.2 Musical genre preferences.

Table 3.3 Musical genre preferences.

Table 3.4 Frequency distribution—reduction in errors.

Table 3.5 Error reduction frequency table (control).

Chapter 4

Table 4.1 Permutation of error reduction scores into two groups.

Table 4.2 Average error reduction: first random group—second random group, 5...

Table 4.3 Sorted results from 1000 trials (first 15 shown).

Chapter 5

Table 5.1 Probability of different successes.

Table 5.2 Cholesterol scores for a group of 10 subjects.

Table 5.3 Simple -table.

Chapter 6

Table 6.1 Applications to UC Berkeley departments.

Table 6.2 Applications to UC Berkeley departments.

Table 6.3 Numbers of applicants to UC Berkeley departments A, B, C….

Table 6.4 Percentage of department applications by gender.

Table 6.5 Male/female applications by department.

Table 6.6 Admission rates (percent) by department.

Table 6.7 Apparent discrimination against women at Berkeley.

Table 6.8 Berkeley admission rates by department.

Chapter 7

Table 7.1 Toyota Corolla used car prices.

Table 7.2 Mean of 20 resampled or bootstrapped values.

Table 7.3 90% confidence interval from percentiles of resampling distributio...

Chapter 8

Table 8.1 Marriage therapy.

Table 8.2 Expected outcomes if treatments yield the same results.

Table 8.3 Absolute difference between observed and expected.

Table 8.4 Resampling output—distribution of outcome after one shuffling.

Table 8.5 Tabulation of one shuffling.

Table 8.6 Marriage therapy: A few of the sums of resampled differences, in d...

Table 8.7 Frequencies of 315 interior digits in Imanishi-Kari data.

Table 8.8 Fat absorption data (grams).

Table 8.9 Comparing means.

Table 8.10 Variance of group means.

Table 8.11 Resampled variances.

Table 8.12 Doughnut data with group means and grand average.

Table 8.13 Hypothetical outcomes of factorial design.

Table 8.14 User responses to two online product treatments.

Table 8.15 Web page load times in seconds for 3 different server configurati...

Chapter 9

Table 9.1 Training and productivity at Delta Wire.

Table 9.2 Excerpt of baseball payroll and total wins.

Chapter 11

Table 11.1 Boston Housing data variables.

Table 11.2 Boston Housing data: correlation matrix.

Table 11.3 Tayko data fields.

Table 11.4 Tayko data, spending known (top-10 rows).

Table 11.5 Tayko data, validation partition.

Table 11.6 Predictions from the regression.

Table 11.7 Actual spending in hold-out data revealed.

Table 11.8 Actual spending in hold-out data revealed, adding residuals.

Table 11.9 Finding root mean squared error (RMSE).

Table 11.10 Predicting everyone is average.

Chapter 12

Table 12.1 Hypothetical customer purchases.

Table 12.2 Online course customers.

List of Illustrations

Chapter 2

Figure 2.1 Dart throws off-target in

consistent

fashion (biased). Source: Pe...

Figure 2.2 All possible outcomes for three coin tosses.

Figure 2.3 Kerrich coin tosses. Number of tosses on the -axis and proportio...

Figure 2.4 The American Community Survey is a detailed version of the full c...

Figure 2.5 Dart throw misses. Source: Peter Bruce (Book Author).

Chapter 3

Figure 3.1 Frequency histogram of error reductions in treatment group.

Figure 3.2 Hospital sizes by number of beds (hypothetical data for a mid-siz...

Figure 3.3 Back-to-back histogram.

Figure 3.4 Bar chart—applicants by department.

Figure 3.5 Boxplot of metropolitan area hospital sizes (-axis shows number ...

Figure 3.6 Error reductions (-axis) for control hospitals(0) and treatment ...

Figure 3.7 Simple line plot using the (a) pyplot and (b) object-oriented int...

Figure 3.8 Customization of a graph with title, axes labels, legends and gri...

Figure 3.9 Creating two graphs next to each other.

Figure 3.10 Examples of visualizations for exploratory data analysis created...

Chapter 4

Figure 4.1 Histogram of 1000 trials—permutation of control and treatment gro...

Figure 4.2 The theoretical Normal distribution (the -axis is expressed in s...

Figure 4.3 Histogram of 1000 trials—permutation of control and treatment gro...

Figure 4.4 Histogram of 1000 resampled difference in the length of time peop...

Chapter 5

Figure 5.1 Blaise Pascal, 1623–1662. Source: Unknown author/Wikimedia Common...

Figure 5.2 Venn diagram: E and E (E means “not E”).

Figure 5.3 Venn diagram—two events of interest (B and E) and their intersect...

Figure 5.4 Distribution of the number of heads in 150 flips of a fair coin....

Figure 5.5 Probability of successes (hits) in 5 at-bats, calculated using th...

Figure 5.6 .

Figure 5.7 Histograms of 5000 random numbers sampled from different distribu...

Figure 5.8 Probability density function (pdf), cumulative distribution funct...

Figure 5.9 Probability mass function (pmf), cumulative distribution function...

Chapter 6

Figure 6.1 Bayesian calculation (medical test example).

Figure 6.2 Example distribution for 10 degrees of freedom.

Figure 6.3 Bletchley Park.

Chapter 7

Figure 7.1

The Literary Digest

was the premier literary and political commen...

Figure 7.2 Gallup poll.

Figure 7.3 90% confidence interval; proportion “favorable” on -axis.

Figure 7.4 Histogram of Toyota prices.

Figure 7.5 Histogram of used Toyota Corolla resale values.

Figure 7.6 Normal distribution with mean = 0, SD = 1;...

Figure 7.7 William S. Gossett’s 1908 article.

Figure 7.8 Gossett”s description of his simulation.

Figure 7.9 Gossett”s plot of one simulation.

Figure 7.10 Sampling distribution for a proportion.

Figure 7.11 Bootstrap sampling distribution for the mean price of Toyota Cor...

Figure 7.12 Stratified sampling of the

CRIM

variable of the

boston-housing.c

...

Chapter 8

Figure 8.1 The frequency (-axis) of leading digits (-axis) in most multi-d...

Figure 8.2 Histogram of one 315-digit resample (bars) compared to the observ...

Figure 8.3 Dot plots for the doughnut experiment.

Figure 8.4 Boxplots for the doughnut experiment.

Figure 8.5 Frequency histogram of resampled variances from doughnut data pro...

Figure 8.6 Resulting ANOVA table from

statsmodels

.

Figure 8.7 Probability density of (a) and its inverse cumulative density (...

Figure 8.8 Resampling procedure for the marriage therapy data. The observed ...

Figure 8.9 Visualizations for the doughnut experiment. (a) Dot plots, (b) do...

Figure 8.10 Frequency histogram of resampled variances from doughnut data pr...

Chapter 9

Figure 9.1 Delta Wire productivity vs. training.

Figure 9.2 Pulmonary capacity (PEFR) and exposure to cotton dust (years).

Figure 9.3 Baseball payroll vs. total wins, 2006–2008.

Figure 9.4 Baseball histogram of shuffled vector product sums (000).

Figure 9.5 Baseball payroll vs. total wins, 2006–2008; .

Figure 9.6 Pulmonary capacity (PEFR) and exposure to cotton dust (years); ....

Figure 9.7 Delta Wire productivity vs. training; .

Figure 9.8 Resampling distribution of correlation coefficient for baseball u...

Figure 9.9 Murder rates and alphabetical order of states, .

Figure 9.10 Hypothetical data on tax rates and revenue.

Figure 9.11 Distribution of resampled statistics for the Baseball dataset. I...

Figure 9.12 Visualizing correlation between two variables. (a) Heatmap and (...

Chapter 10

Figure 10.1 Slope and intercept of a line.

Figure 10.2 Payroll vs. total wins 2006–2008.

Figure 10.3 Estimated trend line, drawn by eye.

Figure 10.4 Minimizing residuals.

Figure 10.5 Delta Wire hours of training and productivity.

Figure 10.6 Pulmonary capacity (PEFR) and exposure to cotton dust (years).

Figure 10.7 Trend line with negative slope.

Figure 10.8 Payroll residual plot.

Figure 10.9 Delta Wire training residual plot.

Figure 10.10 PEFR residual plot.

Figure 10.11 Assessing the performance of regression.

Figure 10.12 Assessing the performance of regression for prediction.

Figure 10.13 Output of the

summary

function of the fitted model.

Figure 10.14 Residual plot of the fitted model.

Figure 10.15 Linear regression model fitted to the full dataset set (gray li...

Chapter 11

Figure 11.1 Different terms for variables in regression.

Figure 11.2 Boston Housing data variables.

Figure 11.3 Multiple linear regression: Boston Housing data.

Figure 11.4 Regression output with CRIM*RM interaction.

Figure 11.5 Random and coordinates, plotted.

Figure 11.6 Boston Housing, predicted values vs. residuals.

Figure 11.7 QQ-plot for residuals.

Figure 11.8 Regression via resampling—revisiting the PEFR data with a single...

Figure 11.9 Analyzing PEFR regression intercept bootstrapped output.

Figure 11.10 Statsmodels regression output.

Figure 11.11 Multiple linear regression: Boston Housing data (same as Figure...

Figure 11.12 Bootstrapped coefficients for the Boston Housing model.

Figure 11.13 Tayko data, multiple linear regression output.

Figure 11.14 Actual vs. predicted MEDV plots for the main effects model (a) ...

Figure 11.15 Residual plot (a) and QQ-plot (b) for the interaction model.

Figure 11.16 Histograms of the coefficients from the resampling procedure. T...

Figure 11.17 Distribution of resampled coefficients and the 95% confidence i...

Chapter 12

Figure 12.1 Riding Mower, classifying new household (cross) as owner (filled...

Figure 12.2 Finding the nearest single neighbor ().

Figure 12.3 Five nearest neighbors ().

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

About the Authors

Acknowledgments

About the Companion Website

Introduction

Begin Reading

Index

End User License Agreement

Pages

iii

iv

v

xvii

xix

xxi

xxiii

xxiv

xxv

1

2

3

4

5

6

7

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

Statistics for Data Science and Analytics

 

Peter C. Bruce

Founder, Institute for Statistics Education at Statistics.com

Peter Gedeck

Senior Data Scientist, Collaborative Drug Discovery

Janet Dobbins

Chair, Data Community DC

 

 

 

 

 

Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data Applied for:

Hardback ISBN: 9781394253807

Cover Design: WileyCover Image: © gremlin/Getty Images

 

 

 

To my wife, Liz, whose editorial help and judgment is impeccable—PB

 

To my parents, Helga and Erhard Gedeck, and my sister, Heike, who have always been supportive of my education—PG

 

To my friend Peter, whose guidance on my data science journey has been invaluable, and to my husband Peter, whose encouragement adds joy to every step—JD

About the Authors

Peter C. Bruce is the Founder of the Institute for Statistics Education at Statistics.com. Founded in 2002, Statistics.com was the first educational institution devoted solely to online education in statistics and data science.

Peter Gedeck is a senior data scientist at Collaborative Drug Discovery. He specializes in the development of cloud-based software for managing data in the drug discovery process. In addition, he teaches data science at the University of Virginia.

Janet Dobbins is a leading voice in the Washington, DC, data science community. She is Chair of the Board of Directors for Data Community DC (DC2) and co-organizes the popular Data Science DC meetups. She previously worked at the Institute for Statistics Education at Statistics.com.

Acknowledgments

Julian Simon, an early resampling pioneer, first kindled Peter C. Bruce’s interest in statistics with his permutation and bootstrap approach to statistics, his Resampling Stats software (first released in the late 1970s), and his statistics text on the same subject. Robert Hayden, who co-authored early versions of parts of the text you are reading, was instrumental in getting this project launched.

Michelle Everson has taught many sessions of introductory statistics using versions of this book and has been vigilant in pointing out ambiguities and omissions. Her active participation in the statistics education community has been an asset as we have strived to improve and perfect this text. Meena Badade also teaches using this text and has also been very helpful in bringing to our attention errors and points requiring clarification. Diane Murphy reviewed the latest version of the book with care and contributed many useful corrections and suggestions.

Many students at the Institute for Statistics Education at Statistics.com have helped clarify confusing points and refine this book over the years.

We also thank our editor at Wiley, Brett Kurzman, who shepherded this book through the acceptance and launch process quickly and smoothly. Nandhini Karuppiah, the Managing Editor, helped guide us through the production process, and Govind Nagaraj managed the copyediting.

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/Wiley_Statistics_for_Data

We are happy that you have chosen our book for your course. For instructors that adopt the book, we provide these supplemental materials:

Short answers and exercises in the text.

Datasets and Python examples

Videos mentioned in the test

Link to GitHub repository and Jupyter notebook

Introduction

Statistics and Data Science

As of the writing of this book, the fields of statistics and data science are evolving rapidly to meet the changing needs of business, government, and research organizations. It is an oversimplification, but still useful, to think of two distinct communities as you proceed:

The traditional academic and medical

research communities

that typically conduct extended research projects adhering to rigorous regulatory or publication standards, and

Businesses and large organizations that use statistical methods to extract value from their data, often on the fly. Reliability and value are more important than academic rigor to this

data science community

.

Most users of statistical methods now fall in the second category, as those methods are a basic component of what is now called artificial intelligence (AI). However, most of the specific techniques, as well as the language of statistics, had their origin in the first group. As a result, there is a certain amount of “baggage” that is not truly relevant to the data science community. That baggage can sometimes be obscure or confusing and, in this book, we provide guidance on what is or is not important to data science. Another feature of this book is the use of resampling/simulation methods to develop the underpinnings of statistical inference (the most difficult topic in an introductory course) in a transparent and understandable fashion.

We start off with some examples of statistics in action (including two of statistics gone wrong), then dive right in to look at the proper design of studies and account for the possible role of chance. All the standard topics of introductory statistics are here (probability, descriptive statistics, inference, sampling, correlation, etc.), but sometimes they are introduced not as separate standalone topics but rather in the context of the situation in which they are needed.

Accompanying Web Resources

Python code, datasets, some solutions, and other material accompanying this book can be found at https://introductorystatistics.com/.

Python

Python is a general programming language that can be used in many different areas. It is especially popular in the machine learning and data science communities. A wide range of libraries provide efficient solutions for almost every need, from simple one-off scripts, to web servers, and highly complex scientific applications. As we will see throughout this book, it also has great support for statistics.

You can use Python in many different ways. For most people new to the language, the easiest way to get started is to use Python in Jupyter notebooks (see https://jupyter.org/jupyter.org). Jupyter notebooks are documents that contain both code and rich text elements, such as figures, links, equations, etc. Because of this, they are an ideal environment to learn Python and to present your work. You will find notebooks with the example code of this book on our website (https://introductorystatistics.com/).

A great way to get started with Python is to run code on one of the freely accessible cloud computing platforms. Google Colab (https://colab.research.google.com/) has a free tier that is sufficient for all the examples in this book.

An alternative to cloud computing platforms is to install Python locally on your computer. You can download and install different versions of Python from https://www.python.org. However, it is more convenient to use Anaconda (https://www.anaconda.com). Anaconda is a free package manager for Python and R programming languages focusing on scientific computing. It distributes the most popular Python packages for science, mathematics, engineering, and data analysis. We provide detailed installation instructions on our website at https://introductorystatistics.com/.

Using Python with this Book

With some exceptions, this book presents relevant Python code in the second part of each chapter. The book is not an in-depth step-by-step introduction to computer programming as a discipline, but rather it provides the tools you need to implement the statistical procedures that are discussed in this book. Because many of these procedures are based on iterative resampling, rather than simply calculating formulas, you will get useful practice with the data handling and manipulation that is a Python strength. No specific level of Python ability is required to get started. If you are completely new to Python, you could consider launching yourself with a quick self-study guide (easily found on the web), but, in general, you should be able to follow along.

1Statistics and Data Science

Statistical methods first came into use before homes had electricity, and had several phases of rapid growth:

The first big boost came from manufacturers and farmers who were able to decrease costs, produce better products, and improve crop yields via statistical experiments.

Similar experiments helped drug companies graduate from snake oil purveyors to makers of scientifically proven remedies.

In the late 20th century, computing power enabled a new class of computationally intensive methods, like the resampling methods that we will study.

In the early decades of the current millennium, organizations discovered that the rapidly growing repositories of data they were collecting (“big data”) could be mined for useful insights.

As with any powerful tool, the more you know about it the better you can apply it and the less likely you will go astray. The lurking dangers are illustrated when you type the phrase “How to lie with...” into a web search engine. The likely autocompletion is “statistics.”

Much of the book that follows deals with important issues that can determine whether data yields meaningful information or not:

How to assess the role that random chance can play in creating apparently interesting results or patterns in data

How to design experiments and surveys to get useful and reliable information

How to formulate simple statistical models to describe relationships between one variable and another

We will start our study in the next chapter with a look at how to design experiments, but, before we dive in, let’s look at some statistical wins and losses from different arenas.

1.1 Big Data: Predicting Pregnancy

In 2010, a statistician from Target described how the company used customer transaction data to make educated guesses about whether customers are pregnant or not. On the strength of these guesses, Target sent out advertising flyers to likely prospects, centered around the needs of pregnant women.

How did Target use data to make those guesses? The key was data used to “train” a statistical model: data in which the outcome of interest—pregnant/not pregnant—was known in advance. Where did Target get such data? The “not pregnant” data was easy—the vast majority of customers are not pregnant, so data on their purchases is easy to come by. The “pregnant” data came from a baby shower registry. Both datasets were quite large, containing lists of items purchased by thousands of customers.

Some clues are obvious—the purchase of a crib and baby clothes is a dead giveaway. But, from Target’s perspective, by the time a customer purchases these obvious big ticket items, it was too late—they had already chosen their shopping venue. Target wanted to reach customers earlier, before they decided where to do their shopping for the big day. For that, Target used statistical modeling to make use of non-obvious patterns in the data that distinguish pregnant from non-pregnant customers. One clue that emerged was shifts in the pattern of supplement purchases—e.g. a customer who was not buying supplements 60 days ago but is buying them now.

1.2 Phantom Protection from Vitamin E

In 1993, researchers examining a database on nurses’ health found that nurses who took vitamin E supplements had 30% to 40% fewer heart attacks than those who didn’t. These data fit with theories that antioxidants such as vitamins E and C could slow damaging processes within the body. Linus Pauling, winner of the Nobel Prize in Chemistry in 1954, was a major proponent of these theories, which were one driver of the nutritional supplements industry.

However, the heart health benefits of vitamin E turned out to be illusory. A study completed in 2007 divided 14,641 male physicians randomly into four groups:

Take 268 mg of vitamin E every other day

Take 500 mg of vitamin C every day

Take both vitamin E and C

Take placebo.

Those who took vitamin E fared no better than those who did not take vitamin E. Since the only difference between the two groups was whether or not they took vitamin E, if there were a vitamin E effect, it would have shown up. Several meta-analyses, which are consolidated reviews of the results of multiple published studies, have reached the same conclusion. One found that vitamin E at the above dosage might even increase mortality.

What happened to make the researchers in 1993 think they had found a link between vitamin E and disease inhibition? In reviewing a vast quantity of data, researchers thought they saw an interesting association. In retrospect, with the benefit of a well-designed experiment, it appears that this association was merely a chance coincidence. Unfortunately, coincidences happen all the time in life. In fact, they happen to a greater extent than we think possible.

1.3 Statistician, Heal Thyself

In 1993, Mathsoft Corp., the developer of Mathcad mathematical software, acquired StatSci, the developer of S-PLUS statistical software, the precursor to R. Mathcad was an affordable tool popular with engineers—prices were in the hundreds of dollars and the number of users was in the hundreds of thousands. S-PLUS was a high-end graphical and statistical tool used primarily by statisticians—prices were in the thousands of dollars and the number of users was in the thousands.

In looking to boost revenues, Mathsoft turned to an established marketing principle—cross-selling. In other words, try to convince the people who bought product A to buy product B. With the acquisition of a highly regarded niche product, S-PLUS, and an existing large customer base for Mathcad, Mathsoft decided that the logical thing to do would be to ramp up S-PLUS sales via direct mail to its installed Mathcad user base. It also decided to purchase lists of similar prospective customers for both Mathcad and S-PLUS.

This major mailing program boosted revenues, but it boosted expenses even more. The company lost over $13 million in 1993 and 1994 combined—significant numbers for a company that had only $11 million in 1992 revenue.

What happened?

In retrospect, it was clear that the mailings were not well targeted. The costs of the unopened mail exceeded the revenue from the few recipients who did respond. Mathcad users turned out not to be likely users of S-PLUS. The huge losses could have been avoided through the use of two common statistical techniques:

Doing a test mailing to the various lists being considered to (1) determine whether the list is productive and (2) test different headlines, copy, pricing, etc., to see what works best.

Using predictive modeling techniques to identify which names on a list are most likely to turn into customers.

1.4 Identifying Terrorists in Airports

Since the September 11, 2001, Al Qaeda attacks in the United States and subsequent attacks elsewhere, security screening programs at airports have become a major undertaking, costing billions of dollars per year in the United States alone. Most of these resources are consumed in an exhaustive screening process. All passengers and their tickets are reviewed, their baggage is screened and individuals pass through detectors of varying sophistication. An individual and his or her bag can only receive a limited amount of attention in a screening process that is applied to everyone. The process is largely the same for each individual. Potential terrorists can see the process and its workings in detail and identify weaknesses.

To improve the effectiveness of the system, security officials have studied ways of focusing more concentrated attention on a small number of travelers. In the years after the attacks, one technique used enhanced screening for a limited number of randomly selected travelers. While it adds some uncertainty to the screening process, which acts as a deterrent to attackers, random selection does nothing to focus attention on high-risk individuals.

Determining who is at high-risk is, of course, the problem. How do you know who the high-risk passengers are?

One method is passenger profiling—specifying some guidelines about what passenger characteristics merit special attention. These characteristics were determined by a reasoned, logical approach. For example, purchasing a ticket for cash, as the 2001 hijackers did, raises a red flag. The Transportation Security Administration trains a cadre of Behavior Detection Officers. The Administration also maintains a specific no-fly list of individuals who trigger special screening.

There are several problems with the profiling and no-fly approaches.

Profiling can generate backlash and controversy because it comes close to stereotyping. American National Public Radio commentator Juan Williams was fired when he made an offhand comment to the effect that he would be nervous about boarding an aircraft in the company of people in full Muslim garb.

Profiling, since it does tend to merge with stereotyping and is based on logic and reason, enables terrorist organizations to engineer attackers that do not meet profile criteria.

No-fly lists are imprecise (a name may match thousands of individuals) and often erroneous. Senator Edward Kennedy was once pulled aside because he supposedly showed up on a no-fly list.

An alternative or supplemental approach is a statistical one—separate out passengers who are “different” for additional screening, where “different” is defined quantitatively across many variables that are not made known to the public. The statistical term is “outlier.” Different does not necessarily prove that the person is a terrorist threat, but the theory is that outliers may have a higher threat probability. Turning the work over to a statistical algorithm mitigates some of the controversy around profiling, since security officers would lack the authority to make discretionary decisions.

Defining “different” requires a statistical measure of distance, which we will learn more about later.

1.5 Looking Ahead

We’ll be studying many things in this book, but several important themes will be

Learning more about random processes and statistical tools that will help quantify the role of chance and distinguish real phenomena from chance coincidence.

Learning how to design experiments and studies that can provide more definitive answers to questions such as whether a medical therapy works, which marketing message generates a better response, and which management technique or industrial process produces fewer errors.

Learning how to specify and interpret statistical models that describe the relationship between two variables, or between a response variable and several “predictor” variables, in order to:

Explain/understand phenomena and answer research questions (“What factors contribute to a drug’s success, or the response to a marketing message?”).

Make predictions (“Will a given subscriber leave this year?” “Is a given insurance claim fraudulent?”)

1.6 Big Data and Statisticians

Before the turn of the millennium, by and large, statisticians did not have to be too concerned with programming languages, SQL queries, and the management of data. Database administration and data storage in general was someone else’s job, and statisticians would obtain or get handed data to work on and analyze. A statistician might, for example,

Direct the design of a clinical trial to determine the efficacy of a new therapy

Help a researcher determine how many subjects to enroll in a study

Analyze data to prepare for legal testimony

Conduct sample surveys and analyze the results

Help a scientist analyze data that comes out of a study

Help an engineer improve an industrial process

All of these tasks involve examining data, but the number of records is likely to be in the hundreds or thousands at most, and the challenge of obtaining the data and preparing it for analysis was not overwhelming. So the task of obtaining the data could safely be left to others.

1.6.1 Data Scientists

The advent of big data has changed things. The explosion of data means that more interesting things can be done with data, and they are often done in real time or on a rapid turnaround schedule. FICO, the credit-scoring company, uses statistical models to predict credit card fraud, collecting customer data, merchant data, and transaction data 24 hours a day. FICO has more than two billion customer accounts to protect, so it is easy to see that this statistical modeling is a massive undertaking.

The science of computer programming and details of database administration lie beyond the scope of this book, but these fields now lie within the scope of statistical work. The statistician must be conversant with the data, as well as how to get it and work with it.

Statisticians are increasingly asked to plug their statistical models into big data environments, where the challenge of wrangling and preparing analyzable data is paramount, and requires both programming and database skills.

Programmers and database administrators are increasingly interested in adding statistical methods to their toolkits, as companies realize that their databases possess value that is strategic, not just administrative, and goes well beyond the original reason for collecting the data.

Around 2010, the term data scientist came into use to describe analysts who combined these two sets of skills. Job announcements now carry the term data scientist with greater frequency than the term statistician, reflecting the importance that organizations attach to managing, manipulating, and obtaining value out of their vast and rapidly growing quantities of data.

We close this chapter with a probability experiment:

Try It Yourself

Write down a series of 50 random coin flips without actually flipping the coins. That is, write down a series of 50 made-up H’s and T’s selected in such a way that they appear random.

Now actually flip a coin 50 times.

If you look at the series of made-up tosses and compare them to the real tosses, the longest streaks of either H or T generally occur in the ACTUAL tosses. When a person is asked to make up random tosses, they will rarely “allow” more than four H’s or T’s in a row. By the time they have written down four H’s in a row, they think it is time to switch over to T, or else the series would not appear random. By contrast, instructors who teach this exercise in class often see a streak of 8 T’s or H’s in a row. Most people think that this is not random, and yet it clearly is.

In 1913, a roulette wheel at the Monte Carlo casino landed on black 26 times in a row. As the streak developed, gamblers, convinced that the wheel would most certainly have to end the streak, increasingly bet heavily on red—they lost millions.

The message here is that random variation reliably produces patterns that appear non-random.

Why is this significant? Just as with coin tosses, there is a significant component of random variation (engineers call it noise) in the data that routinely flow through life—whether business life, government affairs, the education world, or personal life. So much data…so much random variation…how do we know what is real and what is random?

We can’t know for certain, though we do know that random behavior can appear to be real. One purpose of this book is to teach you about probability, and help you evaluate the potential random component in data, and provide ways of modeling it. This gives us a benchmark against which to measure patterns, and form educated guesses about whether observed events or patterns of interest might be really due to chance. When we understand randomness better, we can curb the tendency to chase after random patterns, and produce more reliable analyses of data.

2Designing and Carrying Out a Statistical Study

In this chapter, we study random behavior and how it can fool us, and we learn how to design studies to gain useful and reliable information. After completing this chapter, you should be able to

Use coin flips to replicate random processes, and interpret the results of coin-flipping experiments

Use an informal working definition of probability

Define, intuitively, -value

Describe the different data formats you will encounter, including relational database and flat file formats

Describe the difference between data encountered in traditional statistical research, and “big data”

Explain the use of treatment and control groups in experiments

Define statistical bias

Explain the role of randomization and blinding in assigning subjects in a study

Explain the difference between observational studies and experiments

Design a statistical study following basic principles

2.1 Statistical Science

“It’s not what you don’t know that hurts you, it’s what you know for sure that ain’t so.” (Will Rogers, American humorist)

Source: Silvio/Adobe Stock Photos

Nearly all large organizations now have huge stores of customer and other data that they mine for insight, in hopes of boosting revenue or reducing costs. In the academic world, over five million research articles are published per year in scholarly and scientific journals. These activities afford ample opportunity to dive into the data, and discover things that aren’t true, particularly when the diving is done automatically and at a large scale. Statistical methods play a large role in this extraction of meaning from data. However, the science of statistics also provides tools to study data more carefully, and distinguish what’s true from what ain’t so.

2.2 Big Data

In most organizations today, raw data are plentiful (often too plentiful), and this is a two-edged sword.

Huge amounts of data make prediction possible in circumstances where small amounts of data don’t help. One type of recommendation system, for example, needs to process millions of transactions to locate transactions with the same item you are looking at—enough so that reliable information about associated items can be deduced.

On the other hand, huge data flows and incorrect data can obscure meaningful patterns in the data, and generate false ones. Useful data are often difficult and expensive to gather. We need to find ways to get the most information, and the most accurate information, for each dollar spent in assembling and preparing data.

2.3 Data Science

The terms big data, machine learning, data science, and artificial intelligence (AI) often go together, and bring different things to mind for different people. The term artificial intelligence, particularly with the advent of Chat-GPT and generative AI, suggests almost magical methods that approach human-like cognition capabilities. Privacy-minded individuals may think of large corporations or spy agencies combing through petabytes of personal data in hopes of locating tidbits of information that are interesting or useful. Analysts may focus on statistical and machine learning models that can predict an unknown value of interest (loan default, acceptance of a sales offer, filing a fraudulent insurance claim or tax return, for example).

Statistical science, by contrast, has well over a century of history, and its methods were originally tailored to data that were small and well-structured. However, it is an important contributor to the field of data science which, when it is well practiced, is not just aimless trolling for patterns, but starts out with questions of interest:

What additional product should we recommend to a customer?

Which price will generate more revenue?

Does the MRI show a malignancy?

Is a customer likely to terminate a subscription?

All these questions require some understanding of random behavior and all benefit from an understanding of the principles of well-designed statistical studies, so this is where we will start.

2.4 Example: Hospital Errors

Healthcare accounts for about 18% of the United States GDP (as of 2024), is a regular subject of political controversy and proposals for reform, and produces enormous amounts of data and analysis. One area of study is the problem of medical errors—violations of the Hippocratic oath’s “do no harm” provision. Millions of hospitalized patients each year around the world are affected by treatment errors (mostly medication errors). A 2017 report from the National Institutes of Health (NIH) in the U.S. estimated that 250,000 deaths per year resulted from medical errors. There are various approaches to dealing with the problem.

Source: Vineey/Adobe Stock Photos

Clinical Decision Support systems (CDS) are used to guide practitioners in diagnosis and treatment, and can provide rule-based alerts when standard treatment protocols are violated. However, all those rules must be programmed and kept up-to-date in an extremely complex medical environment. Many false alarms result, which can cause practitioners to ignore the alerts. Recent advances in machine learning have enabled systems that learn on their own to provide alerts, without experts having to program rules. These systems allow for the correction of errors once they occur, but what about identifying the causes of errors and reducing their frequency?

One obvious and uncontroversial innovation has been to promote the use of checklists to reduce errors of omission. Other ideas may not be so obvious. No-fault error reporting has been proposed, in which staff are encouraged to report all errors, both their own and those committed by others, without fear of punishment. This could have the benefit of generating better information about errors and their sources, but could also hinder accountability efforts. How could you find out whether such a program really works? The answer: a well-designed statistical study.

2.5 Experiment

To tie together our study of statistics we will look at an experiment designed to test whether no-fault reporting of all hospital errors reduces major errors in hospitals (errors resulting in further hospitalization, serious complications, or even death). An experiment like this was conducted by a hospital in Quebec, Canada, but it was too small to provide definitive conclusions. For illustrative purposes, we will look at hypothetical data that a larger study might have produced.

Experiments are used in industry, medicine, social science, and data science. The ubiquitous A/B test (more on that later) is an experiment. The key feature of an experiment is that the investigator manipulates some variable that is believed to affect an outcome of interest, in order to demonstrate the importance and effect (or lack thereof) of the variable. This stands in contrast to a survey or other analysis of existing data, where the analyst simply collects and analyzes data. For example, in a web experiment, the marketing investigator might try out a new product price to see how it affects sales.

Experiments can be uncontrolled or controlled. In an uncontrolled experiment, the investigator collects data on the group or time period for which the variable of interest has been changed. In a web experiment, for example, the price of a product might be increased by 25%, and then sales compared to prior sales.

Experiment vs. Observational Study

In the fifth inning of the third game of the 1932 baseball World Series between the NY Yankees and the Chicago Cubs, the great slugger Babe Ruth came to bat and pointed towards center field, as if to indicate that he planned to hit the next pitch there. On the next pitch, he indeed hit the ball for a home run into the center field bleachers.a

A Babe Ruth home run was an impressive feat, but not that uncommon. He hit one every 11.8 at-bats. What made this one so special is that he predicted it. In statistical terms, he specified in advance a theory about a future event—the next swing of the bat—and an outcome of interest—a home run to center field.

In statistics, we make an important distinction between studying pre-existing data (an observational study) and collecting data to answer a pre-specified question (an experiment or prospective study). The most impressive and durable results in science come when the researcher specifies a question in advance, then collects data in a well-designed experiment to answer the question. Offering commentary on the past can be helpful, but is no match for predicting the future.

aThere is some controversy about whether he actually pointed to center field or to left field and whether he was foreshadowing a prospective home run or taunting Cubs players. You can Google the incident (“Babe Ruth called shot”) and study videos on YouTube, then judge for yourself.

2.6 Designing an Experiment

The problem with an uncontrolled experiment is the uncertainty involved in the comparison. Suppose sales drop 10% in the web experiment with the new price. Can you be sure that nothing else has changed since the experiment started? Probably not—companies are making modifications and trying new things all the time. Hence, the need for a control group.

In a controlled experiment, two groups are used and they are handled in the same way, except that one is given the treatment (e.g. the increased price), and the other is not given the treatment. In this way, we can eliminate the confounding effect of other factors not being studied.

2.6.1 A/B Tests; A Controlled Experiment for the Hospital Plans

In our errors experiment, we could compare two groups of hospitals. One group uses the no-fault plan and one does not. The group that gets the change in treatment you wish to study (here, the no-fault plan) is called the treatment group. The group that gets no treatment or the standard treatment is called the control group.

An experiment like this, testing a control group vs. a treatment group, is also called an A/B test, particularly in the field of marketing, where one web treatment might be tested against another. Sometimes, particularly in marketing, there might not be an established control scenario and we are simply comparing one proposed new treatment against another proposed new treatment (e.g. two different web pages).

How do you decide which hospitals go into which group?

You would like the two groups to be similar to one another, except for the treatment/control difference. That way, if the treatment group does turn out to have fewer errors, you can be confident that it was due to the treatment. One way to do this would be to study all the hospitals in detail, examine all their relevant characteristics, and assign them to treatment/control in such a way that the two groups end up being similar across all these attributes. There are two problems with this approach.

It is usually not possible to think of all the relevant characteristics that might affect the outcome. Research is replete with the discovery of factors that were unknown prior to the study or thought to be unimportant.

The researcher, who has a stake in the outcome of the experiment, may consciously or unconsciously assign hospitals in a way that enhances the chances of the success of their pet theory.

Oddly enough, the best strategy is to assign hospitals randomly: for example, by tossing a coin.

2.6.2 Randomizing

True random assignment eliminates both conscious and unconscious bias in the assignment to groups. It does not guarantee that the groups will be equal in all respects. However, it does guarantee that any departure from equality will be due simply to the chance allocation, and that the larger the samples, the fewer differences the groups will have. With extremely large samples, differences due to chance virtually disappear, and you are left with differences that are real—provided the assignment to groups is really random.

Random assignment lets us make the claim that any difference in group outcomes that is more than might reasonably happen by chance is, in fact, due to the different treatment of the groups. The study of probability in this book lets us quantify the role that chance can play and take it into account.

We can imagine an experiment in which both groups got the same treatment. We would expect to see some differences from one hospital to another. An everyday example of this might be tossing a coin. If you toss a coin 10 times you will get a certain number of heads. Do it again and you will probably get a different number of heads.

Though the results vary, there are laws of chance that allow you to calculate things like how many heads you would expect on average or how much the results would vary from one set of 10 (or 100 or 1000) tosses to the next. If we assign subjects at random, we can use these same laws of chance—or a lot of coin tosses—to analyze our results.

If we have Doctor Jones assign subjects using her own best judgment, we will have no knowledge of the (often subconscious) factors that influence assignment. These factors may bias assignment so that we can no longer say that the only thing (besides chance assignment) distinguishing the treatment and control groups is the treatment. Random assignment is not always possible—for example, randomly assigning elementary school students to two different teaching methods, where everything else in the education setting is the same.

Randomization can be difficult or impossible in some situations, but it is relatively easy in the A/B testing that is popular in digital marketing. Web visitors can be easily randomized to one web page or another; email recipients can easily be assigned randomly to one version or another of an email.

2.6.3 Planning

You need some hospitals and you estimate you can find about 100 within a reasonable distance. You will probably need to present a plan for your study to the hospitals to get their approval. That seems like a nuisance, but they cannot let just anyone do any study they please on the patients.1 In addition to writing a plan to get approval, you know that one of the biggest problems in interpreting studies is that many are poorly designed. You want to avoid that, so you think carefully about your plan and ask others for advice. It would be good to talk to a statistician with experience in medical work. Your plan is to ask the 100 or so available hospitals if they are willing to join your study. They have a right to say no. You hope quite a few will say yes. In particular, you hope to recruit 50 willing hospitals and randomly assign them to treatment and control.

Try It Yourself

How exactly would you assign hospitals randomly? Think about options for the scenario where it doesn’t matter if the groups are exactly equal-sized, and for the scenario where you want two groups of equal size.

Figure 2.1 Dart throws off-target in consistent fashion (biased).

Source: Peter Bruce (Book Author).

2.6.4 Bias

Randomization is used to try to make the two groups similar at the beginning. It is important to keep them as similar as possible during the experiment. We want to be sure the treatment is the only difference between them. Any difference in outcome due to non-random extraneous factors is a form of bias. Statistical bias is a technical concept but can include the lay definition that refers to people’s opinions or states of mind.

Definitions: Bias Statistical bias is the tendency for an estimate, model, or procedure to yield results that are consistently off-target for a specific purpose (as in Fig. 2.1).

For example, the mean (average) income for a region might not be a good estimate for the income of a typical resident, if part of the region is home to a small number of very high-income residents. Their incomes would likely raise the average above that of most typical residents (i.e. ones selected at random). Another example is gun sights on a long-range rifle. Gravity will exert a downward pull on a bullet, the more distant the target the greater the pull. The coordinates of the target in the sights will be biased upward compared to where the bullet lands.

Bias can often creep into a study when humans are involved, either as subjects or experimenters. For one thing, subject behavior can be changed by the fact that they are participating in a study. Experience has also shown that people respond positively to attention, and just being part of a study may cause subjects to change. A positive response to the attention of being in a study is called the Hawthorne effect. Awareness of an issue can significantly affect perceptions, which is why potential jurors in a trial are asked if they have seen news coverage of a case at issue.

Out-of-Control Toyotas?