Applied Regression Modeling - Iain Pardoe - E-Book

Applied Regression Modeling E-Book

Iain Pardoe

4,3
110,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Master the fundamentals of regression without learning calculus with this one-stop resource The newly and thoroughly revised 3rd Edition of Applied Regression Modeling delivers a concise but comprehensive treatment of the application of statistical regression analysis for those with little or no background in calculus. Accomplished instructor and author Dr. Iain Pardoe has reworked many of the more challenging topics, included learning outcomes and additional end-of-chapter exercises, and added coverage of several brand-new topics including multiple linear regression using matrices. The methods described in the text are clearly illustrated with multi-format datasets available on the book's supplementary website. In addition to a fulsome explanation of foundational regression techniques, the book introduces modeling extensions that illustrate advanced regression strategies, including model building, logistic regression, Poisson regression, discrete choice models, multilevel models, Bayesian modeling, and time series forecasting. Illustrations, graphs, and computer software output appear throughout the book to assist readers in understanding and retaining the more complex content. Applied Regression Modeling covers a wide variety of topics, like: * Simple linear regression models, including the least squares criterion, how to evaluate model fit, and estimation/prediction * Multiple linear regression, including testing regression parameters, checking model assumptions graphically, and testing model assumptions numerically * Regression model building, including predictor and response variable transformations, qualitative predictors, and regression pitfalls * Three fully described case studies, including one each on home prices, vehicle fuel efficiency, and pharmaceutical patches Perfect for students of any undergraduate statistics course in which regression analysis is a main focus, Applied Regression Modeling also belongs on the bookshelves of non-statistics graduate students, including MBAs, and for students of vocational, professional, and applied courses like data science and machine learning.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 649

Veröffentlichungsjahr: 2020

Bewertungen
4,3 (16 Bewertungen)
9
3
4
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Applied Regression Modeling

Copyright

Dedication

Preface

Acknowledgments

Introduction

About the Companion Website

Chapter 1: Foundations

1.1 Identifying and Summarizing Data

1.2 Population Distributions

1.3 Selecting Individuals at Random—Probability

1.4 Random Sampling

1.5 Interval Estimation

1.6 Hypothesis Testing

1.7 Random Errors and Prediction

1.8 Chapter Summary

Chapter 2: Simple Linear Regression

2.1 Probability Model for and

2.2 Least Squares Criterion

2.3 Model Evaluation

2.4 Model Assumptions

2.5 Model Interpretation

2.6 Estimation and Prediction

2.7 Chapter Summary

Chapter 3: Multiple Linear Regression

3.1 Probability Model for (X1, X2, …) and Y

3.2 Least Squares Criterion

3.3 Model Evaluation

3.4 Model Assumptions

3.5 Model Interpretation

3.6 Estimation and Prediction

3.7 Chapter Summary

Chapter 4: Regression Model Building I

4.1 Transformations

4.2 Interactions

4.3 Qualitative Predictors

4.4 Chapter Summary

Chapter 5: Regression Model Building II

5.1 Influential Points

5.2 Regression Pitfalls

5.3 Model Building Guidelines

5.4 Model Selection

5.5 Model Interpretation Using Graphics

5.6 Chapter Summary

Bibliography

Glossary

Index

End User License Agreement

List of Tables

Chapter 3

Table 3.1 Shipping data with response variable

weekly labor hours and four p...

Chapter 4

Table 4.1 TV commercial data:

spending in $m,

millions of retained impress...

Table 4.2 Car data with

city miles per gallon,

engine size in liters, for

Chapter 5

Table 5.1 Car data with

miles per gallon,

size (l),

of cylinders,

pass...

Table 5.2 Computer component data.

Table 5.3 Simulated dataset containing missing values.

Table 5.4 Some automated model selection results for the

SIMULATE

data file

Table 5.5 Credit card data to illustrate model interpretation using predictor...

List of Illustrations

Chapter 1

Figure 1.1 Histogram for home prices example.

Figure 1.2 Histogram for a simulated population of

sale prices, together w...

Figure 1.3 Standard normal density curve together with a shaded area of

be...

Figure 1.4 QQ‐plot for the home prices example.

Figure 1.5 The central limit theorem in action. The upper density curve (a) ...

Figure 1.6 Home prices example—density curve for the t‐distribution with

d...

Figure 1.7 Relationships between critical values, significance levels, test ...

Figure 1.8 Relationships between critical values, significance levels, test ...

Chapter 2

Figure 2.1 (a)–(d) Different kinds of association between sale price and flo...

Figure 2.2 Scatterplot showing the simple linear regression model for the ho...

Figure 2.3 Linear equation for the simple linear regression model.

Figure 2.4 Illustration of the least squares criterion for the simple linear...

Figure 2.5 Simple linear regression model fitted to sample data for the home...

Figure 2.6 How well does the model fit each dataset?

Figure 2.7 Interpretation of the regression standard error for simple linear...

Figure 2.8 Measures of variation used to derive the coefficient of determina...

Figure 2.9 Examples of

values for a variety of scatterplots.

Figure 2.10 Examples of correlation values and corresponding

values for a ...

Figure 2.11 Simple linear regression model fitted to hypothetical population...

Figure 2.12 Illustration of the sampling distribution of the slope for the s...

Figure 2.13 Scatterplot illustrating random error probability distributions....

Figure 2.14 Examples of residual plots for which the four simple linear regr...

Figure 2.15 Examples of residual plots for which the four simple linear regr...

Figure 2.16 Examples of histograms of residuals for which the normality regr...

Figure 2.17 Examples of QQ‐plots of residuals for which the normality regres...

Figure 2.18 Simple linear regression model for the home prices–floor size ex...

Figure 2.19 Scatterplot illustrating confidence intervals for the mean,

, a...

Figure 2.20 Scatterplot of

standing height (in cm) and

upper arm length ...

Figure 2.21 Residual plot for the body measurements example.

Figure 2.22 Histogram and QQ‐plot of residuals for the body measurements exa...

Figure 2.23 Scatterplot of

versus

for the body measurements example with...

Figure 2.242.24 Examples of residual plots for Problem 14.

Figure 2.252.25 Examples of QQ‐plots for Problem 14.

Chapter 3

Figure 3.1 Multiple linear regression model with two predictors fitted to a ...

Figure 3.2 Scatterplot matrix for the home prices example.

Figure 3.3 Scatterplot of simulated data with low correlation between

and

Figure 3.4 Scatterplot of simulated data with high correlation between

and...

Figure 3.5 Relationships between critical values, significance levels, test ...

Figure 3.6 Scatterplot of simulated data with low correlation between

and

Figure 3.7 Scatterplot matrix for simulated data with high correlation betwe...

Figure 3.8 Residual plots for the

MLRA

example, with model

in (a) and mode...

Figure 3.9 Model

residual plots for the

MLRA

example. Moving across each p...

Figure 3.10 Histogram and QQ‐plot of the model

residuals for the

MLRA

exam...

Chapter 4

Figure 4.1 Scatterplot of

versus

for the TV commercial example with fitt...

Figure 4.2 Scatterplot of

versus

for the TV commercial example with fitt...

Figure 4.3 Predictor effect plot of

versus

for the TV commercial example...

Figure 4.4 Histograms of

(a) and

(b) for the TV commercial dataset.

Figure 4.5 Histograms of

(a) and

(b) for a simulated dataset.

Figure 4.6 Scatterplot of

versus

for the home prices–age example with fi...

Figure 4.7 Scatterplot of

versus

for the cars example with a fitted line...

Figure 4.8 Scatterplot of

versus

for the cars example with a fitted line...

Figure 4.9 Predictor effect plot of

versus

for the cars example.

Figure 4.10 Scatterplot of

versus

for the work experience example with a...

Figure 4.11 Scatterplot of

versus

for the work experience example with a...

Figure 4.12 Scatterplot of

versus

for the homes taxes example with a fit...

Figure 4.13 Scatterplot of

versus

for the home taxes example with a fitt...

Figure 4.14 Scatterplot of

versus

with the points marked according to th...

Figure 4.15 Scatterplot of

versus

with points marked by

for the sales–...

Figure 4.16 Scatterplot of

versus

with the points marked according to ge...

Figure 4.17 Scatterplot of

versus

with the points marked according to ge...

Figure 4.18 Scatterplot of

versus

with the points marked according to ca...

Chapter 5

Figure 5.1 Histogram of studentized residuals from the model fit to all the ...

Figure 5.2 Histogram of studentized residuals from the model fit to the

CARS

...

Figure 5.3 Scatterplot of leverage versus ID number for the model fit to the...

Figure 5.4 Scatterplot of Cook's distance versus ID number for the model fit...

Figure 5.5 Scatterplot of Cook's distance versus ID number for the model fit...

Figure 5.6 Scatterplot of studentized residuals versus fitted values for the...

Figure 5.7 Scatterplot of studentized residuals versus fitted values for the...

Figure 5.8 Scatterplot of the residuals from the model

versus

for the un...

Figure 5.9 Scatterplot of the residuals versus

for the unemployment model ...

Figure 5.10 Scatterplot of

versus

for the computer components example.

Figure 5.11 Scatterplot of

versus

with points marked by

for the comput...

Figure 5.12 Scatterplots of

versus

(a) with the fitted line from the cor...

Figure 5.13 Scatterplot of

in hundreds of units versus

hours in hundreds...

Figure 5.14 Predictor effect plot for

in the credit card example.

is on ...

Figure 5.15 Predictor effect plot for

in the credit card example.

effect...

Guide

Cover Page

Title Page

Copyright

Dedication

Preface

Acknowledgments

Introduction

About the Companion Website

Table of Contents

Begin Reading

Bibliography

Glossary

Index

WILEY END USER LICENSE AGREEMENT

Pages

iv

v

xi

xii

xiii

xv

xvii

xviii

xix

xx

xxi

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

293

294

295

296

297

298

299

299

300

301

302

303

304

305

306

307

308

309

310

311

312

Applied Regression Modeling

Third Edition

Iain PardoeThompson Rivers UniversityThe Pennsylvania State University

 

 

 

 

 

 

Copyright

This edition first published 2021

© 2021 John Wiley & Sons, Inc

Edition History

Second Edition, 2012, John Wiley & Sons, Inc

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Iain Pardoe to be identified as the author of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Pardoe, Iain, 1970- author

Title: Applied regression modeling / Iain Pardoe, Thompson Rivers

     University, The Pennsylvania State University.

Description: Third edition. | Hoboken, New Jersey : Wiley, [2020] |

     includes bibliographical references and index.

Identifiers: LCCN 2020028117 (print) | LCCN 2020028118 (ebook) | ISBN

     9781119615866 (cloth) | ISBN 9781119615880 (adobe pdf) | ISBN

     9781119615903 (epub)

Subjects: LCSH: Regression analysis. | Statistics.

Classification: LCC QA278.2 .P363 2020 (print) | LCC QA278.2 (ebook) |

     DDC 519.5/36–dc23

LC record available at https://lccn.loc.gov/2020028117

LC ebook record available at https://lccn.loc.gov/2020028118

Cover Design: Wiley

Cover Image: Courtesy of Bethany Pardoe

To Bethany and Sierra

Preface

The first edition of this book was developed from class notes written for an applied regression course taken primarily by undergraduate business majors in their junior year at the University of Oregon. Since the regression methods and techniques covered in the book have broad application in many fields, not just business, the second edition widened its scope to reflect this. This third edition refines and improves the text further. Details of the major changes for the third edition are included at the end of this preface.

The book is suitable for any undergraduate or graduate statistics course in which regression analysis is the main focus. A recommended prerequisite is an introductory probability and statistics course. It is also appropriate for use in an applied regression course for MBAs and for vocational, professional, or other non‐degree courses. Mathematical details have deliberately been kept to a minimum, and the book does not contain any calculus. Instead, emphasis is placed on applying regression analysis to data using statistical software, and understanding and interpreting results. Optional formulas are provided for those wishing to see these details and the book now includes an informal overview of matrices in the context of multiple linear regression.

Chapter 1 reviews essential introductory statistics material, while Chapter 2 covers simple linear regression. Chapter 3 introduces multiple linear regression, while Chapters 4 and 5 provide guidance on building regression models, including transforming variables, using interactions, incorporating qualitative information, and using regression diagnostics. Each of these chapters includes homework problems, mostly based on analyzing real datasets provided with the book. Chapter 6 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) contains three in‐depth case studies, while Chapter 7 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) introduces extensions to linear regression and outlines some related topics. The appendices contain a list of statistical software packages that can be used to carry out all the analyses covered in the book (each with detailed instructions available from the book website www.wiley.com/go/pardoe/AppliedRegressionModeling3e), a table of critical values for the t‐distribution, notation and formulas used throughout the book, a glossary of important terms, a short mathematics refresher, a tutorial on multiple linear regression using matrices, and brief answers to selected homework problems.

The first five chapters of the book have been used successfully in quarter‐length courses at a number of institutions. An alternative approach for a quarter‐length course would be to skip some of the material in Chapters 4 and 5 and substitute one or more of the case studies in Chapter 6 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e), or briefly introduce some of the topics in Chapter 7 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e). A semester‐length course could comfortably cover all the material in the book.

The website for the book, which can be found at www.wiley.com/go/pardoe/Applied RegressionModeling3e, contains supplementary material designed to help both the instructor teaching from this book and the student learning from it. There you'll find all the datasets used for examples and homework problems in formats suitable for most statistical software packages, as well as detailed instructions for using the major packages, including SPSS, Minitab, SAS, JMP, Data Desk, EViews, Stata, Statistica, R, and Python. There is also some information on using the Microsoft Excel spreadsheet package for some of the analyses covered in the book (dedicated statistical software is necessary to carry out all of the analyses). The website also includes information on obtaining a instructor's manual containing complete answers to all the homework problems, as well as instructional videos, practice quizzes, and further ideas for organizing class time around the material in the book.

The book contains the following stylistic conventions:

When displaying calculated values, the general approach is to be as accurate as possible when it matters (such as in intermediate calculations for problems with many steps), but to round appropriately when convenient or when reporting final results for real‐world questions. Displayed results from statistical software use the default rounding employed in R throughout.

In the author's experience, many students find some traditional approaches to notation and terminology a barrier to learning and understanding. Thus, some traditions have been altered to improve ease of understanding. These include: using familiar Roman letters in place of unfamiliar Greek letters (e.g.,

rather than

and

rather than

); replacing the nonintuitive

for the sample mean of

with

; using

and

for null hypothesis and alternative hypothesis, respectively, rather than the usual

and

.

Major changes for the third edition

The second edition of this book was used in the regression analysis course run by

Statistics.com

from 2012 to 2020. The lively discussion boards provided an invaluable source for suggestions for changes to the book. This edition clarifies and expands on concepts that students found challenging and addresses every question posed in those discussions.

There is expanded material on assessing model assumptions, analysis of variance, sums of squares, lack of fit testing, hierarchical models, influential observations, weighted least squares, multicollinearity, and logistic regression.

A new appendix provides an informal overview of matrices in the context of multiple linear regression.

I've added learning objectives to the beginning of each chapter and text boxes at the end of each section that summarize the important concepts.

As in the first two editions, this edition uses mathematics to explain methods and techniques only where necessary, and formulas are used within the text only when they are instructive. However, the book also includes additional formulas in optional sections to aid those students who can benefit from more mathematical detail.

I've added many more end‐of‐chapter problems. In total, the number of problems has increased by nearly 70%.

I've updated and added new references.

The book website has been expanded to include instructional videos and practice quizzes.

Iain Pardoe

Nelson, British ColumbiaJanuary, 2020

Acknowledgments

I am grateful to a number of people who helped to make this book a reality. Dennis Cook and Sandy Weisberg first gave me the textbook‐writing bug when they approached me to work with them on their classic applied regression book [Cook and Weisberg, 1999], and Dennis subsequently motivated me to transform my teaching class notes into my own applied regression book. People who provided data for examples used throughout the book include: Victoria Whitman for the house price examples; Wolfgang Jank for the autocorrelation example on beverage sales; Craig Allen for the case study on pharmaceutical patches; Cathy Durham for the Poisson regression example in the chapter on extensions. The multilevel and Bayesian modeling sections of the chapter on extensions are based on work by Andrew Gelman and Hal Stern. A variety of anonymous reviewers provided extremely useful feedback on the second edition of the book, as did many of my students at the University of Oregon and Statistics.com. Finally, I'd like to thank colleagues at Thompson Rivers University and the Pennsylvania State University, as well as Kathleen Santoloci and Mindy Okura‐Marszycki at Wiley.

Iain Pardoe

INTRODUCTION

I.1 STATISTICS IN PRACTICE

Statistics is used in many fields of application since it provides an effective way to analyze quantitative information. Some examples include:

A pharmaceutical company is developing a new drug for treating a particular disease more effectively. How might statistics help you decide whether the drug will be safe and effective if brought to market?

Clinical trials involve large‐scale statistical studies of people—usually both patients with the disease and healthy volunteers—who are assessed for their response to the drug. To determine that the drug is both safe and effective requires careful statistical analysis of the trial results, which can involve controlling for the personal characteristics of the people (e.g., age, gender, health history) and possible placebo effects, comparisons with alternative treatments, and so on.

A manufacturing firm is not getting paid by its customers in a timely manner—this costs the firm money on lost interest. You've collected recent data for the customer accounts on amount owed, number of days since the customer was billed, and size of the customer (small, medium, large). How might statistics help you improve the on‐time payment rate?

You can use statistics to find out whether there is an association between the amount owed and the number of days and/or size. For example, there may be a positive association between amount owed and number of days for small and medium‐sized customers but not for large‐sized customers—thus it may be more profitable to focus collection efforts on small and medium‐sized customers billed some time ago, rather than on large‐sized customers or customers billed more recently.

A firm makes scientific instruments and has been invited to make a sealed bid on a large government contract. You have cost estimates for preparing the bid and fulfilling the contract, as well as historical information on similar previous contracts on which the firm has bid (some successful, others not). How might statistics help you decide how to price the bid?

You can use statistics to model the association between the success/failure of past bids and variables such as bid cost, contract cost, bid price, and so on. If your model proves useful for predicting bid success, you could use it to set a maximum price at which the bid is likely to be successful.

As an auditor, you'd like to determine the number of price errors in all of a company's invoices—this will help you detect whether there might be systematic fraud at the company. It is too time‐consuming and costly to examine all of the company's invoices, so how might statistics help you determine an upper bound for the proportion of invoices with errors?

Statistics allows you to infer about a population from a relatively small random sample of that population. In this case, you could take a sample of 100 invoices, say, to find a proportion, p, such that you could be 95% confident that the population error rate is less than that quantity p.

A firm manufactures automobile parts and the factory manager wants to get a better understanding of overhead costs. You believe two variables in particular might contribute to cost variation: machine hours used per month and separate production runs per month. How might statistics help you to quantify this information?

You can use statistics to build a multiple linear regression model that estimates an equation relating the variables to one another. Among other things you can use the model to determine how much cost variation can be attributed to the two cost drivers, their individual effects on cost, and predicted costs for particular values of the cost drivers.

You work for a computer chip manufacturing firm and are responsible for forecasting future sales. How might statistics be used to improve the accuracy of your forecasts?

Statistics can be used to fit a number of different forecasting models to a time series of sales figures. Some models might just use past sales values and extrapolate into the future, while others might control for external variables such as economic indices. You can use statistics to assess the fit of the various models, and then use the best‐fitting model, or perhaps an average of the few best‐fitting models, to base your forecasts on.

As a financial analyst, you review a variety of financial data, such as price/ earnings ratios and dividend yields, to guide investment recommendations. How might statistics be used to help you make buy, sell, or hold recommendations for individual stocks?

By comparing statistical information for an individual stock with information about stock market sector averages, you can draw conclusions about whether the stock is overvalued or undervalued. Statistics is used for both “technical analysis” (which considers the trading patterns of stocks) and “quantitative analysis” (which studies economic or company‐specific data that might be expected to affect the price or perceived value of a stock).

You are a brand manager for a retailer and wish to gain a better understanding of the association between promotional activities and sales. How might statistics be used to help you obtain this information and use it to establish future marketing strategies for your brand?

Electronic scanners at retail checkout counters and online retailer records can provide sales data and statistical summaries on promotional activities such as discount pricing and the use of in‐store displays or e‐commerce websites. Statistics can be used to model these data to discover which product features appeal to particular market segments and to predict market share for different marketing strategies.

As a production manager for a manufacturer, you wish to improve the overall quality of your product by deciding when to make adjustments to the production process, for example, increasing or decreasing the speed of a machine. How might statistics be used to help you make those decisions?

Statistical quality control charts can be used to monitor the output of the production process. Samples from previous runs can be used to determine when the process is “in control.” Ongoing samples allow you to monitor when the process goes out of control, so that you can make the adjustments necessary to bring it back into control.

As an economist, one of your responsibilities is providing forecasts about some aspect of the economy, for example, the inflation rate. How might statistics be used to estimate those forecasts optimally?

Statistical information on various economic indicators can be entered into computerized forecasting models (also determined using statistical methods) to predict inflation rates. Examples of such indicators include the producer price index, the unemployment rate, and manufacturing capacity utilization.

As general manager of a baseball team with limited financial resources, you'd like to obtain strong, yet undervalued players. How might statistics help you to do this?

A wealth of statistical information on baseball player performance is available, and objective analysis of these data can reveal information on those players most likely to add value to the team (in terms of winning games) relative to a player's cost. This field of statistics even has its own name, sabermetrics.

I.2 Learning Statistics

What is this book about?

This book is about the application of statistical methods, primarily regression analysis and modeling, to enhance decision‐making. Regression analysis is by far the most used statistical methodology in real‐world applications. Furthermore, many other statistical techniques are variants or extensions of regression analysis, so once you have a firm foundation in this methodology, you can approach these other techniques without too much additional difficulty. In this book we show you how to apply and interpret regression models, rather than deriving results and formulas (there is no calculus in the book).

Why are non‐math major students required to study statistics?

In many aspects of modern life, we have to make decisions based on incomplete information (e.g., health, climate, economics, business). This book will help you to understand, analyze, and interpret such data in order to make informed decisions in the face of uncertainty. Statistical theory allows a rigorous, quantifiable appraisal of this uncertainty.

How is the book organized?

Chapter 1 reviews the essential details of an introductory statistics course necessary for use in later chapters. Chapter 2 covers the simple linear regression model for analyzing the linear association between two variables (a “response” and a “predictor”). Chapter 3 extends the methods of Chapter 3 to multiple linear regression where there can be more than one predictor variable. Chapters 4 and 5 provide guidance on building regression models, including transforming variables, using interactions, incorporating qualitative information, and diagnosing problems. Chapter 6 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) contains three case studies that apply the linear regression modeling techniques considered in this book to examples on real estate prices, vehicle fuel efficiency, and pharmaceutical patches. Chapter 7 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) introduces some extensions to the multiple linear regression model and outlines some related topics. The appendices contain a list of statistical software that can be used to carry out all the analyses covered in the book, a t‐table for use in calculating confidence intervals and conducting hypothesis tests, notation and formulas used throughout the book, a glossary of important terms, a short mathematics refresher, a tutorial on multiple linear regression using matrices, and brief answers to selected problems.

What else do you need?

The preferred calculation method for understanding the material and completing the problems is to use statistical software rather than a statistical calculator. It may be possible to apply many of the methods discussed using spreadsheet software (such as Microsoft Excel), although some of the graphical methods may be difficult to implement and statistical software will generally be easier to use. Although a statistical calculator is not recommended for use with this book, a traditional calculator capable of basic arithmetic (including taking logarithmic and exponential transformations) will be invaluable.

What other resources are recommended?

Good supplementary textbooks (some at a more advanced level) include Chatterjee and Hadi (2013), Dielman (2004), Draper and Smith (1998), Fox (2015), Gelman et al. (2020), Kutner et al. (2004), Mendenhall and Sincich (2020), Montgomery et al. (2021), Ryan (2008), and Weisberg (2013).

About the Companion Website

This book is accompanied by a companion website for Instructors and Students:

www.wiley.com/go/pardoe/AppliedRegressionModeling3e

Datasets used for examples

R code

Presentation slides

Statistical software packages

Chapter 6 – Case studies

Chapter 7 – Extensions

Appendix A – Computer Software help

Appendix B – Critical values for t-distributions

Appendix C – Notation and formulas

Appendix D – Mathematics refresher

Appendix E – Multiple Linear Regression Using Matrices

Appendix F – Answers for selected problems

Instructor's manual

Chapter 1Foundations

This chapter provides a brief refresher of the main statistical ideas that are a useful foundation for the main focus of this book, regression analysis, covered in subsequent chapters. For more detailed discussion of this material, consult a good introductory statistics textbook such as Freedman et al. (2007) or Moore et al. (2018). To simplify matters at this stage, we consider univariate data, that is, datasets consisting of measurements of a single variable from a sample of observations. By contrast, regression analysis concerns multivariate data where there are two or more variables measured from a sample of observations. Nevertheless, the statistical ideas for univariate data carry over readily to this more complex situation, so it helps to start out as simply as possible and make things more complicated only as needed.

After reading this chapter you should be able to:

Summarize univariate data graphically and numerically.

Calculate and interpret a confidence interval for a univariate population mean.

Conduct and draw conclusions from a hypothesis test for a univariate population mean using both the rejection region and p‐value methods.

Calculate and interpret a prediction interval for an individual univariate value.

1.1 Identifying and Summarizing Data

One way to think about statistics is as a collection of methods for using data to understand a problem quantitatively—we saw many examples of this in the introduction. This book is concerned primarily with analyzing data to obtain information that can be used to help make decisions in real‐world contexts.

The process of framing a problem in such a way that it is amenable to quantitative analysis is clearly an important step in the decision‐making process, but this lies outside the scope of this book. Similarly, while data collection is also a necessary task—often the most time‐consuming part of any analysis—we assume from this point on that we have already obtained data relevant to the problem at hand. We will return to the issue of the manner in which these data have been collected—namely, whether we can consider the sample data to be representative of some larger population that we wish to make statistical inferences for—in Section 1.3.

For now, we consider identifying and summarizing the data at hand. For example, suppose that we have moved to a new city and wish to buy a home. In deciding on a suitable home, we would probably consider a variety of factors, such as size, location, amenities, and price. For the sake of illustration, we focus on price and, in particular, see if we can understand the way in which sale prices vary in a specific housing market. This example will run through the rest of the chapter, and, while no one would probably ever obsess over this problem to this degree in real life, it provides a useful, intuitive application for the statistical ideas that we use in the rest of the book in more complex problems.

For this example, identifying the data is straightforward: the units of observation are a random sample of size single‐family homes in our particular housing market, and we have a single measurement for each observation, the sale price in thousands of dollars ($), represented using the notation . Here, is the generic letter used for any univariate data variable, while is the specific variable name for this dataset. These data, obtained from Victoria Whitman, a realtor in Eugene, Oregon, are available in the HOMES1 data file on the book website—they represent sale prices of 30 homes in south Eugene during 2005. This represents a subset of a larger file containing more extensive information on 76 homes, which is analyzed as a case study in Chapter 6 (refer www.wiley.com/go/pardoe/AppliedRegressionModeling3e).

The particular sample in the HOMES1 data file is random because the 30 homes have been selected randomly somehow from the population of all single‐family homes in this housing market. For example, consider a list of homes currently for sale, which are considered to be representative of this population. A random number generator—commonly available in spreadsheet or statistical software—can be used to pick out 30 of these. Alternative selection methods may or may not lead to a random sample. For example, picking the first 30 homes on the list would not lead to a random sample if the list were ordered by the size of the sale price.

We can simply list small datasets such as this. The values of in this case are as follows:

155.5

195.0

197.0

207.0

214.9

230.0

239.5

242.0

252.5

255.0

259.9

259.9

269.9

270.0

274.9

283.0

285.0

285.0

299.0

299.9

319.0

319.9

324.5

330.0

336.0

339.0

340.0

355.0

359.9

359.9

However, even for these data, it can be helpful to summarize the numbers with a small number of sample statistics (such as the sample mean and standard deviation), or with a graph that can effectively convey the manner in which the numbers vary. A particularly effective graph is a stem‐and‐leaf plot, which places the numbers along the vertical axis of the plot, ordered in adjoining data intervals (called “bins”) from the lowest at the top to the highest at the bottom. For example, a stem‐and‐leaf plot for the 30 sample prices looks like the following:

1 | 6

2 | 0011344

2 | 5666777899

3 | 002223444

3 | 666

In this plot, the decimal point is two digits to the right of the stem. So, the “1” in the stem and the “6” in the leaf represents 160 or, because of rounding, any number between 155 and 164.9. In particular, it represents the lowest price in the dataset of 155.5 (thousand dollars). The next part of the graph shows two prices between 195 and 204.9, two prices between 205 and 214.9, one price between 225 and 234.9, two prices between 235 and 244.9, and so on. A stem‐and‐leaf plot can easily be constructed by hand for small datasets such as this, or it can be constructed automatically using statistical software. The appearance of the plot can depend on the type of statistical software used—this particular plot was constructed using R statistical software (as are all the plots in this book). Instructions for constructing stem‐and‐leaf plots are available as computer help #13 in the software information files available from the book website at www.wiley.com/go/pardoe/AppliedRegressionModeling3e.

The overall impression from this graph is that the sample prices range from the mid‐150s to the mid‐350s, with some suggestion of clustering around the high 200s. Perhaps the sample represents quite a range of moderately priced homes, but with no very cheap or very expensive homes. This type of observation often arises throughout a data analysis—the data begin to tell a story and suggest possible explanations. A good analysis is usually not the end of the story since it will frequently lead to other analyses and investigations. For example, in this case, we might surmise that we would probably be unlikely to find a home priced at much less than in this market, but perhaps a realtor might know of a nearby market with more affordable housing.

A few modifications to a stem‐and‐leaf plot produce a histogram—the value axis is now horizontal rather than vertical, and the counts of observations within the bins are displayed as bars (with the counts, or frequency, shown on the vertical axis) rather than by displaying individual values with digits. Figure 1.1 shows a histogram for the home prices data generated by statistical software (see computer help #14).

Figure 1.1 Histogram for home prices example.

Histograms can convey very different impressions depending on the bin width, start point, and so on. Ideally, we want a large enough bin size to avoid excessive sampling “noise” (a histogram with many bins that looks very wiggly), but not so large that it is hard to see the underlying distribution (a histogram with few bins that looks too blocky). A reasonable pragmatic approach is to use the default settings in whichever software package we are using, and then perhaps to create a few more histograms with different settings to check that we are not missing anything. There are more sophisticated methods, but for the purposes of the methods in this book, this should suffice.

In addition to graphical summaries such as the stem‐and‐leaf plot and histogram, sample statistics can summarize data numerically. For example:

The

sample mean

,

, is a measure of the “central tendency” of the data

‐values.

The

sample standard deviation

,

, is a measure of the spread or variation in the data

‐values.

We will not bother here with the formulas for these sample statistics. Since almost all of the calculations necessary for learning the material covered by this book will be performed by statistical software, the book only contains formulas when they are helpful in understanding a particular concept or provide additional insight to interested readers.

We can calculate sample standardized‐values from the data ‐values:

Sometimes, it is useful to work with sample standardized ‐values rather than the original data ‐values since sample standardized ‐values have a sample mean of 0 and a sample standard deviation of 1. Try using statistical software to calculate sample standardized ‐values for the home prices data, and then check that the mean and standard deviation of the ‐values are 0 and 1, respectively.

Statistical software can also calculate additional sample statistics, such as:

the

median

(another measure of central tendency, but which is less sensitive than the sample mean to very small or very large values in the data)—half the dataset values are smaller than this quantity and half are larger;

the minimum and maximum;

percentiles

or

quantiles

such as the 25th percentile—this is the smallest value that is larger than 25% of the values in the dataset (i.e., 25% of the dataset values are smaller than the 25th percentile, while 75% of the dataset values are larger).

Here are the values obtained by statistical software for the home prices example (see computer help #10):

Sample size,

Valid

30

missing

0

Mean

278.6033

Median

278.9500

Standard deviation

53.8656

Minimum

155.5000

Maximum

359.9000

Percentiles

25

241.3750

50

278.9500

75

325.8750

There are many other methods—numerical and graphical—for summarizing data. For example, another popular graph besides the histogram is the boxplot; see Chapter 6 (www.wiley.com/go/pardoe/AppliedRegressionModeling3e) for some examples of boxplots used in case studies.

1.2 Population Distributions

While the methods of the preceding section are useful for describing and displaying sample data, the real power of statistics is revealed when we use samples to give us information about populations. In this context, a population is the entire collection of objects of interest, for example, the sale prices for all single‐family homes in the housing market represented by our dataset. We would like to know more about this population to help us make a decision about which home to buy, but the only data we have is a random sample of 30 sale prices.

Nevertheless, we can employ “statistical thinking” to draw inferences about the population of interest by analyzing the sample data. In particular, we use the notion of a model—a mathematical abstraction of the real world—which we fit to the sample data. If this model provides a reasonable fit to the data, that is, if it can approximate the manner in which the data vary, then we assume it can also approximate the behavior of the population. The model then provides the basis for making decisions about the population, by, for example, identifying patterns, explaining variation, and predicting future values. Of course, this process can work only if the sample data can be considered representative of the population. One way to address this is to randomly select the sample from the population. There are other more complex sampling methods that are used to select representative samples, and there are also ways to make adjustments to models to account for known nonrandom sampling. However, we do not consider these here—any good sampling textbook should cover these issues.

Sometimes, even when we know that a sample has not been selected randomly, we can still model it. Then, we may not be able to formally infer about a population from the sample, but we can still model the underlying structure of the sample. One example would be a convenience sample—a sample selected more for reasons of convenience than for its statistical properties. When modeling such samples, any results should be reported with a caution about restricting any conclusions to objects similar to those in the sample. Another kind of example is when the sample comprises the whole population. For example, we could model data for all 50 states of the United States of America to better understand any patterns or systematic associations among the states.

Since the real world can be extremely complicated (in the way that data values vary or interact together), models are useful because they simplify problems so that we can better understand them (and then make more effective decisions). On the one hand, we therefore need models to be simple enough that we can easily use them to make decisions, but on the other hand, we need models that are flexible enough to provide good approximations to complex situations. Fortunately, many statistical models have been developed over the years that provide an effective balance between these two criteria. One such model, which provides a good starting point for the more complicated models we consider later, is the normal distribution.

From a statistical perspective, a distribution (strictly speaking, a probability distribution) is a theoretical model that describes how a random variable varies. For our purposes, a random variable represents the data values of interest in the population, for example, the sale prices of all single‐family homes in our housing market. One way to represent the population distribution of data values is in a histogram, as described in Section 1.1. The difference now is that the histogram displays the whole population rather than just the sample. Since the population is so much larger than the sample, the bins of the histogram (the consecutive ranges of the data that comprise the horizontal intervals for the bars) can be much smaller than in Figure 1.1. For example, Figure 1.2 shows a histogram for a simulated population of sale prices. The scale of the vertical axis now represents proportions (density) rather than the counts (frequency) of Figure 1.1.

Figure 1.2 Histogram for a simulated population of sale prices, together with a normal density curve.

As the population size gets larger, we can imagine the histogram bars getting thinner and more numerous, until the histogram resembles a smooth curve rather than a series of steps. This smooth curve is called a density curve and can be thought of as the theoretical version of the population histogram. Density curves also provide a way to visualize probability distributions such as the normal distribution. A normal density curve is superimposed on Figure 1.2. The simulated population histogram follows the curve quite closely, which suggests that this simulated population distribution is quite close to normal.

To see how a theoretical distribution can prove useful for making statistical inferences about populations such as that in our home prices example, we need to look more closely at the normal distribution. To begin, we consider a particular version of the normal distribution, the standard normal, as represented by the density curve in Figure 1.3. Random variables that follow a standard normal distribution have a mean of 0 (represented in Figure 1.3 by the curve being symmetric about 0, which is under the highest point of the curve) and a standard deviation of 1 (represented in Figure 1.3 by the curve having a point of inflection—where the curve bends first one way and then the other—at and ). The normal density curve is sometimes called the “bell curve” since its shape resembles that of a bell. It is a slightly odd bell, however, since its sides never quite reach the ground (although the ends of the curve in Figure 1.3 are quite close to zero on the vertical axis, they would never actually quite reach there, even if the graph were extended a very long way on either side).

Figure 1.3 Standard normal density curve together with a shaded area of between and , which represents the probability that a standard normal random variable lies between and .

The key feature of the normal density curve that allows us to make statistical inferences is that areas under the curve represent probabilities. The entire area under the curve is one, while the area under the curve between one point on the horizontal axis (, say) and another point (, say) represents the probability that a random variable that follows a standard normal distribution is between and . So, for example, Figure 1.3 shows there is a probability of 0.475 that a standard normal random variable lies between and , since the area under the curve between and is 0.475.

We can obtain values for these areas or probabilities from a variety of sources: tables of numbers, calculators, spreadsheet or statistical software, Internet websites, and so on. In this book, we print only a few select values since most of the later calculations use a generalization of the normal distribution called the “t‐distribution.” Also, rather than areas such as that shaded in Figure 1.3, it will become more useful to consider “tail areas” (e.g., to the right of point ), and so for consistency with later tables of numbers, the following table allows calculation of such tail areas: Normal distribution probabilities (tail areas) and percentiles (horizontal axis values)

Upper‐tail area

0.1

0.05

0.025

0.01

0.005

0.001

Horizontal axis value

1.282

1.645

1.960

2.326

2.576

3.090

Two‐tail area

0.2

0.1

0.05

0.02

0.01

0.002

In particular, the upper‐tail area to the right of 1.960 is 0.025; this is equivalent to saying that the area between 0 and 1.960 is 0.475 (since the entire area under the curve is 1 and the area to the right of 0 is 0.5). Similarly, the two‐tail area, which is the sum of the areas to the right of 1.960 and to the left of −1.960, is two times 0.025, or 0.05.

How does all this help us to make statistical inferences about populations such as that in our home prices example? The essential idea is that we fit a normal distribution model to our sample data and then use this model to make inferences about the corresponding population. For example, we can use probability calculations for a normal distribution (as shown in Figure 1.3) to make probability statements about a population modeled using that normal distribution—we will show exactly how to do this in Section 1.3. Before we do that, however, we pause to consider an aspect of this inferential sequence that can make or break the process. Does the model provide a close enough approximation to the pattern of sample values that we can be confident the model adequately represents the population values? The better the approximation, the more reliable our inferential statements will be.

We saw in Figure 1.2 how a density curve can be thought of as a histogram with a very large sample size. So one way to assess whether our population follows a normal distribution model is to construct a histogram from our sample data and visually determine whether it “looks normal,” that is, approximately symmetric and bell‐shaped. This is a somewhat subjective decision, but with experience you should find that it becomes easier to discern clearly nonnormal histograms from those that are reasonably normal. For example, while the histogram in Figure 1.2 clearly looks like a normal density curve, the normality of the histogram of 30 sample sale prices in Figure 1.1 is less certain. A reasonable conclusion in this case would be that while this sample histogram is not perfectly symmetric and bell‐shaped, it is close enough that the corresponding (hypothetical) population histogram could well be normal.

An alternative way to assess normality is to construct a QQ‐plot (quantile–quantile plot), also known as a normal probability plot, as shown in Figure 1.4 (see computer help #22 in the software information files available from the book website). If the points in the QQ‐plot lie close to the diagonal line, then the corresponding population values could well be normal. If the points generally lie far from the line, then normality is in question. Again, this is a somewhat subjective decision that becomes easier to make with experience. In this case, given the fairly small sample size, the points are probably close enough to the line that it is reasonable to conclude that the population values could be normal.

Figure 1.4 QQ‐plot for the home prices example.

There are also a variety of quantitative methods for assessing normality—brief details and references are provided in Section 3.4.2.

Optional—technical details of QQ‐plots

For the purposes of this book, the technical details of QQ‐plots are not too important. For those that are curious, however, a brief description follows. First, calculate a set of equally spaced percentiles (quantiles) from a standard normal distribution. For example, if the sample size, , is 9, then the calculated percentiles would be the 10th, 20th, , 90th. Then construct a scatterplot with the observed data values ordered from low to high on the vertical axis and the calculated percentiles on the horizontal axis. If the two sets of values are similar (i.e., if the sample values closely follow a normal distribution), then the points will lie roughly along a straight line. To facilitate this assessment, a diagonal line that passes through the first and third quartiles is often added to the plot. The exact details of how a QQ‐plot is drawn can differ depending on the statistical software used (e.g., sometimes the axes are switched or the diagonal line is constructed differently).

1.3 Selecting Individuals at Random—Probability

Having assessed the normality of our population of sale prices by looking at the histogram and QQ‐plot of sample sale prices, we now return to the task of making probability statements about the population. The crucial question at this point is whether the sample data are representative of the population for which we wish to make statistical inferences. One way to increase the chance of this being true is to select the sample values from the population at random—we discussed this in the context of our home prices example in Section 1.1. We can then make reliable statistical inferences about the population by considering properties of a model fit to the sample data—provided the model fits reasonably well.

We saw in Section 1.2 that a normal distribution model fits the home prices example reasonably well. However, we can see from Figure 1.1 that a standard normal distribution is inappropriate here, because a standard normal distribution has a mean of 0 and a standard deviation of 1, whereas our sample data have a mean of 278.6033 and a standard deviation of 53.8656. We therefore need to consider more general normal distributions with a mean that can take any value and a standard deviation that can take any positive value (standard deviations cannot be negative).

Let represent the population values (sale prices in our example) and suppose that is normally distributed with mean (or expected value), , and standard deviation, . This textbook uses this notation with familiar Roman letters in place of the traditional Greek letters, (mu) and (sigma), which, in the author's experience, are unfamiliar and awkward for many students. We can abbreviate this normal distribution as , where the first number is the mean and the second number is the square of the standard deviation (also known as the variance). Then the population standardized‐value,

has a standard normal distribution with mean 0 and standard deviation 1. In symbols,

We are now ready to make a probability statement for the home prices example. Suppose that we would consider a home as being too expensive to buy if its sale price is higher than . What is the probability of finding such an expensive home in our housing market? In other words, if we were to randomly select one home from the population of all homes, what is the probability that it has a sale price higher than ? To answer this question, we need to make a number of assumptions. We have already decided that it is probably safe to assume that the population of sale prices () could be normal, but we do not know the mean, , or the standard deviation, , of the population of home prices. For now, let us assume that and (fairly close to the sample mean of 278.6033 and sample standard deviation of 53.8656). (We will be able to relax these assumptions later in this chapter.) From the theoretical result above, has a standard normal distribution with mean 0 and standard deviation 1.

Next, to find the probability that a randomly selected is greater than 380, we perform some standard algebra on probability statements. In particular, if we write “the probability that is bigger than ” as “,” then we can make changes to (such as adding, subtracting, multiplying, and dividing other quantities) as long as we do the same thing to . It is perhaps easier to see how this works by example:

The second equality follows since is defined to be , which is a standard normal random variable with mean 0 and standard deviation 1. From the normal table in Section 1.2, the probability that a standard normal random variable is greater than 1.96 is 0.025. Thus, Pr() is slightly less than 0.025 (draw a picture of a normal density curve with 1.96 and 2.00 marked on the horizontal axis to convince yourself of this fact). In other words, there is slightly less than a 2.5% chance of finding an expensive home () in our housing market, under the assumption that .