Regression Analysis By Example Using R - Ali S. Hadi - E-Book

Regression Analysis By Example Using R E-Book

Ali S. Hadi

0,0
106,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Regression Analysis By Example Using R A STRAIGHTFORWARD AND CONCISE DISCUSSION OF THE ESSENTIALS OF REGRESSION ANALYSIS In the newly revised sixth edition of Regression Analysis By Example Using R, distinguished statistician Dr Ali S. Hadi delivers an expanded and thoroughly updated discussion of exploratory data analysis using regression analysis in R. The book provides in-depth treatments of regression diagnostics, transformation, multicollinearity, logistic regression, and robust regression. The author clearly demonstrates effective methods of regression analysis with examples that contain the types of data irregularities commonly encountered in the real world. This newest edition also offers a brand-new, easy to read chapter on the freely available statistical software package R. Readers will also find: * Reorganized, expanded, and upgraded exercises at the end of each chapter with an emphasis on data analysis * Updated data sets and examples throughout the book * Complimentary access to a companion website that provides data sets in xlsx, csv, and txt format Perfect for upper-level undergraduate or beginning graduate students in statistics, mathematics, biostatistics, and computer science programs, Regression Analysis By Example Using R will also benefit readers who need a reference for quick updates on regression methods and applications.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 730

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

COVER

TABLE OF CONTENTS

TITLE PAGE

COPYRIGHT

DEDICATION

PREFACE

ABOUT THE COMPANION WEBSITE

CHAPTER 1: INTRODUCTION

1.1 WHAT IS REGRESSION ANALYSIS?

1.2 PUBLICLY AVAILABLE DATA SETS

1.3 SELECTED APPLICATIONS OF REGRESSION ANALYSIS

1.4 STEPS IN REGRESSION ANALYSIS

1.5 SCOPE AND ORGANIZATION OF THE BOOK

NOTES

CHAPTER 2: A BRIEF INTRODUCTION TO R

2.1 WHAT IS R AND RSTUDIO?

2.2 INSTALLING R AND RSTUDIO

2.3 GETTING STARTED WITH R

2.4 DATA VALUES AND OBJECTS IN R

2.5 R PACKAGES (LIBRARIES)

2.6 IMPORTING (READING) DATA INTO R WORKSPACE

2.7 WRITING (EXPORTING) DATA TO FILES

2.8 SOME ARITHMETIC AND OTHER OPERATORS

2.9 PROGRAMMING IN R

2.10 BIBLIOGRAPHIC NOTES

NOTE

CHAPTER 3: SIMPLE LINEAR REGRESSION

3.1 INTRODUCTION

3.2 COVARIANCE AND CORRELATION COEFFICIENT

3.3 EXAMPLE: COMPUTER REPAIR DATA

3.4 THE SIMPLE LINEAR REGRESSION MODEL

3.5 PARAMETER ESTIMATION

3.6 TESTS OF HYPOTHESES

3.7 CONFIDENCE INTERVALS

3.8 PREDICTIONS

3.9 MEASURING THE QUALITY OF FIT

3.10 REGRESSION LINE THROUGH THE ORIGIN

3.11 TRIVIAL REGRESSION MODELS

3.12 BIBLIOGRAPHIC NOTES

NOTES

CHAPTER 4: MULTIPLE LINEAR REGRESSION

4.1 INTRODUCTION

4.2 DESCRIPTION OF THE DATA AND MODEL

4.3 EXAMPLE: SUPERVISOR PERFORMANCE DATA

4.4 PARAMETER ESTIMATION

4.5 INTERPRETATIONS OF REGRESSION COEFFICIENTS

4.6 CENTERING AND SCALING

4.7 PROPERTIES OF THE LEAST SQUARES ESTIMATORS

4.8 MULTIPLE CORRELATION COEFFICIENT

4.9 INFERENCE FOR INDIVIDUAL REGRESSION COEFFICIENTS

4.10 TESTS OF HYPOTHESES IN A LINEAR MODEL

4.11 PREDICTIONS

4.12 SUMMARY

NOTES

CHAPTER 5: REGRESSION DIAGNOSTICS: DETECTION OF MODEL VIOLATIONS

5.1 INTRODUCTION

5.2 THE STANDARD REGRESSION ASSUMPTIONS

5.3 VARIOUS TYPES OF RESIDUALS

5.4 GRAPHICAL METHODS

5.5 GRAPHS BEFORE FITTING A MODEL

5.6 GRAPHS AFTER FITTING A MODEL

5.7 CHECKING LINEARITY AND NORMALITY ASSUMPTIONS

5.8 LEVERAGE, INFLUENCE, AND OUTLIERS

5.9 MEASURES OF INFLUENCE

5.10 THE POTENTIAL–RESIDUAL PLOT

5.11 REGRESSION DIAGNOSTICS IN R

5.12 WHAT TO DO WITH THE OUTLIERS?

5.13 ROLE OF VARIABLES IN A REGRESSION EQUATION

5.14 EFFECTS OF AN ADDITIONAL PREDICTOR

5.15 ROBUST REGRESSION

NOTES

CHAPTER 6: QUALITATIVE VARIABLES AS PREDICTORS

6.1 INTRODUCTION

6.2 SALARY SURVEY DATA

6.3 INTERACTION VARIABLES

6.4 SYSTEMS OF REGRESSION EQUATIONS: COMPARING TWO GROUPS

6.5 OTHER APPLICATIONS OF INDICATOR VARIABLES

6.6 SEASONALITY

6.7 STABILITY OF REGRESSION PARAMETERS OVER TIME

NOTES

CHAPTER 7: TRANSFORMATION OF VARIABLES

7.1 INTRODUCTION

7.2 TRANSFORMATIONS TO ACHIEVE LINEARITY

7.3 BACTERIA DEATHS DUE TO X-RAY RADIATION

7.4 TRANSFORMATIONS TO STABILIZE VARIANCE

7.5 DETECTION OF HETEROSCEDASTIC ERRORS

7.6 REMOVAL OF HETEROSCEDASTICITY

7.7 WEIGHTED LEAST SQUARES

7.8 LOGARITHMIC TRANSFORMATION OF DATA

7.9 POWER TRANSFORMATION

7.10 SUMMARY

NOTES

CHAPTER 8: WEIGHTED LEAST SQUARES

8.1 INTRODUCTION

8.2 HETEROSCEDASTIC MODELS

8.3 TWO-STAGE ESTIMATION

8.4 EDUCATION EXPENDITURE DATA

8.5 FITTING A DOSE–RESPONSE RELATIONSHIP CURVE

NOTES

CHAPTER 9: THE PROBLEM OF CORRELATED ERRORS

9.1 INTRODUCTION: AUTOCORRELATION

9.2 CONSUMER EXPENDITURE AND MONEY STOCK

9.3 DURBIN–WATSON STATISTIC

9.4 REMOVAL OF AUTOCORRELATION BY TRANSFORMATION

9.5 ITERATIVE ESTIMATION WITH AUTOCORRELATED ERRORS

9.6 AUTOCORRELATION AND MISSING VARIABLES

9.7 ANALYSIS OF HOUSING STARTS

9.8 LIMITATIONS OF THE DURBIN–WATSON STATISTIC

9.9 INDICATOR VARIABLES TO REMOVE SEASONALITY

9.10 REGRESSING TWO TIME SERIES

NOTES

CHAPTER 10: ANALYSIS OF COLLINEAR DATA

10.1 INTRODUCTION

10.2 EFFECTS OF COLLINEARITY ON INFERENCE

10.3 EFFECTS OF COLLINEARITY ON FORECASTING

10.4 DETECTION OF COLLINEARITY

NOTES

CHAPTER 11: WORKING WITH COLLINEAR DATA

11.1 INTRODUCTION

11.2 PRINCIPAL COMPONENTS

11.3 COMPUTATIONS USING PRINCIPAL COMPONENTS

11.4 IMPOSING CONSTRAINTS

11.5 SEARCHING FOR LINEAR FUNCTIONS OF THE

β

'S

11.6 BIASED ESTIMATION OF REGRESSION COEFFICIENTS

11.7 PRINCIPAL COMPONENTS REGRESSION

11.8 REDUCTION OF COLLINEARITY IN THE ESTIMATION DATA

11.9 CONSTRAINTS ON THE REGRESSION COEFFICIENTS

11.10 PRINCIPAL COMPONENTS REGRESSION: A CAUTION

11.11 RIDGE REGRESSION

11.12 ESTIMATION BY THE RIDGE METHOD

11.13 RIDGE REGRESSION: SOME REMARKS

11.14 SUMMARY

11.15 BIBLIOGRAPHIC NOTES

NOTES

CHAPTER 12: VARIABLE SELECTION PROCEDURES

12.1 INTRODUCTION

12.2 FORMULATION OF THE PROBLEM

12.3 CONSEQUENCES OF VARIABLES DELETION

12.4 USES OF REGRESSION EQUATIONS

12.5 CRITERIA FOR EVALUATING EQUATIONS

12.6 COLLINEARITY AND VARIABLE SELECTION

12.7 EVALUATING ALL POSSIBLE EQUATIONS

12.8 VARIABLE SELECTION PROCEDURES

12.9 GENERAL REMARKS ON VARIABLE SELECTION METHODS

12.10 A STUDY OF SUPERVISOR PERFORMANCE

12.11 VARIABLE SELECTION WITH COLLINEAR DATA

12.12 THE HOMICIDE DATA

12.13 VARIABLE SELECTION USING RIDGE REGRESSION

12.14 SELECTION OF VARIABLES IN AN AIR POLLUTION STUDY

12.15 A POSSIBLE STRATEGY FOR FITTING REGRESSION MODELS

12.16 BIBLIOGRAPHIC NOTES

NOTES

CHAPTER 13: LOGISTIC REGRESSION

13.1 INTRODUCTION

13.2 MODELING QUALITATIVE DATA

13.3 THE LOGIT MODEL

13.4 EXAMPLE: ESTIMATING PROBABILITY OF BANKRUPTCIES

13.5 LOGISTIC REGRESSION DIAGNOSTICS

13.6 DETERMINATION OF VARIABLES TO RETAIN

13.7 JUDGING THE FIT OF A LOGISTIC REGRESSION

13.8 THE MULTINOMIAL LOGIT MODEL

13.9 CLASSIFICATION PROBLEM: ANOTHER APPROACH

NOTES

CHAPTER 14: FURTHER TOPICS

14.1 INTRODUCTION

14.2 GENERALIZED LINEAR MODEL

14.3 POISSON REGRESSION MODEL

14.4 INTRODUCTION OF NEW DRUGS

14.5 ROBUST REGRESSION

14.6 FITTING A QUADRATIC MODEL

14.7 DISTRIBUTION OF PCB IN U.S. BAYS

NOTES

REFERENCES

INDEX

END USER LICENSE AGREEMENT

List of Tables

Chapter 1

Table 1.1 Variables in Milk Production Data

Table 1.2 Variables in Right-To-Work Laws Data

Table 1.3 Variables in Study of Domestic Immigration

Table 1.4 Variables in Egyptian Skulls Data

Table 1.5 Variables in Study of Water Pollution in New York Rivers

Table 1.6 Variables in Cost of Health Care Data

Table 1.7 Notation for Data Used in Regression Analysis

Table 1.8 Various Classifications of Regression Analysis

Chapter 2

Table 2.1 Some Useful R Functions for Information About Data

Table 2.2 Some Useful R Functions for Testing Object Type

Table 2.3 Some Useful R Functions for Reading File Formats

Table 2.4 Some Arithmetic, Logical, and Relational Operators in R and Their ...

Table 2.5 Some R Commands Useful for Matrix Calculations and Manipulations

Table 2.6 Some Useful R Commands or Functions

Chapter 3

Table 3.1 Notation for the Data Used in Simple Regression and Correlation

Table 3.2 Algebraic Signs of the Quantities and

Table 3.3 Data Set with a Perfect Nonlinear Relationship Between and , Ye...

Table 3.4 Length of Service Calls (in Minutes) and Number of Units Repaired...

Table 3.5 Quantities Needed for Computation of Correlation Coefficient Betwe...

Table 3.6 Fitted Values, , and Ordinary Least Squares Residuals, , for Com...

Table 3.7 Standard Regression Output

Table 3.8 Regression Output for Computer Repair Data

Table 3.9 Regression Output When Is Regressed on for Labor Force Partici...

Chapter 4

Table 4.1 Notation for Data Used in Multiple Regression Analysis

Table 4.2 Description of Variables in Supervisor Performance Data

Table 4.3 Partial Residuals

Table 4.4 Regression Output for Supervisor Performance Data

Table 4.5 Analysis of Variance (ANOVA) Table in Multiple Regression

Table 4.6 Supervisor Performance Data: Analysis of Variance (ANOVA) Table

Table 4.7 Regression Output from the Regression of on and

Table 4.8 Analysis of Variance (ANOVA) Table in Simple Regression

Table 4.9 Regression Output When Is Regressed on for 20 Observations

Table 4.10 Regression Output When Is Regressed on for 18 Observations

Table 4.11 Regression Outputs for Salary Discriminating Data

Table 4.12 Regression Output When Salary Is Related to Four Predictor Variab...

Table 4.13 ANOVA Table When the Beginning Salary Is Regressed on Education

Table 4.14 Variables in the Cigarette Consumption Data

Chapter 5

Table 5.1 Hamilton's (1987) Data

Table 5.2 New York Rivers Data: The -Tests for the Individual Coefficients...

Table 5.3 New York Rivers Data: Standardized Residuals, , and Leverage Valu...

Table 5.4 New York Rivers Data. Influence Measures from Fitting Model (5.18)...

Table 5.5 Functions for Computing Regression Diagnostics

Table 5.6 Classification of the Five Points in Figure 5.15

Chapter 6

Table 6.1 Regression Equations for the Six Categories of Education and Manag...

Table 6.2 Regression Analysis of Salary Survey Data

Table 6.3 Regression Analysis of Salary Data: Expanded Model

Table 6.4 Regression Analysis of Salary Data: Expanded Model, Observation 33...

Table 6.5 Estimates of Base Salary Using the Nonadditive Model in (6.2)

Table 6.6 Data on Preemployment Testing Program

Table 6.7 Regression Results, Preemployment Data: Model 1

Table 6.8 Regression Results, Preemployment Data: Model 3

Table 6.9 Separate Regression Results

Table 6.10 Variable Descriptions in the Education Expenditures Data

Table 6.11 Regression Output from the Regression of the Weekly Wages, , on

Table 6.12 Some Regression Outputs When Fitting Three Models to the Car Data...

Table 6.13 Corn Yields by Fertilizer Group

Table 6.14 Variables for the Presidential Election Data (1916–1996)

Chapter 7

Table 7.1 Linearizable Simple Regression Functions with Corresponding Transf...

Table 7.2 Number of Surviving Bacteria (Units of 100)

Table 7.3 Estimated Regression Coefficients from Model (7.7)

Table 7.4 Estimated Regression Coefficients When Is Regressed on Time

Table 7.5 Transformations to Stabilize Variance

Table 7.6 Number of Injury Incidents and Proportion of Total Flights

Table 7.7 Estimated Regression Coefficients (When Is Regressed on )

Table 7.8 Estimated Regression Coefficients When Is Regressed on

Table 7.9 Number of Supervised Workers () and Supervisors () in 27 Industr...

Table 7.10 Estimated Regression Coefficients When Number of Supervisors () ...

Table 7.11 Estimated Regression Coefficients of the Original Equation When F...

Table 7.12 Estimated Regression Coefficients When Is Regressed on

Table 7.13 Estimated Regression Coefficients When Is Regressed on and ...

Table 7.14 Correlation Coefficient Between and for Some Values of

Table 7.15 Correlation Coefficient Between and for Some Values of

Table 7.16 Wind Chill Factor (F) for Various Values of Wind speed, , in Mi...

Table 7.17 Annual World Crude Oil Production in Millions of Barrels (1880–19...

Table 7.18 Average Price Per Megabyte in Dollars from 1988 to 1998

Chapter 8

Table 8.1 Variables in Cost of Education Survey

Table 8.2 State Expenditures on Education, Variable List

Table 8.3 Regression Results: State Expenditures on Education for the Year 1...

Table 8.4 Regression Results: State Expenditures on Education in 1975 , Ala...

Table 8.5 Weights for Weighted Least Squares

Table 8.6 OLS and WLS Coefficients for Education Data in 1975 , Alaska Omit...

Chapter 9

Table 9.1 Consumer Expenditure and Money Stock

Table 9.2 Results When Consumer Expenditure Is Regressed on Money Stock,

Table 9.3 Comparison of Regression Estimates

Table 9.4 Regression on Housing Starts () Versus Population ()

Table 9.5 Results of the Regression of Housing Starts () on Population () ...

Table 9.6 Ski Sales Versus PDI

Table 9.7 Ski Sales Versus PDI and Seasonal Variables

Chapter 10

Table 10.1 EEO Data: Regression Results

Table 10.2 Data Combinations for Three Predictor Variables

Table 10.3 Import Data (1949–1966): Regression Results

Table 10.4 Import Data (1949–1959): Regression Results

Table 10.5 Import Data (1949–1959): Regression Coefficients for All Possible...

Table 10.6 Regression Results for the Advertising Data

Table 10.7 Pairwise Correlation Coefficients for the Advertising Data

Table 10.8 Variance Inflation Factors for Three Data Sets

Table 10.9 Condition Indices for Three Data Sets

Table 10.10 Import Data (1949–1959): Eigenvalues and Corresponding Condition...

Table 10.11 Six Eigenvectors of the Correlation Matrix of the Predictors

Table 10.12 Variables for the Gasoline Consumption Data

Chapter 11

Table 11.1 The PCs for the Import Data (1949–1959)

Table 11.2 Advertising Data: The Eigenvalues and Corresponding Eigenvectors ...

Table 11.3 Regression Results Obtained from Fitting the Model in (11.10)

Table 11.4 Regression Results Obtained from Fitting the Model in (11.13)

Table 11.5 Regression Results of Import Data (1949–1959) with the Constraint...

Table 11.6 Regression Results When Fitting Model (11.16) to the Import Data ...

Table 11.7 Regression Results of Fitting Model (11.23) to the Import Data 19...

Table 11.8 Regression Results of Fitting Model (11.26) to the Import Data (1...

Table 11.9 Estimated Regression Coefficients for the Standardized and Origin...

Table 11.10 Response Variable and Set of Principal Components of Four Pred...

Table 11.11 Regression Results Using All Four PCs of Hald's Data

Table 11.12 Regression Results Using the First Three PCs of Hald's Data

Table 11.13 Ridge Estimates , as Functions of the Ridge Parameter , for th...

Table 11.14 Residual Sum of Squares, , and Variance Inflation Factors, , a...

Table 11.15 OLS and Ridge Estimates of the Regression Coefficients for IMPOR...

Table 11.16 Three Eigenvectors of the Correlation Matrix of the Three Predic...

Table 11.17 Regression Output from the Regression of on the Principal Comp...

Chapter 12

Table 12.1 Correlation Matrix for the Supervisor Performance Data

Table 12.2 Variables Selected by the Forward Selection Method

Table 12.3 Variables Selected by Backward Elimination Method

Table 12.4 Values of Statistic (All Possible Equations)

Table 12.5 Variables Selected on the Basis of Statistic

Table 12.6 Homicide Data: Description of Variables

Table 12.7 Homicide Data: The OLS Results from Fitting Model (12.13)

Table 12.8 Homicide Data: The Estimated Coefficients, Their -Tests, and the...

Table 12.9 Description of Variables, Means, and Standard Deviations, SD

Table 12.10 OLS Regression Output for the Air Pollution Data (15 Predictor V...

Table 12.11 OLS Regression Output for the Air Pollution Data (10 Predictor V...

Table 12.12 OLS Regression Output for the Air Pollution Data (Eight Predicto...

Table 12.13 List of Variables for Data in the file Property.Valuation.csv at...

Chapter 13

Table 13.1 Output from the Logistic Regression Using , , and

Table 13.2 Output From the Logistic Regression Using and

Table 13.3 Output from the Logistic Regression Using

Table 13.4 The AIC and BIC Criteria for Various Logistic Regression models

Table 13.5 Multinomial Logistic Regression Output with RW, SSPG, and IR (Bas...

Table 13.6 Multinomial Logistic Regression Output with SSPG and IR (Base Lev...

Table 13.7 Classification Table of Diabetes Data Using Multinomial Logistic ...

Table 13.8 Ordinal Logistic Regression Model (Proportional Odds) Using SSPG ...

Table 13.9 Classification Table of Diabetes Data Using Multinomial Logistic ...

Table 13.10 Results from the OLS Regression of on

Table 13.11 Classification of Observations by Fitted Values

Table 13.12 Field-Goal-Kicking Performances of the American Football League ...

Chapter 14

Table 14.1 Output From the Poisson Regression Using and

Table 14.2 Output From the Linear Regression Using and

Table 14.3 Data Illustrating Robust Regression

Table 14.4 Least Squares Quadratic Fit for the Data Set in Table 14.3

Table 14.5 Robust Regression Quadratic Fit for the Data Set in Table 14.314....

Table 14.6 Least Squares Regression of (PCB85) on (PCB84) for the Data in ...

Table 14.7 Least Squares Regression of (PCB85) on (PCB84) for the Data Set...

List of Illustrations

Chapter 1

Figure 1.1 A schematic illustration of the iterative nature of the regressio...

Figure 1.2 A flowchart illustrating the dynamic iterative regression process...

Chapter 3

Figure 3.1 Graphical illustration of the correlation coefficient.

Figure 3.2 Scatter plot of versus in Table 3.3.

Figure 3.3 Scatter plots of Anscombe's data with the fitted regression lines...

Figure 3.4 Computer repair data: scatter plot of minutes versus units.

Figure 3.5 Computer repair data: plot of minutes versus uUnits with the fitt...

Figure 3.6 Graph of the probability density function of a -distribution. Th...

Figure 3.7 Graphical illustration of various quantities computed after fitti...

Chapter 5

Figure 5.1 Plot of the data with the least squares fitted line for the Ans...

Figure 5.2 Plot matrix for Hamilton's data with the pairwise correlation coe...

Figure 5.3 Rotating plot for Hamilton's data.

Figure 5.4 Two scatter plots of residuals versus illustrating violations o...

Figure 5.5 New York Rivers data: scatter plot of versus .

Figure 5.6 New York Rivers data: index plots of the standardized residuals,

Figure 5.7 New York Rivers data: index plots of influence measures: (a) Cook...

Figure 5.8 New York Rivers data: potential–residual plot.

Figure 5.9 Scatter plot of population size, , versus time, . The curve is ...

Figure 5.10 Rotating plot for the Scottish hills races data.

Figure 5.11 Scottish hills races data: added-variable plots for (a) Distance...

Figure 5.12 Scottish hills races data: residual plus component plots for (a)...

Figure 5.13 Scottish hills races data: potential–residual plot.

Figure 5.14 P-R plot used in Exercise 5.4.

Figure 5.15 Plot of versus , for distinct observations with the least s...

Chapter 6

Figure 6.1 Standardized residuals versus years of experience ().

Figure 6.2 Standardized residuals versus education-management categorical va...

Figure 6.3 Standardized residuals versus years of experience: expanded model...

Figure 6.4 Standardized residuals versus years of experience: expanded model...

Figure 6.5 Standardized residuals versus education-management categorical va...

Figure 6.6 Requirements for employment on pretest.

Figure 6.7 Standardized residuals versus test score: Model 1.

Figure 6.8 Standardized residuals versus test score: Model 3.

Figure 6.9 Standardized residuals versus race: Model 1.

Figure 6.10 Standardized residuals versus test: Model 1, minority only.

Figure 6.11 Standardized residuals versus test: Model 1, white only.

Chapter 7

Figure 7.1 Graphs of the linearizable function .

Figure 7.2 Graphs of the linearizable function .

Figure 7.3 Graphs of the linearizable function .

Figure 7.4 Graphs of the linearizable functions: (a) and (b) .

Figure 7.5 Plot of against time .

Figure 7.6 Plot of the standardized residuals from (7.7) against time .

Figure 7.7 Plot of against time .

Figure 7.8 Plot of the standardized residuals against time after transform...

Figure 7.9 An example of heteroscedastic residuals.

Figure 7.10 Plot of against .

Figure 7.11 Plot of the standardized residuals versus .

Figure 7.12 Plot of the standardized residuals from the regression of on

Figure 7.13 Number of supervisors () versus number supervised ().

Figure 7.14 Plot of the standardized residuals against when number of supe...

Figure 7.15 Plot of the standardized residuals against when is regressed...

Figure 7.16 Scatter plot of versus .

Figure 7.17 Plot of the standardized residuals against when is regressed...

Figure 7.18 Plot of standardized residuals against the fitted values when ...

Figure 7.19 Plot of standardized residuals against when is regressed on

Figure 7.20 Plot of standardized residuals against when is regressed on

Figure 7.21 Animals data: scatter plots of brain weight versus body weight....

Figure 7.22 Scatter plots of versus for various values of .

Chapter 8

Figure 8.1 Example of heteroscedastic residuals.

Figure 8.2 Nonconstant variance with replicated observations.

Figure 8.3 Plot of standardized residuals versus fitted values.

Figure 8.4 Plot of standardized residuals versus regions.

Figure 8.5 Plot of standardized residuals versus each of the predictor varia...

Figure 8.6 Plot of standardized residuals versus each of the predictor varia...

Figure 8.7 Plot of standardized residuals versus each of the predictor varia...

Figure 8.8 Plot of the standardized residuals versus fitted values (excludin...

Figure 8.9 Plot of the standardized residuals versus region (excluding Alask...

Figure 8.10 Standardized residuals versus fitted values for WLS solution.

Figure 8.11 Standardized residuals by geographic region for WLS solution.

Figure 8.12 Logistic response function.

Chapter 9

Figure 9.1 Index plot of the standardized residuals.

Figure 9.2 Index plot of standardized residuals after one iteration of the C...

Figure 9.3 Index plot of standardized residuals from the regression of on

Figure 9.4 Index plot of the standardized residuals from the regression of

Figure 9.5 Index plot of the standardized residuals. Quarters 1 and 4 are in...

Figure 9.6 Model for ski sales and PDI adjusted for season.

Figure 9.7 Index plot of the standardized residuals with seasonal variables ...

Chapter 10

Figure 10.1 Standardized residuals against fitted values of ACHV.

Figure 10.2 Pairwise scatter plots of the three predictor variables FAM, PEE...

Figure 10.3 Import data (1949–1966): index plot of the standardized residual...

Figure 10.4 Import data (1949–1959): index plot of the standardized residual...

Figure 10.5 Standardized residuals versus fitted values of sales.

Figure 10.6 Index plot of the standardized residuals.

Chapter 11

Figure 11.1 Index plot of the standardized residuals. Import data (1949–1959...

Figure 11.2 Standardized residuals against fitted values of Import data (194...

Figure 11.3 Scatter plots of versus each of the PCs of Hald's data.

Figure 11.4 Ridge trace: IMPORT data (1949–1959).

Chapter 12

Figure 12.1 Supervisor's Performance data: scatter plot of versus for su...

Figure 12.2 Air Pollution data: ridge traces for , , (the 15-variable mo...

Figure 12.3 Air Pollution data: ridge traces for , , (the 15-variable mo...

Figure 12.4 Air Pollution data: ridge traces for , , (the 15-variable mo...

Figure 12.5 Air Pollution data: ridge traces for , , (the 10-variable mo...

Figure 12.6 Air Pollution data: ridge traces for , , , and (the 10-va...

Chapter 13

Figure 13.1 Logistic response function.

Figure 13.2 Bankruptcy data: Index plot of , the standardized deviance resi...

Figure 13.3 Bankruptcy data: Index plot of , the scaled difference in the r...

Figure 13.4 Bankruptcy data: Index plot of , the change in the chi-squared ...

Figure 13.5 Side-by-side boxplots for the Diabetes data.

Chapter 14

Figure 14.1 Scatter plot of versus for the Data Set in Table 14.314.3.

Figure 14.2 Least squares and robust fits superposed on the scatter plot of

Figure 14.3 Least squares and robust fits superposed on scatter plot of (PC...

Guide

Cover

Title Page

Copyright

Dedication

Preface

About the Companion Website

Table of Contents

Begin Reading

References

Index

End User License Agreement

Pages

iii

iv

v

xiii

xiv

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

REGRESSION ANALYSIS BY EXAMPLE USING R

 

Sixth Edition

 

Ali S. Hadi

The American University in Cairo

Samprit Chatterjee

New York University

 

 

 

 

 

 

 

Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data applied for:

ISBN: 9781119830870 (HB); ePDF:9781119830887; epub: 9781119830894

Cover Design: WileyCover Images: © Ali S. Hadi

 

Dedicated to:

The memory of my parents – A. S. H.

Allegra, Martha, and Rima – S. C.

 

 

 

It's a gift to be simple …

Old Shaker hymn

True knowledge is knowledge of why things are as they are, and not merely what they are.

Isaiah Berlin

PREFACE

I have been feeling a great sense of sadness while I was working alone on this edition of the book after Professor Samprit Chatterjee, my longtime teacher, mentor, friend, and co-author, passed away in April 2021. Our first paper was published in 1986 (Chatterjee and Hadi, 1986). Samprit and I also co-authored our 1988 book (Chatterjee and Hadi, 1988) as well as several other papers. My sincere condolences to his family and friends. May God rest his soul in peace.

Regression analysis has become one of the most widely used statistical tools for analyzing multifactor data. It is appealing because it provides a conceptually simple method for investigating functional relationships among variables. The standard approach in regression analysis is to take data, fit a model, and then evaluate the fit using statistics such as , and Durbin–Watson test. Our approach is broader. We view regression analysis as a set of data analytic techniques that examine the interrelationships among a given set of variables. The emphasis is not on formal statistical tests and probability calculations. We argue for an informal analysis directed toward uncovering patterns in the data. We have also attempted to write a book for a group of readers with diverse backgrounds. We have also tried to put emphasis on the art of data analysis rather than on the development of statistical theory.

The material presented is intended for anyone who is involved in analyzing data. The book should be helpful to those who have some knowledge of the basic concepts of statistics. In the university, it could be used as a text for a course on regression analysis for students whose specialization is not statistics, but, who nevertheless use regression analysis quite extensively in their work. For students whose major emphasis is statistics, and who take a course on regression analysis from a book at the level of Rao (1973), Seber (1977), or Sen and Srivastava (1990), this book can be used to balance and complement the theoretical aspects of the subject with practical applications. Outside the university, this book can be profitably used by those people whose present approach to analyzing multifactor data consists of looking at standard computer output (, standard errors, etc.), but who want to go beyond these summaries for a more thorough analysis.

We utilize most standard and some not-so-standard summary statistics on the basis of their intuitive appeal. We rely heavily on graphical representations of the data and employ many variations of plots of regression residuals. We are not overly concerned with precise probability evaluations. Graphical methods for exploring residuals can suggest model deficiencies or point to troublesome observations. Upon further investigation into their origin, the troublesome observations often turn out to be more informative than the well-behaved observations. We notice often that more information is obtained from a quick examination of a plot of residuals than from a formal test of statistical significance of some limited null hypothesis. In short, the presentation in the chapters of this book is guided by the principles and concepts of exploratory data analysis.

As we mentioned in previous editions, the statistical community has been most supportive, and we have benefitted greatly from their suggestions in improving the text. Our presentation of the various concepts and techniques of regression analysis relies on carefully developed examples. In each example, we have isolated one or two techniques and discussed them in some detail. The data were chosen to highlight the techniques being presented. Although when analyzing a given set of data it is usually necessary to employ many techniques, we have tried to choose the various data sets so that it would not be necessary to discuss the same technique more than once. Our hope is that after working through the book, the reader will be ready and able to analyze their data methodically, thoroughly, and confidently.

The emphasis in this book is on the analysis of data rather than on plugging numbers into formulas, tests of hypotheses, or confidence intervals. Therefore no attempt has been made to derive the techniques. Techniques are described, the required assumptions are given and, finally, the success of the technique in the particular example is assessed. Although derivations of the techniques are not included, we have tried to refer the reader in each case to sources in which such discussion is available. Our hope is that some of these sources will be followed up by the reader who wants a more thorough grounding in theory.

Recently there has been a qualitative change in the analysis of linear models, from model fitting to model building, from overall tests to clinical examinations of data, from macroscopic to the microscopic analysis. To do this kind of analysis a computer is essential and, in previous editions, we have assumed its availability, but did not wish to endorse or associate the book with any of the commercially available statistical packages to make it available to a wider community.

We are particularly heartened by the arrival of the language R, which is available on the Internet under the General Public License (GPL). The language has excellent computing and graphical features. It is also free! For these and other reasons, I decided to introduce and use R in this edition of the book to enable the readers to use R on their own datasets and reproduce the various types of graphs and analysis presented in this book. Although a knowledge of R would certainly be helpful, no prior knowledge of R is assumed.

Major changes have been made in streamlining the text, removing ambiguities, and correcting errors pointed out by readers and others detected by the authors. Chapter 2 is new in this edition. It gives a brief but, what we believe to be, sufficient introduction to R that would enable readers to use R to carry out the regression analysis computations as well as the graphical displays presented in this edition of the book. To help the readers out, we provide all the necessary R code in the new Chapter 2 and throughout the rest of the chapters. Section 5.11 about regression diagnostics in R is new. New references have also been added. The index at the end of the book has been enhanced. The addition of the new chapter increased the number of pages. To offset this increase, data tables that are larger than 10 rows are deleted from the book because the reader can obtain them in digital forms from the Book's Website at http://www.aucegypt.edu/faculty/hadi/RABE6. This Website contains, among other things, all the data sets that are included in this book, the R code that are used to produce the graphs and tables in this book, and more. Also, the use of R enabled us to delete the statistical tables in the appendix because the reader can now use R to compute the -values as well as the critical values of test statistics for any desired significance level, not just the customary ones such as 0.1, 0.05, and 0.01.

We have rewritten some of the exercises and added new ones at the end of the chapters. We feel that the exercises reinforce the understanding of the material in the preceding chapters. Also new to accompany this edition a Solution Manual and Power Point files are available only for instructors by contacting the authors at [email protected] or [email protected].

Previous editions of this book have been translated to Persian, Korean, and Chinese languages. We are grateful to the translators Prof. H. A. Niromand, Prof. Zhongguo Zheng, Prof. Kee Young Kim, Prof. Myoungshic Jhun, Prof. Hyuncheol Kang, and Prof. Seong Keon Lee. We are fortunate to have had assistance and encouragement from several friends, colleagues, and associates. Some of our colleagues and students at New York University, Cornell University, and The American University in Cairo have used portions of the material in their courses and have shared with us their comments and comments of their students. Special thanks are due to our friend and former colleague Jeffrey Simonoff (New York University) for comments, suggestions, and general help. The students in our classes on regression analysis have all contributed by asking penetrating questions and demanding meaningful and understandable answers. Our special thanks go to Nedret Billor (Cukurova University, Turkey) and Sahar El-Sheneity (Cornell University) for their very careful reading of an earlier edition of this book.

We also appreciate the comments provided by Habibollah Esmaily, Hassan Doosti, Fengkai Yang, Mamunur Rashid, Saeed Hajebi, Zheng Zhongguo, Robert W. Hayden, Marie Duggan, Sungho Lee, Hock Lin (Andy) Tai, and Junchang Ju. We also thank Lamia Abdellatif for proofreading parts of this edition, Dimple Philip for preparing the Latex style files and the corresponding PDF version, Dean Gonzalez for helping with the production of some of the figures, and Michael New for helping with the front and back covers.

ALI S. HADI

Cairo, Egypt

September 2023

ABOUT THE COMPANION WEBSITE

This book is accompanied by a companion website.

www.wiley.com/go/hadi/regression_analysis_6e

This website includes:

Table of contents

Preface

Book cover

Places where you can purchase the book

Data sets

Stata, SAS or SPSS users

R users

Errata/Comments/Feedback

Solutions to Exercises

CHAPTER 1INTRODUCTION

1.1 WHAT IS REGRESSION ANALYSIS?

Regression analysis is a conceptually simple method for investigating functional relationships among variables. A real estate appraiser may wish to relate the sale price of a home from selected physical characteristics of the building and taxes (local, school, county) paid on the building. We may wish to examine whether cigarette consumption is related to various socioeconomic and demographic variables such as age, education, income, and price of cigarettes. The relationship is expressed in the form of an equation or a model connecting the response or dependent variable and one or more explanatory or predictor variables. In the cigarette consumption example, the response variable is cigarette consumption (measured by the number of packs of cigarette sold in a given state on a per capita basis during a given year) and the explanatory or predictor variables are the various socioeconomic and demographic variables. In the real estate appraisal example, the response variable is the price of a home and the explanatory or predictor variables are the characteristics of the building and taxes paid on the building.

We denote the response variable by and the set of predictor variables by , , where denotes the number of predictor variables. The true relationship between and can be approximated by the regression model

(1.1)

where is assumed to be a random error representing the discrepancy in the approximation. It accounts for the failure of the model to fit the data exactly. The function describes the relationship between and , , , . An example is the linear regression model

(1.2)

where , called the regression parameters or coefficients, are unknown constants to be determined (estimated) from the data. We follow the commonly used notational convention of denoting unknown parameters by Greek letters.

The predictor or explanatory variables are also called by other names such as independent variables, covariates, regressors, factors, and carriers. The name independent variable, though commonly used, is the least preferred, because in practice the predictor variables are rarely independent of each other.

1.2 PUBLICLY AVAILABLE DATA SETS

Regression analysis has numerous areas of applications. A partial list would include economics, finance, business, law, meteorology, medicine, biology, chemistry, engineering, physics, education, sports, history, sociology, and psychology. A few examples of such applications are given in Section 1.3. Regression analysis is learned most effectively by analyzing data that are of direct interest to the reader. We invite the readers to think about questions (in their own areas of work, research, or interest) that can be addressed using regression analysis. Readers should collect the relevant data and then apply the regression analysis techniques presented in this book to their own data. To help the reader locate real-life data, this section provides some sources and links to a wealth of data sets that are available for public use.

A number of data sets are available in books and on the Internet. The book by Hand et al. (1994) contains data sets from many fields. These data sets are small in size and are suitable for use as exercises. The book by Chatterjee et al. (1995) provides numerous data sets from diverse fields. The data are included in a diskette that comes with the book and can also be found at the Website.1

Data sets are also available on the Internet at many other sites. Some of the Websites given below allow the direct copying and pasting into the statistical package of choice, while others require downloading the data file and then importing them into a statistical package. Some of these sites also contain further links to yet other data sets or statistics-related Websites.

The Data and Story Library (DASL, pronounced “dazzle”) is one of the most interesting sites that contains a number of data sets accompanied by the “story” or background associated with each data set. DASL is an online library2 of data files and stories that illustrate the use of basic statistical methods. The data sets cover a wide variety of topics. DASL comes with a powerful search engine to locate the story or data file of interest.

Another Website, which also contains data sets arranged by the method used in the analysis, is the Electronic Dataset Service.3 The site also contains many links to other data sources on the Internet.

Finally, this book has a Website,4 which contains, among other things, all the data sets that are included in this book and more. These and other data sets can be found at the Book's Website.

1.3 SELECTED APPLICATIONS OF REGRESSION ANALYSIS

Regression analysis is one of the most widely used statistical tools because it provides simple methods for establishing a functional relationship among variables. It has extensive applications in many subject areas. The cigarette consumption and the real estate appraisal, mentioned above, are but two examples. In this section, we give a few additional examples demonstrating the wide applicability of regression analysis in real-life situations. Some of the data sets described here will be used later in the book to illustrate regression techniques or in the exercises at the end of various chapters.

1.3.1 Agricultural Sciences

The Dairy Herd Improvement Cooperative (DHI) in upstate New York collects and analyzes data on milk production. One question of interest here is how to develop a suitable model to predict current milk production from a set of measured variables. The response variable (current milk production in pounds) and the predictor variables are given in Table 1.1. Samples are taken once a month during milking. The period that a cow gives milk is called lactation. Number of lactations is the number of times a cow has calved or given milk. The recommended management practice is to have the cow produce milk for about 305 days and then allow a 60-day rest period before beginning the next lactation. The data set, consisting of 199 observations, was compiled from the DHI milk production records. The Milk Production data can be found at the Book's Website.

Table 1.1 Variables in Milk Production Data

Variable

Definition

Current

Current month milk production in pounds

Previous

Previous month milk production in pounds

Fat

Percent of fat in milk

Protein

Percent of protein in milk

Days

Number of days since present lactation

Lactation

Number of lactations

I79

Indicator variable (0 if Days and 1 if Days )

1.3.2 Industrial and Labor Relations

In 1947, the United States Congress passed the Taft–Hartley Amendments to the Wagner Act. The original Wagner Act had permitted the unions to use a Closed Shop Contract5 unless prohibited by state law. The Taft–Hartley Amendments made the use of Closed Shop Contract illegal and gave individual states the right to prohibit union shops6 as well. These right-to-work laws have caused a wave of concern throughout the labor movement. A question of interest here is: What are the effects of these laws on the cost of living for a four-person family living on an intermediate budget in the United States? To answer this question a data set consisting of 38 geographic locations has been assembled from various sources. The variables used are defined in Table 1.2. The Right-To-Work Laws data can be found at the Book's Website.

Table 1.2 Variables in Right-To-Work Laws Data

Variable

Definition

COL

Cost of living for a four-person family

PD

Population density (person per square mile)

URate

State unionization rate in 1978

Pop

Population in 1975

Taxes

Property taxes in 1972

Income

Per capita income in 1974

RTWL

Indicator variable (1 if there are right-to-work laws in the state and 0 otherwise)

1.3.3 Government

Information about domestic immigration (the movement of people from one state or area of a country to another) is important to state and local governments. It is of interest to build a model that predicts domestic immigration or to answer the question of why do people leave one place to go to another? There are many factors that influence domestic immigration, such as weather conditions, crime, taxes, and unemployment rates. A data set for the 48 contiguous states has been created. Alaska and Hawaii are excluded from the analysis because the environments of these states are significantly different from the other 48, and their locations present certain barriers to immigration. The response variable here is net domestic immigration, which represents the net movement of people into and out of a state over the period 1990–1994 divided by the population of the state. Eleven predictor variables thought to influence domestic immigration are defined in Table 1.3 and can be found at the Book's Website.

1.3.4 History

A question of historical interest is how to estimate the age of historical objects based on some age-related characteristics of the objects. For example, the variables in Table 1.4 can be used to estimate the age of Egyptian skulls. Here the response variable is Year and the other four variables are possible predictors. There are 150 observations in this data set. The original source of the data is Thomson and Randall-Maciver (1905), but they can be found in Hand et al. (1994, pp. 299–301). An analysis of the data can be found in Manly (1986). The Egyptian Skulls data can be found at the Book's Website.

Table 1.3 Variables in Study of Domestic Immigration

Variable

Definition

State

State name

NDIR

Net domestic immigration rate over the period 1990–1994

Unemp

Unemployment rate in the civilian labor force in 1994

Wage

Average hourly earnings of production workers in manufacturing in 1994

Crime

Violent crime rate per 100,000 people in 1993

Income

Median household income in 1994

Metrop

Percentage of state population living in metropolitan areas in 1992

Poor

Percentage of population who fall below the poverty level in 1994

Taxes

Total state and local taxes per capita in 1993

Educ

Percentage of population 25 years or older who have a high school degree or higher in 1990

BusFail

The number of business failures divided by the population of the state in 1993

Temp

Average of the 12 monthly average temperatures (in degrees Fahrenheit) for the state in 1993

Region

Region in which the state is located (northeast, south, midwest, west)

Table 1.4 Variables in Egyptian Skulls Data

Variable

Definition

Year

Approximate year of skull formation (negative = B.C.; positive = A.D.)

MB

Maximum breadth of skull

BH

Basibregmatic height of skull

BL

Basialveolar length of skull

NH

Nasal Height of skull

Table 1.5 Variables in Study of Water Pollution in New York Rivers

Variable

Definition

Mean nitrogen concentration (mg/liter) based on samples taken

at regular intervals during the spring, summer, and fall months

Agriculture: percentage of land area currently in agricultural use

Forest: percentage of forest land

Residential: percentage of land area in residential use

Commercial/Industrial: percentage of land area in either commercial or industrial use

1.3.5 Environmental Sciences

In a 1976 study exploring the relationship between water quality and land use, Haith (1976) obtained the measurements (shown in Table 1.5) on 20 river basins in New York State. A question of interest here is how the land use around a river basin contributes to the water pollution as measured by the mean nitrogen concentration (mg/liter). The dataset can be found at the Book's Website.

1.3.6 Industrial Production

Nambe Mills in Santa Fe, New Mexico, makes a line of tableware made from sand casting a special alloy of metals. After casting, the pieces go through a series of shaping, grinding, buffing, and polishing steps. Data was collected for 59 items produced by the company. The relation between the polishing time and the product diameters and the product types (Bowl, Casserole, Dish, Tray, and Plate) are used to estimate the polishing time for new products which are designed or suggested for design and manufacture. The variables representing product types are coded as binary variables (1 corresponds to the type and 0 otherwise). Diam is the diameter of the item (in inches), polishing time is measured in minutes, and price in dollars. The polishing time is the major item in the cost of the product. The production decision will be based on the estimated time of polishing. The data is obtained from the DASL library, can be found there and also in the file Industrial.Production.csv at the Book's Website.

1.3.7 The Space Shuttle Challenger

The explosion of the space shuttle Challenger in 1986 killing the crew was a shattering tragedy. To look into the case a Presidential Commission was appointed. The O-rings in the booster rockets, which are used in space launching, play a very important part in its safety. The rigidity of the O-rings is thought to be affected by the temperature at launching. There are six O-rings in a booster rocket. The data consists of two variables: the number of rings damaged and the temperature at launchings of the 23 flights. The data set can be found at the Book's Website. The analysis performed before the launch did not include the launches in which no O-ring was damaged and came to the wrong conclusion. A detailed discussion of the problem is found in The Flight of the Space Shuttle Challenger in Chatterjee et al. (1995, pp. 33–35). Note here that the response variable is a proportion bounded between 0 and 1.

1.3.8 Cost of Health Care

The cost of delivery of health care has become an important concern. Getting data on this topic is extremely difficult because it is highly proprietary. These data were collected by the Department of Health and Social Services of the State of New Mexico and cover 52 of the 60 licensed facilities in New Mexico in 1988. The variables in these data are the characteristics which describe the facilities size, volume of usage, expenditures, and revenue. The location of the facility is also indicated, whether it is in the rural or nonrural area. Specific definitions of the variables are given in Table 1.6 and the data can be found at the Book's Website. There are several ways of looking at a body of data and extracting various kinds of information. For example, (a) Are rural facilities different from nonrural facilities? and (b) How do the hospital characteristics affect the total patient care revenue?

Table 1.6 Variables in Cost of Health Care Data

Variable

Definition

RURAL

Rural home (1) and nonrural home (0)

BED

Number of beds in home

MCDAYS

Annual medical in-patient days (hundreds)

TDAYS

Annual total patient days (hundreds)

PCREV

Annual total patient care revenue ($100)

NSAL

Annual nursing salaries ($100)

FEXP

Annual facilities expenditures ($100)

NETREV

PCREV – NSAL – FEXP

1.4 STEPS IN REGRESSION ANALYSIS

Regression analysis includes the following steps:

Statement of the problem

Selection of potentially relevant variables

Data collection

Model specification

Choice of fitting method

Model fitting

Model validation and criticism

Using the chosen model(s) for the solution of the posed problem.

These steps are examined below.

1.4.1 Statement of the Problem

Regression analysis usually starts with a formulation of the problem. This includes the determination of the question(s) to be addressed by the analysis. The problem statement is the first and perhaps the most important step in regression analysis. It is important because an ill-defined problem or a misformulated question can lead to wasted effort. It can lead to the selection of irrelevant set of variables or to a wrong choice of the statistical method of analysis. A question that is not carefully formulated can also lead to the wrong choice of a model. Suppose we wish to determine whether or not an employer is discriminating against a given group of employees, say women. Data on salary, qualifications, and gender are available from the company's record to address the issue of discrimination. There are several definitions of employment discrimination in the literature. For example, discrimination occurs when on the average (a) women are paid less than equally qualified men, or (b) women are more qualified than equally paid men. To answer the question: “On the average, are women paid less than equally qualified men?” we choose salary as a response variable, and qualification and gender as predictor variables. But to answer the question: “On the average, are women more qualified than equally paid men?” we choose qualification as a response variable and salary and gender as predictor variables, that is, the roles of variables have been switched.

1.4.2 Selection of Potentially Relevant Variables

The next step after the statement of the problem is to select a set of variables that are thought by the experts in the area of study to explain or predict the response variable. The response variable is denoted by and the explanatory or predictor variables are denoted by , where denotes the number of predictor variables. An example of a response variable is the price of a single-family house in a given geographical area. A possible relevant set of predictor variables in this case is: area of the lot, area of the house, age of the house, number of bedrooms, number of bathrooms, type of neighborhood, style of the house, amount of real estate taxes, and so forth.

1.4.3 Data Collection

The next step after the selection of potentially relevant variables is to collect the data from the environment under study to be used in the analysis. Sometimes the data are collected in a controlled setting so that factors that are not of primary interest can be held constant. More often the data are collected under nonexperimental conditions where very little can be controlled by the investigator. In either case, the collected data consist of observations on subjects. Each of these observations consists of measurements for each of the potentially relevant variables. The data are usually recorded as in Table 1.7. A column in Table 1.7 represents a variable, whereas a row represents an observation, which is a set of values for a single subject (e.g., a house); one value for the response variable and one value for each of the predictors. The notation refers to the th value of the th variable. The first subscript refers to observation number and the second refers to variable number.

Each of the variables in Table 1.7 can be classified as either quantitative or qualitative. Examples of quantitative variables are the house price, number of bedrooms, age, and taxes. Examples of qualitative variables are neighborhood type (e.g., good or bad neighborhood) and house style (e.g., ranch, colonial, etc.). In this book we deal mainly with the cases where the response variable is quantitative. A technique used in cases where the response variable is binary7 is called logistic regression. This is introduced in Chapter 13