106,99 €
Regression Analysis By Example Using R A STRAIGHTFORWARD AND CONCISE DISCUSSION OF THE ESSENTIALS OF REGRESSION ANALYSIS In the newly revised sixth edition of Regression Analysis By Example Using R, distinguished statistician Dr Ali S. Hadi delivers an expanded and thoroughly updated discussion of exploratory data analysis using regression analysis in R. The book provides in-depth treatments of regression diagnostics, transformation, multicollinearity, logistic regression, and robust regression. The author clearly demonstrates effective methods of regression analysis with examples that contain the types of data irregularities commonly encountered in the real world. This newest edition also offers a brand-new, easy to read chapter on the freely available statistical software package R. Readers will also find: * Reorganized, expanded, and upgraded exercises at the end of each chapter with an emphasis on data analysis * Updated data sets and examples throughout the book * Complimentary access to a companion website that provides data sets in xlsx, csv, and txt format Perfect for upper-level undergraduate or beginning graduate students in statistics, mathematics, biostatistics, and computer science programs, Regression Analysis By Example Using R will also benefit readers who need a reference for quick updates on regression methods and applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 730
Veröffentlichungsjahr: 2023
COVER
TABLE OF CONTENTS
TITLE PAGE
COPYRIGHT
DEDICATION
PREFACE
ABOUT THE COMPANION WEBSITE
CHAPTER 1: INTRODUCTION
1.1 WHAT IS REGRESSION ANALYSIS?
1.2 PUBLICLY AVAILABLE DATA SETS
1.3 SELECTED APPLICATIONS OF REGRESSION ANALYSIS
1.4 STEPS IN REGRESSION ANALYSIS
1.5 SCOPE AND ORGANIZATION OF THE BOOK
NOTES
CHAPTER 2: A BRIEF INTRODUCTION TO R
2.1 WHAT IS R AND RSTUDIO?
2.2 INSTALLING R AND RSTUDIO
2.3 GETTING STARTED WITH R
2.4 DATA VALUES AND OBJECTS IN R
2.5 R PACKAGES (LIBRARIES)
2.6 IMPORTING (READING) DATA INTO R WORKSPACE
2.7 WRITING (EXPORTING) DATA TO FILES
2.8 SOME ARITHMETIC AND OTHER OPERATORS
2.9 PROGRAMMING IN R
2.10 BIBLIOGRAPHIC NOTES
NOTE
CHAPTER 3: SIMPLE LINEAR REGRESSION
3.1 INTRODUCTION
3.2 COVARIANCE AND CORRELATION COEFFICIENT
3.3 EXAMPLE: COMPUTER REPAIR DATA
3.4 THE SIMPLE LINEAR REGRESSION MODEL
3.5 PARAMETER ESTIMATION
3.6 TESTS OF HYPOTHESES
3.7 CONFIDENCE INTERVALS
3.8 PREDICTIONS
3.9 MEASURING THE QUALITY OF FIT
3.10 REGRESSION LINE THROUGH THE ORIGIN
3.11 TRIVIAL REGRESSION MODELS
3.12 BIBLIOGRAPHIC NOTES
NOTES
CHAPTER 4: MULTIPLE LINEAR REGRESSION
4.1 INTRODUCTION
4.2 DESCRIPTION OF THE DATA AND MODEL
4.3 EXAMPLE: SUPERVISOR PERFORMANCE DATA
4.4 PARAMETER ESTIMATION
4.5 INTERPRETATIONS OF REGRESSION COEFFICIENTS
4.6 CENTERING AND SCALING
4.7 PROPERTIES OF THE LEAST SQUARES ESTIMATORS
4.8 MULTIPLE CORRELATION COEFFICIENT
4.9 INFERENCE FOR INDIVIDUAL REGRESSION COEFFICIENTS
4.10 TESTS OF HYPOTHESES IN A LINEAR MODEL
4.11 PREDICTIONS
4.12 SUMMARY
NOTES
CHAPTER 5: REGRESSION DIAGNOSTICS: DETECTION OF MODEL VIOLATIONS
5.1 INTRODUCTION
5.2 THE STANDARD REGRESSION ASSUMPTIONS
5.3 VARIOUS TYPES OF RESIDUALS
5.4 GRAPHICAL METHODS
5.5 GRAPHS BEFORE FITTING A MODEL
5.6 GRAPHS AFTER FITTING A MODEL
5.7 CHECKING LINEARITY AND NORMALITY ASSUMPTIONS
5.8 LEVERAGE, INFLUENCE, AND OUTLIERS
5.9 MEASURES OF INFLUENCE
5.10 THE POTENTIAL–RESIDUAL PLOT
5.11 REGRESSION DIAGNOSTICS IN R
5.12 WHAT TO DO WITH THE OUTLIERS?
5.13 ROLE OF VARIABLES IN A REGRESSION EQUATION
5.14 EFFECTS OF AN ADDITIONAL PREDICTOR
5.15 ROBUST REGRESSION
NOTES
CHAPTER 6: QUALITATIVE VARIABLES AS PREDICTORS
6.1 INTRODUCTION
6.2 SALARY SURVEY DATA
6.3 INTERACTION VARIABLES
6.4 SYSTEMS OF REGRESSION EQUATIONS: COMPARING TWO GROUPS
6.5 OTHER APPLICATIONS OF INDICATOR VARIABLES
6.6 SEASONALITY
6.7 STABILITY OF REGRESSION PARAMETERS OVER TIME
NOTES
CHAPTER 7: TRANSFORMATION OF VARIABLES
7.1 INTRODUCTION
7.2 TRANSFORMATIONS TO ACHIEVE LINEARITY
7.3 BACTERIA DEATHS DUE TO X-RAY RADIATION
7.4 TRANSFORMATIONS TO STABILIZE VARIANCE
7.5 DETECTION OF HETEROSCEDASTIC ERRORS
7.6 REMOVAL OF HETEROSCEDASTICITY
7.7 WEIGHTED LEAST SQUARES
7.8 LOGARITHMIC TRANSFORMATION OF DATA
7.9 POWER TRANSFORMATION
7.10 SUMMARY
NOTES
CHAPTER 8: WEIGHTED LEAST SQUARES
8.1 INTRODUCTION
8.2 HETEROSCEDASTIC MODELS
8.3 TWO-STAGE ESTIMATION
8.4 EDUCATION EXPENDITURE DATA
8.5 FITTING A DOSE–RESPONSE RELATIONSHIP CURVE
NOTES
CHAPTER 9: THE PROBLEM OF CORRELATED ERRORS
9.1 INTRODUCTION: AUTOCORRELATION
9.2 CONSUMER EXPENDITURE AND MONEY STOCK
9.3 DURBIN–WATSON STATISTIC
9.4 REMOVAL OF AUTOCORRELATION BY TRANSFORMATION
9.5 ITERATIVE ESTIMATION WITH AUTOCORRELATED ERRORS
9.6 AUTOCORRELATION AND MISSING VARIABLES
9.7 ANALYSIS OF HOUSING STARTS
9.8 LIMITATIONS OF THE DURBIN–WATSON STATISTIC
9.9 INDICATOR VARIABLES TO REMOVE SEASONALITY
9.10 REGRESSING TWO TIME SERIES
NOTES
CHAPTER 10: ANALYSIS OF COLLINEAR DATA
10.1 INTRODUCTION
10.2 EFFECTS OF COLLINEARITY ON INFERENCE
10.3 EFFECTS OF COLLINEARITY ON FORECASTING
10.4 DETECTION OF COLLINEARITY
NOTES
CHAPTER 11: WORKING WITH COLLINEAR DATA
11.1 INTRODUCTION
11.2 PRINCIPAL COMPONENTS
11.3 COMPUTATIONS USING PRINCIPAL COMPONENTS
11.4 IMPOSING CONSTRAINTS
11.5 SEARCHING FOR LINEAR FUNCTIONS OF THE
β
'S
11.6 BIASED ESTIMATION OF REGRESSION COEFFICIENTS
11.7 PRINCIPAL COMPONENTS REGRESSION
11.8 REDUCTION OF COLLINEARITY IN THE ESTIMATION DATA
11.9 CONSTRAINTS ON THE REGRESSION COEFFICIENTS
11.10 PRINCIPAL COMPONENTS REGRESSION: A CAUTION
11.11 RIDGE REGRESSION
11.12 ESTIMATION BY THE RIDGE METHOD
11.13 RIDGE REGRESSION: SOME REMARKS
11.14 SUMMARY
11.15 BIBLIOGRAPHIC NOTES
NOTES
CHAPTER 12: VARIABLE SELECTION PROCEDURES
12.1 INTRODUCTION
12.2 FORMULATION OF THE PROBLEM
12.3 CONSEQUENCES OF VARIABLES DELETION
12.4 USES OF REGRESSION EQUATIONS
12.5 CRITERIA FOR EVALUATING EQUATIONS
12.6 COLLINEARITY AND VARIABLE SELECTION
12.7 EVALUATING ALL POSSIBLE EQUATIONS
12.8 VARIABLE SELECTION PROCEDURES
12.9 GENERAL REMARKS ON VARIABLE SELECTION METHODS
12.10 A STUDY OF SUPERVISOR PERFORMANCE
12.11 VARIABLE SELECTION WITH COLLINEAR DATA
12.12 THE HOMICIDE DATA
12.13 VARIABLE SELECTION USING RIDGE REGRESSION
12.14 SELECTION OF VARIABLES IN AN AIR POLLUTION STUDY
12.15 A POSSIBLE STRATEGY FOR FITTING REGRESSION MODELS
12.16 BIBLIOGRAPHIC NOTES
NOTES
CHAPTER 13: LOGISTIC REGRESSION
13.1 INTRODUCTION
13.2 MODELING QUALITATIVE DATA
13.3 THE LOGIT MODEL
13.4 EXAMPLE: ESTIMATING PROBABILITY OF BANKRUPTCIES
13.5 LOGISTIC REGRESSION DIAGNOSTICS
13.6 DETERMINATION OF VARIABLES TO RETAIN
13.7 JUDGING THE FIT OF A LOGISTIC REGRESSION
13.8 THE MULTINOMIAL LOGIT MODEL
13.9 CLASSIFICATION PROBLEM: ANOTHER APPROACH
NOTES
CHAPTER 14: FURTHER TOPICS
14.1 INTRODUCTION
14.2 GENERALIZED LINEAR MODEL
14.3 POISSON REGRESSION MODEL
14.4 INTRODUCTION OF NEW DRUGS
14.5 ROBUST REGRESSION
14.6 FITTING A QUADRATIC MODEL
14.7 DISTRIBUTION OF PCB IN U.S. BAYS
NOTES
REFERENCES
INDEX
END USER LICENSE AGREEMENT
Chapter 1
Table 1.1 Variables in Milk Production Data
Table 1.2 Variables in Right-To-Work Laws Data
Table 1.3 Variables in Study of Domestic Immigration
Table 1.4 Variables in Egyptian Skulls Data
Table 1.5 Variables in Study of Water Pollution in New York Rivers
Table 1.6 Variables in Cost of Health Care Data
Table 1.7 Notation for Data Used in Regression Analysis
Table 1.8 Various Classifications of Regression Analysis
Chapter 2
Table 2.1 Some Useful R Functions for Information About Data
Table 2.2 Some Useful R Functions for Testing Object Type
Table 2.3 Some Useful R Functions for Reading File Formats
Table 2.4 Some Arithmetic, Logical, and Relational Operators in R and Their ...
Table 2.5 Some R Commands Useful for Matrix Calculations and Manipulations
Table 2.6 Some Useful R Commands or Functions
Chapter 3
Table 3.1 Notation for the Data Used in Simple Regression and Correlation
Table 3.2 Algebraic Signs of the Quantities and
Table 3.3 Data Set with a Perfect Nonlinear Relationship Between and , Ye...
Table 3.4 Length of Service Calls (in Minutes) and Number of Units Repaired...
Table 3.5 Quantities Needed for Computation of Correlation Coefficient Betwe...
Table 3.6 Fitted Values, , and Ordinary Least Squares Residuals, , for Com...
Table 3.7 Standard Regression Output
Table 3.8 Regression Output for Computer Repair Data
Table 3.9 Regression Output When Is Regressed on for Labor Force Partici...
Chapter 4
Table 4.1 Notation for Data Used in Multiple Regression Analysis
Table 4.2 Description of Variables in Supervisor Performance Data
Table 4.3 Partial Residuals
Table 4.4 Regression Output for Supervisor Performance Data
Table 4.5 Analysis of Variance (ANOVA) Table in Multiple Regression
Table 4.6 Supervisor Performance Data: Analysis of Variance (ANOVA) Table
Table 4.7 Regression Output from the Regression of on and
Table 4.8 Analysis of Variance (ANOVA) Table in Simple Regression
Table 4.9 Regression Output When Is Regressed on for 20 Observations
Table 4.10 Regression Output When Is Regressed on for 18 Observations
Table 4.11 Regression Outputs for Salary Discriminating Data
Table 4.12 Regression Output When Salary Is Related to Four Predictor Variab...
Table 4.13 ANOVA Table When the Beginning Salary Is Regressed on Education
Table 4.14 Variables in the Cigarette Consumption Data
Chapter 5
Table 5.1 Hamilton's (1987) Data
Table 5.2 New York Rivers Data: The -Tests for the Individual Coefficients...
Table 5.3 New York Rivers Data: Standardized Residuals, , and Leverage Valu...
Table 5.4 New York Rivers Data. Influence Measures from Fitting Model (5.18)...
Table 5.5 Functions for Computing Regression Diagnostics
Table 5.6 Classification of the Five Points in Figure 5.15
Chapter 6
Table 6.1 Regression Equations for the Six Categories of Education and Manag...
Table 6.2 Regression Analysis of Salary Survey Data
Table 6.3 Regression Analysis of Salary Data: Expanded Model
Table 6.4 Regression Analysis of Salary Data: Expanded Model, Observation 33...
Table 6.5 Estimates of Base Salary Using the Nonadditive Model in (6.2)
Table 6.6 Data on Preemployment Testing Program
Table 6.7 Regression Results, Preemployment Data: Model 1
Table 6.8 Regression Results, Preemployment Data: Model 3
Table 6.9 Separate Regression Results
Table 6.10 Variable Descriptions in the Education Expenditures Data
Table 6.11 Regression Output from the Regression of the Weekly Wages, , on
Table 6.12 Some Regression Outputs When Fitting Three Models to the Car Data...
Table 6.13 Corn Yields by Fertilizer Group
Table 6.14 Variables for the Presidential Election Data (1916–1996)
Chapter 7
Table 7.1 Linearizable Simple Regression Functions with Corresponding Transf...
Table 7.2 Number of Surviving Bacteria (Units of 100)
Table 7.3 Estimated Regression Coefficients from Model (7.7)
Table 7.4 Estimated Regression Coefficients When Is Regressed on Time
Table 7.5 Transformations to Stabilize Variance
Table 7.6 Number of Injury Incidents and Proportion of Total Flights
Table 7.7 Estimated Regression Coefficients (When Is Regressed on )
Table 7.8 Estimated Regression Coefficients When Is Regressed on
Table 7.9 Number of Supervised Workers () and Supervisors () in 27 Industr...
Table 7.10 Estimated Regression Coefficients When Number of Supervisors () ...
Table 7.11 Estimated Regression Coefficients of the Original Equation When F...
Table 7.12 Estimated Regression Coefficients When Is Regressed on
Table 7.13 Estimated Regression Coefficients When Is Regressed on and ...
Table 7.14 Correlation Coefficient Between and for Some Values of
Table 7.15 Correlation Coefficient Between and for Some Values of
Table 7.16 Wind Chill Factor (F) for Various Values of Wind speed, , in Mi...
Table 7.17 Annual World Crude Oil Production in Millions of Barrels (1880–19...
Table 7.18 Average Price Per Megabyte in Dollars from 1988 to 1998
Chapter 8
Table 8.1 Variables in Cost of Education Survey
Table 8.2 State Expenditures on Education, Variable List
Table 8.3 Regression Results: State Expenditures on Education for the Year 1...
Table 8.4 Regression Results: State Expenditures on Education in 1975 , Ala...
Table 8.5 Weights for Weighted Least Squares
Table 8.6 OLS and WLS Coefficients for Education Data in 1975 , Alaska Omit...
Chapter 9
Table 9.1 Consumer Expenditure and Money Stock
Table 9.2 Results When Consumer Expenditure Is Regressed on Money Stock,
Table 9.3 Comparison of Regression Estimates
Table 9.4 Regression on Housing Starts () Versus Population ()
Table 9.5 Results of the Regression of Housing Starts () on Population () ...
Table 9.6 Ski Sales Versus PDI
Table 9.7 Ski Sales Versus PDI and Seasonal Variables
Chapter 10
Table 10.1 EEO Data: Regression Results
Table 10.2 Data Combinations for Three Predictor Variables
Table 10.3 Import Data (1949–1966): Regression Results
Table 10.4 Import Data (1949–1959): Regression Results
Table 10.5 Import Data (1949–1959): Regression Coefficients for All Possible...
Table 10.6 Regression Results for the Advertising Data
Table 10.7 Pairwise Correlation Coefficients for the Advertising Data
Table 10.8 Variance Inflation Factors for Three Data Sets
Table 10.9 Condition Indices for Three Data Sets
Table 10.10 Import Data (1949–1959): Eigenvalues and Corresponding Condition...
Table 10.11 Six Eigenvectors of the Correlation Matrix of the Predictors
Table 10.12 Variables for the Gasoline Consumption Data
Chapter 11
Table 11.1 The PCs for the Import Data (1949–1959)
Table 11.2 Advertising Data: The Eigenvalues and Corresponding Eigenvectors ...
Table 11.3 Regression Results Obtained from Fitting the Model in (11.10)
Table 11.4 Regression Results Obtained from Fitting the Model in (11.13)
Table 11.5 Regression Results of Import Data (1949–1959) with the Constraint...
Table 11.6 Regression Results When Fitting Model (11.16) to the Import Data ...
Table 11.7 Regression Results of Fitting Model (11.23) to the Import Data 19...
Table 11.8 Regression Results of Fitting Model (11.26) to the Import Data (1...
Table 11.9 Estimated Regression Coefficients for the Standardized and Origin...
Table 11.10 Response Variable and Set of Principal Components of Four Pred...
Table 11.11 Regression Results Using All Four PCs of Hald's Data
Table 11.12 Regression Results Using the First Three PCs of Hald's Data
Table 11.13 Ridge Estimates , as Functions of the Ridge Parameter , for th...
Table 11.14 Residual Sum of Squares, , and Variance Inflation Factors, , a...
Table 11.15 OLS and Ridge Estimates of the Regression Coefficients for IMPOR...
Table 11.16 Three Eigenvectors of the Correlation Matrix of the Three Predic...
Table 11.17 Regression Output from the Regression of on the Principal Comp...
Chapter 12
Table 12.1 Correlation Matrix for the Supervisor Performance Data
Table 12.2 Variables Selected by the Forward Selection Method
Table 12.3 Variables Selected by Backward Elimination Method
Table 12.4 Values of Statistic (All Possible Equations)
Table 12.5 Variables Selected on the Basis of Statistic
Table 12.6 Homicide Data: Description of Variables
Table 12.7 Homicide Data: The OLS Results from Fitting Model (12.13)
Table 12.8 Homicide Data: The Estimated Coefficients, Their -Tests, and the...
Table 12.9 Description of Variables, Means, and Standard Deviations, SD
Table 12.10 OLS Regression Output for the Air Pollution Data (15 Predictor V...
Table 12.11 OLS Regression Output for the Air Pollution Data (10 Predictor V...
Table 12.12 OLS Regression Output for the Air Pollution Data (Eight Predicto...
Table 12.13 List of Variables for Data in the file Property.Valuation.csv at...
Chapter 13
Table 13.1 Output from the Logistic Regression Using , , and
Table 13.2 Output From the Logistic Regression Using and
Table 13.3 Output from the Logistic Regression Using
Table 13.4 The AIC and BIC Criteria for Various Logistic Regression models
Table 13.5 Multinomial Logistic Regression Output with RW, SSPG, and IR (Bas...
Table 13.6 Multinomial Logistic Regression Output with SSPG and IR (Base Lev...
Table 13.7 Classification Table of Diabetes Data Using Multinomial Logistic ...
Table 13.8 Ordinal Logistic Regression Model (Proportional Odds) Using SSPG ...
Table 13.9 Classification Table of Diabetes Data Using Multinomial Logistic ...
Table 13.10 Results from the OLS Regression of on
Table 13.11 Classification of Observations by Fitted Values
Table 13.12 Field-Goal-Kicking Performances of the American Football League ...
Chapter 14
Table 14.1 Output From the Poisson Regression Using and
Table 14.2 Output From the Linear Regression Using and
Table 14.3 Data Illustrating Robust Regression
Table 14.4 Least Squares Quadratic Fit for the Data Set in Table 14.3
Table 14.5 Robust Regression Quadratic Fit for the Data Set in Table 14.314....
Table 14.6 Least Squares Regression of (PCB85) on (PCB84) for the Data in ...
Table 14.7 Least Squares Regression of (PCB85) on (PCB84) for the Data Set...
Chapter 1
Figure 1.1 A schematic illustration of the iterative nature of the regressio...
Figure 1.2 A flowchart illustrating the dynamic iterative regression process...
Chapter 3
Figure 3.1 Graphical illustration of the correlation coefficient.
Figure 3.2 Scatter plot of versus in Table 3.3.
Figure 3.3 Scatter plots of Anscombe's data with the fitted regression lines...
Figure 3.4 Computer repair data: scatter plot of minutes versus units.
Figure 3.5 Computer repair data: plot of minutes versus uUnits with the fitt...
Figure 3.6 Graph of the probability density function of a -distribution. Th...
Figure 3.7 Graphical illustration of various quantities computed after fitti...
Chapter 5
Figure 5.1 Plot of the data with the least squares fitted line for the Ans...
Figure 5.2 Plot matrix for Hamilton's data with the pairwise correlation coe...
Figure 5.3 Rotating plot for Hamilton's data.
Figure 5.4 Two scatter plots of residuals versus illustrating violations o...
Figure 5.5 New York Rivers data: scatter plot of versus .
Figure 5.6 New York Rivers data: index plots of the standardized residuals,
Figure 5.7 New York Rivers data: index plots of influence measures: (a) Cook...
Figure 5.8 New York Rivers data: potential–residual plot.
Figure 5.9 Scatter plot of population size, , versus time, . The curve is ...
Figure 5.10 Rotating plot for the Scottish hills races data.
Figure 5.11 Scottish hills races data: added-variable plots for (a) Distance...
Figure 5.12 Scottish hills races data: residual plus component plots for (a)...
Figure 5.13 Scottish hills races data: potential–residual plot.
Figure 5.14 P-R plot used in Exercise 5.4.
Figure 5.15 Plot of versus , for distinct observations with the least s...
Chapter 6
Figure 6.1 Standardized residuals versus years of experience ().
Figure 6.2 Standardized residuals versus education-management categorical va...
Figure 6.3 Standardized residuals versus years of experience: expanded model...
Figure 6.4 Standardized residuals versus years of experience: expanded model...
Figure 6.5 Standardized residuals versus education-management categorical va...
Figure 6.6 Requirements for employment on pretest.
Figure 6.7 Standardized residuals versus test score: Model 1.
Figure 6.8 Standardized residuals versus test score: Model 3.
Figure 6.9 Standardized residuals versus race: Model 1.
Figure 6.10 Standardized residuals versus test: Model 1, minority only.
Figure 6.11 Standardized residuals versus test: Model 1, white only.
Chapter 7
Figure 7.1 Graphs of the linearizable function .
Figure 7.2 Graphs of the linearizable function .
Figure 7.3 Graphs of the linearizable function .
Figure 7.4 Graphs of the linearizable functions: (a) and (b) .
Figure 7.5 Plot of against time .
Figure 7.6 Plot of the standardized residuals from (7.7) against time .
Figure 7.7 Plot of against time .
Figure 7.8 Plot of the standardized residuals against time after transform...
Figure 7.9 An example of heteroscedastic residuals.
Figure 7.10 Plot of against .
Figure 7.11 Plot of the standardized residuals versus .
Figure 7.12 Plot of the standardized residuals from the regression of on
Figure 7.13 Number of supervisors () versus number supervised ().
Figure 7.14 Plot of the standardized residuals against when number of supe...
Figure 7.15 Plot of the standardized residuals against when is regressed...
Figure 7.16 Scatter plot of versus .
Figure 7.17 Plot of the standardized residuals against when is regressed...
Figure 7.18 Plot of standardized residuals against the fitted values when ...
Figure 7.19 Plot of standardized residuals against when is regressed on
Figure 7.20 Plot of standardized residuals against when is regressed on
Figure 7.21 Animals data: scatter plots of brain weight versus body weight....
Figure 7.22 Scatter plots of versus for various values of .
Chapter 8
Figure 8.1 Example of heteroscedastic residuals.
Figure 8.2 Nonconstant variance with replicated observations.
Figure 8.3 Plot of standardized residuals versus fitted values.
Figure 8.4 Plot of standardized residuals versus regions.
Figure 8.5 Plot of standardized residuals versus each of the predictor varia...
Figure 8.6 Plot of standardized residuals versus each of the predictor varia...
Figure 8.7 Plot of standardized residuals versus each of the predictor varia...
Figure 8.8 Plot of the standardized residuals versus fitted values (excludin...
Figure 8.9 Plot of the standardized residuals versus region (excluding Alask...
Figure 8.10 Standardized residuals versus fitted values for WLS solution.
Figure 8.11 Standardized residuals by geographic region for WLS solution.
Figure 8.12 Logistic response function.
Chapter 9
Figure 9.1 Index plot of the standardized residuals.
Figure 9.2 Index plot of standardized residuals after one iteration of the C...
Figure 9.3 Index plot of standardized residuals from the regression of on
Figure 9.4 Index plot of the standardized residuals from the regression of
Figure 9.5 Index plot of the standardized residuals. Quarters 1 and 4 are in...
Figure 9.6 Model for ski sales and PDI adjusted for season.
Figure 9.7 Index plot of the standardized residuals with seasonal variables ...
Chapter 10
Figure 10.1 Standardized residuals against fitted values of ACHV.
Figure 10.2 Pairwise scatter plots of the three predictor variables FAM, PEE...
Figure 10.3 Import data (1949–1966): index plot of the standardized residual...
Figure 10.4 Import data (1949–1959): index plot of the standardized residual...
Figure 10.5 Standardized residuals versus fitted values of sales.
Figure 10.6 Index plot of the standardized residuals.
Chapter 11
Figure 11.1 Index plot of the standardized residuals. Import data (1949–1959...
Figure 11.2 Standardized residuals against fitted values of Import data (194...
Figure 11.3 Scatter plots of versus each of the PCs of Hald's data.
Figure 11.4 Ridge trace: IMPORT data (1949–1959).
Chapter 12
Figure 12.1 Supervisor's Performance data: scatter plot of versus for su...
Figure 12.2 Air Pollution data: ridge traces for , , (the 15-variable mo...
Figure 12.3 Air Pollution data: ridge traces for , , (the 15-variable mo...
Figure 12.4 Air Pollution data: ridge traces for , , (the 15-variable mo...
Figure 12.5 Air Pollution data: ridge traces for , , (the 10-variable mo...
Figure 12.6 Air Pollution data: ridge traces for , , , and (the 10-va...
Chapter 13
Figure 13.1 Logistic response function.
Figure 13.2 Bankruptcy data: Index plot of , the standardized deviance resi...
Figure 13.3 Bankruptcy data: Index plot of , the scaled difference in the r...
Figure 13.4 Bankruptcy data: Index plot of , the change in the chi-squared ...
Figure 13.5 Side-by-side boxplots for the Diabetes data.
Chapter 14
Figure 14.1 Scatter plot of versus for the Data Set in Table 14.314.3.
Figure 14.2 Least squares and robust fits superposed on the scatter plot of
Figure 14.3 Least squares and robust fits superposed on scatter plot of (PC...
Cover
Title Page
Copyright
Dedication
Preface
About the Companion Website
Table of Contents
Begin Reading
References
Index
End User License Agreement
iii
iv
v
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
Sixth Edition
Ali S. Hadi
The American University in Cairo
Samprit Chatterjee
New York University
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data applied for:
ISBN: 9781119830870 (HB); ePDF:9781119830887; epub: 9781119830894
Cover Design: WileyCover Images: © Ali S. Hadi
The memory of my parents – A. S. H.
Allegra, Martha, and Rima – S. C.
It's a gift to be simple …
Old Shaker hymn
True knowledge is knowledge of why things are as they are, and not merely what they are.
Isaiah Berlin
I have been feeling a great sense of sadness while I was working alone on this edition of the book after Professor Samprit Chatterjee, my longtime teacher, mentor, friend, and co-author, passed away in April 2021. Our first paper was published in 1986 (Chatterjee and Hadi, 1986). Samprit and I also co-authored our 1988 book (Chatterjee and Hadi, 1988) as well as several other papers. My sincere condolences to his family and friends. May God rest his soul in peace.
Regression analysis has become one of the most widely used statistical tools for analyzing multifactor data. It is appealing because it provides a conceptually simple method for investigating functional relationships among variables. The standard approach in regression analysis is to take data, fit a model, and then evaluate the fit using statistics such as , and Durbin–Watson test. Our approach is broader. We view regression analysis as a set of data analytic techniques that examine the interrelationships among a given set of variables. The emphasis is not on formal statistical tests and probability calculations. We argue for an informal analysis directed toward uncovering patterns in the data. We have also attempted to write a book for a group of readers with diverse backgrounds. We have also tried to put emphasis on the art of data analysis rather than on the development of statistical theory.
The material presented is intended for anyone who is involved in analyzing data. The book should be helpful to those who have some knowledge of the basic concepts of statistics. In the university, it could be used as a text for a course on regression analysis for students whose specialization is not statistics, but, who nevertheless use regression analysis quite extensively in their work. For students whose major emphasis is statistics, and who take a course on regression analysis from a book at the level of Rao (1973), Seber (1977), or Sen and Srivastava (1990), this book can be used to balance and complement the theoretical aspects of the subject with practical applications. Outside the university, this book can be profitably used by those people whose present approach to analyzing multifactor data consists of looking at standard computer output (, standard errors, etc.), but who want to go beyond these summaries for a more thorough analysis.
We utilize most standard and some not-so-standard summary statistics on the basis of their intuitive appeal. We rely heavily on graphical representations of the data and employ many variations of plots of regression residuals. We are not overly concerned with precise probability evaluations. Graphical methods for exploring residuals can suggest model deficiencies or point to troublesome observations. Upon further investigation into their origin, the troublesome observations often turn out to be more informative than the well-behaved observations. We notice often that more information is obtained from a quick examination of a plot of residuals than from a formal test of statistical significance of some limited null hypothesis. In short, the presentation in the chapters of this book is guided by the principles and concepts of exploratory data analysis.
As we mentioned in previous editions, the statistical community has been most supportive, and we have benefitted greatly from their suggestions in improving the text. Our presentation of the various concepts and techniques of regression analysis relies on carefully developed examples. In each example, we have isolated one or two techniques and discussed them in some detail. The data were chosen to highlight the techniques being presented. Although when analyzing a given set of data it is usually necessary to employ many techniques, we have tried to choose the various data sets so that it would not be necessary to discuss the same technique more than once. Our hope is that after working through the book, the reader will be ready and able to analyze their data methodically, thoroughly, and confidently.
The emphasis in this book is on the analysis of data rather than on plugging numbers into formulas, tests of hypotheses, or confidence intervals. Therefore no attempt has been made to derive the techniques. Techniques are described, the required assumptions are given and, finally, the success of the technique in the particular example is assessed. Although derivations of the techniques are not included, we have tried to refer the reader in each case to sources in which such discussion is available. Our hope is that some of these sources will be followed up by the reader who wants a more thorough grounding in theory.
Recently there has been a qualitative change in the analysis of linear models, from model fitting to model building, from overall tests to clinical examinations of data, from macroscopic to the microscopic analysis. To do this kind of analysis a computer is essential and, in previous editions, we have assumed its availability, but did not wish to endorse or associate the book with any of the commercially available statistical packages to make it available to a wider community.
We are particularly heartened by the arrival of the language R, which is available on the Internet under the General Public License (GPL). The language has excellent computing and graphical features. It is also free! For these and other reasons, I decided to introduce and use R in this edition of the book to enable the readers to use R on their own datasets and reproduce the various types of graphs and analysis presented in this book. Although a knowledge of R would certainly be helpful, no prior knowledge of R is assumed.
Major changes have been made in streamlining the text, removing ambiguities, and correcting errors pointed out by readers and others detected by the authors. Chapter 2 is new in this edition. It gives a brief but, what we believe to be, sufficient introduction to R that would enable readers to use R to carry out the regression analysis computations as well as the graphical displays presented in this edition of the book. To help the readers out, we provide all the necessary R code in the new Chapter 2 and throughout the rest of the chapters. Section 5.11 about regression diagnostics in R is new. New references have also been added. The index at the end of the book has been enhanced. The addition of the new chapter increased the number of pages. To offset this increase, data tables that are larger than 10 rows are deleted from the book because the reader can obtain them in digital forms from the Book's Website at http://www.aucegypt.edu/faculty/hadi/RABE6. This Website contains, among other things, all the data sets that are included in this book, the R code that are used to produce the graphs and tables in this book, and more. Also, the use of R enabled us to delete the statistical tables in the appendix because the reader can now use R to compute the -values as well as the critical values of test statistics for any desired significance level, not just the customary ones such as 0.1, 0.05, and 0.01.
We have rewritten some of the exercises and added new ones at the end of the chapters. We feel that the exercises reinforce the understanding of the material in the preceding chapters. Also new to accompany this edition a Solution Manual and Power Point files are available only for instructors by contacting the authors at [email protected] or [email protected].
Previous editions of this book have been translated to Persian, Korean, and Chinese languages. We are grateful to the translators Prof. H. A. Niromand, Prof. Zhongguo Zheng, Prof. Kee Young Kim, Prof. Myoungshic Jhun, Prof. Hyuncheol Kang, and Prof. Seong Keon Lee. We are fortunate to have had assistance and encouragement from several friends, colleagues, and associates. Some of our colleagues and students at New York University, Cornell University, and The American University in Cairo have used portions of the material in their courses and have shared with us their comments and comments of their students. Special thanks are due to our friend and former colleague Jeffrey Simonoff (New York University) for comments, suggestions, and general help. The students in our classes on regression analysis have all contributed by asking penetrating questions and demanding meaningful and understandable answers. Our special thanks go to Nedret Billor (Cukurova University, Turkey) and Sahar El-Sheneity (Cornell University) for their very careful reading of an earlier edition of this book.
We also appreciate the comments provided by Habibollah Esmaily, Hassan Doosti, Fengkai Yang, Mamunur Rashid, Saeed Hajebi, Zheng Zhongguo, Robert W. Hayden, Marie Duggan, Sungho Lee, Hock Lin (Andy) Tai, and Junchang Ju. We also thank Lamia Abdellatif for proofreading parts of this edition, Dimple Philip for preparing the Latex style files and the corresponding PDF version, Dean Gonzalez for helping with the production of some of the figures, and Michael New for helping with the front and back covers.
ALI S. HADI
Cairo, Egypt
September 2023
This book is accompanied by a companion website.
www.wiley.com/go/hadi/regression_analysis_6e
This website includes:
Table of contents
Preface
Book cover
Places where you can purchase the book
Data sets
Stata, SAS or SPSS users
R users
Errata/Comments/Feedback
Solutions to Exercises
Regression analysis is a conceptually simple method for investigating functional relationships among variables. A real estate appraiser may wish to relate the sale price of a home from selected physical characteristics of the building and taxes (local, school, county) paid on the building. We may wish to examine whether cigarette consumption is related to various socioeconomic and demographic variables such as age, education, income, and price of cigarettes. The relationship is expressed in the form of an equation or a model connecting the response or dependent variable and one or more explanatory or predictor variables. In the cigarette consumption example, the response variable is cigarette consumption (measured by the number of packs of cigarette sold in a given state on a per capita basis during a given year) and the explanatory or predictor variables are the various socioeconomic and demographic variables. In the real estate appraisal example, the response variable is the price of a home and the explanatory or predictor variables are the characteristics of the building and taxes paid on the building.
We denote the response variable by and the set of predictor variables by , , where denotes the number of predictor variables. The true relationship between and can be approximated by the regression model
where is assumed to be a random error representing the discrepancy in the approximation. It accounts for the failure of the model to fit the data exactly. The function describes the relationship between and , , , . An example is the linear regression model
where , called the regression parameters or coefficients, are unknown constants to be determined (estimated) from the data. We follow the commonly used notational convention of denoting unknown parameters by Greek letters.
The predictor or explanatory variables are also called by other names such as independent variables, covariates, regressors, factors, and carriers. The name independent variable, though commonly used, is the least preferred, because in practice the predictor variables are rarely independent of each other.
Regression analysis has numerous areas of applications. A partial list would include economics, finance, business, law, meteorology, medicine, biology, chemistry, engineering, physics, education, sports, history, sociology, and psychology. A few examples of such applications are given in Section 1.3. Regression analysis is learned most effectively by analyzing data that are of direct interest to the reader. We invite the readers to think about questions (in their own areas of work, research, or interest) that can be addressed using regression analysis. Readers should collect the relevant data and then apply the regression analysis techniques presented in this book to their own data. To help the reader locate real-life data, this section provides some sources and links to a wealth of data sets that are available for public use.
A number of data sets are available in books and on the Internet. The book by Hand et al. (1994) contains data sets from many fields. These data sets are small in size and are suitable for use as exercises. The book by Chatterjee et al. (1995) provides numerous data sets from diverse fields. The data are included in a diskette that comes with the book and can also be found at the Website.1
Data sets are also available on the Internet at many other sites. Some of the Websites given below allow the direct copying and pasting into the statistical package of choice, while others require downloading the data file and then importing them into a statistical package. Some of these sites also contain further links to yet other data sets or statistics-related Websites.
The Data and Story Library (DASL, pronounced “dazzle”) is one of the most interesting sites that contains a number of data sets accompanied by the “story” or background associated with each data set. DASL is an online library2 of data files and stories that illustrate the use of basic statistical methods. The data sets cover a wide variety of topics. DASL comes with a powerful search engine to locate the story or data file of interest.
Another Website, which also contains data sets arranged by the method used in the analysis, is the Electronic Dataset Service.3 The site also contains many links to other data sources on the Internet.
Finally, this book has a Website,4 which contains, among other things, all the data sets that are included in this book and more. These and other data sets can be found at the Book's Website.
Regression analysis is one of the most widely used statistical tools because it provides simple methods for establishing a functional relationship among variables. It has extensive applications in many subject areas. The cigarette consumption and the real estate appraisal, mentioned above, are but two examples. In this section, we give a few additional examples demonstrating the wide applicability of regression analysis in real-life situations. Some of the data sets described here will be used later in the book to illustrate regression techniques or in the exercises at the end of various chapters.
The Dairy Herd Improvement Cooperative (DHI) in upstate New York collects and analyzes data on milk production. One question of interest here is how to develop a suitable model to predict current milk production from a set of measured variables. The response variable (current milk production in pounds) and the predictor variables are given in Table 1.1. Samples are taken once a month during milking. The period that a cow gives milk is called lactation. Number of lactations is the number of times a cow has calved or given milk. The recommended management practice is to have the cow produce milk for about 305 days and then allow a 60-day rest period before beginning the next lactation. The data set, consisting of 199 observations, was compiled from the DHI milk production records. The Milk Production data can be found at the Book's Website.
Table 1.1 Variables in Milk Production Data
Variable
Definition
Current
Current month milk production in pounds
Previous
Previous month milk production in pounds
Fat
Percent of fat in milk
Protein
Percent of protein in milk
Days
Number of days since present lactation
Lactation
Number of lactations
I79
Indicator variable (0 if Days and 1 if Days )
In 1947, the United States Congress passed the Taft–Hartley Amendments to the Wagner Act. The original Wagner Act had permitted the unions to use a Closed Shop Contract5 unless prohibited by state law. The Taft–Hartley Amendments made the use of Closed Shop Contract illegal and gave individual states the right to prohibit union shops6 as well. These right-to-work laws have caused a wave of concern throughout the labor movement. A question of interest here is: What are the effects of these laws on the cost of living for a four-person family living on an intermediate budget in the United States? To answer this question a data set consisting of 38 geographic locations has been assembled from various sources. The variables used are defined in Table 1.2. The Right-To-Work Laws data can be found at the Book's Website.
Table 1.2 Variables in Right-To-Work Laws Data
Variable
Definition
COL
Cost of living for a four-person family
PD
Population density (person per square mile)
URate
State unionization rate in 1978
Pop
Population in 1975
Taxes
Property taxes in 1972
Income
Per capita income in 1974
RTWL
Indicator variable (1 if there are right-to-work laws in the state and 0 otherwise)
Information about domestic immigration (the movement of people from one state or area of a country to another) is important to state and local governments. It is of interest to build a model that predicts domestic immigration or to answer the question of why do people leave one place to go to another? There are many factors that influence domestic immigration, such as weather conditions, crime, taxes, and unemployment rates. A data set for the 48 contiguous states has been created. Alaska and Hawaii are excluded from the analysis because the environments of these states are significantly different from the other 48, and their locations present certain barriers to immigration. The response variable here is net domestic immigration, which represents the net movement of people into and out of a state over the period 1990–1994 divided by the population of the state. Eleven predictor variables thought to influence domestic immigration are defined in Table 1.3 and can be found at the Book's Website.
A question of historical interest is how to estimate the age of historical objects based on some age-related characteristics of the objects. For example, the variables in Table 1.4 can be used to estimate the age of Egyptian skulls. Here the response variable is Year and the other four variables are possible predictors. There are 150 observations in this data set. The original source of the data is Thomson and Randall-Maciver (1905), but they can be found in Hand et al. (1994, pp. 299–301). An analysis of the data can be found in Manly (1986). The Egyptian Skulls data can be found at the Book's Website.
Table 1.3 Variables in Study of Domestic Immigration
Variable
Definition
State
State name
NDIR
Net domestic immigration rate over the period 1990–1994
Unemp
Unemployment rate in the civilian labor force in 1994
Wage
Average hourly earnings of production workers in manufacturing in 1994
Crime
Violent crime rate per 100,000 people in 1993
Income
Median household income in 1994
Metrop
Percentage of state population living in metropolitan areas in 1992
Poor
Percentage of population who fall below the poverty level in 1994
Taxes
Total state and local taxes per capita in 1993
Educ
Percentage of population 25 years or older who have a high school degree or higher in 1990
BusFail
The number of business failures divided by the population of the state in 1993
Temp
Average of the 12 monthly average temperatures (in degrees Fahrenheit) for the state in 1993
Region
Region in which the state is located (northeast, south, midwest, west)
Table 1.4 Variables in Egyptian Skulls Data
Variable
Definition
Year
Approximate year of skull formation (negative = B.C.; positive = A.D.)
MB
Maximum breadth of skull
BH
Basibregmatic height of skull
BL
Basialveolar length of skull
NH
Nasal Height of skull
Table 1.5 Variables in Study of Water Pollution in New York Rivers
Variable
Definition
Mean nitrogen concentration (mg/liter) based on samples taken
at regular intervals during the spring, summer, and fall months
Agriculture: percentage of land area currently in agricultural use
Forest: percentage of forest land
Residential: percentage of land area in residential use
Commercial/Industrial: percentage of land area in either commercial or industrial use
In a 1976 study exploring the relationship between water quality and land use, Haith (1976) obtained the measurements (shown in Table 1.5) on 20 river basins in New York State. A question of interest here is how the land use around a river basin contributes to the water pollution as measured by the mean nitrogen concentration (mg/liter). The dataset can be found at the Book's Website.
Nambe Mills in Santa Fe, New Mexico, makes a line of tableware made from sand casting a special alloy of metals. After casting, the pieces go through a series of shaping, grinding, buffing, and polishing steps. Data was collected for 59 items produced by the company. The relation between the polishing time and the product diameters and the product types (Bowl, Casserole, Dish, Tray, and Plate) are used to estimate the polishing time for new products which are designed or suggested for design and manufacture. The variables representing product types are coded as binary variables (1 corresponds to the type and 0 otherwise). Diam is the diameter of the item (in inches), polishing time is measured in minutes, and price in dollars. The polishing time is the major item in the cost of the product. The production decision will be based on the estimated time of polishing. The data is obtained from the DASL library, can be found there and also in the file Industrial.Production.csv at the Book's Website.
The explosion of the space shuttle Challenger in 1986 killing the crew was a shattering tragedy. To look into the case a Presidential Commission was appointed. The O-rings in the booster rockets, which are used in space launching, play a very important part in its safety. The rigidity of the O-rings is thought to be affected by the temperature at launching. There are six O-rings in a booster rocket. The data consists of two variables: the number of rings damaged and the temperature at launchings of the 23 flights. The data set can be found at the Book's Website. The analysis performed before the launch did not include the launches in which no O-ring was damaged and came to the wrong conclusion. A detailed discussion of the problem is found in The Flight of the Space Shuttle Challenger in Chatterjee et al. (1995, pp. 33–35). Note here that the response variable is a proportion bounded between 0 and 1.
The cost of delivery of health care has become an important concern. Getting data on this topic is extremely difficult because it is highly proprietary. These data were collected by the Department of Health and Social Services of the State of New Mexico and cover 52 of the 60 licensed facilities in New Mexico in 1988. The variables in these data are the characteristics which describe the facilities size, volume of usage, expenditures, and revenue. The location of the facility is also indicated, whether it is in the rural or nonrural area. Specific definitions of the variables are given in Table 1.6 and the data can be found at the Book's Website. There are several ways of looking at a body of data and extracting various kinds of information. For example, (a) Are rural facilities different from nonrural facilities? and (b) How do the hospital characteristics affect the total patient care revenue?
Table 1.6 Variables in Cost of Health Care Data
Variable
Definition
RURAL
Rural home (1) and nonrural home (0)
BED
Number of beds in home
MCDAYS
Annual medical in-patient days (hundreds)
TDAYS
Annual total patient days (hundreds)
PCREV
Annual total patient care revenue ($100)
NSAL
Annual nursing salaries ($100)
FEXP
Annual facilities expenditures ($100)
NETREV
PCREV – NSAL – FEXP
Regression analysis includes the following steps:
Statement of the problem
Selection of potentially relevant variables
Data collection
Model specification
Choice of fitting method
Model fitting
Model validation and criticism
Using the chosen model(s) for the solution of the posed problem.
These steps are examined below.
Regression analysis usually starts with a formulation of the problem. This includes the determination of the question(s) to be addressed by the analysis. The problem statement is the first and perhaps the most important step in regression analysis. It is important because an ill-defined problem or a misformulated question can lead to wasted effort. It can lead to the selection of irrelevant set of variables or to a wrong choice of the statistical method of analysis. A question that is not carefully formulated can also lead to the wrong choice of a model. Suppose we wish to determine whether or not an employer is discriminating against a given group of employees, say women. Data on salary, qualifications, and gender are available from the company's record to address the issue of discrimination. There are several definitions of employment discrimination in the literature. For example, discrimination occurs when on the average (a) women are paid less than equally qualified men, or (b) women are more qualified than equally paid men. To answer the question: “On the average, are women paid less than equally qualified men?” we choose salary as a response variable, and qualification and gender as predictor variables. But to answer the question: “On the average, are women more qualified than equally paid men?” we choose qualification as a response variable and salary and gender as predictor variables, that is, the roles of variables have been switched.
The next step after the statement of the problem is to select a set of variables that are thought by the experts in the area of study to explain or predict the response variable. The response variable is denoted by and the explanatory or predictor variables are denoted by , where denotes the number of predictor variables. An example of a response variable is the price of a single-family house in a given geographical area. A possible relevant set of predictor variables in this case is: area of the lot, area of the house, age of the house, number of bedrooms, number of bathrooms, type of neighborhood, style of the house, amount of real estate taxes, and so forth.
The next step after the selection of potentially relevant variables is to collect the data from the environment under study to be used in the analysis. Sometimes the data are collected in a controlled setting so that factors that are not of primary interest can be held constant. More often the data are collected under nonexperimental conditions where very little can be controlled by the investigator. In either case, the collected data consist of observations on subjects. Each of these observations consists of measurements for each of the potentially relevant variables. The data are usually recorded as in Table 1.7. A column in Table 1.7 represents a variable, whereas a row represents an observation, which is a set of values for a single subject (e.g., a house); one value for the response variable and one value for each of the predictors. The notation refers to the th value of the th variable. The first subscript refers to observation number and the second refers to variable number.
Each of the variables in Table 1.7 can be classified as either quantitative or qualitative. Examples of quantitative variables are the house price, number of bedrooms, age, and taxes. Examples of qualitative variables are neighborhood type (e.g., good or bad neighborhood) and house style (e.g., ranch, colonial, etc.). In this book we deal mainly with the cases where the response variable is quantitative. A technique used in cases where the response variable is binary7 is called logistic regression. This is introduced in Chapter 13