Applied Statistics - Dieter Rasch - E-Book

Applied Statistics E-Book

Dieter Rasch

0,0
86,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Instructs readers on how to use methods of statistics and experimental design with R software Applied statistics covers both the theory and the application of modern statistical and mathematical modelling techniques to applied problems in industry, public services, commerce, and research. It proceeds from a strong theoretical background, but it is practically oriented to develop one's ability to tackle new and non-standard problems confidently. Taking a practical approach to applied statistics, this user-friendly guide teaches readers how to use methods of statistics and experimental design without going deep into the theory. Applied Statistics: Theory and Problem Solutions with R includes chapters that cover R package sampling procedures, analysis of variance, point estimation, and more. It follows on the heels of Rasch and Schott's Mathematical Statistics via that book's theoretical background--taking the lessons learned from there to another level with this book's addition of instructions on how to employ the methods using R. But there are two important chapters not mentioned in the theoretical back ground as Generalised Linear Models and Spatial Statistics. * Offers a practical over theoretical approach to the subject of applied statistics * Provides a pre-experimental as well as post-experimental approach to applied statistics * Features classroom tested material * Applicable to a wide range of people working in experimental design and all empirical sciences * Includes 300 different procedures with R and examples with R-programs for the analysis and for determining minimal experimental sizes Applied Statistics: Theory and Problem Solutions with R will appeal to experimenters, statisticians, mathematicians, and all scientists using statistical procedures in the natural sciences, medicine, and psychology amongst others.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 749

Veröffentlichungsjahr: 2019

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface

References

1 The R‐Package, Sampling Procedures, and Random Variables

1.1 Introduction

1.2 The Statistical Software Package R

1.3 Sampling Procedures and Random Variables

References

2 Point Estimation

2.1 Introduction

2.2 Estimating Location Parameters

2.3 Estimating Scale Parameters

2.4 Estimating Higher Moments

2.5 Contingency Tables

References

3 Testing Hypotheses – One‐ and Two‐Sample Problems

3.1 Introduction

3.2 The One‐Sample Problem

3.3 The Two‐Sample Problem

References

4 Confidence Estimations – One‐ and Two‐Sample Problems

4.1 Introduction

4.2 The One‐Sample Case

4.3 The Two‐Sample Case

References

5 Analysis of Variance (ANOVA) – Fixed Effects Models

5.1 Introduction

5.2 Planning the Size of an Experiment

5.3 One‐Way Analysis of Variance

5.4 Two‐Way Analysis of Variance

5.5 Three‐Way Classification

References

6 Analysis of Variance – Models with Random Effects

6.1 Introduction

6.2 One‐Way Classification

6.3 Two‐Way Classification

6.4 Three‐Way Classification

References

7 Analysis of Variance – Mixed Models

7.1 Introduction

7.2 Two‐Way Classification

7.3 Three‐Way Layout

References

8 Regression Analysis

8.1 Introduction

8.2 Regression with Non‐Random Regressors – Model I of Regression

8.3 Models with Random Regressors

References

9 Analysis of Covariance (ANCOVA)

9.1 Introduction

9.2 Completely Randomised Design with Covariate

9.3 Randomised Complete Block Design with Covariate

9.4 Concluding Remarks

References

10 Multiple Decision Problems

10.1 Introduction

10.2 Selection Procedures

10.3 The Subset Selection Procedure for Expectations

10.4 Optimal Combination of the Indifference Zone and the Subset Selection Procedure

10.5 Selection of the Normal Distribution with the Smallest Variance

10.6 Multiple Comparisons

References

11 Generalised Linear Models

11.1 Introduction

11.2 Exponential Families of Distributions

11.3 Generalised Linear Models – An Overview

11.4 Analysis – Fitting a GLM – The Linear Case

11.5 Binary Logistic Regression

11.6 Poisson Regression

11.7 The Gamma Regression

11.8 GLM for Gamma Regression

11.9 GLM for the Multinomial Distribution

References

12 Spatial Statistics

12.1 Introduction

12.2 Geostatistics

12.3 Special Problems and Outlook

References

Appendix A: List of Problems

Appendix B: Symbolism

Appendix C: Abbreviations

Appendix D: Probability and Density Functions

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Number

of inhabitants in 23 municipalities of Vienna.

Chapter 2

Table 2.1 Number of noxious weed seeds.

Table 2.2 Some results of the first 20 steps in the iteration of the Heifer exam...

Table 2.3 A two‐by‐two contingency table – model I.

Table 2.4 A two‐by‐two contingency table – model II.

Table 2.5 A two‐by‐two contingency table – for calculating association measures.

Table 2.6 Mother tongue and marital status of the mother of 50 children.

Table 2.7 Hair and eye colour of 2000 German persons.

Chapter 3

Table 3.1

P‐

quantiles

Z

(

P

) of the standard normal distribution.

Table 3.2 Situations and decisions in hypotheses testing.

Table 3.3 Values of the power function

for

n

 = 9, 16, 25,

σ

 = 1 and specia...

Table 3.4 The litter weights of mice (in grams) and the differences between the ...

Table 3.6 (

γ

1

,

γ

2

)‐values and the corresponding coefficients used in th...

Chapter 4

Table 4.1 Values of

.

Table 4.2 Confidence table of two kinds of smokers.

Chapter 5

Table 5.1 Observations

y

ij

of an experiment with

a

levels of a factor

A

.

Table 5.2 Theoretical ANOVA table: one‐way classification, model I.

Table 5.3 Empirical ANOVA table: one‐way classification, model I.

Table 5.4 Performances (milk fat in kg)

y

ij

of the daughters of three sires.

Table 5.5 ANOVA table for testing the hypothesisH0 : a1 = a2 = a3 = 0...

Table 5.6 Results of testing pig fattening – fattening days (from 40 kg to 110 k...

Table 5.7 Empirical ANOVA table of a two‐way cross‐classification with equal sub...

Table 5.9 ANOVA Table of Example 5.9.

Table 5.8 Observations (loss in per cent of dry mass, during storage of 300 days...

Table 5.10 Analysis of variance table of a two‐way cross‐classification with equ...

Table 5.11 Observations of the carotene storage experiment of Example 5.12.

Table 5.12 ANOVA table for the carotene storage experiment of Example 5.12.

Table 5.13 Theoretical ANOVA table of the two‐way nested classification for mode...

Table 5.14 Observations of the example.

Table 5.15 ANOVA table of a three‐way cross‐classification with equal subclass n...

Table 5.16 Three‐way classification with factors kind of storage, packaging mate...

Table 5.17 ANOVA Table for data of Table 5.16.

Table 5.18 Water temperature (

T

), water salinity (

S

), and density of shrimp popu...

Table 5.19 ANOVA table of a three‐way nested classification for model I.

Table 5.20 Observations of a three‐way nested classification.

Table 5.21 Observations of a mixed classification type (A≻B) × C with

a

 = ...

Table 5.22 ANOVA table for a balanced three‐way mixed classification(B ≺ A)x C...

Table 5.23 ANOVA table and expectations of the MS for model I of a balanced thre...

Table 5.24 Observations of a mixed classification type (A×B)≻C with

a

 = 2,...

Chapter 6

Table 6.1 Expected mean squares of the one‐way ANOVA model II.

Table 6.2 Milk fat performances

y

ij

of daughters of ten sires.

Table 6.3 ANOVA table of model II with

E(

MS

)

of the example of Problem 6.1.

Table 6.4 ANOVA table of model II of the unbalanced one‐way classification with ...

Table 6.5 Deviations of the specification

y

of three at random chosen products fr...

Table 6.6 The column

E

(

MS

) of the two‐way nested classification for model II.

Table 6.7 Data of the example for Problem 6.9.

Table 6.8 The column

E

(

MS

) as supplement for model II to the analysis of variance...

Table 6.9 (Kuehl 1994) Observations of products produced by operator

A

i

on machin...

Table 6.10 Test statistics for testing hypotheses and distributions of these tes...

Table 6.11 Expectations of the

MS

of a three‐way nested classification for model ...

Table 6.12 Observations

y

ijkl

of a three‐way nested classification model II.

Table 6.13 ANOVA table with df and expected mean squares

E

(

MS

) of model (6.29).

Table 6.14 Data in a three‐way mixed classification((

B

 ≺ 

A

)x

C

)

Table 6.15 ANOVA table with df and expected mean squares

E(MS)

of model (6.30).

Table 6.16 Data in a three‐way mixed classification

C

 ≺ (

AxB

) m...

Chapter 7

Table 7.1 Expectations of the

MS

in Table 5.10 for a Mixed model (Levels of

A

fix...

Table 7.2 Yield per plot

y

in kilograms of a variety trial.

Table 7.3 Yield per plot

y

in kilograms of a variety‐location, two‐way classifica...

Table 7.4 Yields of 6 varieties tested on 12 randomly chosen farms.

Table 7.5

E(

MS

)

for balanced nested mixed models.

Table 7.6 Data of an experiment to determine the content uniformity of film‐coat...

Table 7.7 Data from Example 7.5 with a random factor

A

of batches and a fixed fac...

Table 7.8 ANOVA table – three‐way ANOVA – cross‐classification, balanced case.

Table 7.9 Expected mean squares for the three‐way cross‐classification – balance...

Table 7.10 ANOVA table of the three‐way nested classification – unbalanced case.

Table 7.11 Expected mean squares for the balanced case of model III.

Table 7.12 Expected mean squares for balanced case of model IV.

Table 7.13 Expected mean squares for model V.

Table 7.14 Expected mean squares for model VI.

Table 7.15 Expected mean squares for model VII.

Table 7.16 Expected mean squares for model VIII.

Table 7.17 ANOVA table for the balanced three‐way analysis of variance – mixed c...

Table 7.18 Expected mean squares for model III.

Table 7.19 Expected mean squares for model IV.

Table 7.20 Expected mean squares for the balanced model V.

Table 7.21 Expected mean squares for model VI.

Table 7.22 ANOVA table for the three‐way balanced analysis of variance – mixed c...

Table 7.23 Expected mean squares for balanced model III.

Table 7.24 Expected mean squares for balanced model IV.

Table 7.25 Expected mean squares for the balanced model V.

Table 7.26 Expected mean squares for model VI.

Table 7.27 Expected mean squares for the balanced model VII.

Table 7.28 Expected mean squares for model VIII.

Chapter 8

Table 8.1 The height of hemp plants (

y

in centimetres) during growth (

x

age in w...

Table 8.2 Shoe sizes (

x

in centimetres

)

and body heights (

y

in centimetres) from...

Table 8.3 Average withers heights of 112 cows in the first 60 months of life.

Table 8.4 Carotene content (in mg/100 g dry matter)

y

of grass in dependency of t...

Table 8.5 Lower and upper bounds of the realised 95%‐confidence band for

β

0

 ...

Table 8.6 Time in minutes (

x

) and change of rotation in degrees (

y

) from experim...

Table 8.7 Leaf surfaces of oil palms

y

insquare metres and age

x

in years

Table 8.8 Relative frequencies of 10 000 simulated samples for the correct accep...

Table 8.9 Leaf surface

y

i

in

m

2

of oil palms on a trial area in dependency of age...

Table 8.10

D

‐ and

G

‐optimal designs for polynomial regression for x ∈ [

a, b

] and

Table 8.11 Locally optimal designs for the exponential regression.

Table 8.12 Locally optimal designs for the logistic regression.

Table 8.13 Locally optimal designs for the Bertalanffy regression.

Table 8.14 Locally optimal designs for the Gompertz regression.

Table 8.15 Optimal size of sub‐samples (

k

) and optimal nominal type‐II‐risk (

β

...

Table 8.16 Optimal size of subsamples (

k

) and optimal nominal type‐II‐risk (

β

...

Chapter 9

Table 9.1 Nested ANOVA table for the test of the null hypothesis

H

β

0

:

β

Table 9.2 Nested ANOVA table for the test HA0: ‘all

a

i

are equal’.

Table 9.3 Strength of a monofilament fibre produced by three different machines

M

Table 9.4 Data of a randomised double‐blind study.

Table 9.5 Data of a randomised complete block design with four blocks (factor B)...

Chapter 10

Table 10.1 Sample means of Example 10.1.

Table 10.2 Values of

n

Gu

used in the simulation experiment.

Table 10.3 Values of average

found in the subset selection.

Table 10.4 Optimal values of

P

B

used in the simulation experiment.

Table 10.5 Average total size of the simulation experiment (upper entry) and the...

Table 10.6 Relative frequencies of correct selection calculated from 100 000 run...

Table 10.7 Simulated observations of Example 5.7.

Table 10.8 Differences

between means of Example 5.7.

Table 10.9 Minimal sample sizes for several multiple decision problems.

Chapter 11

Table 11.1 Link function, random and systematic components of some GLMs.

Table 11.2 Observations (loss during storage in percent of dry mass during stora...

Table 11.3 Analysis of variance table of Problem 11.4.

Table 11.4 Values of

N

ijk

(

n

ijk

) of the block experiment.

Table 11.5 Number

n

of wasps per group and number

k

of these

n

wasps finding eggs...

Table 11.6 Number of soldiers dying from kicks by army mules.

Table 11.7 Values of

k

ijk

(

n

ijk

) on plots

k

 = 1, …,

m

ij

in block

i

and genotype

j

...

Table 11.8 Values

k

ij

and

m

ij

found in three strains.

Table 11.9 Clotting time of blood in seconds (

y

) for normal plasma diluted to ni...

Chapter 12

Table 12.1 Gauss–Krüger code numbers.

List of Illustrations

Chapter 3

Figure 3.1 The power functions of the t‐test testing the null hypothesis

H

0

:

μ

...

Figure 3.2 Graphical representation of the two risks

α

and

β

and the...

Figure 3.3 Result of the example.

Figure 3.4 Graph of the triangular sequential two‐sample test of Problem 3.14.

Chapter 8

Figure 8.1 Scatter‐plot for the association between age and height of hemp pla...

Figure 8.2 Scatter‐plot for the observations of Example 8.2.

Figure 8.3 Scatter‐plot of the data in Example 8.3.

Figure 8.4 Scatter‐plot and estimated regression line of the carotene example ...

Figure 8.5 Estimated regression lines of the carotene example (sack

, glass

Figure 8.6 Scatter‐plot and estimated regression line of the carotene example ...

Figure 8.7 Scatter‐plot of the observations from Table 8.3 and the fitted func...

Figure 8.8 Scatter‐plot of the data and the fitted exponential function in the...

Figure 8.9 Fitted regression function of the example in Problem 8.28.

Figure 8.10 Fitted regression function of the example in Problem 8.32.

Figure 8.11 Graph of the triangle of the test of Example 8.7.

Chapter 9

Figure 9.1Figure 9.1 Scatter‐plot of the example in Problem 9.1 with M1 as 1, ...

Figure 9.2 Scatter‐plot with regression lines of the example in Problem 9.2.

Chapter 10

Figure 10.1 Relationship between the total experimental size

N

and

.

Chapter 11

Figure 11.1 Data fitted with the

lm-function

.

Figure 11.2 Data fitted with

glm-functions

.

Chapter 12

Figure 12.1 Exploratory spatial data analysis of dataset s100.

Figure 12.2 Exponential semi‐variogram model underlying the dataset s100.

Figure 12.3 Empirical directional variograms for dataset s100.

Figure 12.4 Variogram cloud and omnidirectional variogram of elevation data.

Figure 12.5 Empirical semi‐variogram (circles) and the fitted theoretical semi...

Figure 12.6 Predicted values and standard errors for s100.

Figure 12.7 Association between standard deviations and data locations.

Figure 12.8 Determining a 95% confidence interval for Box–Cox parameter.

Figure 12.9 Empirical and fitted semi‐variogram model for Meuse zinc data.

Figure 12.10 OK‐predicted values of zinc data (on a log‐scale).

Figure 12.11 OK variances of predicted zinc values.

Guide

Cover

Table of Contents

Begin Reading

Pages

xi

xii

1

2

3

4

5

6

7

8

9

10

11

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

405

406

407

408

409

410

412

413

414

415

417

418

419

420

421

425

427

429

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

483

484

485

485

486

487

488

489

489

490

491

492

493

494

495

496

497

498

490

Applied Statistics

Theory and Problem Solutions with R

Dieter Rasch Rostock Germany

 

Rob Verdooren Wageningen The Netherlands

 

Jürgen Pilz Klagenfurt Austria

Copyright

This edition first published 2020

© 2020 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Dieter Rasch, Rob Verdooren and Jürgen Pilz to be identified as the authors of this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Rasch, Dieter, author. | Verdooren, L. R., author. | Pilz, Jürgen,

 1951- author.

Title: Applied statistics : theory and problem solutions with R / Dieter

 Rasch (Rostock, GM), Rob Verdooren, Jürgen Pilz.

Description: Hoboken, NJ : Wiley, 2020. | Includes bibliographical references

 and index. |

Identifiers: LCCN 2019016568 (print) | LCCN 2019017761 (ebook) | ISBN

 9781119551553 (Adobe PDF) | ISBN 9781119551546 (ePub) | ISBN 9781119551522

 (hardcover)

Subjects: LCSH: Mathematical statistics-Problems, exercises, etc. | R

 (Computer program language)

Classification: LCC QA276 (ebook) | LCC QA276 .R3668 2019 (print) | DDC

 519.5-dc23

LC record available at https://lccn.loc.gov/2019016568

Cover design: Wiley

Cover Images: © Dieter Rasch, © whiteMocca/Shutterstock

Preface

We wrote this book for people that have to apply statistical methods in their research but whose main interest is not in theorems and proofs. Because of such an approach, our aim is not to provide the detailed theoretical background of statistical procedures. While mathematical statistics as a branch of mathematics includes definitions as well as theorems and their proofs, applied statistics gives hints for the application of the results of mathematical statistics.

Sometimes applied statistics uses simulation results in place of results from theorems. An example is that the normality assumption needed for many theorems in mathematical statistics can be neglected in applications for location parameters such as the expectation, see for this Rasch and Tiku (1985). Nearly all statistical tests and confidence estimations for expectations have been shown by simulations to be very robust against the violation of the normality assumption needed to prove corresponding theorems.

We gave the present book an analogous structure to that of Rasch and Schott (2018) so that the reader can easily find the corresponding theoretical background there. Chapter 11 ‘Generalised Linear Models’ and Chapter 12 ‘Spatial Statistics’ of the present book have no prototype in Rasch and Schott (2018). Further, the present book contains no exercises; lecturers can either use the exercises (with solutions in the appendix) in Rasch and Schott (2018) or the exercises in the problems mentioned below.

Instead, our aim was to demonstrate the theory presented in Rasch and Schott (2018) and that underlying the new Chapters 11 and 12 using functions and procedures available in the statistical programming system R, which has become the golden standard when it comes to statistical computing.

Within the text, the reader finds often the sequence problem – solution – example with problems numbered within the chapters. Readers interested only in special applications in many cases may find the corresponding procedure in the list of problems in Appendix A.

We thank Alison Oliver (Wiley, Oxford) and Mustaq Ahamed (Wiley) for their assistance in publishing this book.

We are very interested in the comments of readers. Please contact:

d_rasch@t‐online.de, [email protected], [email protected].

Rostock, Wageningen, and Klagenfurt, June 2019, the authors.

References

1985 Rasch, D. and Tiku, M.L. (eds.) (1985). Robustness of statistical methods and nonparametric statistics. In:

Proceedings of the Conference on Robustness of Statistical Methods and Nonparametric Statistics, held at Schwerin (DDR), May 29‐June 2, 1983

. Boston, Lancaster, Tokyo: Reidel Publ. Co. Dordrecht.

2018 Rasch, D. and Schott, D. (2018).

Mathematical Statistics

. Oxford: Wiley.

1The R‐Package, Sampling Procedures, and Random Variables

1.1 Introduction

In this chapter we give an overview of the software package R and introduce basic knowledge about random variables and sampling procedures.

1.2 The Statistical Software Package R

In practical investigations, professional statistical software is used to design experiments or to analyse data already collected. We apply here the software package R. Anybody can extend the functionality of R without any restrictions using free software tools; moreover, it is also possible to implement special statistical methods as well as certain procedures of C and FORTRAN. Such tools are offered on the internet in standardised archives. The most popular archive is probably CRAN (Comprehensive R Archive Network), a server net that is supervised by the R Development Core Team. This net also offers the package OPDOE (optimal design of experiments), which was thoroughly described in Rasch et al. (2011). Further it offers the following packages used in this book: car, lme4, DunnettTests, VCA, lmerTest, mvtnorm, seqtest, faraway, MASS, glm2, geoR, gstat.

Apart from only a few exceptions, R contains implementations for all statistical methods concerning analysis, evaluation, and planning. We refer for details to Crawley (2013).

The software package R is available free of charge from http://cran.r‐project.org for the operating systems Linux, MacOS X, and Windows. The installation under Microsoft Windows takes place via ‘Windows’. Choosing ‘base’ the installation platform is reached. Using ‘Download R 2.X.X for Windows’ (X stands for the required version number) the setup file can be downloaded. After this file is started the setup assistant runs through the installation steps. In this book, all standard settings are adopted. The interested reader will find more information about R at http://www.r‐project.org or in Crawley (2013).

After starting R the input window will be opened, presenting the red coloured input request: ‘>’. Here commands can be written up and carried out by pressing the enter button. The output is given directly below the command line. However, the user can also realise line changes as well as line indents for increasing clarity. Not all this influences the functional procedure. A command to read for instance data y = (1, 3, 8, 11) is as follows:

> y <- c(1,3,8,11)

The assignment operator in R is the two‐character sequence ‘<-’ or ‘=’.

The Workspace is a special working environment in R. There, certain objects can be stored that were obtained during the current work with R. Such objects contain the results of computations and data sets. A Workspace is loaded using the menu

File – Load Workspace...

In this book the R‐commands start with >. Readers who like to use R‐commands must only type or copy the text after > into the R‐window.

An advantage of R is that, as with other statistical packages like SAS and IBM‐SPSS, we no longer need an appendix with tables in statistical books. Often tables of the density or distribution function of the standard normal distribution appear in such appendices. However, the values can be easily calculated using R.

The notation of this and the following chapters is just that of Rasch and Schott (2018).

Problem 1.1

Calculate the value ϕ(z) of the density function of the standard normal distribution for a given value z.

Solution

Use the command > dnorm(z, mean = 0, sd = 1). If the mean or sd is not specified they assume the default values of 0 and 1, respectively. Hence > dnorm(z) can be used in Problem 1.1.

Example

We calculate the value ϕ(1) of the density function of the standard normal distribution using

> dnorm(1)

[1] 0.2419707

Problem 1.2

Calculate the value Φ(z) of the distribution function of the standard normal distribution for a given value z.

Solution

Use the command > pnorm(z, mean = 0, sd = 1).

Example

We calculate the value Φ(1) of the distribution function of the standard normal distribution by > pnorm(1, mean = 0, sd = 1) or using the default values using > pnorm(1).

> pnorm(1)

[1] 0.8413447

Also, for other continuous distributions, we obtain using d with the R‐name of a distribution, the value of the density function and, using p with the R‐name of a distribution, the value of the distribution function. We demonstrate this in the next problem for the lognormal distribution.

Problem 1.3

Calculate the value of the density function of the lognormal distribution whose logarithm has mean equal to meanlog = 0 and standard deviation equal to sdlog = 1 for a given value z.

Solution

Use the command > dlnorm(z, meanlog = 0, sdlog = 1) or use the default values meanlog = 0 and sdlog = 1 using > dlnorm(z).

Example

We calculate the value of the density function of the lognormal distribution with meanlog = 0 and sdlog = 1 using

> dlnorm(1)

[1] 0.3989423

Problem 1.4

Calculate the value of the distribution function of the lognormal distribution whose logarithm has mean equal to meanlog = 0 and standard deviation equal to sdlog = 1 for a given value z.

Solution

Use the command > plnorm(z, meanlog = 0, sdlog = 1) or use the default values meanlog = 0 and sdlog = 1 using > plnorm(z).

Example

We calculate the value of the distribution function for z = 1 of the lognormal distribution with meanlog = 0 and sdlog = 1 using

> plnorm(1)

[1] 0.5

From most of the other distributions we need the quantiles (or percentiles) qP = P(y ≤ P).

This can be done by writing q followed by the R‐name of the distribution.

Problem 1.5

Calculate the P%‐quantile of the t‐distribution with df degrees of freedom and optional non‐centrality parameter ncp.

Solution

Use the command > qt(P,df, ncp) and for a central t‐distribution use the default by omitting ncp.

Example

Calculate the 95%‐quantile of the central t‐distribution with 10 degrees of freedom.

> qt(0.95,10)

[1] 1.812461

We demonstrate the procedure for the chi‐square and the F‐distribution.

Problem 1.6

Calculate the P%‐quantile of the χ2‐distribution with df degrees of freedom and optional non‐centrality parameter ncp.

Solution

Use the command > qchisq(P,df, ncp) and for the central χ2‐distribution with df degrees of freedom use > qchisq(P,df).

Example

Calculate the 95%‐quantile of the central χ2‐distribution with 10 degrees of freedom.

> qchisq(0.95,10)

[1] 18.30704

Problem 1.7

Calculate the P%‐quantile of the F‐distribution with df1 and df2 degrees of freedom and optional non‐centrality parameter ncp.

Solution

Use the command > qf(P,df1,df2, ncp), and for the central F‐distribution with df1 and df2 degrees of freedom use > qf(P,df1,df2).

Example

Calculate the 95%‐quantile of the central F‐distribution with 10 and 20 degrees of freedom!

> qf(0.95,10,20)

[1] 2.347878

For the calculation of further values of probability functions of discrete random variables or of distribution functions and quantiles the commands can be found by using the help function in the tool bar of R, and then you may call up the ‘manual’ or use Crawley (2013).

1.3 Sampling Procedures and Random Variables

Even if we, in this book, we mainly discuss how to plan experiments and to analyse observed data, we still need basic knowledge about random variables because, without this, we could not explain unbiased estimators or the expected length of a confidence interval or how to define the risks of a statistical tests.

Definition 1.1

A sampling procedure without replacement (wor) or with replacement (wr) is a rule of selecting a proper subset, named sample, from a well‐defined finite basic set of objects (population, universe). It is said to be at random if each element of the basic set has the same probability p to be drawn into the sample. We also can say that in a random sampling procedure each possible sample has the same probability to be drawn.

A (concrete) sample is the result of a sampling procedure. Samples resulting from a random sampling procedure are said to be (concrete) random samples or shortly samples.

If we consider all possible samples from a given finite universe, then, from this definition, it follows that each possible sample has the same probability to be drawn.

There are several random sampling procedures that can be used in practice. Basic sets of objects are mostly called (statistical) populations or, synonymously, (statistical) universes.

Concerning random sampling procedures, we distinguish (among other cases):

Simple (or pure) random sampling with replacement (wr) where each of the

N

elements of the population is selected with probability

.

Simple random sampling without replacement (wor) where each unordered sample of

n

different objects has the same probability to be chosen.

In cluster sampling, the population is divided into disjoint subclasses (clusters). Random sampling without replacement is done among these clusters. In the selected clusters, all objects are taken into the sample. This kind of selection is often used in area sampling. It is only random corresponding to

Definition 1.1

if the clusters contain the same number of objects.

In multi‐stage sampling, sampling is done in several steps. We restrict ourselves to two stages of sampling where the population is decomposed into disjoint subsets (primary units). Part of the primary units is sampled randomly without replacement (wor) and within them pure random sampling without replacement (wor) is done with the secondary units. A multi‐stage sampling is favourable if the population has a hierarchical structure (e.g. country, province, towns in the province). It is at random corresponding to

Definition 1.1

if the primary units contain the same number of secondary units.

Sequential sampling, where the sample size is not fixed at the beginning of the sampling procedure. At first, a small sample with replacement is taken and analysed. Then it is decided whether the obtained information is sufficient, e.g. to reject or to accept a given hypothesis (see

Chapter 3

), or if more information is needed by selecting a further unit.

When a cluster or in two‐stage sampling the clusters or primary units have different sizes (number of elements or areas), more sophisticated methods are used (Rasch et al. 2008, Methods 1/31/2110, 1/31/3100).

Both a random sampling (procedure) and arbitrary sampling (procedure) can result in the same concrete sample. Hence, we cannot prove by inspecting the concrete sample itself whether or not the sample is randomly chosen. We have to check the sampling procedure used instead.

In mathematical statistics random sampling with a replacement procedure is modelled by a vector Y = (y1, y2, … , yn)T of random variables yi, i = 1, … , n, which are independently distributed as a random variable y, i.e. they all have the same distribution. The yi, i = 1, … , n are said independently and identically distributed (i.i.d.). This leads to the following definition.

Definition 1.2

A random sample of size n is a vector Y = (y1, y2, … , yn)T with n i.i.d. random variables yi, i = 1, … , n as elements.

Random variables given in bold print (see Appendix A for motivation).

The vector Y = (y1, y2, … , yn)T is called a realisation of Y = (y1, y2, … , yn)T and is used as a model of a vector of observed values or values selected by a random selection procedure.

To explain this approach let us assume that we have a universe of 100 elements (the numbers 1–100). We like to draw a pure random sample without replacement (wor) of size n = 10 from this universe and model this by Y = (y1, y2, … , y10)T. When a random sample has been drawn it could be the vector Y = (y1, y2, … , y10)T = (3, 98, 12, 37, 2, 67, 33, 21, 9, 56)T = (2, 3, 9, 12, 21, 33, 37, 56, 67, 98)T. This means that it is only important which element has been selected and not at which place this has happened. All samples wor occur with probability . The denominator can be calculated by R with the > choose() command

> choose(100,10)

[1] 1.731031e+13

and from this the probability is .

We can now write

In a probability statement, something must always be random. To write

is nonsense because (y1, y2, … , y10)T as the vector on the right‐hand‐side is a vector of special numbers and it is nonsense to ask for the probability that 5 equals 7.

To explain the situation again we consider the problem of throwing a fair dice; this is a dice where we know that each of the numbers 1, …, 6 occurs with the same probability . We ask for the probability that an even number is thrown. Because one half of the six numbers are even, this probability is . Assume we throw the dice using a dice cup and let the result be hidden, than the probability is still . However, if we take the dice cup away, a realisation occurs, let us say a 5. Now, it is stupid to ask, what is the probability that 5 is even or that an even number is even. Probability statements about realisations of random variables are senseless and not allowed. The reader of this book should only look at a probability statement in the form of a formula if something is in bold print; only in such a case is a probability statement possible.

We learn in Chapter 4 what a confidence interval is. It is defined as an interval with at least one random boundary and we can, for example, calculate with some small α the probability 1 − α that the expectation of some random variable is covered by this interval. However, when we have realised boundaries, then the interval is fixed and it either covers or does not cover the expectation. In applied statistics, we work with observed data modelled by realised random variables. Then the calculated interval does not allow a probability statement. We do not know, by using R or otherwise, whether the calculated interval covers the expectation or not. Why did we fix this probability before starting the experiment when we cannot use it in interpreting the result?

The answer is not easy, but we will try to give some reasons. If a researcher has to carry out many similar experiments and in each of them calculates for some parameter a (1 − α) confidence interval, then he can say that in about (1 − α)100% of all cases the interval has covered the parameter, but of course he does not know when this happened.

What should we do when only one experiment has to be done? Then we should choose (1 − α) so large (say 0.95 or 0.99) that we can take the risk of making an erroneous statement by saying that the interval covers the parameter. This is analogous to the situation of a person who has a severe disease and needs an operation in hospital. The person can choose between two hospitals and knows that in hospital A about 99% of people operated on survived a similar operation and in hospital B only about 80%. Of course (without further information) the person chooses A even without knowing whether she/he will survive. As in normal life, also in science; we have to take risks and to make decisions under uncertainty.

We now show how R can easily solve simple problems of sampling.

Problem 1.8

Draw a pure random sample without replacement of size n < N from N given objects represented by numbers 1, …, N without replacing the drawn objects. There are possible unordered samples having the same probability p =  to be selected.

Solution

Insert in R a data file y with N entries and continue in the next line with >sample (y,n, replace = FALSE) or >sample (y,n, replace = F) with n < N to create a sample of size n < N different elements from y; when we insert replace = TRUE we get random sampling with replacement. The default is replace = FALSE, hence for sampling without replacement we can use >sample (y, n).

Example

We choose N = 9, and n = 5, with population values y = (1,2,3,4,5,6,7,8,9)

> y <- c(1,2,3,4,5,6,7,8,9)

> sample(y,5)

[1] 7 6 5 1 3

A pure random sampling with replacement also occurs if the random sample is obtained by replacing the objects immediately after drawing and each object has the same probability of coming into the sample using this procedure. Hence, the population always has the same number of objects before a new object is taken. This is only possible if the observation of objects works without destroying or changing them (examples are tensile breaking tests, medical examinations of killed animals, felling of trees, harvesting of food).

Problem 1.9

Draw with replacement a pure random sample of size n from N given objects represented by numbers 1, …, N with replacing the drawn objects. There are possible unordered samples having the same probability to be selected.

Solution

Insert in R a data file y with N entries and continue in the next line with >sample (y, n, replace =TRUE) or >sample(y, n, replace=T) to create a sample of size n not necessarily with different elements from y.

Examples

Example with n < N

> y<-c(1,2,3,4,5,6,7,8,9)

> sample(y,5,replace=T)

[1] 2 4 6 4 2

Example with n > N

> y<-c(1,2,3,4,5,6,7,8,9)

> sample(y,10,replace=T) [1] 3 9 5 5 9 9 8 7 6 3

A method that can sometimes be realised more easily is systematic sampling with a random start. It is applicable if the objects of the finite sampling set are numbered from 1 to N, and the sequence is not related to the character considered. If the quotient m = N/n is a natural number, a value i between 1 and m is chosen at random, and the sample is collected from objects with numbers i, m + i, 2m + i, … , (n – 1)m + i. Detailed information about this case and the case where the quotient m is not an integer can be found in Rasch et al. (2008, method 1/31/1210).

Problem 1.10

From a set of N objects systematic sampling with a random start should choose a random sample of size n.

Solution

We assume that in the sequence 1, 2, …, N there is no trend. Let assume that m =  is an integer and select by pure random sampling a value 1 ≤ x ≤ m (sample of size 1) from the m numbers 1, …, m. Then the systematic sample with random start contains the numbers x, x + m, x + 2m, … , x + (n − 1)m.

Example

We choose N = 500 and n = 20, and the quotient is an integer‐valued. Analogous to Problem 1.1 we draw a random sample of size 1 from (1, 2, …, 25) using R.

> y<- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15, 16,17,18,19,20,21,22,23,24,25)

> sample(y,1)

[1] 9

The final systematic sample with random start of size n = 20 starts with number x = 9 and m = 25: (9, 34, 59, 84, 109, 134, 159, 184, 209, 234, 259, 284, 309, 334, 359, 384, 409, 434, 459, 484).

Problem 1.11

By cluster sampling, from a population of size N decomposed into s disjoint subpopulations, so‐called clusters of sizes N1, N2, . . , Ns, a random sample has to be drawn.

Solution

Partial samples of size ni are collected from the ith stratum (i = 1, 2, … , s) where pure random sampling procedures without replacement are used in each stratum. This leads to a random sampling without replacement procedure for the whole population if the numbers ni/n are chosen proportional to the numbers Ni/N. The final random sample contains elements.

Example

Vienna, the capital of Austria, is subdivided into 23 municipalities. We repeat a table with the numbers of inhabitants in these municipalities from Rasch et al. (2011) and round the numbers for demonstrating the example to values so that Ni/N is an integer, where N = 1 700 000.

Now we select by pure random sampling without replacement, as shown in Problem 1.8, from each municipality ni from the Ni inhabitants to reach a total random sample of 1000 inhabitants from the 1 700 000 people in Vienna.

While for the stratified random sampling objects are selected without replacement from each subset, for two‐stage sampling, subsets or objects are selected at random without replacement at each stage, as described below. Let the population consist of s disjoint subsets of size N0, the primary units, in the two‐stage case. Further, we suppose that the character values in the single primary units differ only at random, so that objects need not to be selected from all primary units. If the desired sample size is n = r n0 with r < s, then in the first step, r of the s given primary units are selected using a pure random sampling procedure. In the second step, n0 objects (secondary units) are chosen from each selected primary unit, again applying a pure random sampling. The number of possible samples is , and each object of the population has the same probability to reach the sample corresponding to Definition 1.1.

Problem 1.12

Draw a random sample of size n in a two‐stage procedure by selecting first from the s primary units having sizes Ni (i = 1, …, s) exactly r units.

Solution

To draw a random sample without replacement of size n we select a divisor r of n and from the s primary units we randomly select r proportional to the relative sizes with (i = 1, …, s). From each of the selected r primary units we select by pure random sampling without replacement elements as the total sample of secondary units.

Example

We take again the values of Table 1.1 and select r = 5 from the s = 23 municipalities to take an overall sample of n = 1000. For this we split the interval (0,1] into 23 subintervals i = 1, …, 23 with N0 = 0 and generate five uniformly distributed random numbers in (0,1]. If a random number multiplied by 1000 falls in any of the 23 sub‐intervals (which can be easily found by using the ‘cum’ column in Table 1.1) the corresponding municipality has to be selected. If a further random number falls into the same interval it is replaced by another uniformly distributed random number. We generate five such random numbers as follows:

Table 1.1 Number of inhabitants in 23 municipalities of Vienna.

Source: From Statistik Austria (2009) Bevölkerungsstand inclusive Revision seit 1.1. 2002, Wien, Statistik Austria.

Municipality

N

i

cum

Innere Stadt

16 958

17 000

10

10

Leopoldstadt

94 595

102 000

60

70

Landstraße

83 737

85 000

50

120

Wieden

30 587

34 000

20

140

Margarethen

52 548

51 000

30

170

Mariahilf

29 371

34 000

20

190

Neubau

30 056

34 000

20

210

Josefstadt

23 912

34 000

20

230

Alsergrund

39 422

34 000

20

250

Favoriten

173 623

170 000

100

350

Simmering

88 102

85 000

50

400

Meidling

87 285

85 000

50

450

Hietzing

51 147

51 000

30

480

Penzing

84 187

85 000

50

530

Rudolfsheim

70 902

68 000

40

570

Ottakring

94 735

102 000

60

630

Hernals

52 701

51 000

30

660

Währing

47 861

51 000

30

690

Döbling

68 277

68 000

40

730

Brigittenau

82 369

85 000

50

780

Floridsdorf

139 729

136 000

80

860

Donaustadt

153 408

153 000

90

950

Liesing

91 759

85 000

50

1 000

Total

N

*

=

1 687 271

N =

 1 700 000

n =

 1 000

Rounded numbers Ni, ni, and cumulated ni.

> runif(5)

[1] 0.18769112 0.78229430 0.09359499 0.46677904 0.51150546

The first number corresponds to Mariahilf, the second to Florisdorf, the third to Landstraße, the fourth to Hietzing, and the last one to Penzing. To obtain a random sample of size 1000 we take pure random samples of size 200 from people in Mariahilf, Florisdorf, Landstraße, Hietzing, and Penzing, respectively.

References

Crawley, M.J. (2013).

The

R

Book

, 2nd edition, Chichester: Wiley.

Rasch, D. and Schott, D. (2018).

Mathematical Statistics

. Oxford: Wiley.

Rasch, D., Herrendörfer, G., Bock, J., Victor, N., and Guiard, V. (2008).

Verfahrensbibliothek Versuchsplanung und ‐ auswertung

, 2. verbesserte Auflage in einem Band mit CD. R. Oldenbourg Verlag München Wien.

Rasch, D., Pilz, J., Verdooren, R., and Gebhardt, A. (2011).

Optimal Experimental Design with

R

. Boca Raton: Chapman and Hall.

2Point Estimation

2.1 Introduction

The theory of point estimation is described in most books about mathematical statistics, and we refer here, as in other chapters, mainly to Rasch and Schott (2018).

We describe the problem as follows. Let the distribution Pθ of a random variable y depend on a parameter (vector)  θ ∈ Ω ⊆ Rp,  p ≥ 1 . With the help of a realisation, Y, of a random sample Y = (y1, y2, … , yn)T,  n ≥ 1 we have to make a statement concerning the value of θ (or a function of it). The elements of a random sample Y are independently and identically distributed (i.i.d) like y. Obviously the statement about θ should be as precise as possible. What this really means depends on the choice of the loss function defined in section 1.4 in Rasch and Schott (2018). We define an estimator S(Y), i.e. a measurable mapping of Rn onto Ω taking the value S(Y) for the realisation Y=(y1, y2, … , yn)T of Y, where S(Y) is called the estimate of θ. The estimate is thus the realisation of the estimator. In this chapter, data are assumed to be realisations (y1, y2, … , yn ) of one random sample where n is called the sample size; the case of more than one sample is discussed in the following chapters. The random sample, i.e. the random variable y stems from some distribution, which is described when the method of estimation depends on the distribution – like in the maximum likelihood estimation. For this distribution the rth central moment

2.1

is assumed to exist where μ = E(y) is the expectation and σ2 = E[(y − μ)2] is the variance of y. The rth central sample moment mr is defined as

2.2

with

2.3

An estimator S(Y) based on a random sample Y = (y1, y2, … , yn)T of size n ≥ 1 is said to be unbiased with respect to θ if

2.4

holds for all θ ɛ Ω.

The difference bn(θ) = E[S(Y)] − θ is called the bias of the estimator S(Y).

We show here how R can easily calculate estimates of location and scale parameters as well as higher moments from a data set. We at first create a simple data set y in R. The following values are weights in kilograms and therefore non‐negative.

> y <- c(5,7,1,7,8,9,13,9,10,10,18,10,15,10,10,11,8,11,12,13,15, 22,10,25,11)

If we consider y as a sample, the sample size n can with R be determined via

> length(y)

[1] 25

i.e. n = 25. We start with estimating the parameters of location.

In Sections 2.2, 2.3, and 2.4 we assume that we observe measurements in an interval scale or ratio scale; if they are in an ordinal or nominal scale we use the methods described in Section 2.5.

2.2 Estimating Location Parameters

When we estimate any parameter we assume that it exists, so speaking about expectations, skewness γ1 = μ3/σ3, kurtosis γ2 = [μ4/σ4] − 3 and so on we assume that the corresponding moments in the underlying distribution exist.

The arithmetic mean, or briefly, the mean

2.5

is an estimate of the expectation μ of some distribution.

Problem 2.1

Calculate the arithmetic mean of a sample.

Solution

Use the command > mean().

> mean(y)

Example

We use the sample Y already defined above and obtain

> y<- c(5,7,1,7,8,9,13,9,10,10,18,10,15,10,10,11,8,11,12,13,15,22, 10,25,11)

> mean(y)

[1] 11.2

i.e. .

The arithmetic mean is a least squares estimate of the expectation μ of y.

The corresponding least squares estimator is and is unbiased.

Problem 2.2

Calculate the extreme values y(1) = min(y) and y(n) = max(y) of a sample.

Solution

We receive the extreme values using the R commands >min() and >max().

Example

Again, we use the sample y defined above and obtain

> min(y)

[1] 1

> max(y)

[1] 25

i.e. y(1) = 1 and y(25) = 25 if we denote the jth element of the ordered set of Y by y(j) such that y(1) ≤  …  ≤ y(n) holds. Note: you can get both values using the command > range(y).

Sometimes one or more elements of Y = (y1, y2, … , yn)T do not have the same distribution as the others and Y = (y1, y2, … , yn)T is not a random sample.

If only a few of the elements of Y have a different distribution we call them outliers. Often the minimum and the maximum values of y represent realisations of such outliers. If we conjecture the existence of such outliers we can use special L‐estimators as the trimmed or the Winsorised mean. Outliers in observed values can occur even if the corresponding element of Y is not an outlier. This can happen by incorrectly writing down an observed number or by an error in the measuring instrument.

L‐estimators are weighted means of order statistics (where L stands for linear combination). If we arrange the elements of the realisation Y of Y according to their magnitude, and if we denote the jth element of this ordered set by y(j) such that y(1) ≤  …  ≤ y(n) holds, then

is a function of the realisation of Y, and S(Y) = Y(.) = (y(1), … , y(n))T is said to be the order statistic vector, the component y(i) is called the ith order statistic and

2.6

is said to be an L‐estimator and is called an L‐estimate.

If we put

in (2.6) with , then

2.7

is called the – trimmed mean.

If we do not suppress the t smallest and the t largest observations, but concentrate them in the values y(t + 1) and y(n − t), respectively, then we get the so‐called  Winsorised mean

2.8

The median in samples of even size n = 2m can be defined as the 1/2 Winsorised mean

2.9