86,99 €
Instructs readers on how to use methods of statistics and experimental design with R software Applied statistics covers both the theory and the application of modern statistical and mathematical modelling techniques to applied problems in industry, public services, commerce, and research. It proceeds from a strong theoretical background, but it is practically oriented to develop one's ability to tackle new and non-standard problems confidently. Taking a practical approach to applied statistics, this user-friendly guide teaches readers how to use methods of statistics and experimental design without going deep into the theory. Applied Statistics: Theory and Problem Solutions with R includes chapters that cover R package sampling procedures, analysis of variance, point estimation, and more. It follows on the heels of Rasch and Schott's Mathematical Statistics via that book's theoretical background--taking the lessons learned from there to another level with this book's addition of instructions on how to employ the methods using R. But there are two important chapters not mentioned in the theoretical back ground as Generalised Linear Models and Spatial Statistics. * Offers a practical over theoretical approach to the subject of applied statistics * Provides a pre-experimental as well as post-experimental approach to applied statistics * Features classroom tested material * Applicable to a wide range of people working in experimental design and all empirical sciences * Includes 300 different procedures with R and examples with R-programs for the analysis and for determining minimal experimental sizes Applied Statistics: Theory and Problem Solutions with R will appeal to experimenters, statisticians, mathematicians, and all scientists using statistical procedures in the natural sciences, medicine, and psychology amongst others.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 749
Veröffentlichungsjahr: 2019
Cover
Preface
References
1 The R‐Package, Sampling Procedures, and Random Variables
1.1 Introduction
1.2 The Statistical Software Package R
1.3 Sampling Procedures and Random Variables
References
2 Point Estimation
2.1 Introduction
2.2 Estimating Location Parameters
2.3 Estimating Scale Parameters
2.4 Estimating Higher Moments
2.5 Contingency Tables
References
3 Testing Hypotheses – One‐ and Two‐Sample Problems
3.1 Introduction
3.2 The One‐Sample Problem
3.3 The Two‐Sample Problem
References
4 Confidence Estimations – One‐ and Two‐Sample Problems
4.1 Introduction
4.2 The One‐Sample Case
4.3 The Two‐Sample Case
References
5 Analysis of Variance (ANOVA) – Fixed Effects Models
5.1 Introduction
5.2 Planning the Size of an Experiment
5.3 One‐Way Analysis of Variance
5.4 Two‐Way Analysis of Variance
5.5 Three‐Way Classification
References
6 Analysis of Variance – Models with Random Effects
6.1 Introduction
6.2 One‐Way Classification
6.3 Two‐Way Classification
6.4 Three‐Way Classification
References
7 Analysis of Variance – Mixed Models
7.1 Introduction
7.2 Two‐Way Classification
7.3 Three‐Way Layout
References
8 Regression Analysis
8.1 Introduction
8.2 Regression with Non‐Random Regressors – Model I of Regression
8.3 Models with Random Regressors
References
9 Analysis of Covariance (ANCOVA)
9.1 Introduction
9.2 Completely Randomised Design with Covariate
9.3 Randomised Complete Block Design with Covariate
9.4 Concluding Remarks
References
10 Multiple Decision Problems
10.1 Introduction
10.2 Selection Procedures
10.3 The Subset Selection Procedure for Expectations
10.4 Optimal Combination of the Indifference Zone and the Subset Selection Procedure
10.5 Selection of the Normal Distribution with the Smallest Variance
10.6 Multiple Comparisons
References
11 Generalised Linear Models
11.1 Introduction
11.2 Exponential Families of Distributions
11.3 Generalised Linear Models – An Overview
11.4 Analysis – Fitting a GLM – The Linear Case
11.5 Binary Logistic Regression
11.6 Poisson Regression
11.7 The Gamma Regression
11.8 GLM for Gamma Regression
11.9 GLM for the Multinomial Distribution
References
12 Spatial Statistics
12.1 Introduction
12.2 Geostatistics
12.3 Special Problems and Outlook
References
Appendix A: List of Problems
Appendix B: Symbolism
Appendix C: Abbreviations
Appendix D: Probability and Density Functions
Index
End User License Agreement
Chapter 1
Table 1.1 Number
of inhabitants in 23 municipalities of Vienna.
Chapter 2
Table 2.1 Number of noxious weed seeds.
Table 2.2 Some results of the first 20 steps in the iteration of the Heifer exam...
Table 2.3 A two‐by‐two contingency table – model I.
Table 2.4 A two‐by‐two contingency table – model II.
Table 2.5 A two‐by‐two contingency table – for calculating association measures.
Table 2.6 Mother tongue and marital status of the mother of 50 children.
Table 2.7 Hair and eye colour of 2000 German persons.
Chapter 3
Table 3.1
P‐
quantiles
Z
(
P
) of the standard normal distribution.
Table 3.2 Situations and decisions in hypotheses testing.
Table 3.3 Values of the power function
for
n
= 9, 16, 25,
σ
= 1 and specia...
Table 3.4 The litter weights of mice (in grams) and the differences between the ...
Table 3.6 (
γ
1
,
γ
2
)‐values and the corresponding coefficients used in th...
Chapter 4
Table 4.1 Values of
.
Table 4.2 Confidence table of two kinds of smokers.
Chapter 5
Table 5.1 Observations
y
ij
of an experiment with
a
levels of a factor
A
.
Table 5.2 Theoretical ANOVA table: one‐way classification, model I.
Table 5.3 Empirical ANOVA table: one‐way classification, model I.
Table 5.4 Performances (milk fat in kg)
y
ij
of the daughters of three sires.
Table 5.5 ANOVA table for testing the hypothesisH0 : a1 = a2 = a3 = 0...
Table 5.6 Results of testing pig fattening – fattening days (from 40 kg to 110 k...
Table 5.7 Empirical ANOVA table of a two‐way cross‐classification with equal sub...
Table 5.9 ANOVA Table of Example 5.9.
Table 5.8 Observations (loss in per cent of dry mass, during storage of 300 days...
Table 5.10 Analysis of variance table of a two‐way cross‐classification with equ...
Table 5.11 Observations of the carotene storage experiment of Example 5.12.
Table 5.12 ANOVA table for the carotene storage experiment of Example 5.12.
Table 5.13 Theoretical ANOVA table of the two‐way nested classification for mode...
Table 5.14 Observations of the example.
Table 5.15 ANOVA table of a three‐way cross‐classification with equal subclass n...
Table 5.16 Three‐way classification with factors kind of storage, packaging mate...
Table 5.17 ANOVA Table for data of Table 5.16.
Table 5.18 Water temperature (
T
), water salinity (
S
), and density of shrimp popu...
Table 5.19 ANOVA table of a three‐way nested classification for model I.
Table 5.20 Observations of a three‐way nested classification.
Table 5.21 Observations of a mixed classification type (A≻B) × C with
a
= ...
Table 5.22 ANOVA table for a balanced three‐way mixed classification(B ≺ A)x C...
Table 5.23 ANOVA table and expectations of the MS for model I of a balanced thre...
Table 5.24 Observations of a mixed classification type (A×B)≻C with
a
= 2,...
Chapter 6
Table 6.1 Expected mean squares of the one‐way ANOVA model II.
Table 6.2 Milk fat performances
y
ij
of daughters of ten sires.
Table 6.3 ANOVA table of model II with
E(
MS
)
of the example of Problem 6.1.
Table 6.4 ANOVA table of model II of the unbalanced one‐way classification with ...
Table 6.5 Deviations of the specification
y
of three at random chosen products fr...
Table 6.6 The column
E
(
MS
) of the two‐way nested classification for model II.
Table 6.7 Data of the example for Problem 6.9.
Table 6.8 The column
E
(
MS
) as supplement for model II to the analysis of variance...
Table 6.9 (Kuehl 1994) Observations of products produced by operator
A
i
on machin...
Table 6.10 Test statistics for testing hypotheses and distributions of these tes...
Table 6.11 Expectations of the
MS
of a three‐way nested classification for model ...
Table 6.12 Observations
y
ijkl
of a three‐way nested classification model II.
Table 6.13 ANOVA table with df and expected mean squares
E
(
MS
) of model (6.29).
Table 6.14 Data in a three‐way mixed classification((
B
≺
A
)x
C
)
Table 6.15 ANOVA table with df and expected mean squares
E(MS)
of model (6.30).
Table 6.16 Data in a three‐way mixed classification
C
≺ (
AxB
) m...
Chapter 7
Table 7.1 Expectations of the
MS
in Table 5.10 for a Mixed model (Levels of
A
fix...
Table 7.2 Yield per plot
y
in kilograms of a variety trial.
Table 7.3 Yield per plot
y
in kilograms of a variety‐location, two‐way classifica...
Table 7.4 Yields of 6 varieties tested on 12 randomly chosen farms.
Table 7.5
E(
MS
)
for balanced nested mixed models.
Table 7.6 Data of an experiment to determine the content uniformity of film‐coat...
Table 7.7 Data from Example 7.5 with a random factor
A
of batches and a fixed fac...
Table 7.8 ANOVA table – three‐way ANOVA – cross‐classification, balanced case.
Table 7.9 Expected mean squares for the three‐way cross‐classification – balance...
Table 7.10 ANOVA table of the three‐way nested classification – unbalanced case.
Table 7.11 Expected mean squares for the balanced case of model III.
Table 7.12 Expected mean squares for balanced case of model IV.
Table 7.13 Expected mean squares for model V.
Table 7.14 Expected mean squares for model VI.
Table 7.15 Expected mean squares for model VII.
Table 7.16 Expected mean squares for model VIII.
Table 7.17 ANOVA table for the balanced three‐way analysis of variance – mixed c...
Table 7.18 Expected mean squares for model III.
Table 7.19 Expected mean squares for model IV.
Table 7.20 Expected mean squares for the balanced model V.
Table 7.21 Expected mean squares for model VI.
Table 7.22 ANOVA table for the three‐way balanced analysis of variance – mixed c...
Table 7.23 Expected mean squares for balanced model III.
Table 7.24 Expected mean squares for balanced model IV.
Table 7.25 Expected mean squares for the balanced model V.
Table 7.26 Expected mean squares for model VI.
Table 7.27 Expected mean squares for the balanced model VII.
Table 7.28 Expected mean squares for model VIII.
Chapter 8
Table 8.1 The height of hemp plants (
y
in centimetres) during growth (
x
age in w...
Table 8.2 Shoe sizes (
x
in centimetres
)
and body heights (
y
in centimetres) from...
Table 8.3 Average withers heights of 112 cows in the first 60 months of life.
Table 8.4 Carotene content (in mg/100 g dry matter)
y
of grass in dependency of t...
Table 8.5 Lower and upper bounds of the realised 95%‐confidence band for
β
0
...
Table 8.6 Time in minutes (
x
) and change of rotation in degrees (
y
) from experim...
Table 8.7 Leaf surfaces of oil palms
y
insquare metres and age
x
in years
Table 8.8 Relative frequencies of 10 000 simulated samples for the correct accep...
Table 8.9 Leaf surface
y
i
in
m
2
of oil palms on a trial area in dependency of age...
Table 8.10
D
‐ and
G
‐optimal designs for polynomial regression for x ∈ [
a, b
] and
Table 8.11 Locally optimal designs for the exponential regression.
Table 8.12 Locally optimal designs for the logistic regression.
Table 8.13 Locally optimal designs for the Bertalanffy regression.
Table 8.14 Locally optimal designs for the Gompertz regression.
Table 8.15 Optimal size of sub‐samples (
k
) and optimal nominal type‐II‐risk (
β
...
Table 8.16 Optimal size of subsamples (
k
) and optimal nominal type‐II‐risk (
β
...
Chapter 9
Table 9.1 Nested ANOVA table for the test of the null hypothesis
H
β
0
:
β
Table 9.2 Nested ANOVA table for the test HA0: ‘all
a
i
are equal’.
Table 9.3 Strength of a monofilament fibre produced by three different machines
M
Table 9.4 Data of a randomised double‐blind study.
Table 9.5 Data of a randomised complete block design with four blocks (factor B)...
Chapter 10
Table 10.1 Sample means of Example 10.1.
Table 10.2 Values of
n
Gu
used in the simulation experiment.
Table 10.3 Values of average
found in the subset selection.
Table 10.4 Optimal values of
P
B
used in the simulation experiment.
Table 10.5 Average total size of the simulation experiment (upper entry) and the...
Table 10.6 Relative frequencies of correct selection calculated from 100 000 run...
Table 10.7 Simulated observations of Example 5.7.
Table 10.8 Differences
between means of Example 5.7.
Table 10.9 Minimal sample sizes for several multiple decision problems.
Chapter 11
Table 11.1 Link function, random and systematic components of some GLMs.
Table 11.2 Observations (loss during storage in percent of dry mass during stora...
Table 11.3 Analysis of variance table of Problem 11.4.
Table 11.4 Values of
N
ijk
(
n
ijk
) of the block experiment.
Table 11.5 Number
n
of wasps per group and number
k
of these
n
wasps finding eggs...
Table 11.6 Number of soldiers dying from kicks by army mules.
Table 11.7 Values of
k
ijk
(
n
ijk
) on plots
k
= 1, …,
m
ij
in block
i
and genotype
j
...
Table 11.8 Values
k
ij
and
m
ij
found in three strains.
Table 11.9 Clotting time of blood in seconds (
y
) for normal plasma diluted to ni...
Chapter 12
Table 12.1 Gauss–Krüger code numbers.
Chapter 3
Figure 3.1 The power functions of the t‐test testing the null hypothesis
H
0
:
μ
...
Figure 3.2 Graphical representation of the two risks
α
and
β
and the...
Figure 3.3 Result of the example.
Figure 3.4 Graph of the triangular sequential two‐sample test of Problem 3.14.
Chapter 8
Figure 8.1 Scatter‐plot for the association between age and height of hemp pla...
Figure 8.2 Scatter‐plot for the observations of Example 8.2.
Figure 8.3 Scatter‐plot of the data in Example 8.3.
Figure 8.4 Scatter‐plot and estimated regression line of the carotene example ...
Figure 8.5 Estimated regression lines of the carotene example (sack
, glass
Figure 8.6 Scatter‐plot and estimated regression line of the carotene example ...
Figure 8.7 Scatter‐plot of the observations from Table 8.3 and the fitted func...
Figure 8.8 Scatter‐plot of the data and the fitted exponential function in the...
Figure 8.9 Fitted regression function of the example in Problem 8.28.
Figure 8.10 Fitted regression function of the example in Problem 8.32.
Figure 8.11 Graph of the triangle of the test of Example 8.7.
Chapter 9
Figure 9.1Figure 9.1 Scatter‐plot of the example in Problem 9.1 with M1 as 1, ...
Figure 9.2 Scatter‐plot with regression lines of the example in Problem 9.2.
Chapter 10
Figure 10.1 Relationship between the total experimental size
N
and
.
Chapter 11
Figure 11.1 Data fitted with the
lm-function
.
Figure 11.2 Data fitted with
glm-functions
.
Chapter 12
Figure 12.1 Exploratory spatial data analysis of dataset s100.
Figure 12.2 Exponential semi‐variogram model underlying the dataset s100.
Figure 12.3 Empirical directional variograms for dataset s100.
Figure 12.4 Variogram cloud and omnidirectional variogram of elevation data.
Figure 12.5 Empirical semi‐variogram (circles) and the fitted theoretical semi...
Figure 12.6 Predicted values and standard errors for s100.
Figure 12.7 Association between standard deviations and data locations.
Figure 12.8 Determining a 95% confidence interval for Box–Cox parameter.
Figure 12.9 Empirical and fitted semi‐variogram model for Meuse zinc data.
Figure 12.10 OK‐predicted values of zinc data (on a log‐scale).
Figure 12.11 OK variances of predicted zinc values.
Cover
Table of Contents
Begin Reading
xi
xii
1
2
3
4
5
6
7
8
9
10
11
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
405
406
407
408
409
410
412
413
414
415
417
418
419
420
421
425
427
429
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
483
484
485
485
486
487
488
489
489
490
491
492
493
494
495
496
497
498
490
Dieter Rasch Rostock Germany
Rob Verdooren Wageningen The Netherlands
Jürgen Pilz Klagenfurt Austria
This edition first published 2020
© 2020 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Dieter Rasch, Rob Verdooren and Jürgen Pilz to be identified as the authors of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Rasch, Dieter, author. | Verdooren, L. R., author. | Pilz, Jürgen,
1951- author.
Title: Applied statistics : theory and problem solutions with R / Dieter
Rasch (Rostock, GM), Rob Verdooren, Jürgen Pilz.
Description: Hoboken, NJ : Wiley, 2020. | Includes bibliographical references
and index. |
Identifiers: LCCN 2019016568 (print) | LCCN 2019017761 (ebook) | ISBN
9781119551553 (Adobe PDF) | ISBN 9781119551546 (ePub) | ISBN 9781119551522
(hardcover)
Subjects: LCSH: Mathematical statistics-Problems, exercises, etc. | R
(Computer program language)
Classification: LCC QA276 (ebook) | LCC QA276 .R3668 2019 (print) | DDC
519.5-dc23
LC record available at https://lccn.loc.gov/2019016568
Cover design: Wiley
Cover Images: © Dieter Rasch, © whiteMocca/Shutterstock
We wrote this book for people that have to apply statistical methods in their research but whose main interest is not in theorems and proofs. Because of such an approach, our aim is not to provide the detailed theoretical background of statistical procedures. While mathematical statistics as a branch of mathematics includes definitions as well as theorems and their proofs, applied statistics gives hints for the application of the results of mathematical statistics.
Sometimes applied statistics uses simulation results in place of results from theorems. An example is that the normality assumption needed for many theorems in mathematical statistics can be neglected in applications for location parameters such as the expectation, see for this Rasch and Tiku (1985). Nearly all statistical tests and confidence estimations for expectations have been shown by simulations to be very robust against the violation of the normality assumption needed to prove corresponding theorems.
We gave the present book an analogous structure to that of Rasch and Schott (2018) so that the reader can easily find the corresponding theoretical background there. Chapter 11 ‘Generalised Linear Models’ and Chapter 12 ‘Spatial Statistics’ of the present book have no prototype in Rasch and Schott (2018). Further, the present book contains no exercises; lecturers can either use the exercises (with solutions in the appendix) in Rasch and Schott (2018) or the exercises in the problems mentioned below.
Instead, our aim was to demonstrate the theory presented in Rasch and Schott (2018) and that underlying the new Chapters 11 and 12 using functions and procedures available in the statistical programming system R, which has become the golden standard when it comes to statistical computing.
Within the text, the reader finds often the sequence problem – solution – example with problems numbered within the chapters. Readers interested only in special applications in many cases may find the corresponding procedure in the list of problems in Appendix A.
We thank Alison Oliver (Wiley, Oxford) and Mustaq Ahamed (Wiley) for their assistance in publishing this book.
We are very interested in the comments of readers. Please contact:
d_rasch@t‐online.de, [email protected], [email protected].
Rostock, Wageningen, and Klagenfurt, June 2019, the authors.
1985 Rasch, D. and Tiku, M.L. (eds.) (1985). Robustness of statistical methods and nonparametric statistics. In:
Proceedings of the Conference on Robustness of Statistical Methods and Nonparametric Statistics, held at Schwerin (DDR), May 29‐June 2, 1983
. Boston, Lancaster, Tokyo: Reidel Publ. Co. Dordrecht.
2018 Rasch, D. and Schott, D. (2018).
Mathematical Statistics
. Oxford: Wiley.
In this chapter we give an overview of the software package R and introduce basic knowledge about random variables and sampling procedures.
In practical investigations, professional statistical software is used to design experiments or to analyse data already collected. We apply here the software package R. Anybody can extend the functionality of R without any restrictions using free software tools; moreover, it is also possible to implement special statistical methods as well as certain procedures of C and FORTRAN. Such tools are offered on the internet in standardised archives. The most popular archive is probably CRAN (Comprehensive R Archive Network), a server net that is supervised by the R Development Core Team. This net also offers the package OPDOE (optimal design of experiments), which was thoroughly described in Rasch et al. (2011). Further it offers the following packages used in this book: car, lme4, DunnettTests, VCA, lmerTest, mvtnorm, seqtest, faraway, MASS, glm2, geoR, gstat.
Apart from only a few exceptions, R contains implementations for all statistical methods concerning analysis, evaluation, and planning. We refer for details to Crawley (2013).
The software package R is available free of charge from http://cran.r‐project.org for the operating systems Linux, MacOS X, and Windows. The installation under Microsoft Windows takes place via ‘Windows’. Choosing ‘base’ the installation platform is reached. Using ‘Download R 2.X.X for Windows’ (X stands for the required version number) the setup file can be downloaded. After this file is started the setup assistant runs through the installation steps. In this book, all standard settings are adopted. The interested reader will find more information about R at http://www.r‐project.org or in Crawley (2013).
After starting R the input window will be opened, presenting the red coloured input request: ‘>’. Here commands can be written up and carried out by pressing the enter button. The output is given directly below the command line. However, the user can also realise line changes as well as line indents for increasing clarity. Not all this influences the functional procedure. A command to read for instance data y = (1, 3, 8, 11) is as follows:
> y <- c(1,3,8,11)
The assignment operator in R is the two‐character sequence ‘<-’ or ‘=’.
The Workspace is a special working environment in R. There, certain objects can be stored that were obtained during the current work with R. Such objects contain the results of computations and data sets. A Workspace is loaded using the menu
File – Load Workspace...
In this book the R‐commands start with >. Readers who like to use R‐commands must only type or copy the text after > into the R‐window.
An advantage of R is that, as with other statistical packages like SAS and IBM‐SPSS, we no longer need an appendix with tables in statistical books. Often tables of the density or distribution function of the standard normal distribution appear in such appendices. However, the values can be easily calculated using R.
The notation of this and the following chapters is just that of Rasch and Schott (2018).
Calculate the value ϕ(z) of the density function of the standard normal distribution for a given value z.
Use the command > dnorm(z, mean = 0, sd = 1). If the mean or sd is not specified they assume the default values of 0 and 1, respectively. Hence > dnorm(z) can be used in Problem 1.1.
We calculate the value ϕ(1) of the density function of the standard normal distribution using
> dnorm(1)
[1] 0.2419707
Calculate the value Φ(z) of the distribution function of the standard normal distribution for a given value z.
Use the command > pnorm(z, mean = 0, sd = 1).
We calculate the value Φ(1) of the distribution function of the standard normal distribution by > pnorm(1, mean = 0, sd = 1) or using the default values using > pnorm(1).
> pnorm(1)
[1] 0.8413447
Also, for other continuous distributions, we obtain using d with the R‐name of a distribution, the value of the density function and, using p with the R‐name of a distribution, the value of the distribution function. We demonstrate this in the next problem for the lognormal distribution.
Calculate the value of the density function of the lognormal distribution whose logarithm has mean equal to meanlog = 0 and standard deviation equal to sdlog = 1 for a given value z.
Use the command > dlnorm(z, meanlog = 0, sdlog = 1) or use the default values meanlog = 0 and sdlog = 1 using > dlnorm(z).
We calculate the value of the density function of the lognormal distribution with meanlog = 0 and sdlog = 1 using
> dlnorm(1)
[1] 0.3989423
Calculate the value of the distribution function of the lognormal distribution whose logarithm has mean equal to meanlog = 0 and standard deviation equal to sdlog = 1 for a given value z.
Use the command > plnorm(z, meanlog = 0, sdlog = 1) or use the default values meanlog = 0 and sdlog = 1 using > plnorm(z).
We calculate the value of the distribution function for z = 1 of the lognormal distribution with meanlog = 0 and sdlog = 1 using
> plnorm(1)
[1] 0.5
From most of the other distributions we need the quantiles (or percentiles) qP = P(y ≤ P).
This can be done by writing q followed by the R‐name of the distribution.
Calculate the P%‐quantile of the t‐distribution with df degrees of freedom and optional non‐centrality parameter ncp.
Use the command > qt(P,df, ncp) and for a central t‐distribution use the default by omitting ncp.
Calculate the 95%‐quantile of the central t‐distribution with 10 degrees of freedom.
> qt(0.95,10)
[1] 1.812461
We demonstrate the procedure for the chi‐square and the F‐distribution.
Calculate the P%‐quantile of the χ2‐distribution with df degrees of freedom and optional non‐centrality parameter ncp.
Use the command > qchisq(P,df, ncp) and for the central χ2‐distribution with df degrees of freedom use > qchisq(P,df).
Calculate the 95%‐quantile of the central χ2‐distribution with 10 degrees of freedom.
> qchisq(0.95,10)
[1] 18.30704
Calculate the P%‐quantile of the F‐distribution with df1 and df2 degrees of freedom and optional non‐centrality parameter ncp.
Use the command > qf(P,df1,df2, ncp), and for the central F‐distribution with df1 and df2 degrees of freedom use > qf(P,df1,df2).
Calculate the 95%‐quantile of the central F‐distribution with 10 and 20 degrees of freedom!
> qf(0.95,10,20)
[1] 2.347878
For the calculation of further values of probability functions of discrete random variables or of distribution functions and quantiles the commands can be found by using the help function in the tool bar of R, and then you may call up the ‘manual’ or use Crawley (2013).
Even if we, in this book, we mainly discuss how to plan experiments and to analyse observed data, we still need basic knowledge about random variables because, without this, we could not explain unbiased estimators or the expected length of a confidence interval or how to define the risks of a statistical tests.
A sampling procedure without replacement (wor) or with replacement (wr) is a rule of selecting a proper subset, named sample, from a well‐defined finite basic set of objects (population, universe). It is said to be at random if each element of the basic set has the same probability p to be drawn into the sample. We also can say that in a random sampling procedure each possible sample has the same probability to be drawn.
A (concrete) sample is the result of a sampling procedure. Samples resulting from a random sampling procedure are said to be (concrete) random samples or shortly samples.
If we consider all possible samples from a given finite universe, then, from this definition, it follows that each possible sample has the same probability to be drawn.
There are several random sampling procedures that can be used in practice. Basic sets of objects are mostly called (statistical) populations or, synonymously, (statistical) universes.
Concerning random sampling procedures, we distinguish (among other cases):
Simple (or pure) random sampling with replacement (wr) where each of the
N
elements of the population is selected with probability
.
Simple random sampling without replacement (wor) where each unordered sample of
n
different objects has the same probability to be chosen.
In cluster sampling, the population is divided into disjoint subclasses (clusters). Random sampling without replacement is done among these clusters. In the selected clusters, all objects are taken into the sample. This kind of selection is often used in area sampling. It is only random corresponding to
Definition 1.1
if the clusters contain the same number of objects.
In multi‐stage sampling, sampling is done in several steps. We restrict ourselves to two stages of sampling where the population is decomposed into disjoint subsets (primary units). Part of the primary units is sampled randomly without replacement (wor) and within them pure random sampling without replacement (wor) is done with the secondary units. A multi‐stage sampling is favourable if the population has a hierarchical structure (e.g. country, province, towns in the province). It is at random corresponding to
Definition 1.1
if the primary units contain the same number of secondary units.
Sequential sampling, where the sample size is not fixed at the beginning of the sampling procedure. At first, a small sample with replacement is taken and analysed. Then it is decided whether the obtained information is sufficient, e.g. to reject or to accept a given hypothesis (see
Chapter 3
), or if more information is needed by selecting a further unit.
When a cluster or in two‐stage sampling the clusters or primary units have different sizes (number of elements or areas), more sophisticated methods are used (Rasch et al. 2008, Methods 1/31/2110, 1/31/3100).
Both a random sampling (procedure) and arbitrary sampling (procedure) can result in the same concrete sample. Hence, we cannot prove by inspecting the concrete sample itself whether or not the sample is randomly chosen. We have to check the sampling procedure used instead.
In mathematical statistics random sampling with a replacement procedure is modelled by a vector Y = (y1, y2, … , yn)T of random variables yi, i = 1, … , n, which are independently distributed as a random variable y, i.e. they all have the same distribution. The yi, i = 1, … , n are said independently and identically distributed (i.i.d.). This leads to the following definition.
A random sample of size n is a vector Y = (y1, y2, … , yn)T with n i.i.d. random variables yi, i = 1, … , n as elements.
Random variables given in bold print (see Appendix A for motivation).
The vector Y = (y1, y2, … , yn)T is called a realisation of Y = (y1, y2, … , yn)T and is used as a model of a vector of observed values or values selected by a random selection procedure.
To explain this approach let us assume that we have a universe of 100 elements (the numbers 1–100). We like to draw a pure random sample without replacement (wor) of size n = 10 from this universe and model this by Y = (y1, y2, … , y10)T. When a random sample has been drawn it could be the vector Y = (y1, y2, … , y10)T = (3, 98, 12, 37, 2, 67, 33, 21, 9, 56)T = (2, 3, 9, 12, 21, 33, 37, 56, 67, 98)T. This means that it is only important which element has been selected and not at which place this has happened. All samples wor occur with probability . The denominator can be calculated by R with the > choose() command
> choose(100,10)
[1] 1.731031e+13
and from this the probability is .
We can now write
In a probability statement, something must always be random. To write
is nonsense because (y1, y2, … , y10)T as the vector on the right‐hand‐side is a vector of special numbers and it is nonsense to ask for the probability that 5 equals 7.
To explain the situation again we consider the problem of throwing a fair dice; this is a dice where we know that each of the numbers 1, …, 6 occurs with the same probability . We ask for the probability that an even number is thrown. Because one half of the six numbers are even, this probability is . Assume we throw the dice using a dice cup and let the result be hidden, than the probability is still . However, if we take the dice cup away, a realisation occurs, let us say a 5. Now, it is stupid to ask, what is the probability that 5 is even or that an even number is even. Probability statements about realisations of random variables are senseless and not allowed. The reader of this book should only look at a probability statement in the form of a formula if something is in bold print; only in such a case is a probability statement possible.
We learn in Chapter 4 what a confidence interval is. It is defined as an interval with at least one random boundary and we can, for example, calculate with some small α the probability 1 − α that the expectation of some random variable is covered by this interval. However, when we have realised boundaries, then the interval is fixed and it either covers or does not cover the expectation. In applied statistics, we work with observed data modelled by realised random variables. Then the calculated interval does not allow a probability statement. We do not know, by using R or otherwise, whether the calculated interval covers the expectation or not. Why did we fix this probability before starting the experiment when we cannot use it in interpreting the result?
The answer is not easy, but we will try to give some reasons. If a researcher has to carry out many similar experiments and in each of them calculates for some parameter a (1 − α) confidence interval, then he can say that in about (1 − α)100% of all cases the interval has covered the parameter, but of course he does not know when this happened.
What should we do when only one experiment has to be done? Then we should choose (1 − α) so large (say 0.95 or 0.99) that we can take the risk of making an erroneous statement by saying that the interval covers the parameter. This is analogous to the situation of a person who has a severe disease and needs an operation in hospital. The person can choose between two hospitals and knows that in hospital A about 99% of people operated on survived a similar operation and in hospital B only about 80%. Of course (without further information) the person chooses A even without knowing whether she/he will survive. As in normal life, also in science; we have to take risks and to make decisions under uncertainty.
We now show how R can easily solve simple problems of sampling.
Draw a pure random sample without replacement of size n < N from N given objects represented by numbers 1, …, N without replacing the drawn objects. There are possible unordered samples having the same probability p = to be selected.
Insert in R a data file y with N entries and continue in the next line with >sample (y,n, replace = FALSE) or >sample (y,n, replace = F) with n < N to create a sample of size n < N different elements from y; when we insert replace = TRUE we get random sampling with replacement. The default is replace = FALSE, hence for sampling without replacement we can use >sample (y, n).
We choose N = 9, and n = 5, with population values y = (1,2,3,4,5,6,7,8,9)
> y <- c(1,2,3,4,5,6,7,8,9)
> sample(y,5)
[1] 7 6 5 1 3
A pure random sampling with replacement also occurs if the random sample is obtained by replacing the objects immediately after drawing and each object has the same probability of coming into the sample using this procedure. Hence, the population always has the same number of objects before a new object is taken. This is only possible if the observation of objects works without destroying or changing them (examples are tensile breaking tests, medical examinations of killed animals, felling of trees, harvesting of food).
Draw with replacement a pure random sample of size n from N given objects represented by numbers 1, …, N with replacing the drawn objects. There are possible unordered samples having the same probability to be selected.
Insert in R a data file y with N entries and continue in the next line with >sample (y, n, replace =TRUE) or >sample(y, n, replace=T) to create a sample of size n not necessarily with different elements from y.
Example with n < N
> y<-c(1,2,3,4,5,6,7,8,9)
> sample(y,5,replace=T)
[1] 2 4 6 4 2
Example with n > N
> y<-c(1,2,3,4,5,6,7,8,9)
> sample(y,10,replace=T) [1] 3 9 5 5 9 9 8 7 6 3
A method that can sometimes be realised more easily is systematic sampling with a random start. It is applicable if the objects of the finite sampling set are numbered from 1 to N, and the sequence is not related to the character considered. If the quotient m = N/n is a natural number, a value i between 1 and m is chosen at random, and the sample is collected from objects with numbers i, m + i, 2m + i, … , (n – 1)m + i. Detailed information about this case and the case where the quotient m is not an integer can be found in Rasch et al. (2008, method 1/31/1210).
From a set of N objects systematic sampling with a random start should choose a random sample of size n.
We assume that in the sequence 1, 2, …, N there is no trend. Let assume that m = is an integer and select by pure random sampling a value 1 ≤ x ≤ m (sample of size 1) from the m numbers 1, …, m. Then the systematic sample with random start contains the numbers x, x + m, x + 2m, … , x + (n − 1)m.
We choose N = 500 and n = 20, and the quotient is an integer‐valued. Analogous to Problem 1.1 we draw a random sample of size 1 from (1, 2, …, 25) using R.
> y<- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15, 16,17,18,19,20,21,22,23,24,25)
> sample(y,1)
[1] 9
The final systematic sample with random start of size n = 20 starts with number x = 9 and m = 25: (9, 34, 59, 84, 109, 134, 159, 184, 209, 234, 259, 284, 309, 334, 359, 384, 409, 434, 459, 484).
By cluster sampling, from a population of size N decomposed into s disjoint subpopulations, so‐called clusters of sizes N1, N2, . . , Ns, a random sample has to be drawn.
Partial samples of size ni are collected from the ith stratum (i = 1, 2, … , s) where pure random sampling procedures without replacement are used in each stratum. This leads to a random sampling without replacement procedure for the whole population if the numbers ni/n are chosen proportional to the numbers Ni/N. The final random sample contains elements.
Vienna, the capital of Austria, is subdivided into 23 municipalities. We repeat a table with the numbers of inhabitants in these municipalities from Rasch et al. (2011) and round the numbers for demonstrating the example to values so that Ni/N is an integer, where N = 1 700 000.
Now we select by pure random sampling without replacement, as shown in Problem 1.8, from each municipality ni from the Ni inhabitants to reach a total random sample of 1000 inhabitants from the 1 700 000 people in Vienna.
While for the stratified random sampling objects are selected without replacement from each subset, for two‐stage sampling, subsets or objects are selected at random without replacement at each stage, as described below. Let the population consist of s disjoint subsets of size N0, the primary units, in the two‐stage case. Further, we suppose that the character values in the single primary units differ only at random, so that objects need not to be selected from all primary units. If the desired sample size is n = r n0 with r < s, then in the first step, r of the s given primary units are selected using a pure random sampling procedure. In the second step, n0 objects (secondary units) are chosen from each selected primary unit, again applying a pure random sampling. The number of possible samples is , and each object of the population has the same probability to reach the sample corresponding to Definition 1.1.
Draw a random sample of size n in a two‐stage procedure by selecting first from the s primary units having sizes Ni (i = 1, …, s) exactly r units.
To draw a random sample without replacement of size n we select a divisor r of n and from the s primary units we randomly select r proportional to the relative sizes with (i = 1, …, s). From each of the selected r primary units we select by pure random sampling without replacement elements as the total sample of secondary units.
We take again the values of Table 1.1 and select r = 5 from the s = 23 municipalities to take an overall sample of n = 1000. For this we split the interval (0,1] into 23 subintervals i = 1, …, 23 with N0 = 0 and generate five uniformly distributed random numbers in (0,1]. If a random number multiplied by 1000 falls in any of the 23 sub‐intervals (which can be easily found by using the ‘cum’ column in Table 1.1) the corresponding municipality has to be selected. If a further random number falls into the same interval it is replaced by another uniformly distributed random number. We generate five such random numbers as follows:
Table 1.1 Number of inhabitants in 23 municipalities of Vienna.
Source: From Statistik Austria (2009) Bevölkerungsstand inclusive Revision seit 1.1. 2002, Wien, Statistik Austria.
Municipality
N
i
cum
Innere Stadt
16 958
17 000
10
10
Leopoldstadt
94 595
102 000
60
70
Landstraße
83 737
85 000
50
120
Wieden
30 587
34 000
20
140
Margarethen
52 548
51 000
30
170
Mariahilf
29 371
34 000
20
190
Neubau
30 056
34 000
20
210
Josefstadt
23 912
34 000
20
230
Alsergrund
39 422
34 000
20
250
Favoriten
173 623
170 000
100
350
Simmering
88 102
85 000
50
400
Meidling
87 285
85 000
50
450
Hietzing
51 147
51 000
30
480
Penzing
84 187
85 000
50
530
Rudolfsheim
70 902
68 000
40
570
Ottakring
94 735
102 000
60
630
Hernals
52 701
51 000
30
660
Währing
47 861
51 000
30
690
Döbling
68 277
68 000
40
730
Brigittenau
82 369
85 000
50
780
Floridsdorf
139 729
136 000
80
860
Donaustadt
153 408
153 000
90
950
Liesing
91 759
85 000
50
1 000
Total
N
*
=
1 687 271
N =
1 700 000
n =
1 000
Rounded numbers Ni, ni, and cumulated ni.
> runif(5)
[1] 0.18769112 0.78229430 0.09359499 0.46677904 0.51150546
The first number corresponds to Mariahilf, the second to Florisdorf, the third to Landstraße, the fourth to Hietzing, and the last one to Penzing. To obtain a random sample of size 1000 we take pure random samples of size 200 from people in Mariahilf, Florisdorf, Landstraße, Hietzing, and Penzing, respectively.
Crawley, M.J. (2013).
The
R
Book
, 2nd edition, Chichester: Wiley.
Rasch, D. and Schott, D. (2018).
Mathematical Statistics
. Oxford: Wiley.
Rasch, D., Herrendörfer, G., Bock, J., Victor, N., and Guiard, V. (2008).
Verfahrensbibliothek Versuchsplanung und ‐ auswertung
, 2. verbesserte Auflage in einem Band mit CD. R. Oldenbourg Verlag München Wien.
Rasch, D., Pilz, J., Verdooren, R., and Gebhardt, A. (2011).
Optimal Experimental Design with
R
. Boca Raton: Chapman and Hall.
The theory of point estimation is described in most books about mathematical statistics, and we refer here, as in other chapters, mainly to Rasch and Schott (2018).
We describe the problem as follows. Let the distribution Pθ of a random variable y depend on a parameter (vector) θ ∈ Ω ⊆ Rp, p ≥ 1 . With the help of a realisation, Y, of a random sample Y = (y1, y2, … , yn)T, n ≥ 1 we have to make a statement concerning the value of θ (or a function of it). The elements of a random sample Y are independently and identically distributed (i.i.d) like y. Obviously the statement about θ should be as precise as possible. What this really means depends on the choice of the loss function defined in section 1.4 in Rasch and Schott (2018). We define an estimator S(Y), i.e. a measurable mapping of Rn onto Ω taking the value S(Y) for the realisation Y=(y1, y2, … , yn)T of Y, where S(Y) is called the estimate of θ. The estimate is thus the realisation of the estimator. In this chapter, data are assumed to be realisations (y1, y2, … , yn ) of one random sample where n is called the sample size; the case of more than one sample is discussed in the following chapters. The random sample, i.e. the random variable y stems from some distribution, which is described when the method of estimation depends on the distribution – like in the maximum likelihood estimation. For this distribution the rth central moment
is assumed to exist where μ = E(y) is the expectation and σ2 = E[(y − μ)2] is the variance of y. The rth central sample moment mr is defined as
with
An estimator S(Y) based on a random sample Y = (y1, y2, … , yn)T of size n ≥ 1 is said to be unbiased with respect to θ if
holds for all θ ɛ Ω.
The difference bn(θ) = E[S(Y)] − θ is called the bias of the estimator S(Y).
We show here how R can easily calculate estimates of location and scale parameters as well as higher moments from a data set. We at first create a simple data set y in R. The following values are weights in kilograms and therefore non‐negative.
> y <- c(5,7,1,7,8,9,13,9,10,10,18,10,15,10,10,11,8,11,12,13,15, 22,10,25,11)
If we consider y as a sample, the sample size n can with R be determined via
> length(y)
[1] 25
i.e. n = 25. We start with estimating the parameters of location.
In Sections 2.2, 2.3, and 2.4 we assume that we observe measurements in an interval scale or ratio scale; if they are in an ordinal or nominal scale we use the methods described in Section 2.5.
When we estimate any parameter we assume that it exists, so speaking about expectations, skewness γ1 = μ3/σ3, kurtosis γ2 = [μ4/σ4] − 3 and so on we assume that the corresponding moments in the underlying distribution exist.
The arithmetic mean, or briefly, the mean
is an estimate of the expectation μ of some distribution.
Calculate the arithmetic mean of a sample.
Use the command > mean().
> mean(y)
We use the sample Y already defined above and obtain
> y<- c(5,7,1,7,8,9,13,9,10,10,18,10,15,10,10,11,8,11,12,13,15,22, 10,25,11)
> mean(y)
[1] 11.2
i.e. .
The arithmetic mean is a least squares estimate of the expectation μ of y.
The corresponding least squares estimator is and is unbiased.
Calculate the extreme values y(1) = min(y) and y(n) = max(y) of a sample.
We receive the extreme values using the R commands >min() and >max().
Again, we use the sample y defined above and obtain
> min(y)
[1] 1
> max(y)
[1] 25
i.e. y(1) = 1 and y(25) = 25 if we denote the jth element of the ordered set of Y by y(j) such that y(1) ≤ … ≤ y(n) holds. Note: you can get both values using the command > range(y).
Sometimes one or more elements of Y = (y1, y2, … , yn)T do not have the same distribution as the others and Y = (y1, y2, … , yn)T is not a random sample.
If only a few of the elements of Y have a different distribution we call them outliers. Often the minimum and the maximum values of y represent realisations of such outliers. If we conjecture the existence of such outliers we can use special L‐estimators as the trimmed or the Winsorised mean. Outliers in observed values can occur even if the corresponding element of Y is not an outlier. This can happen by incorrectly writing down an observed number or by an error in the measuring instrument.
L‐estimators are weighted means of order statistics (where L stands for linear combination). If we arrange the elements of the realisation Y of Y according to their magnitude, and if we denote the jth element of this ordered set by y(j) such that y(1) ≤ … ≤ y(n) holds, then
is a function of the realisation of Y, and S(Y) = Y(.) = (y(1), … , y(n))T is said to be the order statistic vector, the component y(i) is called the ith order statistic and
is said to be an L‐estimator and is called an L‐estimate.
If we put
in (2.6) with , then
is called the – trimmed mean.
If we do not suppress the t smallest and the t largest observations, but concentrate them in the values y(t + 1) and y(n − t), respectively, then we get the so‐called Winsorised mean
The median in samples of even size n = 2m can be defined as the 1/2 Winsorised mean