73,99 €
A much-needed reference on survey sampling and its applications that presents the latest advances in the field
Seeking to show that sampling theory is a living discipline with a very broad scope, this book examines the modern development of the theory of survey sampling and the foundations of survey sampling. It offers readers a critical approach to the subject and discusses putting theory into practice. It also explores the treatment of non-sampling errors featuring a range of topics from the problems of coverage to the treatment of non-response. In addition, the book includes real examples, applications, and a large set of exercises with solutions.
Sampling and Estimation from Finite Populations begins with a look at the history of survey sampling. It then offers chapters on: population, sample, and estimation; simple and systematic designs; stratification; sampling with unequal probabilities; balanced sampling; cluster and two-stage sampling; and other topics on sampling, such as spatial sampling, coordination in repeated surveys, and multiple survey frames. The book also includes sections on: post-stratification and calibration on marginal totals; calibration estimation; estimation of complex parameters; variance estimation by linearization; and much more.
Sampling and Estimation from Finite Populations is an excellent book for methodologists and researchers in survey agencies and advanced undergraduate and graduate students in social science, statistics, and survey courses.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 573
Veröffentlichungsjahr: 2020
Cover
List of Figures
List of Tables
List of Algorithms
Preface
Preface to the First French Edition
Table of Notations
Chapter 1: A History of Ideas in Survey Sampling Theory
1.1 Introduction
1.2 Enumerative Statistics During the 19th Century
1.3 Controversy on the use of Partial Data
1.4 Development of a Survey Sampling Theory
1.5 The US Elections of 1936
1.6 The Statistical Theory of Survey Sampling
1.7 Modeling the Population
1.8 Attempt to a Synthesis
1.9 Auxiliary Information
1.10 Recent References and Development
Notes
Chapter 2: Population, Sample, and Estimation
2.1 Population
2.2 Sample
2.3 Inclusion Probabilities
2.4 Parameter Estimation
2.5 Estimation of a Total
2.6 Estimation of a Mean
2.7 Variance of the Total Estimator
2.8 Sampling with Replacement
Chapter 3: Simple and Systematic Designs
3.1 Simple Random Sampling without Replacement with Fixed Sample Size
3.2 Bernoulli Sampling
3.3 Simple Random Sampling with Replacement
3.4 Comparison of the Designs with and Without Replacement
3.5 Sampling with Replacement and Retaining Distinct Units
3.6 Inverse Sampling with Replacement
3.7 Estimation of Other Functions of Interest
3.8 Determination of the Sample Size
3.9 Implementation of Simple Random Sampling Designs
3.10 Systematic Sampling with Equal Probabilities
3.11 Entropy for Simple and Systematic Designs
Chapter 4: Stratification
4.1 Population and Strata
4.2 Sample, Inclusion Probabilities, and Estimation
4.3 Simple Stratified Designs
4.4 Stratified Design with Proportional Allocation
4.5 Optimal Stratified Design for the Total
4.6 Notes About Optimality in Stratification
4.7 Power Allocation
4.8 Optimality and Cost
4.9 Smallest Sample Size
4.10 Construction of the Strata
4.11 Stratification Under Many Objectives
Chapter 5: Sampling with Unequal Probabilities
5.1 Auxiliary Variables and Inclusion Probabilities
5.2 Calculation of the Inclusion Probabilities
5.3 General Remarks
5.4 Sampling with Replacement with Unequal Inclusion Probabilities
5.5 Nonvalidity of the Generalization of the Successive Drawing without Replacement
5.6 Systematic Sampling with Unequal Probabilities
5.7 Deville's Systematic Sampling
5.8 Poisson Sampling
5.9 Maximum Entropy Design
5.10 Rao–Sampford Rejective Procedure
5.11 Order Sampling
5.12 Splitting Method
5.13 Choice of Method
5.14 Variance Approximation
5.15 Variance Estimation
Exercises
Chapter 6: Balanced Sampling
6.1 Introduction
6.2 Balanced Sampling: Definition
6.3 Balanced Sampling and Linear Programming
6.4 Balanced Sampling by Systematic Sampling
6.5 Methode of Deville, Grosbras, and Roth
6.6 Cube Method
6.7 Variance Approximation
6.8 Variance Estimation
6.9 Special Cases of Balanced Sampling
6.10 Practical Aspects of Balanced Sampling
Exercise
Chapter 7: Cluster and Two‐stage Sampling
7.1 Cluster Sampling
7.2 Two‐stage Sampling
7.3 Multi‐stage Designs
7.4 Selecting Primary Units with Replacement
7.5 Two‐phase Designs
7.6 Intersection of Two Independent Samples
Exercises
Chapter 8: Other Topics on Sampling
8.1 Spatial Sampling
8.2 Coordination in Repeated Surveys
8.3 Multiple Survey Frames
8.4 Indirect Sampling
8.5 Capture–Recapture
Chapter 9: Estimation with a Quantitative Auxiliary Variable
9.1 The Problem
9.2 Ratio Estimator
9.3 The Difference Estimator
9.4 Estimation by Regression
9.5 The Optimal Regression Estimator
9.6 Discussion of the Three Estimation Methods
Chapter 10: Post‐Stratification and Calibration on Marginal Totals
10.1 Introduction
10.2 Post‐Stratification
10.3 The Post‐Stratified Estimator in Simple Designs
10.4 Estimation by Calibration on Marginal Totals
10.5 Example
Chapter 11: Multiple Regression Estimation
11.1 Introduction
11.2 Multiple Regression Estimator
11.3 Alternative Forms of the Estimator
11.4 Calibration of the Multiple Regression Estimator
11.5 Variance of the Multiple Regression Estimator
11.6 Choice of Weights
11.7 Special Cases
11.8 Extension to Regression Estimation
Chapter 12: Calibration Estimation
12.1 Calibrated Methods
12.2 Distances and Calibration Functions
12.3 Solving Calibration Equations
12.4 Calibrating on Households and Individuals
12.5 Generalized Calibration
12.6 Calibration in Practice
12.7 An Example
Chapter 13: Model‐Based approach
13.1 Model Approach
13.2 The Model
13.3 Homoscedastic Constant Model
13.4 Heteroscedastic Model 1 Without Intercept
13.5 Heteroscedastic Model 2 Without Intercept
13.6 Univariate Homoscedastic Linear Model
13.7 Stratified Population
13.8 Simplified Versions of the Optimal Estimator
13.9 Completed Heteroscedasticity Model
13.10 Discussion
13.11 An Approach that is Both Model‐ and Design‐based
Chapter 14: Estimation of Complex Parameters
14.1 Estimation of a Function of Totals
14.2 Variance Estimation
14.3 Covariance Estimation
14.4 Implicit Function Estimation
14.5 Cumulative Distribution Function and Quantiles
14.6 Cumulative Income, Lorenz Curve, and Quintile Share Ratio
14.7 Gini Index
14.8 An Example
Chapter 15: Variance Estimation by Linearization
15.1 Introduction
15.2 Orders of Magnitude in Probability
15.3 Asymptotic Hypotheses
15.4 Linearization of Functions of Interest
15.5 Linearization by Steps
15.6 Linearization of an Implicit Function of Interest
15.7 Influence Function Approach
15.8 Binder's Cookbook Approach
15.9 Demnati and Rao Approach
15.10 Linearization by the Sample Indicator Variables
15.11 Discussion on Variance Estimation
Chapter 16: Treatment of Nonresponse
16.1 Sources of Error
16.2 Coverage Errors
16.3 Different Types of Nonresponse
16.4 Nonresponse Modeling
16.5 Treating Nonresponse by Reweighting
16.6 Imputation
16.7 Variance Estimation with Nonresponse
Chapter 17: Summary Solutions to the Exercises
Bibliography
Author Index
Subject Index
End User License Agreement
Chapter 3
Table 3.1 Simple designs: summary table.
Table 3.2 Example of sample sizes required for different population sizes and di...
Chapter 4
Table 4.1 Application of optimal allocation: the sample size is larger than the ...
Table 4.2 Second application of optimal allocation in strata 1 and 2.
Chapter 5
Table 5.1 Minimum support design.
Table 5.2 Decomposition into simple random sampling designs.
Table 5.3 Decomposition into
simple random sampling designs.
Table 5.4 Properties of the methods.
Chapter 6
Table 6.1 Population of 20 students with variables, constant, gender (1, male, 2...
Table 6.2 Totals and expansion estimators for balancing variables.
Table 6.3 Variances of the expansion estimators of the means under simple random...
Chapter 7
Table 7.1 Block number, number of households, and total household income.
Chapter 8
Table 8.1 Means of spatial balancing measures based on Voronoï polygons
and mod...
Table 8.2 Selection intervals for negative coordination and selection indicators...
Table 8.3 Selection indicators for each selection interval for unit
.
Chapter 9
Table 9.1 Estimation methods: summary table.
Chapter 10
Table 10.1 Population partition.
Table 10.2 Totals with respect to two variables.
Table 10.3 Calibration, starting table.
Table 10.4 Salaries in Euros.
Table 10.5 Estimated totals using simple random sampling without replacement.
Table 10.6 Known margins using a census.
Table 10.7 Iteration 1: row total adjustment.
Table 10.8 Iteration 2: column total adjustment.
Table 10.9 Iteration 3: row total adjustment.
Table 10.10 Iteration 4: column total adjustment.
Table 1
Table 2
Chapter 12
Table 12.1 Pseudo‐distances for calibration.
Table 12.2 Calibration functions and their derivatives.
Table 12.3 Minima, maxima, means, and standard deviations of the weights for eac...
Chapter 14
Table 14.1 Sample, variable of interest
, weights
, cumulative weights
and re...
Table 14.2 Table of fictitious incomes
, weights
, cumulative weights
, relati...
Table 14.3 Totals necessary to estimate the Gini index.
Chapter 17
Table 1
Table 2
Chapter 1
Figure 1.1 Auxiliary information can be used before or after data collection t...
Chapter 4
Figure 4.1 Stratified design: the samples are selected independently from one ...
Chapter 5
Figure 5.1 Systematic sampling: example with inclusion probabilities
Figure 5.2 Method of Deville.
Figure 5.3 Splitting into two parts.
Figure 5.4 Splitting in
parts.
Figure 5.5 Minimum support design.
Figure 5.6 Decomposition into simple random sampling designs.
Figure 5.7 Pivotal method applied on vector
.
Chapter 6
Figure 6.1 Possible samples in a population of size
3.
Figure 6.2 Fixed size constraint: the three samples of size
2 are connected ...
Figure 6.3 None of the vertices of
is a vertex of the cube.
Figure 6.4 Two vertices of
are vertices of the cube, but the third is not.
Figure 6.5 Flight phase in a population of size
3 with a constraint of fixed...
Chapter 7
Figure 7.1 Cluster sampling: the population is divided into clusters. Clusters...
Figure 7.2 Two‐stage sampling design: we randomly select primary units in whic...
Figure 7.3 Two‐phase design: a sample
is selected in sample
.
Figure 7.4 The sample
is the intersection of samples
and
.
Chapter 8
Figure 8.1 In a
grid, a systematic sample and a stratified sample with one u...
Figure 8.2 Recursive quadrant function used for the GRTS method with three sub...
Figure 8.3 Original function with four random permutations.
Figure 8.4 Samples of 64 points in a grid of
points using simple designs, GR...
Figure 8.5 Sample of 64 points in a grid of
points and Voronoï polygons. App...
Figure 8.6 Interval corresponding to the first wave (extract from Qualité, 20...
Figure 8.7 Positive coordination when
(extract from Qualité, 2009).
Figure 8.8 Positive coordination when
(extract from Qualité, 2009).
Figure 8.9 Negative coordination when
(extract from Qualité, 2009).
Figure 8.10 Negative coordination when
(extract from Qualité, 2009).
Figure 8.11 Coordination of a third sample (extract from Qualité, 2009).
Figure 8.12 Two survey frames
and
cover the population. In each one, we se...
Figure 8.13 In this example, the points represent contaminated trees. During t...
Figure 8.14 Example of indirect sampling. In population
the units surrounded...
Chapter 9
Figure 9.1 Ratio estimator: observations aligned along a line passing through ...
Figure 9.2 Difference estimator: observations aligned along a line of slope eq...
Chapter 10
Figure 10.1 Post‐stratification: the population is divided in post‐strata, but...
Chapter 12
Figure 12.1 Linear method: pseudo‐distance
with
and
.
Figure 12.2 Linear method: function
with
and
Figure 12.3 Linear method: function
with
.
Figure 12.4 Raking ratio: pseudo‐distance
with
and
.
Figure 12.5 Raking ratio: function
with
and
.
Figure 12.6 Raking ratio: function
with
.
Figure 12.7 Reverse information: pseudo‐distance
with
and
.
Figure 12.8 Reverse information: function
with
and
.
Figure 12.9 Reverse information: function
with
.
Figure 12.10 Truncated linear method: pseudo‐distance
with
, and
.
Figure 12.11 Truncated linear method: function
with
, and
.
Figure 12.12 Truncated linear method: calibration function
with
, and
Figure 12.13 Pseudo‐distances
with
and
.
Figure 12.14 Calibration functions
with
and
.
Figure 12.15 Logistic method: pseudo‐distance
with
, and
.
Figure 12.16 Logistic method: function
with
, and
.
Figure 12.17 Logistic method: calibration function
with
and
.
Figure 12.18 Deville calibration: pseudo‐distance
with
Figure 12.19 Deville calibration: calibration function
with
.
Figure 12.20 Pseudo‐distances
of Roy and Vanheuverzwyn with
,
, and
.
Figure 12.21 Calibration function
of Roy and Vanheuverzwyn with
and
.
Figure 12.22 Variation of the
‐weights for different calibration methods as a...
Chapter 13
Figure 13.1 Total taxable income in millions of euros with respect to the numb...
Chapter 14
Figure 14.1 Step cumulative distribution function
with corresponding quartil...
Figure 14.2 Cumulative distribution function
obtained by interpolation of po...
Figure 14.3 Cumulative distribution function
obtained by interpolating the c...
Figure 14.4 Lorenz curve and the surface associated with the Gini index.
Chapter 16
Figure 16.1 Two‐phase approach for nonresponse. The set of respondents
is a...
Figure 16.2 The reversed approach for nonresponse. The sample of nonrespondent...
Cover
Table of Contents
Begin Reading
ii
iii
iv
xiii
xiv
xv
xvii
xviii
xix
xxi
xxiii
xxv
xxvi
xxvii
xxviii
1
2
3
4
5
6
7
8
9
10
11
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: Noel Cressie, Garrett Fitzmaurice, David Balding, Geert Molenberghs, Geof Givens, Harvey Goldstein, David Scott, Adrian Smith, Ruey Tsay.
Yves Tillé
Université de Neuchâtel Switzerland
Most of this book has been translated from French by Ilya Hekimi
Original French title: Théorie des sondages : Échantillonnage et estimation en populations finies
This edition first published 2020
© 2020 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Yves Tillé to be identified as the author of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Tillé, Yves, author. | Hekimi, Ilya, translator.
Title: Sampling and estimation from finite populations / Yves Tillé ; most
of this book has been translated from French by Ilya Hekimi.
Other titles: Théorie des sondages. English
Description: Hoboken, NJ : Wiley, [2020] | Series: Wiley series in
probability and statistics applied. Probability and statistics section |
Translation of: Théorie des sondages : échantillonnage et estimation
en populations finies. | Includes bibliographical references and index.
Identifiers: LCCN 2019048451 | ISBN 9780470682050 (hardback) | ISBN
9781119071266 (adobe pdf) | ISBN 9781119071273 (epub)
Subjects: LCSH: Sampling (Statistics) | Public opinion polls – Statistical
methods. | Estimation theory.
Classification: LCC QA276.6 .T62813 2020 | DDC 519.5/2 – dc23
LC record available at https://lccn.loc.gov/2019048451
Cover Design: Wiley
Cover Image: © gremlin/Getty Images
Figure 1.1
Auxiliary information can be used before or after data collection to improve estimations
Figure 4.1
Stratified design: the samples are selected independently from one stratum to another
Figure 5.1
Systematic sampling: example with inclusion probabilities
and
Figure 5.2
Method of Deville
Figure 5.3
Splitting into two parts
Figure 5.4
Splitting in
parts
Figure 5.5
Minimum support design
Figure 5.6
Decomposition into simple random sampling designs
Figure 5.7
Pivotal method applied on vector
Figure 6.1
Possible samples in a population of size
3
Figure 6.2
Fixed size constraint: the three samples of size
2 are connected by an affine subspace
Figure 6.3
None of the vertices of
is a vertex of the cube
Figure 6.4
Two vertices of
are vertices of the cube, but the third is not
Figure 6.5
Flight phase in a population of size
3 with a constraint of fixed size
2
Figure 7.1
Cluster sampling: the population is divided into clusters. Clusters are randomly selected. All units from the selected clusters are included in the sample
Figure 7.2
Two‐stage sampling design: we randomly select primary units in which we select a sample of secondary units
Figure 7.3
Two‐phase design: a sample
is selected in sample
Figure 7.4
The sample
is the intersection of samples
and
Figure 8.1
In a
grid, a systematic sample and a stratified sample with one unit per stratum are selected
Figure 8.2
Recursive quadrant function used for the GRTS method with three subdivisions
Figure 8.3
Original function with four random permutations
Figure 8.4
Samples of 64 points in a grid of
points using simple designs, GRTS, the local pivotal method, and the local cube method
Figure 8.5
Sample of 64 points in a grid of
points and Voronoï polygons. Applications to simple, systematic, and stratified designs, the local pivotal method, and the local cube method
Figure 8.6
Interval corresponding to the first wave (extract from Qualité,
2009
)
Figure 8.7
Positive coordination when
(extract from Qualité,
2009
)
Figure 8.8
Positive coordination when
(extract from Qualité,
2009
)
Figure 8.9
Negative coordination when
(extract from Qualité,
2009
)
Figure 8.10
Negative coordination when
(extract from Qualité,
2009
)
Figure 8.11
Coordination of a third sample (extract from Qualité,
2009
)
Figure 8.12
Two survey frames
and
cover the population. In each one, we select a sample
Figure 8.13
In this example, the points represent contaminated trees. During the initial sampling, the shaded squares are selected. The borders in bold surround the final selected zones
Figure 8.14
Example of indirect sampling. In population
the units surrounded by a circle are selected. Two clusters (
and
) of population
each contain at least one unit that has a link with a unit selected in population
. Units of
surrounded by a circle are selected at the end
Figure 9.1
Ratio estimator: observations aligned along a line passing through the origin
Figure 9.2
Difference estimator: observations aligned along a line of slope equal to 1
Figure 10.1
Post‐stratification: the population is divided in post‐strata, but the sample is selected without taking post‐strata into account
Figure 12.1
Linear method: pseudo‐distance
with
and
Figure 12.2
Linear method: function
with
and
Figure 12.3
Linear method: function
with
Figure 12.4
Raking ratio: pseudo‐distance
with
and
Figure 12.5
Raking ratio: function
with
and
Figure 12.6
Raking ratio: function
with
Figure 12.7
Reverse information: pseudo‐distance
with
and
Figure 12.8
Reverse information: function
with
and
Figure 12.9
Reverse information: function
with
Figure 12.10
Truncated linear method: pseudo‐distance
with
, and
Figure 12.11
Truncated linear method: function
with
, and
Figure 12.12
Truncated linear method: calibration function
with
, and
Figure 12.13
Pseudo‐distances
with
and
Figure 12.14
Calibration functions
with
and
Figure 12.15
Logistic method: pseudo‐distance
with
, and
Figure 12.16
Logistic method: function
with
Table 3.1
Simple designs: summary table
Table 3.2
Example of sample sizes required for different population sizes and different values of
for
and
Table 4.1
Application of optimal allocation: the sample size is larger than the population size in the third stratum
Table 4.2
Second application of optimal allocation in strata 1 and 2
Table 5.1
Minimum support design
Table 5.2
Decomposition into simple random sampling designs
Table 5.3
Decomposition into
simple random sampling designs
Table 5.4
Properties of the methods
Table 6.1
Population of 20 students with variables, constant, gender (1, male, 2 female), age, and a mark of 20 in a statistics exam
Table 6.2
Totals and expansion estimators for balancing variables
Table 6.3
Variances of the expansion estimators of the means under simple random sampling and balanced sampling
Table 7.1
Block number, number of households, and total household income
Table 8.1
Means of spatial balancing measures based on Voronoï polygons
and modified Moran indices
for six sampling designs on 1000 simulations
Table 8.2
Selection intervals for negative coordination and selection indicators in the case where the PRNs falls within the interval. On the left, the case where
(Figure 8.9). On the right, the case where
(Figure 8.10)
Table 8.3
Selection indicators for each selection interval for unit
Table 9.1
Estimation methods: summary table
Table 10.1
Population partition
Table 10.2
Totals with respect to two variables
Table 10.3
Calibration, starting table
Table 10.4
Salaries in Euros
Table 10.5
Estimated totals using simple random sampling without replacement
Table 10.6
Known margins using a census
Table 10.7
Iteration 1: row total adjustment
Table 10.8
Iteration 2: column total adjustment
Table 10.9
Iteration 3: row total adjustment
Table 10.10
Iteration 4: column total adjustment
Table 12.1
Pseudo‐distances for calibration
Table 12.2
Calibration functions and their derivatives
Table 12.3
Minima, maxima, means, and standard deviations of the weights for each calibration method
Table 14.1
Sample, variable of interest
, weights
, cumulative weights
and relative cumulative weights
Table 14.2
Table of fictitious incomes
, weights
, cumulative weights
Algorithm 1
Bernoulli sampling
Algorithm 2
Selection–rejection method
Algorithm 3
Reservoir method
Algorithm 4
Sequential algorithm for simple random sampling with replacement
Algorithm 5
Systematic sampling with equal probabilities
Algorithm 6
Systematic sampling with unequal probabilities
Algorithm 7
Algorithm for Poisson sampling
Algorithm 8
Sampford procedure
Algorithm 9
General algorithm for the cube method
Algorithm 10
Positive coordination using the Kish and Scott method
Algorithm 11
Negative coordination with the Rivière method
Algorithm 12
Negative coordination with EDS method
The first version of this book was published in 2001, the year I left the Ecole Nationale de la Statistique et de l'Analyse de l'Information (ENSAI) in Rennes (France) to teach at the University of Neuchâtel in Switzerland. This version came from several course materials of sampling theory that I had taught in Rennes. At the ENSAI, the collaboration with Jean‐Claude Deville was particularly stimulating.
The editing of this new edition was laborious and was done in fits and starts. I thank all those who reviewed the drafts and provided me with their comments. Special thanks to Monique Graf for her meticulous re‐reading of some chapters.
The almost 20 years I spent in Neuchâtel were dotted with multiple adventures. I am particularly grateful to Philippe Eichenberger and Jean‐Pierre Renfer, who successively headed the Statistical Methods Section of the Federal Statistical Office. Their trust and professionalism helped to establish a fruitful exchange between the Institute of Statistics of the University of Neuchâtel and the Swiss Federal Statistical Office.
I am also very grateful to the PhD students that I have had the pleasure of mentoring so far. Each thesis is an adventure that teaches both supervisor and doctoral student. Thank you to Alina Matei, Lionel Quality, Desislava Nedyalkova, Erika Antal, Matti Langel, Toky Randrianasolo, Eric Graf, Caren Hasler, Matthieu Wilhelm, Mihaela Guinand‐Anastasiade, and Audrey‐Anne Vallée who trusted me and whom I had the pleasure to supervise for a few years.
Yves Tillé
Neuchâtel, 2018
This book contains teaching material that I started to develop in 1994. All chapters have indeed served as a support for teaching, a course, training, a workshop or a seminar. By grouping this material, I hope to present a coherent and modern set of results on the sampling, estimation, and treatment of nonresponses, in other words, on all the statistical operations of a standard sample survey.
In producing this book, my goal is not to provide a comprehensive overview of survey sampling theory, but rather to show that sampling theory is a living discipline, with a very broad scope. If, in several chapters demonstrations have been discarded, I have always been careful to refer the reader to bibliographical references. The abundance of very recent publications attests to the fertility of the 1990s in this area. All the developments presented in this book are based on the so‐called “design‐based” approach. In theory, there is another point of view based on population modeling. I intentionally left this approach aside, not out of disinterest, but to propose an approach that I deem consistent and ethically acceptable to the public statistician.
I would like to thank all the people who, in one way or another, helped me to make this book: Laurence Broze, who entrusted me with my first sampling course at the University Lille 3, Carl Särndal, who encouraged me on several occasions, and Yves Berger, with whom I shared an office at the Université Libre de Bruxelles for several years and who gave me a multitude of relevent remarks. My thanks also go to Antonio Canedo who taught me to use LaTeX, to Lydia Zaïd who has corrected the manuscript several times, and to Jean Dumais for his many constructive comments.
I wrote most of this book at the École Nationale de la Statistique et de l'Analyse de l'Information. The warm atmosphere that prevailed in the statistics department gave me a lot of support. I especially thank my colleagues Fabienne Gaude, Camelia Goga, and Sylvie Rousseau, who meticulously reread the manuscript, and Germaine Razé, who did the work of reproduction of the proofs. Several exercises are due to Pascal Ardilly, Jean‐Claude Deville, and Laurent Wilms. I want to thank them for allowing me to reproduce them. My gratitude goes particularly to Jean‐Claude Deville for our fruitful collaboration within the Laboratory of Survey Statistics of the Center for Research in Economics and Statistics. The chapters on the splitting method and balanced sampling also reflect the research that we have done together.
Yves Tillé
Bruz, 2001
cardinal (number of elements in a set)
much less than
complement of
in
function
is the derivative of
factorial:
number of ways to choose
units from
units
interval
is approximately equal to
is proportional to
follows a specific probability distribution (for a random value)
equals 1 if
is true and 0 otherwise
number of times unit
is in the sample
vector of
population regression coefficients
vector of population regression coefficients
regression coefficients for model
vector of regression coefficients of model
vector of estimated regression coefficients
vector of estimated regression coefficients of the model
cube whose vertices are samples
covariance between random variables
and
estimated covariance between random variables
and
population coefficient of variation
estimated coefficient of variation
expansion estimator survey weights
mathematical expectation under the sampling design
of estimator
mathematical expectation under the model
of estimator
mathematical expectation under the nonresponse mechanism
of estimator
mathematical expectation under the imputation mechanism
of estimator
mean square error
sampling fraction
pseudo‐distance derivative for calibration
adjustment factor after calibration called
‐weight
pseudo‐distance for calibration
strata or post‐strata index
confidence interval with confidence level
ou
indicates a statistical unit,
or
intersection of the cube and constraint space for the cube method
number of clusters or primary units in the sample of clusters or primary units
number of clusters or primary units in the population
Sample size (without replacement)
number of secondary units sampled in primary unit
size of the sample in
if the size is random
population size
Sample size in stratum or post‐stratum
number of units in stratum or post‐stratum
number of secondary units in primary unit
population totals when
is a contingency table
set of natural numbers
set of positive natural numbers with zero
probability of selecting sample
probability of sampling unit
for sampling with replacement
or
proportion of units belonging to domain
probability that event
occurs
probability that event
occurs, given
occurred
subspace of constraints for the cube method
response indicator
set of real numbers
set of positive real numbers with zero
set of strictly positive real numbers
Sample or subset of the population,
Sample variance of variable
Sample variance of
in stratum or post‐stratum
covariance between variables
and
in the sample
random sample such that
variance of variance
in the population
covariance between variables
and
in the population
random sample selected in stratum or post‐stratum
population variance of
in the stratum or post‐stratum
vector
is the transpose of vector
finite population of size
stratum or post‐stratum
, where
linearized variable
Horvitz–Thompson estimator of the variance of estimator
Sen–Yates–Grundy estimator of the variance of estimator
variance of estimator
under the survey design
variance of estimator
Looking back, the debates that animated a scientific discipline often appear futile. However, the history of sampling theory is particularly instructive. It is one of the specializations of statistics which itself has a somewhat special position, since it is used in almost all scientific disciplines. Statistics is inseparable from its fields of application since it determines how data should be processed. Statistics is the cornerstone of quantitative scientific methods. It is not possible to determine the relevance of the applications of a statistical technique without referring to the scientific methods of the disciplines in which it is applied.
Scientific truth is often presented as the consensus of a scientific community at a specific point in time. The history of a scientific discipline is the story of these consensuses and especially of their changes. Since the work of Thomas Samuel Kuhn (1970), we have considered that science develops around paradigms that are, according to Kuhn (1970, p. 10), “models from which spring particular coherent traditions of scientific research.” These models have two characteristics: “Their achievement was sufficiently unprecedented to attract an enduring group of adherents away from competing modes of scientific activity. Simultaneously, it was sufficiently open‐ended to leave all sorts of problems for the redefined group of practitioners to resolve.” (Kuhn, 1970, p. 10).
Many authors have proposed a chronology of discoveries in survey theory that reflect the major controversies that have marked its development (see among others Hansen & Madow, 1974; Hansen et al., 1983; Owen & Cochran, 1976; Sheynin, 1986; Stigler, 1986). Bellhouse (1988a) interprets this timeline as a story of the great ideas that contributed to the development of survey sampling theory. Statistics is a peculiar science. With mathematics for tools, it allows the methodology of the other disciplines to be finalized. Because of the close correlation between a method and the multiplicity of its fields of action, statistics is based on a multitude of different ideas from the various disciplines in which it is applied.
The theory of survey sampling plays a preponderant role in the development of statistics. However, the use of sampling techniques has been accepted only very recently. Among the controversies that have animated this theory, we find some of the classical debates of mathematical statistics, such as the role of modeling and a discussion of estimation techniques. Sampling theory was torn between the major currents of statistics and gave rise to multiple approaches: design‐based, model‐based, model‐assisted, predictive, and Bayesian.
In the Middle Ages, several attempts to extrapolate partial data to an entire population can be found in Droesbeke et al. (1987). In 1783, in France, Pierre Simon de Laplace (see 1847) presented to the Academy of Sciences a method to determine the number of inhabitants from birth registers using a sample of regions. He proposed to calculate, from this sample of regions, the ratio of the number of inhabitants to the number of births and then to multiply it by the total number of births, which could be obtained with precision for the whole population. Laplace even suggested estimating “the error to be feared” by referring to the central limit theorem. In addition, he recommended the use of a ratio estimator using the total number of births as auxiliary information. Survey methodology as well as probabilistic tools were known before the 19th century. However, never during this period was there a consensus about their validity.
The development of statistics (etymologically, from German: analysis of data about the state) is inseparable from the emergence of modern states in the 19th century. One of the most outstanding personalities in the official statistics of the 19th century is the Belgian Adolphe Quételet (1796–1874). He knew of Laplace's method and maintained a correspondence with him. According to Stigler (1986, pp. 164–165), Quételet was initially attracted to the idea of using partial data. He even tried to apply Laplace's method to estimate the population of the Netherlands in 1824 (which Belgium was a part of until 1830). However, it seems that he then rallied to a note from Keverberg (1827) which severely criticized the use of partial data in the name of precision and accuracy:
In my opinion, there is only one way to arrive at an exact knowledge of the population and the elements of which it is composed: it is that of an actual and detailed enumeration; that is to say, the formation of nominative states of all the inhabitants, with indication of their age and occupation. Only by this mode of operation can reliable documents be obtained on the actual number of inhabitants of a country, and at the same time on the statistics of the ages of which the population is composed, and the branches of industry in which it finds the means of comfort and prosperity.1
In one of his letters to the Duke of Saxe‐Coburg Gotha, Quételet (1846, p. 293) also advocates for an exhaustive statement:
La Place had proposed to substitute for the census of a large country, such as France, some special censuses in selected departments where this kind of operation might have more chances of success, and then to carefully determine the ratio of the population either at birth or at death. By means of these ratios of the births and deaths of all the other departments, figures which can be ascertained with sufficient accuracy, it is then easy to determine the population of the whole kingdom. This way of operating is very expeditious, but it supposes an invariable ratio passing from one department to another. [] This indirect method must be avoided as much as possible, although it may be useful in some cases, where the administration would have to proceed quickly; it can also be used with advantage as a means of control.2
It is interesting to examine the argument used by Quételet (1846, p. 293) to justify his position.
To not obtain the faculty of verifying the documents that are collected is to fail in one of the principal rules of science. Statistics is valuable only by its accuracy; without this essential quality, it becomes null, dangerous even, since it leads to error.3
Again, accuracy is considered a basic principle of statistical science. Despite the existence of probabilistic tools and despite various applications of sampling techniques, the use of partial data was perceived as a dubious and unscientific method. Quételet had a great influence on the development of official statistics. He participated in the creation of a section for statistics within the British Association of the Advancement of Sciences in 1833 with Thomas Malthus and Charles Babbage (see Horvàth, 1974). One of its objectives was to harmonize the production of official statistics. He organized the International Congress of Statistics in Brussels in 1853. Quételet was well acquainted with the administrative systems of France, the United Kingdom, the Netherlands, and Belgium. He has probably contributed to the idea that the use of partial data is unscientific.
Some personalities, such as Malthus and Babbage in Great Britain, and Quételet in Belgium, contributed greatly to the development of statistical methodology. On the other hand, the establishment of a statistical apparatus was a necessity in the construction of modern states, and it is probably not a coincidence that these personalities come from the two countries most rapidly affected by the industrial revolution. At that time, the statistician's objective was mainly to make enumerations. The main concern was to inventory the resources of nations. In this context, the use of sampling was unanimously rejected as an inexact and fundamentally unscientific procedure. Throughout the 19th century, the discussions of statisticians focused on how to obtain reliable data and on the presentation, interpretation, and possibly modeling (adjustment) of these data.
In 1895, the Norwegian Anders Nicolai Kiær, Director of the Central Statistical Office of Norway, presented to the Congress of the International Statistical Institute of Statistics (ISI) in Bern a work entitled Observations et expériences concernant des dénombrements représentatifs (Observations and experiments on representative enumeration) for a survey conducted in Norway. Kiær (1896) first selected a sample of cities and municipalities. Then, in each of these municipalities, he selected only some individuals using the first letter of their surnames. He applied a two‐stage design, but the choice of the units was not random. Kiær argues for the use of partial data if it is produced using a “representative method”. According to this method, the sample must be a representation with a reduced size of the population. Kiær's concept of representativeness is linked to the quota method. His speech was followed by a heated debate, and the proceedings of the Congress of the ISI reflect a long dispute. Let us take a closer look at the arguments from two opponents of Kiær's method (see ISI General Assembly Minutes, 1896).
Georg von Mayr (Prussia)[] It is especially dangerous to call for this system of representative investigations within an assembly of statisticians. It is understandable that for legislative or administrative purposes such limited enumeration may be useful – but then it must be remembered that it can never replace complete statistical observation. It is all the more necessary to support this point, that there is among us in these days a current among mathematicians who, in many directions, would rather calculate than observe. But we must remain firm and say: no calculation where observation can be done.4
Guillaume Milliet (Switzerland). I believe that it is not right to give a congressional voice to the representative method(which can only be an expedient) an importance that serious statistics will never recognize. No doubt, statistics made with this method, or, as I might call it, statistics, pars pro toto, has given us here and there interesting information; but its principle is so much in contradiction with the demands of the statistical method that as statisticians, we should not grant to imperfect things the same right of bourgeoisie, so to speak, that we accord to the ideal that scientifically we propose to reach.5
The content of these reactions can again be summarized as follows: since statistics is by definition exhaustive, renouncing complete enumeration denies the very mission of statistical science. The discussion does not concern the method proposed by Kiaer, but is on the definition of statistical science. However, Kiaer did not let go, and continued to defend the representative method in 1897 at the congress of the ISI at St. Petersburg (see Kiær, 1899), in 1901 in Budapest, and in 1903 in Berlin (see Kiær, 1903, 1905). After this date, the issue is no longer mentioned at the ISI Congress. However, Kiær obtained the support of Arthur Bowley (1869–1957), who then played a decisive role in the development of sampling theory. Bowley (1906) presented an empirical verification of the application of the central limit theorem to sampling. He was the true promoter of random sampling techniques, developed stratified designs with proportional allocations, and used the law of total variance. It will be necessary to wait for the end of the First World War and the emergence of a new generation of statisticians for the problem to be rediscussed within the ISI. On this subject, we cannot help but quote Max Plank's reflection on the appearance of new scientific truths: “a new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it” (quoted by Kuhn, 1970, p. 151).
In 1924, a commission (composed of Arthur Bowley, Corrado Gini, Adolphe Jensen, Lucien March, Verrijn Stuart, and Frantz Zizek) was created to evaluate the relevance of using the representative method. The results of this commission, entitled “Report on the representative method of statistics”, were presented at the 1925 ISI Congress in Rome. The commission accepted the principle of survey sampling as long as the methodology is respected. Thirty years after Kiær's communication, the idea of sampling was officially accepted. The commission laid the foundation for future research. Two methods are clearly distinguished: “random selection” and “purposive selection”. These two methods correspond to two fundamentally different scientific approaches. On the one hand, the validation of random methods is based on the calculation of probabilities that allows confidence intervals to be build for certain parameters. On the other hand, the validation of the purposive selection method can only be obtained through experimentation by comparing the obtained estimations to census results. Therefore, random methods are validated by a strictly mathematical argument while purposive methods are validated by an experimental approach.
The report of the commission presented to the ISI Congress in 1925 marked the official recognition of the use of survey sampling. Most of the basic problems had already been posed, such as the use of random samples and the calculation of the variance of the estimators for simple and stratified designs. The acceptance of the use of partial data, and especially the recommendation to use random designs, led to a rapid mathematization of this theory. At that time, the calculation of probabilities was already known. In addition, statisticians had already developed a theory for experimental statistics. Everything was in place for the rapid progress of a fertile field of research: the construction of a statistical theory of survey sampling.
Jerzy Neyman (1894–1981) developed a large part of the foundations of the probabilistic theory of sampling for simple, stratified, and cluster designs. He also determined the optimal allocation of a stratified design. The optimal allocation method challenges the basic idea of the quota method, which is the “representativeness”. Indeed, depending on the optimal stratification, the sample should not be a miniature of the population as some strata must be overrepresented. The article published by Neyman (1934) in the Journal of the Royal Statistical Society is currently considered one of the founding texts of sampling theory. Neyman identified the main fields of research and his work was to have a very important impact in later years. We now know that Tschuprow (1923) had already obtained some of the results that were attributed to Neyman, but the latter seems to have found them independently of Tschuprow. It is not surprising that such a discovery was made simultaneously in several places. From the moment that the use of random samples was considered a valid method, the theory would arise directly from the application of the theory of probability.
During the same period, the implementation of the quota method contributed much more to the development of the use of survey sampling methods than theoretical studies. The 1936 US election marked an important turning point in the handling of questionnaire surveys. The facts can be summarized as follows. The major American newspapers used to publish, before the elections, the results of empirical surveys produced from large samples (two million people polled for the Literary Digest) but without any method to select individuals. While most polls predicted Landon's victory, Roosevelt was elected. Surveys conducted by Crossley, Roper, and Gallup on smaller samples but using the quota method gave a correct prediction. This event helped to confirm the validity of the data provided by opinion polls.
This event, which favored the increase in the practice of sample surveys, was made without reference to the probabilistic theory that had already been developed. The method of Crossley, Roper, and Gallup is indeed not probabilistic but empirical, therefore validation of the adequacy of the method is experimental and absolutely not mathematical.
The establishment of a new scientific consensus in 1925 and the identification of major lines of research in the following years led to a very rapid development of survey theory. During the Second World War, research continued in the United States. Important contributions are due to Deming & Stephan (1940), Stephan (1942, 1945, 1948) and Deming (1948, 1950, 1960), especially on the question of adjusting statistical tables to census data. Cornfield (1944) proposed using indicator variables for the presence of units in the sample. Cochran (1939, 1942, 1946, 1961) and Hansen & Hurwitz (1943, 1949) showed the interest of unequal probability sampling with replacement. Madow (1949) proposed unequal probability systematic sampling (see also Hansen et al., 1953a,b). This is quickly established that an unequal probability sampling with fixed size without replacement is a complex problem. Narain (1951), Horvitz & Thompson (1952), Sen (1953), and Yates & Grundy (1953) presented several methods with unequal probabilities in two articles that are certainly among the most cited in this field. Devoted to the examination of several designs with unequal probabilities, these texts are mentioned for the general estimator (expansion estimator) of the total, which is also proposed and discussed. The expansion estimator is, in fact, an unbiased general estimator applicable to any sampling design without replacement. However, the proposed estimator of variance has a default. Yates & Grundy (1953) showed that the variance estimator proposed by Horvitz and Thompson can be negative. They proposed a valid variant when the sample is of fixed sample size and gives sufficient conditions for it to be positive. As early as the 1950s, the problem of sampling with unequal probabilities attracted considerable interest, which was reflected in the publication of more than 200 articles. Before turning to rank statistics, Hájek (1981) discussed the problem in detail. A book of synthesis by Brewer & Hanif (1983) was devoted entirely to this subject, which seems far from exhausted, as evidenced by regular publications.
The theory of survey sampling, which makes abundant use of the calculation of probabilities, attracted the attention of university statisticians and very quickly they reviewed all aspects of this theory that have a mathematical interest. A coherent mathematical theory of survey sampling was constructed. The statisticians very quickly came up against a difficult problem: surveys with finite populations. The proposed model postulated the identifiability of the units. This component of the model makes irrelevant the application of the reduction by sufficiency and the maximum likelihood method. Godambe (1955) states that there is no optimal linear estimator. This result is one of the many pieces of evidence showing the impossibility of defining optimal estimation procedures for general sampling designs in finite populations. Next, Basu (1969) and Basu & Ghosh (1967) demonstrated that the reduction by sufficiency is limited to the suppression of the information concerning the multiplicity of the units and therefore of the nonoperationality of this method. Several approaches were examined, including one from the theory of the decision. New properties, such as hyperadmissibility (see Hanurav, 1968
