Sampling and Estimation from Finite Populations - Yves Tille - E-Book

Sampling and Estimation from Finite Populations E-Book

Yves Tille

0,0
73,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A much-needed reference on survey sampling and its applications that presents the latest advances in the field

Seeking to show that sampling theory is a living discipline with a very broad scope, this book examines the modern development of the theory of survey sampling and the foundations of survey sampling. It offers readers a critical approach to the subject and discusses putting theory into practice. It also explores the treatment of non-sampling errors featuring a range of topics from the problems of coverage to the treatment of non-response. In addition, the book includes real examples, applications, and a large set of exercises with solutions.

Sampling and Estimation from Finite Populations begins with a look at the history of survey sampling. It then offers chapters on: population, sample, and estimation; simple and systematic designs; stratification; sampling with unequal probabilities; balanced sampling; cluster and two-stage sampling; and other topics on sampling, such as spatial sampling, coordination in repeated surveys, and multiple survey frames. The book also includes sections on: post-stratification and calibration on marginal totals; calibration estimation; estimation of complex parameters; variance estimation by linearization; and much more.

  • Provides an up-to-date review of the theory of sampling
  • Discusses the foundation of inference in survey sampling, in particular, the model-based and design-based frameworks
  • Reviews the problems of application of the theory into practice
  • Also deals with the treatment of non sampling errors

Sampling and Estimation from Finite Populations is an excellent book for methodologists and researchers in survey agencies and advanced undergraduate and graduate students in social science, statistics, and survey courses.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 573

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

List of Figures

List of Tables

List of Algorithms

Preface

Preface to the First French Edition

Table of Notations

Chapter 1: A History of Ideas in Survey Sampling Theory

1.1 Introduction

1.2 Enumerative Statistics During the 19th Century

1.3 Controversy on the use of Partial Data

1.4 Development of a Survey Sampling Theory

1.5 The US Elections of 1936

1.6 The Statistical Theory of Survey Sampling

1.7 Modeling the Population

1.8 Attempt to a Synthesis

1.9 Auxiliary Information

1.10 Recent References and Development

Notes

Chapter 2: Population, Sample, and Estimation

2.1 Population

2.2 Sample

2.3 Inclusion Probabilities

2.4 Parameter Estimation

2.5 Estimation of a Total

2.6 Estimation of a Mean

2.7 Variance of the Total Estimator

2.8 Sampling with Replacement

Chapter 3: Simple and Systematic Designs

3.1 Simple Random Sampling without Replacement with Fixed Sample Size

3.2 Bernoulli Sampling

3.3 Simple Random Sampling with Replacement

3.4 Comparison of the Designs with and Without Replacement

3.5 Sampling with Replacement and Retaining Distinct Units

3.6 Inverse Sampling with Replacement

3.7 Estimation of Other Functions of Interest

3.8 Determination of the Sample Size

3.9 Implementation of Simple Random Sampling Designs

3.10 Systematic Sampling with Equal Probabilities

3.11 Entropy for Simple and Systematic Designs

Chapter 4: Stratification

4.1 Population and Strata

4.2 Sample, Inclusion Probabilities, and Estimation

4.3 Simple Stratified Designs

4.4 Stratified Design with Proportional Allocation

4.5 Optimal Stratified Design for the Total

4.6 Notes About Optimality in Stratification

4.7 Power Allocation

4.8 Optimality and Cost

4.9 Smallest Sample Size

4.10 Construction of the Strata

4.11 Stratification Under Many Objectives

Chapter 5: Sampling with Unequal Probabilities

5.1 Auxiliary Variables and Inclusion Probabilities

5.2 Calculation of the Inclusion Probabilities

5.3 General Remarks

5.4 Sampling with Replacement with Unequal Inclusion Probabilities

5.5 Nonvalidity of the Generalization of the Successive Drawing without Replacement

5.6 Systematic Sampling with Unequal Probabilities

5.7 Deville's Systematic Sampling

5.8 Poisson Sampling

5.9 Maximum Entropy Design

5.10 Rao–Sampford Rejective Procedure

5.11 Order Sampling

5.12 Splitting Method

5.13 Choice of Method

5.14 Variance Approximation

5.15 Variance Estimation

Exercises

Chapter 6: Balanced Sampling

6.1 Introduction

6.2 Balanced Sampling: Definition

6.3 Balanced Sampling and Linear Programming

6.4 Balanced Sampling by Systematic Sampling

6.5 Methode of Deville, Grosbras, and Roth

6.6 Cube Method

6.7 Variance Approximation

6.8 Variance Estimation

6.9 Special Cases of Balanced Sampling

6.10 Practical Aspects of Balanced Sampling

Exercise

Chapter 7: Cluster and Two‐stage Sampling

7.1 Cluster Sampling

7.2 Two‐stage Sampling

7.3 Multi‐stage Designs

7.4 Selecting Primary Units with Replacement

7.5 Two‐phase Designs

7.6 Intersection of Two Independent Samples

Exercises

Chapter 8: Other Topics on Sampling

8.1 Spatial Sampling

8.2 Coordination in Repeated Surveys

8.3 Multiple Survey Frames

8.4 Indirect Sampling

8.5 Capture–Recapture

Chapter 9: Estimation with a Quantitative Auxiliary Variable

9.1 The Problem

9.2 Ratio Estimator

9.3 The Difference Estimator

9.4 Estimation by Regression

9.5 The Optimal Regression Estimator

9.6 Discussion of the Three Estimation Methods

Chapter 10: Post‐Stratification and Calibration on Marginal Totals

10.1 Introduction

10.2 Post‐Stratification

10.3 The Post‐Stratified Estimator in Simple Designs

10.4 Estimation by Calibration on Marginal Totals

10.5 Example

Chapter 11: Multiple Regression Estimation

11.1 Introduction

11.2 Multiple Regression Estimator

11.3 Alternative Forms of the Estimator

11.4 Calibration of the Multiple Regression Estimator

11.5 Variance of the Multiple Regression Estimator

11.6 Choice of Weights

11.7 Special Cases

11.8 Extension to Regression Estimation

Chapter 12: Calibration Estimation

12.1 Calibrated Methods

12.2 Distances and Calibration Functions

12.3 Solving Calibration Equations

12.4 Calibrating on Households and Individuals

12.5 Generalized Calibration

12.6 Calibration in Practice

12.7 An Example

Chapter 13: Model‐Based approach

13.1 Model Approach

13.2 The Model

13.3 Homoscedastic Constant Model

13.4 Heteroscedastic Model 1 Without Intercept

13.5 Heteroscedastic Model 2 Without Intercept

13.6 Univariate Homoscedastic Linear Model

13.7 Stratified Population

13.8 Simplified Versions of the Optimal Estimator

13.9 Completed Heteroscedasticity Model

13.10 Discussion

13.11 An Approach that is Both Model‐ and Design‐based

Chapter 14: Estimation of Complex Parameters

14.1 Estimation of a Function of Totals

14.2 Variance Estimation

14.3 Covariance Estimation

14.4 Implicit Function Estimation

14.5 Cumulative Distribution Function and Quantiles

14.6 Cumulative Income, Lorenz Curve, and Quintile Share Ratio

14.7 Gini Index

14.8 An Example

Chapter 15: Variance Estimation by Linearization

15.1 Introduction

15.2 Orders of Magnitude in Probability

15.3 Asymptotic Hypotheses

15.4 Linearization of Functions of Interest

15.5 Linearization by Steps

15.6 Linearization of an Implicit Function of Interest

15.7 Influence Function Approach

15.8 Binder's Cookbook Approach

15.9 Demnati and Rao Approach

15.10 Linearization by the Sample Indicator Variables

15.11 Discussion on Variance Estimation

Chapter 16: Treatment of Nonresponse

16.1 Sources of Error

16.2 Coverage Errors

16.3 Different Types of Nonresponse

16.4 Nonresponse Modeling

16.5 Treating Nonresponse by Reweighting

16.6 Imputation

16.7 Variance Estimation with Nonresponse

Chapter 17: Summary Solutions to the Exercises

Bibliography

Author Index

Subject Index

End User License Agreement

List of Tables

Chapter 3

Table 3.1 Simple designs: summary table.

Table 3.2 Example of sample sizes required for different population sizes and di...

Chapter 4

Table 4.1 Application of optimal allocation: the sample size is larger than the ...

Table 4.2 Second application of optimal allocation in strata 1 and 2.

Chapter 5

Table 5.1 Minimum support design.

Table 5.2 Decomposition into simple random sampling designs.

Table 5.3 Decomposition into

simple random sampling designs.

Table 5.4 Properties of the methods.

Chapter 6

Table 6.1 Population of 20 students with variables, constant, gender (1, male, 2...

Table 6.2 Totals and expansion estimators for balancing variables.

Table 6.3 Variances of the expansion estimators of the means under simple random...

Chapter 7

Table 7.1 Block number, number of households, and total household income.

Chapter 8

Table 8.1 Means of spatial balancing measures based on Voronoï polygons

and mod...

Table 8.2 Selection intervals for negative coordination and selection indicators...

Table 8.3 Selection indicators for each selection interval for unit 

.

Chapter 9

Table 9.1 Estimation methods: summary table.

Chapter 10

Table 10.1 Population partition.

Table 10.2 Totals with respect to two variables.

Table 10.3 Calibration, starting table.

Table 10.4 Salaries in Euros.

Table 10.5 Estimated totals using simple random sampling without replacement.

Table 10.6 Known margins using a census.

Table 10.7 Iteration 1: row total adjustment.

Table 10.8 Iteration 2: column total adjustment.

Table 10.9 Iteration 3: row total adjustment.

Table 10.10 Iteration 4: column total adjustment.

Table 1

Table 2

Chapter 12

Table 12.1 Pseudo‐distances for calibration.

Table 12.2 Calibration functions and their derivatives.

Table 12.3 Minima, maxima, means, and standard deviations of the weights for eac...

Chapter 14

Table 14.1 Sample, variable of interest

, weights

, cumulative weights

and re...

Table 14.2 Table of fictitious incomes

, weights

, cumulative weights

, relati...

Table 14.3 Totals necessary to estimate the Gini index.

Chapter 17

Table 1

Table 2

List of Illustrations

Chapter 1

Figure 1.1 Auxiliary information can be used before or after data collection t...

Chapter 4

Figure 4.1 Stratified design: the samples are selected independently from one ...

Chapter 5

Figure 5.1 Systematic sampling: example with inclusion probabilities

Figure 5.2 Method of Deville.

Figure 5.3 Splitting into two parts.

Figure 5.4 Splitting in

parts.

Figure 5.5 Minimum support design.

Figure 5.6 Decomposition into simple random sampling designs.

Figure 5.7 Pivotal method applied on vector

.

Chapter 6

Figure 6.1 Possible samples in a population of size

3.

Figure 6.2 Fixed size constraint: the three samples of size

2 are connected ...

Figure 6.3 None of the vertices of

is a vertex of the cube.

Figure 6.4 Two vertices of

are vertices of the cube, but the third is not.

Figure 6.5 Flight phase in a population of size

3 with a constraint of fixed...

Chapter 7

Figure 7.1 Cluster sampling: the population is divided into clusters. Clusters...

Figure 7.2 Two‐stage sampling design: we randomly select primary units in whic...

Figure 7.3 Two‐phase design: a sample

is selected in sample

.

Figure 7.4 The sample 

is the intersection of samples

and

.

Chapter 8

Figure 8.1 In a

grid, a systematic sample and a stratified sample with one u...

Figure 8.2 Recursive quadrant function used for the GRTS method with three sub...

Figure 8.3 Original function with four random permutations.

Figure 8.4 Samples of 64 points in a grid of

points using simple designs, GR...

Figure 8.5 Sample of 64 points in a grid of

points and Voronoï polygons. App...

Figure 8.6 Interval corresponding to the first wave (extract from Qualité, 20...

Figure 8.7 Positive coordination when

(extract from Qualité, 2009).

Figure 8.8 Positive coordination when

(extract from Qualité, 2009).

Figure 8.9 Negative coordination when

(extract from Qualité, 2009).

Figure 8.10 Negative coordination when

(extract from Qualité, 2009).

Figure 8.11 Coordination of a third sample (extract from Qualité, 2009).

Figure 8.12 Two survey frames

and

cover the population. In each one, we se...

Figure 8.13 In this example, the points represent contaminated trees. During t...

Figure 8.14 Example of indirect sampling. In population

the units surrounded...

Chapter 9

Figure 9.1 Ratio estimator: observations aligned along a line passing through ...

Figure 9.2 Difference estimator: observations aligned along a line of slope eq...

Chapter 10

Figure 10.1 Post‐stratification: the population is divided in post‐strata, but...

Chapter 12

Figure 12.1 Linear method: pseudo‐distance

with

and

.

Figure 12.2 Linear method: function

with

and

Figure 12.3 Linear method: function

with

.

Figure 12.4 Raking ratio: pseudo‐distance

with

and

.

Figure 12.5 Raking ratio: function

with

and

.

Figure 12.6 Raking ratio: function

with

.

Figure 12.7 Reverse information: pseudo‐distance

with

and

.

Figure 12.8 Reverse information: function

with

and

.

Figure 12.9 Reverse information: function

with

.

Figure 12.10 Truncated linear method: pseudo‐distance

with

, and

.

Figure 12.11 Truncated linear method: function

with

, and

.

Figure 12.12 Truncated linear method: calibration function

with

, and

Figure 12.13 Pseudo‐distances

with

and

.

Figure 12.14 Calibration functions

with

and

.

Figure 12.15 Logistic method: pseudo‐distance

with

, and

.

Figure 12.16 Logistic method: function

with

, and

.

Figure 12.17 Logistic method: calibration function

with

and

.

Figure 12.18 Deville calibration: pseudo‐distance

with

Figure 12.19 Deville calibration: calibration function

with

.

Figure 12.20 Pseudo‐distances

of Roy and Vanheuverzwyn with

,

, and

.

Figure 12.21 Calibration function

of Roy and Vanheuverzwyn with

and

.

Figure 12.22 Variation of the

‐weights for different calibration methods as a...

Chapter 13

Figure 13.1 Total taxable income in millions of euros with respect to the numb...

Chapter 14

Figure 14.1 Step cumulative distribution function

with corresponding quartil...

Figure 14.2 Cumulative distribution function

obtained by interpolation of po...

Figure 14.3 Cumulative distribution function

obtained by interpolating the c...

Figure 14.4 Lorenz curve and the surface associated with the Gini index.

Chapter 16

Figure 16.1 Two‐phase approach for nonresponse. The set of respondents

is a...

Figure 16.2 The reversed approach for nonresponse. The sample of nonrespondent...

Guide

Cover

Table of Contents

Begin Reading

Pages

ii

iii

iv

xiii

xiv

xv

xvii

xviii

xix

xxi

xxiii

xxv

xxvi

xxvii

xxviii

1

2

3

4

5

6

7

8

9

10

11

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by WALTER A. SHEWHART and SAMUEL S. WILKS

Editors: Noel Cressie, Garrett Fitzmaurice, David Balding, Geert Molenberghs, Geof Givens, Harvey Goldstein, David Scott, Adrian Smith, Ruey Tsay.

Sampling and Estimation from Finite Populations

Yves Tillé

Université de Neuchâtel Switzerland

 

 

Most of this book has been translated from French by Ilya Hekimi

 

 

Original French title: Théorie des sondages : Échantillonnage et estimation en populations finies

 

 

 

 

 

Copyright

This edition first published 2020

© 2020 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Yves Tillé to be identified as the author of this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Tillé, Yves, author. | Hekimi, Ilya, translator.

Title: Sampling and estimation from finite populations / Yves Tillé ; most

  of this book has been translated from French by Ilya Hekimi.

Other titles: Théorie des sondages. English

Description: Hoboken, NJ : Wiley, [2020] | Series: Wiley series in

  probability and statistics applied. Probability and statistics section |

  Translation of: Théorie des sondages : échantillonnage et estimation

  en populations finies. | Includes bibliographical references and index.

Identifiers: LCCN 2019048451 | ISBN 9780470682050 (hardback) | ISBN

  9781119071266 (adobe pdf) | ISBN 9781119071273 (epub)

Subjects: LCSH: Sampling (Statistics) | Public opinion polls – Statistical

  methods. | Estimation theory.

Classification: LCC QA276.6 .T62813 2020 | DDC 519.5/2 – dc23

LC record available at https://lccn.loc.gov/2019048451

Cover Design: Wiley

Cover Image: © gremlin/Getty Images

List of Figures

Figure 1.1

Auxiliary information can be used before or after data collection to improve estimations

Figure 4.1

Stratified design: the samples are selected independently from one stratum to another

Figure 5.1

Systematic sampling: example with inclusion probabilities

and

Figure 5.2

Method of Deville

Figure 5.3

Splitting into two parts

Figure 5.4

Splitting in

parts

Figure 5.5

Minimum support design

Figure 5.6

Decomposition into simple random sampling designs

Figure 5.7

Pivotal method applied on vector

Figure 6.1

Possible samples in a population of size

3

Figure 6.2

Fixed size constraint: the three samples of size

2 are connected by an affine subspace

Figure 6.3

None of the vertices of

is a vertex of the cube

Figure 6.4

Two vertices of

are vertices of the cube, but the third is not

Figure 6.5

Flight phase in a population of size

3 with a constraint of fixed size

2

Figure 7.1

Cluster sampling: the population is divided into clusters. Clusters are randomly selected. All units from the selected clusters are included in the sample

Figure 7.2

Two‐stage sampling design: we randomly select primary units in which we select a sample of secondary units

Figure 7.3

Two‐phase design: a sample

is selected in sample

Figure 7.4

The sample 

is the intersection of samples

and

Figure 8.1

In a

grid, a systematic sample and a stratified sample with one unit per stratum are selected

Figure 8.2

Recursive quadrant function used for the GRTS method with three subdivisions

Figure 8.3

Original function with four random permutations

Figure 8.4

Samples of 64 points in a grid of

points using simple designs, GRTS, the local pivotal method, and the local cube method

Figure 8.5

Sample of 64 points in a grid of

points and Voronoï polygons. Applications to simple, systematic, and stratified designs, the local pivotal method, and the local cube method

Figure 8.6

Interval corresponding to the first wave (extract from Qualité,

2009

)

Figure 8.7

Positive coordination when

(extract from Qualité,

2009

)

Figure 8.8

Positive coordination when

(extract from Qualité,

2009

)

Figure 8.9

Negative coordination when

(extract from Qualité,

2009

)

Figure 8.10

Negative coordination when

(extract from Qualité,

2009

)

Figure 8.11

Coordination of a third sample (extract from Qualité,

2009

)

Figure 8.12

Two survey frames

and

cover the population. In each one, we select a sample

Figure 8.13

In this example, the points represent contaminated trees. During the initial sampling, the shaded squares are selected. The borders in bold surround the final selected zones

Figure 8.14

Example of indirect sampling. In population

the units surrounded by a circle are selected. Two clusters (

and

) of population

each contain at least one unit that has a link with a unit selected in population

. Units of

surrounded by a circle are selected at the end

Figure 9.1

Ratio estimator: observations aligned along a line passing through the origin

Figure 9.2

Difference estimator: observations aligned along a line of slope equal to 1

Figure 10.1

Post‐stratification: the population is divided in post‐strata, but the sample is selected without taking post‐strata into account

Figure 12.1

Linear method: pseudo‐distance

with

and

Figure 12.2

Linear method: function

with

and

Figure 12.3

Linear method: function

with

Figure 12.4

Raking ratio: pseudo‐distance

with

and

Figure 12.5

Raking ratio: function

with

and

Figure 12.6

Raking ratio: function

with

Figure 12.7

Reverse information: pseudo‐distance

with

and

Figure 12.8

Reverse information: function

with

and

Figure 12.9

Reverse information: function

with

Figure 12.10

Truncated linear method: pseudo‐distance

with

, and

Figure 12.11

Truncated linear method: function

with

, and

Figure 12.12

Truncated linear method: calibration function

with

, and

Figure 12.13

Pseudo‐distances

with

and

Figure 12.14

Calibration functions

with

and

Figure 12.15

Logistic method: pseudo‐distance

with

, and

Figure 12.16

Logistic method: function

with

List of Tables

Table 3.1

Simple designs: summary table

Table 3.2

Example of sample sizes required for different population sizes and different values of

for

and

Table 4.1

Application of optimal allocation: the sample size is larger than the population size in the third stratum

Table 4.2

Second application of optimal allocation in strata 1 and 2

Table 5.1

Minimum support design

Table 5.2

Decomposition into simple random sampling designs

Table 5.3

Decomposition into

simple random sampling designs

Table 5.4

Properties of the methods

Table 6.1

Population of 20 students with variables, constant, gender (1, male, 2 female), age, and a mark of 20 in a statistics exam

Table 6.2

Totals and expansion estimators for balancing variables

Table 6.3

Variances of the expansion estimators of the means under simple random sampling and balanced sampling

Table 7.1

Block number, number of households, and total household income

Table 8.1

Means of spatial balancing measures based on Voronoï polygons

and modified Moran indices

for six sampling designs on 1000 simulations

Table 8.2

Selection intervals for negative coordination and selection indicators in the case where the PRNs falls within the interval. On the left, the case where

(Figure 8.9). On the right, the case where

(Figure 8.10)

Table 8.3

Selection indicators for each selection interval for unit 

Table 9.1

Estimation methods: summary table

Table 10.1

Population partition

Table 10.2

Totals with respect to two variables

Table 10.3

Calibration, starting table

Table 10.4

Salaries in Euros

Table 10.5

Estimated totals using simple random sampling without replacement

Table 10.6

Known margins using a census

Table 10.7

Iteration 1: row total adjustment

Table 10.8

Iteration 2: column total adjustment

Table 10.9

Iteration 3: row total adjustment

Table 10.10

Iteration 4: column total adjustment

Table 12.1

Pseudo‐distances for calibration

Table 12.2

Calibration functions and their derivatives

Table 12.3

Minima, maxima, means, and standard deviations of the weights for each calibration method

Table 14.1

Sample, variable of interest

, weights

, cumulative weights

and relative cumulative weights

Table 14.2

Table of fictitious incomes

, weights

, cumulative weights

List of Algorithms

Algorithm 1

Bernoulli sampling

Algorithm 2

Selection–rejection method

Algorithm 3

Reservoir method

Algorithm 4

Sequential algorithm for simple random sampling with replacement

Algorithm 5

Systematic sampling with equal probabilities

Algorithm 6

Systematic sampling with unequal probabilities

Algorithm 7

Algorithm for Poisson sampling

Algorithm 8

Sampford procedure

Algorithm 9

General algorithm for the cube method

Algorithm 10

Positive coordination using the Kish and Scott method

Algorithm 11

Negative coordination with the Rivière method

Algorithm 12

Negative coordination with EDS method

Preface

The first version of this book was published in 2001, the year I left the Ecole Nationale de la Statistique et de l'Analyse de l'Information (ENSAI) in Rennes (France) to teach at the University of Neuchâtel in Switzerland. This version came from several course materials of sampling theory that I had taught in Rennes. At the ENSAI, the collaboration with Jean‐Claude Deville was particularly stimulating.

The editing of this new edition was laborious and was done in fits and starts. I thank all those who reviewed the drafts and provided me with their comments. Special thanks to Monique Graf for her meticulous re‐reading of some chapters.

The almost 20 years I spent in Neuchâtel were dotted with multiple adventures. I am particularly grateful to Philippe Eichenberger and Jean‐Pierre Renfer, who successively headed the Statistical Methods Section of the Federal Statistical Office. Their trust and professionalism helped to establish a fruitful exchange between the Institute of Statistics of the University of Neuchâtel and the Swiss Federal Statistical Office.

I am also very grateful to the PhD students that I have had the pleasure of mentoring so far. Each thesis is an adventure that teaches both supervisor and doctoral student. Thank you to Alina Matei, Lionel Quality, Desislava Nedyalkova, Erika Antal, Matti Langel, Toky Randrianasolo, Eric Graf, Caren Hasler, Matthieu Wilhelm, Mihaela Guinand‐Anastasiade, and Audrey‐Anne Vallée who trusted me and whom I had the pleasure to supervise for a few years.

Yves Tillé

Neuchâtel, 2018

Preface to the First French Edition

This book contains teaching material that I started to develop in 1994. All chapters have indeed served as a support for teaching, a course, training, a workshop or a seminar. By grouping this material, I hope to present a coherent and modern set of results on the sampling, estimation, and treatment of nonresponses, in other words, on all the statistical operations of a standard sample survey.

In producing this book, my goal is not to provide a comprehensive overview of survey sampling theory, but rather to show that sampling theory is a living discipline, with a very broad scope. If, in several chapters demonstrations have been discarded, I have always been careful to refer the reader to bibliographical references. The abundance of very recent publications attests to the fertility of the 1990s in this area. All the developments presented in this book are based on the so‐called “design‐based” approach. In theory, there is another point of view based on population modeling. I intentionally left this approach aside, not out of disinterest, but to propose an approach that I deem consistent and ethically acceptable to the public statistician.

I would like to thank all the people who, in one way or another, helped me to make this book: Laurence Broze, who entrusted me with my first sampling course at the University Lille 3, Carl Särndal, who encouraged me on several occasions, and Yves Berger, with whom I shared an office at the Université Libre de Bruxelles for several years and who gave me a multitude of relevent remarks. My thanks also go to Antonio Canedo who taught me to use LaTeX, to Lydia Zaïd who has corrected the manuscript several times, and to Jean Dumais for his many constructive comments.

I wrote most of this book at the École Nationale de la Statistique et de l'Analyse de l'Information. The warm atmosphere that prevailed in the statistics department gave me a lot of support. I especially thank my colleagues Fabienne Gaude, Camelia Goga, and Sylvie Rousseau, who meticulously reread the manuscript, and Germaine Razé, who did the work of reproduction of the proofs. Several exercises are due to Pascal Ardilly, Jean‐Claude Deville, and Laurent Wilms. I want to thank them for allowing me to reproduce them. My gratitude goes particularly to Jean‐Claude Deville for our fruitful collaboration within the Laboratory of Survey Statistics of the Center for Research in Economics and Statistics. The chapters on the splitting method and balanced sampling also reflect the research that we have done together.

Yves Tillé

Bruz, 2001

Table of Notations

cardinal (number of elements in a set)

much less than

complement of

in

function

is the derivative of

factorial:

number of ways to choose

units from

units

interval

is approximately equal to

is proportional to

follows a specific probability distribution (for a random value)

equals 1 if

is true and 0 otherwise

number of times unit 

is in the sample

vector of

population regression coefficients

vector of population regression coefficients

regression coefficients for model

vector of regression coefficients of model

vector of estimated regression coefficients

vector of estimated regression coefficients of the model

cube whose vertices are samples

covariance between random variables

and 

estimated covariance between random variables

and 

population coefficient of variation

estimated coefficient of variation

expansion estimator survey weights

mathematical expectation under the sampling design

of estimator 

mathematical expectation under the model

of estimator 

mathematical expectation under the nonresponse mechanism

of estimator 

mathematical expectation under the imputation mechanism

of estimator 

mean square error

sampling fraction

pseudo‐distance derivative for calibration

adjustment factor after calibration called

‐weight

pseudo‐distance for calibration

strata or post‐strata index

confidence interval with confidence level

ou

indicates a statistical unit,

or

intersection of the cube and constraint space for the cube method

number of clusters or primary units in the sample of clusters or primary units

number of clusters or primary units in the population

Sample size (without replacement)

number of secondary units sampled in primary unit

size of the sample in

if the size is random

population size

Sample size in stratum or post‐stratum

number of units in stratum or post‐stratum

number of secondary units in primary unit

population totals when

is a contingency table

set of natural numbers

set of positive natural numbers with zero

probability of selecting sample

probability of sampling unit 

for sampling with replacement

or

proportion of units belonging to domain

probability that event

occurs

probability that event

occurs, given

occurred

subspace of constraints for the cube method

response indicator

set of real numbers

set of positive real numbers with zero

set of strictly positive real numbers

Sample or subset of the population,

Sample variance of variable 

Sample variance of

in stratum or post‐stratum

covariance between variables

and

in the sample

random sample such that

variance of variance

in the population

covariance between variables

and

in the population

random sample selected in stratum or post‐stratum

population variance of

in the stratum or post‐stratum

vector

is the transpose of vector

finite population of size

stratum or post‐stratum

, where

linearized variable

Horvitz–Thompson estimator of the variance of estimator 

Sen–Yates–Grundy estimator of the variance of estimator 

variance of estimator 

under the survey design

variance of estimator 

Chapter 1A History of Ideas in Survey Sampling Theory

1.1 Introduction

Looking back, the debates that animated a scientific discipline often appear futile. However, the history of sampling theory is particularly instructive. It is one of the specializations of statistics which itself has a somewhat special position, since it is used in almost all scientific disciplines. Statistics is inseparable from its fields of application since it determines how data should be processed. Statistics is the cornerstone of quantitative scientific methods. It is not possible to determine the relevance of the applications of a statistical technique without referring to the scientific methods of the disciplines in which it is applied.

Scientific truth is often presented as the consensus of a scientific community at a specific point in time. The history of a scientific discipline is the story of these consensuses and especially of their changes. Since the work of Thomas Samuel Kuhn (1970), we have considered that science develops around paradigms that are, according to Kuhn (1970, p. 10), “models from which spring particular coherent traditions of scientific research.” These models have two characteristics: “Their achievement was sufficiently unprecedented to attract an enduring group of adherents away from competing modes of scientific activity. Simultaneously, it was sufficiently open‐ended to leave all sorts of problems for the redefined group of practitioners to resolve.” (Kuhn, 1970, p. 10).

Many authors have proposed a chronology of discoveries in survey theory that reflect the major controversies that have marked its development (see among others Hansen & Madow, 1974; Hansen et al., 1983; Owen & Cochran, 1976; Sheynin, 1986; Stigler, 1986). Bellhouse (1988a) interprets this timeline as a story of the great ideas that contributed to the development of survey sampling theory. Statistics is a peculiar science. With mathematics for tools, it allows the methodology of the other disciplines to be finalized. Because of the close correlation between a method and the multiplicity of its fields of action, statistics is based on a multitude of different ideas from the various disciplines in which it is applied.

The theory of survey sampling plays a preponderant role in the development of statistics. However, the use of sampling techniques has been accepted only very recently. Among the controversies that have animated this theory, we find some of the classical debates of mathematical statistics, such as the role of modeling and a discussion of estimation techniques. Sampling theory was torn between the major currents of statistics and gave rise to multiple approaches: design‐based, model‐based, model‐assisted, predictive, and Bayesian.

1.2 Enumerative Statistics During the 19th Century

In the Middle Ages, several attempts to extrapolate partial data to an entire population can be found in Droesbeke et al. (1987). In 1783, in France, Pierre Simon de Laplace (see 1847) presented to the Academy of Sciences a method to determine the number of inhabitants from birth registers using a sample of regions. He proposed to calculate, from this sample of regions, the ratio of the number of inhabitants to the number of births and then to multiply it by the total number of births, which could be obtained with precision for the whole population. Laplace even suggested estimating “the error to be feared” by referring to the central limit theorem. In addition, he recommended the use of a ratio estimator using the total number of births as auxiliary information. Survey methodology as well as probabilistic tools were known before the 19th century. However, never during this period was there a consensus about their validity.

The development of statistics (etymologically, from German: analysis of data about the state) is inseparable from the emergence of modern states in the 19th century. One of the most outstanding personalities in the official statistics of the 19th century is the Belgian Adolphe Quételet (1796–1874). He knew of Laplace's method and maintained a correspondence with him. According to Stigler (1986, pp. 164–165), Quételet was initially attracted to the idea of using partial data. He even tried to apply Laplace's method to estimate the population of the Netherlands in 1824 (which Belgium was a part of until 1830). However, it seems that he then rallied to a note from Keverberg (1827) which severely criticized the use of partial data in the name of precision and accuracy:

In my opinion, there is only one way to arrive at an exact knowledge of the population and the elements of which it is composed: it is that of an actual and detailed enumeration; that is to say, the formation of nominative states of all the inhabitants, with indication of their age and occupation. Only by this mode of operation can reliable documents be obtained on the actual number of inhabitants of a country, and at the same time on the statistics of the ages of which the population is composed, and the branches of industry in which it finds the means of comfort and prosperity.1

In one of his letters to the Duke of Saxe‐Coburg Gotha, Quételet (1846, p. 293) also advocates for an exhaustive statement:

La Place had proposed to substitute for the census of a large country, such as France, some special censuses in selected departments where this kind of operation might have more chances of success, and then to carefully determine the ratio of the population either at birth or at death. By means of these ratios of the births and deaths of all the other departments, figures which can be ascertained with sufficient accuracy, it is then easy to determine the population of the whole kingdom. This way of operating is very expeditious, but it supposes an invariable ratio passing from one department to another. [] This indirect method must be avoided as much as possible, although it may be useful in some cases, where the administration would have to proceed quickly; it can also be used with advantage as a means of control.2

It is interesting to examine the argument used by Quételet (1846, p. 293) to justify his position.

To not obtain the faculty of verifying the documents that are collected is to fail in one of the principal rules of science. Statistics is valuable only by its accuracy; without this essential quality, it becomes null, dangerous even, since it leads to error.3

Again, accuracy is considered a basic principle of statistical science. Despite the existence of probabilistic tools and despite various applications of sampling techniques, the use of partial data was perceived as a dubious and unscientific method. Quételet had a great influence on the development of official statistics. He participated in the creation of a section for statistics within the British Association of the Advancement of Sciences in 1833 with Thomas Malthus and Charles Babbage (see Horvàth, 1974). One of its objectives was to harmonize the production of official statistics. He organized the International Congress of Statistics in Brussels in 1853. Quételet was well acquainted with the administrative systems of France, the United Kingdom, the Netherlands, and Belgium. He has probably contributed to the idea that the use of partial data is unscientific.

Some personalities, such as Malthus and Babbage in Great Britain, and Quételet in Belgium, contributed greatly to the development of statistical methodology. On the other hand, the establishment of a statistical apparatus was a necessity in the construction of modern states, and it is probably not a coincidence that these personalities come from the two countries most rapidly affected by the industrial revolution. At that time, the statistician's objective was mainly to make enumerations. The main concern was to inventory the resources of nations. In this context, the use of sampling was unanimously rejected as an inexact and fundamentally unscientific procedure. Throughout the 19th century, the discussions of statisticians focused on how to obtain reliable data and on the presentation, interpretation, and possibly modeling (adjustment) of these data.

1.3 Controversy on the use of Partial Data

In 1895, the Norwegian Anders Nicolai Kiær, Director of the Central Statistical Office of Norway, presented to the Congress of the International Statistical Institute of Statistics (ISI) in Bern a work entitled Observations et expériences concernant des dénombrements représentatifs (Observations and experiments on representative enumeration) for a survey conducted in Norway. Kiær (1896) first selected a sample of cities and municipalities. Then, in each of these municipalities, he selected only some individuals using the first letter of their surnames. He applied a two‐stage design, but the choice of the units was not random. Kiær argues for the use of partial data if it is produced using a “representative method”. According to this method, the sample must be a representation with a reduced size of the population. Kiær's concept of representativeness is linked to the quota method. His speech was followed by a heated debate, and the proceedings of the Congress of the ISI reflect a long dispute. Let us take a closer look at the arguments from two opponents of Kiær's method (see ISI General Assembly Minutes, 1896).

Georg von Mayr (Prussia)[] It is especially dangerous to call for this system of representative investigations within an assembly of statisticians. It is understandable that for legislative or administrative purposes such limited enumeration may be useful – but then it must be remembered that it can never replace complete statistical observation. It is all the more necessary to support this point, that there is among us in these days a current among mathematicians who, in many directions, would rather calculate than observe. But we must remain firm and say: no calculation where observation can be done.4

Guillaume Milliet (Switzerland). I believe that it is not right to give a congressional voice to the representative method(which can only be an expedient) an importance that serious statistics will never recognize. No doubt, statistics made with this method, or, as I might call it, statistics, pars pro toto, has given us here and there interesting information; but its principle is so much in contradiction with the demands of the statistical method that as statisticians, we should not grant to imperfect things the same right of bourgeoisie, so to speak, that we accord to the ideal that scientifically we propose to reach.5

The content of these reactions can again be summarized as follows: since statistics is by definition exhaustive, renouncing complete enumeration denies the very mission of statistical science. The discussion does not concern the method proposed by Kiaer, but is on the definition of statistical science. However, Kiaer did not let go, and continued to defend the representative method in 1897 at the congress of the ISI at St. Petersburg (see Kiær, 1899), in 1901 in Budapest, and in 1903 in Berlin (see Kiær, 1903, 1905). After this date, the issue is no longer mentioned at the ISI Congress. However, Kiær obtained the support of Arthur Bowley (1869–1957), who then played a decisive role in the development of sampling theory. Bowley (1906) presented an empirical verification of the application of the central limit theorem to sampling. He was the true promoter of random sampling techniques, developed stratified designs with proportional allocations, and used the law of total variance. It will be necessary to wait for the end of the First World War and the emergence of a new generation of statisticians for the problem to be rediscussed within the ISI. On this subject, we cannot help but quote Max Plank's reflection on the appearance of new scientific truths: “a new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it” (quoted by Kuhn, 1970, p. 151).

In 1924, a commission (composed of Arthur Bowley, Corrado Gini, Adolphe Jensen, Lucien March, Verrijn Stuart, and Frantz Zizek) was created to evaluate the relevance of using the representative method. The results of this commission, entitled “Report on the representative method of statistics”, were presented at the 1925 ISI Congress in Rome. The commission accepted the principle of survey sampling as long as the methodology is respected. Thirty years after Kiær's communication, the idea of sampling was officially accepted. The commission laid the foundation for future research. Two methods are clearly distinguished: “random selection” and “purposive selection”. These two methods correspond to two fundamentally different scientific approaches. On the one hand, the validation of random methods is based on the calculation of probabilities that allows confidence intervals to be build for certain parameters. On the other hand, the validation of the purposive selection method can only be obtained through experimentation by comparing the obtained estimations to census results. Therefore, random methods are validated by a strictly mathematical argument while purposive methods are validated by an experimental approach.

1.4 Development of a Survey Sampling Theory

The report of the commission presented to the ISI Congress in 1925 marked the official recognition of the use of survey sampling. Most of the basic problems had already been posed, such as the use of random samples and the calculation of the variance of the estimators for simple and stratified designs. The acceptance of the use of partial data, and especially the recommendation to use random designs, led to a rapid mathematization of this theory. At that time, the calculation of probabilities was already known. In addition, statisticians had already developed a theory for experimental statistics. Everything was in place for the rapid progress of a fertile field of research: the construction of a statistical theory of survey sampling.

Jerzy Neyman (1894–1981) developed a large part of the foundations of the probabilistic theory of sampling for simple, stratified, and cluster designs. He also determined the optimal allocation of a stratified design. The optimal allocation method challenges the basic idea of the quota method, which is the “representativeness”. Indeed, depending on the optimal stratification, the sample should not be a miniature of the population as some strata must be overrepresented. The article published by Neyman (1934) in the Journal of the Royal Statistical Society is currently considered one of the founding texts of sampling theory. Neyman identified the main fields of research and his work was to have a very important impact in later years. We now know that Tschuprow (1923) had already obtained some of the results that were attributed to Neyman, but the latter seems to have found them independently of Tschuprow. It is not surprising that such a discovery was made simultaneously in several places. From the moment that the use of random samples was considered a valid method, the theory would arise directly from the application of the theory of probability.

1.5 The US Elections of 1936

During the same period, the implementation of the quota method contributed much more to the development of the use of survey sampling methods than theoretical studies. The 1936 US election marked an important turning point in the handling of questionnaire surveys. The facts can be summarized as follows. The major American newspapers used to publish, before the elections, the results of empirical surveys produced from large samples (two million people polled for the Literary Digest) but without any method to select individuals. While most polls predicted Landon's victory, Roosevelt was elected. Surveys conducted by Crossley, Roper, and Gallup on smaller samples but using the quota method gave a correct prediction. This event helped to confirm the validity of the data provided by opinion polls.

This event, which favored the increase in the practice of sample surveys, was made without reference to the probabilistic theory that had already been developed. The method of Crossley, Roper, and Gallup is indeed not probabilistic but empirical, therefore validation of the adequacy of the method is experimental and absolutely not mathematical.

1.6 The Statistical Theory of Survey Sampling

The establishment of a new scientific consensus in 1925 and the identification of major lines of research in the following years led to a very rapid development of survey theory. During the Second World War, research continued in the United States. Important contributions are due to Deming & Stephan (1940), Stephan (1942, 1945, 1948) and Deming (1948, 1950, 1960), especially on the question of adjusting statistical tables to census data. Cornfield (1944) proposed using indicator variables for the presence of units in the sample. Cochran (1939, 1942, 1946, 1961) and Hansen & Hurwitz (1943, 1949) showed the interest of unequal probability sampling with replacement. Madow (1949) proposed unequal probability systematic sampling (see also Hansen et al., 1953a,b). This is quickly established that an unequal probability sampling with fixed size without replacement is a complex problem. Narain (1951), Horvitz & Thompson (1952), Sen (1953), and Yates & Grundy (1953) presented several methods with unequal probabilities in two articles that are certainly among the most cited in this field. Devoted to the examination of several designs with unequal probabilities, these texts are mentioned for the general estimator (expansion estimator) of the total, which is also proposed and discussed. The expansion estimator is, in fact, an unbiased general estimator applicable to any sampling design without replacement. However, the proposed estimator of variance has a default. Yates & Grundy (1953) showed that the variance estimator proposed by Horvitz and Thompson can be negative. They proposed a valid variant when the sample is of fixed sample size and gives sufficient conditions for it to be positive. As early as the 1950s, the problem of sampling with unequal probabilities attracted considerable interest, which was reflected in the publication of more than 200 articles. Before turning to rank statistics, Hájek (1981) discussed the problem in detail. A book of synthesis by Brewer & Hanif (1983) was devoted entirely to this subject, which seems far from exhausted, as evidenced by regular publications.

The theory of survey sampling, which makes abundant use of the calculation of probabilities, attracted the attention of university statisticians and very quickly they reviewed all aspects of this theory that have a mathematical interest. A coherent mathematical theory of survey sampling was constructed. The statisticians very quickly came up against a difficult problem: surveys with finite populations. The proposed model postulated the identifiability of the units. This component of the model makes irrelevant the application of the reduction by sufficiency and the maximum likelihood method. Godambe (1955) states that there is no optimal linear estimator. This result is one of the many pieces of evidence showing the impossibility of defining optimal estimation procedures for general sampling designs in finite populations. Next, Basu (1969) and Basu & Ghosh (1967) demonstrated that the reduction by sufficiency is limited to the suppression of the information concerning the multiplicity of the units and therefore of the nonoperationality of this method. Several approaches were examined, including one from the theory of the decision. New properties, such as hyperadmissibility (see Hanurav, 1968