Implementation of Large-Scale Education Assessments -  - E-Book

Implementation of Large-Scale Education Assessments E-Book

0,0
70,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Presents a comprehensive treatment of issues related to the inception, design, implementation and reporting of large-scale education assessments.

In recent years many countries have decided to become involved in international educational assessments to allow them to ascertain the strengths and weaknesses of their student populations. Assessments such as the OECD's Programme for International Student Assessment (PISA), the IEA's Trends in Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy (PIRLS) have provided opportunities for comparison between students of different countries on a common international scale.

This book is designed to give researchers, policy makers and practitioners a well-grounded knowledge in the design, implementation, analysis and reporting of international assessments. Readers will be able to gain a more detailed insight into the scientific principles employed in such studies allowing them to make better use of the results. The book will also give readers an understanding of the resources needed to undertake and improve the design of educational assessments in their own countries and regions.

Implementation of Large-Scale Education Assessments:

  • Brings together the editors’ extensive experience in creating, designing, implementing, analysing and reporting results on a wide range of assessments.
  • Emphasizes methods for implementing international studies of student achievement and obtaining highquality data from cognitive tests and contextual questionnaires.
  • Discusses the methods of sampling, weighting, and variance estimation that are commonly encountered in international large-scale assessments.
  • Provides direction and stimulus for improving global educational assessment and student learning
  • Is written by experts in the field, with an international perspective.

Survey researchers, market researchers and practitioners engaged in comparative projects will all benefit from the unparalleled breadth of knowledge and experience in large-scale educational assessments gathered in this one volume.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 798

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Notes on Contributors

Foreword

Acknowledgements

Abbreviations

1 Implementation of Large‐Scale Education Assessments

1.1 Introduction

1.2 International, Regional and National Assessment Programmes in Education

1.3 Purposes of LSAs in Education

1.4 Key Areas for the Implementation of LSAs in Education

1.5 Summary and Outlook

Appendix 1.A

References

2 Test Design and Objectives

2.1 Introduction

2.2 PISA

2.3 TIMSS

2.4 PIRLS and Pre‐PIRLS

2.5 ASER

2.6 SACMEQ

2.7 Conclusion

References

3 Test Development

3.1 Introduction

3.2 Developing an Assessment Framework: A Collaborative and Iterative Process

3.3 Generating and Collecting Test Material

3.4 Refinement of Test Material

3.5 Beyond Professional Test Development: External Qualitative Review of Test Material

3.6 Introducing Innovation

3.7 Conclusion

References

4 Design, Development and Implementation of Contextual Questionnaires in Large‐Scale Assessments

4.1 Introduction

4.2 The Role of Questionnaires in LSAs

4.3 Steps in Questionnaire Design and Implementation

4.4 Questions and Response Options in LSAs

4.5 Alternative Item Formats

4.6 Computer‐Based/Online Questionnaire Instruments

4.7 Conclusion and Future Perspectives

Acknowledgements

References

5 Sample Design, Weighting, and Calculation of Sampling Variance

5.1 Introduction

5.2 Target Population

5.3 Sample Design

5.4 Weighting

5.5 Sampling Adjudication Standards

5.6 Estimation of Sampling Variance

References

6 Translation and Cultural Appropriateness of Survey Material in Large‐Scale Assessments

6.1 Introduction

6.2 Overview of Translation/Adaptation and Verification Approaches Used in Current Multilingual Comparative Surveys

6.3 Step‐by‐Step Breakdown of a Sophisticated Localisation Design

6.4 Measuring the Benefits of a Good Localisation Design

6.5 Checklist of Requirements for a Robust Localisation Design

References

7 Quality Assurance

7.1 Introduction

7.2 The Development and Agreement of Standardised Implementation Procedures

7.3 The Production of Manuals which Reflect Agreed Procedures

7.4 The Recruitment and Training of Personnel in Administration and Organisation: Especially the Test Administrator and the School Coordinator

7.5 The Quality Monitoring Processes: Recruiting and Training Quality Monitors to Visit National Centres and Schools

7.6 Other Quality Monitoring Procedures

7.7 Conclusion

Reference

8 Processing Responses to Open‐Ended Survey Questions

8.1 Introduction

8.2 The Fundamental Objective

8.3 Contextual Factors: Survey Respondents and Items

8.4 Administration of the Coding Process

8.5 Quality Assurance and Control: Ensuring Consistent and Reliable Coding

8.6 Conclusion

References

9 Computer‐Based Delivery of Cognitive Assessment and Questionnaires

9.1 Introduction

9.2 Why Implement CBAs?

9.3 Implementation of International Comparative CBAs

9.4 Assessment Architecture

9.5 Item Design Issues

9.6 State‐of‐the‐Art and Emerging Technologies

9.7 Summary and Conclusion

References

10 Data Management Procedures

10.1 Introduction

10.2 Historical Review: From Data Entry and Data Cleaning to Integration into the Entire Study Process

10.3 The Life Cycle of a LSA Study

10.4 Standards for Data Management

10.5 The Data Management Process

10.6 Outlook

References

11 Test Implementation in the Field: The Case of PASEC

11.1 Introduction

11.2 Test Implementation

11.3 Data Entry

11.4 Data Cleaning

11.5 Data Analysis

11.6 Governance and Financial Management of the Assessments

Acknowledgments

References

12 Test Implementation in the Field: The Experience of Chile in International Large‐Scale Assessments

12.1 Introduction

12.2 International Studies in Chile

References

13 Why Large‐Scale Assessments Use Scaling and Item Response Theory

13.1 Introduction

13.2 Item Response Theory

13.3 Test Development and Construct Validation

13.4 Rotated Test Booklets

13.5 Comparability of Scales Across Settings and Over Time

13.6 Construction of Performance Indicators

13.7 Conclusion

References

14 Describing Learning Growth

14.1 Background

14.2 Terminology: The Elements of a

Learning Metric

14.3 Example of a Learning Metric

14.4 Issues for Consideration

14.5 PISA Described Proficiency Scales

14.6 Defining and Interpreting Proficiency Levels

14.7 Use of Learning Metrics

Acknowledgement

References

15 Scaling of Questionnaire Data in International Large‐Scale Assessments

15.1 Introduction

15.2 Methodologies for Construct Validation and Scaling

15.3 Classical Item Analysis

15.4 Exploratory Factor Analysis

15.5 Confirmatory Factor Analysis

15.6 IRT Scaling

15.7 Described IRT Questionnaire Scales

15.8 Deriving Composite Measures of Socio‐economic Status

15.9 Conclusion and Future Perspectives

References

16 Database Production for Large‐Scale Educational Assessments

16.1 Introduction

16.2 Data Collection

16.3 Cleaning, Recoding and Scaling

16.4 Database Construction

16.5 Assistance

References

17 Dissemination and Reporting

17.1 Introduction

17.2 Frameworks

17.3 Sample Items

17.4 Questionnaires

17.5 Video

17.6 Regional and International Reports

17.7 National Reports

17.8 Thematic Reports

17.9 Summary Reports

17.10 Analytical Services and Support

17.11 Policy Papers

17.12 Web‐Based Interactive Display

17.13 Capacity‐Building Workshops

17.14 Manuals

17.15 Technical Reports

17.16 Conclusion

References

Index

End User License Agreement

List of Tables

Chapter 02

Table 2.1 Cluster rotation design used to form standard test booklets for PISA 2012

Table 2.2 TIMSS 2015 booklet design for fourth and eighth grades

Table 2.3 TIMSS 2015 framework characteristics for fourth and eighth grade mathematics

Table 2.4 Blueprint for the PIRLS and pre‐PIRLS assessments

Table 2.5 ASER reading and arithmetic assessment task descriptions

Table 2.6 The test blueprint for the SACMEQ II pupil mathematics test

Chapter 03

Table 3.1 Blueprint for numeracy content areas in PIAAC

Chapter 04

Table 4.1 Questionnaire content

Table 4.2 Final design of rotated student context questionnaires in the PISA2012 MS

Table 4.3 Unforeseen sources of error and example reactive probes

Table 4.4 Number of domains on top bookshelf by year level

Chapter 08

Table 8.1 Examples of an initially lenient result and a neutral result

Table 8.2 Examples of flagged cases in a country

Table 8.3 Hypothetical examples of percentages of flagged cases for one booklet

Chapter 09

Table 9.1 Rotated cluster design, PISA 2012 CBA

Chapter 11

Table 11.1 Reading objectives, subdomains and materials in the PASEC

2014

assessment

Table 11.2 Cognitive processes in the PASEC

2014

assessment

Table 11.3 Mathematical content measured by the PASEC

2014

assessment

Table 11.4 Assessed cognitive processes

Table 11.5 Allocation of item blocks across test booklets in PASEC

2014

Table 11.6 Language test organisation

Table 11.7 Mathematics test organisation

Chapter 12

Table 12.1 International studies of educational assessment in Chile (1998–2016)

Chapter 13

Table 13.1 Example of item statistics for a multiple‐choice test item with four response options

Table 13.2 Example of item statistics for a partial credit item

Table 13.3 TIMSS 2015 student achievement booklet design

Chapter 15

Table 15.1 Socio‐economic indicators in major international studies

Chapter 16

Table 16.1 International Standard Classification of Education (ISCED) categories (selected)

List of Illustrations

Chapter 01

Figure 1.1 Simplified model of the policy cycle

Figure 1.2 Key areas of a robust assessment programme

Chapter 02

Figure 2.1 Mapping of the PIRLS comprehension processes to the PISA reading aspects

Figure 2.2 ASER sample reading assessment instrument (English)

Chapter 04

Figure 4.1 Number of books at home item – TIMSS and PIRLS Year 4

Figure 4.2 Examples of forced‐choice items in PISA 2012 (ST48Q02 and ST48Q05)

Figure 4.3 Examples of situational judgement type items in PISA 2012 (ST104Q01, ST104Q04, ST104Q05, ST104Q06)

Figure 4.4 Example of ‘over‐claiming technique’ type question in PISA 2012 (ST62Q01‐ST62Q19)

Figure 4.5 The ‘who is close to me’ item from the Australian Child Wellbeing Project

Figure 4.6 The ‘bookshelf item’ from the Australian Child Wellbeing Project

Chapter 06

Figure 6.1 Images of a book for left‐to‐right and right‐to‐left languages

Chapter 08

Figure 8.1 Two examples of survey questions with their response coding instructions

Figure 8.2 A PISA item that aims to encapsulate all possible responses to an open‐ended question

Chapter 09

Figure 9.1 Cluster administration order

Figure 9.2 A progress indicator from a CBA

Figure 9.3 A timing bar from a CBA

Chapter 10

Figure 10.1 Flow of data

Chapter 13

Figure 13.1 Example of item characteristic curve for test item 88

Figure 13.2 Example of item characteristic curve for test item 107

Figure 13.3 Illustrative problematic test item

Figure 13.4 Item difficulty and achievement distributions

Figure 13.5 Comparison of the item pool information function for mathematics and the calibration sample proficiency distribution for a country participating in PISA 2009

Figure 13.6 DIF plot

Figure 13.7 Example item with gender DIF

Figure 13.8 Item S414Q04 PISA field trial 2006

Figure 13.9 An example of the item‐by‐country interaction report (item S414Q04, PISA 2006 field trial)

Figure 13.10 Trend oriented MTEG design

Figure 13.11 An example of the distribution of actual individual scale score for 5000 students

Figure 13.12 An example of the distribution of estimates of individual scale score for 5000 students

Chapter 14

Figure 14.1 Example learning metric for mathematics

Figure 14.2 Sample item allocated to the ‘number and algebra’ and ‘apply’ categories

Figure 14.3 Sample item allocated to the ‘number and algebra’ and ‘translate’ categories

Figure 14.4 Sample ACER ConQuest item map

Figure 14.5 What it might mean for an individual to ‘be at a level’ on a learning metric

Figure 14.6 Calculating the RP‐value used to define PISA proficiency levels (for dichotomous items)

Chapter 15

Figure 15.1 Category characteristic curves, for example, item with four categories

Figure 15.2 Expected item scores, for example, item with four categories

Figure 15.3 Accumulated category probabilities, for example, item with four categories

Figure 15.4 Example of ICCS 2009 item map to describe questionnaire items

Guide

Cover

Table of Contents

Begin Reading

Pages

ii

iii

iv

xvii

xviii

xix

xx

xvi

xxi

xxii

xxiii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

Wiley Series in Survey Methodology

The Wiley Series in Survey Methodology covers topics of current research and practical interests in survey methodology and sampling. While the emphasis is on application, theoretical discussion is encouraged when it supports a broader understanding of the subject matter.

The authors are leading academics and researchers in methodology and sampling. The readership includes professionals in, and students of, the fields of applied statistics, biostatistics, public policy, and government and corporate enterprises.

ALWIN ‐ Margins of Error: A Study of Reliability in Survey Measurement

BETHLEHEM ‐ Applied Survey Methods: A Statistical Perspective

BIEMER, LEEUW, ECKMAN, EDWARDS, KREUTER, LYBERG, TUCKER, WEST (EDITORS) ‐ Total Survey Error in Practice: Improving Quality in the Era of Big Data

BIEMER ‐ Latent Class Analysis of Survey Error

BIEMER and LYBERG ‐ Introduction to Survey Quality

CALLEGARO, BAKER, BETHLEHEM, GORITZ, KROSNICK, LAVRAKAS (EDITORS) ‐ Online Panel Research: A Data Quality Perspective

CHAMBERS and SKINNER (EDITORS) ‐ Analysis of Survey Data

CONRAD and SCHOBER (EDITORS) ‐ Envisioning the Survey Interview of the Future

COUPER, BAKER, BETHLEHEM, CLARK, MARTIN, NICHOLLS, O'REILLY (EDITORS) ‐ Computer Assisted Survey Information Collection

D'ORAZIO, DI ZIO, SCANU ‐ Statistical Matching: Theory and Practice

FULLER ‐ Sampling Statistics

GROVES, DILLMAN, ELTINGE, LITTLE (EDITORS) ‐ Survey Nonresponse

GROVES, BIEMER, LYBERG, MASSEY, NICHOLLS, WAKSBERG (EDITORS) ‐ Telephone Survey Methodology

GROVES AND COUPER ‐ Nonresponse in Household Interview Surveys

GROVES ‐ Survey Errors and Survey Costs

GROVES ‐ The Collected Works of Robert M. Groves, 6 Book Set

GROVES, FOWLER, COUPER, LEPKOWSKI, SINGER, TOURANGEAU ‐ Survey Methodology, 2nd Edition

HARKNESS, VAN DE VIJVER, MOHLER ‐ Cross‐Cultural Survey Methods

HARKNESS, BRAUN, EDWARDS, JOHNSON, LYBERG, MOHLER, PENNELL, SMITH (EDITORS) ‐ Survey Methods in Multicultural, Multinational, and Multiregional Contexts

HEDAYAT, SINHA ‐ Design and Inference in Finite Population Sampling

HUNDEPOOL, DOMINGO‐FERRER, FRANCONI, GIESSING, NORDHOLT, SPICER, DE WOLF ‐ Statistical Disclosure Control

KALTON, HEERINGA (EDITORS) ‐ Leslie Kish: Selected Papers

KORN, GRAUBARD ‐ Analysis of Health Surveys

KREUTER (EDITOR) ‐ Improving Surveys with Paradata: Analytic Uses of Process Information

LEPKOWSKI, TUCKER, BRICK, DE LEEUW, JAPEC, LAVRAKAS, LINK, SANGSTER ‐ Advances in Telephone Survey Methodology

LEVY, LEMESHOW ‐ Sampling of Populations: Methods and Applications, 4th Edition

LIETZ, CRESSWELL, RUST, ADAMS (EDITORS) ‐ Implementation of Large‐Scale Education Assessments

LUMLEY ‐ Complex Surveys: A Guide to Analysis Using R

LYNN (EDITOR) ‐ Methodology of Longitudinal Surveys

MADANS, MILLER, MAITLAND, WILLIS ‐ Question Evaluation Methods: Contributing to the Science of Data Quality

MAYNARD, HOUTKOOP‐STEENSTRA, SCHAEFFER, VAN DER ZOUWEN (EDITORS) ‐ Standardization and Tacit Knowledge: Interaction and Practice in the Survey Interview

MILLER, CHEPP, WILLSON, PADILLA (EDITORS) ‐ Cognitive Interviewing Methodology

PRATESI (EDITOR) ‐ Analysis of Poverty Data by Small Area Estimation

PRESSER, ROTHGEB, COUPER, LESSLER, E. MARTIN, J. MARTIN, SINGER ‐ Methods for Testing and Evaluating Survey Questionnaires

RAO, MOLINA ‐ Small Area Estimation, 2nd Edition

SÄRNDAL, LUNDSTRÖM ‐ Estimation in Surveys with Nonresponse

SARIS, GALLHOFER ‐ Design, Evaluation, and Analysis of Questionnaires for Survey Research, 2nd Edition

SIRKEN, HERRMANN, SCHECHTER, SCHWARZ, TANUR, TOURANGEAU (EDITORS) ‐ Cognition and Survey Research

SNIJKERS, HARALDSEN, JONES, WILLIMACK ‐ Designing and Conducting Business Surveys

STOOP, BILLIET, KOCH, FITZGERALD ‐ Improving Survey Response: Lessons Learned from the European Social Survey

VALLIANT, DORFMAN, ROYALL ‐ Finite Population Sampling and Inference: A Prediction Approach

WALLGREN, A., WALLGREN B. ‐ Register‐based Statistics: Statistical Methods for Administrative Data, 2nd Edition

WALLGREN, A., WALLGREN B. ‐ Register‐based Statistics: Administrative Data for Statistical Purposes

Implementation of Large‐Scale Education Assessments

 

 

Edited by

 

Petra LietzJohn C. CresswellKeith F. RustRaymond J. Adams

 

 

 

 

 

 

 

This edition first published 2017© 2017 by John Wiley and Sons Ltd

Registered OfficeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging‐in‐Publication Data

Names: Lietz, Petra, editor. | Cresswell, John, 1950– editor. | Rust, Keith, editor. | Adams, Raymond J., 1959– editor.Title: Implementation of large‐scale education assessments / editors, Petra Lietz, John C. Cresswell, Keith F. Rust, Raymond J. Adams.Other titles: Wiley Series in Survey MethodologyDescription: Chichester, UK ; Hoboken, NJ : John Wiley & Sons, 2017. | Series: Wiley Series in Survey Methodology | Includes bibliographical references and index.Identifiers: LCCN 2016035918 (print) | LCCN 2016050522 (ebook) | ISBN 9781118336090 (cloth) | ISBN 9781118762479 (pdf) | ISBN 9781118762493 (epub)Subjects: LCSH: Educational tests and measurements.Classification: LCC LB3051 .L473 2016 (print) | LCC LB3051 (ebook) | DDC 371.26–dc23LC record available at https://lccn.loc.gov/2016035918

A catalogue record for this book is available from the British Library.

Cover design by WileyCover image: ZaZa Studio/Shutterstock;(Map) yukipon/Gettyimages

Notes on Contributors

Raymond J. AdamsAustralian Council for Educational Research

Alla BereznerAustralian Council for Educational Research

Falk BreseInternational Association for the Evaluation of Educational Achievement (IEA) Data Processing and Research Center

Mark CockleInternational Association for the Evaluation of Educational Achievement (IEA) Data Processing and Research Center

John C. CresswellAustralian Council for Educational Research

Steve DeptcApStAn Linguistic Quality Control

Andrea FerraricApStAn Linguistic Quality Control

Eveline GebhardtAustralian Council for Educational Research

Béatrice HalleuxHallStat

Oswald KoussihouèdéProgramme for the Analysis of Education Systems of CONFEMEN (PASEC)

Sheila KrawchukWestat

Ema Lagos CamposAgencia de Calidad de la Educación

Petra LietzAustralian Council for Educational Research

Antoine MarivinProgramme for the Analysis of Education Systems of CONFEMEN (PASEC)

Juliette MendelovitsAustralian Council for Educational Research

Christian MonseurUniversité de Liège

Dara RamalingamAustralian Council for Educational Research

Keith F. RustWestat

Wolfram SchulzAustralian Council for Educational Research

Vanessa SyProgramme for the Analysis of Education Systems of CONFEMEN (PASEC)

Ross TurnerAustralian Council for Educational Research

Maurice WalkerAustralian Council for Educational Research

Foreword

The Science of Large‐Scale Assessment

Governments throughout the world recognise that the quality of schooling provided to children and young people will be an important determinant of a country’s social and economic success in the twenty‐first century. In every country, a central question is what governments and school systems can do to ensure that all students are equipped with the knowledge, skills and attributes necessary for effective participation in the future workforce and for productive future citizenship.

To answer this question, countries require quality information, including information on current levels of student achievement, the performances of subgroups of the student population − especially socio‐economically disadvantaged students, Indigenous students and new arrivals − and recent trends in achievement levels within a country. Also important is an understanding of how well a nation’s schools are performing in comparison with schools elsewhere in the world. Are some school systems producing better outcomes overall? Have some systems achieved superior improvements in achievement levels over time? Are some more effective in ameliorating the influence of socio‐economic disadvantage on educational outcomes? Are some doing a better job of developing the kinds of skills and attributes required for life and work in the twenty‐first century?

Some 60 years ago, a small group of educational researchers working in a number of countries conceived the idea of collecting data on the impact of countries’ educational policies and practices on student outcomes. With naturally occurring differences in countries’ school curricula, teaching practices, ways of organising and resourcing schools and methods of preparing and developing teachers and school leaders, they saw the possibility of studying the effectiveness of different educational policies and practices in ways that would be difficult or impossible in any one country. The cross‐national studies that these researchers initiated in the 1960s marked the beginning of large‐scale international achievement surveys.

In the decades since the 1960s, international comparative studies of student achievement and the factors underpinning differences in educational performance in different countries have evolved from a research interest of a handful or academics and educational research organisations to a major policy tool of governments across the globe. International surveys now include the OECD’s PISA implemented in 75 countries in 2015 and the IEA’s Trends in International Mathematics and Science Study implemented in 59 countries in 2015. Other international studies are conducted in areas such as primary school reading, civics and citizenship and ICT literacy. Complementing these international surveys are three significant regional assessment programmes, with a fourth under development. Governments use the results of these large‐scale international studies, often alongside results from their own national surveys, to monitor progress in improving quality and equity in school education and to evaluate the effectiveness of system‐wide policies and programmes.

The decades since the 1960s have also seen significant advances in methodologies for the planning, implementation and use of international surveys – in effect, the evolution of a science of large‐scale assessment.

This book maps an evolving methodology for large‐scale educational assessments. Advances in this field have drawn on advances in specific disciplines and areas of practice, including psychometrics, test development, statistics, sampling theory and the use of new technologies of assessment. The book identifies and discusses 13 elements of a complex, integrated science of large‐scale assessment – a methodology that begins with a consideration of the policy context and purpose of a study – proceeds through various steps in the design and implementation of a quality assessment programme and culminates in the reporting and dissemination of a study’s findings. Each chapter in the book is authored by one or more international authorities with experience in leading the implementation of an element of the described methodology.

As the contributors to this book explain, the science of large‐scale assessments is continuing to evolve. The challenges faced by the field and addressed by a number of contributors to this book include the collection of useful, internationally comparable data on a broader range of skills and attributes than have typically been assessed in large‐scale surveys. National education systems and governments are increasingly identifying skills and attributes such as collaboration, innovativeness, entrepreneurship and creativity as important outcomes of school education. The assessment of such attributes may require very different methods of observation and data gathering, including by capitalising on advances in assessment technologies.

An ongoing challenge will be to ensure that the results of large‐scale assessments continue to meet their essential purpose: to inform and lead effective educational policies and practices to better prepare all students for life and work in the twenty‐first century.

Professor Geoff Masters (AO)CEO, Australian Council for Educational Research (ACER)Camberwell, Victoria, January 2016

Acknowledgements

The editors gratefully acknowledge the Australian Council for Educational Research (ACER), the Australian Department of Foreign Affairs and Trade (DFAT) and Westat for their support of this book.

Particular thanks go to Juliet Young‐Thornton for her patient, friendly and effective assistance throughout the process of producing this book.

Abbreviations

ACER

Australian Council

for

Educational Research

ALL

Adult Literacy and Life Skills Survey

ASER

Annual Status of Education Report

BRR

Balanced repeated replication

CBA

Computer‐based assessment

CFA

Confirmatory factor analysis

CFI

Comparative fit index

CIVED

Civic Education Study

CONFEMEN

Conference of Education Ministers of Countries using French as the Language of Communication/Conférence des ministres de l'Éducation des Etats et gouvernements de la Francophonie

DIF

Differential item functioning

DPS

Described proficiency scale

EFA

Exploratory factor analysis

ESC

Expected scored curves

ESCS

Economic, social and cultural status

ESS

European Social Survey

ETS

Educational Testing Service

FEGS

Functional Expert Groups

FIMS

First International Mathematics Study

FT

Field trial

ICC

Item characteristic curve

ICCS

International Civic and Citizenship Education Study

ICILS

International Computer and Information Literacy Study

ICT

Information and computer technology

IDB

International database

IDs

Identification variables

IEA

International Association for the Evaluation of Educational Achievement

IIEP

UNESCO International Institute for Educational Planning

ILO

International Labour Organization

IREDU

Institute for Research in the Sociology and Economics of Education

IRM

Item response models

IRT

Item response theory

ISCED

International Standard Classification of Education

ISCO

International Standard Classification of Occupations

ISEI

International Socio

Economic Index of Occupational Status

ITC

International Test Commission

LAMP

Literacy Assessment and Monitoring Programme

LAN

Local area network

LGE

General Education Law/General de Educación

LLECE

Latin American Laboratory for Assessment of the Quality of Education/Laboratorio Latinoamericano de Evaluación de la Calidad de la Educación

LSA

Large‐scale assessments

MOS

Measure of size

MS

Main survey

MTEG

Monitoring Trends in Educational Growth

NAEP

United States National Assessment of Educational Progress

NNFI

Non‐normed fit index

NPMs

National project managers

OCR

Optical character recognition

OECD

Organisation for Economic Co‐operation and Development

PASEC

The Programme for the Analysis of Education Systems of CONFEMEN/Programme d’Analyse des Systèmes Éducatifs de la CONFEMEN

PCA

Principal component analysis

PCM

Partial credit model

PIRLS

Progress in International Reading Literacy Study

PISA

Programme for International Student Assessment

PL

Parameter logistic model

PPS

Probability proportional to size

PSUs

Primary sampling units

RL

Reading Literacy Study

RMSEA

Root‐mean square error of approximation

RP

Response probability

SACMEQ

Southern and Eastern Africa Consortium for Monitoring Educational Quality

SDGs

Sustainable Development Goals

SEA‐PLM

Southeast Asian Primary Learning Metrics

SEM

Structural equation modelling

SERCE

Second Regional Comparative and Explanatory Study

SES

Socio‐economic status

SIGE

Students General Information System/Sistema Información General de Estudiantes

SIMCE

Sistema de Medición de la Calidad de la Educación

SIMS

Second International Mathematics Study

SISS

Second International Science Study

SITES

Second Information Technology in Education Study

SSUs

Secondary sampling units

TALIS

Teaching and Learning International Survey

TCMAs

Test‐Curriculum Matching Analyses

TERCE

Third Regional Comparative and Explanatory Study

TIMSS

Trends in International Mathematics and Science Study

TORCH

Test of Reading Comprehension

TRAPD

Translation, Review, Adjudication, Pretesting, and Documentation

UAENAP

United Arab Emirates (UAE) National Assessment Program

UNESCO

United Nations Educational, Scientific and Cultural Organization

UREALC

UNESCO’s Regional Bureau of Education for Latin America and the Caribbean

1Implementation of Large‐Scale Education Assessments

Petra Lietz, John C. Cresswell, Keith F. Rust and Raymond J. Adams

1.1 Introduction

The 60 years that followed a study of mathematics in 12 countries conducted by the International Association for the Evaluation of Educational Achievement (IEA) in 1964 have seen a proliferation of large‐scale assessments (LSAs) in education. In a recent systematic review of the impact of LSAs on education policy (Best et al., 2013), it was estimated that LSAs in education are now being undertaken in about 70% of the countries in the world.

The Programme for International Student Assessment (PISA) conducted by the Organisation for Economic Co‐operation and Development (OECD) was implemented in 75 countries in 2015 with around 510 000 participating students and their schools. Similarly, the Trends in International Mathematics and Science Study (TIMSS), conducted by the IEA, collected information from schools and students in 59 countries in 2015.

This book is about the implementation of LSAs in schools which can be considered to involve 13 key areas. These start with the explication of policy goals and issues, assessment frameworks, test and questionnaire designs, item development, translation and linguistic control as well as sampling. They also cover field operations, technical standards, data collection, coding and management as well as quality assurance measures. Finally, test and questionnaire data have to be scaled and analysed while a database is produced and accompanied by dissemination and the reporting of results. While much of the book has been written from a central coordinating and management perspective, two chapters illustrate the actual implementation of LSAs which highlight the requirements regarding project teams and infrastructure required for participation in such assessments. Figure 1.2 in the concluding section of this chapter provides details regarding where each of these 13 key areas is covered in the chapters of this book.

Participation in these studies, on a continuing basis, is now widespread, as is indicated in . Furthermore, their results have become integral to the general public discussion of educational progress and international comparisons in a wide range of countries with the impact of LSAs on education policy being demonstrated (e.g. Baker & LeTendre, 2005; Best et al., 2013; Breakspear, 2012; Gilmore, 2005). Therefore, it seems timely to bring together in one place the collective knowledge of those who routinely conduct these studies, with the aim of informing users of the results as to how such studies are conducted and providing a handbook for future practitioners of current and prospective studies.

While the emphasis throughout the book is on the practical implementation of LSAs, it is grounded in theories of psychometrics, statistics, quality improvement and survey communication. The chapters of this book seek to cover in one place almost every aspect of the design, implementation and analysis of LSAs, (see Figure 1.2), with perhaps greater emphasis on the aspects of implementation than can be found elsewhere. This emphasis is intended to complement other recent texts with related content but which have a greater focus on the analysis of data from LSAs (e.g. Rutkowski, von Davier & Rutkowski, 2013).

This introductory chapter first provides some context in terms of the development of international, regional and national assessments and the policy context in which they occur. Then, the purposes for countries to undertake such assessments, particularly with a view to evidence‐based policymaking in education, are discussed. This is followed by a description of the content of the book. The chapter finishes with considerations as to where LSAs might be headed and what is likely to shape their development.

1.2 International, Regional and National Assessment Programmes in Education

The IEA first started a programme of large‐scale evaluation studies in education with a pilot study to explore the feasibility of such an endeavour in 1959–1961 (Foshay et al., 1962). After the feasibility study had shown that international comparative studies in education were indeed possible, the first content areas to be tested were mathematics with the First International Mathematics Study conducted by 12 countries in 1962–1967 (Husén, 1967; Postlethwaite, 1967) and the content areas of the six subject surveys, namely, civic education, English as a foreign language, French as a foreign language, literature education, reading, comprehension and science, conducted in 18 countries in 1970–1971. Since then, as can be seen in Appendix 1.A, participation in international studies of education has grown considerably with 59 and 75 countries and economies, respectively, participating in the latest administrations of the TIMSS by the IEA in 2015 and the PISA by the OECD in 2015.

In addition to international studies conducted by the IEA since the late 1950s and by the OECD since 2000, commencing in the mid 1990s, three assessment programmes with a regional focus have been designed and implemented. First, the Conference of Education Ministers of Countries Using French as the Language of Communication (Conférence des ministres de l’Education des États et gouvernements de la Francophonie – CONFEMEN) conducts the Programme d’Analyse des Systèmes Educatifs de la CONFEMEN (PASEC). Since its first data collection in 1991, assessments have been undertaken in over 20 francophone countries not only in Africa but other parts of the world (e.g. Cambodia, Laos and Vietnam). Second, the Southern and Eastern African Consortium for Monitoring Educational Quality (SACMEQ), with the support of the UNESCO International Institute for Educational Planning (IIEP) in Paris, has undertaken four data collections since 1995, with the latest assessment in 2012–2014 (SACMEQ IV) involving 15 countries in Southeast Africa. Third, the Latin‐American Laboratory for Assessment of the Quality in Education (LLECE is the Spanish acronym), with the assistance of UNESCO’s Regional Bureau for Education in Latin America and the Caribbean (UREALC), has undertaken three rounds of data collection since 1997, with 15 countries participating in the Third Regional Comparative and Explanatory Study (TERCE) in 2013. First steps towards an assessment in the Asia‐Pacific region are currently being undertaken through the Southeast Asian Primary Learning Metrics (SEA‐PLM) initiative.

In terms of LSAs of student learning, a distinction is made here between LSAs that are intended to be representative of an entire education system, which may measure and monitor learning outcomes for various subgroups (e.g. by gender or socio‐economic background), and large‐scale examinations that are usually national in scope and which report or certify individual student’s achievement (Kellaghan, Greaney & Murray, 2009). Certifying examinations may be used by education systems to attest achievement at the end of primary or secondary education, for example, or education systems may use examinations to select students and allocate placements for further or specialised study, such as university entrance or scholarship examinations. The focus of this book is on the implementation of LSAs of student learning that are representative of education systems, particularly international assessments that compare education systems and student learning across participating countries.

Parallel to the growth in international assessments, the number of countries around the world administering national assessments in any year has also increased – from 28 in 1995 to 57 in 2006 (Benavot & Tanner, 2007). For economically developing countries in the period from 1959 to 2009, Kamens and Benavot (2011) reported the highest number of national assessments in one year as 37 in 1999. Also in the 1990s, most of the countries in Central and South America introduced national assessments (e.g. Argentina, Bolivia, Brazil, Colombia, Dominican Republic, Ecuador, El Salvador, Guatemala, Paraguay, Peru, Uruguay and Venezuela) through the Partnership for Educational Revitalization in the Americas (PREAL) (Ferrer, 2006) although some introduced them earlier (e.g. Chile in 1982 and Costa Rica in 1986).

International, regional and national assessment programmes can all be considered as LSAs in education. While this book focuses mainly on international assessment programmes conducted in primary and secondary education, it also contains examples and illustrations from regional and national assessments where appropriate.

1.3 Purposes of LSAs in Education

Data from LSAs provide information regarding the extent to which students of a particular age or grade in an education system are learning what is expected in terms of certain content and skills. In addition, they assess differences in achievement levels by subgroups such as gender or region and factors that are correlated with different levels of achievement. Thus, a general purpose of participation in LSAs is to obtain information on a system’s educational outcomes and – if questionnaires are administered to obtain background information from students, teachers, parents and/or schools – the associated factors, which, in turn, can assist policymakers and other stakeholders in the education system in making policy and resourcing decisions for improvement (Anderson, Chiu & Yore, 2010; Benavot & Tanner, 2007; Braun, Kanjee & Bettinger, 2006; Grek, 2009; Postlethwaite & Kellaghan, 2008). This approach to education policymaking, based on evidence, including data from LSAs, has been adopted around the world, with Wiseman (2010, p. 2) stating that it is ‘the most frequently reported method used by politicians and policymakers which he argues can be considered a global norm for educational governance’.

More specifically, Wiseman (2010) has put forward three main purposes for evidence‐based policymaking, namely, measuring and ensuring quality, ensuring equity and accountability. To fulfil the purpose of measuring quality, comparisons of performance across countries and over time tend to be undertaken. To provide indicators of equity, the performance of subgroups in terms of gender, socio‐economic status, school type or regions tends to be compared. Accountability refers to the use of assessment results to monitor and report, sometimes publicly, achievement results to enforce schools and other stakeholders to improve practice for meeting defined curricular and performance standards. In addition, the use of assessments for accountability purposes may use assessment data to implement resource allocation policies (e.g. staff remuneration and contracts). Accountability is more frequently an associated goal of national assessment programmes than international assessment programmes.

To explicate further the way in which information from LSAs is used in education policymaking, models of the policy cycle are frequently put forward (e.g. Bridgman & Davis, 2004; Haddad & Demsky, 1995; Sutcliffe & Court, 2005). While most models include between six and eight stages, they seem to share four stages, namely, agenda setting, policy formulation, policy implementation and monitoring and evaluation. Agenda setting is the awareness of and priority given to an issue or problem whereas policy formulation refers to the analytical and political ways in which options and strategies are constructed. Policy implementation covers the forms and nature of policy administration and activities in the field. In the final step, monitoring and evaluation involves an appraisal of the extent to which implemented policies have achieved the intended aims and objectives. A model showing these four steps is shown in Figure 1.1.

Figure 1.1 Simplified model of the policy cycle

(Source: Sutcliffe and Court (2005). Reproduced with permission from the Overseas Development Institute)

Regardless of their purpose, data from LSAs are reported mainly through international, regional and national reports. However, these data are also used quite extensively in secondary data analyses (e.g. Hansen, Gustafsson & Rosén, 2014; Howie & Plomp, 2006; Owens, 2013), as well as meta‐analyses (e.g. Else‐Quest, Hyde & Linn, 2010; Lietz, 2006) which frequently lead to policy recommendations.

While recommendations are widespread, examples of the actual impact of these assessments on education policy are often provided in a more anecdotal or case study fashion (see Figazollo, 2009; Hanushek & Woessmann, 2010; McKinsey & Company, 2010) or by the main initiators of these assessments (e.g. Husén, 1967). Moreover, surveys have been conducted to ascertain the policy impact of these assessments. As these surveys have frequently been commissioned or initiated by the organisation responsible for the assessment (e.g. Breakspear, 2012 for the OECD; Gilmore, 2005 for the IEA), a certain positive predisposition regarding the effectiveness of the link between assessment and policy could be assumed. Similarly, surveys of and interviews with staff in ministries and entities that participate in such assessments (e.g. UNESCO, 2013), and that rely on future funding to continue their participation, are likely to report positively on the effects of assessment results on education policymaking.

Two systematic reviews that were conducted recently (Best et al., 2013; Tobin et al., 2015) took a different approach by systematically locating and analysing available evidence of links between LSA programmes and education policy. In other words, these reviews did not include reports or articles that resulted in policy recommendations or surveys of participating entities’ perceived impact of assessments on policy but looked for actual evidence of an assessment–policy link. In the review that focused on such a link in economically developing countries between 1990 and 2011 (Best et al., 2013), of 1325 uniquely identified materials only 54 were considered to provide such evidence. In the review that focused on all countries in the Asia‐Pacific between 1990 and 2013 (Tobin et al., 2015), 68 of the 1301 uniquely identified materials showed evidence of such a link.

Results of these systematic reviews revealed some interesting insights into the use of LSAs as follows:

Just under half of the assessment programmes in the review were national in coverage, followed by one‐third international programmes, while approximately one‐fifth were regional assessment programmes and only a few were subnational assessment programmes.

Of the regional assessment programmes SACMEQ featured most often, followed by LLECE/SERCE and PASEC.

Of the international assessments, PISA featured most often, followed by TIMSS and the Progress in International Reading Literacy Study (PIRLS).

LSA programmes were most often intended to measure and ensure educational quality. Assessment programmes were less often used for the policy goals of equity or accountability for specific education matters.

The most frequent education policies impacted upon by the use of assessment data were system‐level policies regarding (i) curriculum standards and reform, (ii) performance standards and (iii) assessment policies.

The most common facilitators for assessment data to be used in policymaking, regardless of the type of assessment programme, were media and public opinion as well as appropriate and ongoing dissemination to stakeholders.

Materials which explicitly noted no impact on the policy process outlined barriers to the use of assessment data, which were thematically grouped as problems relating to (i) the (low) quality of an assessment programme and analyses, (ii) financial constraints, (iii) weak assessment bodies and fragmented government agencies and (iv) low technical capacity of assessment staff.

The high quality of the assessment programme was frequently seen as a facilitator to the use of regional assessment data, while the lack of quality was often regarded as a barrier to the use of subnational and national assessments. In international assessments, quality emerged as both a facilitator and barrier. The high quality of an assessment programme was seen as a facilitator in so far as the results were credible, robust and not questioned by stakeholders. They were also regarded as a barrier in that the requirement of having to adhere to the internationally defined high‐quality standards was frequently a challenge to participating countries.

As the chapters throughout this book demonstrate, for assessment programmes to be of high quality, much effort, expertise, time and financial resources are required. While developing and maintaining the necessary funding and expertise continues to be a challenge, ultimately, the highest quality standards are required if information from LSAs is to be taken seriously by policymakers and other users of these data. Such high technical quality, combined with the ongoing integration of assessments into policy processes and an ongoing and varied media and communication strategy will increase the usefulness of evidence from LSAs for various stakeholders (Tobin et al., 2015).

1.3.1 Trend as a Specific Purpose of LSAs in Education

One‐off or cross‐sectional assessments can provide information about an outcome of interest at one point in time. This is of some interest in the comparative sense as participating systems can look at each other’s performance on the outcome and see what they can learn from those systems that (i) perform at a higher level or (ii) manage to produce greater homogeneity between the highest and lowest achievers or (iii) do preferably both i and ii. These comparisons, however, are made across cultures and it is frequently being questioned as to which cultures or countries it is appropriate or reasonable to compare (e.g. Goldstein & Thomas, 2008). The relatively higher achievement of many Asian countries in PISA and TIMSS compared to other countries is often argued to be a consequence of differences in basic tenets and resulting dispositions, beliefs and behaviours across countries. Thus, various authors (e.g. Bracey, 2005; Leung, 2006, Minkov, 2011; Stankov, 2010) demonstrate cultural differences across societies regarding, for example, the value of education or student’s effort or respect for teachers which makes it difficult to see how countries can learn from each other to improve outcomes. Therefore, assessments that enable comparisons over time within countries are often considered to be more meaningful.

In England, the follow‐up study of the Plowden National Survey of 1964 was undertaken 4 years later in 1968 and was reported by Peaker (1967, 1971). This study followed the same students over the two occasions. Similarly, in Australia, a longitudinal Study of School Performance was carried out in 1975 with a subsample of students following up 4 years later in 1979 with 10‐and 14‐year‐old students in the fields of literacy and numeracy (Bourke et al. 1981; Keeves and Bourke, 1976; Williams et al., 1980).

Both of these studies were longitudinal in kind, which is relatively rare in the realm of LSAs, which tend to use repeated cross‐sectional assessments as a means to gauge changes over time across comparable cohorts, rather than looking at growth within a cohort by following the same individuals over time. The most substantial and continuing programme of this latter type of assessment of national scope is the National Assessment of Educational Progress (NAEP) in the United States. It was initiated in 1969 in order to assess achievement at the levels of Grade 4, Grade 8 and Grade 12 in reading, mathematics and science (see, e.g. Jones & Olkin, 2004; Tyler, 1985).

The main international assessments are cross‐sectional in kind and are repeated at regular intervals with PIRLS conducted every 5 years, PISA every 3 years and TIMSS every 4 years. As the target population (e.g. 15‐year‐olds or Grade 4 students) remains the same on each occasion, it enables the monitoring of student outcomes for this target population over time. Notably, the importance of providing trend information was reflected by IEA’s change in what ‘TIMSS’ meant. In the 1995 assessment, the ‘T’ stood for ‘third’ which was still maintained in 1999 where the study was called the ‘Third International Mathematics and Science Study Repeat’. By the time of the 2003 assessment, however, the ‘T’ stood for ‘Trends in International Mathematics and Science Study’.

Now that PISA has assessed all major domains (i.e. reading, mathematics and science) twice, increasingly the attention paid to the results within each country are those of national trends, both overall and for population subgroups, rather than cross national. It is not news anymore that Korean students substantially outperform US students in mathematics. Indeed, if an implementation of PISA were suddenly to show this not to be the case the results would not be believed, even though a different cohort is assessed each time. Generally, participating countries are most interested in whether or not there is evidence of improvement over time, both since the prior assessment and over the longer term. Such comparisons within a country over time are of great interest since they are not affected by the possible unique effects of culture which can be seen as problematic for cross‐country comparisons.

Increasingly, countries that participate in PISA supplement their samples with additional students, not in a way that will appreciably improve the precision of comparisons with other countries but in ways that will improve the precision of trend measurements for key demographic groups within the country, such as ethnic or language minorities or students of lower socio‐economic status. Of course this does not preclude the occasional instances of political leaders who vow to show improvements in education over time through a rise in the rankings of the PISA or TIMSS ‘league tables’ (e.g. Ferrari, 2012).

1.4 Key Areas for the Implementation of LSAs in Education

As emphasised at the beginning of this introduction and found in the systematic reviews, for LSAs to be robust and useful, they need to be of high quality, technically sound, have a comprehensive communication strategy and be useful for education policy. To achieve this aim, 13 key areas need to be considered in the implementation of LSAs (see Figure 1.2).

Figure 1.2 Key areas of a robust assessment programme

While Figure 1.2 illustrates where these key areas are discussed in the chapters of this book, a brief summary of the content of each chapter is given below.

Chapter 2

– Test Design and Objectives

Given that all LSAs have to address the 13 elements of a robust assessment programme, why and how do these assessments differ from one another in practice? The answer to this question lies in the way that the purpose and guiding principles of an assessment guide decisions about who and what should be assessed. In this chapter, Dara Ramalingam outlines the key features of a selection of LSAs to illustrate the way in which their different purposes and assessment frameworks have led to key differences in decisions about test content, target population and sampling.

Chapter 3

– Test Development

All educational assessments that seek to provide accurate information about the test takers’ knowledge, skills and understanding in the domain of interest share a number of common characteristics. These include tasks which elicit responses that contribute to building a sense of the test takers’ capacity in the domain. This also means that the tests draw on knowledge and understanding that are intrinsic to the domain and are not likely to be more or less difficult for any individual or group because of knowledge or skills that are irrelevant to the domain. The tests must be in a format that is suited to the kind of questions being asked, provide coverage of the area of learning that is under investigation and they must be practically manageable. Juliette Mendelovits describes the additional challenges for LSAs to comply with these general ‘best practice’ characteristics as international LSAs start with the development of frameworks that guide the development of tests that are subsequently administered to many thousands of students in diverse countries, cultures and contexts.

Chapter 4

– Design, Development and Implementation of Contextual Questionnaires in LSAs

In order to be relevant to education policy and practice, LSAs routinely collect contextual information through questionnaires to enable the examination of factors that are linked to differences in student performance. In addition, information obtained by contextual questionnaires is used independently of performance data to generate indicators of non‐cognitive learning outcomes such as students’ attitudes towards reading, mathematics self‐efficacy and interest in science or indicators about teacher education and satisfaction as well as application of instructional strategies. In this chapter, Petra Lietz not only gives an overview of the content of questionnaires for students, parents, teachers and schools in LSAs but also discusses and illustrates the questionnaire design process from questionnaire framework development to issues such as question order and length as well as question and response formats.

Chapter 5

– Sample Design, Weighting and Calculation of Sampling Variance

Since the goal of LSAs as we have characterised them is to measure the achievement of populations and specified subgroups rather than that of individual students and schools, it is neither necessary, nor in most cases feasible, to assess all students in the target population within each participating country. Hence, the selection of an appropriate sample of students, generally from a sample of schools, is a key technical requirement for these studies. In this chapter, Keith Rust, Sheila Krawchuk and Christian Monseur describe the steps involved in selecting such samples and their rationale. Given that a complex, stratified multistage sample is selected in most instances, those analysing the data must use appropriate methods of inference that take into account the effects of the sample design on the sample configuration. Furthermore, some degree of school and student nonresponse is bound to occur, and methods are needed in an effort to mitigate any bias that such nonresponse might introduce.

Chapter 6

– Translation and Cultural Appropriateness of Survey Material in LSAs

Cross‐linguistic, cross‐national and cross‐cultural equivalence is a fundamental requirement of LSAs in education which seek to make comparisons across many different settings. While procedures for the translation, adaptation, verification and finalisation of survey materials – also called ‘localisation’ – will not completely prevent language or culturally induced bias, they aim to minimise the possibility of them occurring. In this chapter, Steve Dept, Andrea Ferrari and Béatrice Halleux discuss the strengths and weaknesses of various approaches to the localisation of materials in different LSAs and single out practices that are more likely than others to yield satisfactory outcomes.

Chapter 7

– Quality Assurance

Quality assurance measures cover all aspects from test development to database production as John Cresswell explains in this chapter. To ensure comparability of the results across students and across countries and schools, much work has gone into standardising cross‐national assessments. The term ‘standardised’, in this context, not only refers to the scaling and scoring of the tests but also to the consistency in the design, content and administration of the tests (deLandshere, 1997). This extent of standardisation is illustrated by the PISA technical standards which for the administration in 2012 (NPM(1003)9a) covered three broad standards, one concerning data, the second regarding management and the third regarding national involvement. Data standards covered target population and sampling, language of testing, field trial participation, adaptation and translation of tests, implementation of national options, quality monitoring, printing, response coding and data submission. Management standards covered communication, notification of international and national options, schedule for material submission, drawing of samples, data management and archiving of materials. National standards covered feedback regarding appropriate mechanisms for promoting school participation and dissemination of results among all national stakeholders.

Chapter 8

– Processing Responses to Open‐ended Survey Questions

In this chapter, Ross Turner discusses the challenges associated with the consistent assessment of responses that students generate when answering questions other than multiple‐choice items. The methods described take into account the increased difficulty of this task when carried out in an international setting. Examples are given of the detailed sets of guidelines which are needed to code the responses and the processes involved in developing and implementing these guidelines.

Chapter 9

– Computer‐based Delivery of Cognitive Assessment and Questionnaires

As digital technologies have advanced in the twenty‐first century, the demand for using these technologies in large‐scale educational assessment has increased. Maurice Walker focuses in this chapter on the substantive and logistical rationales for adopting or incorporating a computer‐based approach to student assessment. He outlines assessment architecture and important item design options with the view that well‐planned computer‐based assessment (CBA) should be a coherent, accessible, stimulating and intuitive experience for the test taker. Throughout the chapter, examples illustrate the differing degrees of diffusion of digital infrastructure into the schools of countries that participate in LSAs. It also discusses the impact of these infrastructure issues on the choices of whether and how to undertake CBAs.

Chapter 10

– Data Management Procedures

Falk Brese and Mark Cockle discuss in this chapter the data management procedures needed to minimise error that might be introduced by any processes involved with converting responses from students, teachers, parents and school principals to electronic data. The chapter presents the various aspects of data management of international LSAs that need to be taken into account to meet this goal.

Chapter 11

– Test Implementation in the Field: The Case of PASEC

Oswald Koussihouèdé describes the implementation of one of the regional assessments – PASEC – which is undertaken in francophone countries in Africa and Asia. He describes the significant changes which have recently been made to this assessment programme in an attempt to better describe the strengths and weaknesses of the student populations of the participating countries and to ensure that the assessment is being implemented using the latest methodology.

Chapter 12

– Test Implementation in the Field: The Experience of Chile in International LSAs

Chile has participated in international LSAs undertaken by the IEA, OECD and UNESCO since 1998. Ema Lagos first explains the context in which these assessments have occurred, both in terms of the education system as well as political circumstances. She then provides a comprehensive picture of all the tasks that need to be undertaken by a participating country, from input into instrument and item development, sampling, the preparation of test materials and manuals and the conduct of field operations to the coding, entry, management and analysis of data and the reporting of results.

Chapter 13

– Why LSAs Use Scaling and Item Response Theory (IRT)