E-Book
32,99 €

Criterion-referenced Test Development E-Book

Sharon A. Shrock

0,0

32,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Criterion-Referenced Test Development is designed specifically for training professionals who need to better understand how to develop criterion-referenced tests (CRTs). This important resource offers step-by-step guidance for how to make and defend Level 2 testing decisions, how to write test questions and performance scales that match jobs, and how to show that those certified as ?masters? are truly masters. A comprehensive guide to the development and use of CRTs, the book provides information about a variety of topics, including different methods of test interpretations, test construction, item formats, test scoring, reliability and validation methods, test administration, a score reporting, as well as the legal and liability issues surrounding testing. New revisions include: * Illustrative real-world examples. * Issues of test security. * Advice on the use of test creation software. * Expanded sections on performance testing. * Single administration techniques for calculating reliability. * Updated legal and compliance guidelines. Order the third edition of this classic and comprehensive reference guide to the theory and practice of organizational tests today.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 634

Veröffentlichungsjahr: 2008

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Patrick M. Lencioni

Mensch und Wald

Carsten Wippermann

The Food Truck Handbook

David Weber

Die selbstbestimmte Geburt

Ina May Gaskin

Leseprobe

List of Figures, Tables, and Sidebars

Introduction: A Little Knowledge Is Dangerous

Why Test?

Why Read This Book?

A Confusing State of Affairs

Testing and Kirkpatrick’s Levels of Evaluation

Certification in the Corporate World

Corporate Testing Enters the New Millennium

What Is to Come . . .

Part I: Background: The Fundamentals

Chapter One: Test Theory

What Is Testing?

What Does a Test Score Mean?

Reliability and Validity: A Primer

Concluding Comment

Chapter Two: Types of Tests

Criterion-Referenced Versus Norm-Referenced Tests

Six Purposes for Tests in Training Settings

Three Methods of Test Construction (One of Which You Should Never Use)

Part II: Overview: The CRTD Model and Process

Chapter Three: The CRTD Model and Process

Relationship to the Instructional Design Process

The CRTD Process

Summary

Part III: The CRTD Process: Planning and Creating the Test

Chapter Four: Plan Documentation

Why Document?

What to Document

The Documentation

Chapter Five: Analyze Job Content

Job Analysis

Job Analysis Models

DACUM

Hierarchies

Bloom’s Original Taxonomy

Bloom’s Revised Taxonomy

Gagné’s Learned Capabilities

Merrill’s Component Design Theory

Data-Based Methods for Hierarchy Validation

Who Killed Cock Robin?

Chapter Six: Content Validity of Objectives

Overview of the Process

The Role of Objectives in Item Writing

A Word from the Legal Department About Objectives

The Certification Suite

How to Use the Certification Suite

Converting Job-Task Statements to Objectives

In Conclusion

Chapter Seven: Create Cognitive Items

What Are Cognitive Items?

Classification Schemes for Objectives

Types of Test Items

The Key to Writing Items That Match Jobs

The Certification Suite

Guidelines for Writing Test Items

How Many Items Should Be on a Test?

Summary of Determinants of Test Length

A Cookbook for the SME

Deciding Among Scoring Systems

Chapter Eight: Create Rating Instruments

What Are Performance Tests?

Product Versus Process in Performance Testing

Four Types of Rating Scales for Use in Performance Tests (Two of Which You Should Never Use)

Open Skill Testing

Chapter Nine: Establish Content Validity of Items and Instruments

The Process

Establishing Content Validity—The Single Most Important Step

Two Other Types of Validity

Summary Comment About Validity

Chapter Ten: Initial Test Pilot

Why Pilot a Test?

Six Steps in the Pilot Process

Preparing to Collect Pilot Test Data

Before You Administer the Test

When You Administer the Test

Honesty and Integrity in Testing

Chapter Eleven: Statistical Pilot

Standard Deviation and Test Distributions

Item Statistics and Item Analysis

Choosing Item Statistics and Item Analysis Techniques

Garbage In-Garbage Out

Chapter Twelve: Parallel Forms

Paper-and-Pencil Tests

Computerized Item Banks

Reusable Learning Objects

Chapter Thirteen: Cut-Off Scores

Determining the Standard for Mastery

The Outcomes of a Criterion-Referenced Test

The Necessity of Human Judgment in Setting a Cut-Off Score

Three Procedures for Setting the Cut-Off Score

Borderline Decisions

Problems with Correction-for-Guessing

The Problem of the Saltatory Cut-Off Score

Chapter Fourteen: Reliability of Cognitive Tests

The Concepts of Reliability, Validity, and Correlation

Types of Reliability

Single-Test-Administration Reliability Techniques

Calculating Reliability for Single-Test Administration Techniques

Two-Test-Administration Reliability Techniques

Calculating Reliability for Two-Test Administration Techniques

Comparison of

ϕ, P

, and

The Logistics of Establishing Test Reliability

Recommendations for Choosing a Reliability Technique

Summary Comments

Chapter Fifteen: Reliability of Performance Tests

Reliability and Validity of Performance Tests

Inter-Rater Reliability

Repeated Performance and Consecutive Success

Procedures for Training Raters

What if a Rater Passes Everyone Regardless of Performance?

What if You Get a High Percentage of Agreement among Raters but a Negative Phi Coefficient?

Chapter Sixteen: Report Scores

CRT Versus NRT Reporting

Summing Subscores

What Should You Report to a Manager?

Is There a Legal Reason to Archive the Tests?

A Final Thought About Testing and Teaching

Part IV: Legal Issues in Criterion-Referenced Testing

Chapter Seventeen: Criterion-Referenced Testing and Employment Selection Laws

What Do We Mean by Employment Selection Laws?

Who May Bring a Claim?

A Short History of the

Uniform Guidelines on Employee Selection Procedures

Legal Challenges to Testing and the

Uniform Guidelines

Balancing CRTs with Employment Discrimination Laws

Watch Out for Blanket Exclusions in the Name of Business Necessity

Adverse Impact, the Bottom Line, and Affirmative Action

Accommodating Test-Takers with Special Needs

Test Validation Criteria: General Guidelines

Test Validation: A Step-by-Step Guide

Keys to Maintaining Effective and Legally Defensible Documentation

Is Your Criterion-Referenced Testing Legally Defensible? A Checklist

A Final Thought

Epilogue: CRTD as Organizational Transformation

References

About the Authors

Index

List of Illustrations

Figure 1.1a. Reliable, But Not Valid.

Figure 1.1b. Neither Reliable Nor Valid.

Figure 1.1c. Reliable and Valid.

Figure 2.1. Example Frequency Distribution.

Figure 2.2. Ideal NRT Frequency Distribution.

Figure 2.3. The Normal Distribution.

Figure 2.4. Mastery Curve.

Figure 2.5. Example of an Objective with Corresponding Test Item.

Figure 3.1. Designing Criterion-Referenced Tests.

Figure 4.1. The CRTD Process and Documentation.

Figure 5.1. Hierarchical Relationship of Skills.

Figure 5.2. Extended Hierarchical Analysis.

Figure 5.3. Hierarchical Task Analysis, Production-Operations Manager.

Figure 5.4. Cognitive Levels of Bloom’s Taxonomy.

Figure 5.5. Hierarchy Illustrating Correct Bloom Sequence.

Figure 5.6. Hierarchy Illustrating Incorrect Bloom Sequence.

Figure 5.7. Bloom’s Levels Applied to the Production Manager Content Hierarchy.

Figure 5.8. Application of Gagné’s Intellectual Skills to Hierarchy Validation.

Figure 5.9. Application of Merrill’s Component Design Theory to Hierarchy Validation.

Figure 5.10. Analysis of Posttest Scores to Validate a Hierarchy, Example of a Valid Hierarchy.

Figure 5.11. Analysis of Posttest Scores to Validate a Hierarchy, Example of an Invalid Hierarchy.

Figure 6.1. Selecting the Certification Level.

Figure 8.1. Numerical Scale.

Figure 8.2. Descriptive Scale.

Figure 8.3. Behaviorally Anchored Rating Scale.

Figure 8.4. Checklist.

Figure 8.5. Criterion-Referenced Performance Test.

Figure 8.6. Sample Form Used to Score the Task of Merging into Traffic.

Figure 9.1. Test Content Validation Form.

Figure 9.2. Test Content Validation Results Form.

Figure 9.3. Content Validity Index Scales.

Figure 9.4. CVI-Relevance for a Test Item.

Figure 9.5. Phi Table for Concurrent Validity.

Figure 9.6. Example Phi Table for Concurrent Validity.

Figure 9.7. Blank Table for Practice Phi Calculation: Concurrent Validity.

Figure 9.8. Answer for Practice Phi Calculation: Concurrent Validity.

Figure 9.9. Phi Table for Predictive Validity.

Figure 10.1. Kincaid Readability Index.

Figure 11.1. Standard Normal Curve.

Figure 11.2. Standard Deviations of a Normal Curve.

Figure 11.3. Frequency Distributions with Standard Deviations of Various Sizes.

Figure 11.4. Skewed Curves.

Figure 11.5. Mastery Curve.

Figure 11.6. The Upper/Lower Index for a CRT.

Figure 11.7. Phi Table for Item Analysis.

Figure 13.1. Outcomes of a Criterion-Referenced Test.

Figure 13.2. Contrasting Groups Method of Cut-Off Score Estimation.

Figure 13.3. Frequency Distributions for Using the Contrasting Groups Method.

Figure 13.4. Application of the Standard Error of Measurement.

Figure 13.5. The Test Score as a Range Rather Than a Point.

Figure 13.6. Correction-for-Guessing Formula.

Figure 14.1. Graphic Illustrations of Correlation.

Figure 14.2. Phi Table for Test-Retest Reliability.

Figure 14.3. Example Phi Table for Test-Retest Reliability.

Figure 14.4. Blank Table for Practice Phi Calculation.

Figure 14.5. Answer for Practice Phi Calculation, Test-Retest Reliability.

Figure 14.6. Matrix for Determining p

and p

CHANCE

Figure 14.7. Example Matrix for Determining p

and p

CHANCE

Figure 14.8. Blank Matrix for Determining p

and p

CHANCE

Figure 14.9. Completed Practice Matrix for Determining p

and p

CHANCE

Figure 15.1. Matrix for Determining

and

chance

Figure 15.2. Example

and

CHANCE

Matrix, Judges 1 & 2.

Figure 15.3. Example

and

CHANCE

Matrix, Judges 1 & 3.

Figure 15.4. Example

and

CHANCE

Matrix, Judges 2 & 3.

Figure 15.5. Blank

and

CHANCE

Matrix, Judges 1 & 2.

Figure 15.6. Blank

and

CHANCE

Matrix, Judges 1 & 3.

Figure 15.7. Blank

and

CHANCE

Matrix, Judges 2 & 3.

Figure 15.8. Answer for

and

CHANCE

Matrix, Judges 1 & 2.

Figure 15.9. Answer for

and

CHANCE

Matrix, Judges 1 & 3.

Figure 15.10. Answer for

and

CHANCE

Matrix, Judges 2 & 3.

Figure 15.11. Matrix for Calculating Phi.

Figure 15.12. Matrix for Calculating Phi, Judges 1 & 2.

Figure 15.13. Matrix for Calculating Phi, Judges 1 & 3.

Figure 15.14. Matrix for Calculating Phi, Judges 2 & 3.

Figure 15.15. Blank Matrix for Calculating Phi, Judges 1 & 2.

Figure 15.16. Blank Matrix for Calculating Phi, Judges 1 & 3.

Figure 15.17. Blank Matrix for Calculating Phi, Judges 2 & 3.

Figure 15.18. Answer Matrix for Calculating Phi, Judges 1 & 2.

Figure 15.19. Answer Matrix for Calculating Phi, Judges 1 & 3.

Figure 15.20. Answer Matrix for Calculating Phi, Judges 2 & 3.

Figure 15.21. Klaus and Lisa Phi Calculation When Klaus Passes All Test-Takers.

Figure 15.22. Phi When Klaus Agrees with Lisa on One Performer She Failed.

Figure 15.23. Phi When Klaus and Lisa Agree One Previously Passed Performer Failed.

Figure 15.24. Phi When Klaus Fails One Performer Lisa Passed.

Figure 15.25. Phi When Gina and Omar Agree on One More Passing Performer.

Figure 15.26. Phi When Gina and Omar Agree on One Failing Performer.

Figure 15.27. Phi When Gina and Omar Are More Consistent in Disagreeing.

List of Tables

Table 5.1. DACUM Research Chart for Computer Applications Programmer.

Table 5.2. Standard Task Analysis Form.

Table 5.3. Summary of the Revised Bloom’s Taxonomy.

Table 6.1. Summary of Certification Suite.

Table 7.1. Bloom’s Taxonomy on the Battlefield: A Scenario of How Bloom’s Levels Occur in a Combat Environment.

Table 7.2. Decision Table for Estimating the Number of Items Per Objective to Be Included on a Test.

Table 7.3. Summary of SME Item Rating for Unit 1 Production Manager Test.

Table 9.1. Example of Concurrent Validity Data.

Table 9.2. Example of Concurrent Validity Data.

Table 12.1. Angoff Ratings for Items in an Item Bank.

Table 13.1. Judges’ Probability Estimates (Angoff Method).

Table 13.2. Possible Probability Estimates, Angoff Method.

Table 13.3. Example Test Results for Using the Contrasting Groups Method.

Table 14.1. Comparison of Three Single-Test Administration Reliability Estimates.

Table 14.2. Example of Test-Retest Data for a CRT.

Table 14.3. Sample Test-Retest Data.

Table 15.1. Example Performance Test Data, Inter-Rater Reliability.

Table 15.2. Sample Performance Test Data, Inter-Rater Reliability.

Table 15.3. Conversion Table for

(R) into Z.

Table 16.1. Calculation of the Overall Course Cut-Off Score.

Table 16.2. Calculation of an Individual’s Weighted Performance Score.

Table 17.1. Sample Summary of Adverse Impact Figures.

Table 17.2. Summary of Adverse Impact Figures for Practice.

Pages

iii

vii

viii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

xxx

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

Guide

Cover

Table of Contents

Begin Reading

About This Book

Why is this topic important?

Today’s organizations feel both external and internal pressures to test the competence of those who work for them. The current global, competitive, regulated, and litigation savvy economic environment has increased external pressures, while the resulting increased investment in training and the escalating cost of high-tech instructional and human performance systems create internal pressure for accountability. Many products are now so complicated that human testing systems to ensure the product’s correct operation and maintenance have become virtually part of the product marketed to buyers. Valid, informative, and legally defensible competence testing has become essential to many organizations, yet the technology for creating these assessments has historically been shrouded in academic circles impenetrable to those without advanced degrees in measurement sciences.

What can you achieve with this book?

This book presents a straightforward model for the creation of legally defensible criterion-referenced tests designed to determine whether or not test-takers have mastered job-related knowledge and performance skills. Exercises with accompanying feedback allow you to monitor your own proficiency with the concepts and procedures introduced. Furthermore, the issues that are most likely to be encountered at each step in the test development process are fully elaborated to enable you to make sound decisions and actually complete and document the creation of valid testing systems.

How is this book organized?

This book is divided into five main sections. The first two introduce essential, fundamental testing concepts and present an overview of the entire CRTD Model. Part Three includes a chapter devoted to the elaboration of each step in the CRTD Model. A thorough discussion of the legal issues surrounding test creation and administration constitutes Part Four. Examples and exercises with feedback are used liberally throughout the book to facilitate understanding, engagement, and proficiency in the CRTD process. The final piece in the book is a brief epilogue reflecting on the profound and often unforeseen impact that testing can have on overall organizational performance.

About Pfeiffer

Pfeiffer serves the professional development and hands-on resource needs of training and human resource practitioners and gives them products to do their jobs better. We deliver proven ideas and solutions from experts in HR development and HR management, and we offer effective and customizable tools to improve workplace performance. From novice to seasoned professional, Pfeiffer is the source you can trust to make yourself and your organization more successful.

Essential Knowledge Pfeiffer produces insightful, practical, and comprehensive materials on topics that matter the most to training and HR professionals. Our Essential Knowledge resources translate the expertise of seasoned professionals into practical, how-to guidance on critical workplace issues and problems. These resources are supported by case studies, worksheets, and job aids and are frequently supplemented with CD-ROMs, websites, and other means of making the content easier to read, understand, and use.

Essential Tools Pfeiffer’s Essential Tools resources save time and expense by offering proven, ready-to-use materials–including exercises, activities, games, instruments, and assessments–for use during a training or team-learning event. These resources are frequently offered in looseleaf or CD-ROM format to facilitate copying and customization of the material.

Pfeiffer also recognizes the remarkable power of new technologies in expanding the reach and effectiveness of training. While e-hype has often created whizbang solutions in search of a problem, we are dedicated to bringing convenience and enhancements to proven training solutions. All our e-tools comply with rigorous functionality standards. The most appropriate technology wrapped around essential content yields the perfect solution for today’s on-the-go trainers and human resource professionals.

Essential resources for training and HR professionals

About ISPI

The International Society for Performance Improvement (ISPI) is dedicated to improving individual, organizational, and societal performance. Founded in 1962, ISPI is the leading international association dedicated to improving productivity and performance in the workplace. ISPI represents more than 10,000 international and chapter members throughout the United States, Canada, and forty other countries.

ISPI’s mission is to develop and recognize the proficiency of our members and advocate the use of Human Performance Technology. This systematic approach to improving productivity and competence uses a set of methods and procedures and a strategy for solving problems for realizing opportunities related to the performance of people. It is a systematic combination of performance analysis, cause analysis, intervention design and development, implementation, and evaluation that can be applied to individuals, small groups, and large organizations.

Website:www.ispi.org

Mail: International Society for Performance Improvement

1400 Spring Street, Suite 260

Silver Spring, Maryland 20910 USA

Phone: 1.301.587.8570

Fax: 1.301.587.8573

E-mail:[email protected]

Criterion-Referenced Test Development

Technical and Legal Guidelines for Corporate Training

3rd Edition

Sharon A. Shrock

William C. Coscarelli

Published by Pfeiffer

An Imprint of Wiley

989 Market Street, San Francisco, CA 94103-1741

www.pfeiffer.com

Wiley Bicentennial logo: Richard J. Pacifico

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, 201-748-6011, fax 201-748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Readers should be aware that Internet websites offered as citations and/or sources for further information may have changed or disappeared between the time this was written and when it is read.

For additional copies/bulk purchases of this book in the U.S. please contact 800-274-4434.

Pfeiffer books and products are available through most bookstores. To contact Pfeiffer directly call our Customer Care Department within the U.S. at 800-274-4434, outside the U.S. at 317-572-3985, fax 317-572-4002, or visit www.pfeiffer.com.

Pfeiffer also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Library of Congress Cataloging-in-Publication Data

Shrock, Sharon A.

Criterion-referenced test development : technical and legal guidelines for corporate training / Sharon A. Shrock and William C. Coscarelli. – 3rd ed.

p. cm.

Includes index.

ISBN 978-0-7879-8850-0 (pbk.)

1. Employees–Training of–Evaluation. 2. Criterion-referenced tests.

I. Coscarelli, William C. C. II. Title.

HF5549.5.T7S554 2007

658.3'12404–dc22

2007019607

Acquiring Editor: Matthew Davis

Marketing Manager: Jeanenne Ray

Director of Development: Kathleen Dolan Davies

Developmental Editor: Susan Rachmeler

Production Editor: Michael Kay

Editorial Assistant: Julie Rodriguez

Editor: Rebecca Taff

Manufacturing Supervisor: Becky Morgan

Dedicated toRubye and Donand to Kate and Cyra and Maybeline

List of Figures, Tables and Sidebars

Figure 1.1a

Reliable, But Not Valid

Figure 1.1b

Neither Reliable Nor Valid

Figure 1.1c

Reliable and Valid

Figure 2.1

Example Frequency Distribution

Figure 2.2

Ideal NRT Frequency Distribution

Figure 2.3

The Normal Distribution

Figure 2.4

Mastery Curve

Figure 2.5

Example of an Objective with Corresponding Test Item

Figure 3.1

Designing Criterion-Referenced Tests

Figure 4.1

The CRTD Process and Documentation

Table 5.1

DACUM Research Chart for Computer Applications Programmer

Table 5.2

Standard Task Analysis Form

Figure 5.1

Hierarchical Relationship of Skills

Figure 5.2

Extended Hierarchical Analysis

Figure 5.3

Hierarchical Task Analysis, Production-Operations Manager

Figure 5.4

Cognitive Levels of Bloom’s Taxonomy

Figure 5.5

Hierarchy Illustrating Correct Bloom Sequence

Figure 5.6

Hierarchy Illustrating Incorrect Bloom Sequence

Figure 5.7

Bloom’s Levels Applied to the Production Manager Content Hierarchy

Table 5.3

Summary of the Revised Bloom’s Taxonomy

Figure 5.8

Application of Gagné’s Intellectual Skills to Hierarchy Validation

Figure 5.9

Application of Merrill’s Component Design Theory to Hierarchy Validation

Figure 5.10

Analysis of Posttest Scores to Validate a Hierarchy, Example of a Valid Hierarchy

Figure 5.11

Analysis of Posttest Scores to Validate a Hierarchy, Example of an Invalid Hierarchy

Table 6.1

Summary of Certification Suite

Figure 6.1

Selecting the Certification Level

Table 7.1

Bloom’s Taxonomy on the Battlefield: A Scenario of How Bloom’s Levels Occur in a Combat Environment

Table 7.2

Decision Table for Estimating the Number of Items Per Objective to Be Included on a Test

Table 7.3

Summary of SME Item Rating for Unit 1 Production Manager Test

Figure 8.1

Numerical Scale

Figure 8.2

Descriptive Scale

Figure 8.3

. Behaviorally Anchored Rating Scale

Figure 8.4

Checklist

Figure 8.5

Criterion-Referenced Performance Test

Figure 8.6

Sample Form Used to Score the Task of Merging into Traffic

Figure 9.1

Test Content Validation Form

Figure 9.2

Test Content Validation Results Form

Figure 9.3

Content Validity Index Scales

Figure 9.4

CVI-Relevance for a Test Item

Table 9.1

Example of Concurrent Validity Data

Figure 9.5

Phi Table for Concurrent Validity

Figure 9.6

Example Phi Table for Concurrent Validity

Table 9.2

Example of Concurrent Validity Data

Figure 9.7

Blank Table for Practice Phi Calculation: Concurrent Validity

Figure 9.8

Answer for Practice Phi Calculation: Concurrent Validity

Figure 9.9

Phi Table for Predictive Validity

Figure 10.1

Kincaid Readability Index

Figure 11.1

Standard Normal Curve

Figure 11.2

Standard Deviations of a Normal Curve

Figure 11.3

Frequency Distributions with Standard Deviations of Various Sizes

Figure 11.4

Skewed Curves

Figure 11.5

Mastery Curve

Figure 11.6

The Upper/Lower Index for a CRT

Figure 11.7

Phi Table for Item Analysis

Table 12.1

Angoff Ratings for Items in an Item Bank

Figure 13.1

Outcomes of a Criterion-Referenced Test

Table 13.1

Judges’ Probability Estimates (Angoff Method)

Table 13.2

Possible Probability Estimates (Angoff Method)

Figure 13.2

Contrasting Groups Method of Cut-Off Score Estimation

Table 13.3

Example Test Results for Using the Contrasting Groups Method

Figure 13.3

Frequency Distributions for Using the Contrasting Groups Method

Figure 13.4

Application of the Standard Error of Measurement

Figure 13.5

The Test Score as a Range Rather Than a Point

Figure 13.6

Correction-for-Guessing Formula

Figure 14.1

Graphic Illustrations of Correlation

Table 14.1

Comparison of Three Single-Test Administration Reliability Estimates

Table 14.2

Example of Test-Retest Data for a CRT

Figure 14.2

Phi Table for Test-Retest Reliability

Figure 14.3

Example Phi Table for Test-Retest Reliability

Table 14.3

Sample Test-Retest Data

Figure 14.4

Blank Table for Practice Phi Calculation

Figure 14.5

Answer for Practice Phi Calculation, Test-Retest Reliability

Figure 14.6

Matrix for Determining

and

chance

Figure 14.7

Example Matrix for Determining

and

chance

Figure 14.8

Blank Matrix for Determining

and

chance

Figure 14.9

Completed Practice Matrix for Determining

and

chance

Table 15.1

Example Performance Test Data, Inter-Rater Reliability

Figure 15.1

Matrix for Determining

and

chance

Figure 15.2

Example

and

chance

Matrix, Judges 1 & 2

Figure 15.3

Example

and

chance

Matrix, Judges 1 & 3

Figure 15.4

Example

and

chance

Matrix, Judges 2 & 3

Figure 15.5

Blank

and

chance

Matrix, Judges 1 & 2

Figure 15.6

Blank

and

chance

Matrix, Judges 1 & 3

Figure 15.7

Blank

and

chance

Matrix, Judges 2 & 3

Table 15.2

Sample Performance Test Data, Inter-Rater Reliability

Figure 15.8

Answer for

and

chance

Matrix, Judges 1 & 2

Figure 15.9

Answer for

and

chance

Matrix, Judges 1 & 3

Figure 15.10

Answer for

and

chance

Matrix, Judges 2 & 3

Figure 15.11

Matrix for Calculating Phi

Figure 15.12

Matrix for Calculating Phi, Judges 1 & 2

Figure 15.13

Matrix for Calculating Phi, Judges 1 & 3

Figure 15.14

Matrix for Calculating Phi, Judges 2 & 3

Table 15.3

Conversion Table for

(r) into Z

Figure 15.15

Blank Matrix for Calculating Phi, Judges 1 & 2

Figure 15.16

Blank Matrix for Calculating Phi, Judges 1 & 3

Figure 15.17

Blank Matrix for Calculating Phi, Judges 2 & 3

Figure 15.18

Answer Matrix for Calculating Phi, Judges 1 & 2

Figure 15.19

Answer Matrix for Calculating Phi, Judges 1 & 3

Figure 15.20

Answer Matrix for Calculating Phi, Judges 2 & 3

Figure 15.21

Klaus and Lisa Phi Calculation When Klaus Passes All Test-Takers

Figure 15.22

Phi When Klaus Agrees with Lisa on One Performer She Failed

Figure 15.23

Phi When Klaus and Lisa Agree One Previously Passed Performer Failed

Figure 15.24

Phi When Klaus Fails One Performer Lisa Passed

Figure 15.25

Phi When Gina and Omar Agree on One More Passing Performer

Figure 15.26

Phi When Gina and Omar Agree on One Failing Performer

Figure 15.27

Phi When Gina and Omar Are More Consistent in Disagreeing

Table 16.1

Calculation of the Overall Course Cut-Off Score

Table 16.2

Calculation of an Individual’s Weighted Performance Score

Table 17.1

Sample Summary of Adverse Impact Figures

Table 17.2

Summary of Adverse Impact Figures for Practice

Sidebars

The Stakes of an Assessment

Using Assessments for Compliance

Documenting a Test Security Plan

The Man with the Multiple-Choice Mind

Why Computerized Testing Is Preferred in Business

The Difference Between Performance and Knowledge Tests

Should You Use an Independent Testing Center?

Cheating: It’s Not Just for Breakfast Anymore

Using Statistical Methods to Detect Cheating

IntroductionA Little Knowledge Is Dangerous

Why Test?Why Read This Book?A Confusing State of AffairsTesting and Kirkpatrick’s Levels of EvaluationCertification in the Corporate WorldCorporate Testing Enters the New MillenniumWhat Is to Come . . .

Why Test?

Today’s business and technological environment has increased the need for assessment of human competence. Any competitive advantage in the global economy requires that the most competent workers be identified and retained. Furthermore, training and development, HRD, and performance technology agencies are increasingly required to justify their existence with evidence of effectiveness. These pressures have heightened the demand for better assessment and the distribution of assessment data to line managers to achieve organizational goals. These demands increasingly present us with difficult issues. For example, if you haven’t tested, how can you show that those graduates you certify as “masters” are indeed masters and can be trusted to perform competently while handling dangerous or expensive equipment or materials? What would you tell an EEO officer who presented you with a grievance from an employee who was denied a salary increase based on a test you developed? These and other important questions need to be answered for business, ethical, and legal reasons. And they can be answered through doable and cost-effective test systems.

So, as certification and competency testing are increasingly used in business and industry, correct testing practices make possible the data for rational decision making.

Why Read This Book?

Corporate training, driven by competition and keen awareness of the “bottom line,” has a certain intensity about it. Errors in instructional design or employees’ failure to master skills or content can cause significant negative consequences. It is not surprising, then, that corporate trainers are strong proponents of the systematic design of criterion-referenced instructional systems. What is surprising is the general lack of emphasis on a parallel process for the assessment of instructional outcomes—in other words, testing.

All designers of instruction acknowledge the need for appropriate testing strategies, and non-instructional interventions also frequently require the assessment of human competence, whether in the interest of needs assessment, the formation of effective work teams, or the evaluation of the intervention.

Most training professionals have taken at least one intensive course in the design of instruction, but most have never had similar training in the development of criterion-referenced tests—tests that compare persons against a standard of competence, instead of against other persons (norm-referenced tests). It is not uncommon for a forty-hour workshop in the systematic design of instruction to devote less than four hours to the topic of test development—focusing primarily on item writing skills. With such minimal training, how can we make and defend our assessment decisions?

Without an understanding of the basic principles of test design, you can face difficult ethical, economic, or legal problems. For these and other reasons, test development should stand on an equal footing with instructional development—for if it doesn’t, how will you know whether your instructional objectives were achieved and how will you convince anyone else that they were?

Criterion-Referenced Test Development translates complex testing technology into sound technical practice within the grasp of a non-specialist. And hence, one of the themes that we have woven into the book is that testing properly is often no more expensive and time-consuming than testing improperly. For example, we have been able to show how to create a defensible certification test for a forty-hour administrative training course using a test that takes fewer than fifteen minutes to administer and probably less than a half-day to create. It is no longer acceptable simply to write test items without regard to a defensible process. Specific knowledge of the strengths and limitations of both criterion-referenced and norm-referenced testing is required to address the information needs of the world today.

A Confusing State of Affairs

Grade schools, high schools, universities, and corporations share many similar reasons for not having adopted the techniques for creating sound criterion-referenced tests. We have found three reasons that seem to explain why those who might otherwise embrace the systematic process of test design have not: misleading familiarity, inaccessible information, and procedural confusion. In each instance, it seems that a little knowledge about testing has proven dangerous to the quality of the criterion-referenced test.

Misleading Familiarity

As training professionals, few of us teach the way we were taught. However, most of us are still testing the way we were tested. Since every adult has taken many tests while in school, there is a misleading familiarity with them. There is a tendency to believe that everyone already knows how to write a test. This belief is an error, not only because exposure does not guarantee know-how, but because most of the tests to which we were exposed in school were poorly constructed. The exceptions—the well-constructed tests in our past—tend to be the group-administered standardized tests, for example, the Iowa Tests of Basic Skills or the SAT. Unfortunately for corporate trainers, these standardized tests are good examples of norm-referenced tests, not of criterion-referenced tests. Norm-referenced tests are designed for completely different purposes than criterion-referenced tests, and each is constructed and interpreted differently. Most teacher-made tests are “mongrels,” having characteristics of both norm-referenced and criterion-referenced tests—to the detriment of both.

Inaccessible Technology

Criterion-referenced testing technology is scarce in corporate training partly because the technology of creating these tests has been slow to develop. Even now with so much emphasis on minimal competency testing in the schools, the vast majority of college courses on tests and measurements are about the principles of creating norm-referenced tests. In other words, even if trainers want to “do the right thing,” answers to important questions are hard to come by. Much of the information about criterion-referenced tests has appeared only in highly technical measurement journals. The technology to improve practice in this area just hasn’t been accessible.

Procedural Confusion

A final pitfall in good criterion-referenced test development is that both norm-referenced tests and criterion-referenced tests share some of the same fundamental measurement concepts, such as reliability and validity. Test creators don’t always seem to know how these concepts must be modified to be applied to the two different kinds of tests.

Recently, we saw an article in a respected corporate training publication that purported to detail all the steps necessary to establish the reliability of a test. The procedures that were described, however, will work only for norm-referenced tests. Since the article appeared in a training journal, we question the applicability of the information to the vast majority of testing that its readers will conduct. Because the author was the head of a training department, we had to appreciate his sensitivity to the value of a reliability estimate in the test development process, yet the article provided a clear illustration of procedural confusion in test development, even among those with some knowledge of basic testing concepts.

Testing and Kirkpatrick’s Levels of Evaluation

In 1994 Donald Kirkpatrick presented a classification scheme for four levels of evaluation in business organizations that have permeated much of management’s current thinking about evaluation. We want to review these and then share two observations. First, the four levels:

Level 1, or Reaction evaluations, measure “how those who participate in the program react to it ... I call it a measure of customer satisfaction” (p. 21).

Level 2, or Learning evaluations, “can be defined as the extent to which participants change attitudes, improve knowledge, and/or increase skill as a result of attending the program” (p. 22). Criterion-referenced assessments of competence are the skill and knowledge assessments that typically take place at the end of training. They seek to measure whether desired competencies have been mastered and so typically measure against a specific set of course objectives.

Level 3, or Behavior evaluations, “are defined as the extent to which change in behavior has occurred because the participant attended the training program” (p. 23). These evaluations are usually designed to assess the transfer of training from the classroom to the job.

Level 4, or Results evaluation, is designed to determine “the final results that occurred because the participants attended the program” (p. 25). Typically, this level of evaluation is seen as an estimate of the return to the organization on its investment in training. In other words, what is the cost-benefit ratio to the organization from the use of training?

We would like to make two observations about criterion-referenced testing and this model. The first observation is:

Level 2 evaluation of skills and knowledge is synonymous with the criterion-referenced testing process described in this book.

The second observation is more controversial, but supported by Kirkpatrick:

You cannot do Level 3 and Level 4 evaluations until you have completed Level 2 evaluations.

Kirkpatrick argued:

Some trainers are anxious to get to Level 3 or 4 right away because they think the first two aren’t as important. Don’t do it. Suppose, for example, that you evaluate at Level 3 and discover that little or no change in behavior has occurred. What conclusions can you draw? The first conclusion is probably that the training program was no good, and we had better discontinue it or at least modify it. This conclusion may be entirely wrong ... the reason for no change in job behavior may be that the climate prevents it. Supervisors may have gone back to the job with the necessary knowledge, skills, and attitudes, but the boss wouldn’t allow change to take place. Therefore, it is important to evaluate at Level 2 so you can determine whether the reason for no change in behavior was lack of learning or negative job climate. (p. 72)

Here’s another perspective on this point, by way of an analogy:

Suppose your company manufactures sheet metal. Your factory takes resources, processes the resources to produce the metal, shapes the metal, and then distributes the product to your customers. One day you begin to receive calls. “Hey,” says one valued customer, “this metal doesn’t work! Some sheets are too fat, some too thin, some just right! I’m never quite sure when they’ll work on the job! What am I getting for my money?” “What?” you reply, “They ought to work! We regularly check with our workers, who are very good, and they all feel we do good work.” “I don’t care what they think,” says the customer, “the stuff just doesn’t work!”

Now, substitute the word “training” for “sheet metal” and we see the problem. Your company takes resources and produces training. Your trainees say that the training is good (Level 1—What did the learner think of the instruction?), but your customers report that what they are getting on the job doesn’t match their needs (Level 3—What is taken from training and applied on the job?), and as a result, they wonder what their return on investment is (Level 4—What is the return on investment [ROI] from training?). Your company has a problem because the quality of the process, that is, training (Level 2—What did the learner learn from instruction?) has not been assessed; as a result, you really don’t know what is going on during your processes. And now that you have evidence the product doesn’t work, you have no idea where to begin to fix the problem. No viable manufacturer would allow its products to be shipped without making sure they met product specifications. But training is routinely completed without a valid and reliable measure of its outcomes. Supervisors ask about on-the-job relevance, managers wonder about the ROI from training, but neither question can be answered until the outcomes of training have been assessed. If you don’t know what they learned in training, you can’t tell what they transferred from training to the job and what its costs and benefits are! (Coscarelli & Shrock, 1996, p. 210)

In conclusion, we agree completely with Kirkpatrick when he wrote “Some trainers want to bypass Levels 1 and 2. ... This is a serious mistake” (p. 23).

Certification in the Corporate World

In the 1970s, few organizations offered certification programs, for example, the Chartered Life Underwriter (CLU), Certified Production and Inventory Management (CPIM). By the late 1990s certification had become, literally, a growth industry. Internal corporate certification programs proliferated and profession-wide certification testing had become a profit center for some companies, including Novell, Microsoft, and others. The Educational Testing Service opened its first for-profit center, the Chauncey Group, to concentrate on certification test development and human resources issues. Sylvan became known in the business world as the primary provider of computer-based, proctored, testing centers. There are many reasons why such an interest has developed. Thomas (1996) identifies seven elements and observes that the “theme underlying all of these elements is the need for accountability and communication, especially on a global basis” (p. 276). Because the business world remains market-driven, the classic academic definitions of terms related to testing have become blurred so that various terms in the field of certification have different meanings. While a tonsil is a tonsil is a tonsil in the medical world, certification may not mean the same thing to each member in a discussion. While in Chapter 6 we present a tactical way to think about certification program design (The Certification Suite), here we want to clarify a few terms that are often ill-defined or confused.

Certification “is a formal validation of knowledge or skill ... based on performance on a qualifying examination ... the goal is to produce results that are as dependable or more dependable than those that could be gained by direct observation (on the job)” (Drake Prometric, 1995, p. 2). Certification should provide “an objective and consistent method of measuring competence and ensuring the qualifications of technical professionals” (Microsoft, 1995, p. 3). Certification usually means measuring a person’s competence against a given standard—a criterion-referenced test interpretation. The certification test seeks to measure an individual’s performance in terms of specific skills the individual has demonstrated and without regard to the performance of other test-takers. There is no limit to the number of test-takers who can succeed on a criterion-referenced test—everyone who scores beyond a given level is judged a “master” of the competencies covered by the test. (The term “master” doesn’t usually mean the rare individual who excels far beyond peers; the term simply means someone competent in the performance of the skills covered by the test.) “The intent of certification ... normally is to inform the public that individuals who have achieved certification have demonstrated a particular degree of knowledge and skill (and) is usually a voluntary process instituted by a nongovernmental agency” (Fabrey, 1996, p. 3).

Licensure, by contrast, “generally refers to the mandatory governmental requirement necessary to practice in a particular profession or occupation. Licensure implies both practice protection and title protection, in that only individuals who hold a license are permitted to practice and use a particular title” (Fabrey, 1996, p. 3). Licensure in the business world is rarely an issue in assessing employee competence but plays a major role in protecting society in areas of health care, teaching, law, and other professions.

Qualification is the assessment that a person understands the technology or processes of a system as it was designed or that he or she has a basic understanding of the system or process, but not to the level of certainty provided through certification testing. Qualification is the most problematic of the terms that are often used in business, and it is one we have seen develop primarily in the high-tech industries.

Qualification as a term has developed in many ways as a response to a problematic training situation. Customers (either internal or external to the business) demand that those sent for training be able to demonstrate competence on the job, while at the same time those doing the training and assessment have not been given a job task analysis that is specific to the organization’s need. Thus, the trainers cannot in good conscience represent that the trainees who have passed the tests in training can perform back at the work site. So, for example, if a company develops a new high-tech cell phone switching system, the same system can be configured in a variety of ways by each of the various regional telephone companies that purchase the switch. Without a training program customized to each company, the switch developer will offer training only in the characteristics of the switching system, or perhaps its most common configurations. That training would then “qualify” the trainee to configure and work with the switch within the idiosyncratic constraints of the particular employer. As you can see, the term is founded more on the practical realities of technology development and contract negotiation than on formal assessment. Organizations that provide training that cannot be designed to match the job requirement are often best served by drawing the distinction between certification and qualification early on in the contract negotiation stage, thus clarifying either formal or informal expectations.

Corporate Testing Enters the New Millennium

By early 2000 certification had become less a growth industry and more a mature one. A number of the larger programs, for example, Hewlett-Packard and Microsoft, were well-established and operating on a stable basis. In-house certification programs did continue, but management more acutely examined the cost-benefit ratio for these programs. Meanwhile, in the United States the 2001 Federal act, No Child Left Behind, was signed into law and placed a new emphasis on school accountability for student learning progress. Interestingly, the discussion that was sparked by this act created a distinction in testing that was assimilated by both the academic and business communities and helped guide resource allocations. This concept is “often referred to as the stakes of the testing,” according to the Standards for Educational and Psychological Testing (AERA/APA/NCME Joint Committee, 1999, p. 139), which described a classification of sorts for the outcomes of testing and the implied level of rigor associated with each type of test’s design.

High Stakes Tests. A high stakes test is one in which “significant educational paths or choices of an individual are directly affected by test performance. ... Testing programs for institutions can have high stakes when aggregate performance of a sample or of the entire population of test-takers is used to infer the quality of service provided, and decisions are made about institutional status, rewards, or sanctions based on the test results” (AERA/APA/NCME Joint Committee, 1999, p. 139). While the definition of high stakes was intended for the public schools, it was easily translated into a corporate culture, where individual promotion, bonuses, or employment might all be tied to test performance or where entire departments, such as the training department, might be affected by test-taker performance.

Low Stakes Tests. At the other end of the continuum, the Standards defined low stakes tests as those that are “administered for informational purposes or for highly tentative judgments such as when test results provide feedback to students...” (p. 139).

These two ends of the continuum implied different levels of rigor and resources in test construction. This distinction was also indicated by the Standards:

The higher the stakes associated with a given test use, the more important it is that test-based inferences are supported with strong evidence of technical quality. In particular, when the stakes for an individual are high, and important decisions depend substantially on test performance, the test needs to exhibit higher standards of technical quality for its avowed purposes than might be expected of tests used for lower-stakes purposes ... Although it is never possible to achieve perfect accuracy in describing an individual’s performance, efforts need to be made to minimize errors in estimating individual scores in classifying individuals in pass/fail or admit/reject categories. Further, enhancing validity for high-stakes purposes, whether individual or institutional, typically entails collecting sound collateral information both to assist in understanding the factors that contributed to test results and to provide corroborating evidence that supports the inferences based on test results. (pp. 139–140)

What Is to Come ...

In the following chapters, we will describe a systematic approach to the development of criterion-referenced tests. We recognize that not all tests are high-stakes tests, but the book does describe the steps you need to consider for developing a high-stakes criterion-referenced test. If your test doesn’t need to meet that standard, you can then decide which steps can be skipped, adapted, or adopted to meet you own particular needs. To help you do this

Criterion Referenced Test Development

(CRTD) is divided into five main sections:

In the Background, we provide a basic frame of reference for the entire test development process.

The Overview provides a detailed description of the Criterion-Referenced Test Development Process (CRTD) using the model we have created and tested in our work with more than forty companies.

Planning and Creating the Test describes how to proceed with the CRTD process using each of the thirteen steps in the model. Each step is explored as a separate chapter, and where appropriate, we have provided summary points that you may need to complete the CRTD documentation process.

Legal Issues in Criterion-Referenced Testing is authored by Patricia Eyres, who is a practicing attorney in the field and deals with some of the important legal issues in the CRTD process.

Our Epilogue is a reflection of our experiences with testing. In fact, those of you starting a testing program in an organization may wish to read this chapter first! When we first began our work in CRTD, we thought of the testing process as the last “box” in the Instructional Development process. We have since come to understand that testing, when done properly, will often have serious consequences to the organization. These can be highly beneficial if the process is supported and well managed. However, we now view effective CRT systems as not simply discrete assessment devices, but as systemic interventions.

Periodically, we have provided an opportunity for practice and feedback. You will find that many of the topics in the Background are reinforced by exercises with corresponding answers and that, throughout the book, opportunities to practice applying the most important or difficult concepts are similarly provided.

We are also including short sidebars from individuals and organizations associated with the world of CRT, when we feel they can help illustrate a point in the process. Interestingly, most of the sidebars reflect the two areas that have developed most rapidly since our last edition—computer-based testing and processes to reduce cheating on tests.

Part OneBackground: The Fundamentals

Chapter OneTest Theory

What Is Testing?What Does a Test Score Mean?Reliability and Validity: A PrimerConcluding Comment

What Is Testing?

There are four related terms that can be somewhat confusing at first: evaluation, assessment, measurement, and testing. These terms are sometimes used interchangeably; however, we think it is useful to make the following distinctions among them:

Testing

is the collection of quantitative (numerical) information about the degree to which a competence or ability is present in the test-taker. There are right and wrong answers to the items on a test, whether it be a test comprised of written questions or a performance test requiring the demonstration of a skill. A typical test question might be: “List the six steps in the selling process.”

Measurement

is the collection of quantitative data to determine the degree of whatever is being measured. There may or may not be right and wrong answers. A measurement inventory such as the

Decision-Making Style Inventory

might be used to determine a preference for using a Systematic style versus a Spontaneous one in making a sale. One style is not “right” and the other “wrong”; the two styles are simply different.

Assessment

is systematic information gathering without necessarily making judgments of worth. It may involve the collection of quantitative or qualitative (narrative) information. For example, by using a series of personality inventories and through interviewing, one might build a profile of “the aggressive salesperson.” (Many companies use Assessment Centers as part of their management training and selection process. However, as the results from these centers are usually used to make judgments of worth, they are more properly classed as evaluation devices.)

Evaluation

is the process of making judgments regarding the appropriateness of some person, program, process, or product for a specific purpose. Evaluation may or may not involve testing, measurement, or assessment. Most informed judgments of worth, however, would likely require one or more of these data gathering processes. Evaluation decisions may be based on either quantitative or qualitative data; the type of data that is most useful depends entirely on the nature of the evaluation question. An example of an evaluation issue might be, “Does our training department serve the needs of the company?”

Practice

Here are some statements related to these four concepts. See whether you can classify them as issues related to Testing, Measurement, Assessment, or Evaluation:

“She was able to install the air conditioner without error during the allotted time.”

“Personality inventories indicate that our programmers tend to have higher extroversion scores than introversion.”

“Does the pilot test process we use really tell us anything about how well our instruction works?”

“What types of tasks characterize the typical day of a submarine officer?”

Feedback

Testing

Measurement

Evaluation

Assessment

What Does a Test Score Mean?

Suppose you had to take an important test. In fact, this test was so important that you had studied intensively for five weeks. Suppose then that, when you went to take the test, the temperature in the room was 45 degrees. After 20 minutes, all you could think of was getting out of the room, never mind taking the test. On the other hand, suppose you had to take a test for which you never studied. By chance a friend dropped by the morning of the test and showed you the answer key. In both situations, the score you receive on the test probably doesn’t accurately reflect what you actually know. In the first instance, you may have known more than the test score showed, but the environment was so uncomfortable that you couldn’t attend to the test. In the second instance, you probably knew less than the test score showed due now to another type of “environmental” influence.

In either instance, the score you received on the test (your observed score) was a combination of what you really knew (your true score) and those factors that modified your true score (error). The relationship of these score components is the basis for all test theory and is usually expressed by a simple equation:

where Xo is the observed score, Xt the true score and Xe the error component. It is very important to remember that in test theory “error” doesn’t mean a wrong answer. It means the factor that accounts for any mismatch between a test-taker’s actual level of knowledge (the true score) and the test score the person receives. Error can make a score higher (as we saw when your friend dropped by) or lower (when it got too cold to concentrate).

The primary purpose of a systematic approach to test design is to reduce the error component so that the observed score and the true score are as nearly identical as possible. All the procedures we will discuss and recommend in this book will be tied to a simple assumption: the primary purpose of test development is the reduction of error. We think of the results of test development like this:

where error has been reduced to the lowest possible level.

Realistically, there will always be some error in a test score, but careful attention to the principles of test development and administration will help reduce the error component.