32,99 €
Criterion-Referenced Test Development is designed specifically for training professionals who need to better understand how to develop criterion-referenced tests (CRTs). This important resource offers step-by-step guidance for how to make and defend Level 2 testing decisions, how to write test questions and performance scales that match jobs, and how to show that those certified as ?masters? are truly masters. A comprehensive guide to the development and use of CRTs, the book provides information about a variety of topics, including different methods of test interpretations, test construction, item formats, test scoring, reliability and validation methods, test administration, a score reporting, as well as the legal and liability issues surrounding testing. New revisions include: * Illustrative real-world examples. * Issues of test security. * Advice on the use of test creation software. * Expanded sections on performance testing. * Single administration techniques for calculating reliability. * Updated legal and compliance guidelines. Order the third edition of this classic and comprehensive reference guide to the theory and practice of organizational tests today.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 634
Veröffentlichungsjahr: 2008
List of Figures, Tables, and Sidebars
Introduction: A Little Knowledge Is Dangerous
Why Test?
Why Read This Book?
A Confusing State of Affairs
Testing and Kirkpatrick’s Levels of Evaluation
Certification in the Corporate World
Corporate Testing Enters the New Millennium
What Is to Come . . .
Part I: Background: The Fundamentals
Chapter One: Test Theory
What Is Testing?
What Does a Test Score Mean?
Reliability and Validity: A Primer
Concluding Comment
Chapter Two: Types of Tests
Criterion-Referenced Versus Norm-Referenced Tests
Six Purposes for Tests in Training Settings
Three Methods of Test Construction (One of Which You Should Never Use)
Part II: Overview: The CRTD Model and Process
Chapter Three: The CRTD Model and Process
Relationship to the Instructional Design Process
The CRTD Process
Summary
Part III: The CRTD Process: Planning and Creating the Test
Chapter Four: Plan Documentation
Why Document?
What to Document
The Documentation
Chapter Five: Analyze Job Content
Job Analysis
Job Analysis Models
DACUM
Hierarchies
Bloom’s Original Taxonomy
Bloom’s Revised Taxonomy
Gagné’s Learned Capabilities
Merrill’s Component Design Theory
Data-Based Methods for Hierarchy Validation
Who Killed Cock Robin?
Chapter Six: Content Validity of Objectives
Overview of the Process
The Role of Objectives in Item Writing
A Word from the Legal Department About Objectives
The Certification Suite
How to Use the Certification Suite
Converting Job-Task Statements to Objectives
In Conclusion
Chapter Seven: Create Cognitive Items
What Are Cognitive Items?
Classification Schemes for Objectives
Types of Test Items
The Key to Writing Items That Match Jobs
The Certification Suite
Guidelines for Writing Test Items
How Many Items Should Be on a Test?
Summary of Determinants of Test Length
A Cookbook for the SME
Deciding Among Scoring Systems
Chapter Eight: Create Rating Instruments
What Are Performance Tests?
Product Versus Process in Performance Testing
Four Types of Rating Scales for Use in Performance Tests (Two of Which You Should Never Use)
Open Skill Testing
Chapter Nine: Establish Content Validity of Items and Instruments
The Process
Establishing Content Validity—The Single Most Important Step
Two Other Types of Validity
Summary Comment About Validity
Chapter Ten: Initial Test Pilot
Why Pilot a Test?
Six Steps in the Pilot Process
Preparing to Collect Pilot Test Data
Before You Administer the Test
When You Administer the Test
Honesty and Integrity in Testing
Chapter Eleven: Statistical Pilot
Standard Deviation and Test Distributions
Item Statistics and Item Analysis
Choosing Item Statistics and Item Analysis Techniques
Garbage In-Garbage Out
Chapter Twelve: Parallel Forms
Paper-and-Pencil Tests
Computerized Item Banks
Reusable Learning Objects
Chapter Thirteen: Cut-Off Scores
Determining the Standard for Mastery
The Outcomes of a Criterion-Referenced Test
The Necessity of Human Judgment in Setting a Cut-Off Score
Three Procedures for Setting the Cut-Off Score
Borderline Decisions
Problems with Correction-for-Guessing
The Problem of the Saltatory Cut-Off Score
Chapter Fourteen: Reliability of Cognitive Tests
The Concepts of Reliability, Validity, and Correlation
Types of Reliability
Single-Test-Administration Reliability Techniques
Calculating Reliability for Single-Test Administration Techniques
Two-Test-Administration Reliability Techniques
Calculating Reliability for Two-Test Administration Techniques
Comparison of
ϕ, P
o
, and
κ
The Logistics of Establishing Test Reliability
Recommendations for Choosing a Reliability Technique
Summary Comments
Chapter Fifteen: Reliability of Performance Tests
Reliability and Validity of Performance Tests
Inter-Rater Reliability
Repeated Performance and Consecutive Success
Procedures for Training Raters
What if a Rater Passes Everyone Regardless of Performance?
What if You Get a High Percentage of Agreement among Raters but a Negative Phi Coefficient?
Chapter Sixteen: Report Scores
CRT Versus NRT Reporting
Summing Subscores
What Should You Report to a Manager?
Is There a Legal Reason to Archive the Tests?
A Final Thought About Testing and Teaching
Part IV: Legal Issues in Criterion-Referenced Testing
Chapter Seventeen: Criterion-Referenced Testing and Employment Selection Laws
What Do We Mean by Employment Selection Laws?
Who May Bring a Claim?
A Short History of the
Uniform Guidelines on Employee Selection Procedures
Legal Challenges to Testing and the
Uniform Guidelines
Balancing CRTs with Employment Discrimination Laws
Watch Out for Blanket Exclusions in the Name of Business Necessity
Adverse Impact, the Bottom Line, and Affirmative Action
Accommodating Test-Takers with Special Needs
Test Validation Criteria: General Guidelines
Test Validation: A Step-by-Step Guide
Keys to Maintaining Effective and Legally Defensible Documentation
Is Your Criterion-Referenced Testing Legally Defensible? A Checklist
A Final Thought
Epilogue: CRTD as Organizational Transformation
References
About the Authors
Index
Advertisements
End User License Agreement
Figure 1.1a. Reliable, But Not Valid.
Figure 1.1b. Neither Reliable Nor Valid.
Figure 1.1c. Reliable and Valid.
Figure 2.1. Example Frequency Distribution.
Figure 2.2. Ideal NRT Frequency Distribution.
Figure 2.3. The Normal Distribution.
Figure 2.4. Mastery Curve.
Figure 2.5. Example of an Objective with Corresponding Test Item.
Figure 3.1. Designing Criterion-Referenced Tests.
Figure 4.1. The CRTD Process and Documentation.
Figure 5.1. Hierarchical Relationship of Skills.
Figure 5.2. Extended Hierarchical Analysis.
Figure 5.3. Hierarchical Task Analysis, Production-Operations Manager.
Figure 5.4. Cognitive Levels of Bloom’s Taxonomy.
Figure 5.5. Hierarchy Illustrating Correct Bloom Sequence.
Figure 5.6. Hierarchy Illustrating Incorrect Bloom Sequence.
Figure 5.7. Bloom’s Levels Applied to the Production Manager Content Hierarchy.
Figure 5.8. Application of Gagné’s Intellectual Skills to Hierarchy Validation.
Figure 5.9. Application of Merrill’s Component Design Theory to Hierarchy Validation.
Figure 5.10. Analysis of Posttest Scores to Validate a Hierarchy, Example of a Valid Hierarchy.
Figure 5.11. Analysis of Posttest Scores to Validate a Hierarchy, Example of an Invalid Hierarchy.
Figure 6.1. Selecting the Certification Level.
Figure 8.1. Numerical Scale.
Figure 8.2. Descriptive Scale.
Figure 8.3. Behaviorally Anchored Rating Scale.
Figure 8.4. Checklist.
Figure 8.5. Criterion-Referenced Performance Test.
Figure 8.6. Sample Form Used to Score the Task of Merging into Traffic.
Figure 9.1. Test Content Validation Form.
Figure 9.2. Test Content Validation Results Form.
Figure 9.3. Content Validity Index Scales.
Figure 9.4. CVI-Relevance for a Test Item.
Figure 9.5. Phi Table for Concurrent Validity.
Figure 9.6. Example Phi Table for Concurrent Validity.
Figure 9.7. Blank Table for Practice Phi Calculation: Concurrent Validity.
Figure 9.8. Answer for Practice Phi Calculation: Concurrent Validity.
Figure 9.9. Phi Table for Predictive Validity.
Figure 10.1. Kincaid Readability Index.
Figure 11.1. Standard Normal Curve.
Figure 11.2. Standard Deviations of a Normal Curve.
Figure 11.3. Frequency Distributions with Standard Deviations of Various Sizes.
Figure 11.4. Skewed Curves.
Figure 11.5. Mastery Curve.
Figure 11.6. The Upper/Lower Index for a CRT.
Figure 11.7. Phi Table for Item Analysis.
Figure 13.1. Outcomes of a Criterion-Referenced Test.
Figure 13.2. Contrasting Groups Method of Cut-Off Score Estimation.
Figure 13.3. Frequency Distributions for Using the Contrasting Groups Method.
Figure 13.4. Application of the Standard Error of Measurement.
Figure 13.5. The Test Score as a Range Rather Than a Point.
Figure 13.6. Correction-for-Guessing Formula.
Figure 14.1. Graphic Illustrations of Correlation.
Figure 14.2. Phi Table for Test-Retest Reliability.
Figure 14.3. Example Phi Table for Test-Retest Reliability.
Figure 14.4. Blank Table for Practice Phi Calculation.
Figure 14.5. Answer for Practice Phi Calculation, Test-Retest Reliability.
Figure 14.6. Matrix for Determining p
O
and p
CHANCE
.
Figure 14.7. Example Matrix for Determining p
O
and p
CHANCE
.
Figure 14.8. Blank Matrix for Determining p
O
and p
CHANCE
.
Figure 14.9. Completed Practice Matrix for Determining p
O
and p
CHANCE
.
Figure 15.1. Matrix for Determining
p
o
and
p
chance
.
Figure 15.2. Example
p
O
and
p
CHANCE
Matrix, Judges 1 & 2.
Figure 15.3. Example
p
O
and
p
CHANCE
Matrix, Judges 1 & 3.
Figure 15.4. Example
p
O
and
p
CHANCE
Matrix, Judges 2 & 3.
Figure 15.5. Blank
p
O
and
p
CHANCE
Matrix, Judges 1 & 2.
Figure 15.6. Blank
p
O
and
p
CHANCE
Matrix, Judges 1 & 3.
Figure 15.7. Blank
p
O
and
p
CHANCE
Matrix, Judges 2 & 3.
Figure 15.8. Answer for
p
O
and
p
CHANCE
Matrix, Judges 1 & 2.
Figure 15.9. Answer for
p
O
and
p
CHANCE
Matrix, Judges 1 & 3.
Figure 15.10. Answer for
p
O
and
p
CHANCE
Matrix, Judges 2 & 3.
Figure 15.11. Matrix for Calculating Phi.
Figure 15.12. Matrix for Calculating Phi, Judges 1 & 2.
Figure 15.13. Matrix for Calculating Phi, Judges 1 & 3.
Figure 15.14. Matrix for Calculating Phi, Judges 2 & 3.
Figure 15.15. Blank Matrix for Calculating Phi, Judges 1 & 2.
Figure 15.16. Blank Matrix for Calculating Phi, Judges 1 & 3.
Figure 15.17. Blank Matrix for Calculating Phi, Judges 2 & 3.
Figure 15.18. Answer Matrix for Calculating Phi, Judges 1 & 2.
Figure 15.19. Answer Matrix for Calculating Phi, Judges 1 & 3.
Figure 15.20. Answer Matrix for Calculating Phi, Judges 2 & 3.
Figure 15.21. Klaus and Lisa Phi Calculation When Klaus Passes All Test-Takers.
Figure 15.22. Phi When Klaus Agrees with Lisa on One Performer She Failed.
Figure 15.23. Phi When Klaus and Lisa Agree One Previously Passed Performer Failed.
Figure 15.24. Phi When Klaus Fails One Performer Lisa Passed.
Figure 15.25. Phi When Gina and Omar Agree on One More Passing Performer.
Figure 15.26. Phi When Gina and Omar Agree on One Failing Performer.
Figure 15.27. Phi When Gina and Omar Are More Consistent in Disagreeing.
Table 5.1. DACUM Research Chart for Computer Applications Programmer.
Table 5.2. Standard Task Analysis Form.
Table 5.3. Summary of the Revised Bloom’s Taxonomy.
Table 6.1. Summary of Certification Suite.
Table 7.1. Bloom’s Taxonomy on the Battlefield: A Scenario of How Bloom’s Levels Occur in a Combat Environment.
Table 7.2. Decision Table for Estimating the Number of Items Per Objective to Be Included on a Test.
Table 7.3. Summary of SME Item Rating for Unit 1 Production Manager Test.
Table 9.1. Example of Concurrent Validity Data.
Table 9.2. Example of Concurrent Validity Data.
Table 12.1. Angoff Ratings for Items in an Item Bank.
Table 13.1. Judges’ Probability Estimates (Angoff Method).
Table 13.2. Possible Probability Estimates, Angoff Method.
Table 13.3. Example Test Results for Using the Contrasting Groups Method.
Table 14.1. Comparison of Three Single-Test Administration Reliability Estimates.
Table 14.2. Example of Test-Retest Data for a CRT.
Table 14.3. Sample Test-Retest Data.
Table 15.1. Example Performance Test Data, Inter-Rater Reliability.
Table 15.2. Sample Performance Test Data, Inter-Rater Reliability.
Table 15.3. Conversion Table for
ϕ
(R) into Z.
Table 16.1. Calculation of the Overall Course Cut-Off Score.
Table 16.2. Calculation of an Individual’s Weighted Performance Score.
Table 17.1. Sample Summary of Adverse Impact Figures.
Table 17.2. Summary of Adverse Impact Figures for Practice.
i
ii
iii
iv
v
vi
vii
viii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
Cover
Table of Contents
Begin Reading
Today’s organizations feel both external and internal pressures to test the competence of those who work for them. The current global, competitive, regulated, and litigation savvy economic environment has increased external pressures, while the resulting increased investment in training and the escalating cost of high-tech instructional and human performance systems create internal pressure for accountability. Many products are now so complicated that human testing systems to ensure the product’s correct operation and maintenance have become virtually part of the product marketed to buyers. Valid, informative, and legally defensible competence testing has become essential to many organizations, yet the technology for creating these assessments has historically been shrouded in academic circles impenetrable to those without advanced degrees in measurement sciences.
This book presents a straightforward model for the creation of legally defensible criterion-referenced tests designed to determine whether or not test-takers have mastered job-related knowledge and performance skills. Exercises with accompanying feedback allow you to monitor your own proficiency with the concepts and procedures introduced. Furthermore, the issues that are most likely to be encountered at each step in the test development process are fully elaborated to enable you to make sound decisions and actually complete and document the creation of valid testing systems.
This book is divided into five main sections. The first two introduce essential, fundamental testing concepts and present an overview of the entire CRTD Model. Part Three includes a chapter devoted to the elaboration of each step in the CRTD Model. A thorough discussion of the legal issues surrounding test creation and administration constitutes Part Four. Examples and exercises with feedback are used liberally throughout the book to facilitate understanding, engagement, and proficiency in the CRTD process. The final piece in the book is a brief epilogue reflecting on the profound and often unforeseen impact that testing can have on overall organizational performance.
Pfeiffer serves the professional development and hands-on resource needs of training and human resource practitioners and gives them products to do their jobs better. We deliver proven ideas and solutions from experts in HR development and HR management, and we offer effective and customizable tools to improve workplace performance. From novice to seasoned professional, Pfeiffer is the source you can trust to make yourself and your organization more successful.
Essential Knowledge Pfeiffer produces insightful, practical, and comprehensive materials on topics that matter the most to training and HR professionals. Our Essential Knowledge resources translate the expertise of seasoned professionals into practical, how-to guidance on critical workplace issues and problems. These resources are supported by case studies, worksheets, and job aids and are frequently supplemented with CD-ROMs, websites, and other means of making the content easier to read, understand, and use.
Essential Tools Pfeiffer’s Essential Tools resources save time and expense by offering proven, ready-to-use materials–including exercises, activities, games, instruments, and assessments–for use during a training or team-learning event. These resources are frequently offered in looseleaf or CD-ROM format to facilitate copying and customization of the material.
Pfeiffer also recognizes the remarkable power of new technologies in expanding the reach and effectiveness of training. While e-hype has often created whizbang solutions in search of a problem, we are dedicated to bringing convenience and enhancements to proven training solutions. All our e-tools comply with rigorous functionality standards. The most appropriate technology wrapped around essential content yields the perfect solution for today’s on-the-go trainers and human resource professionals.
Essential resources for training and HR professionals
The International Society for Performance Improvement (ISPI) is dedicated to improving individual, organizational, and societal performance. Founded in 1962, ISPI is the leading international association dedicated to improving productivity and performance in the workplace. ISPI represents more than 10,000 international and chapter members throughout the United States, Canada, and forty other countries.
ISPI’s mission is to develop and recognize the proficiency of our members and advocate the use of Human Performance Technology. This systematic approach to improving productivity and competence uses a set of methods and procedures and a strategy for solving problems for realizing opportunities related to the performance of people. It is a systematic combination of performance analysis, cause analysis, intervention design and development, implementation, and evaluation that can be applied to individuals, small groups, and large organizations.
Website:www.ispi.org
Mail: International Society for Performance Improvement
1400 Spring Street, Suite 260
Silver Spring, Maryland 20910 USA
Phone: 1.301.587.8570
Fax: 1.301.587.8573
E-mail:[email protected]
3rd Edition
Sharon A. Shrock
William C. Coscarelli
Copyright © 2007 by John Wiley & Sons, Inc. All rights reserved.
Published by Pfeiffer
An Imprint of Wiley
989 Market Street, San Francisco, CA 94103-1741
www.pfeiffer.com
Wiley Bicentennial logo: Richard J. Pacifico
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, 201-748-6011, fax 201-748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Readers should be aware that Internet websites offered as citations and/or sources for further information may have changed or disappeared between the time this was written and when it is read.
For additional copies/bulk purchases of this book in the U.S. please contact 800-274-4434.
Pfeiffer books and products are available through most bookstores. To contact Pfeiffer directly call our Customer Care Department within the U.S. at 800-274-4434, outside the U.S. at 317-572-3985, fax 317-572-4002, or visit www.pfeiffer.com.
Pfeiffer also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Library of Congress Cataloging-in-Publication Data
Shrock, Sharon A.
Criterion-referenced test development : technical and legal guidelines for corporate training / Sharon A. Shrock and William C. Coscarelli. – 3rd ed.
p. cm.
Includes index.
ISBN 978-0-7879-8850-0 (pbk.)
1. Employees–Training of–Evaluation. 2. Criterion-referenced tests.
I. Coscarelli, William C. C. II. Title.
HF5549.5.T7S554 2007
658.3'12404–dc22
2007019607
Acquiring Editor: Matthew Davis
Marketing Manager: Jeanenne Ray
Director of Development: Kathleen Dolan Davies
Developmental Editor: Susan Rachmeler
Production Editor: Michael Kay
Editorial Assistant: Julie Rodriguez
Editor: Rebecca Taff
Manufacturing Supervisor: Becky Morgan
Dedicated toRubye and Donand to Kate and Cyra and Maybeline
Figure 1.1a
Reliable, But Not Valid
Figure 1.1b
Neither Reliable Nor Valid
Figure 1.1c
Reliable and Valid
Figure 2.1
Example Frequency Distribution
Figure 2.2
Ideal NRT Frequency Distribution
Figure 2.3
The Normal Distribution
Figure 2.4
Mastery Curve
Figure 2.5
Example of an Objective with Corresponding Test Item
Figure 3.1
Designing Criterion-Referenced Tests
Figure 4.1
The CRTD Process and Documentation
Table 5.1
DACUM Research Chart for Computer Applications Programmer
Table 5.2
Standard Task Analysis Form
Figure 5.1
Hierarchical Relationship of Skills
Figure 5.2
Extended Hierarchical Analysis
Figure 5.3
Hierarchical Task Analysis, Production-Operations Manager
Figure 5.4
Cognitive Levels of Bloom’s Taxonomy
Figure 5.5
Hierarchy Illustrating Correct Bloom Sequence
Figure 5.6
Hierarchy Illustrating Incorrect Bloom Sequence
Figure 5.7
Bloom’s Levels Applied to the Production Manager Content Hierarchy
Table 5.3
Summary of the Revised Bloom’s Taxonomy
Figure 5.8
Application of Gagné’s Intellectual Skills to Hierarchy Validation
Figure 5.9
Application of Merrill’s Component Design Theory to Hierarchy Validation
Figure 5.10
Analysis of Posttest Scores to Validate a Hierarchy, Example of a Valid Hierarchy
Figure 5.11
Analysis of Posttest Scores to Validate a Hierarchy, Example of an Invalid Hierarchy
Table 6.1
Summary of Certification Suite
Figure 6.1
Selecting the Certification Level
Table 7.1
Bloom’s Taxonomy on the Battlefield: A Scenario of How Bloom’s Levels Occur in a Combat Environment
Table 7.2
Decision Table for Estimating the Number of Items Per Objective to Be Included on a Test
Table 7.3
Summary of SME Item Rating for Unit 1 Production Manager Test
Figure 8.1
Numerical Scale
Figure 8.2
Descriptive Scale
Figure 8.3
. Behaviorally Anchored Rating Scale
Figure 8.4
Checklist
Figure 8.5
Criterion-Referenced Performance Test
Figure 8.6
Sample Form Used to Score the Task of Merging into Traffic
Figure 9.1
Test Content Validation Form
Figure 9.2
Test Content Validation Results Form
Figure 9.3
Content Validity Index Scales
Figure 9.4
CVI-Relevance for a Test Item
Table 9.1
Example of Concurrent Validity Data
Figure 9.5
Phi Table for Concurrent Validity
Figure 9.6
Example Phi Table for Concurrent Validity
Table 9.2
Example of Concurrent Validity Data
Figure 9.7
Blank Table for Practice Phi Calculation: Concurrent Validity
Figure 9.8
Answer for Practice Phi Calculation: Concurrent Validity
Figure 9.9
Phi Table for Predictive Validity
Figure 10.1
Kincaid Readability Index
Figure 11.1
Standard Normal Curve
Figure 11.2
Standard Deviations of a Normal Curve
Figure 11.3
Frequency Distributions with Standard Deviations of Various Sizes
Figure 11.4
Skewed Curves
Figure 11.5
Mastery Curve
Figure 11.6
The Upper/Lower Index for a CRT
Figure 11.7
Phi Table for Item Analysis
Table 12.1
Angoff Ratings for Items in an Item Bank
Figure 13.1
Outcomes of a Criterion-Referenced Test
Table 13.1
Judges’ Probability Estimates (Angoff Method)
Table 13.2
Possible Probability Estimates (Angoff Method)
Figure 13.2
Contrasting Groups Method of Cut-Off Score Estimation
Table 13.3
Example Test Results for Using the Contrasting Groups Method
Figure 13.3
Frequency Distributions for Using the Contrasting Groups Method
Figure 13.4
Application of the Standard Error of Measurement
Figure 13.5
The Test Score as a Range Rather Than a Point
Figure 13.6
Correction-for-Guessing Formula
Figure 14.1
Graphic Illustrations of Correlation
Table 14.1
Comparison of Three Single-Test Administration Reliability Estimates
Table 14.2
Example of Test-Retest Data for a CRT
Figure 14.2
Phi Table for Test-Retest Reliability
Figure 14.3
Example Phi Table for Test-Retest Reliability
Table 14.3
Sample Test-Retest Data
Figure 14.4
Blank Table for Practice Phi Calculation
Figure 14.5
Answer for Practice Phi Calculation, Test-Retest Reliability
Figure 14.6
Matrix for Determining
ρ
o
and
ρ
chance
Figure 14.7
Example Matrix for Determining
ρ
o
and
ρ
chance
Figure 14.8
Blank Matrix for Determining
ρ
o
and
ρ
chance
Figure 14.9
Completed Practice Matrix for Determining
ρ
o
and
ρ
chance
Table 15.1
Example Performance Test Data, Inter-Rater Reliability
Figure 15.1
Matrix for Determining
ρ
o
and
ρ
chance
Figure 15.2
Example
ρ
o
and
ρ
chance
Matrix, Judges 1 & 2
Figure 15.3
Example
ρ
o
and
ρ
chance
Matrix, Judges 1 & 3
Figure 15.4
Example
ρ
o
and
ρ
chance
Matrix, Judges 2 & 3
Figure 15.5
Blank
ρ
o
and
ρ
chance
Matrix, Judges 1 & 2
Figure 15.6
Blank
ρ
o
and
ρ
chance
Matrix, Judges 1 & 3
Figure 15.7
Blank
ρ
o
and
ρ
chance
Matrix, Judges 2 & 3
Table 15.2
Sample Performance Test Data, Inter-Rater Reliability
Figure 15.8
Answer for
ρ
o
and
ρ
chance
Matrix, Judges 1 & 2
Figure 15.9
Answer for
ρ
o
and
ρ
chance
Matrix, Judges 1 & 3
Figure 15.10
Answer for
ρ
o
and
ρ
chance
Matrix, Judges 2 & 3
Figure 15.11
Matrix for Calculating Phi
Figure 15.12
Matrix for Calculating Phi, Judges 1 & 2
Figure 15.13
Matrix for Calculating Phi, Judges 1 & 3
Figure 15.14
Matrix for Calculating Phi, Judges 2 & 3
Table 15.3
Conversion Table for
ϕ
(r) into Z
Figure 15.15
Blank Matrix for Calculating Phi, Judges 1 & 2
Figure 15.16
Blank Matrix for Calculating Phi, Judges 1 & 3
Figure 15.17
Blank Matrix for Calculating Phi, Judges 2 & 3
Figure 15.18
Answer Matrix for Calculating Phi, Judges 1 & 2
Figure 15.19
Answer Matrix for Calculating Phi, Judges 1 & 3
Figure 15.20
Answer Matrix for Calculating Phi, Judges 2 & 3
Figure 15.21
Klaus and Lisa Phi Calculation When Klaus Passes All Test-Takers
Figure 15.22
Phi When Klaus Agrees with Lisa on One Performer She Failed
Figure 15.23
Phi When Klaus and Lisa Agree One Previously Passed Performer Failed
Figure 15.24
Phi When Klaus Fails One Performer Lisa Passed
Figure 15.25
Phi When Gina and Omar Agree on One More Passing Performer
Figure 15.26
Phi When Gina and Omar Agree on One Failing Performer
Figure 15.27
Phi When Gina and Omar Are More Consistent in Disagreeing
Table 16.1
Calculation of the Overall Course Cut-Off Score
Table 16.2
Calculation of an Individual’s Weighted Performance Score
Table 17.1
Sample Summary of Adverse Impact Figures
Table 17.2
Summary of Adverse Impact Figures for Practice
The Stakes of an Assessment
Using Assessments for Compliance
Documenting a Test Security Plan
The Man with the Multiple-Choice Mind
Why Computerized Testing Is Preferred in Business
The Difference Between Performance and Knowledge Tests
Should You Use an Independent Testing Center?
Cheating: It’s Not Just for Breakfast Anymore
Using Statistical Methods to Detect Cheating
Today’s business and technological environment has increased the need for assessment of human competence. Any competitive advantage in the global economy requires that the most competent workers be identified and retained. Furthermore, training and development, HRD, and performance technology agencies are increasingly required to justify their existence with evidence of effectiveness. These pressures have heightened the demand for better assessment and the distribution of assessment data to line managers to achieve organizational goals. These demands increasingly present us with difficult issues. For example, if you haven’t tested, how can you show that those graduates you certify as “masters” are indeed masters and can be trusted to perform competently while handling dangerous or expensive equipment or materials? What would you tell an EEO officer who presented you with a grievance from an employee who was denied a salary increase based on a test you developed? These and other important questions need to be answered for business, ethical, and legal reasons. And they can be answered through doable and cost-effective test systems.
So, as certification and competency testing are increasingly used in business and industry, correct testing practices make possible the data for rational decision making.
Corporate training, driven by competition and keen awareness of the “bottom line,” has a certain intensity about it. Errors in instructional design or employees’ failure to master skills or content can cause significant negative consequences. It is not surprising, then, that corporate trainers are strong proponents of the systematic design of criterion-referenced instructional systems. What is surprising is the general lack of emphasis on a parallel process for the assessment of instructional outcomes—in other words, testing.
All designers of instruction acknowledge the need for appropriate testing strategies, and non-instructional interventions also frequently require the assessment of human competence, whether in the interest of needs assessment, the formation of effective work teams, or the evaluation of the intervention.
Most training professionals have taken at least one intensive course in the design of instruction, but most have never had similar training in the development of criterion-referenced tests—tests that compare persons against a standard of competence, instead of against other persons (norm-referenced tests). It is not uncommon for a forty-hour workshop in the systematic design of instruction to devote less than four hours to the topic of test development—focusing primarily on item writing skills. With such minimal training, how can we make and defend our assessment decisions?
Without an understanding of the basic principles of test design, you can face difficult ethical, economic, or legal problems. For these and other reasons, test development should stand on an equal footing with instructional development—for if it doesn’t, how will you know whether your instructional objectives were achieved and how will you convince anyone else that they were?
Criterion-Referenced Test Development translates complex testing technology into sound technical practice within the grasp of a non-specialist. And hence, one of the themes that we have woven into the book is that testing properly is often no more expensive and time-consuming than testing improperly. For example, we have been able to show how to create a defensible certification test for a forty-hour administrative training course using a test that takes fewer than fifteen minutes to administer and probably less than a half-day to create. It is no longer acceptable simply to write test items without regard to a defensible process. Specific knowledge of the strengths and limitations of both criterion-referenced and norm-referenced testing is required to address the information needs of the world today.
Grade schools, high schools, universities, and corporations share many similar reasons for not having adopted the techniques for creating sound criterion-referenced tests. We have found three reasons that seem to explain why those who might otherwise embrace the systematic process of test design have not: misleading familiarity, inaccessible information, and procedural confusion. In each instance, it seems that a little knowledge about testing has proven dangerous to the quality of the criterion-referenced test.
As training professionals, few of us teach the way we were taught. However, most of us are still testing the way we were tested. Since every adult has taken many tests while in school, there is a misleading familiarity with them. There is a tendency to believe that everyone already knows how to write a test. This belief is an error, not only because exposure does not guarantee know-how, but because most of the tests to which we were exposed in school were poorly constructed. The exceptions—the well-constructed tests in our past—tend to be the group-administered standardized tests, for example, the Iowa Tests of Basic Skills or the SAT. Unfortunately for corporate trainers, these standardized tests are good examples of norm-referenced tests, not of criterion-referenced tests. Norm-referenced tests are designed for completely different purposes than criterion-referenced tests, and each is constructed and interpreted differently. Most teacher-made tests are “mongrels,” having characteristics of both norm-referenced and criterion-referenced tests—to the detriment of both.
Criterion-referenced testing technology is scarce in corporate training partly because the technology of creating these tests has been slow to develop. Even now with so much emphasis on minimal competency testing in the schools, the vast majority of college courses on tests and measurements are about the principles of creating norm-referenced tests. In other words, even if trainers want to “do the right thing,” answers to important questions are hard to come by. Much of the information about criterion-referenced tests has appeared only in highly technical measurement journals. The technology to improve practice in this area just hasn’t been accessible.
A final pitfall in good criterion-referenced test development is that both norm-referenced tests and criterion-referenced tests share some of the same fundamental measurement concepts, such as reliability and validity. Test creators don’t always seem to know how these concepts must be modified to be applied to the two different kinds of tests.
Recently, we saw an article in a respected corporate training publication that purported to detail all the steps necessary to establish the reliability of a test. The procedures that were described, however, will work only for norm-referenced tests. Since the article appeared in a training journal, we question the applicability of the information to the vast majority of testing that its readers will conduct. Because the author was the head of a training department, we had to appreciate his sensitivity to the value of a reliability estimate in the test development process, yet the article provided a clear illustration of procedural confusion in test development, even among those with some knowledge of basic testing concepts.
In 1994 Donald Kirkpatrick presented a classification scheme for four levels of evaluation in business organizations that have permeated much of management’s current thinking about evaluation. We want to review these and then share two observations. First, the four levels:
Level 1, or Reaction evaluations, measure “how those who participate in the program react to it ... I call it a measure of customer satisfaction” (p. 21).
Level 2, or Learning evaluations, “can be defined as the extent to which participants change attitudes, improve knowledge, and/or increase skill as a result of attending the program” (p. 22). Criterion-referenced assessments of competence are the skill and knowledge assessments that typically take place at the end of training. They seek to measure whether desired competencies have been mastered and so typically measure against a specific set of course objectives.
Level 3, or Behavior evaluations, “are defined as the extent to which change in behavior has occurred because the participant attended the training program” (p. 23). These evaluations are usually designed to assess the transfer of training from the classroom to the job.
Level 4, or Results evaluation, is designed to determine “the final results that occurred because the participants attended the program” (p. 25). Typically, this level of evaluation is seen as an estimate of the return to the organization on its investment in training. In other words, what is the cost-benefit ratio to the organization from the use of training?
We would like to make two observations about criterion-referenced testing and this model. The first observation is:
Level 2 evaluation of skills and knowledge is synonymous with the criterion-referenced testing process described in this book.
The second observation is more controversial, but supported by Kirkpatrick:
You cannot do Level 3 and Level 4 evaluations until you have completed Level 2 evaluations.
Kirkpatrick argued:
Some trainers are anxious to get to Level 3 or 4 right away because they think the first two aren’t as important. Don’t do it. Suppose, for example, that you evaluate at Level 3 and discover that little or no change in behavior has occurred. What conclusions can you draw? The first conclusion is probably that the training program was no good, and we had better discontinue it or at least modify it. This conclusion may be entirely wrong ... the reason for no change in job behavior may be that the climate prevents it. Supervisors may have gone back to the job with the necessary knowledge, skills, and attitudes, but the boss wouldn’t allow change to take place. Therefore, it is important to evaluate at Level 2 so you can determine whether the reason for no change in behavior was lack of learning or negative job climate. (p. 72)
Here’s another perspective on this point, by way of an analogy:
Suppose your company manufactures sheet metal. Your factory takes resources, processes the resources to produce the metal, shapes the metal, and then distributes the product to your customers. One day you begin to receive calls. “Hey,” says one valued customer, “this metal doesn’t work! Some sheets are too fat, some too thin, some just right! I’m never quite sure when they’ll work on the job! What am I getting for my money?” “What?” you reply, “They ought to work! We regularly check with our workers, who are very good, and they all feel we do good work.” “I don’t care what they think,” says the customer, “the stuff just doesn’t work!”
Now, substitute the word “training” for “sheet metal” and we see the problem. Your company takes resources and produces training. Your trainees say that the training is good (Level 1—What did the learner think of the instruction?), but your customers report that what they are getting on the job doesn’t match their needs (Level 3—What is taken from training and applied on the job?), and as a result, they wonder what their return on investment is (Level 4—What is the return on investment [ROI] from training?). Your company has a problem because the quality of the process, that is, training (Level 2—What did the learner learn from instruction?) has not been assessed; as a result, you really don’t know what is going on during your processes. And now that you have evidence the product doesn’t work, you have no idea where to begin to fix the problem. No viable manufacturer would allow its products to be shipped without making sure they met product specifications. But training is routinely completed without a valid and reliable measure of its outcomes. Supervisors ask about on-the-job relevance, managers wonder about the ROI from training, but neither question can be answered until the outcomes of training have been assessed. If you don’t know what they learned in training, you can’t tell what they transferred from training to the job and what its costs and benefits are! (Coscarelli & Shrock, 1996, p. 210)
In conclusion, we agree completely with Kirkpatrick when he wrote “Some trainers want to bypass Levels 1 and 2. ... This is a serious mistake” (p. 23).
In the 1970s, few organizations offered certification programs, for example, the Chartered Life Underwriter (CLU), Certified Production and Inventory Management (CPIM). By the late 1990s certification had become, literally, a growth industry. Internal corporate certification programs proliferated and profession-wide certification testing had become a profit center for some companies, including Novell, Microsoft, and others. The Educational Testing Service opened its first for-profit center, the Chauncey Group, to concentrate on certification test development and human resources issues. Sylvan became known in the business world as the primary provider of computer-based, proctored, testing centers. There are many reasons why such an interest has developed. Thomas (1996) identifies seven elements and observes that the “theme underlying all of these elements is the need for accountability and communication, especially on a global basis” (p. 276). Because the business world remains market-driven, the classic academic definitions of terms related to testing have become blurred so that various terms in the field of certification have different meanings. While a tonsil is a tonsil is a tonsil in the medical world, certification may not mean the same thing to each member in a discussion. While in Chapter 6 we present a tactical way to think about certification program design (The Certification Suite), here we want to clarify a few terms that are often ill-defined or confused.
Certification “is a formal validation of knowledge or skill ... based on performance on a qualifying examination ... the goal is to produce results that are as dependable or more dependable than those that could be gained by direct observation (on the job)” (Drake Prometric, 1995, p. 2). Certification should provide “an objective and consistent method of measuring competence and ensuring the qualifications of technical professionals” (Microsoft, 1995, p. 3). Certification usually means measuring a person’s competence against a given standard—a criterion-referenced test interpretation. The certification test seeks to measure an individual’s performance in terms of specific skills the individual has demonstrated and without regard to the performance of other test-takers. There is no limit to the number of test-takers who can succeed on a criterion-referenced test—everyone who scores beyond a given level is judged a “master” of the competencies covered by the test. (The term “master” doesn’t usually mean the rare individual who excels far beyond peers; the term simply means someone competent in the performance of the skills covered by the test.) “The intent of certification ... normally is to inform the public that individuals who have achieved certification have demonstrated a particular degree of knowledge and skill (and) is usually a voluntary process instituted by a nongovernmental agency” (Fabrey, 1996, p. 3).
Licensure, by contrast, “generally refers to the mandatory governmental requirement necessary to practice in a particular profession or occupation. Licensure implies both practice protection and title protection, in that only individuals who hold a license are permitted to practice and use a particular title” (Fabrey, 1996, p. 3). Licensure in the business world is rarely an issue in assessing employee competence but plays a major role in protecting society in areas of health care, teaching, law, and other professions.
Qualification is the assessment that a person understands the technology or processes of a system as it was designed or that he or she has a basic understanding of the system or process, but not to the level of certainty provided through certification testing. Qualification is the most problematic of the terms that are often used in business, and it is one we have seen develop primarily in the high-tech industries.
Qualification as a term has developed in many ways as a response to a problematic training situation. Customers (either internal or external to the business) demand that those sent for training be able to demonstrate competence on the job, while at the same time those doing the training and assessment have not been given a job task analysis that is specific to the organization’s need. Thus, the trainers cannot in good conscience represent that the trainees who have passed the tests in training can perform back at the work site. So, for example, if a company develops a new high-tech cell phone switching system, the same system can be configured in a variety of ways by each of the various regional telephone companies that purchase the switch. Without a training program customized to each company, the switch developer will offer training only in the characteristics of the switching system, or perhaps its most common configurations. That training would then “qualify” the trainee to configure and work with the switch within the idiosyncratic constraints of the particular employer. As you can see, the term is founded more on the practical realities of technology development and contract negotiation than on formal assessment. Organizations that provide training that cannot be designed to match the job requirement are often best served by drawing the distinction between certification and qualification early on in the contract negotiation stage, thus clarifying either formal or informal expectations.
By early 2000 certification had become less a growth industry and more a mature one. A number of the larger programs, for example, Hewlett-Packard and Microsoft, were well-established and operating on a stable basis. In-house certification programs did continue, but management more acutely examined the cost-benefit ratio for these programs. Meanwhile, in the United States the 2001 Federal act, No Child Left Behind, was signed into law and placed a new emphasis on school accountability for student learning progress. Interestingly, the discussion that was sparked by this act created a distinction in testing that was assimilated by both the academic and business communities and helped guide resource allocations. This concept is “often referred to as the stakes of the testing,” according to the Standards for Educational and Psychological Testing (AERA/APA/NCME Joint Committee, 1999, p. 139), which described a classification of sorts for the outcomes of testing and the implied level of rigor associated with each type of test’s design.
High Stakes Tests. A high stakes test is one in which “significant educational paths or choices of an individual are directly affected by test performance. ... Testing programs for institutions can have high stakes when aggregate performance of a sample or of the entire population of test-takers is used to infer the quality of service provided, and decisions are made about institutional status, rewards, or sanctions based on the test results” (AERA/APA/NCME Joint Committee, 1999, p. 139). While the definition of high stakes was intended for the public schools, it was easily translated into a corporate culture, where individual promotion, bonuses, or employment might all be tied to test performance or where entire departments, such as the training department, might be affected by test-taker performance.
Low Stakes Tests. At the other end of the continuum, the Standards defined low stakes tests as those that are “administered for informational purposes or for highly tentative judgments such as when test results provide feedback to students...” (p. 139).
These two ends of the continuum implied different levels of rigor and resources in test construction. This distinction was also indicated by the Standards:
The higher the stakes associated with a given test use, the more important it is that test-based inferences are supported with strong evidence of technical quality. In particular, when the stakes for an individual are high, and important decisions depend substantially on test performance, the test needs to exhibit higher standards of technical quality for its avowed purposes than might be expected of tests used for lower-stakes purposes ... Although it is never possible to achieve perfect accuracy in describing an individual’s performance, efforts need to be made to minimize errors in estimating individual scores in classifying individuals in pass/fail or admit/reject categories. Further, enhancing validity for high-stakes purposes, whether individual or institutional, typically entails collecting sound collateral information both to assist in understanding the factors that contributed to test results and to provide corroborating evidence that supports the inferences based on test results. (pp. 139–140)
In the following chapters, we will describe a systematic approach to the development of criterion-referenced tests. We recognize that not all tests are high-stakes tests, but the book does describe the steps you need to consider for developing a high-stakes criterion-referenced test. If your test doesn’t need to meet that standard, you can then decide which steps can be skipped, adapted, or adopted to meet you own particular needs. To help you do this
Criterion Referenced Test Development
(CRTD) is divided into five main sections:
In the Background, we provide a basic frame of reference for the entire test development process.
The Overview provides a detailed description of the Criterion-Referenced Test Development Process (CRTD) using the model we have created and tested in our work with more than forty companies.
Planning and Creating the Test describes how to proceed with the CRTD process using each of the thirteen steps in the model. Each step is explored as a separate chapter, and where appropriate, we have provided summary points that you may need to complete the CRTD documentation process.
Legal Issues in Criterion-Referenced Testing is authored by Patricia Eyres, who is a practicing attorney in the field and deals with some of the important legal issues in the CRTD process.
Our Epilogue is a reflection of our experiences with testing. In fact, those of you starting a testing program in an organization may wish to read this chapter first! When we first began our work in CRTD, we thought of the testing process as the last “box” in the Instructional Development process. We have since come to understand that testing, when done properly, will often have serious consequences to the organization. These can be highly beneficial if the process is supported and well managed. However, we now view effective CRT systems as not simply discrete assessment devices, but as systemic interventions.
Periodically, we have provided an opportunity for practice and feedback. You will find that many of the topics in the Background are reinforced by exercises with corresponding answers and that, throughout the book, opportunities to practice applying the most important or difficult concepts are similarly provided.
We are also including short sidebars from individuals and organizations associated with the world of CRT, when we feel they can help illustrate a point in the process. Interestingly, most of the sidebars reflect the two areas that have developed most rapidly since our last edition—computer-based testing and processes to reduce cheating on tests.
There are four related terms that can be somewhat confusing at first: evaluation, assessment, measurement, and testing. These terms are sometimes used interchangeably; however, we think it is useful to make the following distinctions among them:
Testing
is the collection of quantitative (numerical) information about the degree to which a competence or ability is present in the test-taker. There are right and wrong answers to the items on a test, whether it be a test comprised of written questions or a performance test requiring the demonstration of a skill. A typical test question might be: “List the six steps in the selling process.”
Measurement
is the collection of quantitative data to determine the degree of whatever is being measured. There may or may not be right and wrong answers. A measurement inventory such as the
Decision-Making Style Inventory
might be used to determine a preference for using a Systematic style versus a Spontaneous one in making a sale. One style is not “right” and the other “wrong”; the two styles are simply different.
Assessment
is systematic information gathering without necessarily making judgments of worth. It may involve the collection of quantitative or qualitative (narrative) information. For example, by using a series of personality inventories and through interviewing, one might build a profile of “the aggressive salesperson.” (Many companies use Assessment Centers as part of their management training and selection process. However, as the results from these centers are usually used to make judgments of worth, they are more properly classed as evaluation devices.)
Evaluation
is the process of making judgments regarding the appropriateness of some person, program, process, or product for a specific purpose. Evaluation may or may not involve testing, measurement, or assessment. Most informed judgments of worth, however, would likely require one or more of these data gathering processes. Evaluation decisions may be based on either quantitative or qualitative data; the type of data that is most useful depends entirely on the nature of the evaluation question. An example of an evaluation issue might be, “Does our training department serve the needs of the company?”
Here are some statements related to these four concepts. See whether you can classify them as issues related to Testing, Measurement, Assessment, or Evaluation:
“She was able to install the air conditioner without error during the allotted time.”
“Personality inventories indicate that our programmers tend to have higher extroversion scores than introversion.”
“Does the pilot test process we use really tell us anything about how well our instruction works?”
“What types of tasks characterize the typical day of a submarine officer?”
Testing
Measurement
Evaluation
Assessment
Suppose you had to take an important test. In fact, this test was so important that you had studied intensively for five weeks. Suppose then that, when you went to take the test, the temperature in the room was 45 degrees. After 20 minutes, all you could think of was getting out of the room, never mind taking the test. On the other hand, suppose you had to take a test for which you never studied. By chance a friend dropped by the morning of the test and showed you the answer key. In both situations, the score you receive on the test probably doesn’t accurately reflect what you actually know. In the first instance, you may have known more than the test score showed, but the environment was so uncomfortable that you couldn’t attend to the test. In the second instance, you probably knew less than the test score showed due now to another type of “environmental” influence.
In either instance, the score you received on the test (your observed score) was a combination of what you really knew (your true score) and those factors that modified your true score (error). The relationship of these score components is the basis for all test theory and is usually expressed by a simple equation:
where Xo is the observed score, Xt the true score and Xe the error component. It is very important to remember that in test theory “error” doesn’t mean a wrong answer. It means the factor that accounts for any mismatch between a test-taker’s actual level of knowledge (the true score) and the test score the person receives. Error can make a score higher (as we saw when your friend dropped by) or lower (when it got too cold to concentrate).
The primary purpose of a systematic approach to test design is to reduce the error component so that the observed score and the true score are as nearly identical as possible. All the procedures we will discuss and recommend in this book will be tied to a simple assumption: the primary purpose of test development is the reduction of error. We think of the results of test development like this:
where error has been reduced to the lowest possible level.
Realistically, there will always be some error in a test score, but careful attention to the principles of test development and administration will help reduce the error component.
See if you can list at least three situations that could inflate a test-taker’s score and three that could reduce the score:
Inflation Factors
Reduction Factors
1. Sees answer key
1. Room too cold
2. __________
2. __________
3. __________
3. __________
4. __________
4. __________
Inflation Factors
Reduction Factors
1. Sees answer key
1. Room too cold
2. Looks at someone’s answers
2. Test scheduled too early
3. Unauthorized job aid used
3. Noisy heating system in room
4. Answers are cued in test
4. Can’t read test directions
